A new computer program can read any sequence in the genome and decipher its genetic code


Yekaterina “Kate” Shulgina was a first year student at the Graduate School of Arts and Sciences, looking for a short computational biology project so that she could verify the requirements of her systems biology program. She wondered how the genetic code, once considered universal, could evolve and change.

It was 2016 and today Shulgina came out on the other end of this short term project with a way to unravel this genetic mystery. She describes it in a new article in the journal eLife with Harvard biologist Sean Eddy.

The report details a new computer program that can read the genome sequence of any organism and then determine its genetic code. The program, called Codetta, has the potential to help scientists deepen their understanding of the evolution of the genetic code and correctly interpret the genetic code of newly sequenced organisms.

“This in itself is a very basic biology question,” said Shulgina, who does her postgraduate research at Eddy’s Lab.

The genetic code is the set of rules that tells cells how to interpret three-letter combinations of nucleotides in proteins, often called the building blocks of life. Almost all organizations, from E. coli to humans, uses the same genetic code. This is why it was once thought that the code was set in stone. But scientists have discovered a handful of outliers -; organisms that use alternative genetic codes – exist where the set of instructions is different.

This is where Codetta can shine. The program can help identify more organisms that are using these alternative genetic codes, helping to shed new light on how genetic codes can even change in the first place.

Understanding how this happened would help us reconcile why we originally thought it was impossible… and how these really fundamental processes actually work. “

Yekaterina “Kate” Shulgina

Already, Codetta has analyzed the genome sequences of more than 250,000 bacteria and other single-celled organisms called archaea for alternative genetic codes, and has identified five that have never been seen. In all five cases, the arginine amino acid code was reassigned to another amino acid. It is believed to be the first time scientists have seen this exchange of bacteria and may hint at evolutionary forces altering the genetic code.

Researchers say the study marks the largest ever screening for alternative genetic codes. Codetta essentially analyzed every genome available for bacteria and archaea. The name of the program is a cross between codons, the sequence of three nucleotides that form pieces of the genetic code, and Rosetta Stone, a rock slab inscribed in three languages.

The work marks a watershed moment for Shulgina, who has spent the past five years developing the statistical theory behind Codetta, writing the program, testing it, and then analyzing the genomes. It works by reading an organism’s genome, then tapping into a database of proteins known to produce a probable genetic code. It differs from other similar methods because of the scale at which it can analyze genomes.

Shulgina joined Eddy’s genome comparison lab in 2016 after coming to her for advice on the algorithm she was devising to interpret genetic codes.

So far, no one has done such a large study for alternative genetic codes.

“It was great to see new codes because for all we knew Kate would be doing all this work and there wouldn’t be any new ones to find,” said Eddy, who is also a Howard medical researcher. Hughes. He also noted the potential of the system to be used to ensure the accuracy of the many databases that house protein sequences.

“Many protein sequences in databases these days are just conceptual translations of genomic DNA sequences,” Eddy said. “People extract these protein sequences for all kinds of useful things, like new enzymes or new gene-editing tools and so on. You would like those protein sequences to be precise, but if the organism uses a non-standard code, they will be translated in error. “

The researchers say the next step in the work is to use Codetta to search for alternative codes in viruses, eukaryotes and organellar genomes like mitochondria and chloroplasts.

“There’s still a lot of diversity of life where we haven’t done this systematic screening yet,” Shulgina said.


Journal reference:

Shulgina, Y & Eddy, SR, (2021) A computer screen for alternative genetic codes in over 250,000 genomes. eLife. doi.org/10.7554/eLife.71402.


Gordon K. Morehouse