Just as a shell can be described with a few simple parameters representing its shape, a genome can be described with a few simple parameters representing its composition. The tetrahedron on the right-hand side shows genome compositions for several hundred organisms: remarkably, the total coding sequence of every organism clusters in a small range of the possible space, around Chargaff's Axis where C=G and A=T.
By investigating the trajectories of evolving genes and genomes in the space of possible compositions, and by tracking the changes that affect a pool of random RNA sequences when they are selected to perform a catalytic task, we hope to uncover fundamental rules relating composition to structure and function. This research will help us understand how mutation and selection affect genomes, and how life could have arisen from random RNA molecules.
What is genetic information? Shannon's Theorem tells us that information is a decrease in uncertainty, where 'uncertainty' is defined as the sum of weighted log probabilities of each outcome. For example, a coin toss has two possible outcomes, so the uncertainty about it beforehand is 0.5 × log(0.5) [heads] + 0.5 × log(0.5) [tails], or one bit (using base-two logarithms). After the coin toss, there is no uncertainty about the result, so we have gained one bit of information.
Genetic information, therefore, consists of inherited states that allow us to make better predictions about an organism's phenotype. In principle, there is no need for information to be localized inside a single molecule, or even for any physical continuity - the germ plasm could be created anew in each generation by gemmules from throughout the body, as Darwin originally suggested. However, we now know that a vast source of heritable information - ranging from a few kilobytes in viruses to over a gigabyte in the diploid human genome - is stored and transmitted as DNA sequences.
My work focuses on how this information is translated from DNA to protein, and how both the information itself and the translation mechanism by which it expressed evolve in different lineages. The genetic code, the mapping between trinucleotide 'codons' in genes and amino acids in proteins, is highly resistant to certain types of genetic errors - is this because the code has been selected from a vast population of inferior alternatives, or because rules of chemistry fixed codon assignments to their present states early in evolution? How and why does the genetic code vary in modern cells and organelles? Why are codons used nonrandomly, even in cases where several codons have the same meaning? How does mutation affect the information content of individual genes and genomes?
More recently, I have become interested in the distribution of functional RNA molecules in the space of all possible sequences. How many random sequences do you need to search to find a particular binding or catalytic function? I am currently using a range of analytical and computational techniques to address questions about the frequency of interesting RNA sequences and structures, and to find whether there there are general rules that unite functional RNA molecules. Some of the predictions from the mathematical work and from the supercomputer simulations are currently being tested experimentlly by others in the Yarus lab.
The Evolution of Information MCD Biology, CU Boulder, January 2003
The Origin and Evolution of the Genetic Code Final Public Oral examination, EEB, Princeton University, April 2001
Codon Assignments as Molecular Fossils: Did ancient amino acid binding sites shape the genetic code?
Information Loss in Mitochondria: Are Mutation Biases to Blame?