Over the summer I’ve been preparing some new biochemistry courses. This has given me the excuse to browse through the literature from the dawn of molecular biology, starting with Watson and Crick’s DNA structure paper from 1953 1. From there, I quickly got distracted (as is the way when you are supposed to be doing something else) by the amazing ideas and theories that followed this seminal paper. It must have been a fabulous time, there were so many questions and little data to restrict thought and so some of the most beautiful wrong ideas in science were hatched.
One of the most pressing questions of the time was how does DNA code for proteins? The problem was simple. There are just 4 bases in DNA, adenine (A), thymine (T), guanine (G) and cytosine (C). But DNA codes for 20 different amino acids found in proteins. So how can a 4 letter alphabet be translated into a 20 letter alphabet? Most of the subsequent ideas seems to stem for a desire to get 20 from 4.
The first serious stab at answering this conundrum came from a surprising direction. George Gamow was a theoretical physicist and cosmologist, who is more famous for the Big Bang Theory than his contributions to molecular biology. Nevertheless his background didn’t stop him publishing, in Nature, an intriguing theory for the genetic code 2.
Gamow proposed an overlapping triplet code. It had to be a triplet of bases because a doublet only produces 16 (4 x 4) combinations, whilst a triplet produces 64 possible combinations, more than enough to code for all the amino acids. He suggested the triplet codes overlapped because it allows the double stranded helix of DNA to act as a direct template on which the amino acids can be assembled into proteins. So, for example the sequence ATGCTA would contain the triplets ATG, TGC, GCT,CTA each of which would code for a different amino acid.
Gamow also proposed a mechanism (which became known as Gamow’s Diamond Hypothesis) to back up his theory. Each amino acid would fit directly into distinct diamond shaped pockets formed within the grooves of DNA where the 4 sides of each pocket would be defined by the 4 bases. And when he did the math, hey presto, it turns out there are 20 possible uniquely shaped pockets!
The numerology is compelling but there is a significant limitation to this theory; it does not allow all possible combinations of amino acids. This problem is best illustrated with a dipeptide. In the overlapping triplet hypothesis a dipeptide would be coded by 4 bases, which results in 256 (4 x 4 x 4 x 4) possible combinations. However, given the 20 amino acids, there are 400 (20 x 20) possible dipeptide sequences. So 144 dipeptide combinations are not possible. Of course this limitation should be easily testable by simply looking at protein sequences and seeing if there are more than 256 dipeptide combinations. But let’s remember that at the time (1954) the data on protein sequences was pretty limited (Sanger only published the first protein sequence in 1951 3). So the test had to wait until 1957, by which time there were just enough sequence data for Brenner to publish the clearly titled paper “On the Impossibility of all Overlapping Triplet Codes in Information Transfer from Nucleic Acid to Proteins” 4.
So back to square one. And Francis Crick steps back into the picture.
Crick saw the problem like so: The genetic code had to consist of non-overlapping triplets of bases. But if that is the case how can one triplet be distinguished from the next. After all there is no punctuation in the DNA. Its like trying to find the three letter words in SATEATEATS without any commas. They could be SAT EAT EAT or ATE ATE ATS or TEA TEA depending on where you start. So Crick decided there must be ”codes with out commas”. And that’s what he called his paper 5.
It was a brilliantly elegant theory. He took the 64 triplet codes and put them together in groups according to whether they had the same circular permutations i.e. ACG, CGA, and GAC are in one group, CCG, GCC and CGC form a second group and so on. He then hypothesised that only one sequence from each group would be used to code for an amino acid. These he called ‘sense’ codons. The remainder were termed ‘nonsense’. So if ACG and CCG are sense then the sequence ACGCCGACG can only be read ACG CCG ACG, because CGC CGA both give nonsense codons.
Also into the nonsense pile went AAA, TTT, GGG and CCC because when they appeared they would cause ambiguity about where a codon starts (e.g. is CCCCGGG read CCC CGG or CCC, GGG). So thats 64 possible codons, subtract CCC, GGG, AAA and TTT leaves use with 60. Of those remaining only every third codon is ‘sense’. EUREKA! theres the 20 codons needed to code for the 20 amino acids. Brilliant, everything fits perfectly.
The comma free code was so elegant and the numbers fitted so well that everyone believed it for the best part of 5 years. Until, that is, pesky experimental data got in the way. In 1961 Marshal Nirenberg and Johann Matthaei produced a stretch of RNA composed of uracil (RNAs equivalent of thymine) 6. When they added it to a mix of ribosomes, tRNAs and amino acids the result was a polypeptide of pure phenylalanine. And so a theory that was too elegant for nature was shot down in flames.
The rest of the story is in the text books.
4. Brenner, S. On the impossibility of all overlapping triplet codes in information transfer from nucleic acid to proteins. Proceedings of the National Academy of Sciences of the U.S.A. 1957. 43:687–694.
6. Marshall N.W., and Matthaei, J..The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. Proceedings of the National Academy of Sciences of the U.S.A. 1961 47:1588–1602