I’m not all that thrilled that my first post here is a techie one. I was kind of hoping I could write about flowers or something. But Mary was so impressed by my decoding skills that she prevailed upon me to write this. So blame her. Here is a picture of flowers anyway. For the record, my decoding skills are OK, but not great. I am mostly pretty good at it because I am so lazy. I’ll write more about that later.
In this post, I will describe how to figure out the encoding scheme for the DNA watermarks Mary described in her recent post.
My main goal is to give an example of how a code gets deciphered. It’s an art as well as a science. This particular code is not insanely difficult, so it makes a good example.
On to the watermarks.
I got the watermarks themselves from this paper. I also read an article that said that there were quotes from James Joyce and Richard Feynman in the watermarks. That is all the information I will need to decode them.
I will concentrate mostly on watermark #2, because it looked to me most amenable to analysis. Here it is:
My first thought is that I am strongly inclined to treat the bases in threes, because that is how the genome encodes amino acids. Taking 4 symbols in sets of three gives a total number of possibilities of 43 = 64, which is enough space for the alphabet and numbers and some symbols. It’s also just enough for upper-and lowercase letters and numbers and maybe a space but nothing else.
I can’t tell which is which, but I heard somewhere that one of the messages has a Web page in it and that require several punctuations, so am sticking with the uppercase letters and symbols theory.
Either way, this code amounts to a single-substitution cipher, kind of like the cryptogram in the newspaper, only with punctuation included.
The start and end tags were given in the paper, so I can remove those. Additional noncoding data is also marked. Here is what I get for the second watermark, with the start and end tags and noncoding data removed, taken three at a time:
CAA CTG GCA GCA TAA AAC ATA TAG AAC TAC CTG CTA TAA GTG ATA
CAA CTG TTT TCA TAG TAA AAC ATA CAA CGT TGC TGA TAG TAC TCC
TAA GTG ATA GCT TAG TGC GTT TAG CAT ATA TTG TAG GCT TCA TAA
TAA GTG ATA TTT TAG CTA CGT AAC TAA ATA AAC TAG CTA TGA CTG
TAC TCC TAA GTG ATA TTT TCA TCC TTT GCA ATA CAA TAA CTA CTA
CAT CAA TAG TGC GTG ATA TGC CTG TGC TAG ATA TAG AAC ACA TAA
CTA CGT TTG CTG TTT TCA GTG ATA TGC TAG TTT CAT CTA TAG ATA
TAG GCT GCT TAG ATT CCC TAC TAG CTA TTT CTG TAG GTG ATA TAC
GTC CAT TGC ATA AGT TAA TGC ATT TAA CTA GCT GTG ATA CTA TAG
CAT CCC CAT TCC TAG TGC ATA TTT TCA TCC TAG TGC TAC GTG ATA
TAA TTG TAC TAA TGC CTG TAG ATA ATT TAA TGC CTG GCT CGT TTG
TAG GTG ATA ATT TAG TGC CTG TAA AAC ATA TAC CTG AGT GCT CGT
TGC GTG ATA GTT CGT TCA TGC ATA TAC AAC TAG GCT GCT GTG ATA
TGG TCA CTG CCC TTA CTG TGC TAC ATA TTA CTG CGA GGG GGA TGA
CGT ATA AAC CTG TTG TAA GTG ATA TGA CGT ATA TAA CTA CTA GTG
ATA TGA CGT ATA GGC TAG AAC AAC GTG ATA TGA CGT ATA TGA CTA
CTG TCC CAA ACA TCA GTG ATA TGA CGT ATA CTA TAA TTT CTA TAA
TAG TGA TAA ATA AAC CTG GGC TAA ATA CGT TCC TGA ATA CGT GGC
ATA AAC CTG GGC TAA CGA GGA ATA CCC ATA GTT TAG CAA TAA GCT
ATA GTT CGT CAT TTT TAA
The first place to start in decoding any code is a frequency analysis. I just count the number of times each three-base symbol appears in the text:
ATA: 41 TAG: 27 TAA: 25 CTG: 18
TGC: 16 GTG: 16 CTA: 15 CGT: 14
AAC: 13 TTT: 10 TGA: 10 TAC: 10
GCT: 10 TCA: 8 TCC: 7 CAT: 7
CAA: 7 TTG: 5 GTT: 4 GGC: 4
CCC: 4 ATT: 4 GCA: 3 TTA: 2
GGA: 2 CGA: 2 AGT: 2 ACA: 2
TGG: 1 GTC: 1 GGG: 1
I’ve listed the symbols here in decreasing order of frequency. There is one that really stands out: ATA occurs 41 times, and the next one is TAG that occurs 27 times. That’s a big gap. I am pretty confident that ATA will be a space. There is a simple test for that: do two ATAs ever occur next to each other? If not, then ATA is probably a space. And sure enough, ATA ATA never appears.
The next two most frequent are TAG and TAA. I am not so sure about those; they could be commas, if the watermark is a list of names, or they might be letters. In English, the letters in order of decreasing frequency are ETAON RISHD LFCMU … so it is likely that those two from the ETAON group. Let’s try it out with TAG as E and TAA as T:
????T? E????T? ????ET? ????E??T? ?E??E? ?E??TT? ?E???T ?E?????T? ????? ?T????E?? ???E E??T??????? ?E???E E??E???E???E? ???? ?T??T??? ?E????E? ???E??? T??T??E ?T?????E? ?E??T? ??????? ???? ??E??? ???????? ??????? ???T? ?? T??? ?? ?E??? ?? ???????? ?? ?T??TE?T ???T ??? ?? ???T?? ? ?E?T? ????T
I have labeled all the characters I don’t know as question marks. That’s not optimal, because you can’t tell which characters are the same and which are different, but it will have to do. It looks like I got the spaces right; the words look like words. I can’t tell about the other letters. But here I can try a “crib:” a known piece of plaintext. Notice those last three words? They might be an author name from a quote:
? ?E?T? ????T
- JAMES JOYCE
If TAG is an A and TAA is an E, then they fit! So let’s assume JAMES JOYCE is correct and try again:
M???E? A????E? M?C?AE? MO??A??E? SA?JAY ?AS?EE? CA?O?E ?A?????E? C??C? ME??YMA?? ???A A??E?O??C?? ?ACY?A ASSA?-?A?C?A? ??Y? ?E??E?S? ?AY-Y?A? C??A??? E??E??A ?E??SO?A? ?A??E? ???SO?? JO?? ??ASS? ???-???? ??????O ???E? ?O E??? ?O ?A??? ?O ????M??? ?O ?EC?EA?E ???E O?? O? ???E?? - JAMES JOYCE
That is starting to look like something! I see SAN?JAY so I am going to guess that is SANJAY and I see ?EC?EA?E which I will guess is RECREATE. Those give me three very common letters: N, T, and R. Putting those in, I get:
M???E? A???RE? M?C?AE? MONTA??E? SANJAY ?AS?EE? CARO?E ?ART???E? C??C? MERRYMAN? N?NA A??ERO??C?? NACYRA ASSA?-?ARC?A? ??YN ?EN?ERS? RAY-Y?AN C??AN?? E??EN?A ?EN?SO?A? ?AN?E? ???SON? JO?N ??ASS? ???-??N? ?????TO ???E? TO ERR? TO ?A??? TO TR??M??? TO RECREATE ???E O?T O? ???E?? - JAMES JOYCE
At this point I could probably just Google the quote but that would be cheating. So I get the L from CAROLE and the I from NINA. And the same symbol occurs a lot at the end of words that looks like a comma:
MI??EL AL?IRE, MIC?AEL MONTA??E, SANJAY ?AS?EE, CAROLE LARTI??E, C??C? MERRYMAN, NINA AL?ERO?IC?, NACYRA ASSA?-?ARCIA, ??YN ?EN?ERS, RAY-Y?AN C??AN?, E??ENIA ?ENISO?A, ?ANIEL ?I?SON, JO?N ?LASS, ??I-?IN? ?I???TO LI?E, TO ERR, TO ?ALL, TO TRI?M??, TO RECREATE LI?E O?T O? LI?E?? - JAMES JOYCE
Now I am getting somewhere. I probably have enough letters right to try that first watermark, which may contain HTML code. Here is what I get from it:
J? CRAI? ?ENTER INSTIT?TE ?????A?C?E???IJ?LMNO??RST????Y?? ??????????????-????????????????????,?SYNT?ETIC ?ENOMICS, INC?????OCTY?E ?TML???TML???EA???TITLE??ENOME TEAM??TITLE????EA????O?Y??A ?RE????TT????????JC?I?OR????T?E JC?I??A?????RO?E YO???E ?ECO?E? T?IS ?ATERMAR? ?Y EMAILIN? ?S ?A ?RE???MAILTO?MRO?STI??JC?I?OR????ERE???A????????O?Y????TML?
Note the IJ?LMNO??RST. There is an alphabet in there! That really helps:
J? CRAIG VENTER INSTITUTE ?????ABCDEFGHIJKLMNOPQRSTUVWXYZ? ??????????????-????????????????????,?SYNTHETIC GENOMICS, INC????DOCTYPE HTML??HTML??HEAD??TITLE?GENOME TEAM??TITLE???HEAD??BODY??A HREF??HTTP???WWW?JCVI?ORG???THE JCVI??A??P?PROVE YOU?VE DECODED THIS WATERMARK BY EMAILING US ?A HREF??MAILTO?MROQSTIZ?JCVI?ORG??HERE???A???P???BODY???HTML?
That’s looking like good HTML. I will put the complete alphabet into the second watermark again:
MIKKEL ALGIRE, MICHAEL MONTAGUE, SANJAY VASHEE, CAROLE LARTIGUE, CHUCK MERRYMAN, NINA ALPEROVICH, NACYRA ASSAD-GARCIA, GWYN BENDERS, RAY-YUAN CHUANG, EVGENIA DENISOVA, DANIEL GIBSON, JOHN GLASS, ZHI-QING QI???TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE?? - JAMES JOYCE
And that’s pretty much it.