To test out the idea that DNA might contain a spoken language, I had to obtain the DNA code for an organism. I chose one of the simplest organisms - E.Coli. The DNA of E. Coli can easily be obtained online here
http://www.ncbi.nlm.nih.gov/nuccore/X01714.
E Coli has only 1609 bases
CAGAGAAAATCAAAAAGCAGGCCACGCAGGGTGATGAATTAACAATAAAA ATGGTTAAAAACCCCGATAT
CGTCGCAGGCGTTGCCGCACTAAAAGACCATCGACCCTACGTCGTTGGAT TTGCCGCCGAAACAAATAAT
GTGGAAGAATACGCCCGGCAAAAACGTATCCGTAAAAACCTTGATCTGAT CTGCGCGAACGATGTTTCCC
AGCCAACTCAAGGATTTAACAGCGACAACAACGCATTACACCTTTTCTGG CAGGACGGAGATAAAGTCTT
ACCGCTTGAGCGCAAAGAGCTCCTTGGCCAATTATTACTCGACGAGATCG TGACCCGTTATGATGAAAAA
AATCGACGTTAAGATTCTGGACCCGCGCGTTGGGAAGGAATTTCCGCTCC CGACTTATGCCACCTCTGGC
TCTGCCGGACTTGACCTGCGTGCCTGTCTCAACGACGCCGTAGAACTGGC TCCGGGTGACACTACGCTGG
TTCCGACCGGGCTGGCGATTCATATTGCCGATCCTTCACTGGCGGCAATG ATGCTGCCGCGCTCCGGATT
GGGACATAAGCACGGTATCGTGCTTGGTAACCTGGTAGGATTGATCGATT CTGACTATCAGGGCCAGTTG
ATGATTTCCGTGTGGAACCGTGGTCAGGACAGCTTCACCATTCAACCTGG CGAACGCATCGCCCAGATGA
TTTTTGTTCCGGTAGTACAGGCTGAATTTAATCTGGTGGAAGATTTCGAC GCCACCGACCGCGGTGAAGG
CGGCTTTGGTCACTCTGGTCGTCAGTAACACATACGCATCCGAATAACGT CATAACATAGCCGCAAACAT
TTCGTTTGCGGTCATAGCGTGGGTGCCGCCTGGCAAGTGCTTATTTTCAG GGGTATTTTGTAACATGGCA
GAAAAACAAACTGCGAAAAGGAACCGTCGCGAGGAAATACTTCAGTCTCT GGCGCTGATGCTGGAATCCA
GCGATGGAAGCCAACGTATCACGACGGCAAAACTGGCCGCCTCTGTCGGC GTTTCCGAAGCGGCACTGTA
TCGCCACTTCCCCAGTAAGACCCGCATGTTCGATAGCCTGATTGAGTTTA TCGAAGATAGCCTGATTACT
CGCATCAACCTGATTCTGAAAGATGAGAAAGACACCACAGCGCGCCTGCG TCTGATTGTGTTGCTGCTTC
TCGGTTTTGGTGAGCGTAATCCTGGCCTGACCCGCATCCTCACTGGTCAT GCGCTAATGTTTGAACAGGA
TCGCCTGCAAGGGCGCATCAACCAGCTGTTCGAGCGTATTGAAGCGCAGC TGCGCCAGGTATTGCGTGAA
AAGAGAATGCGTGAGGGTGAAGGTTACACCACCGATGAAACCCTGCTGGC AAGCCAGATCCTGGCCTTCT
GTGAAGGTATGCTGTCACGTTTTGTCCGCAGCGAATTTAAATACCGCCCG ACGGATGATTTTGACGCCCG
CTGGCCGCTAATTGCGGCCAGTTGCAGTAATATGACGCCGGATGACTTTT CATCCGGCGAGTTTCTTTAA
ACGCCAAACTCTTCGCGATAGGCCTTAACCGCCGCCAGATGTTCCGCCAT TTCCGGCTTCTCTTCCAGG
It is amazing that this short code is able to create an entire organism.
Then I divided this DNA sequence into triplets (codons) using vb.net. The number in brackets is simply the position of this codon in the DNA sequence.
According to the website here -
http://www.ncbi.nlm.nih.gov/nuccore/X01714, the E.Coli DNA has two coding regions. I have highlighted these two coding regions below in red. The rest of the DNA is noncoding.
(1) CAG (4) AGA (7) AAA (10) TCA (13) AAA (16) AGC (19) AGG (22) CCA (25) CGC (28) AGG (31) GTG (34) ATG (37) AAT (40) TAA (43) CAA (46) TAA (49) AAA (52) TGG (55) TTA (58) AAA (61) ACC (64) CCG (67) ATA (70) TCG (73) TCG (76) CAG (79) GCG (82) TTG (85) CCG (88) CAC (91) TAA (94) AAG (97) ACC (100) ATC (103) GAC (106) CCT (109) ACG (112) TCG (115) TTG (118) GAT (121) TTG (124) CCG (127) CCG (130) AAA (133) CAA (136) ATA (139) ATG (142) TGG (145) AAG (148) AAT (151) ACG (154) CCC (157) GGC (160) AAA (163) AAC (166) GTA (169) TCC (172) GTA (175) AAA (178) ACC (181) TTG (184) ATC (187) TGA (190) TCT (193) GCG (196) CGA (199) ACG (202) ATG (205) TTT (208) CCC (211) AGC (214) CAA (217) CTC (220) AAG (223) GAT (226) TTA (229) ACA (232) GCG (235) ACA (238) ACA (241) ACG (244) CAT (247) TAC (250) ACC (253) TTT (256) TCT (259) GGC (262) AGG (265) ACG (268) GAG (271) ATA (274) AAG (277) TCT (280) TAC (283) CGC (286) TTG (289) AGC (292) GCA (295) AAG (298) AGC (301) TCC (304) TTG (307) GCC (310) AAT (313) TAT (316) TAC (319) TCG (322) ACG (325) AGA (328) TCG (331) TGA (334) CCC (337) GTT (340) ATG
(343) ATG (346) AAA (349) AAA (352) ATC (355) GAC (358) GTT (361) AAG (364) ATT (367) CTG (370) GAC (373) CCG (376) CGC (379) GTT (382) GGG (385) AAG (388) GAA (391) TTT (394) CCG (397) CTC (400) CCG (403) ACT (406) TAT (409) GCC (412) ACC (415) TCT (418) GGC (421) TCT (424) GCC (427) GGA (430) CTT (433) GAC (436) CTG (439) CGT (442) GCC (445) TGT (448) CTC (451) AAC (454) GAC (457) GCC (460) GTA (463) GAA (466) CTG (469) GCT (472) CCG (475) GGT (478) GAC (481) ACT (484) ACG (487) CTG (490) GTT (493) CCG (496) ACC (499) GGG (502) CTG (505) GCG (508) ATT (511) CAT (514) ATT (517) GCC (520) GAT (523) CCT (526) TCA (529) CTG (532) GCG (535) GCA (538) ATG (541) ATG (544) CTG (547) CCG (550) CGC (553) TCC (556) GGA (559) TTG (562) GGA (565) CAT (568) AAG (571) CAC (574) GGT (577) ATC (580) GTG (583) CTT (586) GGT (589) AAC (592) CTG (595) GTA (598) GGA (601) TTG (604) ATC (607) GAT (610) TCT (613) GAC (616) TAT (619) CAG (622) GGC (625) CAG (628) TTG (631) ATG (634) ATT (637) TCC (640) GTG (643) TGG (646) AAC (649) CGT (652) GGT (655) CAG (658) GAC (661) AGC (664) TTC (667) ACC (670) ATT (673) CAA (676) CCT (679) GGC (682) GAA (685) CGC (688) ATC (691) GCC (694) CAG (697) ATG (700) ATT (703) TTT (706) GTT (709) CCG (712) GTA (715) GTA (718) CAG (721) GCT (724) GAA (727) TTT (730) AAT (733) CTG (736) GTG (739) GAA (742) GAT (745) TTC (748) GAC (751) GCC (754) ACC (757) GAC (760) CGC (763) GGT (766) GAA (769) GGC (772) GGC (775) TTT (778) GGT (781) CAC (784) TCT (787) GGT (790) CGT (793) CAG (796) TAA (799) CAC (802) ATA (805) CGC (808) ATC (811) CGA (814) ATA (817) ACG (820) TCA (823) TAA (826) CAT (829) AGC (832) CGC (835) AAA (838) CAT (841) TTC (844) GTT (847) TGC (850) GGT (853) CAT (856) AGC (859) GTG (862) GGT (865) GCC (868) GCC (871) TGG (874) CAA (877) GTG (880) CTT (883) ATT (886) TTC (889) AGG (892) GGT (895) ATT (898) TTG (901) TAA (904)
CAT (907) GGC (910) AGA (913) AAA (916) ACA (919) AAC (922) TGC (925) GAA (928) AAG (931) GAA (934) CCG (937) TCG (940) CGA (943) GGA (946) AAT (949) ACT (952) TCA (955) GTC (958) TCT (961) GGC (964) GCT (967) GAT (970) GCT (973) GGA (976) ATC (979) CAG (982) CGA (985) TGG (988) AAG (991) CCA (994) ACG (997) TAT (1000) CAC (1003) GAC (1006) GGC (1009) AAA (1012) ACT (1015) GGC (1018) CGC (1021) CTC (1024) TGT (1027) CGG (1030) CGT (1033) TTC (1036) CGA (1039) AGC (1042) GGC (1045) ACT (1048) GTA (1051) TCG (1054) CCA (1057) CTT (1060) CCC (1063) CAG (1066) TAA (1069) GAC (1072) CCG (1075) CAT (1078) GTT (1081) CGA (1084) TAG (1087) CCT (1090) GAT (1093) TGA (1096) GTT (1099) TAT (1102) CGA (1105) AGA (1108) TAG (1111) CCT (1114) GAT (1117) TAC (1120) TCG (1123) CAT (1126) CAA (1129) CCT (1132) GAT (1135) TCT (1138) GAA (1141) AGA (1144) TGA (1147) GAA (1150) AGA (1153) CAC (1156) CAC (1159) AGC (1162) GCG (1165) CCT (1168) GCG (1171) TCT (1174) GAT (1177) TGT (1180) GTT (1183) GCT (1186) GCT (1189) TCT (1192) CGG (1195) TTT (1198) TGG (1201) TGA (1204) GCG (1207) TAA (1210) TCC (1213) TGG (1216) CCT (1219) GAC (1222) CCG (1225) CAT (1228) CCT (1231) CAC (1234) TGG (1237) TCA (1240) TGC (1243) GCT (1246) AAT (1249) GTT (1252) TGA (1255) ACA (1258) GGA (1261) TCG (1264) CCT (1267) GCA (1270) AGG (1273) GCG (1276) CAT (1279) CAA (1282) CCA (1285) GCT (1288) GTT (1291) CGA (1294) GCG (1297) TAT (1300) TGA (1303) AGC (1306) GCA (1309) GCT (1312) GCG (1315) CCA (1318) GGT (1321) ATT (1324) GCG (1327) TGA (1330) AAA (1333) GAG (1336) AAT (1339) GCG (1342) TGA (1345) GGG (1348) TGA (1351) AGG (1354) TTA (1357) CAC (1360) CAC (1363) CGA (1366) TGA (1369) AAC (1372) CCT (1375) GCT (1378) GGC (1381) AAG (1384) CCA (1387) GAT (1390) CCT (1393) GGC (1396) CTT (1399) CTG (1402) TGA (1405) AGG (1408) TAT (1411) GCT (1414) GTC (1417) ACG (1420) TTT (1423) TGT (1426) CCG (1429) CAG (1432) CGA (1435) ATT (1438) TAA (1441) ATA (1444) CCG (1447) CCC (1450) GAC (1453) GGA (1456) TGA (1459) TTT (1462) TGA (1465) CGC (1468) CCG (1471) CTG (1474) GCC (1477) GCT (1480) AAT (1483) TGC (1486) GGC (1489) CAG (1492) TTG (1495) CAG (1498) TAA (1501) TAT (1504) GAC (1507) GCC (1510) GGA (1513) TGA (1516) CTT (1519) TTC (1522) ATC (1525) CGG (1528) CGA (1531) GTT (1534) TCT (1537) TTA (1540) AAC (1543) GCC (1546) AAA (1549) CTC (1552) TTC (1555) GCG (1558) ATA (1561) GGC (1564) CTT (1567) AAC (1570) CGC (1573) CGC (1576) CAG (1579) ATG (1582) TTC (1585) CGC (1588) CAT (1591) TTC (1594) CGG (1597) CTT (1600) CTC (1603) TTC (1606) CAG
Then I simply counted the occurrence of the STOP Codons TAA or TAG in the coding and noncoding regions and compared them.
Result
In the non-coding region before the first red area, the stop codon TAA occurs 3 times, whilst it occurs only once in the first coding area. So the first hypothesis is supported -there is a definite difference in the frequency of the STOP CODONS between coding and noncoding areas. In the coding areas, the STOP codons only occur once, at the end of the coding sequence. Whilst in non-coding areas, the stop codons occur with much greater frequency.
This isn't really anything new. Scientists already use this criteria to identify coding areas.
Also, the frequency of occurrence of the stop codons in the noncoding area is 3 times in 113 codons = 2.65 %. This is similar to the frequency of occurrence of a letter in the Hebrew alphabet. see here -
http://www.sttmedia.com/characterfrequency-hebrew
Bookmarks