Rather than trying to discover an aminoacid alphabet by analysig DNA, it would be much easier to see if the aminoacids fall naturally into a sequence. We can arrange the amino acids according to -
1. their mass
2. the order inwhich their codons appear when you cycle through the genetic code.
Shcherbak discovered that if all the aminoacids are arranged in a circle, then there is a perfect balance of masses - see here - www.craigdemo.co.uk/circleoflife.pdf. He thought that it was very odd that all the aminoacids should sum in this way - almost as if they were conceived in one go. Shcherbak's pattern may hold the key to finding the aminoacid alphabet.
What is interesting is that all the aminoacids are together and form this perfect balance. The order of the aminoacids in the circle is determined solely by cycling through the bases, in the order T C A G.
Cycling through the bases in the order TCAG generates the 64 codons in a particular order - consequently it generates the aminoacids in a particular order - an aminoacid alphabet
There are only 4 x 3 x 2 x 1 ways of choosing the order in which you cycle through the bases, so there can be only 24 different possible orders of the codons produced - which would give us 24 different aminoacid alphabets.
It is then quite simple to test each alphabet to see which one generates the most meaningful translation.
So I will create a program that cycles through the bases in each of the 24 possible ways, each time generating the codons in a particular order. This will give me the 20 aminoacids in a particular order each time. So I will end up with 24 aminoacid alphabets - and can then test each one to see if it produces meaningful words.