## Abstract

In this article, we present a mathematical framework based on redundant (non-power) representations of integer numbers as a paradigm for the interpretation of genomic information. The core of the approach relies on modelling the degeneracy of the genetic code. The model allows one to explain many features and symmetries of the genetic code and to uncover hidden symmetries. Also, it provides us with new tools for the analysis of genomic sequences. We review briefly three main areas: (i) the Euplotid nuclear code, (ii) the vertebrate mitochondrial code, and (iii) the main coding/decoding strategies used in the three domains of life. In every case, we show how the non-power model is a natural unified framework for describing degeneracy and deriving sound biological hypotheses on protein coding. The approach is rooted on number theory and group theory; nevertheless, we have kept the technical level to a minimum by focusing on key concepts and on the biological implications.

## 1. Introduction

In this article, we present an innovative mathematical description of the genetic code that opens new avenues in the interpretation of genetic information. The approach is based on number theory and has been proposed for the first time in [1]. Since then, new routes have been explored and led to several advancements but the full potential of the approach has yet to be exploited. The main idea relies on modelling the degeneracy of the genetic code. Degeneracy is a universal property shared by all the variants of the genetic code and it might be associated to key biological functions. Our model is based on redundant numeration systems where an integer number can have more than one binary representation. We establish a correspondence between codons and length-6 binary strings, on the one side, and between amino acids and integer numbers, on the other side.

The article is structured as follows. In §2, we present a brief historical account of the mathematical attempts to understand the mechanisms of protein coding from an informational point of view since the discovery of DNA in 1953. After more than 50 years, the challenge is still open and indicates clearly that the quest for mathematical regularities is important since it is universally accepted that these are associated to physico-chemical properties. In §3, we outline the three main aspects of the problem. First, we show that the non-power model describes exactly the degeneracy distribution of the Euplotid nuclear genetic code. The analysis allows also the uncovering of hidden symmetries in its structure. The mathematical properties of the model have a biochemical interpretation that leads to the definition of dichotomic classes, nonlinear operators that allow one to code the genetic information in a new fashion and provide innovative tools for the statistical analysis of genomic sequences. The second aspect concerns the structure of the vertebrate mitochondrial code. We present a non-power model of the degeneracy distribution and a biological hypothesis on the origin of amino acid coding based on ancient tRNA adaptors acting on a special set of four-base codons that we call tesserae. Symmetry, again, is the key feature for connecting the mathematical model with the chemical and biological features of the genetic code. The third aspect concerns the coding strategies used in the different domains of life. Each strategy has an observed minimal number of adaptors needed to implement it. We show that such numbers can be obtained exactly by means of our non-power modelling. In §4, we present the conclusion and further perspectives.

## 2. Background and history

Since the discovery of the double helix structure of DNA by Watson and Crick in 1953, many mathematical hypotheses regarding how proteins are coded have been proposed [2]. Among the most remarkable are those of George Gamow (crystal model [3]) and Francis Crick (comma-free codes [4]). Despite the fact that both hypotheses were experimentally proved incorrect [5], Crick's proposal indicated clearly that coding theory is ‘the’ framework for interpreting the mechanisms of protein synthesis. We quote here the words of Gamow in a letter to Watson and Crick [6, p. 10]:
[I] think that this brings biology over into the group of ‘exact’ sciences […]. If your point of view is correct, and I am sure it is at least in its essentials, each organism will be characterized by a long number written in quadrucal

^{1} system […]

Gamow trusted that this digital representation of genetic information would have opened a new era in the understanding of life.

Somehow, in the following years, the interest in a mathematical description of the genetic code faded away, and subsequent studies were more targeted to explain features of the code mainly on a biological and molecular basis. However, after more than 50 years, we think that the original expectation of Gamow still makes sense. The mathematical description of the regularities of the genetic code has important theoretical and practical consequences. From a theoretical point of view, it provides information about the origin and evolution of the code that it is difficult to obtain otherwise. From a practical point of view, it might be useful to improve bioinformatics algorithms and, in medicine, for the diagnosis and therapy of genetic diseases. Indeed, many regularities in the genetic code were clear from the very beginning (e.g. [7,8]).^{2} For example, in most cases, codons sharing the first two nucleotides code for the same amino acid. Furthermore, some biochemical properties of amino acids are correlated with specific nucleotides in different codon positions; in fact, codons with the same nucleotide in the second position are associated to similar amino acids; also, there is a correlation between the first base of codons and the precursors from which the encoded amino acids should be synthesized [9,10].

The genetic code is a translation table that connects the world of nucleic acids, where biological information is stored, to the world of proteins, the chemical bricks of cellular metabolism. The genetic information is stored in double helix DNA molecules. Part of this information is converted into the single helix messenger RNA (mRNA) through a process called *transcription*. In this process, thymine (T), one of the four bases thymine (T), cytosine (C), adenine (A), and guanine (G) that compose DNA, is replaced by uracil (U). Counting from the *start* signal, every group of three contiguous bases in mRNA forms a *codon*. The genetic code assigns each of the 64 possible codons to the 20 amino acids (plus stop signals) so as to form the protein polymeric chain. Such process is called *translation*. Since there are 64 codons and 20 amino acids, the genetic code is not a one-to-one mapping, that is, more than one codon can code for the same amino acid. For this reason, amino acids are called degenerate and codons are called redundant or synonymous. The degeneracy distribution of amino acids is the number of amino acids that share a given degeneracy. All the known variants of the genetic code are degenerate, and each one has its peculiar degeneracy distribution which can be grouped into two main classes, i.e. nuclear and mitochondrial variants. In table 1, we show the Euplotid nuclear and the vertebrate mitochondrial genetic codes; these two variants are the most symmetric within their classes. The corresponding degeneracy distributions are shown in table 2. In the following, we will show that there is an underlying strong mathematical structure and several hidden symmetries that can be uncovered by means of a unifying approach based on number theory.

## 3. A mathematical model of the genetic code

In order to model a genetic code and its degeneracy distribution, we need to find a non-bijective mapping that describes it. We use numeration systems where each amino acid is represented by an integer number and a codon by its 6-bit binary representation (2^{6}=64). For instance, the integer number 18 has the binary representation 010010 since 0⋅2^{5}+1⋅2^{4}+0⋅2^{3}+0⋅2^{2}+1⋅2^{1}+0⋅2^{0}=16+2=18. However, usual number representation systems (like the binary system) are not redundant since each integer number has a unique representation (i.e. a unique binary string). One way of creating a redundant numeration system is to replace the powers of the base 2 with a series that grows more slowly than 2^{n} [11]. A known example is the binary Fibonacci non-power system where the powers of 2 are replaced by the Fibonacci numbers, i.e. 1 1 2 3 5 8 (note that this series grows more slowly than the powers of 2). In this system, number 18 has two representations, 111011 and 111100: 1⋅8+1⋅5+1⋅3+0⋅2+1⋅1+1⋅1=18=1⋅8+1⋅5+1⋅3+1⋅2+0⋅1+0⋅1. The degeneracy distribution of the Fibonacci non-power system is different from that of all the known versions of the genetic code. In the following, we show the unique non-power solutions that describe exactly both the Euplotid nuclear and the vertebrate mitochondrial genetic codes. For a brief review on redundant numeration systems, see appendix A.

### (a) The Euplotid nuclear genetic code

The degeneracy distribution inside quartets of the Euplotid nuclear genetic code is shown in table 2(left). We chose this version of the genetic code because it differs only in one codon assignment from the standard code and represents the most symmetric version of all nuclear codes; we will discuss in the second section the crucial role of symmetry in our approach. A quartet is a set of four codons that share the first two letters, for example CTT, CTC, CTA and CTG. The main motivation for the use of quartets is that the genetic code is implemented by means of tRNA adaptors that match anti-codons to amino acids. Because of wobble pairing, tRNAs can recognize up to four codons of the same quartet. For this reason, amino acids with degeneracy 6 (Arg, Leu, Ser) are composed by (at least) two elements, one of degeneracy 2 and one of degeneracy 4.

The key result is that the set of non-power weights (8,7,4,2,1,1) is the unique solution that describes exactly the degeneracy distribution of the Euplotid nuclear genetic code of table 2(left). It is possible to show that no other non-trivial solution exists. The study of the symmetries allows one to establish connections between the genetic code and the non-power system so as to obtain a proper mathematical model. The final result is the model shown in figure 1 where each codon is assigned a 6-bit binary string and each amino acid is assigned an integer number between 0 and 23. For example, the amino acid Asn (asparagine) is coded by AAT and AAC (degeneracy 2). The model associates Asn to number 18, and the two strings that represent it, 110110 and 110101, to codons AAT and AAC, respectively. The model possesses all the observed symmetries of the genetic code. For instance, the pyrimidine exchange () in the third letter of the codon does not change the coded amino acid. This symmetry is particularly remarkable since it is shared by all the known variants of genetic codes (both nuclear and mitochondrial). Indeed, in the model, strings of the kind xxxx01 and xxxx10 represent the same integer number. This implies also that strings ending in 01 or 10 represent codons ending in C or T (pyrimidine), whereas strings ending in 00 or 11 represent codons ending with a purine (table 3). Even more striking, the analysis of the model allows one to uncover hidden symmetries in the genetic code. For a detailed analysis of the model and the rationale of its assignations, see [1,12].

In [1,12–14], we have shown that the mathematical properties of the binary strings are linked to the chemical properties of the bases of the codons. The analysis led us to the definition of *dichotomic classes* that divide the 64 codons in two halves. The first dichotomic class is the *parity* of a codon, defined as the parity of the associated binary string. It is possible to show that this mathematical operation on the binary string coincides with a chemical algorithm acting on the last two bases of the codon. Note that nucleotides—U,C,A and G—can be classified according to their chemical character as follows:
The algorithmic representation of the parity is shown in figure 2*a*. The algorithm can be described as follows. If the last letter of the codon is a purine (R=*A*, G), then the parity of the binary string is derived directly: , . If the last letter is a pyrimidine (Y=U, C), then the parity is inferred from the chemical class of the second base of the codon: Amino (M=A, C) even, Keto (K=U, G) odd. In addition to the parity, there are two more dichotomic classes. The three of them are based on chemical algorithms where the chemical characters involved are always the same in the three bases: YR on the third base, KM on the second base and SW on the first base. Notably, if we shift the parity algorithm so as to act on the first two bases of the codon, we obtain a dichotomic class that coincides exactly with Rumer's class (figure 2*b*). It is a bi-partition of the genetic code observed by the theoretical physicist Rumer in the 1960s and is related to the degeneracy of amino acids inside quartets: 32 codons code for amino acids with degeneracy 4, whereas the other 32 code for amino acids with degeneracy non-4 (1, 2 or 3 in this case) (see also figure 1). Rumer also found out that the class is anti-symmetric with respect to the KM transformation (i.e. if we apply the KM transformation to a codon, the Rumer class changes). Note that the parity class is anti-symmetric with respect to the YR transformation. At this point, a complete framework for dichotomic classes can be derived: if we shift the algorithm one base to the 5′ end of the nucleic acid, we can define a third dichotomic class which we called hidden class (figure 2*c*). The hidden class can be defined either inside a codon (first and third base) or between two contiguous codons and is anti-symmetric with respect to the SW transformation. It is possible to derive dichotomic classes in terms of matrix operators and show that they are nonlinear functions of the chemical properties of a dinucleotide. Moreover, they are related to the Klein V group of transformations. Dichotomic classes provide a new, non-trivial way for recoding the information of nucleotide sequences in a binary code. This has several important implications for the statistical analysis of genomic data and for bioinformatics. In [14], we have shown that coding sequences are characterized by universal short-range correlations that show up only by looking at dichotomic classes. Moreover, in [15], we have shown that dichotomic classes can be used to design algorithms for frame retrieval.

### (b) The mitochondrial genetic code

The main motivation that led to the mathematical model of the Euplotid nuclear code was the search for informational error detection/correction mechanisms. Indeed, we found the main ingredients of error detection/correction codes. For instance, there are connections between dichotomic classes, orthogonal arrays and finite groups. Also, redundancy/degeneracy is related to discrete symmetry groups. Can we hope to crack the hypothesized mechanisms of error detection/correction? The problem is a very difficult one but if these mechanisms are universally based on the genetic codes, each one with its own degeneracy, then it is natural to focus on the simplest life system that possesses a genetic code: the mitochondrion. The genetic code of the mitochondrion is the simplest and most symmetric of all code variants and has been proposed as a model for the early code [16,17], the progenitor of the universal genetic code of LUCA (Last Universal Common Ancestor).

The vertebrate mitochondrial genetic code differs from the standard nuclear genetic code in just four codons (table 1). However, the degeneracy distribution of the mitochondrial code is much simpler and more symmetric (table 2(right)). Here, amino acids have either degeneracy 2 or 4. Remarkably, it is possible to show that, also for the mitochondrial code, there is a unique non-power representation system that describes exactly its degeneracy distribution. The representation is determined by the six non-power weights (8,8,4,2,1,0). In table 4, we show the non-power weights for the two genetic codes. Notably, the simplification in the degeneracy distribution of the mitochondrial code is associated to the presence of a 0 weight (i.e. there are no elements with degeneracy 1). Since the 0 weight does not enter in the additive decomposition of a number, its effect is that of a duplication label. This implies that the representation is split in two identical halves. This fact has profound consequences in the description of the associated degeneracy and led us to a biological hypothesis on the origin of degeneracy in protein coding [18].

Similarly to the mathematical model, the biological hypothesis is based on symmetry properties. Symmetry is the key feature for connecting the mathematical model with the chemical and biological features of the genetic code. In brief, the main result is that the degeneracy distribution of the vertebrate mitochondrial genetic code can be exactly described by primeval tRNA adaptors acting on a particular set of four-base codons that we called tesserae (from the Greek tessera=*four*). These adaptors possess the reverse and the self-complementary symmetries. These symmetries imply the invariance of the interaction Hamiltonian with respect to such spatial transformations. An interesting possibility to be explored is that conserved quantities associated to these symmetries might have contributed heavily, in evolutionary terms, to the shape of present codes. The tessera set is presented in table 5.

Our results on the study of the mitochondrial genetic code have important implications regarding: (i) molecular evolution and the origin of protein coding, (ii) the information-based error detection/correction mechanisms in the synthesis machinery, and (iii) the description of decoding strategies in the three domains of life in terms of wobble symmetries and redundant representations systems. In evolutionary terms, the proposed decoding system is placed before the early code, the code that preceded the universal genetic code of LUCA. The early code has been hypothesized to have the same symmetry as the mitochondrial genetic code [17,19,20]. Our theory complements this hypothesis since our tessera code, possessing the same symmetry as the present mitochondrial genetic code, can be seen as a pre-early code, an ancestor of the early code. The tessera code exhibits error-detecting capabilities (complete immunity to point errors) and +1 frame-shift immunity. Such features explain why the evolutionary pressure might have led to the selection of this mechanism. In fact, accuracy in protein synthesis needs to have been preserved to some extent in actual organisms and the preservation of the degeneracy distribution over geological times might be the molecular evidence of such origin. Note that it has been proposed that extant triplet codes derive from primeval codes having codons with more than three nucleotides [21]. This could ensure the necessary bonding stability for direct decoding without ribosomes. Moreover, the feasibility of decoding codons of four letters has been demonstrated with evolved extant ribosomes [22]. In [23], ribosomes have been made evolving to reduce premature termination; in [24], such ribosomes further evolved so as to obtain the same fidelity and efficiency as with triplet codons. These ribosomes were termed ribo-Q due to their capability to efficiently read quadruplet codons.

Regarding the connection with extant decoding strategies, both the non-power model and the hypothesis of primeval symmetric tRNAs provide an exact prediction about the number of adaptors needed to implement the code. Note that this number is 22, the minimal number used in all extant forms of life. Amino acid decoding is implemented by tRNA adaptors, so that one might state that the actual degeneracy distribution is observed at the level of tRNAs. The case of the mitochondrial genetic code is a peculiar one, not only because it uses the minimal number of adaptors, but also because the degeneracy distribution of the code coincides with the degeneracy distribution of its tRNAs, namely, each element of the degeneracy distribution is described by exactly one adaptor (each amino acid is associated to exactly one adaptor). This is an extreme case of the decoding strategies used in extant forms of life.

### (c) Coding strategies

A hypothesis about the evolution of the genetic code from the putative early code is that the decoding of all codons was performed with a minimum number of tRNA species [17]. Since post-transcriptional modifications of the first base of the anti-codon differ widely between organism lines, such modifications are supposed to have been introduced after the birth of the early code. However, the constraint of the minimum number of adaptors also holds for different extant decoding strategies, developed after the introduction of post-transcriptional modifications. The work in [25] provides a summary of the coding strategies in the different domains of life and associates the minimal number of adaptors to each strategy. Remarkably, in all cases, this number can be predicted by the non-power modelling approach (table 6). In order to see this, we show how to predict the minimal number of adaptors needed to decoding the 61 codons (64−3 stop codons) plus one additional tRNA for the elongation of Met (recall that Met indicates also the start of protein synthesis). In the mitochondrial case, this elongation tRNA is not present and we have two groups of two stop codons. Thus, the non-power representation for the vertebrate mitochondrial code (8,8,4,2,1,0) codes a total of 24 elements and predicts 22 tRNA adaptors when the two groups of stop codons are subtracted (24−2=22). In [25], this is denoted as strategy III and allows the decoding of an entire family (super wobble) by using a non-modified U in position 34, the main wobble position (first base) of the anti-codon. This strategy is used only in bacteria and mitochondria and implies that, inside a quartet, amino acids have either degeneration 4 or 2+2. In strategy I, codons ending in a pyrimidine (T or C) are decoded by a post-transcriptionally modified G in position 34 for Archea and Bacteria, or a modified A in Eukarya. Codons ending in a purine (A or G) are recognized separately by a modified U or C in the same position of the anti-codon. In this strategy, the minimum number of adaptors is 46 and the degeneracy distribution inside quartets is of the type 2+1+1. We describe it with the non-power representation (16,16,8,4,2,1) that gives a total of 48 elements. Now, if we subtract three elements corresponding to the stop codons (they all end with a purine base so that they correspond to a single adaptor) and add an additional tRNA for the elongation Met, we have exactly the minimal number of adaptors for strategy I, i.e. 48−3+1=46. Finally, in strategy II, a family is decoded by means of two tRNAs, i.e. post-transcriptionally modified U, and G, in position 34 for Archea and Bacteria, or post-transcriptionally modified A, and U, for Eukarya. The minimal number of adaptors for such strategy is 33. In the general case, the degeneracy inside quartets can be described as 2+2. Thus, we can represent the degeneracy associated to the strategy through the non-power weights: (16,8,4,2,1,0). Similar to the representation of the mitochondrial genetic code, there is a 0 weight. Here, the total number of elements is 32 but if we take into account the stop codons and the special decoding boxes (Ile-Met and Cys-Stop-Trp) plus one additional tRNA for elongation Met, we obtain 33, which is exactly the minimal predicted value for this strategy.

## 4. Conclusion

In this article, we have shown that both the nuclear and the mitochondrial variants of the genetic code can be modelled by using non-power representation systems of integer numbers. The model describes exactly many salient features of the genetic code, including its symmetry properties. Note that there is a wide literature on the idea that the genetic code is related to symmetry properties and, eventually, to symmetry breaking (see for instance the seminal work of Hornos & Hornos [26]). Such paper originated a series of attempts to apply continuous and discrete symmetry groups and algebras to the problem [27–29]. However, this approach has not been received well by biologists; as Maddox states [30]: ‘The application of the theory of mathematical groups to the origin of the genetic code will startle molecular biologists, but is best regarded as a valuable exercise in classification’. For further works about symmetry and symmetry breaking, see among others [31–37]. We argue that the above criticism does not apply to our model since it describes naturally the degeneracy of the actual genetic code without recourse to an improbable chain of symmetry breaking steps. Moreover, it allows a clear biological interpretation of the results.

The mitochondrial genetic code has been proposed as a model that originated the universal genetic code (the code characterizing the LUCA). This pre-universal code has been called the early code. The non-power representation of the vertebrate mitochondrial genetic code allows one to infer a biological hypothesis for the organization of the code that predated the early code and for the origin of protein coding based on the symmetries of ancient tRNA adaptors and a set of four-base codons (tesserae) [18]. This paradigmatic genetic code exhibits strong properties of error detection and correction which might be responsible of its selection in evolutionary terms (synthesis accuracy and expression fitness). Furthermore, the model provides an explanation of why the present genetic code is characterized precisely by 64 codons and 20+2 amino acids. The non-power approach (i) describes the evolution starting from putative ancient codes characterized by a small number of amino acids and coding words and (ii) supports an origin of protein synthesis with adaptors containing oligonucleotides longer than three bases and a ribosome-less direct template synthesis.

Finally, we recall that it is now widely accepted that the genetic code is optimized for conveying more information than the linear coding of proteins [38,39]. The degeneracy of amino acids allows the use of synonymous codons for conveying additional information. This in turns affects the distribution of codons in coding sequences and produces the so-called ‘codon bias’ [40–42]. Codon bias correlates with translation efficiency (accuracy and speed of protein synthesis) and many other key cellular processes from differential protein production to protein folding [43,44]. The mathematical structure of the model and its evolutionary implications lead naturally to the problem of protein synthesis accuracy and efficiency related to mechanisms of error detection/correction in terms of point mutations and frame-shift. In this respect, there are connections with the theory of comma-free and circular codes. These are important to study the problem of retrieving and maintaining of the correct reading frame [45–48]. The non-power approach allows one to study sequences of DNA and mRNA, including their codon bias, under a new perspective [13–15,49] and can lead to new bioinformatics tools and new applications in medicine.

## Authors' contributions

All the authors contributed equally to this work.

## Competing interests

We declare we have no competing interests.

## Funding

We received no funding for this study.

## Appendix A. Redundant numeration systems and the degeneracy of the genetic code

Historically, the notion of number has been introduced to count the elements of a set or to compare quantitatively two or more sets. A numeration system is a way of expressing a number by using a string of other numbers, called digits, as to facilitate their management, for instance for the implementation of arithmetic operations. Nowadays, a proper choice of a numeration system may be crucial for solving specific problems and developing and improving mathematical models and algorithms [11]. A *positional* numeration system allows one to represent numbers by means of a set of digits *d*_{i} and a set of weights *w*_{i}. A given integer number *N* has the following additive representation:
In other words, a system is positional if each digit *d*_{i} is weighted with a different value *w*_{i} according to its position. For practical reasons, e.g. commerce, accounting, etc., these representations have always been univocal since every number has a unique representation. One notable exception is the Maya representation system called serpent number used in their Long Count calendar to describe astronomical times [50].

**(a) Power numeration systems**

Usual numeration systems are based on the additive decomposition of a number using the powers of a base *b* as weights, i.e. *w*_{i}=*b*^{i}. Each integer number 0≤*N*≤*b*^{k}−1 has a unique representation of the kind
It is easy to prove that power numeration systems are univocal since they satisfy the following two conditions:

(i) The digits

*d*_{i}range from 0 to*b*−1: 0≤*d*_{i}≤*b*−1.(ii) The weights are the power of the base

*b*:*w*_{i}=*b*^{i}.

Note that the two conditions imply that , which is the maximum representable number, the minimum being zero. Here, we focus on numeration systems that allow one to represent all the integer numbers in this interval (for instance, if *w*_{i}>*b*^{i} some numbers do not have a representation).

### Example A.1

Representation of the number seventeen both in the binary and in the decimal system.

— In both cases, seventeen has a unique representation/string (degeneracy 1).

Note that in example A.1, we used the term ‘seventeen’ (in letters) on purpose in order to distinguish between the concept of a number (seventeen) and its representations (strings).

The binary system plays a central role since it is used by computers to represent numbers. Hence, as in the example above, from now on we adopt the 6-bit binary system and the usual decimal system in place of the letters so that, for instance, we use 17 in place of seventeen. In the binary example above, the digits range from 0 to *b*−1=1, i.e. *d*_{i}∈{0,1}. Also, there are 2^{6}=64 represented numbers that range from 000000=0 to . In order to build a redundant numeration system, we need to relax at least one of the two conditions above as in the following two examples.

**(b) Signed-digits numeration systems**

In this system, the digits *d*_{i} do not satisfy the condition 0≤*d*_{i}≤*b*−1.

### Example A.2

Representation of the number seventeen in the signed binary system (*w*_{i}=2^{i}) with *d*_{i}∈{−1,0,1}.

— 17 has two representations, i.e. degeneracy 2 (we use to denote −1).

**(c) Non-power numeration systems**

In non-power systems, the weights *w*_{i} grow more slowly than the power of the base *b*, i.e. *w*_{i}≤*b*^{i}.

### Example A.3

Representation of the number 17 in the binary Fibonacci system: *w*_{i}=*F*_{i}=*F*_{i−1}+*F*_{i−2}, with *F*_{0}=1.

— 17 has two representations (degeneracy 2).

— Note that the Fibonacci sequence 1,1,2,3,5,8 grows more slowly than the powers of two, 1,2,4,8,16,32.

**(d) Degeneracy distribution**

Every numeration system possesses a degeneracy table that records the degeneracy of each number. The degeneracy table of the 6-bit binary system presented in example A.1 is

The table is trivial: every number from 0 to 63 has a unique representation. The degeneracy table of the binary Fibonacci system presented in example A.3 is more interesting:

It is easy to show that *D*(*N*)=*D*(*K*−*N*), i.e. *N* and *K*−*N* have the same degeneracy, where is the maximum representable number. In the binary Fibonacci system above *K*=20 and, for example, 1 and 19 have the same degeneracy (i.e. 2). This implies that the degeneracy table is symmetric. From the degeneracy table, one can derive the degeneracy distribution that counts how many numbers have the same degeneracy. The degeneracy distribution of univocal numeration systems is trivial since all the numbers have a unique representation (degeneracy 1). In the following, we show the degeneracy distribution of the three binary systems shown in the above examples.

**(e) The non-power model of the genetic code**

None of the three systems above matches the degeneracy distributions inside quartets of either the nuclear or the mitochondrial genetic code shown in table 2. In both cases, the unique exact non-power solution is given by the following set of weights:

The complete non-power representation for the two systems is given in table 7. The associated degeneracy distributions below match exactly those of the genetic codes of table 2:

For example, in the Euplotid nuclear code, there are two amino acids (Ile, Cys) that are represented by three codons each (table 1). In the same way, in the non-power model, numbers 7 and 16 have degeneracy 3 since they are represented by three binary strings each (in green).

We have seen that the degeneration of both the Euplotid nuclear and of the mitochondrial genetic code admit a non-power representation. Is this a fortunate coincidence?

The degeneracy distributions that admit a non-power representation are extremely rare. In order to see this, note that symmetry of the degeneracy table is a necessary condition for a existence of a non-power representation. The ratio between the number of symmetric degeneracy tables over the total number of possible tables gives a coarse assessment of this fact under the assumption that every table occurs with the same probability.

Counting the number of degeneracy tables is analogous to what in physics is called the Bose–Einstein statistics, that is, counting the ways one can distribute *k* indistinguishable balls into *n* labelled urns. In our 6-bit binary system, we have 64 strings/balls and 24 numbers/urns. Hence, the total number of tables is
The number of symmetric tables is

and the ratio results in 8.95×10^{−12}. Note that this is just a necessary condition and the actual number of degeneracy tables that admit a non-power representation might be much smaller. This is an important indication that the non-power representation of the genetic code cannot be seen as a sheer product of chance. More importantly, the genetic code and the non-power model share a number of symmetry properties that extend far beyond the degeneracy distribution [12] and that reflect a deep mathematical connection between them.

## Footnotes

One contribution of 21 to a theme issue ‘DNA as information’.

↵1 That is quaternary (authors’ note).

↵2 For the first time, the three articles by Rumer on the symmetry of the genetic code have been translated from Russian and are included in this issue.

- Accepted October 27, 2015.

- © 2016 The Author(s)

Published by the Royal Society. All rights reserved.