## Abstract

The circular code theory proposes that genes are constituted of two trinucleotide codes: the classical genetic code with 61 trinucleotides for coding the 20 amino acids (except the three stop codons {*TAA*,*TAG*,*TGA*}) and a circular code based on 20 trinucleotides for retrieving, maintaining and synchronizing the reading frame. It relies on two main results: the identification of a maximal *C*^{3} self-complementary trinucleotide circular code *X* in genes of bacteria, eukaryotes, plasmids and viruses (Michel 2015 *J. Theor. Biol.* 380, 156–177. (doi:10.1016/j.jtbi.2015.04.009); Arquès & Michel 1996 *J. Theor. Biol.* 182, 45–58. (doi:10.1006/jtbi.1996.0142)) and the finding of *X* circular code motifs in tRNAs and rRNAs, in particular in the ribosome decoding centre (Michel 2012 *Comput. Biol. Chem.* 37, 24–37. (doi:10.1016/j.compbiolchem.2011.10.002); El Soufi & Michel 2014 *Comput. Biol. Chem.* 52, 9–17. (doi:10.1016/j.compbiolchem.2014.08.001)). The univerally conserved nucleotides A1492 and A1493 and the conserved nucleotide G530 are included in *X* circular code motifs. Recently, dinucleotide circular codes were also investigated (Michel & Pirillo 2013 *ISRN Biomath.* 2013, 538631. (doi:10.1155/2013/538631); Fimmel *et al.* 2015 *J. Theor. Biol.* 386, 159–165. (doi:10.1016/j.jtbi.2015.08.034)). As the genetic motifs of different lengths are ubiquitous in genes and genomes, we introduce a new approach based on graph theory to study in full generality *n*-nucleotide circular codes *X*, i.e. of length 2 (dinucleotide), 3 (trinucleotide), 4 (tetranucleotide), etc. Indeed, we prove that an *n*-nucleotide code *X* is circular if and only if the corresponding graph is acyclic. Moreover, the maximal length of a path in corresponds to the window of nucleotides in a sequence for detecting the correct reading frame. Finally, the graph theory of tournaments is applied to the study of dinucleotide circular codes. It has full equivalence between the combinatorics theory (Michel & Pirillo 2013 *ISRN Biomath.* 2013, 538631. (doi:10.1155/2013/538631)) and the group theory (Fimmel *et al.* 2015 *J. Theor. Biol.* 386, 159–165. (doi:10.1016/j.jtbi.2015.08.034)) of dinucleotide circular codes while its mathematical approach is simpler.

## 1. Introduction

Trinucleotide codes, such as the genetic code, provide a fascinating theory that combines the search for solutions to old and open problems with modern techniques from different fields of science. About 60 years ago, before the discovery of the genetic code, a class of trinucleotide codes, called comma-free codes, was proposed by Crick *et al.* [1] for explaining how the reading of a sequence of trinucleotides could code amino acids. In particular, how the correct reading frame can be retrieved and maintained. The four nucleotides {*A*,*C*,*G*,*T*} as well as the 16 dinucleotides {*AA*,…,*TT*} are simple codes which are not appropriate for coding 20 amino acids. However, trinucleotides induce a redundancy in their coding. Thus, Crick *et al.* [1] conjectured that only 20 trinucleotides among the 64 possible trinucleotides {*AAA*,…,*TTT*} code the 20 amino acids. Such a bijective code implies that the coding trinucleotides are found only in one frame—the comma-freeness property. The determination of a set of 20 trinucleotides forming a comma-free code has several necessary conditions:

(i) A periodic trinucleotide from the set {

*AAA*,*CCC*,*GGG*,*TTT*} must be excluded from such a code. Indeed, the concatenation of*AAA*with itself, for instance, does not allow the (original) reading frame to be retrieved as there are three possible decompositions: …*AAA*,*AAA*,*AAA*… (original frame), …*A*,*AAA*,*AAA*,*AA*… and …*AA*,*AAA*,*AAA*,*A*…, the commas showing the adopted decomposition.(ii) Two non-periodic permuted trinucleotides, i.e. two trinucleotides related by a circular permutation, e.g.

*ACG*and*CGA*, must also be excluded from such a code. Indeed, the concatenation of*ACG*with itself, for instance, does not allow the reading frame to be retrieved as there are two possible decompositions: …*ACG*,*ACG*,*ACG*… (original frame) and …*A*,*CGA*,*CGA*,*CG*…

Therefore, by excluding the four periodic trinucleotides and by gathering the 60 remaining trinucleotides in 20 classes of three trinucleotides such that, in each class, the three trinucleotides are deduced from each other by a circular permutation, e.g. *ACG*, *CGA* and *GAC*, we see that a comma-free code can contain only one trinucleotide from each class and thus has at most 20 trinucleotides. This trinucleotide number is identical to the amino acid number, thus leading to a code assigning one trinucleotide per amino acid without ambiguity. A few combinatorial results on trinucleotide comma-free codes were obtained by Golomb *et al.* [2,3]. However, no trinucleotide comma-free code was identified in genes statistically. Furthermore, at the beginning of the 1960s, the discovery that the trinucleotide *TTT*, an excluded trinucleotide in a comma-free code, codes phenylalanine [4] led to the abandonment of the concept of comma-freeness over the alphabet {*A*,*C*,*G*,*T*}. For several biological reasons, in particular the interaction between mRNA and tRNA, this concept was again taken up later over the purine/pyrimidine alphabet {*R*,*Y* } (*R*={*A*,*G*}, *Y* ={*C*,*T*}) with two trinucleotide comma-free codes: *RRY* [5] and *RNY*={*RRY*,*RYY*} (*N* being any letter on {*R*,*Y* }) [6]. Some statistical results studying and identifying these two comma-free codes were obtained at the sequence level by Shepherd [7] and at the population level by Michel [8]. In 1986, it was shown that introns, in contrast to exons, have no nucleotide periodicity modulo 3 ([9], fig. 2, with a statistical analysis of 90 introns). One year later, with the increase in sequence data, a nucleotide periodicity modulo 2 was identified in introns by two different statistical methods [10,11]. So far, no circular code has been found in introns.

In 1996, a statistical analysis of occurrence frequencies of the 64 trinucleotides {*AAA*,…,*TTT*} in the three frames of genes of both prokaryotes and eukaryotes showed that the trinucleotides are not uniformly distributed in these three frames [12]. By excluding the four periodic trinucleotides and by assigning each trinucleotide to a preferential frame (frame of highest occurrence frequency), three subsets *X*=*X*_{0}, *X*_{1} and *X*_{2} of 20 trinucleotides were found in the frames 0 (reading frame), 1 (frame 0 shifted by one nucleotide) and 2 (frame 0 shifted by two nucleotides) in genes of both prokaryotes and eukaryotes. This set *X* contains the following 20 trinucleotides:

The two sets *X*_{1} and *X*_{2}, of 20 trinucleotides each, in the shifted frames 1 and 2, respectively, of genes can be deduced from *X* by a circular permutation. These three trinucleotide sets present several strong mathematical properties, particularly the fact that *X* is a maximal *C*^{3} self-complementary trinucleotide circular code [12]. The subset {*CAG*,*CTC*,*CTG*,*GAG*} of the circular code *X* is a trinucleotide comma-free code and, furthermore, *C*^{3} and self-complementary [13]. Comma-free codes are important codes as they represent a limit class of circular codes with three particular properties: (i) no word of a comma-free code is found in a shifted frame; (ii) the length of reading frame retrieval is the shortest; and (iii) the probability of reading frame coding is maximal and equal to 1 [14]. In 2015, by quantifying the approach used in 1996 for identifying a preferential frame and by applying a massive statistical analysis of gene taxonomic groups, the circular code *X* was strengthened in genes of prokaryotes (7 851 762 genes, 2 481 566 882 trinucleotides) and eukaryotes (1 662 579 genes, 824 825 761 trinucleotides), and has now also been identified in genes of plasmids (237 486 genes, 68 244 356 trinucleotides) and viruses (184 344 genes, 45 688 798 trinucleotides) [15]. Several non-maximal *C*^{3} self-complementary circular codes have been identified in genes of viruses (table 1 in [15]) which are all subsets of *X* (table 7d in [15]):

—

*X*∖{*CAG*,*CTG*,*GTA*,*TAC*} of 16 trinucleotides in genes of double-stranded DNA viruses (172 198 genes, 39 934 299 trinucleotides) and single-stranded RNA viruses (4492 genes, 3 510 773 trinucleotides);—

*X*∖{*ACC*,*CAG*,*CTC*,*CTG*,*GAG*,*GCC*,*GGC*,*GGT*} of 12 trinucleotides in genes of double-stranded RNA viruses (973 genes, 654 931 trinucleotides);—

*X*∖{*CAG*,*CTC*,*CTG*,*GAG*,*GCC*,*GGC*,*GTA*,*TAC*} of 12 trinucleotides in genes of single-stranded DNA viruses (3562 genes, 796 401 trinucleotides);—

*X*∖{*AAC*,*ACC*,*CAG*,*CTG*,*GCC*,*GGC*,*GGT*,*GTA*,*GTT*,*TAC*} of 10 trinucleotides in genes of retro-transcribing viruses (559 genes, 269 070 trinucleotides); and—

*X*∖{*CAG*,*CTG*,*CTC*,*GAG*,*GTA*,*TAC*} of 14 trinucleotides in genes of phages (2560 genes, 523 324 trinucleotides).

A trinucleotide circular code has the fundamental property to always retrieve the reading frame in any position of any sequence generated with the circular code. In particular, initiation and stop trinucleotides as well as any frame signals are not necessary to define the reading frame. Indeed, a window of a few nucleotides, whose nucleotide length depends on the class of circular codes, positioned anywhere in a sequence generated with the circular code always retrieves the reading frame.

As an example let us take the word *w*=…*AGGTAATTACCAG*… of the circular code *X*. Is the first nucleotide of *w*, i.e. *A*, the 1st, the 2nd or the 3rd nucleotide of a trinucleotide of *X*? By trying the three possible factorizations (frames) *w*_{0}, *w*_{1} and *w*_{2} (*w*_{1} and *w*_{2} being *w*_{0} shifted by one and two nucleotides, respectively) into trinucleotides of *X*, only one factorization, i.e. *w*_{1}, is possible. Thus, the first nucleotide *A* of *w* is the 3rd nucleotide of a trinucleotide of *X*. Indeed, the factorization *w*_{1} leads to the trinucleotides *NNA*, *GGT*, *AAT*, *TAC* and *CAG* (*N* being any appropriate letter of *X*) which belong to *X*. The factorizations *w*_{0} and *w*_{2} are impossible as no trinucleotide of *X* starts with the prefix *AG*. This case occurs immediately for *w*_{0} and after 11 letters for *w*_{2}. Thus, the unique factorization of *w* is *w*_{1}=…*A*,*GGT*,*AAT*,*TAC*,*CAG*,…. This word *w* can be located anywhere in a sequence of *X*, i.e. the sequence of *X* does not require an initiator codon, a stop codon or any frame signal to retrieve the reading frame. The word *w*′=*AGGTAATTACCA* (*w* without the last *G*) with a length of 12 nucleotides is ambiguous as it has two factorizations *w*_{1} and *w*_{2} into trinucleotides of *X*. The word *w*′ is called an ambiguous word of *X*. By definition of a circular code, all the ambiguous words are finite words. The word *w*′, taken as an example here, is one of the four longest ambiguous words of *X* (see below). Thus, the window length *l* to retrieve the construction frame of a word of a circular code *Y* is the letter length of the longest ambiguous words *w*′, plus one letter. With the circular code *X*, *l*=12+1=13 nucleotides [16]. The window lengths *l* for the trinucleotide circular codes *X*_{1} and *X*_{2} are also equal to *l*=13 nucleotides [16]. In conclusion, the retrieval of the reading frame with the circular code *X*, the frame 1 with the circular code *X*_{1} and the frame 2 with the circular code *X*_{2} needs the same window length *l* of 13 nucleotides (*l*≥13).

In 2012, in addition to the circular code *X* in genes (mRNA), a second major step of this circular code theory was revealed by the identification of *X* motifs, i.e. motifs generated with the circular code *X*, in the 5′ and/or 3′ regions of tRNAs of prokaryotes and eukaryotes [13,17] and 16S rRNAs, in particular in the ribosome decoding centre where the universally conserved nucleotides A1492 and A1493 and the conserved nucleotide G530 are included in the *X* motifs [13,18]. A three-dimensional visualization of *X* motifs in the ribosome shows several spatial configurations involving mRNA *X* motifs, tRNA *X* motifs and 16S rRNA *X* motifs [13,18]. These results led to the concept of a possible translation (framing) code based on the circular code which was proposed in Michel [13].

Trinucleotides are the fundamental words for genes. Dinucleotides are also words with important biological functions in genomes as they are involved in some genome sites, e.g. the splice sites of introns in eukaryotic genomes are based on the dinucleotides *GT* and *AT* [19,20]; and in some genome regions, e.g. the dinucleotide *CG* in animal and plant genomes allows a positive or negative control over gene expression [21]; the dinucleotides *CA* [22,23], *CT* [24] and *TG* [25] in eukaryotic genomes occur as concatenated words (*l*_{1}*l*_{2})^{+}, *l*_{1},*l*_{2}∈{*A*,*C*,*G*,*T*} (called tandem repeats in biology), etc. Thus, dinucleotide circular codes have been studied according to two approaches, by the combinatorics theory [26] and the group theory [27].

As the genetic motifs of different lengths are ubiquitous in genes and genomes, we introduce here a new approach based on graph theory to study in full generality *n*-nucleotide circular codes *X*, i.e. of length 2 (dinucleotide), 3 (trinucleotide), 4 (tetranucleotide), etc. To each such code, a graph is associated and the main theorem states that an *n*-nucleotide code *X* is circular if and only if the corresponding graph is acyclic. Moreover, many properties of the codes can be seen in its representing graph.

## 2. The graph theory of *n*-nucleotide circular codes

Throughout this section, let be the set of nucleotide bases, where *A* stands for *adenine*, *C* stands for *cytosine*, *G* stands for *guanine* and *T* stands for *thymine*. For with *n*≥2 an * n-nucleotide code* is a subset . The following definition relates a directed graph to any

*n*-nucleotide code. Recall from graph theory [28] that a

*graph*consists of a finite set of

*vertices (nodes)*

*V*and a finite set of

*edges*

*E*. Here, an edge is a set {

*v*,

*w*} of vertices from

*V*. The graph is called

*oriented*if the edges have an orientation, i.e. edges are considered to be ordered pairs [

*v*,

*w*] in this case.

### Definition 2.1

Let be an *n*-nucleotide code (). We define a directed graph with a set of vertices *V* (*X*) and a set of edges *E*(*X*) as follows:

—

*V*(*X*)={*N*_{1}…*N*_{i},*N*_{i+1}…*N*_{n}:*N*_{1}*N*_{2}*N*_{3}…*N*_{n}∈*X*,1≤*i*≤*n*−1}—

*E*(*X*)={[*N*_{1}…*N*_{i},*N*_{i+1}…*N*_{n}]:*N*_{1}*N*_{2}*N*_{3}…*N*_{n}∈*X*,1≤*i*≤*n*−1}.

The graph is called the *representing graph* of *X* or the graph *associated* with *X*.

Basically, the graph associated with a code *X* interprets *n*-nucleotide words from *X* in (*n*−1) ways by pairs of *i*-nucleotides and (*n*−*i*)-nucleotides for 1≤*i*≤*n*−1. Figures 1–3 give examples of codes and their representing graphs in the case of *n*=2 (dinucleotide code), *n*=3 (trinucleotide code) and *n*=4 (tetranucleotide code).

As we can see, the graph of the tetranucleotide code has four disjoint parts. However, note that two parts are built by vertices labelled with dinucleotides and two parts are built by vertices labelled with nucleotides and trinucleotides. These parts are called *components* of . Recall that a subset *V* ′ of the set of vertices *V* is called *connected* if for any two nodes *v*,*w*∈*V* ′ there is a path [*v*,*v*_{1}][*v*_{1},*v*_{2}]…[*v*_{n−1},*v*_{n}][*v*_{n},*w*] of vertices from *V* ′ connecting *v* and *w*. Any graph decomposes uniquely into connected components which are pairwise disjoint. Recall also that a graph is *bipartite* if its set of vertices *V* can be decomposed into two disjoint subsets *V* ′ and *V* ′′ such that the edges of connect only nodes from *V* ′ with nodes from *V* ′′ and vice versa. Obviously, if *X* is an *n*-nucleotide code, then the components of are exactly the graphs
with
and
These components do not have to be connected, as we can see in figure 3. However, quite often they are. In fact, consists exactly of the nodes (and their corresponding edges) that interpret the elements of *X* in two ways: as a pair of a *j*-nucleotide and an (*n*−*j*)-nucleotide and as a pair of an (*n*−*j*)-nucleotide and a *j*-nucleotide. Note that by symmetry we have for all *j*<*n*−1. For instance, in figure 3 the two components of the graph associated with the tetranucleotide code are and . The next observation is obvious.

### Lemma 2.2

*Let X be an n-nucleotide code for some* . *Then the following statements hold*:

(

*1*)—*If n is odd, then**is a bipartite graph. In particular, all its components**are bipartite*.(

*2*)—*If n is even, then all components of**are bipartite except for perhaps*.

We now start to investigate our desired objects, namely *n*-nucleotide circular codes.

### Definition 2.3

Let be a code. We say that *X* is a *circular code* if for any concatenation *c*_{1}…*c*_{m} of *n*-nucleotide words from *X* there is only one partition into *n*-nucleotide words from *X* when read on a circle.

The first observation shows that for circular codes the associated graph is already simple. Recall from graph theory [28] that an oriented graph is *simple* if it does not contain loops, i.e. edges between a node and itself, and does not have multiple edges with the same orientation between two nodes. Note that the orientations for the multiple edges do play a role, i.e. for a simple oriented graph , we can still have and , which means that there is a cycle (circle) of length 2. However, for circular codes this structure is also excluded. Recall that a *cycle* in is an oriented closed path in . A *circle* is a cycle that visits no node twice except for the starting node (which is the end node at the same time). For instance, in figure 2, the sequence of vertices *T*,*GT*,*A*,*TG*,*T* is a circle while the sequence *T*,*GT*,*A*,*GT*,*A*,*TG*,*T* is a cycle that is not a circle.

### Lemma 2.4

*Let* *be a circular code. Then its representing graph is a simple oriented graph without circles of length* 2.

### Proof.

A proof can be found in appendix A. ▪

Let us remark that lemma 2.4 simply says that for circular codes the representing graph has an *underlying* simple unoriented graph. Moreover, the circularity is only needed in a weaker form requiring definition 2.3 only for *m*=1. These codes are called *1-circular* (see, for instance, [29] for more details on these codes) and will appear again in §3.

### Example 2.5

In figure 4, we show two examples of trinucleotide codes and their representing graphs. The code {*ATG*,*CAC*,*CAT*,*GTG*} (*a*) is circular and has a simple graph, while the code {*ATG*,*ATT*,*TGA*,*TGT*} (*b*) is non-circular and its graph is not simple.

We now state our first main theorem which proves the connection between the circularity of codes and the acyclicity of graphs. Recall from graph theory [28] that a graph is called *acyclic* if it does not contain cycles, i.e. oriented closed paths.

### Theorem 2.6

*Given a code* *the following statements are equivalent:*

(

*1*)—*X is circular.*(

*2*)—*is acyclic.*

### Proof.

Let be any code and assume that it is circular. If *G*(*X*) is not acyclic, then one of its components is not acyclic. Hence there is a cycle in of the form
for some natural number *k*. This means that the *n*-nucleotides
as well as the *n*-nucleotides
are in the code. In the case *i*≠*n*−*i*, the cycle has an even length. However, for even *n*, if *i*=*n*−*i*, i.e. *i*=*n*/2, it can happen that , and the last edge is missing. Hence, the cycle has odd length. In both cases taking every *second edge* from the cycle and starting with the edge until we get the same edge repeated for the first time, we obtain the word
This word consists of *k* *n*-nucleotides if the cycle length is even and of (2*k*−1) *n*-nucleotides if the cycle length is odd. Now, taking every *second edge* from the cycle but this time starting with the edge until we get the same edge repeated for the first time, we obtain a second decomposition of the word on the circle, namely
This is a contradiction to the circularity of *X*. The converse follows with similar arguments. ▪

Clearly, the above result also gives a handy criterion for the * C^{3} property* of trinucleotide codes, namely by the fact that a code is

*C*

^{3}if and only if the graph of

*X*as well as the graphs of the two

*circularly permuted*sets of trinucleotides of

*X*are acyclic. Recently,

*C*

^{3}codes played an important role in the theory of error detection in genetic information. In particular, the maximal trinucleotide circular code

*X*observed in genes of bacteria, eukaryotes, plasmids and viruses [12,15] initiated a renewed interest and had another interesting property, namely self-complementarity. Recall that an

*n*-nucleotide code is

*self-complementary*if for each

*n*-nucleotide from

*X*the reversed complemented

*n*-nucleotide is in

*X*(see [30] for more details on

*C*

^{3}codes). As an illustration we give the graph associated with the maximal self-complementary

*C*

^{3}code found in [12,15].

### Example 2.7

There are 12 964 440 maximal circular codes of 20 trinucleotides [12]. The maximal trinucleotide circular code *X* observed in genes of bacteria, eukaryotes, plasmids and viruses [12,15] has the following 20 trinucleotides:
The graph associated with *X* is shown in figure 5.

In order to see self-complementarity of a code *X*, we need to investigate first the reversing (mirroring) transformation which plays an important biological role. Recall that the *reversing transformation* inverts the order of bases in any *n*-nucleotide, i.e. for we have . If *X* is a code of *n*-nucleotides, then is the *reversed code* of *X*. Similarly, the *complementing map* that exchanges *A* and *T* as well as *C* and *G* induces the *complemented code* *c*(*X*)={*c*(*x*):*x*∈*X*}, where *c*(*N*_{1}*N*_{2}…*N*_{n−1}*N*_{n})=*c*(*N*_{1})*c*(*N*_{2})…*c*(*N*_{n−1})*c*(*N*_{n}) for any *n*-nucleotide . Note that for trinucleotides (codons) *x*=*N*_{1}*N*_{2}*N*_{3} the anti-codon of *x* is exactly .

The next lemma shows that the graphs and of a code *X* and its reversed code are *anti-isomorphic* while at the same time the graphs and of the complemented code *c*(*X*) are isomorphic. Recall that an (*anti*-) *isomorphism* between two graphs and is a bijective map that preserves edges in the sense that [*g*_{1},*g*_{2}]∈*E* if and only if [*f*(*g*_{1}),*f*(*g*_{2})]∈*E*′ ([*g*_{1},*g*_{2}]∈*E* if and only if [*f*(*g*_{2}),*f*(*g*_{1})]∈*E*′). For example, the graphs *G*=({1,2,3}, {[1,2],[2,1],[1,3]}) and *G*′=({1,2,3}, {[1,2],[2,1],[3,1]}) are anti-isomorphic. Their anti-isomorphism is easy to see considering the identical map *f*=*id*.

### Lemma 2.8

*Let* *be a code and* *its associated graph. Moreover, let c be the usual complementing map. Then*

(

*1*)—*The map**that sends a vertex N*_{1}…*N*_{j}*to N*_{j}…*N*_{1}*is an anti-isomorphism between**and*.(

*2*)—*The map**that sends a vertex N*_{1}…*N*_{j}*to c*(*N*_{1})…*c*(*N*_{j})*is an isomorphism between**and*.

### Proof.

The claims are obvious by the construction of the reversed code and the complemented code. ▪

We can now formulate self-complementarity in our graphs. The proof of the following result is obvious.

### Theorem 2.9

*Let* *be a code. Then X is self-complementary if and only if* *equals the reversed complemented graph* .

The next theorem shows that we can even see the comma-freeness property of a code in its associated graph. Recall that an *n*-nucleotide code is *comma-free* if for any two *n*-nucleotides *N*_{1}…*N*_{n} and *N*_{1}′…*N*_{n}′ from *X* the *n*-nucleotides in frame 1 to *n*−1, i.e. *N*_{j}…*N*_{n}*N*_{1}′…*N*_{j−1}′ for 2≤*j*≤*n*, do not belong to *X*. Comma-free codes are obviously circular but the converse is not true.

As an illustration, we give the graph associated with a maximal comma-free trinucleotide code.

### Example 2.10

There are 408 maximal comma-free codes of 20 trinucleotides [2,3,31]. As an example, let *Y* be the following maximal comma-free code:

The graph associated with *Y* is shown in figure 6.

Our second main theorem shows that comma-freeness of a code *X* is connected to the maximal length of paths in its representing graph. In example 2.10, this length is 2, which is not a coincidence as we show now.

### Theorem 2.11

*Given a code* *the following statements are equivalent:*

(

*1*)—*The maximal length of a path in**is*2.(

*2*)—*The code X is comma-free.*

### Proof.

A proof can be found in appendix B. ▪

In general, the maximal length of a path in a representing graph of a code has a relation to the *error correcting window* of the code, i.e. the longest number of nucleotides that have to be read in an arbitrary sequence of words from the code *X* in order to retrieve the correct frame. In fact, any path in such a graph yields a sequence of nucleotides that can be read in two frames just by concatenating the labels of the vertices of the path. Conversely, any sequence (of words from the code) that can be read in two frames yields a path in the associated graph. The exact relation is not yet clear and has to be investigated in the future but we would like to present an example.

### Example 2.12

For the circular code *X* from example 2.7, the longest paths in have 12 nucleotides if we start with a nucleotide. They are as follows: [*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*AG*], [*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*TC*] and [*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*TG*]. Thus, the two longest ambiguous words of 11 nucleotides which can be read in at least two frames, namely frame 0 and frame 1, are: *GGTAATTACCA* and *GGTAATTACCT* where *GGT*∈*X* is in frame 0.

If we start with a dinucleotide, then the longest paths in have 14 nucleotides and are given by: [*CA*,*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*AG*], [*CA*,*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*TC*], [*CA*,*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*TG*], [*CT*,*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*AG*], [*CT*,*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*TC*], [*CT*,*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*TG*], [*GA*,*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*AG*], [*GA*,*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*TC*] and [*GA*,*G*,*GT*,*A*,*AT*,*T*,*AC*,*C*,*TG*].

Thus, the four longest ambiguous words of 12 nucleotides which can be read in at least two frames, namely frame 0 and frame 1, are: *AGGTAATTACCA*, *AGGTAATTACCT*, *TGGTAATTACCA* and *TGGTAATTACCT* as *AG* and *TG* are suffixes of trinucleotides from *X*.

## 3. Application for dinucleotide circular codes

In this section, we investigate our graph theoretic approach for the case *n*=2, i.e. for dinucleotides, nucleotide words of length two. Given a dinucleotide code the associated graph has at most four vertices, labelled by the nucleotide bases, and at most 12 directed edges. Each of the vertices can have at most four ingoing and four outgoing edges (see [27] for a classification of these codes).

### Example 3.1

Example of the maximal dinucleotide circular code *X*= {*AC*,*AG*,*AT*,*CG*,*CT*,*TG*}. The associated graph is shown in figure 7.

Recall that a dinucleotide code is * k-circular* for if for any concatenation

*c*

_{1}…

*c*

_{m}, (

*m*≤

*k*) of dinucleotides from

*X*there is only one partition into dinucleotides from

*X*when read on a circle [27]. Obviously, 1-circularity of a dinucleotide code

*X*means that for each dinucleotide

*N*

_{1}

*N*

_{2}∈

*X*the reversed dinucleotide

*N*

_{2}

*N*

_{1}is not a member of

*X*. This already implies that the associated graph of such a code can have at most six edges. Moreover, it is known [27] that for dinucleotide codes 3-circularity already implies circularity.

### Example 3.2

Two examples of a 1-circular but not 2-circular dinucleotide code and a 2-circular but not 3-circular dinucleotide code.

(1) Let

*X*={*AG*,*AT*,*CA*,*CT*,*GC*,*TG*}, then*X*is 1-circular but not 2-circular and the associated graph is shown in figure 8.(2) Let

*X*={*AG*,*CA*,*CG*,*CT*,*GT*,*TA*}, then*X*is 2-circular but not 3-circular and the associated graph is shown in figure 9.

Recall from graph theory [28] that a graph is a *tournament* if it is obtained by assigning a direction to each edge of a complete (and hence simple) graph, i.e. it has |*V* |(|*V* |−1)/2 edges. The following lemma shows that the tournaments on four vertices correspond exactly to the maximal 1-circular dinucleotide codes.

### Lemma 3.3

*Given a code* *the following statements are equivalent*:

(

*1*)*X is a maximal*1-*circular dinucleotide code*.(

*2*)*is a tournament on four vertices*.

### Proof.

A proof can be found in appendix C. ▪

It follows immediately from lemma 3.3 that there are 2^{6}=64 different maximal 1-circular dinucleotide codes since every tournament on four vertices has six edges and every edge can be oriented in two ways. We now characterize the graphs associated with 2- and 3-circular dinucleotide codes. Recall that a *Hamiltonian cycle* in some (oriented) graph is a (oriented) cycle that visits every node of the graph exactly once (except for the vertex that is both the start and end, which is visited twice). The proof of the following theorem can be found in appendix D.

### Theorem 3.4

*Let* *be a 1-circular dinucleotide code. Then*

(

*1*)*X is circular if and only if**is acyclic, i.e.**does not contain any oriented cycle.*(

*2*)*X is a 1-circular but not 2-circular code if and only if**contains a Hamiltonian cycle of length 4.*(

*3*)*X is a 2-circular but not 3-circular code if and only if**contains an oriented cycle of length 3 and has no Hamiltonian cycle.*

Figure 10 visualizes the situations described in theorem 3.4.

As we have seen each maximal 1-circular dinucleotide code corresponds to a tournament on four vertices. The theory of tournaments is well studied in graph theory (see, for instance, [28]). Recall that the *score sequence* of a tournament is the set of out-degrees of its vertices {*d*^{+}(*v*):*v*∈*V* }, where *d*^{+}(*v*)=|{[*v*,*w*]∈*E*:*w*∈*V* }| for a vertex *v*∈*V* . Thus, we count how many edges start in each vertex. For instance, in figure 10 the score sequences are (clockwise beginning from the upper left-hand vertex) (*a*) 1, 0, 3, 2; (*b*) 1, 1, 2, 2; (*c*) 1, 1, 3, 1.

### Theorem 3.5

*The following statements are equivalent for a tournament T=(V,E) on n vertices:*

(

*T-1*)*T is acyclic.*(

*T-2*)*T does not contain a cycle of length 3.*(

*T-3*)*The score sequence of T is {0,1,2,…,(n−1)}.*(

*T-4*)*T is transitive (i.e. from [x,y]∈E and [y,z]∈E it follows that [x,z]∈E).*(

*T-5*)*T has exactly one Hamiltonian path.*

We are now in a position to transfer theorem 3.5 one to one to circular dinucleotide codes showing a beautiful equivalence between the theory of tournaments on four vertices and the theory of maximal 1-circular dinucleotide codes. The equivalence of (*C*-1),(*C*-2) and (*C*-3) is known (see, for instance, [26,27]) but was proved using different techniques.

### Theorem 3.6

*Let* *be a maximal 1-circular dinucleotide code. Then the following statements are equivalent:*

(

*C-1*)*X is circular.*(

*C-2*)*X is 3-circular.*(

*C-3*)*X has the form X={N*_{1}*N*_{2}*,N*_{1}*N*_{3}*,N*_{1}*N*_{4}*,N*_{2}*N*_{3}*,N*_{2}*N*_{4}*,N*_{3}*N*_{4}*},*.(

*C-4*)*X is transitive in the following sense: from N*_{1}*N*_{2}*,N*_{2}*N*_{3}*∈X it follows that N*_{1}*N*_{3}*∈X.*(

*C-5*)*The relation < defined on**by**is a total order.*

### Proof.

A proof can be found in appendix E. ▪

As an immediate corollary, we obtain the classifications of all maximal *k*-circular dinucleotide codes that were obtained in [26,27]. The idea of the proof which can be found in appendix F is that for a maximal circular dinucleotide code *X* the associated tournament is acyclic and hence completely determined by its unique Hamiltonian path. Hence there are 24 such codes since there are 24 different Hamiltonian paths possible. For 2-circular but not 3-circular codes, has a Hamiltonian cycle which determines four such codes. Since there are only six possible Hamiltonian cycles we get 24 such codes in total.

### Corollary 3.7

*Let* *be a maximal 1-circular code. The following statements are true*:

(

*1*)*There are 24 different maximal dinucleotide circular*(=3-*circular*)*codes*.(

*2*)*There are 24 different maximal dinucleotide*1-*but not*2-*circular codes*.(

*3*)*There are 16 different maximal dinucleotide*2-*circular but not circular codes*.

Figure 10 illustrates the graphs associated with maximal dinucleotide circular, 2-circular but not circular and 1- but not 2-circular codes. By labelling of the vertices with nucleotide bases *A*,*C*,*G*,*T*, all 24 maximal dinucleotide circular codes can be obtained from figure 10*a*, all 24 maximal dinucleotide 1- but not 2-circular codes from figure 10*b* and eight different 2-circular but not circular codes from figure 10*c* (some of the labels will lead to the same code). The other eight of the 2-circular but not circular codes can be obtained from figure 10*c* by reversing all edges (compare lemma 2.8 and corollary 3.7).

Finally, we consider maximal dinucleotide comma-free codes which have been classified recently in [32]. Also in this case, our new graph theoretical approach recovers the same result in a more elegant way. We begin with embedding circular dinucleotide codes into maximal circular dinucleotide codes. According to theorem 3.4, every dinucleotide circular code can be represented by an acyclic graph with at most four vertices. Straightforward calculations show that every such graph can be expanded to an acyclic tournament which represents a maximal dinucleotide circular code according to theorem 3.4. Hence we have the following.

### Lemma 3.8

*Every dinucleotide circular code is contained in a maximal dinucleotide circular code. In particular, every comma-free dinucleotide code can be extended to a maximal circular dinucleotide code*.

### Example 3.9

In figure 11, a maximal comma-free dinucleotide code and its two circular extensions are shown.

According to lemma 3.8, every comma-free code is contained in some maximal dinucleotide circular code. On the other hand, no maximal dinucleotide circular code is comma-free since its representing graph contains a Hamiltonian path of length 3 (compare theorem 2.11). If we remove one of the three edges which belongs to the (unique) Hamiltonian path from the corresponding acyclic tournament we obtain a graph which contains paths of at most length 2. According to theorem 2.11, we obtain in each case a graph that represents a dinucleotide comma-free code which will be maximal since it contains five edges (dinucleotides).

### Example 3.10

In figure 12, a maximal dinucleotide circular code and its three maximal comma-free subcodes are shown.

Thus, we easily obtain the following.

### Theorem 3.11 [32]

*There are exactly 36 maximal dinucleotide comma-free codes. These are given as*

(

*1*)*12 codes of the form {N*_{1}*N*_{2}*,N*_{1}*N*_{3}*,N*_{1}*N*_{4}*,N*_{2}*N*_{4}*,N*_{3}*N*_{4}*},*(

*2*)*12 codes of the form {N*_{1}*N*_{2}*,N*_{1}*N*_{3}*,N*_{1}*N*_{4}*,N*_{2}*N*_{3}*,N*_{2}*N*_{4}*},*(

*3*)*12 codes of the form {N*_{2}*N*_{1}*,N*_{3}*N*_{1}*,N*_{4}*N*_{1}*,N*_{3}*N*_{2}*,N*_{4}*N*_{2}*},*

*where* *and N*_{i}*≠N*_{j}*,i≠j.*

*Moreover, each of these codes can be obtained from a maximal dinucleotide circular code
**with* *by removing one of its dinucleotides N*_{1}*N*_{2}*,N*_{2}*N*_{3} *or N*_{3}*N*_{4}.

## 4. Conclusion and perspectives

The circular code theory proposes that genes are constituted of two trinucleotide codes: the amino acid code and the circular code. The classical amino acid code contains 64 trinucleotides {*AAA*,…,*TTT*} with 61 trinucleotides coding the 20 amino acids and three stop codons which do not code for amino acid. The amino acid code in today's genes do not use all 64 available trinucleotides but a subset of 61 trinucleotides for coding the 20 amino acids. It is a surjective code. Furthermore, it contains two particular codes, a start code and a stop code, related to the reading frame of genes, precisely to initiate and close it. The main start trinucleotide code is both a signal for the beginning of a gene and a code for the amino acid *Met*. The main stop trinucleotide code (also called stop codons) is only a signal for the end of a gene, i.e. without amino acid coding. The two codes and have great variability among the variant genetic codes, showing an important evolution of the start and stop codes among species (see the genetic codes in http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=t). Indeed, in the standard code (code 1), the start trinucleotide code is extended to coding the multi-set of amino acids {*Met*,*Leu*,*Val*,*Leu*} (amino acid according to trinucleotide order) in eukaryotic genes. However, in the ciliate, dasycladacean and hexamita nuclear code (code 6), the euplotid nuclear code (code 10), the alternative flatworm mitochondrial code (code 14), the chlorophycean mitochondrial code (code 16) and the scenedesmus obliquus mitochondrial code (code 22), the start trinucleotide code is restricted to one trinucleotide coding *Met*. By contrast, in the mould, protozoan and coelenterate mitochondrial code and the mycoplasma/spiroplasma code (code 4), the start trinucleotide code is extended up to eight trinucleotides *T*_{start}={*ATA*,*ATC*,*ATG*,*ATT*,*CTG*,*GTG*,*TTA*,*TTG*} coding the multi-set of amino acids {*Ile*,*Ile*,*Met*,*Ile*,*Leu*,*Val*,*Leu*,*Leu*} (amino acid according to trinucleotide order). In the codes 6 and 14, the stop trinucleotide code is restricted to one trinucleotide and , respectively. In the thraustochytrium mitochondrial code (code 23), the stop trinucleotide code is extended to four trinucleotides .

The circular code *X* identified in genes of bacteria, eukaryotes, plasmids and viruses [12,15] is based on 20 trinucleotides with two mathematical properties involved in translation. It codes 12 amino acids ** G**(

*X*)={

*Ala*,

*Asn*,

*Asp*,

*Gln*,

*Glu*,

*Gly*,

*Ile*,

*Leu*,

*Phe*,

*Thr*,

*Tyr*,

*Val*} according to the standard genetic code

**[12], table 4(a). Thus, it is also a surjective code. Furthermore, it allows the reading frame to be retrieved, maintained and synchronized at any position in a gene generated by**

*G**X*. It is an extended mathematical property compared with the start and stop codes occurring only at the beginning and end positions of genes. The start code with a coding function and a frame function reduced to the first three nucleotides of a gene may be an evolutionary relic of a circular code. Thus, the circular code

*X*allows translation without the aid of proteins. As a consequence, we think that the circular code

*X*has occurred before the classical amino acid code existing in today's genes. According to this hypothesis, primitive genes of short lengths would be directly decoded by the circular code

*X*. This primitive coding of genes would be based on the property of reading frame retrieval of the circular code

*X*and the coding of oligopeptides (peptides of short lengths) using the 12 amino acids

**(**

*G**X*). It would not use the ribosome, which is a complex apparatus containing proteins, e.g. the 22 proteins S1–S22 in the small subunit and the 34 proteins L1–L36 in the large subunit of

*Escherichia coli*, the two classes of aminoacyl tRNA synthetase, etc. Then, the complexity evolution of gene coding would have required an increase in the size of the protein alphabet (from 12 to 20), an increase in the length of proteins allowing diversity (the average size of today's genes is 1000 nucleotides for coding proteins of about 300 amino acids) and the development of the ribosomal apparatus as the property of reading frame retrieval of the circular code

*X*would become inefficient with the size and topology of today's genes. The property of reading frame retrieval of the circular code has not completely disappeared during evolution as today's genes contain start and stop codes and as

*X*motifs have been identified in tRNAs [13,17] and rRNAs, in particular in the ribosome decoding centre [13,18].

In this paper, we present a new approach to circular codes based on graph theory and extended to *n*-nucleotide codes. To each such code, a graph was assigned that interprets the *n*-nucleotides of the code as pairs of prefixes and suffixes. The general theorem established here identifies among the *n*-nucleotide codes those which are circular. Moreover, several properties of the circular codes can be seen in the representing graph, e.g. the error detecting window. Since dinucleotide circular codes may also have a biological function in the coding process of amino acids [26] we applied our approach to such codes and have established a beautiful correspondence between the theory of these codes and the theory of tournaments.

Tetranucleotide codes may also be involved in the amino acid coding [33–36] and thus our approach may help to investigate also the tetranucleotide circular codes in the future.

## Authors' contributions

All authors contributed equally to this work.

## Competing interests

We declare we have no competing interests.

## Funding

We received no funding for this study.

## Acknowledgements

E.F. and L.S. would like to thank the Karl Völker Foundation for its support.

## Appendix A. Proof of lemma 2.4

### Proof.

A self-loop in can only arise for even *n* and would mean that *N*_{1}…*N*_{n/2}*N*_{1}…*N*_{n/2}∈*X* for some . The existence of multiple edges means that the same *n*-nucleotide is represented in the graph twice and the existence of inverted edges that *N*_{1}…*N*_{i}*N*_{i+1}…*N*_{n}∈*X* as well as *N*_{i+1}…*N*_{n}*N*_{1}…*N*_{i}∈*X*. All this is forbidden by the construction of the graph and the circularity of the code *X*. ▪

## Appendix B. Proof of theorem 2.11

### Proof.

Let be given and assume that *X* is comma-free. If contains a path of length at least 3, then it has one of length 3 which must belong to one of the components of . Thus, it is either of the form
or
In the first case, the three *n*-nucleotides
are in *X*. Thus, the concatenation *N*_{1}…*N*_{j}*N*_{j+1}…*N*_{n}*N*_{1}′…*N*_{j}′*N*_{j+1}′…*N*_{n}′ violates comma-freeness since *N*_{j+1}…*N*_{n}*N*_{1}′ is also in *X*. In the latter case, the three words
are in *X*. Thus, the concatenation *N*_{1}…*N*_{n−j}*N*_{n−j+1}…*N*_{n}*N*_{1}′…*N*_{n−j}′*N*_{1}′′…*N*_{n−j+1}′′ violates comma-freeness since the *n*-nucleotide *N*_{n−j+1}…*N*_{n}*N*_{1}′…*N*_{n−j}′ is also in *X*.Conversely, assume that the maximal length of a path in is at most 2 and assume that *X* is not comma-free. Then there are two *n*-nucleotides *N*_{1}…*N*_{n} and *N*_{1}′…*N*_{n}′ in *X* such that the concatenation *N*_{1}…*N*_{n}*N*_{1}′…*N*_{n}′ violates comma-freeness, i.e. there is 1≤*j*≤*n*−1 such that *N*_{j+1}…*N*_{n}*N*_{1}′…*N*_{j}′ is in *X* as well. Thus, we obtain a path of length 3 in the *j*th components of , namely the [*N*_{1}…*N*_{j}*N*_{j+1}…*N*_{n}][*N*_{j+1}…*N*_{n}*N*_{1}′…*N*_{j}′][*N*_{1}′…*N*_{j}′*N*_{j+1}′…*N*_{n}′] contradiction. ▪

## Appendix C. Proof of lemma 3.3

### Proof.

Let *X* be a maximal 1-circular code. Then *X* contains exactly six dinucleotides with all nucleotide bases appearing at least in one of its dinucleotides (see, for instance, [27]). Moreover, due to lemma 2.4 the associated graph is simple. Hence, is a complete graph on four vertices, which means that is a tournament on four vertices. The converse is obvious. ▪

## Appendix D. Proof of theorem 3.4

### Proof.

For claim (1) see theorem 2.6.

For claim (2) assume that *X* is 1- but not 2-circular. Then there are so that the word has two decompositions on the circle. That means that also . However, are different bases due to the 1-circularity of *X* and, thus, is a Hamiltonian cycle in .

Let us assume now that is a Hamiltonian cycle in . Then *N*_{1}*N*_{2}, *N*_{2}*N*_{3}, *N*_{3}*N*_{4}, *N*_{4}*N*_{1}∈*X* and the word *N*_{1}*N*_{2}*N*_{3}*N*_{4} has two decompositions on the circle. Thus, *X* is not 2-circular.

Finally, for claim (3) let *X* be a 2- but not 3-circular code. Then according to (2) cannot have a Hamiltonian cycle. According to (1) cannot be acyclic since there are 2-circular codes which are not circular. The only case left is that has a cycle of length 3 but is not Hamiltonian.

Assume now that has a cycle of length 3 . Then the word *N*_{1}*N*_{2}*N*_{3}*N*_{1}*N*_{2}*N*_{3} has two decompositions on the circle since *N*_{1}*N*_{2}, *N*_{2}*N*_{3}, *N*_{3}*N*_{1}∈*X* and, thus, *X* is not 3-circular. In addition, *X* is 2-circular according to (2) since has no Hamiltonian cycle. ▪

## Appendix E. Proof of theorem 3.6

### Proof.

(C-1) ⇒ (C-2): is obvious.

(C-2) ⇒ (C-1): let *X* be a 3-circular code. According to theorem 3.4 (3) does not contain cycles of length 3 and, thus, according to theorem 3.5 is acyclic. It follows from theorem 2.6 (1) that *X* is circular.

(C-1) ⇔ (C-3): clearly, if *X*={*N*_{1}*N*_{2},*N*_{2}*N*_{3},*N*_{3}*N*_{4},*N*_{1}*N*_{4},*N*_{1}*N*_{3},*N*_{2}*N*_{4}}, it follows that the score sequence of is 0,1,2,3. According to theorem 3.5, this is equivalent to the acyclicity of and, thus, to the circularity of *X*.

(C-1) ⇔ (C-4): is obvious in view of the definition of .

(C-1) ⇔ (C-5): according to theorem 3.5, the acyclicity of and, thus, the circularity of *X* is equivalent to the existence of the unique Hamiltonian path in . Given the unique Hamiltonian path in we define the order *N*_{1}<*N*_{2}<*N*_{3}<*N*_{4} which is a total order on and vice versa. ▪

## Appendix F. Proof of corollary 3.7

### Proof.

For claim (1), the representing graph of a maximal dinucleotide circular code *X* is an acyclic tournament on four vertices. In such a tournament, there is exactly one Hamiltonian path [28] . The remaining three edges can be oriented in the unique way to avoid cycles, namely and . There are 24=4! possibilities to choose such a Hamiltonian path, which proves (1).

For claim (2), let *X* be a maximal dinucleotide 1- but not 2-circular code. Owing to theorem 3.4 the representing graph has a Hamiltonian cycle . There are 6=3×2 possibilities to choose such a cycle and in each case 4=2×2 additional possibilities to orient the remaining two edges. It is easy to see that any orientation of the remaining two edges does not lead to new Hamiltonian cycles. Therefore, we have 24=6×4 different maximal dinucleotide 1- but not 2-circular codes.

For claim (3), the remaining codes which are not covered by (1) or (2) must be 2-circular but not circular. There are 16=64−24−24 such codes. ▪

## Footnotes

One contribution of 21 to a theme issue ‘DNA as information’.

- Accepted September 5, 2015.

- © 2016 The Author(s)

Published by the Royal Society. All rights reserved.