## Abstract

Self-replication is one of the fundamental aspects of computing where a program or a system may duplicate, evolve and mutate. Our point of view is that Kleene's (second) recursion theorem is essential to understand self-replication mechanisms. An interesting example of self-replication codes is given by computer viruses. This was initially explained in the seminal works of Cohen and of Adleman in the 1980s. In fact, the different variants of recursion theorems provide and explain constructions of self-replicating codes and, as a result, of various classes of malware. None of the results are new from the point of view of computability theory. We now propose a self-modifying register machine as a model of computation in which we can effectively deal with the self-reproduction and in which new offsprings can be activated as independent organisms.

*A virus is a virus!**André Lwoff, 1959*

## 1. Introduction

### (a) From Turing machines to computer viruses

The Turing model of computation (MOC) is the bare bone of computability and complexity. Turing machines (TMs) are simple and efficient. A TM consists of a control unit, which is a hard-wired internal program, and of one or several tapes. In his original paper, Turing [1,2] designed a universal Turing machine (UTM), which corresponds nowadays to a self-interpreter. A UTM is able to interpret any Turing machine (TM) from an encoding of it. In particular, a UTM is able to interpret itself. Turing's result suggested that a machine can have access to its own code, that is, to a self-description. The second recursion theorem of Kleene [3] allows us to understand the notion of self-reference. His demonstration suggests the description of a self-reproducing machine. Moschovakis [4] conducted a survey on the applications of this little gem. The reader may also consult the books of Smullyan [5,6].

About 50 years later, Cohen [7] defined computer viruses: ‘A virus is a program that is able to infect other programs by modifying them to include a possibly evolved copy to itself'. More recently, Cohen's definition was rephrased by Ször [8] in his book. Cohen's definition seems to be accepted by the computer security community. There are only a few works on theoretical aspects of computer viruses (e.g. [9–12]).

#### (i) What are we studying and what we will not consider?

Because there is a consensus on the definition of Cohen and of Adleman [9], it will be our starting point. Therefore, we will focus on the features of computer viruses related to self-reproduction, to code mutation and to self-modification. Thus, and for this study, a computer virus is defined as a self-replicating program, whose offspring may be a mutation of the original program. Moreover, the offspring may be activated, implying a new generation and so the contamination can go on. In spite of widespread works on malware, there are only a few theoretical studies on computer viruses. This study is not about the practical side of viruses. For this, the reader may refer, for example, to Ször's book [8].

We are going to illustrate our approach by a first example. A more complete example will be given in the last part of the introduction. Consider a simple virus, say `VA`, that tries to contaminate each running process in a system. For this, `VA` checks first whether or not a process is infected. Thus, it avoids superinfection and, in particular, it avoids infecting itself when executed. If the process is sane, i.e. if the process has not been already infected by the virus `VA`, then it injects a copy of itself. The specification of the virus is written in pseudo-code as follows:

The parameter `v` stands for a self-description of `VA`. To design `VA`, we have to construct a program `v` such that, roughly speaking, `v`=`Specification_of_VA(v)`. In other words, `v` is a fixpoint of the specification of `VA`. This example illustrates the fact that Cohen's or Adleman's model bears a resemblance to the second recursion theorem of Kleene [3] and to von Neumann [13] self-reproducing machines.

#### (ii) A complete model of self-reproduction and main objectives

Before going further, it is worth defining three concepts that are essential for this study. *Self-reference* is the ability of a program to refer to itself. A recursive algorithm is a typical example of self-reference in programming language. In this case, a program has a pointer referring to its own code. *Self-replication* is the ability of a program to copy itself. A well-known example is quines. The duplication process may produce a different program (the offspring) from the original one (the father's offspring). Indeed, there could be a mutation of the original program. In this case, the offspring may or may not compute the same thing as the original program. *Self-modification* is the ability of a program to modify itself. A self-modifying program has the ability to alter its own code. For example, it may change its instructions on the fly or it may generate codes from data that it can access, and thus the overall program may be seen as a sequence of different code layers. These three notions are distinct. A programming language can be self-referential without having the ability to alter its code. These three notions correspond to the notion of self-knowledge developed in Case [14]. We do think that a *complete model of self-reproducing systems* must reflect all these three notions and also the important fact that a copy can be activated.

The main objective of this study was to propose a synthetic model of computer viruses and more generally of complete self-reproduction. From a programming perspective, this means a model in which computer viruses are generated from a viral specification similar to the one given previously. For these reasons, we focus on effective models of computation from the theory of programming languages, computability and complexity theory. We propose an MOC that explains, intuitively, the essential ideas behind self-reproduction with their connections to recursion theorems. This MOC is a simple model of self-reproduction and is complete in the sense that we have just explained. Moreover, it might be seen as a basic model to study self-modifying programs.

#### (iii) A short tour in the land of model of computation

Seminal models, such as TMs or Post [15] tag systems, process strings. Shepherdson & Sturgis [16] seem to be the first to introduce register machines over words. More recently, Moss [17] proposes *text register machines* (TRMs) in order to have a simple and elegant computation model in which constructions in computability theory are implemented. The TRM MOC can manipulate a program as data.

We present a MOC named a *self-modifying register machine* (SRM), which is an extension of TRMs. SRMs work over a finite alphabet. Programs are stored in registers and can be processed as data. The crucial difference with TRMs is that any register may be activated and run. As a result, a program may generate a copy of itself, and then run it. Thus, the SRMs seem to be an adequate model of self-reproducing systems.

An SRM is a stored program abstract machine. As in current computer architectures, a program is loaded into memory. It may modify itself either by changing its instructions or by executing instructions outside the original program. Turing & Wilkinson [18] with the Automatic Computing Engine project and Burks *et al.* [19] were the first main protagonists to conceive such architectures. Elgot & Robinson [20], Hartmanis [21] and Cook & Reckhow [22] defined *random access stored program machines*, RASPs, in a similar way. Unlike RASPs, SRMs have no indirect jumps, which is a questionable feature because it tends to ignore the computing locality principle.

#### (iv) Self-reproducing Turing machines

UTMs and the second recursion theorem form the core of the study of self-reproducing machines. Later, in his paper on morphogenesis, Turing [23] studied a mathematical model of growing and evolving organisms. At the same time, von Neumann & Burks [24] constructed a self-replicating cellular automaton. The clear intention of these works was to have synthetic models of living organisms. More than 50 years later, the theory of self-reproduction was extended in different directions, as described by Sipper [25]. The relations between constructions within the von Neumann model and recursion theorem are quite well known, and are well explained by Case [26] and Arbib [27], for example. Within the Turing MOC, it is essentially Lee [28] and Thatcher [29] who worked on self-describing TMs, and more recently Margenstern & Rogozhin [30]. In essence, the design of a self-describing machine follows Rogers [31], which is based on the recursion theorem proof: suppose that **Diag**(*z*) is the code of the TM **M**_{z}(*z*) (where, for any *z*, **M**_{z} is the machine with code number *z*). There is a code ` q` such that for all

**d**,

**M**

_{q}(

**d**)=

**Diag**(

**d**). Hence, by taking

**Diag**(

`) for`

*q*`, we obtain for all`

*e***d**The machine of code

`duplicates itself. This mechanism is behind all further constructions that we will see.`

*e*### (b) A compiler virus

We now turn to a second example. In his Turing award lecture, Thompson [32] gives a construction of a *compiler virus*. We will use Thompson's construction to illustrate that mechanisms, underlying computer viruses, depend directly on fundamental theorems of computability.

We are now given a self-interpreter ` interp`. For any program

`, a self-interpreter computes . We write to denote the function that is computed by the program`

*p*`.`

*p*Say that an *interpreter virus* is a virus, which infects an interpreter. Then, the infected interpreter may compromise the program it runs. That is, an interpreter virus, say `v`, transforms ` interp` into an infected interpreter . Thus, if we replace the trusted interpreter

`by , instead of running a program`

*interp*`, we will run .`

*p*The program of is easily built, thanks to the s-m-n theorem. The concept behind the s-m-n function is that of program specializers. A program specializer ` spec` is a program transformer that takes two inputs: a program

`and data`

*p***c**. The result is a program such that . As a result, a program that computes is . Indeed, we have . So, is the result of the infection of

`by the virus`

*interp*`v`.

From there, the virus may also inject itself into ` p`. For this,

`v`uses a propagation vector in order to infect

`. In accordance with Bonfante`

*p**et al.*[11], the propagation vector is represented by a specializer

`. For example,`

*spec*`v`may insert its program at the beginning or at the end of

`, or yet it may shuffle its code inside`

*p*`. However, it is not our real concern in this study.`

*p*Thus, the infection caused by the infected interpreter executes instead of the sane program ` p`. The problem is to show that there must exist a virus

`v`that satisfies the following equation: 1.1 The solution of this problem consists of finding a fixpoint

`v`. This can be obtained by the second recursion theorem of Kleene [3].

Following Jones [33], we can generate a compiler from an interpreter and a specializer. A compiler ` comp` is defined by
1.2
and it is implemented by a self-application of specializers,
1.3

Now, if the interpreter is compromised, then the new compiler ` comp`′, after recompilation, will be also compromised,
1.4
where we recall that is the code of the infected interpreter, and
1.5
As a result, the compromised compiler will insert the virus into each new compiled program. The construction that we have presented follows the main lines of the design of Thompson's

*Trojan horse*. The theoretical grounds of this construction are the existence of a UTM, of the s-m-n functions and of program fixpoints. The aim of the remaining paper is to explain all these connections in more detail.

## 2. Self-modifying register machines

### (a) Self-modifying register machines in a nutshell

The central notion of this study is that of SRMs, which allows us to study the diverse notions of self-reproduction. SRMs are based on the definition of TRMs suggested by Moss [17]. TRMs work on words of the alphabet {1,#}. Programs and data are words over {1,#}. An important feature of Moss TRMs is that the composition of programs is performed by a mere concatenation. Thus, there are no tedious coding devices. As a result, it is a fascinating exercise to explore correspondence between programming constructions and demonstrations of classical computability theory, such as the design of a program that computes program fixpoints. The elegance of this work is on the lineage of Jones [33].

An SRM is a text-register-stored program machine. There are two main differences with TRMs. First, programs are stored in registers, and second, any register can be executed. As a result, an SRM program is able to produce another program and to execute it. An SRM has a finite number of registers `R`_{0},`R`_{1},…,`R`_{m}, a *register pointer* `RP` and an *instruction pointer* `IP`. A register `R`_{n} contains a word that may be decoded as a sequence of instructions. The program counter is broken into a register pointer and an instruction pointer. Hence, the current instruction is at address `IP` of the register pointed by `RP`, which we denote `R`_{RP}. For every instruction, the first step is to fetch and to execute the current instruction. The next instruction depends on the instruction kind. The instruction pointer `IP` then moves accordingly. The execution flow may move from one register to another by changing the values of the register pointer. One might think of an SRM as an abstract machine where each register contains a process. A single process is running at any time and a process may activate another one. Thus, any register can be activated and run. In the SRM model, there is no distinction between programs and data. Both are manipulated during a computation. So, programs are ‘first-class citizens’ because they are represented by data, whereas data are also ‘first-class citizens’ because they can be effectively executed. This last feature is shared with RASPs, but not with the other models where a universal machine or an interpreter is necessary.

### (b) Design of the self-modifying register machines

Each register holds a word of the two symbol alphabet {1,#}. Registers are seen as unbounded first in, first out queues. That is, we pop the first letter to read it, and we write a letter by adding it at the bottom of the register. Program instructions are described in Table 1.

The program is initially stored in register `R`_{0}. In an execution of an SRM, the code can move. The code motion is controlled by two features. The first feature is the ability to copy and move data from one register to another one. The second feature is the ability to activate a program in a register, owing to the instruction `exec`. A program inside a register may generate a new program in another and run it. As a result, the overall program, i.e. the set of instructions executed, is spread over several registers.

All instructions except `exec` increment `IP` and leave `RP` unchanged. Jumps (`jmp`) and conditional branches (`case`) are performed inside a register, and addresses are relative to the current instruction (`IP`). An important decision in the design of SRMs is the fact that register addressing is relative to the register pointer `RP`. The register address consists of two parts: `RP` and an offset *n*. This means that the instruction `put` 1,*n* appends 1 at the bottom of the register `R`_{RP+n}. Similarly, the instruction `exec` executes the instruction on the top (i.e. `IP` is initialized to 0) of the register `R`_{RP +n}. This mapping allows us to load the program into any location, by a right translation. To sum up, a program loaded in a register is ‘right handed’; that is, it manipulates and runs registers only on its right. In return, a program has no access to registers on its left. In this way, we clearly simplify program relocation. This feature will be taken into account by the sequential semantics introduced later.

*Discussion.* This design allows us to express the fact that an offspring can be created and activated. However, the offspring is independent and it cannot ‘communicate’ with its father. One might think of a different instruction set or architecture. For example, we may have a relative addressing for registers with positive and negative addresses. Thus, an instruction such as `put` ℓ,−*n* would mean that we add ℓ to register `R`_{RP−n}. We could also have `exec` −*n* and `exec` +*n* in order to run registers on the left and on the right of the current program. It is worth noticing that these architectural choices have consequences for fixpoint constructions that we will not explore in this work. Here, we choose simplicity. Finally, note that the overall address space also has two dimensions: one for instructions and one for registers.

### (c) Semantics of the self-modifying register machine language

A small step semantics in the spirit of structural operational semantics of Plotkin [34] or of natural semantics [35] is now given. A *configuration* consists of a store , a register pointer `RP` and a value *n*, which is either the instruction pointer `IP` or `Halt`. In the last case, we say the configuration is final.

A *store* is a finite mapping such that for all *m*. The notation , where **d** in {1,#}* denotes the store such that for all *n* except *m* where . Actually, the notation is a convenient shortcut that we will eventually abandon to the gain of a more practical sequential semantics.

The *current program* is the program indicated by `RP`. Initially, the current program is in `R`_{0}. We write `RP`[`IP`] the instruction that is pointed by `IP` in the current program, i.e. the instruction at the address `IP` in the register `R`_{RP}. Each instruction has an opcode, which is a word over {1,#}. The program code is a word over {1,#}, which consists of the concatenation of each instruction opcode composing the program. We write ` p`•

`to mean the concatenation of both words`

*q*`and`

*p*`. The size of the instruction encoding is variable, and pointer arithmetic works as usual. So, we write`

*q*`IP`=

`IP`+1 to increment the register

`IP`of the length of an instruction opcode, i.e. to move forwards of one instruction, and

`IP`=

`IP`−1 to decrement the register

`IP`of one instruction, i.e. to move backwards of one instruction. The exact address is obtained by determining the size of opcodes, which is easily calculated because the set of well-formed program opcodes is regular.

We define the *single step transition* as follows. holds, if the execution of the instruction `RP`[`IP`] gives the store , following the rules of Table 2. The next instruction is then `RP`′[`IP`′].

A computation is defined by the reflexive and transitive closure of , and is noted . At the beginning of the computation, the program is loaded into the register `R`_{0}. Both `RP` and `IP` are both initially equal to 0. That is, `IP` points to the first instruction contained in `R`_{0}.

Typically, the program halts if `IP` ‘goes below’ the current program. There are other reasons to halt. The register `IP` may point on a word, which is not the opcode of an instruction. Another way to halt is when a jump is performed with an address outside the register.

### (d) Sequential semantics

We now introduce the notion of *sequential semantics*, which provides an easy way to express SRM computations. We write to mean the configuration where `RP`=*i*, `IP` points to the first letter of ` p` and for

*j*=0,

*n*except for

*j*=

*i*, where and for

*j*>

*n*, . When no ambiguity can arise, we also call a configuration.

The intuition is that the program denoted by **d**_{i} has already been executed, and we begin to run its sequel ` p`. So, the underlined word is the current executed program. In particular, means that

`IP`points to the first letter of

`, and so we run`

*p*`. The initial configuration corresponds to . Next, a computation is now written as to say that if the SRM runs the program`

*p*`, then the computation goes to the configuration . Sometimes, no words will be underlined on the right, e.g. In this case, this means that the computation halts. Sequential semantics will be used throughout the remainder of the paper. Examples in §2`

*p**e*illustrate how it works and allows us to get familiar with it.

### (e) Program motions

In §2*b*, we saw that data contained in registers on the left of the current program are useless to the ongoing computation. Indeed, there is no instruction that can use registers on the left of the current program, that is on the left of `R`_{RP}. We will state this property and use it as a shortcut to display computations in the remain.

### Proposition 2.1

For any

if then*p**i*=*j*and**c**_{k}=**c**_{k}′ for any 0≤*k*≤*i*.For any

*i*,*j*andif then*p*

This proposition means that the registers on the left of ` p` may be forgotten. For this reason, we will tend not to write registers on the left side of the current program. When we want to drop the left part of the current program

`, we will write 2.1`

*p*Another important consequence is that we can relocate a program by translating it on the right. The result of its computation will be exactly the same. This means that if
then for any *i* and all **c**_{0},…,**c**_{i},

In the remainder of this paper, we will use the statement (2.1) either as a simplification rule or as a program relocation rule.

### (f) Sequential connectivity and universality

Now, we come to the notion of composition, which says that for any ` p`,

`and`

*q***d**

_{1},…,

**d**

_{n}, if and, in the above equation,

`RP`is unchanged and

`IP`points to the last instruction of

`, and if then In the SRM framework, we prefer to refer to this compositional condition as a sequential connectivity condition. Indeed, it really suggests the idea that we have connected two computations. SRMs offer other connectivity mechanisms that we will not explore in this study.`

*p*It is evident that the SRM MOC is Turing complete. Nevertheless, it is worth seeing how universality boils down. A self-interpreter is a program ` interp` that satisfies
This equation holds because we can translate a program to the right. This is a consequence of equation (2.1). A very simple SRM self-interpreter is

`exec`1,

## 3. Illustrations

The following examples illustrate how to program SRMs. They will be used in constructions in the remainder. Note that programs are implicitly parametrized by a number *n*, which is an upper bound on the number of saved registers (or arguments). Thus, the reading is still rigorous, but notations are uncluttered. Registers beyond *n* will be used as working registers to perform intermediate steps in a computation. We will suppose that they are empty. Lastly, to reduce notational overhead, we henceforth confuse and use the same notation for a program and its opcode.

### (a) Move

The program *move*`R`_{i},`R`_{j} moves the contents of the register `R`_{i} into the register `R`_{j} and erases the content of `R`_{i}. We can easily express the sequential semantics of *move*`R`_{i},`R`_{j},
3.1
The program of *move*`R`_{i},`R`_{j} is the following:

### (b) Erase

Next, we define the program *erase*`R`_{j}, which erases the content of `R`_{j},

3.2

### (c) Shift

We now define a program *shift*_{n} that shifts to the right the registers `R`_{1},…,`R`_{n},
3.3
We implement *shift*_{n} as follows:

### (d) Copy

The program *cp*`R`_{i},`R`_{j} copies the content of the register `R`_{i} into the register `R`_{j}. Unlike *move*`R`_{i},`R`_{j}, the register `R`_{i} is not erased,
3.4
Hence, there are two copies of the same value at two different places. For this, we use `R`_{n+1} that we suppose is empty. The program of *cp*`R`_{i},`R`_{j} is the following:

## 4. Kleene's recursion theorem

We now come to fixpoint constructions in the SRM model. There are various fixpoint constructions induced by demonstrations of the recursion theorem of Kleene and of its variants. We will focus on a uniform version of the recursion theorem due to Myhill [36], according to Smullyan [6].

There is a *f ixpoint compiling function* *fix* such that for all ` p`∈{1,#}* and all

**d**

_{1},…,

**d**

_{n}∈{1,#}*, where the program

`can be interpreted as a specification to perform a computation with a self-copy. The function`

*p**fix*generates a fixpoint of the program

`. That is, it creates a copy of`

*p*`, and then it runs the copy of`

*p*`on`

*p**fix*(

`). So, the computation goes on.`

*p*Actually, the function *fix* is called a *program transformation*, which is a central notion in this part. More generally, a program transformation is a function, say *Ψ*, that takes a program ` p` (encoded by a word in {1,#}*), and possibly other inputs

**d**

_{1},…,

**d**

_{n}, and constructs a new program

*Ψ*(

`,`

*p***d**

_{1},…,

**d**

_{n}) (encoded by a word in {1,#}*). A compiler, or an s-m-n function, is a well-known example of program transformation (see Jones [33]). In an SRM model, a program transformation is implemented by a program, say

`, that satisfies an equation of the form The result`

*e**Ψ*(

`,`

*p***d**

_{1},…,

**d**

_{n}) is a program that can be activated in turn in the SRM model.

For example, recall from the introduction that a program specializer ` spec` implements an s-m-n function. In an SRM model, is a program transformation implemented by

`satisfying Note that activates the run of`

*spec*`on`

*p***d**.

Following Hansen *et al*. [37], the second recursion theorem does not provide an effective solution for recursive programs. This observation changes if the programming language is reflexive. Indeed, it is more efficient to have a self-description than to have a complete copy of itself, which will be duplicated all along a computation. However, it is not the same thing when we are dealing with computer viruses. In this case, we really need to create an independent copy to contaminate a system. For this reason, and following Case [14], a virus needs a complete self-knowledge. This means that a virus may be able to generate an operational and independent copy of itself.

The construction of the fixpoint compiling function *fix* is made, step-by-step, by defining four different program transformations. The parts mentioned below describe, in detail, each program transformation. The idea behind the construction is very similar to the von Neumann [13] method to build self-reproducing systems. This section concludes by stating both Kleene's recursion theorem and Myhill's variant.

### (a) Write

We begin with an amazing hack of Moss [17]. A *writing function* *wrt*:{1,#}*→{1,#}* is a program transformation such that for every ` p`∈{1,#}*,
4.1
In other words,

*wrt*(

`) appends a copy of`

*p*`to the register`

*p*`R`

_{1}. We devise a program

`, which computes`

*write**wrt*(

`) by generating a program that reconstructs its parameter`

*p*`, 4.2 We implement`

*p*`using`

*write*`R`

_{n+2}as a temporary register, It is worth noticing that the

`program is similar to quine constructions or to the Thompson [32] construction of a C self-reproducing program.`

*write*### (b) Controller

The next program transformation is a function *ctrl*, whose task is to control the duplication as follows:
4.3
The function *ctrl*(` p`) shifts registers to the right and then it drops a copy of

`in the right register. This is performed by setting`

*p**ctrl*(

`)=`

*p*

*shift*_{n}•

*wrt*(

`). We check that equation (4.3) is satisfied,`

*p*We implement *ctrl* by the program ` ctrl` defined as follows:
From now on, we put on the right the sequential semantics of the program or of the instruction executed to help the reader. On the very first line, we indicate the initial configuration. The configuration on line 1 is obtained after executing the instruction

`, and so on.`

*write*It follows that 4.4

Following §2*e*, the function *ctrl* inserts ` p` correctly if we want to run it on

**d**

_{1},…,

**d**

_{n}. We define

*ctrlX*, which, in addition, runs

`. We set`

*p**ctrlX*(

`)=`

*p**ctrl*(

`)•`

*p*`exec`1, which thus satisfies 4.5 4.6 In the first step, the program

*ctrlX*(

`) shifts registers to the right and copies`

*p*`. The second step consists of activating and running`

*p*`on`

*p***d**. As a result, the computation is identical to the computation of because of the fact that a program is relocatable by a right translation.

Finally, note that *ctrlX*(` p`) corresponds to , or more precisely to in the SRM model. Actually, we can now easily devise a specialization function by setting and thus,

### (c) Diagonalizer

A function *diag*:{1,#}*→{1,#}* is a *diagonalization function*,
4.7

A *diagonalizer* ` diag` is a program that computes a diagonalization function,
4.8

We implement ` diag` by

As a result, *diag*(` p`)=

*ctrl*(

`)•`

*p**ctrlX*(

`). We check that equation (4.7) is satisfied,`

*p*Note that a diagonalizer is a particular case of a specializer. It specializes a program with a copy of it. That is, .

### (d) Fixpoint compiler

A function *fix*:{1,#}*→{1,#}* is a *fixpoint function* if and only if for every ` p`∈{1,#}*,
4.9

We define *fix*(` p`)=

*diag*(

`•`

*diag**ctrlX*(

`)). We can check that (4.9) holds,`

*p*### (e) Recursion theorems

We now state and demonstrate the Kleene [3] second recursion theorem in the SRM model.

### Theorem 4.1 (Kleene)

*For any n and any program* *p**∈{1,#}***, there is a program* *e**∈{1,#}** *such that, for all* *d*_{1}*,…,**d*_{n} *in {1,#}***,
*
4.10

### Proof.

Put ` e`=

*fix*(

`). ■`

*p*The uniform version of Myhill [36] is a direct consequence of the above construction. It constructs a fixpoint compiler from a given specification.

### Theorem 4.2 (Myhill)

*There is a program* *fix**∈{1,#}***, such that for any program* *p**∈{1,#}***, for all* *d*_{1}*,…,**d*_{n} *in {1,#}***,
*
4.11
*and
*
4.12

### Proof.

A *fixpoint compiler* ` fix` is implemented as follows:
The program

`implements the fixpoint compiling function`

*fix**fix*. ■

## 5. Computer viruses

### (a) Blueprint viruses

In Bonfante *et al.* [11], we define a class of blueprint viruses that duplicate themselves without modifying their own programs. The virus copy is injected inside a register. If the register is not empty, the virus copy is injected into the register data/program. Otherwise, a stand-alone offspring is created. Contrary to the earlier-mentioned study, the virus copy may be activated in the SRM MOC. Thus, the virus copy may be seen as an independent organism, owing to the program motion property of the SRM. The first example given in the introduction is a blueprint virus. Adleman [9] proposed a model of computer viruses based on recursion theory. Several classes of viruses are defined. In each case, when a virus duplicates, its copy is identical to its source. In fact, viruses studied by Adleman are blueprint viruses.

The aim of this section is to show how blueprint virus construction depends on the recursion theorem, and how to construct a blueprint virus compiler.

Given a program ` p`, a

*blueprint virus*is a program

`v`that satisfies the equation 5.1 In this equation, we say that

`is a viral specification. It is a program that has at least one argument,`

*p*`v`, which denotes a copy of itself. In turn,

`v`may be executed and so it may pursue the infection. Let us return to the example of the introduction (§1

*a*(i)) in order to understand what is going on. In the introductory example,

`Specification_of_VA`corresponds to a viral specification of a blueprint virus, which we could formalize in the SRM model as follows: The solution of the earlier-mentioned fixed-point equation is the virus that we are seeking for.

So, the construction of a blueprint virus corresponds to solving the fixpoint equation (5.1). The Kleene recursion theorem 4.1 gives an immediate solution by setting `v`=*fix*(` p`).

It is important to see that Myhill's recursion theorem allows us to define a compiler of viruses, which is called a *virus distribution*. A virus distribution is a program that takes a viral specification and outputs a virus that realizes it. Thus, ` fix` is a virus distribution. Let us now give two illustrations of blueprint viruses.

#### (i) An overwriting virus

We begin by a virus that erases one by one the content of each register. The specification of an overwriting virus is
5.2
The program ` owrt` is implemented by
The recursion theorem gives a solution of equation (5.1) where

`v`

_{owrt}=

*fix*(

`), The computation goes on because the offspring`

*owrt*`v`

_{owrt}is activated, The computation is repeated, and the virus thus spreads and erases the content of each register.

#### (ii) An ecto-symbiote

The next example is called an *ecto-symbiote* because it lives on the surface of its host. The construction of an ecto-symbiote follows the lines of the previous virus. We define ` ecto`=

*move*`R`

_{1},

`R`

_{2}•

`exec`2. Thus, the program

`verifies the following equation: 5.3 The viral specification`

*ecto*`says the ecto-symbiote moves a copy of itself and appends it on the register just on its right. Then, it runs the register on its right. If the computation of`

*ecto***d**

_{1}activates

*x*, for example, by just halting normally, then the virus offspring will be run.

Again, the construction of the virus corresponds to solving this fixpoint equation,
5.4
A solution is the fixpoint `v`_{ecto}=*fix*(` ecto`). By composing (5.3) and (5.4), we have

### (b) Cohen's model of viruses

In his PhD thesis, Cohen [7] provided a model of computer viruses, which was published by Cohen [38], based on one-tape TMs. Informally, a virus is a code that is able to copy itself and to evolve on a given TM **M**. To take into account virus evolutions, Cohen's model does not consider a single virus but a set *V* , named a *viral set*, denoting the set of viruses of the machine **M**. All evolutions of a virus `v` are in *V* . As a result, the model is a pair *VS*=(**M**,*V*), where **M** is a TM and *V* is a viral set that satisfies the following property: if **M** reads a virus `v`∈*V* , then **M** will write an evolution `v`′∈*V* on its tape. This definition is obviously recursive and `v`′ will produce in turn a word of *V* . To sum up, the two main traits of Cohen's model are (i) viruses are self-reproducing programs and (ii) viruses may evolve. The next two parts are devoted to the construction of simple mutating viruses, which encompass Cohen's definition with respect to an SRM model.

### (c) Self-reproduction and semantics-preserving mutation

We now turn to the question of self-reproduction mechanisms in which the offspring mutates. There are several forms of possible mutations. Following Rogers [31], there is an infinite number of solutions to a given viral specification. So there is a sequence of viruses `v`_{1},`v`_{2},…,`v`_{n},… such that
5.5
Such a sequence of viruses may be obtained by applying a program transformation that preserves the original semantics. That is, a program transformation *Obs* such that `v`_{i+1} behaves exactly like *Obs*(`v`_{i}). From a theoretical point of view, a padding function works well. In practice, code obfuscation techniques are a current mutation method. Thus, the program denotation is concealed to a human analyst and it allows us to bypass detection procedures. Although Barak *et al*. [39] established that there is no good obfuscation method, obfuscation methods are very common in practice. Giacobazzi *et al.* [40] have showed relationships with meta-programming and code generation. In the same spirit, we have yet to analyse the role of the padding lemma as a theoretical basis of code obfuscation.

### (d) Virus mutation

Till now, we have an infinite sequence of viruses with the same behaviour. However, an offspring is still an exact copy of its father because there is no in-built mutation mechanism. In practice, this means that the generation of a new mutant should be made by hand. Now, a direct application of the recursion theorem allows us to generate a different copy at each generation. For this, we need to solve the following fixed-point equation, where ` p` denotes a viral specification:
5.6
Because

*Obs*is a semantic-preserving transformation, we have 5.7 If

*Obs*is injective, then each virus mutation will be different. Thus, at each generation, the virus is different and operational. Because SRMs have the ability to activate offsprings, the viral infection will spread by injecting mutating viruses. A solution is also given by the second recursion theorem by setting

`v`=

*fix*(

`•`

*obs*`), where`

*p*`computes`

*obs**Obs*, that is .

## 6. Conclusion

We have considered only a small portion of the interesting fixpoint theorems. We have seen that a program may mutate by applying a semantics-preserving transformation. Other more elaborate recursion theorems give more complex viruses. Indeed, a program may also evolve by applying a non-semantic-preserving transformation, say *Evo*. In this case, `v` and *Evo*(`v`) do not behave in the same way. We may think that from one generation to another, viruses evolve. In practice, this happens when a virus is updated either to apply a patch or to embed new functionalities. The construction of evolving viruses is not a direct consequence of the second recursion theorem. Indeed, we have to ensure that the evolution *Evo*(`v`) of `v` is able to reproduce itself. As a result, the construction of an evolving virus is based on stronger variants of the recursion theorem. A variant of the extended fixpoint theorem of Smullyan [5] is well suited. In SRM settings, we can state it as follows:
In this equation, *V*(*x*) is the evolving virus and *x* is its viral specification. The viral specification *x* is modified at each replication, owing to the *Evo* function. As a result, *V*(*Evo*(*x*)) is an operational virus, whose behaviour has been updated to compute *Evo*(*x*). The evolution consists of adding computational capabilities to self-replicating structures. There may be an analogy to explore with the works of Myhill [41] and an efficient construction of a conservative, and consistent hierarchy of arithmetic theories. Indeed, a similar construction would allow us to build viruses getting ‘smarter’. It is fascinating to see that most of the constructions in computability theory are now foundational ideas in computer science. To end this discussion, we may have a second look at the function *V* defined earlier. It is a program transformation that corresponds to a compiling function, but where the target program has a different behaviour to the source. The target program has evolved compared with the source.

An SRM MOC may also be seen as a simple abstract machine to study self-modifying programs. When an offspring is activated, self-reproduction is a particular case of self-modification. Actually, self-modifying programs are current. These are mostly used to speed up execution by using just-in-time-compilers or they are used to protect codes against human or automatic analysis. A self-modifying program generates at runtime its code. Thus, we can represent a self-modifying program by a sequence of code waves (i.e. code layers). Each wave is a non-self-modifying program, which is generated by a previous wave. The full program is formed by a set, or rather by a directed graph, of waves.

## Acknowledgements

I thank Serge Grigorieff and Maurice Margenstern who have introduced me to computability theory and many other topics. I also thank Neil Jones for always profound talks and John Case, with whom I discussed this topic during his stay in Nancy.

## Footnotes

One contribution of 18 to a Theme Issue ‘The foundations of computation, physics and mentality: the Turing legacy’.

- This journal is © 2012 The Royal Society