## Abstract

Mapping reads against a genome sequence is an interesting and useful problem in computational molecular biology and bioinformatics. In this paper, we focus on the problem of indexing a sequence for mapping reads with a single mismatch. We first focus on a simpler problem where the length of the pattern is given beforehand during the data structure construction. This version of the problem is interesting in its own right in the context of the next generation sequencing. In the sequel, we show how to solve the more general problem. In both cases, our algorithm can construct an efficient data structure in time and space and can answer subsequent queries in time. Here, *n* is the length of the sequence, *m* is the length of the read, 0<*ε*<1 and is the optimal output size.

## 1. Introduction

In the classical string-matching problem, we are given a text of length *n* and a pattern of length *m* and we need to answer the query whether the pattern exists in the text (*existence query*) as a factor or substring and further to provide the start (or equivalently end) positions of the corresponding occurrences (*occurrence query*). Using the famous KMP algorithm, due to Knuth *et al.* [1], the classical string-matching problem can be solved in optimal *O*(*n*+*m*) time. For a review on KMP algorithm as well as various other (exact) string-matching algorithms, readers are referred to [2].

In many application settings, a number of patterns are searched in a particular text which gives rise to the *indexing version* of the problem. In this version, the text is given beforehand for preprocessing with a goal to construct an index data structure. Subsequently, a number of patterns are queried against the text. Clearly, this indexing variant can also be solved using the KMP algorithm; however, if we have *k* patterns (each of length *m*), the running time will be *O*(*k*(*n*+*m*)). As all the queries are done against a single text, it is only natural to construct an index data structure for it once, preferably in *O*(*n*) time and then answer the subsequent queries in *O*(*m*) time each, which gives a total of *O*(*km*+*n*) running time. And, indeed there exist efficient data structures (e.g. suffix trees [3–6] and suffix arrays [7–9]) that can be constructed in *O*(*n*) time and are capable of answering an existence query in *O*(*m*) time and an occurrence query in *O*(*m*+|Occ|) time, where Occ is the set of output.

The string-matching problem has tremendous applications in different branches of science, including, but not limited to, Computational Molecular Biology, Bioinformatics, Computer Vision, Information retrieval, Computational Musicology, Data Mining, Network Security, etc. However, in many, if not most, practical settings, instead of the exact matching some approximate matching schemes become more relevant and useful. The requirement of such approximate matching schemes is more prominent in Computational Molecular Biology and Bioinformatics, as discussed below. Notably, in the context of Computational Biology, the string-matching problem is essential for mapping *reads* (i.e. patterns) to a reference *biological sequence* (i.e. a text).

While an increasing number of the biology laboratories are using dedicated high-throughput equipments to produce many DNA sequences on a daily basis, the need for automatic annotation and content analysis is greater everyday. Unfortunately, even with the tremendous advancements of the current state of the art technology, the quality of the automatically obtained sequences is sometimes questionable owing to the intrinsic limitations of the equipments (for instance, Minoche *et al.* [10] evaluated the substitution error rate of the control genome PhiX174 to be in the range of 0.11–0.28% excluding uncalled bases: all positions were affected by substitution errors). Moreover, the re-sequencing methods are affected by the natural polymorphism that can be observed between individual samples (e.g. a single nucleotide polymorphism (SNP), that is an isolated mutation, can either stop the translation of an mRNA into a protein sequence or create a binding site for a protein complex that will prevent the complete formation of the functional protein [11,12]). Analysing such uncertain sequences is therefore much more complicated than the traditional pattern-matching problem. This gives the computer scientists, in particular the stringology researchers, the challenge to solve the string-matching problem where in the given text or pattern some positions may be uncertain in some sense. To capture this phenomenon of uncertainty, the idea of mismatches and gaps was introduced. Additionally, a popular and useful framework of don't care pattern matching was introduced under this approximate matching scheme, where a pattern and/or text may contain don't care characters that match with any character in the underlying alphabet.

The use of the don't care paradigm to capture the approximate pattern-matching scenario mentioned above is not new. It was introduced and solved efficiently using convolutional methods in [13]. Slightly tighter solutions have been presented in [14–17]. There exist some solutions avoiding the convolution method as well [18–20]. A number of solutions exist in the literature that consider the problem of text indexing with don't cares [21–24]. Notably, in the literature, the don't cares are also referred to as wildcards. Some work has also been done on a generalized model of the don't care paradigm known as the degenerate or indeterminate string model in the literature [25–28]; in this model, some positions of a string may contain more than one letter and don't care is essentially modelled as a position that contains all the letters of the alphabet.

Another popular approximate model in the literature is where some *k*≥0 mismatches are allowed while doing the pattern matching. There exist a number of results for this problem [29–35]. However, from the indexing point of view the results are only few (see, for instance [21,36,37]. Notably, the *k* mismatch model can be represented/captured in the don't care model by assuming *k* don't care characters at all possible permutations of the positions in the pattern (and/or the text).

In this paper, we focus on the indexed version of the pattern-matching problem with restricted number of mismatches and/or gaps. In particular, we are interested in the pattern-matching problem when at most one mismatch or gap is allowed. Here, gap refers to consecutive mismatches. To the best of our knowledge, the only work that deals with this problem directly is the work of Amir *et al.* [36]. We will give a brief review of the work of Amir *et al*. [36] and compare their results with ours in a subsequent section.

The contribution of this paper is twofold. First, we attack a restricted version of the problem in hand where we assume that the size of the pattern is fixed, i.e. we are given the size of the pattern to be queried against our data structure beforehand. As discussed below, this particular version of the problem has strong motivation from Computational Biology and Bioinformatics, especially in the context of Next Generation Sequencing. In the sequel, we consider the general version of the problem where this restriction is lifted. The solution we propose makes use of suffix arrays and range search data structures borrowed from Computational Geometry literature. In particular, in order to present an efficient data structure for our problem, we use a reduction from our problem at hand to the range search problem in geometry. As will be further discussed in later sections, similar reduction has also been used in [36] to solve this particular problem and in different other papers (e.g. [38–41]) to solve some other interesting problems. Note however that the reduction itself is not enough to get a good solution. As will be clear later, we need to do some non-trivial work based on some useful observations and lemmas to achieve an efficient running time for the queries.

The motivation of our work comes both from the fields of Stringology and Computational Biology and Bioinformatics. Firstly, a solution to the approximate pattern matching with a single mismatch would be useful as a theoretical advancement in the context of the general variant where *k* mismatches are considered. Secondly, approximate pattern-matching problem with at most a single mismatch is a useful problem on its own right. This is specially true in Computational Molecular Biology, where with the advent of new state-of-the-art technologies, the chance of experimental mistakes has become much lower than it was before. Also, the existence of single mismatches, called SNPs, occurring at consecutive positions is rare (see [42, Table 1] for experimental results carried out on 10 individual human genomes from the 1000 Genome Project). Thirdly, because of the recent high-throughput sequencing technologies we are particularly interested to provide a fast solution to the restricted version of the problem where the pattern size is fixed. In the so-called next generation sequencing (NGS) technologies, millions of short reads are generated. Usually, the first region of these reads, called the seeds, contains almost no sequencing error. The seed region is followed by a region having very small possibilities of error and hence from the mapping point of view, only a single mismatch or gap (i.e. consecutive mismatches) is expected. Now, in some NGS platforms, the generated sequences, which are usually several millions in number, are of the same size; this size depends on the technologies used (e.g. Illumina, etc.). When sequencing platforms generate variable-length reads, they can be reduced to prefix seeds of fixed length. As a result, the fixed pattern length version of the problem is of particular interest in this specific scenario.

The rest of the paper is organized as follows. After the definitions and problem setting in §2, we give in §3 the general idea of the solution. It serves as a guideline for the development of algorithms. Section 4 solves a simpler problem, while §5 describes a solution for the complete problem. We put some relevant discussion and conclusion in §§6 and 7.

## 2. Preliminaries

A string is a sequence of zero or more symbols from an alphabet *Σ*. A string of length *n* is denoted by , where for 1≤*i*≤*n*. The length of is denoted by . The string denotes the reverse of the string , i.e. .

A string *w* is a factor of if for *u*, *v*∈*Σ**; in this case, the string *w* occurs at position |*u*|+1 in . The factor *w* is denoted by . A *k*-factor is a factor of length *k*. A prefix (or suffix) of is a factor such that *x*=1 (*y*=*n*), 1≤*y*≤*n* (1≤*x*≤*n*). We define the *i*th prefix to be the prefix ending at position *i*, i.e. , 1≤*i*≤*n*. Dually, the *i*th suffix is the suffix starting at position *i*, i.e. , 1≤*i*≤*n*.

The *Hamming distance* between two strings of equal length is the number of positions at which the corresponding letters are different. More formally, the Hamming distance between *u* and *v* is *δ*(*u*,*v*)=|{*i*|*u*_{i}≠*v*_{i},1≤*i*≤|*u*|=|*v*|}|. Given two equal length strings *u*,*v*∈*Σ**, *u* is said to match *v* (or equivalently *v* is said to match *u*) with at most *k* mismatches if *δ*(*u*,*v*)≤*k*. Essentially, the exact match is characterized by a zero Hamming distance.

Given a text of length *n* and a pattern of length *m* such that *m*≤*n*, is said to occur in at position *i* (i.e. exact match) if and only if . The position *i* is said to be an occurrence of in . We denote by the set of occurrences of in . As an extension, is said to occur in at position *i* with at most *k* mismatches if and only if . We use to denote the set of positions where matches with at most *k* mismatches. Similarly, we use to denote the set of positions where matches with exactly *k* mismatches. Clearly, our interest is in calculating .

In traditional full-text indexing problems, one of the basic data structures used is the suffix array data structure. Others are suffix trees and suffix automata. In our indexing problem, we make use of the suffix array data structure. A complete description of a suffix array is beyond the scope of this paper and can be found in any textbook on stringology (e.g. [43–45]). Here, we give a very concise definition of a suffix array. The suffix array of a text is an array of integers *j*∈[1 . . *n*] such that , if and only if, is the *i*th suffix of in (ascending) lexicographic order. Suffix arrays were first introduced in [46], where an construction algorithm and query time were presented. Later, linear time construction algorithms for space efficient suffix arrays were presented [8,9,47,48]. The query time is also improved to optimal in [49] with the help of another array essentially storing the lengths of the longest common prefixes between lexicographically consecutive suffixes. We recall that the result of a query for a pattern on a suffix array of is given in the form of an interval [*s* . . *e*] such that . In this case, the interval [*s* . . *e*] is denoted by .

We remark that the query time of suffix array (and similar other data structures) always contains a hidden factor. However, as in most of the cases the size of the underlying alphabet *Σ* is a constant, the trend in the literature is to omit the factor from the running times. Indeed, this is all the more applicable in our case because we are focused on biological sequences where usually the underlying alphabet size is a small constant (e.g. 4 for DNA/RNA sequences and 20 for protein/amino acid sequences). Finally, we note that there are several linear time suffix array construction methods that work with integer alphabet as well (e.g. [9,47]).

## 3. The underlying idea for a solution

In this section, we discuss how we can efficiently compute , given and . In other words, we will in fact discuss a solution of our problem albeit not from the indexing point of view. In the sequel, we will be using the underlying idea to construct the index data structures of our interest. Notably, the same idea was used by Amir *et al.* [36] to provide an indexing solution for the problem. We will discuss the problems of the solution in [36] and highlight the differences between their solution and the solution presented in this paper in a later section.

Suppose we know the position of the mismatch and assume that the position is *j*. In other words, we suppose that the mismatch is only allowed with . Let us call the pattern with a * in position *j*, and let us use to denote the corresponding set of occurrences over . Then, we may proceed as follows. We compute and . Then, clearly, our desired set of occurrences can be computed as follows:
Now to lift the restriction, we can simply run a loop on all possible values of *j*. This simple idea works perfectly and the steps are formally presented in algorithm 1. An example of such algorithm is reported in tables 1 and 2. However, transforming this idea to handle the indexing version is a non-trivial task.

## 4. The index data structure

In this section, we focus on constructing an index data structure for our problem. We, however, first consider a restricted variant of the original problem. In particular, for the time being, we assume that we are given the pattern length *m* which we use during the index data structure construction. In other words, the index constructed will be *m*-specific, i.e. it will be able to handle patterns of length *m* only. Recall from §1 that this particular variant of the problem is of particular interest especially in the context of NGS. The general version of the problem will be handled in a later section.

We basically extend the idea of algorithm 1. We maintain two suffix array data structures and . We use to find the occurrences of . We can find the occurrences of using as well. But we need to take a different approach because we have to ‘align’ the occurrences of (line 5) with the occurrences of so that we can compute by intersecting (line 8) them just as is done in algorithm 1. However, it is not as straightforward as algorithm 1 because our aim is to maintain an index rather than finding a match for a particular pattern. We use the following trick, which has been applied to solve this and different other problems in the literature before (e.g. [36,38,40]).

Our approach is as follows. We use the suffix array of the reverse string of , i.e. , to find the occurrences of . By doing so, in fact, we get the end positions of the occurrences of in . However, we still have to do a bit more ‘shifting’ for the intersection to work because from , we get the start positions of the occurrences of . This is where the information of a fixed pattern length, i.e. *m*, comes handy. To achieve the ‘shifting’ effect automatically, we appropriately arrange the entries of as follows. For all 1≤*i*≤*n*, if originally , we assign . It is easy to see that this will effectively transform the position of each of the occurrences of to the appropriate position that will facilitate the intersection. How will be clarified later, the candidate starting positions of an occurrence of a pattern of length *m* over a text of length *n* are those in the range [1 . . *n*−*m*+1]. Obviously, any solution working for a generic range [1 . . *n*] is applicable as well to a subrange [1 . . *n*−*m*+1].

Now, it remains to show how we can perform the intersection (line 8) efficiently in the context of indexing. Clearly, we now have two arrays, namely and , and two intervals, namely and . In addition, now our problem is to find the intersection of the positions within these two intervals. This problem is a known problem in geometry and is called the *Range Set Intersection* (RSI) *Problem*.

### Problem 4.1 (Problem RSI)

*Let* *V* [1 . . *n*] *and* *W*[1 . . *n*] *be two permutations of* [1 . . *n*]. *Preprocess* *V* *and* *W* *to answer the following form of queries*.

*Query: Find the intersection of the elements of* *V* [*i* . . *j*] *and* *W*[*k* . . ℓ], 1≤*i*≤*j*≤*n*,1≤*k*≤ℓ≤*n*.

In order to solve the above problem, we reduce it to the well-studied *Range Search Problem on a Grid* (RSG).

### Problem 4.2 (Problem RSG)

*Let* *A* *be a set of* *n* *points on the grid* [0 . . *U*]× [0 . . *U*]. *Preprocess* *A* *to answer the following form of queries*.

*Query: Given a query rectangle* *q*≡(*a*,*b*)×(*c*,*d*) *find the set of points of* *A* *contained in* *q*.

We can see that Problem RSI is just a different formulation of Problem RSG. This can be realized as follows. We set *U*=*n*. As *V* and *W* in Problem RSI are permutations of [1 . . *n*], every number in [1 . . *n*] appears precisely once in each of them. We define the coordinates of every number *i*∈[1 . . *n*] to be (*x*,*y*), where *V* [*x*]=*W*[*y*]=*i*. Thus, we get at most *n* points on the grid [1 . . *n*]×[1 . . *n*], i.e. array *A* of Problem RSG. The query rectangle *q* is deduced from the two intervals [*i* . . *j*] and [*k* . . ℓ] as follows: *q*≡(*i*,*k*)×(*j*,ℓ). It is easy to verify that the above reduction is correct, and hence we can solve Problem RSI using the solution of Problem RSG. So, this completes our description for constructing the index data structure for our problem. Algorithm 2 formally states the steps to build our data structure, while tables 3 and 4 show an example.

### (a) Analysis

Let us analyse the running time of algorithm 2, i.e. the data structure construction. The computational effort spent for lines 1, 2 and the `for` loop at line 3 is *O*(*n*). In line 7, we construct the set *A* of points in the grid [1 . . *n*]×[1 . . *n*] on which we will apply the range search. This step can also be done in *O*(*n*) as follows. We construct such that . Similarly, we can construct . Then, it is easy to construct *A* in *O*(*n*). A detail is that in our case there may exist *i*, 1≤*i*≤*n*, such that for all 1≤*j*≤*n*. This is because is a permutation of [−*m*+2 . . *n*−*m*+1] instead of [1 . . *n*]. Now, it is easy to observe that any such that *i*>*n* or *i*<1 is irrelevant in the context of our search. So we ignore any such while creating the set *A*. After *A* is constructed we perform line 12. Note that from now on the shifted array as well as the inverted arrays *SA*^{−1} and *SA*′^{−1} are to be dismissed as they are not involved in the query process.

There has been significant research work on Problem RSG. For example, we can use the data structure of Alstrup *et al.* [50]. This data structure can answer the query of Problem RSG in time where *k* is the number of points contained in the query rectangle *q*. The data structure can be constructed in time and space, for any constant 0<*ε*<1. So, line 12 runs in time and space, for any constant 0<*ε*<1. Therefore, the overall time and space complexity of the index remains .

### (b) Query processing

Now, let us focus on the query processing. Again, for the sake of ease, let us suppose that we know the position of the mismatch and assume that the position is *j*. Then the query can be answered as follows. We first compute using . Then we compute using the original (i.e. the unshifted) . Now, we find all the points in *A* that are inside the rectangle *q*_{j}≡(*s*_{ℓ}|^{j},*s*_{r}|^{j})×(*e*_{ℓ}|^{j},*e*_{r}|^{j}). Let *B*|^{j} is the set of those points. Then it is easy to verify that .

Now, we can easily lift the restriction that *j* is the position of the mismatch. We simply, compute , i.e. *B*|^{j} for all values of *j* and compute . The steps are formally presented in the form of algorithm 3. Tables 5 and 6 show an example of such algorithm.

The running time of the query processing is deduced as follows. Lines 3 and 4 can be done in *O*(*m*) time. And line 5, i.e. the range search query, can be done in time, where *K* is the number of points in the query rectangle. Then, the algorithm mostly consists of a loop. So, a straightforward analysis of algorithm 3 leads to a running time of . Here, we assume that is the size of the output returned by the algorithm. We will focus on shortly. However, we can do far better than as follows. Note that we can compute incrementally as we increment *j* from 1 to *m*. This means that we can get all the intervals from , namely, [*s*_{ℓ}|^{j} . . *e*_{ℓ}|^{j}],1≤*j*<*m* spending *O*(*m*) time in total. Similarly, we can get all the intervals from , namely [*s*_{r}|^{j} . . *e*_{r}|^{j}], 1≤*j*≤*m*, spending *O*(*m*) time in total. Hence, we can compute all the intervals first and store them with a little bookkeeping to implement line 5 for all *j*, 1≤*j*≤*m*, afterwards. This will give a much better running time of .

Let us note that the pattern-matching part of our solution, namely lines 3 and 4, where the intervals are provided by suffix arrays, is alphabet dependent while the geometrical part, that is the rectangle query in line 5, is not. Hence, in case of large (integer) alphabet the query time is dominated by the pattern-matching time and the resulting query time is .

### (c) Discussion on

Finally, a discussion on the value of is in order. As we are computing the occurrences with at least 1 mismatch, the occurrences of exact matches will also be reported. And in our algorithm when we compute for a particular value of *j*, it will also contain the exact occurrences. In other words, for all values of *j*, we have . So, every exact occurrence will be reported *m* times. To make the value of optimal, we need to use some further tricks. First, we need the following easy lemma.

### Lemma 4.3

*Suppose we are given a suffix array* *of a text* *and a pattern* . *Suppose* *and* . *If* 1≤*j*≤*m*, *we must have s*|^{j}≤*s*≤*e*≤*e*|^{j}.

Clearly, lemma 4.3 tells us that the range identifying the occurrences of a pattern must be contained in or equal to the range identifying the occurrences of a prefix of that pattern. How do we use the lemma to ensure the optimal value of ? We modify our query algorithm as follows. We only need to modify the part that uses and do some more work while we compute *B*_{j}. At the beginning (before the loop at line 2 of algorithm 3), we compute using . Now, suppose that we have computed using . By lemma 4.3, we know that *s*_{ℓ}|^{j}≤*s*≤*e*≤*e*_{ℓ}|^{j}. So, we divide the range into (at most) two ranges [*s*_{ℓ}|^{j} . . *s*−1] and [*e*+1 . . *e*_{ℓ}|^{j}]. Then the computation of *B*_{j} is modified as follows:
and

In other words, what we are doing is that the interval of each of the prefixes , of is divided into at most two sub-intervals so as to remove the exact occurrences of . Hence, we finally need to include the exact occurrences of before returning the set. In other words, instead of returning we need to return . Clearly, now the output size will be optimal.

Finally, what will be the query time of the augmented query algorithm? Clearly, now for each of the prefixes , we would have at most two ranges (against one range of ). So, we need to make at most 2*m* range search queries in total keeping the total query time asymptotically the same as before. The results of this section can be summarized in the following theorem.

### Theorem 4.4

*Given a sequence* *over a constant alphabet and an integer m, we can construct a data structure in* *time and space that, given a pattern* *of length m as a query, can compute* *in* *time, where* .

### (d) Exactly one mismatch

In some practical applications, especially in Bioinformatics, the occurrences with exactly one mismatch is sought. In other words, in such cases, the exact occurrences are to be excluded, i.e. we are to compute instead of . Clearly, this can be achieved without any change in the query time. In this case, we simply need to return in the end in our augmented algorithm (without including ).

## 5. General case

In this section, we relax the assumption that the pattern length *m* is given as part of the input. So, we cannot take advantage of using *m* during the data structure construction as we did in line 3 of algorithm 2. For this case, we use the same idea but the shifting need be done in a different way so that we do not require the knowledge of *m*. This in the sequel requires a different query processing algorithm as will be discussed shortly. Below, we first discuss the index construction and then focus on to the query processing algorithm.

### (a) Index construction

As has been mentioned previously, we will use the same basic idea and hence will depend on algorithm 1. Let us suppose that we know the position of the mismatch and assume that the position is *j*. Similar to the previous approach, we will maintain two suffix arrays and . However, we now reverse the role of the two suffix arrays as follows. We use to find the occurrences of (rather than as we did previously). We use the suffix array of the reverse string of , i.e. , to find the occurrences of . By doing so, in fact, we get the end positions of the occurrences of in . Now, note that we have the end positions of in and the start positions of in . So, to get the desired alignment all we need to do is to consider the mismatch position and relabel accordingly. As now the alignment is done at the start position of (instead of the start position of ), we do not need the knowledge of *m*. It is easy to verify that, now, each needs to be relabelled as (*n*+1)−*i*+1+1=*n*−*i*+3 to get the desired alignment. The rest of the steps are identical to algorithm 2. Clearly, the running time of the index construction remains the same as before.

### (b) Query processing

The query processing algorithm is built on the same principle as before. Let us suppose that we know the position of the mismatch and assume that the position is *j*. Then the query can be answered as follows. We first compute using . Then we compute using . Now, we find all the points in *A* that are inside the rectangle *q*≡(*s*_{ℓ}|^{j},*s*_{r}|^{j})×(*e*_{ℓ}|^{j},*e*_{r}|^{j}).

Let *B*|^{j} is the set of those points. Now, note that we have done the alignment at the start position of . So, we need to undo this shift to get the actual start position of the desired occurrence of the pattern . So we compute, . It is easy to verify that .

Now, we can easily lift the restriction that *j* is the position of the mismatch. We simply compute , i.e. *B*|^{j} and *B*′|^{j} for all values of *j* and compute .

### (c) Analysis of the query processing algorithm

The analyses of the two algorithms, namely, the query processing algorithm presented in §4*b* and that presented in §5*b* are similar but with a distinct difference that makes the running time of the latter worse. For the query processing of the fixed *m* case, we computed incrementally as we increment *j* from 1 to *m* and got all the intervals from , namely, [*s*_{ℓ}|^{j} . . *e*_{ℓ}|^{j}],1≤*j*≤*m*, spending *O*(*m*) time in total. Similarly, we computed all the intervals from , namely, [*s*_{r}|^{j} . . *e*_{r}|^{j}],1≤*j*≤*m*, spending *O*(*m*) time in total. For the general case unfortunately, we cannot apply the above trick readily. This is because, now, we need to compute incrementally as we increment *j* from 1 to *m*. This means that, unlike the previous case, where we were looking for intervals for prefixes of increasing sizes, now we have to compute intervals for suffixes of decreasing sizes. The same applies for the computation of intervals for , using . As a result the query time becomes .

### (d) Improved query time

Clearly, the bottleneck is the computation of the appropriate intervals for and for different values of *j*. To achieve the same query time of , we need to be able to compute the intervals for the suffixes of decreasing sizes incrementally as we did for the prefixes of increasing sizes. This would ensure a *O*(*m*) running time for the computation of the appropriate intervals and thereby keeping the desired running time intact. To achieve this improvement, we resort to the famous backward search technique of Ferragina & Manzini [51–53] using their data structure popularly known as the FM-index. In particular, we use the following result which is the heart of the Backward Search Algorithm (BSA) of [51] using an FM-index.

### Lemma 5.1 (Ferragina & Manzini [51])

*Assume* *is already computed. Then for any character* *c*, *the interval* *can be computed in* *O*(1) *time*.

It is clear that with lemma 5.1 at our disposal, we can compute the appropriate intervals for suffixes of decreasing sizes of a pattern spending time in total. So, using BSA of [51–53] we are now able to compute the appropriate intervals for and for different values of *j* spending a total time of *O*(*m*). Therefore, the query time improves to , where is the output size.

### (e) Data structure construction with Ferragina & Manzini-index

In the previous section, we have used BSA (lemma 5.1) to achieve the desired running time for the query. However, this means that we would need to include the FM-index in our index data structure. We do actually delegate the pattern-matching part of our solution, that is the part in charge of computing the intervals of all the prefixes and suffixes of a given pattern in the suffix arrays of the text and reverse text, to two FM-indexes: one for the text and another one for the reverse text. Note that, as FM-index is a self-index, it does not need to access the original text at query time. This leads to a practical decreasing of the space occupancy of the proposed solution owing to the sublinear space occupancy of the FM-index for compressible texts. However, the space complexity of our solution remains dominated by the range search data structure.

A further analysis of the data structure construction time is in order. In what follows, we refer to the version of the FM-index presented in [53]. The FM-index and hence the BSA, make clever use of the so-called *Rank* query. Suppose we have a string *X*, *c*∈*Σ* and *f* is an integer. Then, *Rank*_{X}(*c*,*f*) is defined to be the number of occurrences of *c* (i.e. rank) in the prefix *X*[1 . . *f*]. FM-index uses a wavelet tree [54] to facilitate constant time *Rank* queries. It also requires an auxiliary array such that stores the total number of occurrences of all *c*′≤*c*, where ‘≤’ here means lexicographically smaller than or equal to. Finally, FM-index and BSA use the famous Burrows–Wheeler transformation (BWT) technique [55]. A complete description of BWT is out of the scope of and not required for our discussion. We just give a brief definition of BWT in relation to suffix array as follows. The BWT encoding of the string is another string , as defined below. Assume that the *i*th element of the suffix array of is . Then where . The computation of can be done in *O*(*n*) time. FM-index needs to preprocess for constant-time rank queries. As wavelet trees can be constructed in linear time, this can also be achieved in *O*(*n*) time. The auxiliary array can also be prepared in *O*(*n*) time. So, overall the data structure construction time remains dominated by the range search data structure construction.

### (f) Optimality of

The problem with the optimality of as discussed in §4*c* applies for the current algorithm as well. Unfortunately, the method used in §4*c* to make optimal cannot be used now. The technique used in §4*c* was based on lemma 4.3, which basically states that the interval provided by a suffix array for a pattern always lies within the interval of any of its prefixes. But this relation does not hold for suffixes. Therefore, to compute we cannot use lemma 4.3. Nevertheless, we apply another technique to achieve our goal, as described below. We need the following lemma.

### Lemma 5.2

*Suppose we are given a suffix array* *of a text* *and a pattern* . *Suppose* *and* *where* 1≤*j*<*m*. *Then the followings hold true*:

(1)

*The size of the interval**is greater than or equal to the size of the interval**i.e.**e*_{1}−*s*_{1}+1≥*e*_{0}−*s*_{0}+1.(2)

*We have**for some**s*_{1}≤*p*≤*e*_{1}.(3)

*Suppose*, .*Then*,*for all*1≤*i*≤*e*_{1}−*s*_{1}.

Now let us discuss how we can obtain the optimal as follows. The basic idea is similar as before. We want to divide an interval into at most two subintervals such that the subinterval responsible for the exact occurrences of the complete pattern can be excluded when we do the intersection. Now, suppose that we have computed and . By lemmas 5.2(2) and 5.2(3), we know that the occurrences of in some sense ‘contain’ the occurrences of . This is because .

Now, let us go back to our computation assuming that we know the position of the mismatch and assume that the position is *j*. So, we need to do the intersection between (computed using ) and (computed using ). Note carefully that here corresponds to the mismatch position. So, if from the interval we can remove the positions that follow the occurrence of in we are done. Assuming that we have at our hand, we can easily identify those positions using lemma 5.2 as follows. Following the hypothesis of lemma 5.2, let us assume that and . By lemma 5.2(2), we have an *s*_{1}≤*p*≤*e*_{1} such that . We identify this *p*. How can we identify *p* efficiently? To do this efficiently, we slightly augment our index data structure by maintaining an auxiliary array *Next*[1 . . *n*] of length *n* as follows. We store *Next*[*i*]=*j* if and only if . Clearly, this will facilitate *O*(1) time identification of the index *p*. Also, note that the computation of *Next*[1 . . *n*] can be done in linear time during the index data structure construction.

Once *p* is identified, we compute *q*=*p*+*e*_{0}−*s*_{0}. Now, we have two subintervals, namely [*s*_{1} . . *p*−1] and [*q*+1 . . *e*_{1}]. It is not very difficult to realize that these two intervals, namely [*s*_{1} . . *p*−1] and [*q*+1 . . *e*_{1}] give us the positions where occurs in but is not preceded by an occurrence of ; this is exactly what we desired. So, now we finish off by modifying the computation of *B*_{j} appropriately. To show the modification of the computation of *B*_{j}, we need to keep the notational convention followed so far. Note that the position of is to the right with respect to the fixed mismatch position *j*. So, to keep the notational symmetry we will now rename *s*_{1} and *e*_{1} as *s*_{r}|^{j} and *e*_{r}|^{j}, respectively. Now the modified computation of *B*_{j} is given below.
and

Finally, we need to include the exact occurrences of before returning the final output set. In other words, instead of returning , we need to return . Clearly, now the output size will be optimal. By similar argument as presented in §4*c*, the query time remains asymptotically the same, i.e. . The results of this section can be summarized in the following theorem.

### Theorem 5.3

*Given a sequence* *of length n, we can construct a data structure in* *time and space that, given a pattern* *of length m as a query, can compute* *in* *time, where* .

## 6. Discussion on the solution of Amir *et al.* [36]

As has been mentioned above, the underlying idea used in our solution approach is identical to that of the solution proposed by Amir *et al.* [36]. In this section, we present a detailed comparison between the two solutions and identify the shortcomings of the solution of Amir *et al.* [36]. In what follows, we will refer to the data structure of Amir *et al.* [36] as *DS_AKLLLR* (using the first letters of the authors' surnames). And we will use *DS_Fixed* and *DS_Gen* to refer to our data structures for the fixed *m* version and the general version, respectively. In both cases, we will use the names to refer to the corresponding algorithms as well.

*DS_AKLLLR* consists of two suffix trees, one for and another for . The role of these two suffix trees is exactly the same as the role of the two suffix arrays in our data structure *DS_Fixed*. Recall that in *DS_Gen* the roles of the two suffix arrays (FM-indexes, to be precise) are in fact reversed. The range search data structure used by *DS_AKLLLR* is similar but not identical to the one used by *DS_Fixed* and *DS_Gen*, as will be clear shortly. So, overall, the data structure construction is almost similar in both algorithms. The query algorithm however is drastically different as discussed below.

*DS_AKLLLR* query algorithm requires that the suffix trees are constructed using the Weiner construction [6]. According to Weiner's algorithm, given the text of length *n*, the suffix is first considered, followed by , then and so on. Now similar to *DS_Fixed*, *DS_AKLLLR* needs to compute using and using . So with the suffix tree *ST*_{T} of at its disposal, it proceeds as follows. *DS_AKLLLR* ‘feeds’ the pattern to the suffix tree in some sense. In particular, it continues from and builds a suffix tree of , where #∉*Σ*. While doing this extended construction, *DS_AKLLLR* cleverly keeps track of the locus positions for each suffix of . Hence, it can easily get the range . After the construction ends, it does ‘undo’ this to keep the suffix tree as before. Identical operation on using gives the range . Subsequently, all is needed is the application of an appropriate range search query on the range search data structure, which is a part of *DS_AKLLLR*.

Now, there are two important points that need be carefully noted, as highlighted below.

— First, the discussion, in §4

*c*and later again in §5*f*, on the output size applies to*DS_AKLLLR*as well. To make optimal,*DS_AKLLLR*uses a different technique from ours. In particular, it resorts to a higher dimensional range search query. This is where the range search data structure used by*DS_AKLLLR*differs from the one used by our data structure. We omit the details because this is not relevant. But the point is that this technique required the data structure of three-dimensional range search and three-dimensional query. This makes both the data structure construction time and query time (slightly) inferior to that of ours.— Secondly, and more importantly, the claim that the computation of () by ‘feeding’ to runs asymptotically in

*O*(*m*) time is somewhat flawed as follows. The linear time of the Weiner construction of suffix tree (and in fact all other linear time construction, e.g. [5]) depends on an amortized analysis. Hence, while we can certainly say that the construction of a suffix tree for the string can be done in time, given a suffix tree for we cannot always claim that extending it for can be done in time.

## 7. Conclusion

In this paper, we have focused on the problem of indexing sequences for mapping reads with a single mismatch. We have considered a simpler problem first, where the pattern length *m* is given beforehand along with the text of length *n* for preprocessing. So, in this version, the patterns to be queried must be of the same length, *m*. This simpler problem is interesting in its own right especially in the context of the NGS. Subsequently, we have discussed how to solve the more general problem, which can handle patterns of different lengths. In both cases, our algorithm can construct an efficient data structure in time and space, which is able to answer subsequent queries in time, where 0<*ε*<1 and is the optimal output size.

## Funding statement

Maxime Crochemore was supported by EPSRC grant EP/J017108/1. M. Sohel Rahman was supported by a Commonwealth Fellowship.

## Footnotes

One contribution of 11 to a Theo Murphy Meeting Issue ‘Storage and indexing of massive data’.

- © 2014 The Author(s) Published by the Royal Society. All rights reserved.