## Abstract

Advances in DNA sequencing mean that databases of thousands of human genomes will soon be commonplace. In this paper, we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we pre-process the text with the lossless data compression algorithm LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text. Our experiments show that this also significantly reduces query times.

## 1. Introduction

The British government recently announced plans to sequence the genomes of up to one hundred thousand citizens [1]. Although sequencing so many people is still a challenge, the cost has dropped sharply in the past few years, and the decline is expected to continue. Now, scientists must consider how to store such a massive genomic database in a useful form.

Because human genomes are very similar, this database will be highly repetitive and easy to compress well with, for example, the lossless data compression algorithm LZ77 [2]. Unfortunately, conventional indexes [3–5] for approximate pattern matching—a basic operation in bioinformatics—are based on FM-indexes [6] or other technologies that do not take good advantage of repetitive structure. Therefore, these indexes quickly outgrow internal memory when applied to many genomes and must then reside on disk, which slows them down by orders of magnitude.

There are already some experimental LZ- and grammar-based compressed indexes for exact pattern matching [7–13] and these could eventually provide a basis for approximate pattern matching as well (see [14]), but science cannot wait. In this paper, we introduce a simple technique, hybrid indexing, for reducing the size of conventional indexes on highly repetitive texts while preserving most or all of their functionality.

Given upper bounds on pattern lengths and edit distances, we pre-process the text with LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text.

We describe our hybrid index in more detail in §2; this includes details of our implementation.^{1} In §3, we present experimental results, which show that our technique also significantly reduces query times.

## 2. Hybrid indexing

Suppose we are given upper bounds on pattern lengths and edit distances, and asked to index a text *T*[1..*n*]. A query to the index consists of a pattern and an edit distance (which is 0 in the case of exact matching). We store different data structures to be able to find queries’ primary matches and secondary matches. A match for a query is a substring within the given edit distance of the given pattern; it is primary if it contains the first occurrence of a distinct character in *T* or crosses a boundary between phrases in the LZ77 parse of *T*, and secondary otherwise.

To be able to find primary matches, we pre-process the text with LZ77 to obtain a filtered text, which is essentially the subsequence of *T* containing characters close enough to phrase boundaries in *T*’s LZ77 parse, where ‘close enough’ depends on the given upper bounds. We store a conventional index on this filtered text, and a mapping from it to *T*. To be able to find secondary matches, we store a data structure by Kärkkäinen & Ukkonen [15].

Later, given a query, we use the conventional index to find all matches in the filtered text; use the mapping to determine which of those matches correspond to primary matches in *T*; use the mapping again to find those primary matches’ positions in *T*; and apply Kärkkäinen & Ukkonen’s data structure.

We briefly describe LZ77 in §2*a*. In §2*b*, we give more details about the filtered text and how we find primary matches. We describe Kärkkäinen & Ukkonen’s data structure in §2*c*. Finally, in §2*d*, we describe details of our implementation.

### (a) LZ77

We use the variant of LZ77 according to which, for each phrase *T*[*i*..*j*] in the parse of *T*, either *i*=*j* and *T*[*i*] is the first occurrence of that distinct character, or *T*[*i*..*j*] occurs in *T*[1..*j*−1] but *T*[*i*..*j*+1] does not occur in *T*[1..*j*]. In the first case, *T*[*i*] is encoded as itself. In the second case, *T*[*i*..*j*] is encoded as the pair (*i*′,*j*−*i*+1), where *i*′ is the starting point of the leftmost occurrence of *T*[*i*..*j*] in *T*; we call this leftmost occurrence *T*[*i*..*j*]’s source.

For example, if *T* is
then the parse of *T* (with parentheses around phrases) is
and *T*’s encoding is

Note some phrases—e.g. 8-bottles-of-beer-on-the-wall-98-bottles-of-beer-—overlap their own sources. In addition, while the first verse takes the first four lines of the encoding, the next three verses together take only the last line. This is typical of LZ77’s performance on repetitive inputs. Finally, although these verses are annoyingly similar, they are much less similar than human genomes.

### (b) Finding primary matches

Let *M* and *K* be the given upper bounds on pattern lengths and edit distances. Let *T*_{M,K} be the text containing only those characters of *T* within distance *M*+*K*−1 of their nearest phrase boundary in the LZ77 parse; characters not adjacent in *T* are separated in *T*_{M,K} by *K*+1 copies of a special character # not in the normal alphabet. In other words, to obtain *T*_{M,K}, in each phrase, we replace all but the *M*+*K*−1 characters at either end of that phrase by *K*+1 copies of #.

For example, if *T* is the example given above then *T*_{4,1} is
or, with parentheses indicating the phrases of *T*,

Note that, for any substring of *T* with length at most *M*+*K* that contains the first occurrence of a distinct character in *T* or crosses a phrase boundary in the LZ77 parse of *T*, there is a corresponding and equal substring in *T*_{M,K}.

We do not store *T*_{M,K} itself explicitly, but we store a conventional index *I*_{M,K} on *T*_{M,K}. We assume *I*_{M,K} can handle queries with pattern lengths up to *M* and edit distances up to *K*. Because *T*_{M,K} consists of at most 2*M*+3 *K*−1 characters for each phrase, if *T* is highly repetitive and *M* and *K* are reasonable, then *I*_{M,K} should be smaller than a conventional index on all of *T*. In addition, for any valid query and any match in *T*_{M,K}, there is at least one match (primary or secondary) in *T*, so querying *I*_{M,K} should be faster than querying a conventional index for all of *T*.

Let *L* be the sorted list containing the positions of the first character of each phrase in the parse of *T*, and let *L*_{M,K} be sorted lists containing the positions of the corresponding characters in *T*_{M,K}. We store *L* and *L*_{M,K}. If *T*[*i*] is the first occurrence of a distinct character in *T* and *T*_{M,K}[*j*] is the corresponding character in *T*_{M,K}, then we mark *j* in *L*_{M,K}.

For our example, *L* is
and *L*_{4,1} (with asterisks indicating marked numbers) is

Given the endpoints *i* and *j* of a substring *T*_{M,K}[*i*..*j*] of *T*_{M,K} that does not include any occurrences of #, we can use *L*_{M,K} to determine whether the corresponding substring *T*[*i*′..*j*′] of *T* contains the first occurrence of a distinct character in *T* or crosses a phrase boundary in the LZ77 parse of *T*. To do this, we use binary search to find *i*’s successor *L*_{M,K}[*s*]. There are three cases to consider:

— if

*i*<*L*_{M,K}[*s*]≤*j*, then*T*[*i*′..*j*′] crosses a phrase boundary;— if

*i*≤*j*<*L*_{M,K}[*s*], then*T*[*i*′..*j*′] neither contains the first occurrence of a distinct character nor crosses a phrase boundary;— if

*i*=*L*_{M,K}[*s*]≤*j*, then*T*[*i*′..*j*′] contains the first occurrence of a distinct character or crosses a phrase boundary if and only if*L*_{M,K}[*s*] is marked or*L*_{M,K}[*s*+1]≤*j*.

In addition, if *T*[*i*′..*j*′] contains the first occurrence of a distinct character or crosses a phrase boundary, then *i*′=*L*[*s*]−*L*_{M,K}[*s*]+*i* and *j*′=*i*′+*j*−*i*+1. In other words, we can use *L* and *L*_{M,K} as a mapping from *T*_{M,K} to *T*.

Given a query consisting of a pattern *P*[1..*m*] with *m*≤*M* and an edit distance *k*≤*K*, we use *I*_{M,K}, *L* and *L*_{M,K} to find all primary matches in *T*. First, we query *I*_{M,K} to find all matches in *T*_{M,K}. We then discard any matches containing copies of #. We use binary search on *L*_{M,K}, as described above, to determine which of the remaining matches correspond to primary matches in *T*. Finally, we use *L* and *L*_{M,K}, as described above, to find those primary matches’ positions in *T*.

### (c) Finding secondary matches

By definition, any secondary match is completely contained in some phrase. In addition, a phrase contains a secondary match if and only if that phrase’s source contains an earlier match (primary or secondary). More generally, a phrase contains a substring if and only if that phrase’s source contains the same substring. It follows that each secondary match is an exact copy of some primary match; more importantly, Kärkkäinen & Ukkonen [15] showed how, once we have found all the primary matches, we can then find all the secondary matches from the structure of the LZ77 parse.

To do this, for each primary match *T*[ℓ..*r*], we find each phrase *T*[*i*..*j*] whose source *T*[*i*′..*i*′+*j*−*i*] includes *T*[ℓ..*r*]—i.e. such that *i*′≤ℓ≤*r*≤*i*′+*j*−*i*. Note *T*[ℓ′..*r*′]=*T*[ℓ..*r*], where ℓ′=*i*+ℓ−*i*′ and *r*′=*i*+2ℓ−*i*′−*r*. We record *T*[ℓ′..*r*′] as a secondary recurrence and recurse on it.

To be able to find quickly all the sources that cover a match, we store a data structure for two-sided range reporting on the *n*×*n* grid containing a marker at (*i*′,*j*′) for every phrase’s source *T*[*i*′..*j*′]. With each marker, we store as a satellite datum the starting point of the actual phrase. In other words, if a phrase *T*[*i*..*j*] is encoded as (*i*′,*j*−*i*+1) by LZ77, then there is a marker on the grid at (*i*′,*i*′+*j*−*i*) with satellite datum *i*.

For example, for the phrases shown in §2*a,* there are markers at
with satellite data

Note that the markers are simply the encodings of the phrases not consisting of the first occurrences of distinct characters in *T* (see §2*a*), with each second component *j*−*i*+1 replaced by *i*′+*j*−*i*, where *i*′ is the first component. In addition, the satellite data are the positions of the first characters in those phrases, which are a subset of *L*.

### (d) Implementation

Recall that, to be able to find primary matches, we store the conventional index *I*_{M,K} on *T*_{M,K}, the list *L* and the list *L*_{M,K}. Because we want our hybrid index to be flexible, we do not consider how *I*_{M,K} works. (However, we note that it may sometimes be better to use fewer than *K*+1 copies of # as separators, as they serve only to limit the worst-case number of matches in *T*_{M,K}.)

We store *L* and *L*_{M,K} using gap coding—i.e. storing the differences between consecutive values—with every *g*th value stored un-encoded, where *g* is a parameter. We write the differences as -bit integers, where *d* is the largest difference in the list, and we write the un-encoded values as -bit integers. To speed up binary search in *L*_{M,K}, we also sample every *b*th value in it, where *b* is another parameter (typically a multiple of *g*).

Instead of marking values in *L*_{M,K}, we store an array containing the position in *L*_{M,K} of the first occurrence of each distinct character, in order of appearance. We note, however, that this array is only necessary if there may be matches of length one.

To be able to find secondary matches, we store a data structure for two-sided range reporting on the grid described in §2*c*. To build this data structure, we sort the points by their *x*-coordinates. We store the sorted list *X* of *x*-coordinates using gap encoding, with every *g*th value stored un-encoded as before, and every *b*th value sampled to speed up binary search. We store a position-only range-maximum data structure (see [16]) over the list *Y* of *y*-coordinates, sorted by *x*-coordinate. Finally, we store each satellite datum as a -bit pointer to the cell of *L* holding that datum, where *z* is the number of phrases.

We need not store points’ *y*-coordinates explicitly because, if a point has *x*-coordinate *i*′ and satellite datum *i*, then that point’s *y*-coordinate is *i*′+*j*−*i*, where *j*+1 is the value that follows *i* in *L*. (We append *n*+1 to *L* to ensure that we can always compute *j* in this way, although this is not necessary when the last character of *T* is a special end-of-file symbol.) Because we can access *i*′, *i* and *j*, we can access *Y* .

Once we have found all primary matches, we apply recursive two-sided range reporting. To do this, we place the endpoints of the primary matches in a linked list and set a pointer to the head of the list. Until we reach the end of the list, for each match *T*[ℓ..*r*] in the list, we repeat the following procedure:

We use binary search to find ℓ’s predecessor

*X*[*k*] in*X*.We use recursive range-maximum queries to find the values in

*Y*[1..*k*] at least*r*.For each point (

*i*′,*j*′) we find with*i*′≤ℓ≤*r*≤*j*′, we compute ℓ′ and*r*′ as described in §2*c*.We append the pair (ℓ′,

*r*′) of endpoints to the list and move the pointer forward one element in the list.

When we finish, the list contains the endpoints of all primary matches followed by the endpoints of all secondary matches.

## 3. Experiments

In our experiments, we compared a hybrid index based on an FM-index for the filtered text, to an FM-index for the original text. We always used the same implementation of an FM-index^{2} with the same parameters, chosen to achieve reasonable compression for highly repetitive datasets. We used an FM-index instead of a popular index for compressed pattern matching because the latter are usually heavily optimized to take advantage of, for example, multiple processors; we plan to compare against them after we have parallelized the hybrid index. We performed our experiments on an Intel Xeon with 96 GB RAM and eight processors at 2.4 GHz with 12 MB cache, running Linux 2.6.32-46-server. We compiled both indexes with g++ using full optimization.

We used benchmark datasets from the repetitive corpus of the Pizza&Chili website.^{3} Specifically, we used the following files:

cere — 37

*Saccharomyces cerevisiae*genomes from the Saccharomyces Genome Resequencing Project;einstein — versions of the English Wikipedia page on Albert Einstein up to 10 November 2006;

fib41 — the 41st Fibonacci word

*F*_{41}, where*F*_{1}=0,*F*_{1}=1,*F*_{i}=*F*_{i−1}*F*_{i−2};kernel — 36 versions of the Linux 1.0.x and 1.1.x kernel.

We generally set *M*=100, as that seemed a reasonable value for many applications. Because standard FM-indexes do not support approximate pattern matching, we set *K*=0 throughout. Based on preliminary tests, we set the sampling parameters *g* and *b* for our hybrid index to 32 and 512, respectively. Note that these parameters have no effect on the FM-indexes.

Table 1 shows the sizes of the uncompressed files, the files compressed with 7zip^{4} (which does not support pattern matching), the FM-indexes and the hybrid indexes. It also shows the construction times for the FM-indexes and hybrid indexes. The times to build the hybrid indexes include the times to compute the files’ LZ77 parses (which, in turn, include the times to build the files’ suffix arrays).

To estimate how well hybrid indexing takes advantage of repetitive structure, relative to FM-indexing, we truncated cere at 100, 200, 300 and 400 MB, then built FM-indexes and hybrid indexes for those prefixes. Figure 1 shows the sizes of those indexes.

For 10, 20, 40 and 80, we randomly chose 3000 non-unary substrings that length from cere and searched for them with its FM-index and hybrid. Figure 2 shows the average query times, using a logarithmic scale. The large difference between the query times for patterns of length 10 and those of length 20 seems to be because there are usually far more matches for patterns of length 10; the average time per match stayed roughly the same for all four lengths.

On reflection, it is not surprising that the hybrid index performs well here: while the FM-index finds all matches with its locate functionality, the hybrid index finds secondary matches with two-sided range reporting, which is relatively fast; because cere consists of 37 genomes from individuals of the same species, most matches are secondary.

## 4. Conclusions and future work

We have introduced a simple technique, hybrid indexing, for reducing the size of conventional indexes on highly repetitive texts. In our experiments, this technique worked well on benchmark datasets and seemed to scale well. It also significantly reduced query times. We plan to optimize our implementation to use, for example, parallelism across multiple processors, use a more powerful conventional index on the filtered texts, and then compare our hybrid index with popular conventional indexes for approximate pattern matching.

We are also working to optimize hybrid indexing in other ways. For example, readers may have noted that, in our example *T*_{4,1}, there are many copies of ## eer-take ##. Including all these copies seems wasteful, because they could be replaced by, for example, dummy phrases. We are currently testing whether this notably further reduces the size of hybrid indexes in practice.

## 5. Postscript

We first circulated the basic ideas contained in this paper in a preprint [17], section 1, p. 3 in August 2012, then presented them at a scientific meeting of the Royal Society of London in February 2013. We have now learnt that Wandelt *et al.* [18] independently and roughly simultaneously developed many of the same ideas and submitted them in March 2013.

## Funding statement

This research was supported by Fondecyt grant no. 1-110066, the Helsinki Institute for Information Technology, and Academy of Finland grant nos. 134287, 250345 (CoECGR) and 258308. Parts of this research were done while the first author was visiting the University of Helsinki and while the second author was at Aalto University.

## Acknowledgements

Many thanks to Paweł Gawrychowski, Juha Kärkkäinen, Veli Mäkinen, Gonzalo Navarro, Jouni Sirén and Jorma Tarhio, for helpful discussions.

## Footnotes

One contribution of 11 to a Theo Murphy Meeting Issue ‘Storage and indexing of massive data’.

- © 2014 The Author(s) Published by the Royal Society. All rights reserved.