## Abstract

*Histogram indexing*, also known as *jumbled pattern indexing* and *permutation indexing* is one of the important current open problems in pattern matching. It was introduced about 6 years ago and has seen active research since. Yet, to date there is no algorithm that can preprocess a text *T* in time *o*(|*T*|^{2}/polylog|*T*|) and achieve histogram indexing, even over a binary alphabet, in time independent of the text length. The pattern matching version of this problem has a simple linear-time solution. *Block-mass pattern matching* problem is a recently introduced problem, motivated by issues in mass-spectrometry. It is also an example of a pattern matching problem that has an efficient, almost linear-time solution but whose indexing version is daunting. However, for fixed finite alphabets, there has been progress made. In this paper, a strong connection between the histogram indexing problem and the block-mass pattern indexing problem is shown. The reduction we show between the two problems is amazingly simple. Its value lies in recognizing the connection between these two apparently disparate problems, rather than the complexity of the reduction. In addition, we show that for both these problems, even over unbounded alphabets, there are algorithms that preprocess a text *T* in time *o*(|*T*|^{2}/polylog|*T*|) and enable answering indexing queries in time polynomial in the query length. The contributions of this paper are twofold: (i) we introduce the idea of allowing a trade-off between the preprocessing time and query time of various indexing problems that have been stumbling blocks in the literature. (ii) We take the first step in introducing a class of indexing problems that, we believe, cannot be pre-processed in time *o*(|*T*|^{2}/polylog|*T*|) and enable linear-time query processing.

## 1. Introduction

### (a) Motivation

Traditional pattern matching regards the text *T* and pattern *P* as *sequential* strings, provided and stored in sequence (e.g. from left to right). Therefore, implicit in the conventional approximate pattern matching was the assumption that there may indeed be errors in the *content* of the data, but the *order* of the data is inviolate. However, some non-conforming problems have changed this view. Such examples range from text editing [1,2], through computational biology [3,4,5,6] and to architecture [7,8].

Motivated by these questions, a new pattern matching paradigm—*pattern matching with rearrangements*—was proposed by Amir *et al*. [9]. In this model, the pattern *content* remains intact, but the relative positions (addresses) may change. Several papers [10,9,11,7,12,8,13] followed the initial definition of the new paradigm. The simplest pattern matching with rearrangements problem is that of *histogram matching*, also known as *jumbled pattern matching*, *Parikh matching* or *permutation matching*. The goal of histogram matching is given a text, *T*, and pattern *P*, find all locations *i* in the text where *T*[*i*],…,*T*[*i*+|*P*|−1] has the same alphabet symbols and the same frequency of symbols as *P*. Formally,

### Definition 1.1

Let

*Σ*be an alphabet,*S*=*S*[1],…,*S*[*n*],*S*[*i*]∈*Σ*. Let {*σ*_{i1},…,*σ*_{ik}}⊆*Σ*be the set of alphabet symbols appearing in*S*. Let #_{σij}be the number of times*σ*_{ij}occurs in*S*,*j*=1,…,*k*. Call #_{σij}the*histogram frequency*of*σ*_{ij}. Then {(*σ*_{i1},#_{σi1}),…,(*σ*_{ik},#_{σik})} is the*histogram*of*S*.The

*histogram matching*problem is the following:*Input*:*T*=*T*[1],…,*T*[*n*] be a text and*P*=*P*[1],…,*P*[*m*] be a pattern over alphabet*Σ*.*Output*: All indices*i*where the substring*T*[*i*],…,*T*[*i*+*m*−1] of*T*has the same histogram as*P*.

In §2*a*, we will show that the histogram matching problem can be solved in time .

We now consider another interesting, and efficiently solved, problem in the rearrangement model that was recently introduced by Ng *et al.* [14]. The problem is *block-mass indexing* (BMI).

From a biological point of view, the BMI problem arises in mass spectrometry. *Tandem mass spectrometry* (MS/MS) analyses *peptides* (short 8–30 amino acid long fragments of proteins) by generating their *spectra*.^{1} The still unsolved problem in computational proteomics is to reconstruct a peptide from its spectrum: even the most advanced de novo peptide sequencing tools correctly reconstruct only 30–45 in MS/MS database searches [15]. After two decades of algorithmic developments, it seems that de novo peptide sequencing ‘hits a wall’ and that accurate full-length peptide reconstruction is nearly impossible due to the limited information content of MS/MS spectra. Recently, Jeong *et al.* [16] advocated the use of *gapped* (rather than full length) peptides to address the existing bottleneck in de novo peptide sequencing.

Given a string of *n* integers *a*_{1},*a*_{2}…,*a*_{n} and *k* integers 1≤*i*_{1}<⋯<*i*_{k}<*n*, a gapped peptide is a string of (*k*+1) integers *a*_{1}+⋯+*a*_{i1},*a*_{i1+1}+⋯+*a*_{i2},…,*a*_{ik+1}+⋯+*a*_{n}. For example, if a peptide LNRVSQGK is represented as a sequence of its rounded amino acid masses 113, 114, 156, 99, 87, 128, 57, 128 then 113+114, 156+99+87, 128+57, 128 represents a gapped peptide 227, 342, 185, 128.

In [14], the BPM algorithm was presented. BPM leads to the modelling identification of mutated peptides and to modelling identification of fused peptides in tumour genomes.

Combinatorially, the BPI problem is defined below.

### Definition 1.2

Let *T*=*T*[1],*T*[2],…,*T*[*n*] be a text over finite alphabet *Σ*, where . The *mass* of an alphabet symbol, *σ* is its value, i.e. *mass*(*σ*)=*σ*. Let *S*= *T*[*i*],…,*T*[*i*+1],…*T*[*j*] be a substring of *T*. Then mass(*S*) (the *mass* of substring *S*) .

Substrings *T*[*i*],…*T*[*j*] and *T*[*j*+1],…*T*[*k*] are called *consecutive*. A *block* in *T* is a sequence of consecutive substrings. The *mass of a block B* (

*mass*(

*B*)) is a string comprises the masses of the consecutive substrings of

*B*, respectively. Formally, if

*B*=

*S*

_{1}⋯

*S*

_{k}, then mass(

*B*)=mass(

*S*

_{1}),…,mass(

*S*

_{k}).

### Example 1.3

Assume an alphabet of 18 symbols that represents integer masses of 20 amino acids.Let *T*=114,77,57,112,113,186,57,99,112,112,186,113,99.The substrings (57,112,113), (186,57) and (99,112,112) define a block *B* in *T*. mass(*B*)=282,243,323.

### Definition 1.4

The *block-mass pattern matching problem* is defined as follows:

*Input*: A text T over alphabet *Σ* and a pattern *P* over . *Output*: All blocks *B* in the text T such that *mass*(*B*)=*P*.

We now introduce the indexing model. *Pattern matching with indexing* is the problem where, given text *T*, one preprocesses it so that subsequent pattern matching queries of the type: ‘Given pattern *P*, find all occurrences of *P* in *T*’. For a fixed finite alphabet, it is known how to preprocess a text in time *O*(|*T*|) such that subsequent queries can be solved in time *O*(|*P*|). These results are quite sophisticated and involve using suffix trees or arrays [17,18,19,20,21,22]. It is of interest to consider both the histogram matching problem and the block-mass matching problem in an indexing model. We formally define them below.

The *histogram indexing* (*Jumbled Indexing*, *Parikh Indexing*, *Permutation Indexing*) problem is defined as follows:

### Definition 1.5

Let *T*=*T*[1],…,*T*[*n*] be a text over alphabet *Σ*.*Preprocess*: *T* to enable the following queries.*Query*: Given pattern *P*=*P*[1],…,*P*[*m*] over *Σ*, decide whether there is a substring *T*[*i*],…,*T*[*i*+*m*−1] of *T* that has the same histogram as *P*.

The *indexing* version of the block-mass pattern matching problem, where the text is preprocessed so that subsequent queries can be answered without the need to search the entire text, was considered by Ng *et al.* [14]. Their indexing scheme handles a single-element pattern. The problem is formally defined below.

### Definition 1.6

Let *T*=*T*[1],*T*[2],…,*T*[*n*] be a text over finite alphabet *Σ*={*i*_{1},…,*i*_{c}}, where , and *i*_{1}<*i*_{2}<⋅<*i*_{c}.

The *BMI problem* is that of preprocessing text *T* in a manner that allows efficient solutions queries of the form:

*Query*: Given pattern *P* over alphabet , find all blocks in *T* whose mass is equal to the mass of *P*.

If indexing is not required, and we only consider the sequential version of the histogram pattern matching problem, and of the block-mass matching problem, these problems can be efficiently solved, as we will show in §2. In this sense, they are similar to the exact matching problem that is also efficiently solved, albeit using non-trivial techniques (e.g. [23,24]). Surprisingly enough, there are various ways to solve the exact indexing problem with linear-time text preprocessing and linear-time pattern query, but the histogram indexing problem has proved elusively hard and is one of the important open problems in pattern matching today [25,26,27,28,29].

Similarly, the block-mass indexing problem does not have a pleasing solution. Ng *et al.* [14] present an efficient *O*(*m*^{2}+tocc) algorithm for seeking blocks of mass *m*, following an *O*(*n*^{1.5}) text preprocessing, where the text length is *n*, and tocc is the number of length-*m* blocks of mass *m* in the text. They also present an algorithm with a linear-time preprocessing but a slower query time, of . Finally, they prove that there is an algorithm with preprocessing of time and space where the expected query time is *O*(*m*+*tocc*). The Ng *et al.* results hold for finite alphabets, and their main attractiveness is the algorithm with fast expected query time.

The main contribution of this paper is in showing that there is a strong relationship between the histogram indexing problem and the BPI problem with a single-element pattern, where a fixed finite alphabet is considered. Proving this relationship is not too complex. However, this is a case where a very *simple* observation is *important* because it illuminates a useful relationship between apparently different notions that no one thought to juxtapose. In this specific case, this relationship is useful for a number of reasons:

Sometimes there is more insight about a particular problem, and such insight can help solve equivalent problems.

It may have seemed strange that Ng

*et al.*[14] considered a seemingly simple case of the BPI problem—one where the pattern has a single element. Our reduction shows that the single-element pattern case is indeed quite challenging.Most importantly, this note offers some more theoretical understanding of indexing in the rearrangement model that has proved tougher than exact indexing. We conjecture that there is a class of indexing problems of greater complexity than exact matching, where it is impossible to preprocess the text

*T*in time*o*(|*T*|^{2}/polylog|*T*|) and enable linear-time query processing. The problems considered in this paper are the first two examples of such problems.

Another major contribution of this paper is a new concept in measuring complexity of this class of indexing problems. Six years of research have not produced a histogram indexing algorithm whose pre-processing time is lower than *o*(|*T*|^{2}/polylog|*T*|). Having conjectured that we cannot achieve a linear-time query processing followed by a pre-processing that is significantly faster than quadratic, we lower our sights. We develop a trade-off algorithm whereby, for any given *ϵ*, the text *T* is preprocessed in time *O*(|*T*|^{1+ϵ} *polylog*|*T*|), and then query pattern *P* is processed in time *O*(|*P*|^{1/ϵ}). We show that this trade-off scheme exists for both BMI and histogram indexing over unbounded alphabets. A similar idea was independently shown by Kociumaka *et al*. [30].

## 2. Sequential solutions

The most well-known problems for which there is no known efficient indexing solution is *Approximate Indexing*. Can one preprocess a text and then allow queries that seek all text substrings that are at most edit-distance *k* to the query pattern *P*? Much work has been done on this problem (e.g. [31–33]) and yet no efficient indexing algorithm has been found. This may not be surprising, as there are no known linear-time sequential solutions to the problem [34–37].

In this section, we show that this is not the case for either the histogram matching problem or the block-mass matching problem. Both have efficient sequential solutions.

### (a) Sequential histogram matching

The following simple sliding window algorithm solves the histogram matching problem in time , where *Σ* is the set of distinct alphabet symbols in the pattern. The algorithm considers a window of length *m* that slides over the text. The window is initialized to *T*[1],….,*T*[*m*]. A slide to the right changes window *T*[*i*],…,*T*[*i*+*m*−1] to window *T*[*i*+1],…,*T*[*i*+*m*]. For every window, we keep the *window frequency* of every symbol in the window, i.e. the number of times the symbol occurs in the window, and the *histogram equality counter*, which keeps count of the number of symbols in the window whose window frequency is not equal to their histogram frequency. For every window *T*[*i*],…,*T*[*i*+*m*−1] whose histogram equality counter is 0, we have a histogram match at location *i*.

We initialize the first window in time *O*(*m*) as follows.
Every window slide requires a constant time update:

### (b) Block-mass pattern matching

We provide a time convolutions-based algorithm for the sequential version of the block-mass patter matching problem.

Consider the following reduction:Let be the function that maps a natural number *n* to a binary string 10^{n−1}, where *a*^{i} means the symbol *a* repeated *i* times.

### Example 2.1

*u*(5)=10 000, *u*(2)=10, *u*(7)=1 000 000 and *u*(1)=1.

For string *T*, construct bit array *B*_{T}=*B*_{T}[1],…,*B*_{T}[*mass*(*T*)], where *B*_{T} is the concatenation of *u*(*T*[1]),…,*u*(*T*[*n*]).

### Definition 2.2

Let *A*=*A*[1],…,*A*[*n*] and *B*=*B*[1],…,*B*[*m*] be binary arrays. We say that *B* *point matches* *A* *at location* *i*, if *A*[*i*+ℓ−1]=1 for all ℓ such that *B*[ℓ]=1.

### Example 2.3

Let *A*=001001000010101000100 and *B*=10001. Then *B* point matches in locations 11 and 15 of *A* because those are the only two locations where *every* 1 in *B* is aligned with a 1 in *A*. Note that there is no problem with the fact that when *B* is positioned at location 11 of *A*, *A*[13]=1 whereas *B*[3]=0. We do not care if 1’s in *A* are aligned with 0’s in *B*. It is only the other direction that is problematic.

Our reduction is now complete. It is obvious that

### Lemma 2.4

*P* block-mass matches *T* at location *i* iff there is a point matching of *B*_{P} in *B*_{T} at location .

Point matching can be effectively computed by *convolutions*.

#### (i) The finite alphabet case

Convolutions are used for filtering in signal processing and other applications. A convolution uses two initial functions, *t* and *p*, to produce a third function *t*⊗*p*. We formally define a discrete convolution.

### Definition 2.5

Let *T* be a function whose domain is {1,…,*n*} and *P* a function whose domain is {1,…,*m*}. We may view *T* and *P* as arrays of numbers, whose lengths are *n* and *m*, respectively. The *discrete convolution of T and P* is the polynomial multiplication

*T*⊗

*P*, where

In the general case, the convolution can be computed by using the fast Fourier transform (FFT) [38] on *T* and *P*^{R}, the reverse of *P*. This can be done in time , in a computational model with word size .

*Important Property*: The crucial property contributing to the usefulness of convolutions is the following. For every fixed location *j*_{0} in *T*, we are, in essence, overlaying *P* on *T*, starting at *j*_{1}, i.e. *P*[1] corresponds to *T*[*j*_{1}], *P*[2] to *T*[*j*_{1}+1],…,*P*[*i*] to *T*[*j*_{1}+*i*−1],…,*P*[*m*] to *T*[*j*_{1}+*m*−1]. We multiply each element of *P* by its corresponding element of *T* and add all *m* resulting products. This is the convolution’s value at location *j*_{1}.

Clearly, computing the convolution’s value for every text location *j*, can be done in time *O*(*nm*). The fortunate property of convolutions over algebraically close fields is that they can be computed *for all* *n* text locations in time using the FFT.

This property of convolutions can be used to efficiently compute relations of patterns in texts. This is generally done via *linear reductions* to convolutions. In the definition below represents the natural numbers and represents the real numbers.

### Definition 2.6

Let *P* be a pattern of length *m* and *T* a text of length *n* over some alphabet *Σ*. Let *R*(*S*_{1},*S*_{2}) be a relation on strings of length *m* over *Σ*. We say that the relation *R* *holds between P and location j of T* if

*R*(

*P*[1]⋯

*P*[

*m*],

*T*[

*j*]

*T*[

*j*+1]⋯

*T*[

*j*+

*m*−1]).

We say that *R* is *linearly reduced* to convolutions if there exist a natural number *c*, a constant time computable function , and linear-time functions and , , where , , *i*=1,…,*c* such that *R* holds between *P* and location *j* in *T* iff , .

Let *R* be a relation that is linearly reduced to convolutions. It follows immediately from the definition that, using the FFT to compute the *c* convolutions, it is possible to find all locations *j* in *T* where relation *R* holds in time .

### Lemma 2.7

Let *T* be a binary array of length *n* and *P* a binary array of length *m*. The point matching of *P* in *T* can be computed in time .

### Proof.

*Σ*={0,1} and *R* is the point matching relation. The locations where *R* holds between *P* and *T* are the locations *j* where *T*[*j*+*i*−1]≥*P*[*i*], *i*=1,…,*m*. This can be computed in time by the following simple reduction to convolutions.

Let where
and
and where we extend the definition of the functions *χ*_{σ} to a strings in the usual manner, i.e. for *S*=*S*[1]*S*[2]⋯*S*[*n*], *χ*_{σ}(*S*)=*χ*_{σ}(*S*[1])*χ*_{σ}(*S*[2])⋯*χ*_{σ}(*S*[*n*]).

Let
Then for every text location *j*, *f*(ℓ_{1}(*P*)⊗*r*_{1}(*T*)[*j*])=1 iff there is a point matching of *P* at location *j* of *T*.

*Time*: For a finite alphabet *Σ*, |*B*_{T}|=*O*(|*T*|), since , where , ∀*σ*∈*Σ*. We also have that . We conclude that we can solve the block-mass pattern matching problem over finite alphabets in time .

#### (ii) The infinite alphabet case

In the infinite alphabet case, our reduction to bit-vectors *B*_{T} and *B*_{P} may be too large, even exponential in the input size. However, these vectors are quite sparse, with the vast majority of entries being 0. Therefore, we can overcome the exponential blowup in size by encoding the arrays as sets of the indices of the bit vector locations whose value is 1. The size of these sets is proportional to the original arrays.

The problem is that we are confronted with the problem of finding the convolution vector *W* of the two vectors *B*_{T},*B*_{P}, when the two vectors are not given explicitly. While in the regular fast convolution the running time is , the aim here is to compute *W* in time proportional to the number of non-zero entries in *W*. This problem was posed by Muthukrishnan [39].

Cole & Hariharan [40] proposed the following algorithm:

### Theorem 2.8

*Let V* _{1} *and V* _{2} *be two sparse arrays, where n*_{i} *is the number of non-zero entries in V* _{i}*, i=1,2. The convolution of V* _{1} *and V* _{2} *can be computed in time* *by a Las Vegas randomized algorithm with failure probability that is inverse polynomial in n*_{2}*, where w is the number of non-zero entries in the convolution vector.*

This result, although randomized, is very fast in practice. However, our situation is even better. We are assuming a static database *T* where we can afford some preprocessing that subsequently will allow fast queries. In this case, there is a fast deterministic algorithm [41], which preprocesses the *V* _{1} array in time , where *n*_{1} is the number of non-zero entries of *V* _{1}, and subsequently achieves the following time for the sparse convolution:

### Theorem 2.9

*Let V* _{1} *and V* _{2} *be two sparse arrays, where n*_{i} *is the number of non-zero entries in V* _{i}*, i=1,2. The convolution of V* _{1} *and V* _{2} *can be computed in time* *where w is the number of non-zero entries in the convolution vector, following a* *preprocessing of V* _{1}.

The above results mean that the block-mass pattern matching problem over alphabet can be solved in the following time.

*Time*: Let *T* be a text of length *n* and *P* a pattern of length *m* over alphabet . The block-mass pattern matching problem can be solved in time by a Las Vegas randomized algorithm with failure probability that is inverse polynomial in *m*. If *T* is preprocessed in time *n*^{2}, then subsequent block-mass pattern matching queries can be solved deterministically in time .

Ng *et al.* [14] consider the simpler problem of indexing a string for block-mass matching of length-1 patterns. We will be considering the indexing version of this special case. Note, however, that block-mass pattern matching of length-1 patterns does not need the convolutions mechanism. It is easy to compute a ‘sliding window’ over *T*, where we compute the sum of all the numbers in the window and compare it to (the now single number in) *P*. Move the window by subtracting numbers on the left and adding numbers to the right. This algorithm gives a linear-time solution for the block-mass pattern matching problem with *P* of length 1.

## 3. Indexing

As previously mentioned, the histogram indexing problem has proved elusively hard [25–29]. No known solution exists that answers pattern queries in time linear in the pattern length and preprocessing whose time is *o*(*n*^{2}/polylog *n*), where *n* is the text length.

The situation is not much better for the BMI problem. However, research there evolved in a different direction. Ng *et al.* [14] considered the special case of indexing a string for block-mass matching of length-1 patterns. Their algorithm assumes a finite fixed alphabet. Its bottleneck is computing the intersection of two sets. Assume we have an algorithm Set Intersection whose running time is *O*(*t*_{i}). Ng *et al.* show:

### Theorem 3.1

*Let T be a text over fixed finite alphabet* *. Then it is possible to preprocess T in time* *such that subsequent block-mass matching queries for patterns P=P[1] can be answered in time O(P[1]⋅t*_{i}*+tocc), where tocc is the number of blocks in T with mass P, and O(P[1]) is the length of a block with mass P.*

In the attempt to circumvent the problem posed by set intersection, the following two solutions were provided:

### Theorem 3.2

*Let T be a text over fixed finite alphabet* *. Then it is possible to preprocess T in time O(n*^{1.5}*) such that subsequent block-mass matching queries for patterns P=P[1] can be answered in time O(P[1]*^{2}*+tocc), where tocc is the number of blocks in T with mass P, and O(P[1]) is the length of a block with mass P.*

### Theorem 3.3

*Let T be a text over fixed finite alphabet* *. Then it is possible to preprocess T in time* *such that subsequent block-mass matching queries for patterns P=P[1] can be answered in expected time O(P[1]+tocc), where tocc is the number of blocks in T with mass P.*

## 4. The reductions

In this section, we begin study of a class of indexing problem where, we conjecture, no linear-time query can be done with *o*(|*T*|^{2}/polylog|*T*|) pre-processing time. We start with the BMI problem and the histogram indexing problem. We prove the following two lemmas.

### Lemma 4.1

The histogram indexing problem over alphabet *Σ*={1,…,*k*} is polynomially reducible to the BMI problem with length-1 pattern, over an alphabet of size *k*. In fact, the reduction is in time where *n* is the input size.

### Proof.

Given text *T* of length *n* over *Σ*={1,…,*k*}, construct text *T*^{(j)} of length 2*n*+1 where for *i*=1,…,*n*:

### Example 4.2

Let *T*=1,1,2,1,3,2,1,2,3. Take *j*=4.*T*^{(4)}=64,1,64,1,64,4,64,1,64,16,64,4,64,1,64,4,64,16,64.

Clearly, the length of *T*^{(j)} is 2*n* words. The size of each word is *k*× the word size of *T*, but *k* is a constant, thus the construction is linear.

The intuition behind this reduction is as follows. Assume that *j*>*m*, where *m* is the pattern length. The *j*^{k} padding guarantees that any pattern of length *m*, regardless of the histogram, will have total block mass less than all patterns of length *m*+1 and more than all patterns of length *m*−1. In addition, because of the encoding of symbol *i* to mass *j*^{i−1} it is impossible for any combination of symbols smaller than *i* to achieve a mass larger than or even equal to that of *i*. This forces a uniqueness in the mapping of histograms to block masses that enables the reduction. We proceed with the details.

Construct *T*^{(2ℓ)}, .

If we can preprocess each of the *T*^{(j)}’s in time *o*(*n*^{2}) and subsequently solve block-mass queries of length 1 in time *O*(*f*(*n*)), then we can solve histogram queries in time *O*(*m*^{k+1}+*f*(*n*)). When presented with a pattern of length *m*, simply seek, in *T*^{(j)}, for the smallest *j* such that *j*≥*m*, a block with mass
for any input pattern with *k*_{i} *i*’s, *i*=1,…,*k*. The mass is bounded by 2*m*^{k+1}, and it exists iff there is a substring with the input histogram.

### Example 4.3

In the above example, consider a histogram of a pattern of length 3 that has one each of the elements 1,2 and 3. *j*=4>3=*m*, so *T*^{(4)} is the string we use for the block mass search. We seek locations whose mass is 1+4+16+4×64=277. This occurs at positions 5,7,9 and 13. Note that if only three pads are used, even if the symbol in between is the largest (16), the total sum is 16×2+3×64=214<277. If five pads are used, even if the symbol between them is the smallest (1), the total sum is 3+5×64=323>277. Note also that no combination of two elements less than 16 add up to 16 (since 4+4=8<16. This is true for every element. None can be replaced by a combination of smaller symbols.

Because of this complexity relation, we can infer for histogram indexing, algorithms that are derived for BMI.

### Corollary 4.4

Histogram indexing over a fixed finite alphabet can be solved in *O*(*m*^{2k+2}+tocc) time for seeking permutations of patterns of length *m*, following an text preprocessing, where the text length is *n*, and *tocc* is the number of length-*m* substrings in the text that are permutations of the pattern.

### Proof.

This is a direct result of lemma 4.1 and corollary 5.2.

Note that binary alphabet is a special case. We can construct *T*^{(j)} as follows: For *i*=1,…,*n*:
Thus, the query time for a pattern of length *m* is *O*(*m*^{4}), with a text preprocessing.

In §5*b*, we will show a direct trade-off algorithm for histogram indexing, which applies also to unbounded alphabets.

To complete the complexity relation between the two problems, we show that the converse is also true.

### Lemma 4.5

The BMI problem with length-1 pattern over fixed finite alphabet {*a*_{1},…,*a*_{k}}, where , *i*=1,…,*k*, is linearly reducible to the histogram indexing problem over a ternary alphabet, resulting in a query time with a multiplicative factor of the pattern length.

### Proof.

Given text *T* over fixed finite alphabet {*a*_{1},…,*a*_{k}}, where , construct a text *T*′ as follows.

First, construct *T*_{u} where
Because the alphabet is fixed and finite, , where is the largest number in {*a*_{1},…,*a*_{k}}.

Now construct *T*′ by replacing every *T*_{u}[*i*] by the substring
where *σ*^{x} means *σ* concatenated to itself *x* times. Clearly, |*T*′|=*O*(|*T*_{u}|)=*O*(|*T*|).

### Example 4.6

Assume *T*=4,4,1,2,3.Then *T*_{u}=1111,1111,1,11,111.*T*′=*c* *aaaa* *c* *bbbb* *c* *aaaa* *c* *bbbb* *c* *a* *c* *b* *c* *aa* *c* *bb* *c* *aaa* *c* *bbb*.

Suppose we want to answer a BMI query of pattern *P*[1]=*m*. Consider the pattern *P*_{j}=*a*^{m}*b*^{m}*c*^{2j}, where 1≤*j*≤*m*. We claim that any location in *T* that histogram matches *P*_{j} has a block-mass *m*. For the justifications consider the following cases:

It cannot be the case that a solution to the histogram query starts with an

*a*and ends with a*b*, or starts with a*b*and ends with an*a*, because then the number of*c*’s must be odd and that contradicts the query.It cannot be the case that a solution to the histogram query starts and ends with a

*c*because then the number of*c*’s must be odd and that contradicts the query.If the solution to the histogram query starts with an

*a*and ends with an*a*, then the number of*b*’s is exactly equal to*m*and this correctly answers our BMI query in the affirmative. A similar argument holds for a solution to the histogram query that starts and ends with a*b*.If the solution to the histogram query starts with a

*ca*and ends in an*a*, then the number of*b*’s is exactly equal to*m*and this correctly answers our BMI query in the affirmative. A similar argument holds for a solution to the histogram query that starts with a*cb*and ends with a*b*.Finally, if the solution to the histogram query starts with a

*ca*and ends in a*b*, then the number of*a*’s is exactly*m*and therefore this is an affirmative answer to the BMI query. A similar argument holds for a solution to the histogram query that starts with a*cb*and ends with a*a*.

Note that while the solutions to the histogram queries exactly denote solutions to the BMI queries, it is still the case that up to *m* consecutive solutions to the histogram query may indicate the same BMI query solution. It is easy to extract the correct solution from these responses, but it introduces a multiplicative factor of *m* to our query response time.

Therefore, if we can preprocess *T*′ in time *o*(*n*^{2}) and subsequently solve histogram indexing queries in time *O*(*f*(*m*)), then we can solve the BMI query in time *O*(*m*^{2}*f*(*m*)) by simply querying all *m* patterns: *P*_{1},…,*P*_{m}.

Putting the above two lemmas together we get:

### Theorem 4.7

*The histogram indexing problem over fixed finite alphabet is equivalent to the BMI problem with length-1 pattern over a fixed finite alphabet, up to a multiplicative factor m*^{k} *slowdown in query processing time.*

## 5. The trade-off algorithms

We have seen that to date there is no known algorithm for either the histogram indexing or the BMI problem that answers pattern queries in time linear in the pattern length following preprocessing whose time is *o*(*n*^{2}/polylog *n*), where *n* is the text length. Ng *et al.* [14] suggested reducing the preprocessing time on the BMI problem by increasing the query time. The query time is still dependent only on the pattern size and not the text length, but it is no longer linear in the pattern size. A similar relaxation was exploited by Kociumaka *et al.* [30] for the histogram indexing problem. In this section, we show trade-off algorithms for these two indexing problems that work for unbounded alphabets. The similarity between the two trade-off algorithms further strengthens our belief that these problems have similar complexities.

### (a) Alphabet independent block-mass indexing

In this section, we provide an algorithm for BMI that works for infinite alphabets. The algorithm has a trade-off between the preprocessing time and the query time. For a given *ϵ*, if the processing time is , then the query time, for pattern *P*[1], is *P*[1]^{1/ϵ}. Let *T* be a text of length *n* over alphabet .

The following algorithm preprocesses the data:
**Preprocessing Time:** .

Upon the arrival of query pattern *P*[1], invoke the following algorithm:
**Query Processing time:** For *P*[1]≤*n*^{ϵ} the time is . For *P*[1]>*n*^{ϵ}, the time is *O*(*n*), however, *n*=(*n*^{ϵ})^{1/ϵ}>*P*[1]^{1/ϵ}.

Conclude:

### Theorem 5.1

*Let T be a text over* *, ϵ∈(0,1]. Then it is possible to preprocess T in time* *such that subsequent block-mass matching queries for patterns P=P[1] can be answered in time O(P[1]*^{1/ϵ}*+tocc), where tocc is the number of blocks in T with mass P, and O(P[1]) is the length of a block with mass P.*

### Corollary 5.2

Let *T* be a text over . Then it is possible to preprocess *T* in time such that subsequent block-mass matching queries for patterns *P*=*P*[1] can be answered in time *O*(*P*[1]^{2}+tocc), where tocc is the number of blocks in *T* with mass *P*, and *O*(*P*[1]) is the length of a block with mass *P*.

### (b) Alphabet independent histogram indexing

We present a histogram indexing algorithm with a trade-off between the pre-processing and query processing time. Assume we have a text *T* of length *n*.

Consider patterns of length ℓ. Using a sliding window of length ℓ, we can scan the text and update the histogram for every index *i* by deleting element *T*[*i*−1] and adding element *T*[*i*+ℓ]−1. This update takes constant time for a finite fixed alphabet *Σ*, and time for unbounded alphabets, since we can keep the histogram in a balanced sort tree, for example. Thus, the time to prepare all histograms of length ℓ is , where *Σ* is the text alphabet. The necessary space is *O*(*n*). We can index the text indices by histograms in various ways. One possible way is by sorting the symbols in every histogram and keeping the sorted lists in a trie format. For every tree leaf, we have a list of vectors, representing the values of the histograms with these symbols. This list of vectors can also be sorted and kept in a balanced sorted tree, where every leaf is a list of indices having the histogram represented by this leaf. Thus, processing a query of length ℓ takes time . We call the data structure described above, the *histogram index data structure for length ℓ* and denote it by *HT*_{ℓ}.

We are now ready for our trade-off algorithm.
**Preprocessing Time:** .

Upon the arrival of query pattern *P*, invoke the following algorithm:
*Query Processing time*: For |*P*|≤*n*^{ϵ} the time is . For |*P*|>*n*^{ϵ}, the time is *O*(*n*), however, *n*=(*n*^{ϵ})^{1/ϵ}>|*P*|^{1/ϵ}.

## 6. Conclusion and open problem

We have proved the equivalence of two important and challenging indexing problems in the rearrangement model—histogram indexing and BMI.

The key open problem, then, is whether these problems can indeed be solved with a *o*(|*T*| polylog|*T*|) text indexing time and *O*(|*P*|) query time. We conjecture that this is not the case. We believe that they are both members of a class of indexing problems of equivalent complexity.

A less sweeping interesting problem is showing that these two problems are equivalent without query-time degradation.

## Footnotes

One contribution of 11 to a Theo Murphy Meeting Issue ‘Storage and indexing of massive data’.

↵1 Spectra are complex objects that, for the sake of brevity, are not formally described in this paper.

- © 2014 The Author(s) Published by the Royal Society. All rights reserved.