## Abstract

Problems related to string inclusion and non-inclusion have been vigorously studied in diverse fields such as data compression, molecular biology and computer security. Given a finite set of positive strings and a finite set of negative strings , a string *α* is a consistent superstring if every positive string is a substring of *α* and no negative string is a substring of *α*. The shortest (resp. longest) consistent superstring problem is to find a string *α* that is the shortest (resp. longest) among all the consistent superstrings for the given sets of strings. In this paper, we first propose a new graph model for consistent superstrings for given and . In our graph model, the set of strings represented by paths satisfying some conditions is the same as the set of consistent superstrings for and . We also present algorithms for the shortest and the longest consistent superstring problems. Our algorithms solve the consistent superstring problems for all cases, including cases that are not considered in previous work. Moreover, our algorithms solve in polynomial time the consistent superstring problems for more cases than the previous algorithms. For the polynomially solvable cases, our algorithms are more efficient than the previous ones.

## 1. Introduction

Problems related to string inclusion and non-inclusion have been vigorously studied in diverse fields such as data compression [1,2], molecular biology [3,4] and computer security [5].

Some examples of well-known inclusion or non-inclusion related strings are as follows. Consider a set of strings and a string *α* over a constant-size alphabet.

—

*Longest common substrings.*If*α*is a substring of*w*_{i}for all 1≤*i*≤*m*, then*α*is called a common substring of . Among such*α*, the longest one is called the longest common substring of . Finding the longest common substring of is solvable in polynomial time by dynamic programming or using generalized suffix trees [6].—

*Shortest common superstrings.*If*α*is a superstring of*w*_{i}for all 1≤*i*≤*m*, then*α*is called a common superstring of . Among such*α*, the shortest one is called the shortest common superstring of . The problem of finding the shortest common superstring is NP-hard [7,8].—

*Shortest common non-substrings.*If*α*is not a substring of any*w*_{i}for all 1≤*i*≤*m*, then*α*is called a common non-substring of . Among such*α*, the shortest one is called the shortest common non-substring of . The problem of finding the shortest common non-substring can be solved in polynomial time [9].—

*Longest common non-superstrings.*If*α*is not a superstring of any*w*_{i}for all 1≤*i*≤*m*, then*α*is called a common non-superstring of . Among such*α*, the longest one is called the longest common non-superstring (LCNSS) of . The problem of finding the LCNSS is solvable in polynomial time [9–11].

There are some notions that consider string inclusion and non-inclusion at the same time. Given a finite set of positive strings and a finite set of negative strings , a string *α* is a *consistent superstring* (CSS) for and if *α* is both a common superstring of and a common non-superstring of .

—

*Shortest consistent superstrings.*Among the CSSs for and , the shortest one is called the shortest CSS for and . Jiang & Li [12] introduced the notion of the CSS and they provided approximation algorithms for finding the shortest CSS when is bounded by a constant or when is inclusion-free, i.e. no string in the union includes another as a substring. Jiang & Timkovsky [10] proposed an algorithm for solving this problem using the directed graph model proposed in [9]. There are two non-trivial conditions assumed in reference [10] for finding the shortest CSS for and . The first condition is that every symbol of the alphabet appears at the end of some negative string (*final closure*). The second condition is that is inclusion-free. Jiang & Timkovsky’s algorithm runs in polynomial time when the LCNSS for exists or when is bounded by a constant.—

*Longest consistent superstrings.*Among the CSS for and , the longest one is called the longest CSS for and . Under the same conditions as above, Jiang & Timkovsky [10] proposed a polynomial-time algorithm for finding the longest CSS when the LCNSS for exists.

In this paper, we first propose a new graph model for CSSs for given and . Then, we present algorithms for finding the shortest CSS and the longest CSS using our graph model. Our contributions are as follows:

— We propose a new graph model to solve the CSS problems, which is based on the Aho–Corasick machine. In our graph model, the set of strings represented by paths satisfying some conditions are the same as the set of CSSs for and . While Jiang & Timkovsky [10] modelled CSSs for and using only the strings in , we model CSSs using all the strings in and . This makes our graph model more intuitive than that in [10], and makes our algorithms simpler than those in [10]. Furthermore, we do not assume any of the non-trivial conditions in [10], i.e. our graph model is more general than that in [10].

— We also present algorithms for finding the shortest and the longest CSSs using our graph model. Our algorithms solve the CSS problems for all cases, including cases that are not considered in reference [10], i.e. the cases when is not inclusion-free or when the final closure condition does not hold. Moreover, our algorithms solve in polynomial time the CSS problems for more cases than those in [10]. The shortest CSS problem can be solved in polynomial time when the LCNSS for exists and is inclusion-free, or when is bounded by a constant. The longest CSS problem can be solved in polynomial time when is inclusion-free or when is bounded by a constant. In particular, when the LCNSS for exists and is inclusion-free, our algorithms for finding the shortest and the longest consistent superstrings run in time linear in the sum of the lengths of the strings in and , which are more efficient than the algorithms in [10].

The remainder of the paper is organized as follows. In §2, we give some notation and definitions. In §3, we propose a new graph model for CSSs. In §4, we present algorithms for finding the shortest CSS and the longest CSS for the given sets of strings. Then, we give concluding remarks in §5.

## 2. Preliminaries

### (a) Problem definition

Let *α* be a string over a constant-size alphabet *Σ*. We denote the length of *α* by |*α*| and the *i*th (1≤*i*≤|*α*|) character of *α* by *α*[*i*]. When a string *β* is *α*[*i*]*α*[*i*+1]⋯*α*[*j*], *β* is denoted by *α*[*i*..*j*] and called a *substring* of *α*. Conversely, *α* is called a *superstring* of *β*. Moreover, *α*[1..*j*] (1≤*j*≤|*α*|) is called a *prefix* of *α*, and when *j*≠|*α*|, *α*[1..*j*] is called a *proper prefix* of *α*. Similarly, we can define a *suffix* and a *proper suffix* of *α*. For convenience, we assume that the empty string *λ* is both a prefix and a suffix of *α*.

Consider two finite sets of strings over a constant-size alphabet *Σ*: a set of *positive* strings and a set of *negative* strings . We assume that neither nor contains the empty string *λ*. Let *P* and *N* denote the sum of the lengths of all strings in and , respectively. We denote the number of elements in a set by . Then, and .

We define a common superstring, a common non-superstring, and a consistent superstring. A *common superstring* for is a string that includes every positive string *x*_{i} (1≤*i*≤*p*) in as a substring. Similarly, a *common non-superstring* for is a string that includes no negative string *y*_{j} (1≤*j*≤*n*) in as a substring. A *consistent superstring* (CSS) for and is a string that is both a common superstring for and a common non-superstring for . Among CSSs, the shortest one and the longest one are called the *shortest consistent superstring* (SCSS) and the *longest consistent superstring* (LCSS), respectively. If we can make an arbitrarily long CSS, then the number of CSSs is infinite and thus the LCSS does not exist.

Now, we define the shortest consistent superstring problem and the longest consistent superstring problem:

### Problem 2.1

*The shortest consistent superstring problem (the SCSS problem).*

**Input:** *a finite set of positive strings* *and a finite set of negative strings* *over a constant-size alphabet Σ*.

**Output:** *if no CSS exists, the output is ‘no SCSS exists’. Otherwise, the output is an SCSS for* *and* .

### Problem 2.2

*The longest consistent superstring problem (the LCSS problem)*.

**Input:** *a finite set of positive strings* *and a finite set of negative strings* *over a constant-size alphabet Σ*.

**Output:** *if no CSS exists or an arbitrarily long CSS can be made, the output is ‘no LCSS exists.’ Otherwise, the output is an LCSS for* *and* .

To avoid some trivial cases, we assume that the following three *inclusion-free conditions* hold for strings in and :

(1) For all

*x*_{i}and*x*_{j}(*i*≠*j*),*x*_{i}is not a substring of*x*_{j}. If*x*_{i}is a substring of*x*_{j}, then any superstring of*x*_{j}is a superstring of*x*_{i}. Thus, we can remove*x*_{i}from .(2) For all

*y*_{i}and*y*_{j}(*i*≠*j*),*y*_{i}is not a substring of*y*_{j}. If*y*_{i}is a substring of*y*_{j}, then any non-superstring of*y*_{i}is a non-superstring of*y*_{j}, Thus, we can remove*y*_{j}from .(3) For all

*x*_{i}and*y*_{j},*y*_{j}is not a substring of*x*_{i}. If*y*_{j}is a substring of*x*_{i}, then any superstring of*x*_{i}cannot be a non-superstring of*y*_{j}. Thus, no CSS for and exists in this case.

The following is another inclusion-free condition that is non-trivial.

(4) For all

*x*_{i}and*y*_{j},*x*_{i}is not a substring of*y*_{j}.

We say that is *inclusion-free* when and satisfy the condition (4) as well as the conditions (1)–(3).

### (b) Aho–Corasick machine

Our graph model for solving the CSS problems is related to the Aho–Corasick machine (AC machine), which is a useful tool for multiple pattern matching problems [13]. The AC machine consists of vertices (states) and three functions (transitions): the *goto* function, the *failure* function and the *output* function. A vertex represents a prefix of given pattern strings. For a vertex *u*, let *str*(*u*) denote the string represented by *u*. For a vertex *u* and a character *σ*, the *goto* function is defined as a vertex *v* (if it exists) such that *str*(*v*) is *str*(*u*)*σ*. For a vertex *u*, let *α* be the longest proper suffix of *str*(*u*) that is represented by a vertex in the AC machine. Then, the *failure* function for the vertex *u* is defined as a vertex *w* representing *α*. The *output* function for a vertex *u* is defined as a set of pattern strings that are suffixes of *str*(*u*). Figure 1*a* shows an example of the AC machine for {*aba*,*bb*,*aa*,*abba*,*bbb*}. Note that the values of the output functions for *v*_{3},*v*_{4},*v*_{6},*v*_{7},*v*_{8} and *v*_{9} are {*aba*},{*bb*},{*bb*},{*aa*},{*abba*} and {*bb*,*bbb*}, respectively. To avoid all failure functions, the AC machine can be represented as a deterministic finite automaton (DFA). Figure 1*b* shows the DFA version of figure 1*a*.

## 3. Graph model

Here, we first define our new graph model, denoted by *G*_{CSS}, for consistent superstrings, and then give some properties of *G*_{CSS}. Let *G*_{AC} be the graph representing the DFA version of the AC machine for . Informally, *G*_{CSS} is a graph made from *G*_{AC} by removing every vertex *v* representing a negative string, and incoming and outgoing edges of *v* as well. For example, given and , *G*_{AC} is shown in figure 1*b* and *G*_{CSS} is shown in figure 2.

Now, we formally define *G*_{CSS}. Let be the set of all prefixes of all positive strings and all *proper* prefixes of all negative strings. For example, given and , . We first define a function as follows: for and *σ*∈*Σ*, *δ*(*α*,*σ*) is the longest suffix of *ασ* in . Then, *G*_{CSS}=(*V*,*E*) is a directed graph defined as follows:

— Vertex set

*V*. We define vertices so that there is a one-to-one correspondence between the vertices in*V*and the strings in . We denote a corresponding vertex of a string*α*by*ver*(*α*) and conversely, we denote a corresponding string of a vertex*v*by*str*(*v*). We denote*ver*(*λ*) by*v*_{λ}as a special vertex.— Edge set

*E*. For strings and*σ*∈*Σ*, a directed edge from*ver*(*α*) to*ver*(*β*) labelled with*σ*is defined if and only if*β*=*δ*(*α*,*σ*). An edge labelled with*σ*is called a*σ*-edge. Note that there exists at most one outgoing*σ*-edge from*ver*(*α*), and if and only if , the outgoing*σ*-edge from*ver*(*α*) does not exist.

The number of vertices is at most *P*+*N*, because the number of prefixes of a string is the same as the length of the string. The number of edges is *O*(*P*+*N*), because every vertex has at most |*Σ*| outgoing edges that is a constant.

A *path* in *G*_{CSS} is denoted by a sequence *A*=(*u*_{0},*u*_{1},…,*u*_{k}) of vertices such that edge (*u*_{i−1},*u*_{i}), for every *i*=1,2,…,*k*, is defined. The *length* of *A* is the number of edges in *A*, denoted by |*A*|. A path is called a *λ*-*path* if it starts at the vertex *v*_{λ}. A path *A*=(*u*_{0},…,*u*_{k}) represents the unique string, denoted by *pstr*(*A*), whose *i*th character is the label of edge (*u*_{i−1},*u*_{i}) for 1≤*i*≤*k*. Meanwhile, a same string may be represented by more than one path. However, a *λ*-path representing a string is unique (if it exists), because for each character *σ*∈*Σ*, there is at most one outgoing *σ*-edge from each vertex. We denote the *λ*-path representing a string *α* by *λ*-*path*(*α*). For a string *α*, a suffix of *α* is called a *τ*-suffix if the suffix is in .

### Lemma 3.1

*For a λ-path A*=(*v*_{λ},…,*v*), *str*(*v*) *is the longest τ-suffix of pstr*(*A*).

### Proof.

We prove it by induction on the lengths of *λ*-paths. Initially, let us consider the case when |*A*|=0, i.e. *A*=(*v*_{λ}). Then, it is obvious that *str*(*v*_{λ}) (=*λ*) is the longest *τ*-suffix of *pstr*(*A*) (=*λ*).

Next, suppose that the lemma holds for |*A*|=*k*≥0. We show that the lemma holds for |*A*|=*k*+1 as well. Assume that *A* is the concatenated path of a path *A*′=(*v*_{λ},…,*u*) of length *k* and a *σ*-edge (*u*,*v*). Then, *str*(*u*) is the longest *τ*-suffix of *pstr*(*A*′) by the inductive hypothesis. Suppose *str*(*v*) is not the longest *τ*-suffix of *pstr*(*A*), i.e. there is a *τ*-suffix (say *ασ*) of *pstr*(*A*) longer than *str*(*v*). Then, we have two cases according to the length of *α*. First, consider the case when |*str*(*u*)|≥|*α*|. Because *str*(*u*)*σ*, *ασ* and *str*(*v*) are suffixes of *pstr*(*A*) and |*str*(*u*)*σ*|≥|*ασ*|>|*str*(*v*)|, then *ασ* is a *τ*-suffix of *str*(*u*)*σ* longer than *str*(*v*). This contradicts the definition of edge (*u*,*v*) by which *str*(*v*) is the longest *τ*-suffix of *str*(*u*)*σ*. Next consider the case when |*α*|>|*str*(*u*)|. Then, *α* is a *τ*-suffix of *str*(*A*′) longer than *str*(*u*), which contradicts the inductive hypothesis that *str*(*u*) is the longest *τ*-suffix of *pstr*(*A*′). Therefore, *str*(*v*) is the longest *τ*-suffix of *pstr*(*A*). ▪

For example, consider a *λ*-path *A*_{1}=(*v*_{λ},*v*_{1},*v*_{2},*v*_{3},*v*_{2}) in figure 2. Then, *pstr*(*A*_{1})=*abab* and its longest *τ*-suffix is *ab* (=*str*(*v*_{2})) because *abab* and *bab* are not in . For another *λ*-path *A*_{2}=(*v*_{λ},*v*_{5},*v*_{6},*v*_{1}), *pstr*(*A*_{2})=*bba* and its longest *τ*-suffix is *a* (=*str*(*v*_{1})).

Now, we give some relations between *G*_{CSS} and common non-superstrings for .

### Lemma 3.2

*A string α is a common non-superstring for* *if and only if λ*-*path*(*α*) *exists in G*_{CSS}.

### Proof.

(If) We prove it by contradiction. Suppose that there exists a string which is represented by a *λ*-path in *G*_{CSS} and includes a negative string *y*. Let *α* be the shortest one among such strings. Then, *y* is a suffix of *α*. Let (*u*,*v*) be the last edge of *λ*-*path*(*α*) and *σ* be the label of edge (*u*,*v*). Now, we show * str*(

*v*)=

*y*, which contradicts the definition of

*G*

_{CSS}that

*(*

*str**v*) is a string in including no negative string. Let

*y*′ and

*α*′ be the longest proper prefix of

*y*and

*α*, respectively, i.e.

*y*=

*y*′

*σ*and

*α*=

*α*′

*σ*. Because

*str*(

*u*) is the longest

*τ*-suffix of

*α*′ by lemma 3.1 and

*y*′ is in , |

*str*(

*u*)|≥|

*y*′| and thus

*y*(=

*y*′

*σ*) is a suffix of

*str*(

*u*)

*σ*. Furthermore,

*y*is the longest suffix of

*str*(

*u*)

*σ*in by the inclusion-free conditions (2) and (3) that the negative string

*y*is not included in any string in except for itself. Thus,

*δ*(

*str*(

*u*),

*σ*)=

*y*and

*str*(

*v*)=

*y*.

(Only if) We prove it by contradiction. Suppose that there exists a common non-superstring for which is represented by no *λ*-path in *G*_{CSS}. Let *α* be the shortest one among such strings. (Note that *α* cannot be *λ*, because *λ* is represented in *G*_{CSS}.) Let *α*′ be the longest proper prefix of *α* and *σ* be the last character of *α*, i.e. *α*=*α*′*σ*. Then *α*′ is a common non-superstring for because so is *α*. Thus, *λ*-*path*(*α*′) exists, because *α*′ is shorter than *α*. Let *u* be the last vertex of *λ*-*path*(*α*′). Then, *str*(*u*) is a suffix of *α*′ by lemma 3.1. Because *str*(*u*)*σ* is a suffix of the common non-superstring *α*, *str*(*u*)*σ* is also a common non-superstring for . Therefore, no suffix of *str*(*u*)*σ* is a negative string and . That is, the *σ*-edge from *u* to a node *v* is defined and the concatenated path of *A* and (*u*,*v*) represents *α*. This contradicts the assumption that *λ*-*path*(*α*) does not exist in *G*_{CSS}. ▪

Because every vertex in *G*_{CSS} is reachable from *v*_{λ}, an arbitrary longest *λ*-path can be made if and only if *G*_{CSS} is cyclic. Thus, we can obtain the following corollary from lemma 3.2.

### Corollary 3.3

*The longest common non-superstring for* *exists if and only if G*_{CSS} *is acyclic*.

Next, we give some relations between *G*_{CSS} and common superstrings for . For a positive string *x*_{i} (1≤*i*≤*p*), we call a vertex *v* a *q-vertex* of *x*_{i} in *G*_{CSS} if *x*_{i} is a suffix of *str*(*v*), and denote the set of *q*-vertices of *x*_{i} by *Q*(*x*_{i}). (Intuitively, *Q*(*x*_{i}) is a set of vertices whose output function values are *x*_{i} in the AC machine *G*_{AC}.) For example, *Q*(*aba*)={*v*_{3}} and *Q*(*bb*)={*v*_{4},*v*_{6}} in figure 2. Note that a vertex belongs to at most one *Q*(*x*_{i}) owing to the inclusion-free condition (1). Furthermore, if and satisfy the inclusion-free conditions (1)–(4), |*Q*(*x*_{i})|=1 for every positive string *x*_{i}.

### Lemma 3.4

*For a positive string x*_{i}, *a λ-path A passes a vertex in Q*(*x*_{i}) *if and only if pstr*(*A*) *is a superstring of x*_{i}.

### Proof.

(If) Without loss of generality, assume that *x*_{i} is a suffix of *pstr*(*A*). Let *v* be the last vertex on path *A*. Then, *str*(*v*) is the longest *τ*-suffix of *pstr*(*A*) by lemma 3.1. Because , the length of the longest *τ*-suffix of *pstr*(*A*) is larger than or equal to |*x*_{i}|, i.e. |*str*(*v*)|≥|*x*_{i}|. Moreover, because both *str*(*v*) and *x*_{i} are suffixes of *pstr*(*A*), *x*_{i} is a suffix of *str*(*v*) and thus *v*∈*Q*(*x*_{i}). Therefore, *A* passes a vertex in *Q*(*x*_{i}).

(Only if) Let *v* be a vertex in *Q*(*x*_{i}) passed by *A*. Without loss of generality, assume that *v* is the last vertex of *A*. Then, *x*_{i} is a suffix of *str*(*v*) by definition of *Q*(*x*_{i}) and *str*(*v*) is a suffix of *pstr*(*A*) by lemma 3.1. Thus, *x*_{i} is a suffix of *pstr*(*A*). ▪

For example, consider *G*_{CSS} in figure 2. For a *λ*-path *A*_{1}=(*v*_{λ},*v*_{1},*v*_{2},*v*_{3},*v*_{2}) representing a superstring *abab* of *aba*, *A*_{1} passes a vertex *v*_{3} in *Q*(*aba*). Conversely, for *λ*-path *A*_{2}=(*v*_{λ},*v*_{5},*v*_{6},*v*_{1}) passing a vertex *v*_{6} in *Q*(*bb*), *pstr*(*A*_{2}) (=*bba*) is a superstring of *bb*.

Lemma 3.4 can be directly extended to the set of positive strings. We call a *λ*-path *A* a -*path* if *A* passes at least one vertex in *Q*(*x*_{i}) for *every* . Then, corollary 3.5 can be derived directly from lemma 3.4.

### Corollary 3.5

*A λ-path A is a* -*path if and only if pstr*(*A*) *is a common superstring for* .

For example, consider a *λ*-path *A*_{1}=(*v*_{λ},*v*_{1},*v*_{2},*v*_{3},*v*_{2},*v*_{4}) in figure 2, which represents a common superstring *ababb* for . Because *A*_{1} passes *v*_{3}∈*Q*(*aba*) and *v*_{4}∈*Q*(*bb*), *A*_{1} is a -path. Conversely, for a -path *A*_{2}=(*v*_{λ},*v*_{5},*v*_{6},*v*_{1},*v*_{2},*v*_{3}) passing *v*_{3}∈*Q*(*aba*) and *v*_{6}∈*Q*(*bb*), *pstr*(*A*_{2}) (=*bbaba*) is a common superstring for .

Owing to lemma 3.2 and corollary 3.5, we obtain the following theorem, which says that the set of consistent superstrings for and is the same as the set of strings represented by -paths in *G*_{CSS}.

### Theorem 3.6

*A string α is a consistent superstring for* *and* *if and only if a* *-path representing α exists in G*_{CSS}.

## 4. Algorithms for consistent superstring problems

Here, we present algorithms for the CSS problems. We can find CSSs for and by finding -paths in *G*_{CSS} by theorem 3.6. Because the length of a CSS *α* is the same as the length (the number of edges) of *λ*-*path*(*α*), we can find the shortest (resp. longest) CSS by finding the shortest (resp. longest) -path in *G*_{CSS}. Our algorithms for the consistent superstring problems consist of the following three phases:

1. Construct

*G*_{CSS}and compute*Q*(*x*_{i}).2. Find the shortest (or longest) -path.

3. Compute the SCSS (resp. LCSS) if the string represented by the shortest (resp. longest) -path is found in phase 2.

Phase 1 can be done in *O*(*P*+*N*) time by constructing the AC machine and computing its output function, and by removing unnecessary vertices and edges. Phase 3 can be done simply by concatenating the label of each edge on the -path found in phase 2 while traversing the -path. For phase 2, we give details in the following subsections.

### (a) Finding the shortest -path

We describe how to find the shortest -path in *G*_{CSS}. We use different approaches according to whether is inclusion-free or not. Recall that is inclusion-free when and satisfy the inclusion-free condition (4) as well as the inclusion-free conditions (1)–(3).

#### (i) Case when is inclusion-free

In this case, |*Q*(*x*_{i})|=1 for every positive string *x*_{i}. We first check whether *G*_{CSS} is acyclic or not, i.e. the LCNSS for exists or not (corollary 3.3), which can be done in *O*(*P*+*N*) time using depth-first search. Let us consider the case when *G*_{CSS} is acyclic. Let *v*_{1}, *v*_{2}, …, *v*_{p} be the *q*-vertices in topological order. Then, the shortest -path in *G*_{CSS} is the shortest path passing over *v*_{λ},*v*_{1},*v*_{2},…,*v*_{p} in order. This order *v*_{1},*v*_{2},…,*v*_{p} of the *q*-vertices must be unique if there is a -path in *G*_{CSS}, because *G*_{CSS} is acyclic. The shortest -path can be found by concatenating all the shortest paths *A*_{i} from *v*_{i−1} to *v*_{i} (1≤*i*≤*p*), where *v*_{0}=*v*_{λ}. Each *A*_{i} can be easily found by considering only the vertices (and their outgoing edges) from *v*_{i−1} to *v*_{i} in topological order, and thus all *A*_{i} can be found in *O*(*P*+*N*) time. Therefore, we can find the shortest -path in *O*(*P*+*N*) time when *G*_{CSS} is acyclic.

Next, let us consider the case when *G*_{CSS} is cyclic. We first construct a weighted directed graph *G*_{QS} from *G*_{CSS} as follows: Vertices of *G*_{QS} consist of *v*_{λ} and *q*-vertices of *G*_{CSS}. For vertices *u* and *v* in *G*_{QS}, if and only if there is a path from *u* to *v* in *G*_{CSS}, the edge (*u*,*v*) of *G*_{QS} is defined, and its weight is the length of the shortest path from *u* to *v* in *G*_{CSS}. (Note that the shortest path may pass over other *q*-vertices in *G*_{CSS}.) The length of a path in *G*_{QS} is defined as the sum of the weights of the edges on the path (figure 3). (In this example, is not inclusion-free; nevertheless, *G*_{QS} is well defined.) We can find the shortest paths from a vertex *u* to other vertices in *O*(*P*+*N*) time by performing breadth-first search (BFS) starting at *u* in *G*_{CSS}. Because the number of vertices in *G*_{QS} is *p*+1, *G*_{QS} can be constructed in *O*(*p*(*P*+*N*)) time.

Then, finding the shortest -path in *G*_{CSS} is the same as finding the shortest path *A*_{s} in *G*_{QS} such that *A*_{s} starts at *v*_{λ} and passes over each vertex at least once. The problem of finding *A*_{s} in *G*_{QS} can be easily modelled as the travelling salesman problem (TSP). It is well known that the TSP can be solved in *O*(*p*^{2}2^{p}) time using dynamic programming [14,15]. Therefore, we can find the shortest -path in *O*(*p*(*P*+*N*)+*p*^{2}2^{p}) time when *G*_{CSS} is cyclic.

In the case when *G*_{QS} is acyclic, we can solve this problem more efficiently by a topological sort. Note that *G*_{QS} may be acyclic even though *G*_{CSS} is not acyclic. If *G*_{QS} is acyclic, *A*_{s} must pass over all vertices of *G*_{QS} in topological order. If such a path does not exist, then the CSS does not exist. A topological order can be computed in *O*(*p*^{2}) time, because there are *p*+1 vertices and at most *p*(*p*+1) edges in *G*_{QS}. Therefore, we can find the shortest -path in *O*(*p*(*P*+*N*)) time when *G*_{QS} is acyclic.

#### (ii) Case when is not inclusion-free

We first construct *G*_{QS} from *G*_{CSS} as in the previous case. Let *k* be the number of all *q*-vertices, i.e, . Then, *p*≤*k*≤*N*−*n*+*p*, because a vertex representing a proper prefix of negative strings can be a *q*-vertex but no vertex representing a proper prefix of positive strings can be a *q*-vertex owing to the inclusion-free condition (1). Because *G*_{QS} has *k*+1 vertices (including *v*_{λ}), *G*_{QS} can be constructed in *O*(*k*(*P*+*N*)) time using BFS. Then, finding the shortest -path in *G*_{CSS} is the same as finding the shortest path *A*_{s} in *G*_{QS} such that *A*_{s} starts at *v*_{λ} and passes over at least one vertex in every *Q*(*x*_{i}). The problem of finding *A*_{s} in *G*_{QS} is easily modelled as the generalized travelling salesman problem (GTSP), which can be solved in *O*(*k*^{2}2^{p}) time using dynamic programming [16,17]. Therefore, we can find the shortest -path in *O*(*k*(*P*+*N*)+*k*^{2}2^{p}) time. For example, the path (*v*_{λ},*v*_{3},*v*_{4}) in figure 3 is the shortest path that starts at *v*_{λ} and passes over a vertex in *Q*(*aba*) and a vertex in *Q*(*bb*). Hence, its corresponding path (*v*_{λ},*v*_{1},*v*_{2},*v*_{3},*v*_{2},*v*_{4}) is the shortest -path in *G*_{CSS} and the string *ababb* represented by this -path is an SCSS for and .

### Remark 4.1

Even though is not inclusion-free, |*Q*(*x*_{i})| can be 1 for every positive string *x*_{i}. In this case, we can find the shortest -path more efficiently using the algorithm for the case when is inclusion-free.

Table 1 shows the time complexities of our algorithm and the previous one [10] for the SCSS problem. Our algorithm is more efficient than the previous one. Furthermore, our algorithm gives solutions for all cases.

### (b) Finding the longest -path

In this problem, we consider not only simple paths but also paths containing a cycle, i.e. a vertex (including starting and ending vertices) can be visited more than once in a path. If there is a -path whose length is , i.e. containing a cycle, then there is no feasible longest -path and the LCSS does not exist.

#### (i) Case when is inclusion-free

We have two subcases according to whether *G*_{CSS} is acyclic or not. When *G*_{CSS} is acyclic, the longest -path can be easily found in *O*(*P*+*N*) time using an approach similar to finding the shortest -path. Let *v*_{p} be the last *q*-vertex in topological order and let *A*′ be the longest -path ending at *v*_{p}. Note that, unlike the shortest -path, which ends at *v*_{p}, the longest -path can go further. Let *A*′′ be the longest path from *v*_{p} to any vertex in *G*_{CSS}. Then, the longest -path is the concatenated path of *A*′ and *A*′′. When *G*_{CSS} is acyclic, because *A*′ and *A*′′ can be found in *O*(*P*+*N*) time, we can find the longest -path in *O*(*P*+*N*) time.

Next, let us consider the case when *G*_{CSS} is cyclic. We first construct a weighted directed graph *G*_{QL} from *G*_{CSS} defined as follows: vertices of *G*_{QL} consist of *v*_{λ}, *q*-vertices of *G*_{CSS}, and a special vertex *v*_{f}. There are two kinds of edges in *G*_{QL}. (We assume that all edge weights are −1 in *G*_{CSS} when computing edge weights in *G*_{QL}.)

— For vertices

*u*≠*v*_{f}and*v*≠*v*_{f}, if and only if there is a path from*u*to*v*in*G*_{CSS}, the edge (*u*,*v*) of*G*_{QL}is defined and its weight is the length of the shortest path from*u*to*v*in*G*_{CSS}. If a path from*u*to*v*in*G*_{CSS}contains a cycle, the weight of edge (*u*,*v*) is −|*V*|, where*V*is the vertex set of*G*_{CSS}. Note that if no path from*u*to*v*in*G*_{CSS}contains a cycle, the weight of edge (*u*,*v*) is greater than −|*V*|.— For a vertex

*u*≠*v*_{f}and*v*_{f}in*G*_{QL}, the edge (*u*,*v*_{f}) is always defined and its weight is the length of the shortest path from*u*to any vertex in*G*_{CSS}. If a path from*u*to any vertex in*G*_{CSS}contains a cycle, the weight of edge (*u*,*v*_{f}) is −|*V*|. No outgoing edge from*v*_{f}is defined in*G*_{QL}.

The length of a path in *G*_{QL} is defined as the sum of the weights of the edges on the path. Figure 4 shows *G*_{CSS} and *G*_{QL} for and . We compute the weights of outgoing edges from each vertex *u* (≠*v*_{f}) in *G*_{QL} as follows: Set all edge weights to −1 in *G*_{CSS} and compute the shortest paths from *u* to other vertices in *G*_{CSS} using the Bellman–Ford algorithm [18,19], which takes *O*((*P*+*N*)^{2}) time. Note that the Bellman–Ford algorithm can be easily modified to check if there is a path from *u* to *v* containing a cycle [20]. Because *G*_{QL} has *p*+1 vertices excluding *v*_{f}, it takes *O*(*p*(*P*+*N*)^{2}) time to compute all edge weights of *G*_{QL}.

Then, finding the longest -path in *G*_{CSS} is the same as finding the shortest path *A*_{s} in *G*_{QL} such that *A*_{s} starts at *v*_{λ} and passes over each vertex at least once. We have two subcases according to whether *G*_{QL} is cyclic or not.

— If

*G*_{QL}is cyclic, which can be checked in*O*(*p*^{2}) time, either there is no path starting at*v*_{λ}and passing all vertices, or we can find such a path with arbitrarily small weight, because this path contains a cycle. Thus, there is no feasible shortest path in*G*_{QL}, i.e. no feasible longest -path in*G*_{CSS}, and thus no LCSS exists in this case.— If

*G*_{QL}is acyclic,*A*_{s}(if it exists) must pass over all vertices of*G*_{QL}in topological order (if such a path does not exist, then the CSS does not exist). If the length of*A*_{s}is less than or equal to −|*V*|, its corresponding -path in*G*_{CSS}contains a cycle and thus there is no feasible longest -path. Otherwise (if the length of*A*_{s}is greater than −|*V*|), its corresponding path in*G*_{CSS}is the longest -path. (Note that the length of a feasible longest -path is less than |*V*| because it cannot contain a cycle.)

Therefore, we can find the longest -path in *O*(*p*(*P*+*N*)^{2}) time when *G*_{CSS} is cyclic. For example, consider the path (*v*_{λ},*v*_{6},*v*_{4},*v*_{f}) in figure 4*b*. This is the only path passing over all vertices in *G*_{QL} and its length is −11. It means that we can make an arbitrarily long -path in *G*_{CSS}, e.g. (*v*_{λ},*v*_{2},*v*_{5},*v*_{6},*v*_{4},*v*_{1},*v*_{1},…) which represents an arbitrarily long CSS *bbaaaa*…. Therefore, no LCSS exists in this example.

#### (ii) Case when is not inclusion-free

We also solve the problem using *G*_{QL} again. Finding the longest -path in *G*_{CSS} is the same as finding the shortest path *A*_{s} in *G*_{QL} such that *A*_{s} starts at *v*_{λ}, passes over at least one vertex in every *Q*(*x*_{i}), and ends at *v*_{f}. As in the SCSS problem, this problem can be solved by modelling as the GTSP. (Note that we can make all edge weights non-negative without changing the optimal solutions by adding |*V* | to all edge weights in *G*_{QL} [21].) Therefore, we can find the shortest -path in *O*(*k*(*P*+*N*)^{2}+*k*^{2}2^{p}) time. Note that the problem can be solved more efficiently in the case when |*Q*(*x*_{i})|=1 for every positive string *x*_{i} as in the SCSS problem.

Table 2 shows the time complexities of our algorithm and the previous one [10] for the LCSS problem. As shown in the table, our algorithm is more efficient than the previous one and gives solutions for all cases.

## 5. Concluding remarks

We have proposed a new graph model for the CSS problems where -paths have a one-to-one correspondence with CSSs. Thus, the SCSS (resp. LCSS) problems can be solved by finding the shortest (resp. longest) -path. We have also presented how to find the shortest -path and the longest -path in our graph. In some cases, this problem can be solved in polynomial time according to the inclusion-freeness and the topology of the graph. The polynomial-time solvable cases remain polynomial even though the alphabet size |*Σ*| is not constant. For example, when is inclusion-free and LCNSS for exists, the SCSS and the LCSS problems can be solved in *O*((*P*+*N*)|*Σ*|) time. Furthermore, for the LCSS problem, our algorithm can distinguish whether no CSS exists or an arbitrarily long CSS can be made.

In general, however, the CSS problems are NP-hard [22]. The CSS problems using our graph are closely related to the TSP (including the GTSP). Although the TSP is known as NP-hard, many algorithms have been developed for the TSP owing to its diverse applicability [23–25]. In this paper, we adopted algorithms using the dynamic programming approach. However, other approaches for the TSP, such as linear programming algorithms and approximation, can be applied to our problem in order to improve the performance in theory and practice.

Furthermore, the LCNSS problem, where , can be solved using our graph. Recall that the LCNSS for exists if and only if *G*_{CSS} is acyclic (corollary 3.3). We can check whether *G*_{CSS} is acyclic and find the longest *λ*-path in *G*_{CSS} using *O*(*N*) time. Therefore, the LCNSS can be found in *O*(*N*) time.

## Funding statement

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2011-0007860); by the Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2011-0029924); by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2012R1A2A2A01014892), by the IT R&D programme of MSIP/KEIT (10038768, Development of Supercomputing System for Genome Analysis), by the Industrial Strategic Technology Development Program (10041971, Development of Power Efficient High-Performance Multimedia Contents Service Technology using Context-Adapting Distributed Transcoding) funded by the Ministry of Knowledge Economy (MKE, Korea) and by an Inha University research grant.

## Footnotes

↵† A preliminary version of this paper appeared in ISAAC 2009.

One contribution of 11 to a Theo Murphy Meeting Issue ‘Storage and indexing of massive data’.

- © 2014 The Author(s) Published by the Royal Society. All rights reserved.