Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

A new graph model and algorithms for consistent superstring problems† Joong Chae Na1 , Sukhyeun Cho2 , Siwon Choi2 , rsta.royalsocietypublishing.org

Jin Wook Kim3 , Kunsoo Park4 and Jeong Seop Sim2 1 Department of Computer Science and Engineering,

Research Cite this article: Na JC, Cho S, Choi S, Kim JW, Park K, Sim JS. 2014 A new graph model and algorithms for consistent superstring problems. Phil. Trans. R. Soc. A 372: 20130134. http://dx.doi.org/10.1098/rsta.2013.0134

One contribution of 11 to a Theo Murphy Meeting Issue ‘Storage and indexing of massive data’.

Subject Areas: algorithmic information theory, theory of computing Keywords: consistent superstring, string inclusion, string non-inclusion Author for correspondence: Jeong Seop Sim e-mail: [email protected]

Sejong University, Seoul 143-747, South Korea 2 Department of Computer and Information Engineering, Inha University, Incheon 402-751, South Korea 3 Department of Computer Science, Korea National Open University, Seoul 110-791, South Korea 4 School of Computer Science and Engineering, Seoul National University, Seoul 151-742, South Korea Problems related to string inclusion and non-inclusion have been vigorously studied in diverse fields such as data compression, molecular biology and computer security. Given a finite set of positive strings P and a finite set of negative strings N , a string α is a consistent superstring if every positive string is a substring of α and no negative string is a substring of α. The shortest (resp. longest) consistent superstring problem is to find a string α that is the shortest (resp. longest) among all the consistent superstrings for the given sets of strings. In this paper, we first propose a new graph model for consistent superstrings for given P and N . In our graph model, the set of strings represented by paths satisfying some conditions is the same as the set of consistent superstrings for P and N . We also present algorithms for the shortest and the longest consistent superstring problems. Our algorithms solve the consistent superstring problems for all cases, including cases that are not considered in previous work. Moreover, our algorithms solve in polynomial time the consistent superstring problems for more cases than the previous algorithms. For the polynomially solvable cases, our algorithms are more efficient than the previous ones.



A preliminary version of this paper appeared in ISAAC 2009.

2014 The Author(s) Published by the Royal Society. All rights reserved.

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

1. Introduction

There are some notions that consider string inclusion and non-inclusion at the same time. Given a finite set of positive strings P and a finite set of negative strings N , a string α is a consistent superstring (CSS) for P and N if α is both a common superstring of P and a common non-superstring of N . — Shortest consistent superstrings. Among the CSSs for P and N , the shortest one is called the shortest CSS for P and N . Jiang & Li [12] introduced the notion of the CSS and they provided approximation algorithms for finding the shortest CSS when |N | is bounded by a constant or when P ∪ N is inclusion-free, i.e. no string in the union includes another as a substring. Jiang & Timkovsky [10] proposed an algorithm for solving this problem using the directed graph model proposed in [9]. There are two non-trivial conditions assumed in reference [10] for finding the shortest CSS for P and N . The first condition is that every symbol of the alphabet appears at the end of some negative string (final closure). The second condition is that P ∪ N is inclusion-free. Jiang & Timkovsky’s algorithm runs in polynomial time when the LCNSS for N exists or when |P | is bounded by a constant. — Longest consistent superstrings. Among the CSS for P and N , the longest one is called the longest CSS for P and N . Under the same conditions as above, Jiang & Timkovsky [10] proposed a polynomial-time algorithm for finding the longest CSS when the LCNSS for N exists. In this paper, we first propose a new graph model for CSSs for given P and N . Then, we present algorithms for finding the shortest CSS and the longest CSS using our graph model. Our contributions are as follows: — We propose a new graph model to solve the CSS problems, which is based on the Aho–Corasick machine. In our graph model, the set of strings represented by paths satisfying some conditions are the same as the set of CSSs for P and N . While Jiang & Timkovsky [10] modelled CSSs for P and N using only the strings in N , we model CSSs using all the strings in P and N . This makes our graph model more intuitive than that

.........................................................

— Longest common substrings. If α is a substring of wi for all 1 ≤ i ≤ m, then α is called a common substring of W . Among such α, the longest one is called the longest common substring of W . Finding the longest common substring of W is solvable in polynomial time by dynamic programming or using generalized suffix trees [6]. — Shortest common superstrings. If α is a superstring of wi for all 1 ≤ i ≤ m, then α is called a common superstring of W . Among such α, the shortest one is called the shortest common superstring of W . The problem of finding the shortest common superstring is NP-hard [7,8]. — Shortest common non-substrings. If α is not a substring of any wi for all 1 ≤ i ≤ m, then α is called a common non-substring of W . Among such α, the shortest one is called the shortest common non-substring of W . The problem of finding the shortest common nonsubstring can be solved in polynomial time [9]. — Longest common non-superstrings. If α is not a superstring of any wi for all 1 ≤ i ≤ m, then α is called a common non-superstring of W . Among such α, the longest one is called the longest common non-superstring (LCNSS) of W . The problem of finding the LCNSS is solvable in polynomial time [9–11].

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

Problems related to string inclusion and non-inclusion have been vigorously studied in diverse fields such as data compression [1,2], molecular biology [3,4] and computer security [5]. Some examples of well-known inclusion or non-inclusion related strings are as follows. Consider a set of strings W = {w1 , . . . , wm } and a string α over a constant-size alphabet.

2

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

2. Preliminaries (a) Problem definition Let α be a string over a constant-size alphabet Σ. We denote the length of α by |α| and the ith (1 ≤ i ≤ |α|) character of α by α[i]. When a string β is α[i]α[i + 1] · · · α[j], β is denoted by α[i..j] and called a substring of α. Conversely, α is called a superstring of β. Moreover, α[1..j] (1 ≤ j ≤ |α|) is called a prefix of α, and when j = |α|, α[1..j] is called a proper prefix of α. Similarly, we can define a suffix and a proper suffix of α. For convenience, we assume that the empty string λ is both a prefix and a suffix of α. Consider two finite sets of strings over a constant-size alphabet Σ: a set of positive strings P = {x1 , x2 , . . . , xp } and a set of negative strings N = {y1 , y2 , . . . , yn }. We assume that neither P nor N contains the empty string λ. Let P and N denote the sum of the lengths of all strings in P and N , respectively. We denote the number of elements in a set W by |W |. Then, |P | = p and |N | = n. We define a common superstring, a common non-superstring, and a consistent superstring. A common superstring for P is a string that includes every positive string xi (1 ≤ i ≤ p) in P as a substring. Similarly, a common non-superstring for N is a string that includes no negative string yj (1 ≤ j ≤ n) in N as a substring. A consistent superstring (CSS) for P and N is a string that is both a common superstring for P and a common non-superstring for N . Among CSSs, the shortest one and the longest one are called the shortest consistent superstring (SCSS) and the longest consistent superstring (LCSS), respectively. If we can make an arbitrarily long CSS, then the number of CSSs is infinite and thus the LCSS does not exist. Now, we define the shortest consistent superstring problem and the longest consistent superstring problem: Problem 2.1. The shortest consistent superstring problem (the SCSS problem). Input: a finite set of positive strings P = {x1 , x2 , . . . , xp } and a finite set of negative strings N = {y1 , y2 , . . . , yn } over a constant-size alphabet Σ. Output: if no CSS exists, the output is ‘no SCSS exists’. Otherwise, the output is an SCSS for P and N . Problem 2.2. The longest consistent superstring problem (the LCSS problem). Input: a finite set of positive strings P = {x1 , x2 , . . . , xp } and a finite set of negative strings N = {y1 , y2 , . . . , yn } over a constant-size alphabet Σ.

.........................................................

The remainder of the paper is organized as follows. In §2, we give some notation and definitions. In §3, we propose a new graph model for CSSs. In §4, we present algorithms for finding the shortest CSS and the longest CSS for the given sets of strings. Then, we give concluding remarks in §5.

3

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

in [10], and makes our algorithms simpler than those in [10]. Furthermore, we do not assume any of the non-trivial conditions in [10], i.e. our graph model is more general than that in [10]. — We also present algorithms for finding the shortest and the longest CSSs using our graph model. Our algorithms solve the CSS problems for all cases, including cases that are not considered in reference [10], i.e. the cases when P ∪ N is not inclusion-free or when the final closure condition does not hold. Moreover, our algorithms solve in polynomial time the CSS problems for more cases than those in [10]. The shortest CSS problem can be solved in polynomial time when the LCNSS for N exists and P ∪ N is inclusion-free, or when |P | is bounded by a constant. The longest CSS problem can be solved in polynomial time when P ∪ N is inclusion-free or when |P | is bounded by a constant. In particular, when the LCNSS for N exists and P ∪ N is inclusion-free, our algorithms for finding the shortest and the longest consistent superstrings run in time linear in the sum of the lengths of the strings in P and N , which are more efficient than the algorithms in [10].

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

(a)

(b) a

a

vl

v7

v1

v7

a

b b

a

b

v2

v3

v2

a a b

v3

a b

b b

a

v4

a

v8

a

a

a

v4

v8

b b

b

v5

b

b

v6

v9

b

v5

b

v6

v9

b

Figure 1. (a) The AC machine for {aba, bb, aa, abba, bbb} and (b) its DFA version, where dashed arrows show the failure functions and double-circled vertices mean that their output functions are not empty.

Output: if no CSS exists or an arbitrarily long CSS can be made, the output is ‘no LCSS exists.’ Otherwise, the output is an LCSS for P and N . To avoid some trivial cases, we assume that the following three inclusion-free conditions hold for strings in P and N : (1) For all xi and xj (i = j), xi is not a substring of xj . If xi is a substring of xj , then any superstring of xj is a superstring of xi . Thus, we can remove xi from P . (2) For all yi and yj (i = j), yi is not a substring of yj . If yi is a substring of yj , then any nonsuperstring of yi is a non-superstring of yj , Thus, we can remove yj from N . (3) For all xi and yj , yj is not a substring of xi . If yj is a substring of xi , then any superstring of xi cannot be a non-superstring of yj . Thus, no CSS for P and N exists in this case. The following is another inclusion-free condition that is non-trivial. (4) For all xi and yj , xi is not a substring of yj . We say that P ∪ N is inclusion-free when P and N satisfy the condition (4) as well as the conditions (1)–(3).

(b) Aho–Corasick machine Our graph model for solving the CSS problems is related to the Aho–Corasick machine (AC machine), which is a useful tool for multiple pattern matching problems [13]. The AC machine consists of vertices (states) and three functions (transitions): the goto function, the failure function and the output function. A vertex represents a prefix of given pattern strings. For a vertex u, let str(u) denote the string represented by u. For a vertex u and a character σ , the goto function is defined as a vertex v (if it exists) such that str(v) is str(u)σ . For a vertex u, let α be the longest proper suffix of str(u) that is represented by a vertex in the AC machine. Then, the failure function for the vertex u is defined as a vertex w representing α. The output function for a vertex u is defined as a set of pattern strings that are suffixes of str(u). Figure 1a shows an example of the AC machine for {aba, bb, aa, abba, bbb}. Note that the values of the output functions for v3 , v4 , v6 , v7 , v8 and v9 are {aba}, {bb}, {bb}, {aa}, {abba} and {bb, bbb}, respectively. To avoid all failure functions, the AC machine can be represented as a deterministic finite automaton (DFA). Figure 1b shows the DFA version of figure 1a.

.........................................................

v1

a

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

vl

4 a

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

3. Graph model

The number of vertices is at most P + N, because the number of prefixes of a string is the same as the length of the string. The number of edges is O(P + N), because every vertex has at most |Σ| outgoing edges that is a constant. A path in GCSS is denoted by a sequence A = (u0 , u1 , . . . , uk ) of vertices such that edge (ui−1 , ui ), for every i = 1, 2, . . . , k, is defined. The length of A is the number of edges in A, denoted by |A|. A path is called a λ-path if it starts at the vertex vλ . A path A = (u0 , . . . , uk ) represents the unique string, denoted by pstr(A), whose ith character is the label of edge (ui−1 , ui ) for 1 ≤ i ≤ k. Meanwhile, a same string may be represented by more than one path. However, a λ-path representing a string is unique (if it exists), because for each character σ ∈ Σ, there is at most one outgoing σ -edge from each vertex. We denote the λ-path representing a string α by λ-path(α). For a string α, a suffix of α is called a τ -suffix if the suffix is in T . Lemma 3.1. For a λ-path A = (vλ , . . . , v), str(v) is the longest τ -suffix of pstr(A). Proof. We prove it by induction on the lengths of λ-paths. Initially, let us consider the case when |A| = 0, i.e. A = (vλ ). Then, it is obvious that str(vλ ) (= λ) is the longest τ -suffix of pstr(A) (= λ). Next, suppose that the lemma holds for |A| = k ≥ 0. We show that the lemma holds for |A| = k + 1 as well. Assume that A is the concatenated path of a path A = (vλ , . . . , u) of length k and a σ -edge (u, v). Then, str(u) is the longest τ -suffix of pstr(A ) by the inductive hypothesis. Suppose str(v) is not the longest τ -suffix of pstr(A), i.e. there is a τ -suffix (say ασ ) of pstr(A) longer than str(v). Then, we have two cases according to the length of α. First, consider the case when |str(u)| ≥ |α|. Because str(u)σ , ασ and str(v) are suffixes of pstr(A) and |str(u)σ | ≥ |ασ | > |str(v)|, then ασ is a τ -suffix of str(u)σ longer than str(v). This contradicts the definition of edge (u, v) by which str(v) is the longest τ -suffix of str(u)σ . Next consider the case when |α| > |str(u)|. Then, α is a τ -suffix of str(A ) longer than str(u), which contradicts the inductive hypothesis that str(u) is the  longest τ -suffix of pstr(A ). Therefore, str(v) is the longest τ -suffix of pstr(A). For example, consider a λ-path A1 = (vλ , v1 , v2 , v3 , v2 ) in figure 2. Then, pstr(A1 ) = abab and its longest τ -suffix is ab (= str(v2 )) because abab and bab are not in T . For another λ-path A2 = (vλ , v5 , v6 , v1 ), pstr(A2 ) = bba and its longest τ -suffix is a (= str(v1 )). Now, we give some relations between GCSS and common non-superstrings for N . Lemma 3.2. A string α is a common non-superstring for N if and only if λ-path(α) exists in GCSS .

.........................................................

— Vertex set V. We define vertices so that there is a one-to-one correspondence between the vertices in V and the strings in T . We denote a corresponding vertex of a string α by ver(α) and conversely, we denote a corresponding string of a vertex v by str(v). We denote ver(λ) by vλ as a special vertex. — Edge set E. For strings α, β ∈ T and σ ∈ Σ, a directed edge from ver(α) to ver(β) labelled with σ is defined if and only if β = δ(α, σ ). An edge labelled with σ is called a σ -edge. Note that there exists at most one outgoing σ -edge from ver(α), and if and only if δ(α, σ ) ∈ N , the outgoing σ -edge from ver(α) does not exist.

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

Here, we first define our new graph model, denoted by GCSS , for consistent superstrings, and then give some properties of GCSS . Let GAC be the graph representing the DFA version of the AC machine for P ∪ N . Informally, GCSS is a graph made from GAC by removing every vertex v representing a negative string, and incoming and outgoing edges of v as well. For example, given P = {aba, bb} and N = {aa, abba, bbb}, GAC is shown in figure 1b and GCSS is shown in figure 2. Now, we formally define GCSS . Let T be the set of all prefixes of all positive strings and all proper prefixes of all negative strings. For example, given P = {aba, bb} and N = {aa, abba, bbb}, T = {λ, a, ab, aba, abb, b, bb}. We first define a function δ : T × Σ → (T ∪ N ) as follows: for α ∈ T and σ ∈ Σ, δ(α, σ ) is the longest suffix of ασ in T ∪ N . Then, GCSS = (V, E) is a directed graph defined as follows:

5

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

a vl

6

v1

a

v2

b

v3

a

str (v1) = a str (v2) = ab

b a

str (vl) = l

v4

str (v3) = aba str (v4) = abb str (v5) = b str (v6) = bb

b

b v5

v6

Figure 2. The graph GCSS for P = {aba, bb} and N = {aa, abba, bbb}.

Proof. (If) We prove it by contradiction. Suppose that there exists a string which is represented by a λ-path in GCSS and includes a negative string y. Let α be the shortest one among such strings. Then, y is a suffix of α. Let (u, v) be the last edge of λ-path(α) and σ be the label of edge (u, v). Now, we show str(v) = y, which contradicts the definition of GCSS that str(v) is a string in T including no negative string. Let y and α  be the longest proper prefix of y and α, respectively, i.e. y = y σ and α = α  σ . Because str(u) is the longest τ -suffix of α  by lemma 3.1 and y is in T , |str(u)| ≥ |y | and thus y (= y σ ) is a suffix of str(u)σ . Furthermore, y is the longest suffix of str(u)σ in T ∪ N by the inclusion-free conditions (2) and (3) that the negative string y is not included in any string in P ∪ N except for itself. Thus, δ(str(u), σ ) = y and str(v) = y. (Only if) We prove it by contradiction. Suppose that there exists a common non-superstring for N which is represented by no λ-path in GCSS . Let α be the shortest one among such strings. (Note that α cannot be λ, because λ is represented in GCSS .) Let α  be the longest proper prefix of α and σ be the last character of α, i.e. α = α  σ . Then α  is a common non-superstring for N because so is α. Thus, λ-path(α  ) exists, because α  is shorter than α. Let u be the last vertex of λ-path(α  ). Then, str(u) is a suffix of α  by lemma 3.1. Because str(u)σ is a suffix of the common non-superstring α, str(u)σ is also a common non-superstring for N . Therefore, no suffix of str(u)σ is a negative string and δ(str(u), σ ) ∈ T . That is, the σ -edge from u to a node v is defined and the concatenated path of A and (u, v) represents α. This contradicts the assumption that λ-path(α) does not exist  in GCSS . Because every vertex in GCSS is reachable from vλ , an arbitrary longest λ-path can be made if and only if GCSS is cyclic. Thus, we can obtain the following corollary from lemma 3.2. Corollary 3.3. The longest common non-superstring for N exists if and only if GCSS is acyclic. Next, we give some relations between GCSS and common superstrings for P . For a positive string xi (1 ≤ i ≤ p), we call a vertex v a q-vertex of xi in GCSS if xi is a suffix of str(v), and denote the set of q-vertices of xi by Q(xi ). (Intuitively, Q(xi ) is a set of vertices whose output function values are xi in the AC machine GAC .) For example, Q(aba) = {v3 } and Q(bb) = {v4 , v6 } in figure 2. Note that a vertex belongs to at most one Q(xi ) owing to the inclusion-free condition (1). Furthermore, if P and N satisfy the inclusion-free conditions (1)–(4), |Q(xi )| = 1 for every positive string xi . Lemma 3.4. For a positive string xi , a λ-path A passes a vertex in Q(xi ) if and only if pstr(A) is a superstring of xi . Proof. (If) Without loss of generality, assume that xi is a suffix of pstr(A). Let v be the last vertex on path A. Then, str(v) is the longest τ -suffix of pstr(A) by lemma 3.1. Because xi ∈ T , the length of the longest τ -suffix of pstr(A) is larger than or equal to |xi |, i.e. |str(v)| ≥ |xi |. Moreover, because

.........................................................

b

:

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

elements of

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

Corollary 3.5. A λ-path A is a Q-path if and only if pstr(A) is a common superstring for P . For example, consider a λ-path A1 = (vλ , v1 , v2 , v3 , v2 , v4 ) in figure 2, which represents a common superstring ababb for P = {aba, bb}. Because A1 passes v3 ∈ Q(aba) and v4 ∈ Q(bb), A1 is a Q-path. Conversely, for a Q-path A2 = (vλ , v5 , v6 , v1 , v2 , v3 ) passing v3 ∈ Q(aba) and v6 ∈ Q(bb), pstr(A2 ) (= bbaba) is a common superstring for P . Owing to lemma 3.2 and corollary 3.5, we obtain the following theorem, which says that the set of consistent superstrings for P and N is the same as the set of strings represented by Q-paths in GCSS . Theorem 3.6. A string α is a consistent superstring for P and N if and only if a Q-path representing α exists in GCSS .

4. Algorithms for consistent superstring problems Here, we present algorithms for the CSS problems. We can find CSSs for P and N by finding Q-paths in GCSS by theorem 3.6. Because the length of a CSS α is the same as the length (the number of edges) of λ-path(α), we can find the shortest (resp. longest) CSS by finding the shortest (resp. longest) Q-path in GCSS . Our algorithms for the consistent superstring problems consist of the following three phases: 1. Construct GCSS and compute Q(xi ). 2. Find the shortest (or longest) Q-path. 3. Compute the SCSS (resp. LCSS) if the string represented by the shortest (resp. longest) Q-path is found in phase 2. Phase 1 can be done in O(P + N) time by constructing the AC machine and computing its output function, and by removing unnecessary vertices and edges. Phase 3 can be done simply by concatenating the label of each edge on the Q-path found in phase 2 while traversing the Q-path. For phase 2, we give details in the following subsections.

(a) Finding the shortest Q-path We describe how to find the shortest Q-path in GCSS . We use different approaches according to whether P ∪ N is inclusion-free or not. Recall that P ∪ N is inclusion-free when P and N satisfy the inclusion-free condition (4) as well as the inclusion-free conditions (1)–(3).

(i) Case when P ∪ N is inclusion-free In this case, |Q(xi )| = 1 for every positive string xi . We first check whether GCSS is acyclic or not, i.e. the LCNSS for N exists or not (corollary 3.3), which can be done in O(P + N) time using depth-first search. Let us consider the case when GCSS is acyclic. Let v1 , v2 , . . ., vp be the q-vertices in topological order. Then, the shortest Q-path in GCSS is the shortest path passing over

.........................................................

For example, consider GCSS in figure 2. For a λ-path A1 = (vλ , v1 , v2 , v3 , v2 ) representing a superstring abab of aba, A1 passes a vertex v3 in Q(aba). Conversely, for λ-path A2 = (vλ , v5 , v6 , v1 ) passing a vertex v6 in Q(bb), pstr(A2 ) (= bba) is a superstring of bb. Lemma 3.4 can be directly extended to the set P of positive strings. We call a λ-path A a Q-path if A passes at least one vertex in Q(xi ) for every xi ∈ P . Then, corollary 3.5 can be derived directly from lemma 3.4.

7

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

both str(v) and xi are suffixes of pstr(A), xi is a suffix of str(v) and thus v ∈ Q(xi ). Therefore, A passes a vertex in Q(xi ). (Only if) Let v be a vertex in Q(xi ) passed by A. Without loss of generality, assume that v is the last vertex of A. Then, xi is a suffix of str(v) by definition of Q(xi ) and str(v) is a suffix of pstr(A) by  lemma 3.1. Thus, xi is a suffix of pstr(A).

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

3

vl

v3

3 3

2 v6

Figure 3. The graph GQS constructed from GCSS in figure 2.

vλ , v1 , v2 , . . . , vp in order. This order v1 , v2 , . . . , vp of the q-vertices must be unique if there is a Q-path in GCSS , because GCSS is acyclic. The shortest Q-path can be found by concatenating all the shortest paths Ai from vi−1 to vi (1 ≤ i ≤ p), where v0 = vλ . Each Ai can be easily found by considering only the vertices (and their outgoing edges) from vi−1 to vi in topological order, and thus all Ai can be found in O(P + N) time. Therefore, we can find the shortest Q-path in O(P + N) time when GCSS is acyclic. Next, let us consider the case when GCSS is cyclic. We first construct a weighted directed graph GQS from GCSS as follows: Vertices of GQS consist of vλ and q-vertices of GCSS . For vertices u and v in GQS , if and only if there is a path from u to v in GCSS , the edge (u, v) of GQS is defined, and its weight is the length of the shortest path from u to v in GCSS . (Note that the shortest path may pass over other q-vertices in GCSS .) The length of a path in GQS is defined as the sum of the weights of the edges on the path (figure 3). (In this example, P ∪ N is not inclusion-free; nevertheless, GQS is well defined.) We can find the shortest paths from a vertex u to other vertices in O(P + N) time by performing breadth-first search (BFS) starting at u in GCSS . Because the number of vertices in GQS is p + 1, GQS can be constructed in O(p(P + N)) time. Then, finding the shortest Q-path in GCSS is the same as finding the shortest path As in GQS such that As starts at vλ and passes over each vertex at least once. The problem of finding As in GQS can be easily modelled as the travelling salesman problem (TSP). It is well known that the TSP can be solved in O(p2 2p ) time using dynamic programming [14,15]. Therefore, we can find the shortest Q-path in O(p(P + N) + p2 2p ) time when GCSS is cyclic. In the case when GQS is acyclic, we can solve this problem more efficiently by a topological sort. Note that GQS may be acyclic even though GCSS is not acyclic. If GQS is acyclic, As must pass over all vertices of GQS in topological order. If such a path does not exist, then the CSS does not exist. A topological order can be computed in O(p2 ) time, because there are p + 1 vertices and at most p(p + 1) edges in GQS . Therefore, we can find the shortest Q-path in O(p(P + N)) time when GQS is acyclic.

(ii) Case when P ∪ N is not inclusion-free We first construct GQS from GCSS as in the previous case. Let k be the number of all q-vertices, i.e, p k = i=1 |Q(xi )|. Then, p ≤ k ≤ N − n + p, because a vertex representing a proper prefix of negative strings can be a q-vertex but no vertex representing a proper prefix of positive strings can be a q-vertex owing to the inclusion-free condition (1). Because GQS has k + 1 vertices (including vλ ), GQS can be constructed in O(k(P + N)) time using BFS. Then, finding the shortest Q-path in GCSS is the same as finding the shortest path As in GQS such that As starts at vλ and passes over at least one vertex in every Q(xi ). The problem of finding As in GQS is easily modelled as the generalized travelling salesman problem (GTSP), which can be solved in O(k2 2p ) time using dynamic programming [16,17]. Therefore, we can find the shortest Q-path in O(k(P + N) + k2 2p ) time. For example, the path (vλ , v3 , v4 ) in figure 3 is the shortest path that starts at vλ and passes over a vertex in Q(aba) and a vertex in Q(bb). Hence, its corresponding path (vλ , v1 , v2 , v3 , v2 , v4 )

.........................................................

v4

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

2 3

8

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

algorithms JT95 [10]

cases IF and final closurea

LCNSS for N exists O(p2 + pN2 )b

GQS is acyclic

O(P + N)

O(p(P + N))

GQS is cyclic O(pN + p2 2p )b 2

..........................................................................................................................................................................................................

IF

Q1

......

ours



IF

O(p(P + N) + p2 2p )

................................................................................................................................................



Q1

O(k(P + N) + k2 2p )

..........................................................................................................................................................................................................

a It is assumed in [10] that P

∪ N is inclusion-free and every symbol of the alphabet appears at the end of some negative string (final closure). b In addition to these complexities specified in [10], O(P) is required because O(P + N) is the input size. is the shortest Q-path in GCSS and the string ababb represented by this Q-path is an SCSS for P = {aba, bb} and N = {aa, abba, bbb}. Remark 4.1. Even though P ∪ N is not inclusion-free, |Q(xi )| can be 1 for every positive string xi . In this case, we can find the shortest Q-path more efficiently using the algorithm for the case when P ∪ N is inclusion-free. Table 1 shows the time complexities of our algorithm and the previous one [10] for the SCSS problem. Our algorithm is more efficient than the previous one. Furthermore, our algorithm gives solutions for all cases.

(b) Finding the longest Q-path In this problem, we consider not only simple paths but also paths containing a cycle, i.e. a vertex (including starting and ending vertices) can be visited more than once in a path. If there is a Qpath whose length is ∞, i.e. containing a cycle, then there is no feasible longest Q-path and the LCSS does not exist.

(i) Case when P ∪ N is inclusion-free We have two subcases according to whether GCSS is acyclic or not. When GCSS is acyclic, the longest Q-path can be easily found in O(P + N) time using an approach similar to finding the shortest Q-path. Let vp be the last q-vertex in topological order and let A be the longest Q-path ending at vp . Note that, unlike the shortest Q-path, which ends at vp , the longest Q-path can go further. Let A be the longest path from vp to any vertex in GCSS . Then, the longest Q-path is the concatenated path of A and A . When GCSS is acyclic, because A and A can be found in O(P + N) time, we can find the longest Q-path in O(P + N) time. Next, let us consider the case when GCSS is cyclic. We first construct a weighted directed graph GQL from GCSS defined as follows: vertices of GQL consist of vλ , q-vertices of GCSS , and a special vertex vf . There are two kinds of edges in GQL . (We assume that all edge weights are −1 in GCSS when computing edge weights in GQL .) — For vertices u = vf and v = vf , if and only if there is a path from u to v in GCSS , the edge (u, v) of GQL is defined and its weight is the length of the shortest path from u to v in GCSS . If a path from u to v in GCSS contains a cycle, the weight of edge (u, v) is −|V|, where V is the vertex set of GCSS . Note that if no path from u to v in GCSS contains a cycle, the weight of edge (u, v) is greater than −|V|.

.........................................................

no LCNSS for N exists

9

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

Table 1. Time complexities for the SCSS problem, where k is the number of all q-vertices in GCSS (p ≤ k ≤ N − n + p). ‘IF’ represents the case when P ∪ N is inclusion-free and ‘Q1’ represents the case when |Q(xi )| = 1 for every positive string xi . The symbol ∼ means the negation of the case.

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

(a)

(b)

a

10

a −4

a b a

v2

v3

a

v4

vl

−3

v6

−1

v4

−7

vf

−7 −7

a b v5

a

v6

Figure 4. The graphs (a) GCSS and (b) GQL for P = {baa, bba} and N = {ab, bbb}.

— For a vertex u = vf and vf in GQL , the edge (u, vf ) is always defined and its weight is the length of the shortest path from u to any vertex in GCSS . If a path from u to any vertex in GCSS contains a cycle, the weight of edge (u, vf ) is −|V|. No outgoing edge from vf is defined in GQL . The length of a path in GQL is defined as the sum of the weights of the edges on the path. Figure 4 shows GCSS and GQL for P = {baa, bba} and N = {ab, bbb}. We compute the weights of outgoing edges from each vertex u (= vf ) in GQL as follows: Set all edge weights to −1 in GCSS and compute the shortest paths from u to other vertices in GCSS using the Bellman–Ford algorithm [18,19], which takes O((P + N)2 ) time. Note that the Bellman–Ford algorithm can be easily modified to check if there is a path from u to v containing a cycle [20]. Because GQL has p + 1 vertices excluding vf , it takes O(p(P + N)2 ) time to compute all edge weights of GQL . Then, finding the longest Q-path in GCSS is the same as finding the shortest path As in GQL such that As starts at vλ and passes over each vertex at least once. We have two subcases according to whether GQL is cyclic or not. — If GQL is cyclic, which can be checked in O(p2 ) time, either there is no path starting at vλ and passing all vertices, or we can find such a path with arbitrarily small weight, because this path contains a cycle. Thus, there is no feasible shortest path in GQL , i.e. no feasible longest Q-path in GCSS , and thus no LCSS exists in this case. — If GQL is acyclic, As (if it exists) must pass over all vertices of GQL in topological order (if such a path does not exist, then the CSS does not exist). If the length of As is less than or equal to −|V|, its corresponding Q-path in GCSS contains a cycle and thus there is no feasible longest Q-path. Otherwise (if the length of As is greater than −|V|), its corresponding path in GCSS is the longest Q-path. (Note that the length of a feasible longest Q-path is less than |V| because it cannot contain a cycle.) Therefore, we can find the longest Q-path in O(p(P + N)2 ) time when GCSS is cyclic. For example, consider the path (vλ , v6 , v4 , vf ) in figure 4b. This is the only path passing over all vertices in GQL and its length is −11. It means that we can make an arbitrarily long Q-path in GCSS , e.g. (vλ , v2 , v5 , v6 , v4 , v1 , v1 , . . .) which represents an arbitrarily long CSS bbaaaa . . . . Therefore, no LCSS exists in this example.

(ii) Case when P ∪ N is not inclusion-free We also solve the problem using GQL again. Finding the longest Q-path in GCSS is the same as finding the shortest path As in GQL such that As starts at vλ , passes over at least one vertex in

.........................................................

v1

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

vl

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

JT95 [10]

cases

LCNSS for N exists

no LCNSS for N exists

IF and final closure

O(p + pN )



O(P + N)

O(p(P + N)2 )

2

2

..........................................................................................................................................................................................................

IF

Q1

......

ours



IF

...............................................................................................................................



Q1

O(k(P + N)2 + k2 2p )

..........................................................................................................................................................................................................

every Q(xi ), and ends at vf . As in the SCSS problem, this problem can be solved by modelling as the GTSP. (Note that we can make all edge weights non-negative without changing the optimal solutions by adding |V| to all edge weights in GQL [21].) Therefore, we can find the shortest Q-path in O(k(P + N)2 + k2 2p ) time. Note that the problem can be solved more efficiently in the case when |Q(xi )| = 1 for every positive string xi as in the SCSS problem. Table 2 shows the time complexities of our algorithm and the previous one [10] for the LCSS problem. As shown in the table, our algorithm is more efficient than the previous one and gives solutions for all cases.

5. Concluding remarks We have proposed a new graph model for the CSS problems where Q-paths have a one-to-one correspondence with CSSs. Thus, the SCSS (resp. LCSS) problems can be solved by finding the shortest (resp. longest) Q-path. We have also presented how to find the shortest Q-path and the longest Q-path in our graph. In some cases, this problem can be solved in polynomial time according to the inclusion-freeness and the topology of the graph. The polynomial-time solvable cases remain polynomial even though the alphabet size |Σ| is not constant. For example, when P ∪ N is inclusion-free and LCNSS for N exists, the SCSS and the LCSS problems can be solved in O((P + N)|Σ|) time. Furthermore, for the LCSS problem, our algorithm can distinguish whether no CSS exists or an arbitrarily long CSS can be made. In general, however, the CSS problems are NP-hard [22]. The CSS problems using our graph are closely related to the TSP (including the GTSP). Although the TSP is known as NP-hard, many algorithms have been developed for the TSP owing to its diverse applicability [23–25]. In this paper, we adopted algorithms using the dynamic programming approach. However, other approaches for the TSP, such as linear programming algorithms and approximation, can be applied to our problem in order to improve the performance in theory and practice. Furthermore, the LCNSS problem, where P = ∅, can be solved using our graph. Recall that the LCNSS for N exists if and only if GCSS is acyclic (corollary 3.3). We can check whether GCSS is acyclic and find the longest λ-path in GCSS using O(N) time. Therefore, the LCNSS can be found in O(N) time. Funding statement. This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (20110007860); by the Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (20110029924); by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2012R1A2A2A01014892), by the IT R&D programme of MSIP/KEIT (10038768, Development of Supercomputing System for Genome Analysis), by the Industrial Strategic Technology Development Program (10041971, Development of Power Efficient High-Performance Multimedia Contents Service Technology using Context-Adapting Distributed Transcoding) funded by the Ministry of Knowledge Economy (MKE, Korea) and by an Inha University research grant.

.........................................................

algorithms

11

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

Table 2. Time complexities for the LCSS problem, where k is the number of all q-vertices in GCSS (p ≤ k ≤ N − n + p). ‘IF’ represents the case when P ∪ N is inclusion-free and ‘Q1’ represents the case when |Q(xi )| = 1 for every positive string xi . The symbol ∼ means the negation of the case.

Downloaded from http://rsta.royalsocietypublishing.org/ on August 1, 2015

References

.........................................................

rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 372: 20130134

1. Storer JA, Szymanski TG. 1982 Data compression via textural substitution. J. ACM 29, 928–951. (doi:10.1145/322344.322346) 2. Storer JA. 1988 Data compression: methods and theory. New York, NY: Computer Science Press. 3. Drmanac R, Crkvenjakov C. 1992 Sequencing by hybridization (SBH) with oligonucleotide probes as an integral approach for the analysis of complex genomes. Int. J. Genome Res. 1, 59–79. 4. Pevzner PA, Lipshutz RJ. 1994 Towards DNA sequencing chips. In Mathematical Foundations of Computer Science 1994, Proc. 19th Int. Symp., MFCS’94, Košice, Slovakia, 22–26 August (eds I Prívara, B Rovan, P Ruziˇcka). Lecture Notes in Computer Science, vol. 841, pp. 143–158. Berlin, Germany: Springer. (doi:10.1007/3-540-58338-6_64) 5. Kerschbaum F. 2007 A new way to think about secure computation: language-based secure computation. In Proc. 5th Int. Workshop on Security in Information Systems, Funchal, Madeira, Portugal, June (eds MI Yagüe del Valle and E Fernández-Medina), pp. 33–42. Setúbal, Portugal: INSTICC Press. 6. Gusfield D. 1997 Algorithms on strings, tree, and sequences. Cambridge, UK: Cambridge University Press. 7. Maier D, Storer JA. 1977 A note on the complexity of superstring problem. Report no. 233. Princeton, NJ: Princeton University, Computer Science Laboratory. 8. Garey MR, Johnson DS. 1979 Computers and intractability: a guide to the theory of NP-completeness. San Francisco, CA: W.H. Freeman. 9. Rubinov AR, Timkovsky VG. 1998 String noninclusion optimization problems. SIAM J. Discrete Math. 11, 456–467. (doi:10.1137/S0895480192234277) 10. Jiang T, Timkovsky VG. 1995 Shortest consistent superstrings computable in polynomial time. Theor. Comput. Sci. 143, 113–122. (doi:10.1016/0304-3975(95)80027-7) 11. Na JC, Kim DK, Sim JS. 2009 Finding the longest common nonsuperstring in linear time. Inf. Process. Lett. 109, 1066–1070. (doi:10.1016/j.ipl.2009.06.010) 12. Jiang T, Li M. 1994 Approximating shortest superstrings with constraints. Theor. Comput. Sci. 134, 473–491. (doi:10.1016/0304-3975(94)90249-6) 13. Aho AV, Corasick MJ. 1975 Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 333–340. (doi:10.1145/360825.360855) 14. Bellman R. 1962 Dynamic programming treatment of the travelling salesman problem. J. ACM 9, 61–63. (doi:10.1145/321105.321111) 15. Held M, Karp RM. 1962 A dynamic programming approach to sequencing problems. J. Soc. Ind. Appl. Math. 10, 196–210. (doi:10.1137/0110015) 16. Henry-Labordere AL. 1969 The record balancing problem: a dynamic programming solution of a generalized travelling salesman problem. RAIRO Oper. Res. B2, 43–49. 17. Saskena JP. 1970 Mathematical model of scheduling clients through welfare agencies. J. Can. Oper. Res. Soc. 8, 185–200. 18. Bellman R. 1958 On a routing problem. Q. Appl. Math. 16, 87–90. 19. Ford LR, Fulkerson DR. 1962 Flows in networks. Princeton, NJ: Princeton University Press. 20. Cormen TH, Leiserson CE, Rivest RL, Stein C. 2009 Introduction to algorithms, 3rd edn. Cambridge, MA: MIT Press. 21. Punnen AP. 2007 The traveling salesman problem: applications, formulations and variations. In The traveling salesman problem and its variations (eds G Gutin, A Punnen). Combinatorial Optimization, vol. 12, pp. 1–28. Berlin, Germany: Springer. (doi:10.1007/0-306-48213-4_1) 22. Jiang T, Li M. 1993 On the complexity of learning strings and sequences. Theor. Comput. Sci. 119, 363–371. (doi:10.1016/0304-3975(93)90167-R) 23. Laporte G. 1992 The vehicle routing problem: an overview of exact and approximate algorithms. Eur. J. Oper. Res. 59, 345–358. (doi:10.1016/0377-2217(92)90192-C) 24. Bektas T. 2006 The multiple traveling salesman problem: an overview of formulations and solution procedures. Omega 34, 206–219. (doi:10.1016/j.omega.2004.10.004) 25. Öncan T, Altinel IK, Laporte G. 2009 A comparative analysis of several asymmetric traveling salesman problem formulations. Comput. Oper. Res. 36, 637–654. (doi:10.1016/j. cor.2007.11.008)

12

A new graph model and algorithms for consistent superstring problems.

Problems related to string inclusion and non-inclusion have been vigorously studied in diverse fields such as data compression, molecular biology and ...
424KB Sizes 2 Downloads 4 Views