Existence of constants in regular splicing languages.

HHS Public Access Author manuscript Author Manuscript

Inf Comput. Author manuscript; available in PMC 2016 June 01. Published in final edited form as: Inf Comput. 2015 June ; 242: 340–353. doi:10.1016/j.ic.2015.04.001.

Existence of constants in regular splicing languages Paola Bonizzonia and Nataša Jonoskab Paola Bonizzoni: [email protected]; Nataša Jonoska: [email protected] aDipartimento

di Informatica Sistemistica e Comunicazione, Univ. degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milano, Italy bDepartment

of Mathematics and Statistics, University of South Florida, Tampa, FL, USA

Author Manuscript

Abstract In spite of wide investigations of finite splicing systems in formal language theory, basic questions, such as their characterization, remain unsolved. It has been conjectured that a necessary condition for a regular language L to be a splicing language is that L must have a constant in the Schutzenberger sense. We prove this longstanding conjecture to be true. The result is based on properties of strongly connected components of the minimal deterministic finite state automaton for a regular splicing language. Using constants of the corresponding languages, we also provide properties of transitive automata and pathautomata.

1. Introduction Author Manuscript

A splicing system, originally introduced in [12], is a formal model that uses contextual cross-over operation over words to generate languages called splicing languages. This crossover splicing formalizes the behavior of basic biomolecular processes involving cut and paste of DNA performed by restriction enzymes and a ligase. Restriction enzymes act on double stranded DNA molecules by cleaving certain recognized segments leaving short single stranded overhangs. Molecules with same overhangs can join (in a cross-over fashion) in presence of a ligase enzyme. In the introductory paper, T. Head proved that if the splicing is performed by a finite set of certain simple rules, then splicing of finite set of words can generate the class of strictly locally testable languages [9]. The splicing notion was reformulated by G. Paun at a less restrictive level of generality, giving rise to the splicing operation that is commonly adopted and appears nowadays as a standard [17].

Author Manuscript

Theoretical results in splicing systems have contributed to new research in formal language theory focused on modeling of biochemical processes [18]. On the other side, the field suggested new ideas in the framework of biomolecular science, for example, the design of automated enzymatic processes. In this paper, we focus on finite splicing systems, called here simply as splicing systems. A splicing system is meant to have a finite set of rules (modeling enzymes) applied on a finite set of initial strings (modeling DNA sequences). A splicing system (or H-system) is a triple

Correspondence to: Paola Bonizzoni, [email protected].

Bonizzoni and Jonoska

Page 2

Author Manuscript

H = (A, I, R), where A is a finite alphabet, I ⊆ A* is the initial language and R is the set of rules, (see Section 4 for the definitions). The formal language generated by the splicing system is the smallest language containing I and closed under the splicing operation. There have been successes in characterizing certain subclasses of splicing languages, for example those generated by reflexive rules and those generated by symmetric rules [2]. Reflexivity and symmetry are natural properties for splicing systems because they assure splicing of molecules cut with the same enzyme, as well as recombining molecules resulting of the same type of cut [12]. The formal language of a general splicing system may have a set of rules R that is not necessarily symmetric, nor reflexive. Under the formal model, a splicing system is a generative mechanism for a language which belongs to a class that is a proper subclass of the regular languages. This basic result has been firstly proved in [8], and later proved in several other papers by using different approaches (see for example [19,21]).

Author Manuscript

In spite of the vast literature on the topic, a structural characterization of the finite splicing systems is still an open problem, although decidability of regular splicing languages has been recently proved in [15]. On the other hand, progress has been made towards the characterization of certain subclasses of splicing systems. Authors in [11] prove that it is decidable whether a regular language is a reflexive splicing language and provide an example of a regular splicing language that is neither reflexive nor symmetric, A quite different characterization of reflexive symmetric splicing languages is given in [3] and it has been extended to the general class of reflexive regular languages in [4,5]. This characterization has been given by using the concept of a constant of a language introduced by Schutzenberger [20].

Author Manuscript

In order to solve the open problem of characterizing he whole class of splicing languages, it seems necessary to understand the role of constants. Indeed, since the introduction of splicing languages it has been conjectured, and more formally in [10], and in [11], that existence of a constant is a necessary condition for a regular language to be splicing. In this paper we solve this longstanding open question by proving this conjecture true. This result is proved by investigating structural properties of connected components of the transition graph given by the minimal finite state automaton for a regular splicing language. More precisely, properties of the factor language of transitive components are related to the notion of synchronizing words [7]. Synchronizing words have been studied in automata theory for a long time and are of interest in both coding theory [1] and symbolic dynamics [16,14]. Our proof uses an old observation that a synchronizing word for an automaton is a constant for the language recognized by the automaton [20].

Author Manuscript

The paper is organized as follows. In Section 2 we introduce preliminary concepts, including the notion of a synchronizing word and a constant. In Section 3 we introduce the notion of a transitive automaton and a path-automaton, as well as show several results connecting terminal components automata and synchronizing words. Moreover, we show a relationship between transitive languages, transitive automata, transitive components, and constants of the language. Then in Section 4 we recall the basic notion of a splicing system and revisit the notion of splicing rules of a Inf Comput. Author manuscript; available in PMC 2016 June 01.


Page 3

Author Manuscript

splicing system by providing properties that are necessary in proving the main result of the paper. Finally in Section 5 we give examples of non reflexive splicing languages, show a relationship between transitive languages and splicing languages and we prove the main result of the paper. A preliminary extended abstract of this paper appeared in [6]

2. Preliminaries

Author Manuscript Author Manuscript

We refer the reader to [13] for the background of automata theory, and assume some familiarity of the subject. Let A* be the free monoid over a finite alphabet A and let A+ = A* \ 1, where 1 is the empty word. A deterministic finite state automaton (DFA) is a 5-tuple = (Q, A, I, T, ), where Q is a finite set of states, I ⊆ Q is the set of initial states, T ⊆ Q is the set of terminal (final) states and ⊆ Q × A × Q, is the set of transitions such that for every q ∈ Q and every a ∈ A the set {q′ | (q, a, q′) ∈ , q ∈ Q, a ∈ A} consists of at most one element. Given a deterministic finite state automaton , the set of transitions defines a partial action of A* on Q. It is generated with a : Q → Q for a ∈ A defined with q(a) = q′ iff q′ ∈ Q is the unique state with (q, a, q′) ∈ . We use the standard notation qa to denote q′. If such q′ does not exist, we write qa = ∅. Inductively, we extend the notation on words with qwa = (qw)a. Similarly, we write Q w for the image of the set Q under the map w : Q → Q defined with w(q) = qw. If qa is defined for all q ∈ Q and a ∈ A we say that is complete. A deterministic finite state automaton is usually depicted as a directed graph with vertices Q and a set of directed edges . For an edge e = (q, a, q′) we say that q is its “start” state, q′ is its “end” state (also refer to as an end-point) and a is its label. A word w is accepted by an automaton if there is a path with label w that starts at an initial state and ends at a terminal state. We denote with L( ) the language recognized by , that is, the set of all words accepted by [13]. Given a regular language L ⊆ A* it is well-known that there is a unique minimal complete deterministic finite state automaton (mDFA) = (Q, A, {q0}, T, ) that recognizes L such that all other complete DFA with one initial state that recognize L map homomorphically onto [13]. This automaton is unique up to possible renaming of the states, i.e., up to an isomorphism. We reserve the notation (L) to denote this automaton. Given a language L, the language F(L) is the set of all factors of words in L, where x is factor of a word w if w = zxy for z, y ∈ A*. We say L is factor-closed if F(L) = L. The right context of a word w ∈ A* with respect to a language L is defined with ∈ A* | wx ∈ L}. Symmetrically, the left context of w with respect of L is the set ∈ A* | xw ∈ L}.

(w) = {x (w) = {x

Author Manuscript

The right context of a state in is (q) = {x ∈ A* | qx ∈ T}. An automaton is said to be reduced if there are no two states in with the same right context. Observe that the right context depends only on the terminal states in the automaton. In other words, if the initial state(s) are changed in but the transitions and the set of terminal states remain, the right contexts of the states don’t change. It is well-known (see for ex. [13]) that given a regular language L, there is a one-to-one correspondence between the right contexts of words with respect to L and the right contexts of the states in the minimal deterministic finite state automaton for L, i.e.,

Inf Comput. Author manuscript; available in PMC 2016 June 01.


Page 4

Author Manuscript

In fact, in the mDFA , it also holds (w) = therefore (q) = (q′) implies q = q′.

(q) iff

(wa) =

(qa) for all a ∈ A, and

When the language and the DFA are fixed, we drop the subscripts and write

(w) and

(q).

Author Manuscript

Note that every state in an mDFA is accessible, i.e., for each state q ∈ Q there is an x ∈ A* such that q0x = q. A state q is co-accessible, if (q) ≠ ∅. In an mDFA, there is at most one state that is not co-accessible, since for each q ∈ Q, there is u ∈ A* such that qu ∈ T iff (q) ≠ ∅. If such a state in exists, we call it zero and denote it with z. A trimmed mDFA for language L is the DFA obtained from the mDFA for L by erasing the state z and all transitions that terminate in z. The trimmed mDFA is denoted trim . More generally, a trimmed DFA and co-accessible.

is an automaton in which all states are both accessible

Finally, for a finite set S, by #S, we denote the cardinality of the set S. Definition 1—Given a DFA and a state q of the automaton, the set of follower words for q relative to is the set (q) = {x | qx ≠ ∅}. For states q and q′ of , we say that they are follower-equivalent if (q) = (q′). For a state q the set of states in that are follower equivalent to q is denoted μq( ).

Author Manuscript

For a state q of we say that it is minimal-follower with respect to if whenever (q) for a state q′ of , it implies that q and q′ are follower-equivalent.

(q′) ⊆

Recall the definition of a constant of a language L introduced by Schutzenberger in [20]. Definition 2—A word w ∈ A+ is a constant of a language L if w is a factor of some word in L and for all words u1, u2, v1, v2 in A* we have:

A characterization of constants, which is more or less folklore, is stated below.

Author Manuscript

Proposition 1—Let L ⊆ A* be a regular language and let be the mDFA recognizing L. A word w ∈ A+ is a constant of L if and only if Q w \ {z} is a singleton, i.e., there is a unique non-zero state qw such that qw ≠ z implies qw = qw for all q ∈ Q. Suppose w is a label of a path in a finite state automaton. If for a word w there is a state qw such that every path in the automaton with label w terminates in qw, we say that w is a synchronizing word and we say that qw is a synchronizing state, synchronized by w. By Proposition 1, in a trimmed mDFA, trim , of a regular language L, the set of synchronizing



Page 5

Author Manuscript

words for trim coincides with the set of constants of L. In general, if w is a synchronizing word for an automaton then it is a constant for the language recognized by . The context of w with respect to L is the set CL(w) = {(u, v) | u, v ∈ A*, uwv ∈ L}. We define the left projection of the context of w (resp. right projection) as the set (respectively ). A constant w of L defines a constant language Const(w) with respect to the language L with the set . Given two constants w1 and w2 of L, a split language for w1 and

w2 with respect to L is a language (possibly empty) of w1 and

where

is a prefix

is a suffix (possibly empty) of w2.

3. Transitive components and synchronizing words Author Manuscript

In this section we provide structural characterizations of transitive components in a minimal DFA using the notion of synchronizing words. We define the notions of a transitive automaton and of a path-automaton, and give properties that are used to prove the main result of the paper. We first introduce definitions and properties that are used in the rest of the paper.

Author Manuscript

Recall the notion of a transitive component in a deterministic automaton. A strongly connected component of the directed graph for a deterministic automaton is called a transitive component for . If in a transitive component, every edge that starts at a state in this component also ends at the same component, then the transitive component is called terminal. For every state in the mDFA of a language L, there is a path that leads from that state to a terminal component. For a transitive component , we say that is induced by q if q is a state in . We write L( ) for the set of labels of all paths in and say that recognizes L( ). A transitive component is called trivial if L( ) = {1}. A language L is said to be transitive if for every pair of words u, v ∈ L there is a word w ∈ A* such that uwv ∈ L. Note that for a transitive component the language L = L( ) is transitive. Remark 1—Notice that if F(L( )) = L( ).

is a transitive component, then L( ) is factor-closed, i.e.,

Author Manuscript

Two transitive components and are called factor-equivalent if L( ) = L( ). In the following we often use the term component to denote a transitive component. A component is said to be maximal for a collection of components C if for every transitive component in C, we have that L( ) ⊆ L( ) implies L( ) = L( ). Analogously, a transitive component is called minimal for a collection C if whenever L( ) ⊆ L( ) we have L( ) = L( ).



Page 6

3.1. Transitive automata

Author Manuscript

In this section we relate the notion of a synchronizing word to properties of a transitive automaton. An automaton is called transitive if it consists of only one transitive component. Remark 2—Note that if

is transitive, then L( ) is also transitive. Consider two words u,

v ∈ L( ). There are initial states q0 and

such that q0u,

there is a word w that is a label of a path from q0u to

. Since

is transitive,

in , so uwv ∈ L( ).

Example 3.1—Consider the example shown in Fig. 1. This language is transitive, the automaton is reduced and deterministic, hence it is the mDFA for the language. However, there is no deterministic transitive automaton that recognizes this language. Notice that this language has no constants.

Author Manuscript

Remark 3—If L is transitive such that L = L( ) for a transitive component , then for each state q in , (q) = (q) since all states in are terminal. We consider several observations about transitive automata, transitive components and languages. The following observations are proved in [14] (see also [16]): Lemma 2—For every regular factor-closed transitive language L there is a unique minimal deterministic transitive automaton recognizing L. Lemma 3—For a regular factor-closed transitive language L and its unique minimal deterministic transitive automaton the following properties hold: i.

Every state in is synchronizing.

Author Manuscript

ii. A word w ∈ L is a constant for L if and only if w is synchronizing for . iii. Every two states q̂ and p̂ in (q̂ ≠ p̂) are not follower-equivalent.

with L( ) = L there is an onto homomorphism ϕ : such that for every state q̂ in , (q) = (q̂) for each q ∈ ϕ−1(q̂).

iv. For every transitive DFA

Observe that if a state q of a transitive component synchronizing.

is synchronizing, then all states in

→

are

Consider the action of A* on the set of states of . In order to simplify the notation, the action of w on the set is denoted as w instead of w and moreover we say that q is a state of if q ∈ .

Author Manuscript

Remark 4—If c is a constant of L( ) for a transitive automaton , and is the minimal transitive deterministic automaton for L( ) such that c synchronizes onto q̂, then by Remark 3 and Lemma 3(ii–iv) every state q in c maps with ϕ onto q̂, and has the same follower set as q̂. We say that q is follower-equivalent to q̂. In particular, if c is a constant such that qc = q in , then q̂c = q̂ in , and for every q ∈ c, the state qc is in c and is follower-equivalent to q̂, and to q.



Page 7

Author Manuscript

Remark 5—If q is a state in a transitive automaton and is the minimal transitive deterministic automaton as in Lemma 3 with q̂ ∈ follower-equivalent to q, then for all q′ ∈ μq( ) there are constants c, c′ ∈ L( ) such that q̂c = q̂c′ = q̂, q′c = q and qc′ = q′. Take any constant c1 ∈ L( ) such that q̂c1 = q̂. Then by transitivity there are x, y such that qc1x = q′ and q′c1y = q. By Remark 4, c = c1x, and c′ = c1y are the constants sought. If is a transitive deterministic automaton without a synchronizing word, then for every word w, # w ≥ k for some k ≥ 2. We call the minimal such k the degree of . A word w such that # w = k is called k-synchronizing. Therefore, the minimal transitive DFA for L( ) has degree 1, and all constants coincide with (1-)synchronizing words. It follows from Lemmas 2 and 3 that if has degree k and w is a constant that is k-synchronizing, then all states in w are follower-equivalent. Moreover, in that case for all x ∈ A* with wx ≠ ∅, wx is also k-synchronizing.

Author Manuscript

The following lemma relates the right contexts of states in a transitive automaton reached by reading a word w that is k-synchronizing. Lemma 4—Let = (Q, A, Q, Q, ) be a transitive DFA with degree k ≥ 2 and let = (Q, A, I, T, ) be a reduced DFA obtained from by choosing a subset of states I as initial states, and some proper subset T of Q as terminal states. If w is k-synchronizing and q, q′ ∈ w, then (q) \ (q′) ≠ ∅. Proof: Let k ≥2 be the degree of , w a k-synchronizing word, and q1,

k-synchronizing, for all words z ∈ A*, either both q1z, rest of the proof we drop the subscript

from

are undefined or

. Suppose

. Since

Author Manuscript

because otherwise, if

is transitive, there is x2 such that

Again, similarly as with q2 and and

. Denote

.

with q3, and set

.

which implies that both

are in T. In fact,

. We continue in this way where

and consider the pairs of states

. Since

is

for some i < j. But . Because

Author Manuscript

contradiction with the assumption that

.

. We have that

, we have that

finite, there are i and j such that

is

then

which is a contradiction with our assumption that Since

. In the

and q1x1 ∉ T. Set q2 = q1x1 and

reduced, there is a word x1 such that Then

. Since w is

is reduced. Therefore,

, this is a .

Example 3.2—Consider the reduced automata in Fig. 2(a). It contains two terminal components that are mutually factor-equivalent recognizing a*. Moreover, the states q1, q2 and q3 in Fig. 2(a) are follower-equivalent, and the component that contains these three states has degree 3. Every word ak for k ≥ 0 is 3-synchronizing. Consider the automaton that consists of {q1, q2, q3} with q1 being an initial and also the terminal state. Then (q1) = (a3)*, (q2) = aa(a3)* and (q3) = a(a3)*.



Page 8

Author Manuscript

Example 3.3—The transitive component of the automaton in Fig. 2(b) has degree 2. All words that end with symbol a are labels of paths that end in states q2 and q3, and all words that end with symbol b are labels of paths that end in states q1 and q2. The action of c on the states of is the identity. Hence every word that contains symbols a or b is 2-synchronizing. Note that all states are follower-equivalent but is reduced. Moreover, a ∈ (q1) \ (q2) and aa ∈ (q2) \ (q1), also aa ∈ (q2) \ (q3) and a ∈ (q3) \ (q2). However, c ∈ (q1) \ (q3), but (q3) ⊂ (q1). This last condition does not violate Lemma 4 because there are no 2-synchronizing words that label paths ending at states q1 and q3. 3.2. Path-automata

Author Manuscript

In this section we provide structural characterizations of path-automata that do not have synchronizing words. More precisely, we show that a path-automaton having no synchronizing words has a unique maximal component, which is the terminal one, whose language contains all factors of the language accepted by the path-automaton. Definition 3 (Path-automaton)—An automaton path-automaton if the following is satisfied: i.

with an initial state q0 is called a

There is at most one transition in which starts at the component induced by q0 and terminates in another component.

ii. There is only one terminal transitive component in . iii. For every transitive component which does not contain q0 there is precisely one transition that starts in a state outside but terminates in , and if is not terminal, there is precisely one transition that starts at a state in but terminates in a state outside .

Author Manuscript

Let be a path automaton and one of its transitive components. The state of that is the end point of the transition starting outside but ending at is called the entrance state for and the state that is the start point of a transition that starts in but terminates outside is called the exit of . The initial component of has no entrance, and the terminal component has no exit. A path π from an initial state in an automaton to a terminal component in induces a path-automaton which consists of all transitive components in induced by states visited by π.

Author Manuscript

Let be a terminal component of the path-automaton , and let q be the entrance of . We define the language accepted by the component induced by the path π, denoted by Lπ( ), as the language accepted by the automaton with initial state q. Lemma 5—Every trimmed deterministic path-automaton with two transitive components, and whose terminal component is trivial, has a synchronizing word. Proof: Let q0 be an initial state for a path-automaton with two transitive components having a trivial terminal component. Let x be the label of the edge starting from the initial component (say state s) and ending at the terminal component (say state t). Let be the



Page 9

Author Manuscript

initial component for , and let k be the degree of . Let w be k-synchronizing for . Because is transitive, we can extend w such that there is a path with label w that ends at s, i.e., # w = k and s ∈ w. Since k is the degree of , for all z ∈ A*, either wz ∉ L( ) or wz ⊆ with # wz = k. As is deterministic, there is only one edge starting at s with label x and this edge leads outside . Therefore wx ∉ L( ), but wx ∈ F(L( )). Hence wx is synchronizing that synchronizes onto t. We first give a technical lemma that is used later. Lemma 6—Given a deterministic path automaton , let be the terminal component of . If has no synchronizing word, then is a unique maximal transitive component in . Proof: Assume

is an automaton with transition function δ that has no synchronizing

Author Manuscript

words. Let , …, the component

=

be the transitive components of . Let

(i = 2, …, k − 1) such that

state of . For i = 1 we only have

and

is the entrance state of

and for i = k we only have

be the states in and

is the exit

. We set

for the

for a fixed terminal state q′. For i = 1, …, k − 1, let xi be the label

initial state q0, and

of the transition from state to state , i.e., . Consider L( ), …, L( ). Because these languages are all transitive, there is a maximal transitive among them. Assume L1, …, Ls are all distinct maximal transitive languages such that for each j = 1, …, s, there is a transitive component with Lj = L( ). Then for each i = 1, …, s, there are words wi ∈ Li such that wi ∉ Lj if i ≠ j (as Li is maximal, for each j ≠ i there is wij ∈ Li \ Lj, and due to the transitivity of Li, there are zij such that wi = wi1zi1wi2 ··· zis−1wis−1). Note that for each language Li there might be several transitive components that recognize it.

Author Manuscript

We consider words yi (i = 1, …, k) such that yi is a label of a path in from to in the following way. (i) If L( ) = Lj is a maximal transitive language, then wj is a factor of yi, and

yi is a constant for L( ) which uniquely determines the follower-equivalence class of

,

meaning, . This is always possible by Lemma 3 and the transitivity of L( ). (ii) If L( ) is not a maximal transitive language, then yi is a label of the shortest path between

and

.

Author Manuscript

Consider the word y1x1 ··· yk−1xk−1yk. Let p be the smallest index of 1, …, k such that L( ) is maximal transitive and r be the largest index such that L( ) is maximal transitive. Then u = ypxpyp+1 ··· xr−1yr is a word that starts at a maximal transitive component, visits all maximal transitive components, and terminates at the last maximal transitive component. Since has no synchronizing words, there must be at least one more path in with label u. But, by the choice of yi and Lemma 5, every path with label ypxp must start in a transitive component recognizing L( ) and must have a transition with label xp leading outside the component, because ypxp ∉ L( ). Let i1, i2, …, iν be all indexes between p and r such that i1 = p, iν = r and L( ) is a maximal transitive language. By the choice of yi’s, yi1 = yp, yi2, …, yiν = yr uniquely determine the languages of the transitive components Ci1, …, Ciν, that is yij ∈ L(Cij) but yij ∉ L(Cit) if j ≠ t. Therefore, there is a one-to-one correspondence between the order of appearance of yp, yp+1, …, yr in u and the order of the maximal



Page 10

Author Manuscript

transitive components. Hence, the only possibility for existence of another path with label u is if such a path also starts at . Although there might be many paths with label yp in , by Lemma 3 they all end at follower-equivalent states, and due to determinism, there is at most one of those states that is the start of a transition with label xp, and that is

(by Lemma 5,

ypxp is synchronizing for ). Hence, u (or uxr) is a synchronizing word unless p = r = k, i.e., xp does not exist. As we assumed that there are no synchronizing words for , there is at most one maximal transitive language and it must be recognized by the terminal transitive component. The following lemma characterizes a path-automaton with no synchronizing words. Proposition 7—Given a deterministic path-automaton , let of . Then one of the following holds:

be the terminal component

Author Manuscript

has a synchronizing word, or,

a.

F(L( )) = L( ).

b.

Proof: We prove the proposition by induction on the number of transitive components in the path automaton. If consists of a single component, then the lemma holds trivially as = . Now assume that lemma holds for all path automata with less then k transitive components, and suppose that has k transitive components , …, with being initial and

=

terminal. Denote with

the entrance of

and

the exit of . Consider the

Author Manuscript

and transitive components , …, . As this path path automaton with initial state automaton has k − 1 components, by the inductive hypothesis, either has a synchronizing word, or F(L( )) = L( ). Note that L( ) ⊆ F(L( )) holds trivially, so we only consider the converse inequality. Case 1: The path automaton for the automaton of

has a synchronizing word. Let y be the synchronizing word

which consists of

with trivial terminal component consisting

. By Lemma 5, y exists, and we can assume that y synchronizes onto

synchronizing for assume that

, and since for every state q in

there is a path from

. Let w be to q, we can

. We observe that yw is synchronizing for . There is no path in

with

label y, since y is synchronizing for , hence every path in that has a label y terminates in a state in . Since w is synchronizing for , every path in with label yw terminates in a single state. Thus yw is synchronizing for and part (a) is satisfied.

Author Manuscript

Case 2: The path automaton has no synchronizing word. Then by the inductive hypothesis, F(L( )) ⊆ L( ). Assume that has no synchronizing word. We show that all words in F(L( )) appear as labels of paths in = . As in Case 1, consider which consists of with trivial terminal component consisting of path in . If there is a path in with label w then, w ∈ L( ).


. Let w be a label of a


Page 11

Assume now that all paths with label w start in . If all paths with label w also end at

Author Manuscript

then, by Lemma 5, w is a factor of a word y that synchronizes onto synchronizing for , and lemma holds.

of , and hence y is

Suppose there is a path in with label w that starts at and terminates in . We observe that in this case also, has a synchronizing word. Let u be the shortest word such that w =

uxv where x is a symbol,

. Let c ∈ L( ) be a constant

and

for L( ) that fixes the follower-equivalence class of , meaning, . Such c exists by Lemma 2 and Lemma 3. By transitivity of , there is a word c′ such that cc ′u also fixes the follower equivalence class for

and is a label of a path that terminates at

Author Manuscript

. Consider cc′w = cc′uxv. Then cc′ux is synchronizing for in , by Lemma 5. But cc ′w is not synchronizing for , hence there must be another path in with label cc′w, and by our assumption, it starts in and must terminate in . Such a path must use the transition , either with a portion of the path labeled cc′ or with a portion labeled w. In the first case w is a label of a path in , hence w ∈ L( ). In the second case, there must be u′ and v′ such that cc′w = cc′u′xv′ = cc′uxv. Since u was the shortest word such that , it must be that u = u′, in which case cc′ux is synchronizing for . It is impossible that u is a proper prefix of u′ because this would imply cc′ux ⊆ which would contradict the fact that cc′ux synchronizes onto

in .

Example 3.4—The automaton in Fig. 3 is a path-automaton with no synchronizing words. It has only one terminal component which is maximal and the factors of all words in the language are labels of paths in the terminal component. This illustrates the situation (b) in Proposition 7.

Author Manuscript

The following result is used to prove the main result (Theorem 15, Section 5.3) of the paper. Proposition 8—Let L be a regular language, x ∈ F(L) and trim for L. At least one of the two cases holds: i.

be the trimmed mDFA

x is a factor of a constant for L,

ii. there is a path-automaton induced by a path of trim

containing a path labeled x and having a non-trivial terminal transitive component with at least two states.

Author Manuscript

Proof: Let trim = (Q, A, {q0}, T, ) be the trimmed mDFA for the language L. Suppose x ∈ F(L) is not a factor of a constant, i.e., for every v, v′ ∈ A*, vxv′ is not a constant for L, and therefore not synchronizing for trim . Consider a word w such that #Q xw = min#{Q xu|u ∈ A*} and let Pw = Q xw. Since xw is not synchronizing, by Proposition 1, #Pw > 1. Then for every word u ∈ A* we have that either Q xwu = ∅ or #Q xwu = #Pw. Therefore, we can assume that all states in Pw are in terminal components of trim , (if not, we can concatenate w with words that are labels of paths that lead to terminal components). If all terminal components in trim are trivial, then because trim is reduced, there is only one trivial terminal transitive component implying #Pw = 1, which is a contradiction with our assumption that x does not extend to a constant. Thus there must be at least one terminal



Page 12

Author Manuscript

transitive component which is not trivial. If there is a state in Pw that belongs to a component that is not a single state component then (ii) holds. Assume to the contrary that each state in Pw is in a distinct transitive component consisting of only one state having loops at itself. Let y be a label of one of these loops. Since Pw y ≠ ∅ implies Pw y = Pw, i.e., for every q ∈ Pw we have qy = q. This means that all states in Pw are terminal, their loops must have the same labels, and therefore their right contexts are equal. Hence the states in Pw cannot be distinct in a reduced automaton. Thus again implies that Pw has cardinality 1, a contradiction. Hence, there must be at least one state in Pw that belongs to a terminal transitive component with at least two states.

4. Splicing languages and properties of splicing rules

Author Manuscript

As mentioned, in this paper we consider the general notion of the splicing operation and the splicing system given by Paun [17], as defined below. Definition 4—A finite splicing system is a triple S = (A, I, R) where, I ⊂ A* is a finite set of strings, called an initial language, R is a finite set of splicing rules of the form r = (u1, u2) (u3, u4), with ui ∈ A* for i = 1, 2, 3, 4. Given two words x = x1u1u2x2, y = y1u3u4y2, with x1, x2, y1, y2 ∈ A* and a rule r = (u1, u2) (u3, u4), the splicing rule produces w = x1u1u4y2 denoted (x, y) ⊢r w. We also say that u1u2, u3u4 are splice sites of r and u1u4 is the paste site of r. To simplify the notation, in the following, by a splicing system we mean a finite splicing system.

Author Manuscript

Let L ⊆ A*. We denote σ(L) = {w ∈ A*|(x, y) ⊢r w, x, y ∈ L, r ∈ R}. The (iterated) splicing operation is defined as follows: σ0(L) = L, σi+1(L) = σi(L) ∪ σ(σi(L)), i ≥ 0. Finally, σ*(L) = ⋃i≥0 σi(L). Definition 5 (Splicing language)—Given a finite splicing system S = (A, I, R), the language L(S) = σ*(I) is the language generated by S. A language L is a splicing language if there is a splicing system S such that L = L(S). For a word w and a set of states Q, we use notation

(Q w) for ⋃q∈Q

(qw).

Definition 6 (Paste site at p)—Let be the a DFA for a regular splicing language L. The word u1u4 is said to be a paste site at a state p ∈ Q for a splicing rule r = (u1, u2)(u3, u4) if (Q u3u4) ⊆ (pu1u4) and pu1u2 ≠ ∅.

Author Manuscript

More precisely, the notion of a paste site at a state q is used to identify states of the automaton where a rule can be applied. Fig. 4 depicts the situation for a paste site at state p. The doted path with label u3 may not exist in the automaton, but the right context of qu3u4 (wherever a path with such a label exists) must be included in the right context of pu1u4. In what follows we assume that every splicing system is such that all rules are applied at least once during the generation of the splicing language. The following lemma shows an



Page 13

Author Manuscript

equivalence between splicing systems with respect to the extension of sites and paste sites of rules. Lemma 9—Let S = (A, I, R) be a finite splicing system and r = (u1, u2)(u3, u4) be a splicing rule in R. Let c ∈ A*. Then L(S) is the language generated with the splicing system S′ = (A, I, R′) where R′ = R ∪ {r′} for r′ = (u1, u2)(u3, u4c). Proof: It is clear that L(S) ⊆ L(S′) since R′ contains R. The converse also holds since whenever we have (x, y) ⊢r′ w we also have (x, y) ⊢r w. Lemma 10—Let S = (A, I, R) be a finite splicing system and a DFA for L = L(S). If u1u4 is a paste site at state p for a rule r = (u1, u2)(u3, u4) ∈ R then for every c ∈ A* with pu1u4c ≠ ∅, u1u4c is a paste site at p for a rule r′ = (u1, u2)(u3, u4c).

Author Manuscript

Proof: Suppose that u1u4 is a paste site at state p for rule r = (u1, u2)(u3, u4), and let pu1u4c ≠ ∅. Then by Lemma 9, L is also generated by the splicing system S = (A, I, R′) for the set of rules

where

. The first splice site of r equals r′ thus

pu1u2 ≠ ∅. It only remains to show that then cy ∈

(Q u3u4) ⊆

. But if (pu1u4) and so y ∈

(u1u4c). It follows that

is a paste site at state p for rule r′.

5. Splicing languages must have a constant 5.1. Reflexive and non-reflexive splicing languages

Author Manuscript

It is known that every splicing language generated by a finite splicing system is always regular [8,19]. More precisely, regular splicing languages form a proper subclass of the class of regular languages.

Author Manuscript

Recall that a splicing system S is said to be reflexive if for every rule r = (u1, u2)(u3, u4) in R, both (u1, u2)(u1, u2) and (u3, u4)(u3, u4) are rules in R. A language L is said to be a reflexive splicing language if there is a reflexive splicing system S such that L = L(S). It is said that S is symmetric if (u1, u2), (u3, u4) being in R implies that (u3, u4), (u1, u2) is in R. The notion of a constant of a language turned out to be essential in providing a characterization of the class of reflexive regular splicing languages [11,3]. Indeed, a fundamental property of a reflexive regular splicing language L is that there exists a splicing system generating L that has rules whose splicing sites consist of constants for the language L. A more precise characterization shows that the class of reflexive and symmetric splicing languages is equivalent to a class of regular languages, the so-called PA-con-split languages [3]. This result has been extended to the non-symmetric case [4]. In [4], itis shown that each language L in this class is constructed from a finite set of constants for L, as L is expressed as a union consisting of a finite set X, and a finite union of constant and split languages (see end of Section 2). The characterization is given with the following proposition.



Page 14

Author Manuscript

Proposition 11. (See [4].)—A regular language L is a reflexive splicing language if and only if there is a finite set X ⊂ A*, a finite set of constants K1 of L and a finite set K2 of pairs of constants of L such that

The characterization of reflexive languages in Proposition 11 helps to describe factor-closed transitive regular languages as reflexive splicing languages. Proposition 12—If L is a factor-closed transitive regular language then L is a reflexive splicing language.

Author Manuscript

Proof: Since L is factor-closed, by Lemma 2, consider its minimal deterministic transitive automaton . By Lemma 3, all states in are synchronizing. Every word w ∈ L is a label of a path in that passes through some state q, hence w = w′w″ where w′ is in the left context of a constant that labels a path starting at q and w″ is in the right context of a constant that synchronizes onto q. Because all states in are initial and terminal, we have that and for every constant w of L. Let M be a set that consists of constants by choosing one constant mq for each state q in that is a label of a path starting and ending at the state q. Then L = ⋃mq∈M Split(mq, mq) where and splitting is performed by taking the empty prefix and the empty suffix of mq. The conclusion follows directly from the characterization in Proposition 11.

Author Manuscript

An example of non-reflexive regular splicing language is given in [11]; this is the language L = a+b+a+b+a+ ∪ a+b+a+. Example 5.1—The path-automaton of Fig. 5 generalizes the example of regular nonreflexive language given in [11]. More precisely, it is possible to show, similarly as in [11], that any splicing system for the language Lk = ⋃1≤i≤k(a+b+)ia+ with k ≥ 3 must have a rule whose both splice sites are not constants for the language. A splicing system S for Lk can be defined with an initial language Ik = ∪1≤i≤k{a(ba)i, a2(ba)i, (ab)ia2} and rules:

Author Manuscript

The proof that such splicing system generates the language Lk is along the same lines of the ones given in [11]. Observe that both splice sites of rules r3,i = (a, (ba)i)((ab)k−i, ab), for i > 1, are not constants for the language Lk. More precisely, rules r1,k and r2,k are used to increase the initial and final number of a’s in language a+(ab)ka+, respectively. Rules r3,i are used to increase the number of a’s in the (k − i)th appearance of a’s in (a+b+)ka+, for i ≤ k.



Page 15

Author Manuscript

Similarly, rules r4,i are used to increase the number of b’s in language Lk. The rules r3,i are also used to obtain (a+b+)ja+, for j < k. The following lemma shows another example of non-reflexive splicing language whose trimmed mDFA is not a pathautomaton (Fig. 6). Lemma 13—The regular language L = b(a3)* + cba* + da(a3)* is a non-reflexive splicing language.

Author Manuscript

Proof: First we note that L ⊆ A*, for A = {a, b, c, d} is splicing. A splicing system S = (A, I, R) for language L consists of rules R = {r1 = (cba, 1)(cb, a), r2 = (daa3, 1)(da, 1), r3 = (b, a3)(da, 1)}, while the initial language I consists of language I = {ba3, b, cba, cb, daa3, da}. By induction on the number k of iteration steps of splicing rules, we first show that L(S) ⊆ L. If k = 0, since I ⊆ L(S), the inclusion holds. Assume that w ∈ L(S) is generated with k > 0 iterations by applying a rule r to a pair of words w1, w2 ∈ L(S). By induction w1, w2 ∈ L are obtained with k − 1 iterations. Checking splice sites in w1 and w2 for all of the rules, it is immediate to see that w ∈ L. In order to show that L ⊆ L(S), we observe that language L1 = da(a3)* is generated by rule r2 applied to words in the same language daa3. Similarly, we see that language L2 = cba* is generated by rule r1 starting from words from the same language. Language L3 = b(a3)* is generated by rule r3 applied to words of language da(a3)* and of language b(a3)*. By induction on i ≥ 0, indeed we can observe that b(a3)i ∈ L(S), i ≥ 0. If i = 0 or i = 1, being b, ba3 ∈ I, the result is immediate. Otherwise, given words b(a3)i−1 ∈ L(S), for i > 1 and word da(a3)i ∈ L(S), by rule r3 is immediate to generate word b(a3)i ∈ L(S).

Author Manuscript

Finally, notice that language L is not reflexive, that is, it cannot be generated by a splicing system by reflexive splicing rules. Suppose L is reflexive splicing language generated by a reflexive system S. We obtain a contradiction by considering generation of words in language b(a3)*. Since in language L the only words that start with a b are those in language b(a3)* and there must be splicing rules in S to generate words of the form b(a3)k for arbitrarily large k, there must be a rule r with splice site u1u2 that is a factor of b(a3)* But, because S is reflexive, S must also contain a rule (u1, u2)(u1, u2). Then this rule can be applied to x =b(a3)k and y = cb(a3)ka, for some large k > 0, to generate a word w = b(a3)ka ∉ L. Therefore the language L cannot be generated by reflexive rules. 5.2. Canonical and special words The proof of the main result (Theorem 15) is based on special words in a regular splicing language L that must be generated by a splicing rule whose splice site u3u4 is a constant of L. For a lack of a better name, we call these words q-canonical and k-special words.

Author Manuscript

Informally, the q-canonical word of a component is a word c such that qc = q and every such path with label c crosses all states in the component, and moreover, the word c is able to identify the language L( ) of the component . Definition 7 (q-Canonical)—Let be an automaton and let be a component of . Let q be a state of . Then a word c ∈ A+ such that c ∈ L( ) and qc = q is called q-canonical for



Page 16

Author Manuscript

with respect to ⊆ L( ).

if whenever c ∈ L( ), for another component

of , implies that L( )

In the following we show the existence of a q-canonical word for every state in every transitive component in . We give a constructive proof based on the notion of a k-special word for L( ) as defined below. Definition 8 (k-Special)—Let be an automaton. A word c in L( ) is k-special for the language L if every word of F(L) of length ≤k is a factor of c. Example 5.2—Consider the automaton of Fig. 6. Then the word a3 is q2-canonical for the terminal component consisting of states q2, q3, q4. Given the language L = L( ), then the word a3 is k-special for the language L for k ≤ 3.

Author Manuscript

Lemma 14—Given a non-trivial transitive component in a DFA , let k = (#Q)2. Then for every state q in there is a q-canonical of that is a k-special constant of L( ). Proof: Let {x1,…, xn} = L( ) ∩ A≤k. Being a transitive component, there are y1,…, yn−1 such that x1 y1x2 ···yn−1xn ∈ L( ). Set c = x1 y1x2 ···yn−1xn. Due to transitivity, for every q ∈

there are yq and

such that

is a label of a path that starts and ends at q. By

Author Manuscript

Remark 5, yq, can be chosen so that wq is a constant. We show that wq is q-canonical. Assume that wq ∈ L( ) for some transitive component . Take the shortest word z ∈ L( ) \ L( ). Since L( ) \ L( ) = L( ) ∩ (L( ))c, it can be recognized by an automaton with at most #Q ( ) · #Q ( ) ≤ k states [13], the shortest word in this language has length at most k. Thus |z| ≤ k and therefore z must be a factor of c, i.e., z must be in L( ), contradicting the existence of z. 5.3. Proof of the main result Considering the importance of constants in characterization of sub-classes of regular splicing languages, it has been conjectured that every splicing language must have a constant [10,11]. Our main result proves this conjecture to be true. Theorem 15 (Main result)—If L is a regular splicing language, then L has a constant. Example 5.3—The path-automaton of Fig. 3 has no synchronizing word (see Example 3.4) and thus the language L( ) = a*c(c*ac*a)* has no constant. By Proposition 15, L( ) is not a regular splicing language.

Author Manuscript

Example 5.4—The transitive regular language L recognized by the automaton in Fig. 1 has no constants. By Theorem 15, the language L is not a splicing language. Example 5.5—The regular language L = b(a3)* + cba* + da(a3)* is another example of non-reflexive splicing language, as proved in Lemma 6. Fig. 2(a) shows the trimmed mDFA graph for language L. Observe that not every path-automaton induced by a path in the mDFA from the initial state q0 to a terminal component has necessarily a constant of L.



Page 17

Author Manuscript

Indeed, the path-automaton in Fig. 2(a) recognizing language b(a3)* ⊂ L does not have any constant of the language L because every word in b(a3)* is also a substring of a word in cba* and therefore is not a synchronizing word for the automaton of L. Given a splicing regular language L, the proof of Theorem 15 shows existence of a splicing rule r = (u1, u2)(u3, u4) such that the word u1u4 ends in a non-trivial terminal component of the trimmed mDFA trim . More precisely, u1u4 ends in a state which we show to be synchronizing for the automaton trim . Let L be a regular splicing language and let trim = (Q, A, {q0}, T, ) be the trimmed mDFA for language L. We introduce some basic notations that are used in the proof. We are interested in states of the automaton trim that are found as follows.

Author Manuscript

Consider a non-trivial terminal component that is minimal among the non-trivial terminal components in the automaton trim . If a non-trivial component does not exist, then by Proposition 15, trim must have a constant and Theorem 15 holds. Let q ∈ be a minimalfollower state with respect to and recall that with μq( ) we denote the set of states in that are follower-equivalent to q. Let C = { = , , ···, } be the set of all terminal components of the automaton that are factor-equivalent to . Consider the set follower sets in

(note that by Lemma 3, for each i = 1,…, k, the collection of coincides with the collection of follower sets in ).

Then, a candidate state of trim is a state q̄ ∈ F with q̄ ∈ for some component that (q̄) is minimal in the following sense: for all q ∈ F, whenever (q) ⊆ holds that (q)= (q̄), i.e., being trim reduced it holds that q = q̄.

∈ C such (q̄), it

Author Manuscript

The main idea of the proof is to show that either the automaton has no non-trivial components, and in this case a constant exists (see Proposition 8), otherwise there exists a candidate state that is synchronizing for the automaton trim . Example 5.6—Consider the automaton in Fig. 6. Observe that the minimal terminal component induced by state q2 has language L( ) = a*, with L( ) = {a3}*, and is factorequivalent to the component induced by the state q1. Then the set F, corresponding to the candidate component , is the set of states F = {q1, q2, q3, q4} because all these states belong to only one follower-equivalence, thereby the minimal follower-equivalence class. Then the candidate states are q2, q3, q4. We will use the following lemma.

Author Manuscript

Lemma 16—Let q̄ ∈ state.

be a candidate state and let q̄1 ∈ F ∩ . Then q̄1 is also a candidate

Proof: Let be the minimal deterministic transitive automaton for ∈ C from Lemma 3. Suppose q̄ ∈ is a candidate state and let q̄1 ∈ F ∩ . Let further q̂ ∈ be the followerequivalent state to q̄. Then by Remark 5 there is a constant c of L( ) such that q̂c = q̂ and q̄1c = q̄.



Page 18

Author Manuscript

Let q′ ∈ F. First, suppose that q′ ∈

for some

Because c is a constant, by Remark 4,

≠ . Consider

such that

.

is follower-equivalent to q̂, so we have

.

Because q̄ is a candidate state, there are two possibilities (a) right contexts of q̄ and incomparable, i.e.,

and

, or (b)

(equality cannot hold because trim must be a Therefore, in both cases Also, if q′ ∈

(q′) ⊈

∩ F then by Lemma 4,

are

is reduced). In both cases here

. Then q′cz is a terminal state while q̄1cz is not. (q̄1). (q′) \

(q̄1) ≠ ∅.

Therefore q̄1 is a candidate state.

Author Manuscript

Before we present details of the proof of Theorem 15 we outline the steps involved in the proof by illustrating the situation in Example 5.6 shown in Fig. 6. i.

In trim we identify a candidate state q̄ within a non-trivial component C̄ as outlined above. (For Example 5.6 we choose state q2.)

ii. We consider a q̄-canonical word c and observe that there must be a rule (u1, u2)(u3, u4) with a paste site u1u4 at a state p that lies on a path labeled wcsx, for some s, where q0 w = q̄ and q̄x is terminal (see Fig. 7). (For Example 5.6, p = q0 and wcsx = b(a3)s with w = b and c = a3, the rule in question is r3 = (b, a3)(da, 1).)

Author Manuscript

iii. We observe that there is a state q ∈ Q u3u4 such that, for arbitrarily large i, ci is a factor of the right context of q. (For Example 5.6, such states are only q2, q3, q4, because q1 ∉ Q daa3x for any x.) We choose a sufficiently large i such that for some z, all states in Q u3u4zci belong to non-trivial components and we set . We observe that all states in end in non-trivial components that are factor-equivalent to the non-trivial component . By Lemma 10 and obtain is a paste site for p, given rule . (For Example 5.6, we that 3 3 can choose z = 1 and have a new rule r3 = (b, a )(da, a ), and Q daa3 = q2.) iv. We show that for every synchronizing.

, it must be

, therefore

is

We now present the proof of the main result.

Author Manuscript

Proof of Theorem 15—Let L be a regular splicing language, and let trim = (Q, A, {q0}, T, ) be its trimmed mDFA. By Proposition 8, if the automaton trim has only a trivial terminal component (note that since trim is reduced, there could be only one such component), it must have a constant and thus the theorem holds. Therefore we consider the case that trim has at least one non-trivial terminal component. i.

Consider a non-trivial terminal component that is minimal among the non-trivial terminal components in the automaton trim and let q ∈ be minimal-follower state in . Let C = { = , ···, } be the set of all terminal components of the



Page 19

Author Manuscript

automaton that are factor-equivalent to and set choose a candidate state q̄ ∈ F in a component ∈ C.

. We

Author Manuscript

ii. Let w ∈ A* be the shortest word such that q0w = q̄. Consider a word c which is a constant of L( ) and is q̄-canonical for . Such a word exists by Lemma 14. Then wc*x ⊆ L for some x ∈ A*. Since there is a finite number of rules in the splicing system, there are an infinite number of indexes s such that wcsx are obtained by using the same splicing rule r = (u1, u2)(u3, u4) where u1u4 is a subword of wcsx for every such s. More precisely, there must exist an infinite number of pairs of words v = v′u1u2 v″ ∈ L and w′u3u4 w″ ∈ L such that v′u1u4 w″ ∈ wc*x. Thus v′u1 is a prefix of wcix for some i ≥ 0. Let p be such that pu1u2 ≠ ∅ where p = q0 v′. Moreover, if y″ ∈ (Qu3u4), since there is y′ such that y = y′u3u4 y″ ∈ L, by splicing words v = v′u1u2 v″ and y = y′u3u4 y″ with rule r, we obtain v′u1u4 y″ ∈ L and thus y″ ∈ (pu1u4). Therefore, (Qu3u4) ⊆ (pu1u4). We obtain that u1u4 is a paste site at state p for rule r = (u1u2)(u3u4). Refer to Fig. 7. iii. In the following we show that there are states in Q u3u4 such that ci is a factor of a word in their right context for arbitrarily large i’s.

Author Manuscript

Let p′ = q0 v′u1u4 where v′ is a prefix of wc*x. Being trim deterministic, and since is terminal, by the choice of p′ it must be that p′ is either inside the component or otherwise lies along a path with label w from state q0 to state q̄. In the latter case when p′ is not a state inside , v′u1u4 is a prefix of w. In this case cix must be a suffix of w″ in the splicing of v′u1u2 v″ ∈ L and w′u3u4 w″ ∈ L that produces v′u1u4 w″ = wci x. Hence, for arbitrarily large i’s, it must be that ci is a factor of a word in the right context of a state q ∈ Q u3u4. Since there are infinite number of i’s with this property, there is a state q ∈ Q u3u4 such that ci ∈ F ( (qu3u4)) for arbitrarily large i.

Author Manuscript

Now suppose p′ is a state in (see Fig. 7). By Proposition 8, u3u4 is either a factor of a constant, which proves the statement of the theorem, or there is a pathautomaton (a sub-automaton of trim ) with a non-trivial terminal component such that u3u4 is a label of path π in . Then, by Lemma 14, there is a q-canonical word c′ for some q ∈ such that u3u4zc′ is a label of a path in for some z, that is zc′ ∈ (qu3u4). Since u1u4 is a paste site for the rule r at state p, we have that zc′ ∈ (qu3u4) ⊆ (pu1u4) = (p′) ⊆ L( ). But because c′ is q-canonical for it follows that L( ) ⊆ L( ) and by the minimality of component and since is factor equivalent to we have that L( ) = L( ), i.e., c* ⊆ L( ). Therefore ci is a factor of the right context of a state q ∈ Q u3u4 for arbitrarily large i. We now consider states in Q u3u4 whose right context has words with factors ci for arbitrarily large i’s. We fix i sufficiently large, such that for some z, every state in Q u3u4zci belongs to a non-trivial component, and for every state q̂ ∈ Q u3u4zci, the language of the component by Lemma 10,

containing q̂ contains the word c. Given,

is a paste site at the same state p for the rule


, .


Page 20

Author Manuscript

Observe that z can be chosen such that . If p′ = pu1u4 is not in , by the argument above, cix is a suffix of w″, hence we can chose z such that w″ = zcix, i.e., p′z = q̄. Because c is a constant for L( ) such that q̄c = q̄, by Lemma 3 and Remark 4, every state q ∈ c is follower-equivalent to q̄. Therefore, the state is follower-equivalent to q̄, and hence in F. Having q̄0 ∈ F ∩ , by Lemma 16, q̄0 is also a candidate state. iv. Let q be a state in

. We conclude with the observation that q = q̄0 and

Author Manuscript

therefore is a synchronizing word for q̄0, proving the theorem. The proof of this last step consists first in showing that L( ) = L( ) where is the component in trim containing q. Then we show that is terminal and thus ∈ C. By the fact that q̄0 is a candidate state we are able to show that q̄0 = q. We first observe that L( ) = L(

). As

, by Definition 6 of paste site, it holds that

(*)

Since c is q̄-canonical, by Definition 7 we have that L( ) ⊆ L( ). If L( ) \ L( ) ≠ ∅ then (q) ⊈ (q̄0) which contradicts (*). Therefore it must be that L( ) = L( ), that is and are factor-equivalent.

Author Manuscript

Next we see that is terminal. Assume to the contrary that is not terminal and thus there is an edge labeled a that starts in and terminates in a state q′ outside . By Lemma 5 the automaton that consists of together with the edge labeled a ending at q′ has a synchronizing word that ends at q′. Let ua be that word. Then ua ∉ / L( ) = L( ), because otherwise ua would not be synchronizing. By the transitivity of we can assume that ua is a label of a path that starts at q. This implies that ua must be a prefix of a word in (q)\ (q̄0), again contradicting (*). Consequently is terminal and in C. Moreover q ∈ F, since by the choice of the constant c, by Remark 4, q is factor-equivalent to q̄ (hence, to q̄0), and F consists of all states that are factor-equivalent to q̄ and belong to components in C. Thus by (*) (i.e., (q) ⊆ (q̄0)) and the fact that q̄0 as a candidate state (q̄0)= (q). Because trim is reduced, q̄0 = q, which concludes the proof.

Author Manuscript

The proof of Proposition 15 is based on the effective computation of a synchronizing state in the automaton for a regular splicing language in the case of an automaton having non-trivial terminal components. As a main corollary of the above Proposition we can state the following fact. Corollary 17—Let trim be the trimmed minimal deterministic automaton recognizing a splicing regular language. Then every state in a terminal component that contains a candidate state for trim is synchronizing.



Page 21

Author Manuscript

6. Concluding remarks In this paper we solve a conjecture posed by T. Head in his seminal works on regular splicing languages about the existence of a constant as a necessary condition for a regular language to be splicing. We solve this open problem in an affirmative way, by providing a constructive proof that leads to a procedure for finding a synchronizing state in a mDFA for a regular splicing language. The use of constants allows to determine a necessary and sufficient condition for a regular language to be reflexive splicing [3,4]; identifying such a condition for non reflexive splicing languages is still an open problem.

Author Manuscript

Recently, decidability of regular splicing languages has been proved in [15] by providing an upper bound on the lengths of the words included in the splicing rules. This bound is quadratic with respect to the size of the syntactic monoid of the language. The decidability follows from the fact that the bound allows brute-force search and comparison of the given language with splicing languages obtained through all possible finite sets of rules of certain size. Although the existence of the algorithm was long waited, the procedure it provides is useless for all practical purposes. Having a practical procedure to decide whether a regular language is splicing remains a challenging open problem. We believe that finding a characterization of minimal splicing systems recognizing splicing languages, where minimality of the system is given in terms of both the number of splice sites of rules and the length of the splicing sites, would be a promising direction for obtaining a practical decision procedure. Moreover, since splicing rules are built from constants in reflexive languages, the notions of constants and synchronizing words again seem to be vital for answering most of the above questions.

Author Manuscript

Acknowledgments We thank the reviewers for numerous valuable comments. P. Bonizzoni is partially supported by MIUR PRIN 2010–2011 grant “Automi e Linguaggi Formali: Aspetti Matematici e Applicativi”, code H41J12000190001, N. Jonoska is supported in part by the NSF grant CCF-1117254 and the NIH grant R01GM109459-01.

References

Author Manuscript

1. Berstel, J.; Perrin, D. Theory of Codes. Academic Press, Inc; Orlando, Florida: 1985. 2. Bonizzoni, P.; De Felice, C.; Mauri, G.; Zizza, R. Regular languages generated by reflexive finite linear splicing systems. Lect Notes Comput Sci; Proc. Development in Language Theory; Berlin: Springer; 2003. p. 134-145. 3. Bonizzoni P, De Felice C, Zizza R. The structure of reflexive regular splicing languages via Schützenberger constants. Theor Comput Sci. 2005; 334(1–3):71–98. 4. Bonizzoni P, Mauri G. Regular splicing languages and subclasses. Theor Comput Sci. 2005; 340:349–363. 5. Bonizzoni P. Constants and label-equivalence: a decision procedure for reflexive regular splicing languages. Theor Comput Sci. 2010; 411(6):865–877. 6. Bonizzoni, P.; Jonoska, N. Regular splicing languages must have a constant. Lect Notes Comput Sci; Proc. Developments in Language Theory; Berlin: Springer; 2011. p. 82-92. 7. Černý J. Poznámka k homogénnym eksperimentom s konecnými automatami. Mat-Fyz čas Slov Akad Vied. 1964; 14:208–216.



Page 22

Author Manuscript Author Manuscript

8. Culik K, Harju T. Splicing semigroups of dominoes and DNA. Discrete Appl Math. 1991; 31:261– 277. 9. De Luca A, Restivo A. A characterization of strictly locally testable languages and its application to semigroups of free semigroup. Inf Control. 1980; 44:300–319. 10. Goode, E. PhD Thesis. Binghamton University; 1999. Constants and splicing systems. 11. Goode E, Pixton D. Recognizing splicing languages: syntactic monoids and simultaneous pumping. Discrete Appl Math. 2007; 155:989–1006. 12. Head T. Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviours. Bull Math Biol. 1987; 49:737–759. [PubMed: 2832024] 13. Hopcroft, JE.; Motwani, R.; Ullman, JD. Introduction to Automata Theory, Languages, and Computation. Addison–Wesley; Reading, Mass: 2001. 14. Jonoska N. Sofic systems with synchronizing representations. Theor Comput Sci. 1996; 158(1–2): 81–115. 15. Kari, L.; Kopecki, S. Deciding if a regular language is generated by a splicing system. Lect Notes Comput Sci; Proc. DNA Computing and Molecular Programming – 18th International Conference; Berlin: Springer; 2012. p. 98-109. 16. Lind, D.; Marcus, B. An Introduction to Symbolic Dynamics. Cambridge University Press; New York: 1995. 17. Paun G. On the splicing operation. Discrete Appl Math. 1996; 70:57–79. 18. Paun, G.; Rozenberg, G.; Salomaa, A. New Computing Paradigms. Springer-Verlag; Berlin: 1998. DNA Computing. 19. Pixton D. Regularity of splicing languages. Discrete Appl Math. 1996; 69:101–124. 20. Schützenberger MP. Sur certaines opérations de fermeture dans le langages rationnels. Symp Math. 1975; 15:245–253. 21. Verlan, S. PhD Thesis. University of Metz; 2004. Head systems and applications to bioinformatics.

Author Manuscript Author Manuscript Inf Comput. Author manuscript; available in PMC 2016 June 01.


Page 23

Author Manuscript Fig. 1.

A transitive language that doesn’t have a transitive deterministic automaton. The initial state is indicated with an arrow and the terminal states are shaded.

Author Manuscript Author Manuscript Author Manuscript Inf Comput. Author manuscript; available in PMC 2016 June 01.


Page 24

Author Manuscript

Fig. 2.

Two automata, initial states are indicated with an arrow pointing to them and the terminal states are shaded.



Page 25

Author Manuscript

Fig. 3.

A path-automaton with no synchronizing words.



Page 26


Author Manuscript

Paste site at state p, the dotted path with label u3 may or may not exist. But the right context of qu3u4 is included in the right context of pu1u4 for every q.



Page 27

Author Manuscript

Fig. 5.

A path automaton that recognizes a non reflexive splicing language.



Page 28


The figure reports the automaton of Fig. 2(a) detailing the minimal terminal component which consists of states q2, q4 and q3.



Page 29

Author Manuscript Author Manuscript Fig. 7.

A possible paste site at state p.


RNA-Sequencing data supports the existence of novel VEGFA splicing events but not of VEGFAxxxb isoforms.

Stops in the world's languages.

Sound symbolism in the languages of Australia.

Coevolution of languages and genes.

Competing iconicities in the structure of languages.

Languages in Drier Climates Use Fewer Vowels.

Coevolution of genes and languages revisited.

Specific Language Impairment Across Languages.

Modeling the emergence of contact languages.

Anomalous transfer of syntax between languages.

Case study. Consultations across Languages. Commentary.

Quantum manifestation of elastic constants in nanofilms.

On existence of dangerousness.

Time and place in the prehistory of the Aslian languages.

Appointments, Leave, Transfers, Promotions, Languages, Retirements.

Composable languages for bioinformatics: the NYoSh experiment.

Links that speak only some languages.

Punctuated equilibrium in the large-scale evolution of programming languages.

Rank diversity of languages: generic behavior in computational linguistics.

Discriminating languages in bilingual contexts: the impact of orthographic markedness.

Narrative style in the two languages of a bilingual child.

Acid dissociation constants of glycopeptides.

Protein languages differ depending on microorganism lifestyle.

The existence of cords in olecranon bursae.