436

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 17, NO. 2, MARCH 2013

Autoregressive and Iterative Hidden Markov Models for Periodicity Detection and Solenoid Structure Recognition in Protein Sequences Nancy Yu Song and Hong Yan, Fellow, IEEE

Abstract—Traditional signal processing methods cannot detect interspersed repeats and generally cannot handle nonstationary signals. In this paper, we propose a new method for periodicity detection in protein sequences to locate interspersed repeats. We first apply the autoregressive model with a sliding window to find possible repeating subsequences within a protein sequence. Then, we utilize an iterative hidden Markov model (HMM) to count the number of subsequences similar to each of the possible repeating subsequences. An iterative HMM search of the potential repeating subsequences can help identify interspersed repeats. Finally, the numbers of repeating subsequences are aggregated together as a feature and used in the classification process. Experiment results show that our method improves the performance of solenoid protein recognition substantially. Index Terms—Autoregressive (AR) model, feature extraction, Markov model, solenoid protein recognition, spectral analysis.

I. INTRODUCTION ROTEIN structures are uniquely determined by their sequences. At least 14% of known proteins contain repeating sequences. Eukaryotic proteins are three times more likely to have internal repeats than prokaryotic ones. Repetitive proteins evolve more quickly than nonrepetitive ones [1]. Proteins can be classified into four structural groups based on the length of repeats: The first group consists of crystalline structures with no more than two repeating residues; the second group contains fibrous proteins with three to four repeating residues; the third group consists of solenoid proteins with 5 to 42 repeating residues repeats; and the fourth group contains domain-forming proteins with 30 or more repeating residues. Solenoid refers to protein structures that contain superhelical arrangements of repeating structural units. Proteins with solenoid structures are referred to as solenoid proteins. Although solenoid proteins could be divided into different classes, they share many structural and functional properties because of their nonglobular shapes. Large

P

Manuscript received April 26, 2012; revised July 17, 2012; October 3, 2012; accepted December 5, 2012. Date of publication December 21, 2012; date of current version March 8, 2013. N. Y. Song is with the Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong (e-mail: [email protected]). H. Yan is with the Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong, and also with the School of Electrical and Information Engineering, University of Sydney, Sydney, N.S.W. 2006, Australia (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JBHI.2012.2235852

and diverse interfaces as well as cooperative multivalent interactions can be formed by solenoid proteins [2]. As a result, an accurate and efficient way of distinguishing solenoid proteins from other types will definitely help determine how proteins are involved in protein–protein interactions. Internal sequence repeats often exist within protein sequences of solenoid structures. Although internal protein sequence repeats do not sufficiently imply 3-D structural repeats of a protein, sequence repeats indeed provide strong evidence for predicting structural repeats. Thus, a basic idea of identifying solenoid protein structures is to detect internal sequence repeats. Some approaches utilize signal processing methods to search for repeats. These include [3]–[5], REPPER [6], REPETITA [7], and [8]. These methods first convert the symbolic protein sequences into numerical signals. Then, signal processing methods such as the Fourier transform or wavelet analysis are performed to detect repeats. Tandem repeats in protein sequences refer to continuous repeating sequences, while interspersed repeats refer to repeating sequences with gaps or insertions in between. Only tandem repeats can be detected by the signal processing methods. The performance of these repeats degrades when a sequence has interspersed repeats. Protein sequence repeats can also be detected based on selfalignment of the sequence data. These approaches include REPRO [9], RADAR [10], TRUST [11], HHrep [12], and HHrepID [13]. They work directly on symbolic sequences of proteins. Most of them utilize direct sequence to sequence alignment to find internal sequence repeats. The detection processes of these methods do not involve any conversion from symbolic sequences to numerical ones. Their sensitivity is low because weak similarity within a sequence cannot be detected. But their specificity is generally higher than those based on signal processing. These methods can detect both tandem repeats as well as interspersed repeats as long as the sequence similarity is strong enough. The stationary wavelet packet transform (SWPT) method proposed by Vo et al. is the most recently reported method developed specifically for solenoid and nonsolenoid protein classification [8]. In this method, a symbolic protein sequence is first converted into five numerical signals based on five physical properties: polarity, secondary structure, molecular volume, codon diversity, and electrostatics charge [14], [15]. Then, a four-level SWPT is applied to the numerical signals to decompose each signal into 16 subbands. Finally, the energy of each subband is calculated as a feature and used for the final classification. The SWPT is a critical step in this algorithm. However,

2168-2194/$31.00 © 2012 IEEE

SONG AND YAN: AUTOREGRESSIVE AND ITERATIVE HIDDEN MARKOV MODELS

437

TABLE I COMPARISON OF SEPARATE FEATURE GROUPS FOR SOLENOID PROTEIN STRUCTURE RECOGNITION USING THE METHOD OF VO et al. [8]

Sensitivity

Specificity

Accuracy

Features used

Training (%)

Testing (%)

Overall (%)

SWPT subband energy features only Amino acid statistical features only Both types of features SWPT subband energy features only Amino acid statistical features only Both types of features SWPT subband energy features only Amino acid statistical features only Both types of features

86 92 96 79 90.8 93.3 81.1 91.1 94.1

76.4 92.7 90.9 85.2 80.5 91.4 82.5 84.2 91.3

81 92.4 93.3 82.2 85.4 92.3 81.8 87.5 92.6

it cannot be ignored that statistical features of amino acid are also used for classification. These features have not been used in previous methods such as REPETITA [7]. No SWPT is implemented to obtain the statistical features of amino acids. One may wonder how the statistical features of amino acid influence the classification result. A simple experiment was carried out to find the answer. This experiment follows completely the algorithm proposed in [8] except for the use of separate feature groups. The experiment dataset is the same as that in the work of Vo et al. [8]. A total of 100 features are divided into two groups. The first group includes 80 SWPT subband energy features and the second group 20 statistical features of amino acid. The experiment result is shown in Table I, which lists the classification results by using SWPT subband energy features, amino acid statistical features, as well as both types of features. From Table I, it is observed that the classification performance using statistical features of amino acids is better than using the SWPT subband energy features. Although the focus of Vo et al. is to deploy the SWPT to obtain better classification features, the inclusion of simple statistical features of amino acids definitely makes significant contribution to the final classification. Usually, tandem and interspersed repeats in a protein sequence cannot be detected at the same time by using a signal processing method or sequence alignment method solely. However, if we combine the two methods, the problem could be solved. Therefore, we include hidden Markov model (HMM) subsequence alignment information of protein sequences as input features. The new algorithm is superior to previous ones as demonstrated by experiments. II. METHODS A flowchart of all the steps used in our algorithm is shown in Fig. 1. The green blocks represent the new feature extraction procedures proposed by us, while the blue ones are the same as those described in [8]. We include the number of repeating subsequences in a protein sequence as a new feature. Power spectral estimation based on autoregressive model (ARPSD) and HMM sequence comparison is utilized to find the repeating subsequences. Unlike the old features, the new feature contains the information of the interspersed repeats in a protein sequence. We convert a protein sequence into five numerical signals, using a similar scheme in REPETITA [7] and SWPT [8]. Unlike the symbolic protein sequence, the numerical signals can be easily analyzed by signal processing methods. Also, the five numerical signals are biologically more meaningful than the

Fig. 1. Flowchart of the algorithm for solenoid and nonsolenoid classification. The green blocks show our main contributions, while the blue ones are described in [8].

amino acid variability simply generated as ad hoc quantitative indices [14], [15]. ARPSD, which has high resolution especially for short signals, is used to analyze the five numerical signals [16]. ARPSD is proven to be superior to traditional signal processing methods such as the discrete Fourier transform (DFT) [17], [18]. Protein sequences are nonstationary and noisy signals. Different parts of a protein sequence usually contain different frequency components. A protein with the solenoid structure is very likely to have the same frequency component throughout the whole sequence because of its solenoid structure. Similar coils usually reflect similar amino acid attributes inside a protein sequence. The frequency of the solenoid structure should be very strong and is the dominant one among all the frequencies embedded in the protein sequence. The strongest frequency component is defined as dominant frequency component in this paper. The corresponding period is defined as the dominant period. If the protein has solenoid structure, the dominant frequency component can be discovered by ARPSD. However, even if the protein does not have solenoid structure, a dominant frequency component can also be detected in some cases. The result from ARPSD with window size equal to a complete numerical signal of a protein sequence can only tell which frequency components the signal contains but cannot tell where the frequency component is located. If one segment of the signal is strongly periodic, the frequency component could be

438

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 17, NO. 2, MARCH 2013

very strong and become the dominant one. It means that a false alarm can be produced if ARPSD is applied to a nonsolenoid protein. This makes the final classification inaccurate. Therefore, a sliding windowed ARPSD is further applied to identify the potential repeating subsequences, and then, the potential repeating subsequences are verified by subsequence alignments. Adopting sliding window is a good way to process nonstationary signal. The signal within the window is assumed to be stationary though it may not be stationary throughout its total length. After the dominant periodicity is detected, a window with length equal to two or four times of the dominant period slides along the sequence. Whenever the window slides for one step, a power spectral density (PSD) value is calculated. Peak PSD values indicate repeating subsequences. The repeating subsequences on the original symbolic sequence are extracted. Then, an iterative HMM is employed to find other similar subsequences in the whole protein sequence by using the extracted subsequence. The number of similar subsequences found in each symbolic protein sequence is obtained for each potential repeating subsequence. To enhance the classification power of the new feature, the subsequences obtained by different potential subsequences are aggregated together as one feature.

Equation (2) can be ill-conditioned or inconsistent in many applications. In these cases, we can use singular value decomposition to overcome the problem. That is, matrix Y is decomposed into three matrices as follows:

A. PSD Estimation Based on the Autoregressive Model

where Λ−1 [2×(n −p)]×[2×(n −p)] = diag(1/λj ). The prediction order p is chosen to be n/2, where n is the signal length. The reason for selecting this order is that Lang and McClellan recommended that the number of AR coefficients should be in the range of n/3 and n/2 for the best frequency estimation [19]. Finally, PSD can be calculated based on the following equation:

The autoregressive (AR) model can overcome short-signal problems, give a higher resolution, and produce smaller artifacts for spectral estimation when compared with the DFT [17]. We briefly review this method in the following. Let S = [y1 , y2 , y3 , . . . yt , . . . yn ] be a stationary time series which follows an AR model of order p. The time series is normalized to be zero mean. The AR model in a matrix form can be described as y = Ya + ε (1) where a is the AR model coefficients and ε is a noise sequence which is assumed to be normally distributed, with zero mean and variance σ 2 . If we use the forward–backward linear prediction method, (1) can be written as (2). ⎤ ⎡ ⎤ ⎡ y[p] y[p − 1] ··· y[1] y[p + 1] ⎢ y[p + 2] ⎥ ⎢ y[p + 1] y[p] ... y[2] ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ .. .. .. .. ⎢ ⎥ ⎥ ⎢ . . ··· . . ⎥ ⎢ ⎥ ⎢ ⎢ y[n] ⎥ ⎢ y[n − 1] y[n − 2] · · · y[n − p]⎥ ⎥=⎢ ⎥ ⎢ ⎢ y[1] ⎥ ⎢ y[2] y[3] · · · y[p + 1] ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ y[2] ⎥ ⎢ y[3] y[4] · · · y[p + 2] ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ .. .. .. .. ⎦ ⎣ ⎦ ⎣ . . . ··· . y[n − p]

y[n − p + 1] y[n − p + 2] ⎡ ⎤ a1 ⎢ a2 ⎥ ⎢ ⎥ ⎢ a3 ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎢ . ⎥ ⎥ ×⎢ ⎢ .. ⎥ + εj ⎢ . ⎥ ⎢ . ⎥ ⎢ .. ⎥ ⎢ ⎥ ⎣ ap−1 ⎦ ap

···

y[n]

(2)

Yp×[2×(n −p)] T = Up×[2×(n −p)] Λ[2×(n −p)]×[2×(n −p)] V[2×(n −p)]×[2×(n −p)]

(3) where Λ is a diagonal matrix containing singular values: Λ[2×(n −p)]×[2×(n −p)] ⎡ 0 λ1 0 0 0 λ ⎢ 2 =⎢ . . . ⎣ .. .. .. 0

0

0 0 .. .

⎤ ⎥ ⎥ = diag(λj ). ⎦

(4)

λ2×(n −p)

0

The AR coefficients can then be found from the following equation: T a = V[2×(n −p)]×[2×(n −p)] Λ−1 [2×(n −p)]×[2×(n −p)] Up×[2×(n −p)] y

(5)

PAR (ω) =

|1 +

p k =1

σ2 2

ak exp(−jωk)|

(6)

where σ 2 is the variance of noise and is the angular frequency, and it is related to the period of a repeat as follows: ω=

2×π . period

(7)

Since we are interested in the peaks of the PSD to detect potential repeats, only relative values of PAR (ω) are needed. Thus, we simply set σ 2 to 1. B. Dominant Periodicity Detection We first convert a protein sequence into five numerical signals discussed in the beginning of the Section II. Each numerical signal is treated as a stationary time series S in an AR model. We set the window size n of the AR model to be the entire protein sequence length. The prediction order p is set to be half the window size n/2. We then calculate AR coefficients a in (1)–(5) in Section II-A. Because a solenoid protein can have 5 to 42 repeating residue, the period in (7) is set to 5 to 42. Thus, 38 PSD values for 38 different period values are estimated. Five groups of 38 PSD values are estimated for each of the five numerical signals, separately. For each period on a numerical signal, one PSD value is obtained. The five groups of PSD values are then added together. The five numerical signals reflect

SONG AND YAN: AUTOREGRESSIVE AND ITERATIVE HIDDEN MARKOV MODELS

different properties of a protein sequence. Not all five numerical signals contain repeating signals. Aggregating the five groups of the PSD values can enhance dominant period detection, reduce noise influence, and eliminate false periodic components. This is similar to averaging multiple observations to reduce noise, a strategy commonly used in signal processing systems. Periods with the largest and second largest PSD values are identified as the dominant and second dominant periods, respectively. The inclusion of the second dominant period can mitigate the negative effects of noise within the numerical sequences. Finally, the locations for the largest and second largest PSD values are identified. C. Searching for Repeating Subsequences After dominant periods within one protein sequence are detected, the exact subsequences that caused the periodicity should be identified. The length of the subsequence is assumed to be equal to the dominant period or the second dominant period. A sliding window is deployed in order to locate the exact repeating subsequences on a protein sequence based on the following steps. 1) Set the sliding window size n to be twice the dominant period. 2) Put the window at the beginning of each of the five numerical sequences. 3) Estimate the ARPSD value of the windowed signal according to (1)–(6) and set period in (7) to be the dominant period. The prediction order p is set to be half the window size, i.e., n/2. 4) Compute the PSD value. 5) Advance the window one data point at a time. 6) Repeat Steps 3–5 until the sliding window reaches the end of the numerical sequence. 7) Add the five groups of PSD values. 8) Determine the two positions where the largest and second largest PSD values are located. 9) Locate the two positions in Step 8 in the original symbolic sequence and record two subsequences with length equal to one dominant period at the positions. 10) Set the sliding window size n to be four times the dominant period; if window size n is larger than the signal length, set the window size n to be twice the dominant period. 11) Repeat Steps 2–9. 12) Set the sliding window size n to be twice the second dominant period. Repeat Steps 2–9. 13) Set the sliding window size n to be four times the second dominant period. If the window size n is larger than the signal length, set window size n to be twice the dominant period. 14) Repeat Steps 2–9. After these steps, eight model subsequences are obtained, which are potentially the typical repeating subsequences in the protein sequence. Even if the model subsequence is as short as 5, ARPSD can still be performed to locate the subsequence’s position because the window size n and prediction order p are chosen based on the length of the subsequence. The window size

439

n which is set to be twice the dominant period cannot be larger than the protein sequence length. Also, the protein sequence length must be 10 or larger in order to have solenoid repeats which contain five or more repeating residues. D. Count the Number of Repeating Subsequences In the previous steps, eight model subsequences are obtained for a protein sequence. For each subsequence, it is possible that there are other similar subsequences in the protein sequence if the protein contains solenoid repeats. The HMM is used to find the number of similar subsequences in the protein sequence. In this paper, HMMER (http://hmmer.janelia.org/) is deployed to compare the model subsequences to the original protein sequence. HMMER implements profile HMMs to align protein sequence segments. The function jackhammer in the HMMER package is used to iteratively search for subsequences similar to the model subsequence against the entire protein sequence. Iterative HMM has been used frequently in bioinformatics as it is more sensitive than the regular HMM. The first round of jackhammer is to find all the matches that pass the inclusion thresholds. The matched subsequences are put in a multiple alignment. In the second and subsequent rounds, a profile is produced from the multiple alignments obtained in the first round. The original input subsequence is also included in the profile. Then, the original protein sequence is searched again with the new profile. The search is iterated until no new subsequence is detected or the maximum number of searches is reached. In our work, the number of iterations is set to be the default value 5. The parameter “−−incdomE” is set to be 1, which controls inclusion threshold and which hits actually get used in the next iteration. Besides that, the parameter “−A < f >” is turned on to save an annotated multiple alignments of all detected subsequences similar to the input one to the file < f > in Stockholm format after the final iteration. For each potential repeating subsequence, a file of detected similar subsequences is created. Therefore, for each protein sequence, there are eight files created for eight different potential repeating subsequences. The number of repeating subsequences in each file is counted. To mitigate the negative influence of false potential repeating subsequence, eight repeating subsequences from eight files are added. The final number that actually indicates the total number of similar subsequences is taken as a feature for the final classification. III. EXPERIMENTS AND DISCUSSIONS The dataset used in our experiment is downloaded from the website http://protein.bio.unipd.it/repetita/. It contains 105 solenoid proteins and 247 nonsolenoid proteins. The dataset has been used in REPITITA [7] and Vo et al.’s method [8]. We compare our experiment results with four other methods: TRUST, HHrepID, REPETITA, and Vo et al.’s method. Both TRUST and HHrepID are sequence-alignment-based algorithms. TRUST uses logical inference from alignment and transitivity. HHrepID uses the HMM comparison (called HMM–HMM) to detect repeats. Both REPETITA and Vo et al.’s method are signal-processing-based algorithms. REPETITA uses the DFT to separate solenoid and nonsolenoid proteins. The comparison results are listed in Table II, which show that our

440

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 17, NO. 2, MARCH 2013

TABLE II COMPARISON OF OUR METHOD WITH FOUR OTHER METHODS FOR SOLENOID STRUCTURE DETECTION

Sensitivity

Specificity

Accuracy

Method

Training (%)

Testing (%)

Overall (%)

TRUST HHrepID REPETITA Vo et al.s method (SWPT) Our method TRUST HHrepID REPETITA Vo et al.s method (SWPT) Our method TRUST HHrepID REPETITA Vo et al.s method (SWPT) Our method

56 70 70 96 90 94.1 93.3 85 93.3 96.6 82.8 86.4 80.5 94.1 94.7

50.9 63.6 69 90.9 98.2 91.4 89.8 83 91.4 93.8 79.2 82 78.7 91.3 95.1

53.3 66.7 70 93.3 94.3 92.7 91.5 84 92.3 95.1 81 84.1 79.6 92.6 94.9

TABLE III FEATURES SELECTED FOR CLASSIFICATION

Fig. 2.

Ranking

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Feature

90

92

101

86

11

25

14

20

35

42

66

98

21

49

28

37

99

13

18

77

79

Ribbon representation of solenoid protein 1wm5A00.

approach performs slightly worse than Vo et al.’s approach in terms of sensitivity for the training data. However, our approach is better than other methods in terms of all other measurements. The features used for classification are listed in Table III. Only 21 features are selected because the prediction accuracy cannot be improved when more features are included. We rank the features from 1 to 21 according to the descending importance. For example, feature 90 is ranked the first since the decrease in classification rate is the largest if this feature is removed in the classification process, in comparison with the removal of any other feature. We observe that the new feature introduced by us, numbered 101, is the third most important one among all the features. A solenoid protein (1wm5A00), which is detected by our approach but not detected by Vo et al.’s approach, is shown as an example in Fig. 2, Fig. 3, and Table IV. From the ribbon representation produced using UCSF Chimera [20] in Fig. 2, it is observed that the protein’s structure is solenoid. Although the shape of the protein is generally periodic, the protein structure contains helices, coils, as well as sheets, which add noise to the protein signal. The occurrence of the noise could cause some of the repeating subsequences in the protein to become interspersed. Therefore, it is difficult to detect repeats in this protein using a signal processing method only. However, with

Fig. 3. Ribbon representation of solenoid protein 1wm5A00 with the identified repeating subsequences marked in green.

the aid of HMM subsequence alignments in our approach, it is possible to identify the interspersed repeats. These repeats are listed in Table IV. The second column of Table IV shows the subsequence positions. The first number is the starting index of the repeating subsequences in the protein sequence and the second number is the ending index (symbol “*” is also counted). The first row shows the result of the HMM iterative search seeded with the subsequence in bold. The seed subsequence is obtained by ARPSD. In the second row, only the overlapping position of the four subsequences is counted. The nonoverlapping areas are discarded because there is no evidence of sequence similarity found on them. The similarity scores are given in the last column of Table IV. The scores show the percentage of same amino acid occurrence in the whole subsequence. The positions of the four subsequences on the original protein sequences are marked in green and shown in Fig. 3. It can be observed that the four repeating subsequences are indeed interspersed. This explains why the protein can be detected by our approach but not by Vo et al.’s method.

SONG AND YAN: AUTOREGRESSIVE AND ITERATIVE HIDDEN MARKOV MODELS

441

TABLE IV SUBSEQUENCE ALIGNMENT IN SOLENOID PROTEIN 1WM5A00 Row No.

Subsequence Positions (Starting Index-Ending Index)

Subsequence

Similarity Score

1

34-72 3 to 41 75-113 118-156 44-65 13-34 85-106 128-149

F S A VQDP H S R I C F N I GCMY T I LKNMTEAEKAFTRSINRD * * * * * * * * * * * LWN E GV L A A DKKDWKGALDAF * * ** * * * * * * * * * * * * * * * * * * * * * * * * * KD L KEAL * * * * * ** * * * * * * * * * * * * * VL Y N I A F MY A KKE EWKKAE E * * * * ** * * * I C F N I GCMY T I L KNMTE A E KAF * LWNE GV L A A DK K DWKG A L D AF * * * * * * * * * * * * K D L KE A L * * * VL Y N I A F MY A KK E EWKK A E E * *

100% 15.40% 7.70% 15.40% 100% 27.30% 13.60% 27.30%

2

IV. CONCLUSION Traditional signal processing methods cannot be used directly to detect interspersed repeats in protein sequence. Also, these methods are generally only suitable for processing stationary signals. In this paper, we propose to include a new feature for solenoid protein recognition. We first apply ARPSD to determine the dominant periods of a protein sequence. Then, we use a sliding window to find possible repeating subsequences within the sequence. ARPSD with a sliding window is suitable for processing nonstationary, weak, and short signals well. The iterative HMM is employed to count the number of subsequences similar to each of the possible repeating subsequences. An iterative HMM search of the potential repeating subsequence can help identify interspersed repeats. Finally, the numbers of repeating subsequences are aggregated together as a new feature. This feature is used together with the original features proposed by Vo et al. for solenoid protein detection. Experiment results show that this feature indeed helps improve the performance of solenoid protein recognition.

REFERENCES [1] E. Marcotte, M. Pellegrini, T. Yeates, and D. Eisenberg, “A census of protein repeats,” J. Molecul. Biol., vol. 293, no. 1, pp. 151–160, 1999. [2] B. Kobe and A. Kajava, “When protein folding is simplified to protein coiling: The continuum of solenoid protein structures,” Trends Biochem. Sci., vol. 25, no. 10, pp. 509–515, 2000. [3] E. Coward and F. Drabløs, “Detecting periodic patterns in biological sequences,” Bioinformatics, vol. 14, no. 6, pp. 498–507, 1998. [4] K. Murray, D. Gorse, and J. Thornton, “Wavelet transforms for the characterization and detection of repeating motifs,” J. Molecul. Biol., vol. 316, no. 2, pp. 341–363, 2002. [5] K. Murray, W. Taylor, and J. Thornton, “Toward the detection and validation of repeats in protein structure,” Proteins: Struct. Funct. Bioinformat., vol. 57, no. 2, pp. 365–380, 2004. [6] M. Gruber, J. S¨oding, and A. Lupas, “Repper—Repeats and their periodicities in fibrous proteins,” Nucleic Acids Res., vol. 33, suppl. 2, pp. W239–W243, 2005. [7] L. Marsella, F. Sirocco, A. Trovato, F. Seno, and S. Tosatto, “Repetita: Detection and discrimination of the periodicity of protein solenoid repeats by discrete fourier transform,” Bioinformatics, vol. 25, no. 12, pp. i289– i295, 2009. [8] A. Vo, N. Nguyen, and H. Huang, “Solenoid and non-solenoid protein recognition using stationary wavelet packet transform,” Bioinformatics, vol. 26, no. 18, pp. i467–i473, 2010. [9] R. George and J. Heringa, “The repro server: Finding protein internal sequence repeats through the web,” Trends Biochem. Sci., vol. 25, no. 10, pp. 515–517, 2000. [10] A. Heger and L. Holm, “Rapid automatic detection and alignment of repeats in protein sequences,” Proteins: Struct. Funct. Bioinformat., vol. 41, no. 2, pp. 224–237, 2000. [11] R. Szklarczyk and J. Heringa, “Tracking repeats using significance and transitivity,” Bioinformatics, vol. 20, suppl. 1, pp. i311–i317, 2004.

[12] J. S¨oding, M. Remmert, and A. Biegert, “HHrep: De novo protein repeat detection and the origin of tim barrels,” Nucleic Acids Res., vol. 34, suppl 2, pp. W137–W142, 2006. [13] A. Biegert and J. S¨oding, “De novo identification of highly diverged protein repeats by probabilistic consistency,” Bioinformatics, vol. 24, no. 6, pp. 807–814, 2008. [14] W. Atchley, J. Zhao, A. Fernandes, and T. Dr¨uke, “Solving the protein sequence metric problem,” Proc. Nat. Acad. Sci. U.S.A., vol. 102, no. 18, pp. 6395–6400, 2005. [15] S. Kawashima, H. Ogata, and M. Kanehisa, “Aaindex: Amino acid index database,” Nucleic Acids Res., vol. 27, no. 1, pp. 368–369, 1999. [16] N. Y. Song and H. Yan, “Short exon detection in DNA sequences based on multifeature spectral analysis,” EURASIP J. Adv. Signal Process., vol. 2011, art. 780794, 8 pp., 2010, doi: 10.1155/2011/780794. [17] H. Yan and T. Pham, “Spectral estimation techniques for DNA sequence and microarray data analysis,” Current Bioinformat., vol. 2, no. 2, pp. 145– 156, 2007. [18] H. Zhou, L. Du, and H. Yan, “Detection of tandem repeats in DNA sequences based on parametric spectral estimation,” IEEE Trans. Inf. Technol. Biomed., vol. 13, no. 5, pp. 747–755, Sep. 2009. [19] S. Lang and J. McClellan, “Frequency estimation with maximum entropy spectral estimators,” IEEE Trans. Acoust. Speech Signal Process., vol. 28, no. 6, pp. 716–724, Dec. 1980. [20] E. Pettersen, T. Goddard, C. Huang, G. Couch, D. Greenblatt, E. Meng, and T. Ferrin, “UCSF chimera: A visualization system for exploratory research and analysis,” J. Comput. Chem., vol. 25, no. 13, pp. 1605–1612, 2004. Nancy Yu Song received the B.E. degree (first class Hons.) in electronic engineering from the City University of Hong Kong, Kowloon, Hong Kong, in 2009, where she is currently working toward the Ph.D. degree. Her current research interest include signal processing, DNA sequence analysis, and bioinformatics.

Hong Yan (F’06) received the B.E. degree from the Nanjing University of Posts and Telecommunications, Nanjing, China, in 1982, the M.S.E. degree from the University of Michigan, Ann Arbor, in 1984, and the Ph.D. degree from Yale University, New Haven, CT, in 1989, all in electrical engineering. During 1982–1983, he was a Graduate Student and a Research Assistant at Tsinghua University, Beijing, China. From 1986 to 1989, he was a Research Scientist at General Network Corporation, New Haven, where he was involved in design and optimization of computer and telecommunications networks. In 1989, he joined the University of Sydney, Sydney, N.S.W., Australia, where in 1997, he became a Professor of imaging science. He is currently a Professor of computer engineering at the City University of Hong Kong, Kowloon, Hong Kong. He is the author or coauthor of more than 300 papers published in refereed journals and conference proceedings. His current research interests include image processing, pattern recognition, and bioinformatics. Dr. Yan is a Fellow of the International Association for Pattern Recognition.

Autoregressive and iterative hidden Markov models for periodicity detection and solenoid structure recognition in protein sequences.

Traditional signal processing methods cannot detect interspersed repeats and generally cannot handle nonstationary signals. In this paper, we propose ...
579KB Sizes 0 Downloads 0 Views