HHS Public Access Author manuscript Author Manuscript
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18. Published in final edited form as:
IEEE/ACM Trans Comput Biol Bioinform. 2017 ; 14(3): 522–533. doi:10.1109/TCBB.2016.2591533.
Efficient approach to correct read alignment for pseudogene abundance estimates Chelsea J.-T. Ju, Zhuangtian Zhao, and Wei Wang [Member, IEEE] The authors are with the Department of Computer Science, University of California, Los Angeles, CA, 90095
Author Manuscript
Abstract
Author Manuscript
RNA-Sequencing has been the leading technology to quantify expression of thousands genes simultaneously. The data analysis of an RNA-Seq experiment starts from aligning short reads to the reference genome/transcriptome or reconstructed transcriptome. However, current aligners lack the sensitivity to distinguish reads that come from homologous regions of an genome. One group of these homologies is the paralog pseudogenes. Pseudogenes arise from duplication of a set of protein coding genes, and have been considered as degraded paralogs in the genome due to their lost of functionality. Recent studies have provided evidence to support their novel regulatory roles in biological processes. With the growing interests in quantifying the expression level of pseudogenes at different tissues or cell lines, it is critical to have a sensitive method that can correctly align ambiguous reads and accurately estimate the expression level among homologous genes. Previously in PseudoLasso, we proposed a linear regression approach to learn read alignment behaviors, and to leverage this knowledge for abundance estimation and alignment correction. In this paper, we extend the work of PseudoLasso by grouping the homologous genomic regions into different communities using a community detection algorithm, followed by building a linear regression model separately for each community. The results show that this approach is able to retain the same accuracy as PseudoLasso. By breaking the genome into smaller homologous communities, the running time is improved from quadratic growth to linear with respect to the number of genes.
Keywords Index Terms—Bioinformatics; RNA-Seq; Pseudogene; Linear Regression; Community Detection
Author Manuscript
1 Introduction THE advances of high throughput sequencing (HTS) have an incredible impact in the field of genetics. One of the main advantages is to simultaneously measure the expression of many genes in different tissues. The technology that utilizes HTS to capture the snapshot of existing RNA and quantify the expression abundance is referred to as RNA-Sequencing (RNA-Seq). RNA-Seq experiments start with converting purified and sheared RNA molecules to complementary DNA (cDNA). The sequence of cDNA fragments are read out
Ju et al.
Page 2
Author Manuscript
by sequencers, producing millions of short reads, ranging from 50bp to 300bp in length. The fragment can be read from one end only (single-end sequencing) or from both ends (pairedend sequencing). Following a series of data analysis, the amount of reads from a transcript can be used to estimate the expression abundance.
Author Manuscript
Current methods for RNA-seq quantification can be classified into two main approaches: alignment-based, and assembly-based [1] [2]. The alignment-based approach requires a reference genome or a reference transcriptome in the initial step, and determines where the short reads come from by aligning them to the reference. This approach is limited to those species with a reference genome, and is biased to the information provided by the reference. Several commonly used aligners include Tophat [3], MapSplice [4], and SpliceMap [5]. The assembly-based approach reconstructs the transcriptome by assembling reads from an experiment, and aligns reads to the reconstructed transcriptome. It lifts the constraint of having a reference genome or transcriptome, and removes the bias introduced by the reference. Several algorithms have been developed, such as Trinity [6] and Trans-ABySS [7]. However, transcriptome reconstruction is a computational intensive task, specifically with a large amount of short reads from a single experiment.
Author Manuscript
Despite the limitation, alignment-based approach is generally preferred due to its computational advantage. Most of the implementations can be easily parallelized. After aligning million of short reads to the reference, the abundance of each transcript is quantified based on the number of mapped reads, and normalized to account for transcript size and the total number of mapped reads. Statistical analysis can be further applied to identify significant changes in gene expression across different experiments. During the course of data analysis, each step has a cascading effect, and the outcome from read alignment can predominantly change the results of any downstream analysis. For this reason, we focus on the computational challenges of read alignment for RNA-Seq. The main challenge of read alignment lies on mapping reads that come from repeated parts of the genome. Most Eukaryotic genomes, including human genome, are full of large repeated segments. As a result, reads generated from these regions can be mapped to more than one position in the genome. This type of reads is referred to as the “multiread”, and is the main source of challenge for many computational biology problems ranges from de-novo assembly, copy number detection, and homologous genes expression profiling. One type of these repeats in the genome is the pseudogenes, which has received a notable attention due to the novel discovery of their regulatory roles in different biological processes [8], [9], [10].
Author Manuscript
Pseudogenes share a significant amount of sequence similarity with a set of protein-coding genes. They can be categorized into two forms based on their copying mechanisms [11]. The first form of pseudogenes derives from a direct genomic duplication of genes, and thus retain the intron-exon structure. They are often termed duplicated or non-processed pseudogenes. The second form of pseudogenes is the products of retrotransposition, where the mRNA transcripts of original genes are reverse transcribed and reintegrated into the genome at new locations. Consequently, they lose the introns and the 5′-end promoter sequence, and are referred to as processed pseudogenes. Both forms of pseudogenes can accumulate a substantial amount of mutations over time, which may disrupt their functions during
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 3
Author Manuscript
transcription or/and translation. For this reason, they are often labeled as “junk” DNA. Succeeding the ENCODE project, many studies have revealed that although pseudogenes lack functional gene products, they may interact with functional protein coding genes and regulate different biochemical process in cells [8], [9], [10], [12], [13], [14]. Genome-wide studies of pseudogenes have been extensively focused on identifying their genomic locations, and the knowledge from these studies are comprehensively integrated into publicly available databases, such as the GENCODE pseudogene resource [15] and Yale pseudogene resource [11]. On the other hand, genome-wide analyses of pseudogene expression remain challenging due to the highly sequence similarity with their homologous protein-coding genes.
Author Manuscript
In the effort of quantifying the expression of pseudogene, Tonner et al. [16] developed a pipeline that created a composite genome to include the human genome and the mRNA sequence of ribosomal protein genes, and kept only the uniquely mapped read to this genome for abundance estimation. The idea of discarding all mutireads is one of the most common approaches in resolving the multiple alignment issue. It is easy to apply, but tosses out critical information for quantification step. As a result, it creates a bias toward genes with a more unique sequence in the genome. Other related approaches to address the issue include assigning multireads to the best locus based on local coverage estimation from uniquely mapped reads [17], or based on probabilistic models [18], [19]. Nevertheless, none of these methods leverages the relationship of read alignment among homologous regions.
Author Manuscript
Recently, we propose a novel method, PseudoLasso [20], that keeps all multireads and correctly reassigns both unique reads and multireads between pseudogenes and their homologous regions. PseudoLasso is based on an observation that reads from a given gene are linearly distributed to different genomic loci. Since the homologous regions are sparse throughout the genome, this relation can be modeled using linear regression with L1 regularization (Lasso). Once the read count of each region is accurately estimated, reads are reassigned to regions in which the observed read count is lower than the estimation. In this work, we extend PseudoLasso by incorporating a community detection algorithm to identify and cluster the homologous genes. One of the bottlenecks of PseudoLasso resides on the computational time in solving the non-negative linear squares equations to estimate read counts for a new dataset. Building on the sparsity of homologous regions, we refine the framework to partition these genomic regions into different groups. The downstream analyses for read count estimation and alignment correction can be applied concurrently on smaller sets of genes.
Author Manuscript
2 Methods The ultimate goal of our method is to accurately estimate the read count for each gene, and leverage this information to guide the correction of short read alignments among homologous loci. In a paired-end sequencing experiment, a pair of mate reads are considered as a fragment, and thus the fragment count will be estimated. For generality, we use read count in this article to refer to fragment count for paired-end sequencing. The problem
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 4
Author Manuscript
statement is formally described below, and the mathematical notations are summarized in Table 1. PROBLEM STATEMENT 1 Given a set of read alignment records, the amount of reads in each genomic locus can be computed by counting the number of alignments fall into each region. A locus here refers to a region that span through a well-annotated gene or pseudogene, or a fragment of an intergenic region. Isoforms of the same gene is considered as one locus. Reads for expressed genes are mostly aligned back to themselves (the corresponding genomic loci), but they can also be misaligned or ambiguously aligned to other homologous loci. Thus, the goal is to estimate the true read counts for these expressed genes. Input: {s1, s2, …, sm} = a vector of read count for locus j, where 1 < j < m.
Author Manuscript
Output:
1 = a vector of estimated read count for gene i, where 1 < i < n.
Knowing only the read alignments and read counts for all loci is insufficient to recover the true read counts of expressed genes. However, we can consider reads aligned to each locus come from the corresponding known gene (itself) and its homologous genes. The ratio of this composition can be learned from simulated data. We use a distribution matrix X to keep track of the aligned loci for each gene, where rows represent the origin of reads, and columns represent the destination of the alignment. The distribution matrix is defined below. DEFINITION 1
Author Manuscript
Distribution Matrix—Let R be a set of reads in the experiment, R = {r1, r2, rp, G be a set of genes, G = {g1, g2, ⋯, gn}, and L·be a set of genomic loci, L = {l1, l2, ⋯, lm}. X is a n×m distribution matrix with n genes and m loci. We use
to denote a subset of reads,
such that if and only if ϕ(rk, gi, lj) = 1. ϕ(rk, gi, lj) is an indicating function as described in Equation (1). Each value in the distribution matrix represents the number of reads from gene gi aligned to locus lj, and thus to represent the number of element in a set.
. We use the cardinality notation
(1)
Author Manuscript
Using the distribution matrix, the input of our problem statement can be viewed as the column sum of each locus. The data input can be rewritten as Equation (2), where the errors follow a normal distribution with mean of zero.
1We use the “^” symbol to represent estimated variables.
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 5
Author Manuscript
(2)
Author Manuscript
As observed in PseudoLasso, the amount of reads at xij is proportional to the expected number of reads yi for gene gi across different simulated replicates. Due to space limitation in a figure panel, we use ten pairs of parent-pseudogenes to illustrate this phenomenon in Figure 1. Paired-end reads are simulated from parent genes only. The expected number of reads for each gene (x-axis) is plotted against the observed read counts to itself and its pseudogene (y-axis), showing a linear relationship across different replicates. This linear property allows us to learn a general distribution ratio from simulated replicates. We define a pseudo matrix A for these ratios, where aij is the normalized value of xij, aij = xij/yi. Thus, xij can be rewritten as aijyi, and the pseudo matrix is defined below. DEFINITION 2 Pseudo Matrix—The pseudo matrix An×m is the normalized version of the distribution matrix. It is defined by the proportion of reads mapped to locus lj out of the total number of reads for a given gene gi. Let Y be a vector of expected number of reads for genes G, Y = {y1, y2, ⋯ yn}. yi is defined as
, such that rk ∈
if and only if ψ(rk, gi) = 1. ψ(rk, gi)
is another indicating function described in Equation (3). Each value is computed as
.
(3)
Author Manuscript
With Pseudo Matrix, we can rewrite Equation 2 as
Putting all n genes and m loci in a matrix form (n × m), we have (4)
Author Manuscript
Reinstating our problem statement using Equation 4, S is the data input, which is a vector of observed read count in each genomic locus. A is the pseudo matrix obtained from training data, and Y is the unknown read count of expressed genes. The goal is to estimate the read count for all expressed genes (Y) with the following objective function:
(5)
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 6
Author Manuscript
Since the number of reads for each gene cannot be negative, this restriction is reflected on the non-negative constraint in our linear least squares equation. In general, there are more loci (m) than expressed genes (n) in both of the distribution matrix and the pseudo matrix due to homologous repeats in the genome. The complexity and running time in solving the linear equation increases exponentially with the number of expressed genes. An important observation is that these matrices are sparse, since each gene only has a few homologous loci across the genome. The sparsity allows us to partition both distribution and pseudo matrices into clusters of homologous loci. Each partition can be assessed independently to elevate the computational burden. We use a community detection algorithm to explore and identify these homologous communities, and the algorithm is described in Section 2.1.
Author Manuscript
The pseudo matrix represents ratios between the expected read count and the read count at different loci. This ratio may fluctuate slightly among replicates due to different sources of noise. To capture the noises and augment the prediction accuracy, we match the read count of loci (input) to the best profile obtained from the training replicates. The matching step is carried out through a classification approach, specifically the k-nearest neighbors algorithm described in Section 2.2. 2.1 Homologous Community Partition
Author Manuscript
The read count estimation for an expressed gene relies on the information of reads aligned to itself and its homologous loci. A group of homologous loci can be identified by their sequence similarity. From data mining perspective, a set of objects can be grouped together if they share a certain amount of features. In order to cluster these loci, we consider all substrings of the DNA sequence to be the features of expressed genes. A conventional approach is the k-mean clustering algorithm, which partitions the genomes into k clusters. Since each cluster has to capture all the adequate homologous loci, not just the parent and pseudogene regions, the challenge falls in predetermining the number of cluster, k. Instead, when we examine the alignments of these substrings, we can interpret them as information flow from a source (the locus of an expressed gene) to destinations (homologous loci). These flows depend on how an aligner recognize the reads. Thus, we propose a network model with directed graphs and use a community detection algorithm to divide genomic regions into different communities based on the behavior of a specific aligner.
Author Manuscript
In a directed graph, the vertex represents a locus in the genome, and the direct edge connecting two vertices describes the information flow between two loci. The weight of each edge is depicted by the number of reads from one locus aligned to another locus. Intuitively, the heavier the weight between two vertices indicates that more features are shared between this pair of loci. An edge to itself is expected to have a high weight since most of the reads come from one gene are likely to map back to itself. On the other hand, an edge with a light weight may due to random misalignment. The information flow approach is able to remove this type of noise and exclude the vertices with weak connections in the community. Weights for this network graph can be acquired through the global distribution matrix, which keeps track of all alignment behaviors for gene features.
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 7
Author Manuscript
The partition step uses the information from a global distribution matrix (X) to separate the genomic loci into different homologous communities. Since each express gene corresponds to one of these loci, the partition step also segregates these genes into different communities. The distribution matrix is now partitioned into several smaller matrices, and each of them can be computed independently with the same objective function described in Equation 5. 2.2 Read Count Profile Classification The pseudo matrix models the distribution ratio of reads among homologous loci, and this ratio may be slightly affected by various sources of noise. To capture the best pseudo matrix describing reads alignment behavior, each homologous community in different experiments is evaluated separately. The observed read counts of a set of homologous loci are matched to the best profile from the training sets using the k-nearest neighbors algorithm for classification.
Author Manuscript
3 IMPEMENTATION
Author Manuscript
In this paper, we enhance the framework of PseudoLasso by incorporating the community detection technique to separate genomic regions into non-overlapping homologous communities. Read distribution behavior of each cluster can be learned individually through linear regression. The workflow of this enhanced approach is illustrated in Figure 2. The training stage contains two tasks. The first task involves feature generation for homologous community detection; the second task computes read distribution within each community. In the validation stage, a set of normalized distribution matrices (pseudo matrices) from the training stage are used to estimate the read mapping behavior for a given data, and the true expression values are optimized using a non-negative least-square model. Details of our framework are described below. 3.1 Training Stage
Author Manuscript
To cluster the homologous loci, we use all substrings of transcript sequences as features. Given a list of transcript sequences, substrings are generated using a sliding window approach with a window size of 100bp. The choice of this window size is consistent with the read length we intent to model. These substrings are tagged with an gene ID of their origins, and are aligned back to the reference genome using TopHat2. We iterate through all alignment records to construct a distribution matrix indicating the origins and the destinations of all reads. Since all substrings are included, this distribution matrix is referred to as the global distribution matrix, providing the complete knowledge of information flows among genomic regions. We use infomap implemented in the igraph package of R [21] [22] to identify and partition homologous loci. infomap uses map equation as the objective function, which aims at maintaining the information flow during community partition. The distribution matrix is constructed by iterating through all ID-tagged reads in simulation. In a real experiment, this information is unknown, and can only be inferred through the normalized distribution matrices from simulations in the training stage. RNAseqSim [23] is used to simulate paired-end reads for a list of genes, with fragment size ranging from 100bp to 400bp. In order to mark the origin of each read, the software is modified to inherit the
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 8
Author Manuscript
gene ID in read names. Ten different levels of read coverage are used to imitate low (5X, 7X, and 10X), medium (13X, 15X, 17X, and 20X) and deep (23X, 27X, and 30X) sequencing; transcript abundance is assigned either with a fix number across all transcripts (4A, 6A, 8A)2 or with three different sets of random numbers (R1A, R2A, R3A)3. In total, the combination yields 60 sets of data. Six datasets are randomly chosen for validation, and the remaining sets serve as technical replicates during training. We use TopHat2 for short reads alignment. Applying its default settings, multiple alignments are reported up to 20 records, and these multireads are kept in our analysis. In addition, we use Samtools [24] to retrieve the alignment information for mapped reads, and Bedtools to facilitate the matching between genomic loci and gene annotations. Since our methods focus on correcting the read count for each loci, isoforms are treated as the same gene in the matching step. In another word, we target the alignment correction on gene level.
Author Manuscript
3.2 Validation/Experimental Stage As described in Equation 5, given a new set of reads, read count for each gene can be estimated through the observed read counts of all loci along with the pseudo matrices from training sets. The observed read counts are first matched to the best pseudo matrix using knearest neighbor classification, and the predicted read counts are optimized by solving the non-negative least squares problem. Both of these steps are implemented in Matlab.
Author Manuscript
For validation, a separate set of replicates is used to verified the predicted read counts for a list of genes, including pseudogenes and parent genes. We use accuracy to evaluate the performance of different methods. Accuracy is computed as 1 − err, where err is prediction error. The prediction error is the absolute relative error with respect to the true read count. Formally, accuracy is defined as:
After obtaining estimated read count for each gene, reads are realigned using the alignment correction algorithm proposed in PseudoLasso. The algorithm focuses on examining the alignment records and reassign the plausible alignments to the location of these genes up to the estimated amount. The algorithm can be divided into three phases:
Author Manuscript
•
Retain the uniquely mapped reads
•
Assign the multireads to the most likely region
•
Relocate the uniquely mapped reads from homologous loci
The first phase is summarized in Algorithm 1. The algorithm collects all uniquely mapped reads that fall within the genomic locus lj. These reads are sorted from good to bad based on the sequence quality (MAPQ) and whether the mate read is properly mapped for paired-end
2We use letter ‘A’ to denote the magnitude of abundance level. 4A, 6A, and 8A mean that transcripts are expressed 4,6,8 times respectively. 3R1A, R2A, and R3A denote the dataset where abundance levels are randomly assigned to each gene, ranging from 5A to 10A
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 9
Author Manuscript
sequencing. Each gene gi is associated with a locus lj in the distribution matrix. Since there are more loci than genes (m > n), loci that do not overlap with any expressed genes are estimated with read count of zero. The top uniquely mapped reads are kept for the associated locus lj. If the count of retained reads reaches the estimated amount , the locus is marked as resolved. On the other hand, if the number of uniquely mapped reads exceeds the estimated abundance, leftover reads are subjected to be reassigned to another homologous locus in the third phase.
Author Manuscript
In the second phase, Algorithm 2, “unresolved loci” from Phase I are sorted based on their remaining counts. Unassigned multireads are retrieved for these loci and sorted using the same criteria mentioned above. The correction starts with a locus with the least amount of remaining counts, and assign the sorted multireads to locus lj until either it is resolved or there is no more multireads aligned to this locus. Once a multiread has been assigned, it is removed from the multireads pool. The remaining count is updated after each assignment, and if it reaches zero, the corresponding locus is resolved.
Author Manuscript
The remaining unresolved loci do not have enough aligned reads, and thus require realigning leftover reads of their homologous loci from phase I. Identifying the homologous loci can be facilitated through the result of community detection. The global distribution matrix is divided into several subsets after homologous community partition. Within each community, homologous locus can be revealed by the non-zero value in the sub-matrix, where xij > 0 for gene gi and locus lj, and lj is not the associated gene for gi. Similar to phase II, the correction starts with the locus with the least amount of remaining counts. Leftover reads are aligned back to the sequence spanned unresolved loci using Blastn [25]. The top alignments with evalue ≤ 1.00E − 05 are assigned to the unresolved gene until either there is no more alignment or the gene is resolved. The Blastn alignment record is converted to BAM format. Details of this phase is described in Algorithm 3. The final list of alignments combines the results from these three phases, denoted as SList in algorithms. The corrected alignments are stored in BAM format, and can be processed for downstream analyses, such as transcriptome reconstruction and statistical analysis.
Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 10
Author Manuscript
Algorithm 1 Retain Uniquely Mapped Read
Author Manuscript
Algorithm 2 Assign Multiread
Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 11
Author Manuscript
Algorithm 3 Realign Reads from Homologous Loci
Author Manuscript Author Manuscript
4 RESULTS
Author Manuscript
The Yale pseudogene knowledgebase provides the information describeing the relationship between pseudogenes and their homologous parents. Build 79 contains 15774 pseudogenes and 6206 parent genes locating on canonical chromosomes (Chr1-22, X, Y, and mitochondria). We cross-reference these pseudogenes to the gene definitions from Ensembl release 79 of Human Genome GRCh38. Of these 15774 pseudogenes, 14778 are annotated by Ensembl. There are only 2610 transcripts annotated as transcribed pseudogenes. We use these transcribed pseudogenes to fine-tune the parameters, and to evaluate the effectiveness and efficiency of our approach. We start with a small dataset by randomly selecting 600 parent transcripts along with their transcribed pseudogenes (Dataset A). We then gradually increase our dataset size to model the read distribution for more pseudogenes (Dataset B, C, and D). The model is further applied to three larger datasets. The first one includes all transcribed pseudogenes and their parents (Dataset E); the second one includes all Ensembl pseudogenes with their parents (Dataset F; those without parents are omitted); the third dataset contains a complete list of annotated transcripts in human transcriptome (Dataset G). Ambiguous regions in a genome do not limit to pseudogenes; other segments such as long non-coding RNAs (lncRNA) can also be modeled. For this reason, we conduct an analysis
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 12
Author Manuscript
on the full human transcriptome to evaluate our model beyond pseudogenes. The composition of different possible types of repeated sequences is shown in Figure 3. Table 2 summarizes all of the datasets. For each dataset, we address the number of expressed transcripts, corresponding genes, and the number of aligned genomic regions. Since our approach focuses on gene level, genomic regions are characterized based on gene definitions. We also report the proportion of multireads in each dataset. There are about 9– 12% of multireads in datasets containing highly repeated sequences (Datasets A–F). It is also expected to see a drop of multiread proportion on Dataset E since this dataset contains a lot more genes with unique sequences. 4.1 Community Detection
Author Manuscript
We use datasets A – D to illustrate the advantage of incorporating community partition to our framework. Using the sliding window approach, consecutive substrings of the parent and pseudogene transcript sequences are treated as short reads and aligned to the reference genome. These aligned loci are partitioned into different clusters through community detection algorithms. The igraph package in R provides two different methods that perform community detection for a directed graph, infomap and edge-betweenness. Edgebetweenness focuses on the property of community structure, and has the tendency to produce larger communities. We analyze the performance of three different approaches, including no community partition, partition with infomap, and partition with edgebetweenness. Figure 4a illustrates the running time of these two algorithms across four datasets. Edge-betweenness requires significantly more time when there are more genes. The partition results of these three approaches are summarized in Table 3. Compare to infomap, edge-betweenness consistently partitions genomic loci into fewer communities.
Author Manuscript
After the partition, the read count of each community is estimated through linear regressions. We compare the performance of two non-negative linear regression solvers. One of them is the default solver provided by Matlab, lsqnonneg, and the other is an efficient sparse learning solver, nnLeastR, which utilizes the l1-norm regularization properties to improve the efficiency [26]. In nnLeastR, there is a λ parameter to specify the sparsity. Since the community detection algorithm has already partitioned the genomic regions into many dense communities, an ideal value for λ parameter would be 0.1 to minimize the sparsity. We use both 0.1 and 0.5 in our evaluation.
Author Manuscript
To evaluate the prediction running time and accuracy, we repeat each experiment five times for cross-validation. In each experiment, training and testing data are generated as described in Section 3.1, with 54 replicates randomly selected for training and 6 replicates for validation. The average running time and standard deviations of these 30 measurements are plotted in Figure 4b. In the comparison between keeping and discarding the partition, running time grows quadratically as we increase the number of genes without community partition. On the other hand, the running time grows linearly with the number of genes when incorporating the community detection approaches. Comparing the two community detection methods, infomap is able to provide a more efficient growth in running time than edge-betweenness.
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 13
Author Manuscript
We analyze the accuracy in the same settings. The average accuracy and standard deviations are plotted in Figure 4c. We include the read count results from TopHat2 as our baseline, where multireads are discarded. Overall, our accuracy is much higher than TopHat2. Comparing the prediction power between infomap and edge-betweenness, the accuracy is higher for infomap in larger dataset (Dataset D). We also break down the accuracy assessment for each method based on the community size and the read coverage in A.1a and A.1b. The running time is much faster when coupling with nnLeastR as our linear regression solver than lsqnonneg. The speed increases as we raise the degree of sparsity. Although the difference in accuracy is not obvious in the Figure, we observe a fair drop in accuracy as we increase the sparsity. As a result, nnLeastR with λ = 0.1 provides an efficient and accurate approach.
Author Manuscript
Besides the advantage of time efficiency and prediction accuracy, the concept of information flows provides a better explanation of our distribution matrix, and thus is able to remove noise and capture the underlying structure of homologous communities. The partition of Dataset A is visualized in Figure 5. Dataset A contains 600 parent transcripts and 304 pseudogene transcripts. These transcripts correspond to 712 genes. The substrings of these transcript sequences align to 4926 loci. Applying infomap, these 4926 loci are grouped into 875 homologous communities, with the biggest community of 2377 loci.
Author Manuscript
We also examine the tightness of all communities by evaluating the information flow within and between communities. For each pair of vertices, v1 and v2, connected by a non-zero weighted edge e12, we add up the weight of e12 to within-community score if v1 and v2 are in the same community. Otherwise, the weight of e12 is added to between-community score. These scores are normalized by the number of edges. We report the ratio of withincommunity score and between-community score. The higher the ratio represents a more compact set of communities. We report the overall ratios of the partition generated by infomap and a random partition in Table 4. The results show that infomap consistently preserves the compactness of each community for different datasets. We use the same analysis to evaluate the compactness of the largest community in Figure 6.
Author Manuscript
Dataset G contains 198278 transcripts, including 14778 pairs of parent-pseudogene. Infomap divides these transcripts into 36114 clusters. The information flow is based on how a specific aligner maps reads to different regions in the genome. Loci from a pair of parent gene and pseudogene may not end up in the same cluster if the aligner does not map reads from pseudogene to their parents and vice versa. Table 5 examines the proportion of these pairs being placed in the same or different clusters. There are around 18% of these pairs being partitioned into the same community. Among these 18%, 41% are processed pseudogenes and 55% are unprocessed pseudogenes. Others include unitary pseudogenes or other pseudogenes that are yet to be classified. Majority (82%) of these pairs are partitioned into different communities; 75% of them are processed pseudogenes, and 22% of them are unprocessed pseudogenes.
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 14
4.2 k-NN Classification
Author Manuscript Author Manuscript
k-NN classification is used to select the best pseudo matrix describing reads alignment behavior for a new dataset. In the training stage, there are 54 sets of replicates representing a wide range of coverage, and hence the alignment profile of each homologous community can be classified into 54 possible models. However, the running time increases as we raise the number of models (k-parameter) in this classification step, especially for datasets with a large number of expressed genes. In order to find the optimal k-parameter, we use two small gene lists (Datasets A and B) to evaluate the trade-off between running time and accuracy for different number of k. Six replicates are used for the evaluation, and the average running time and prediction errors are plotted in Figure 7. The results show that average error rates drop rapidly at first, and reach a steady phase after k = 10, as highlighted by the grey boxes in the figure. Compares to using all 54 replicates, a similar accuracy can be achieved at 10 replicates with a five-fold increase in speed. Consequently for all of our analyses, we set the k-parameter to 10. 4.3 Read Count Estimation We use infomap algorithm and nnLeastR linear solver with λ = 0.1 to estimate the read count for three larger gene sets (Datasets E, F, and G). Assuming all of the transcripts are expressed, the average running time and accuracy are listed in Table 6.
Author Manuscript Author Manuscript
We next take a closer look at the read count estimation for each 59129 genes on Dataset G. We randomly select 12 replicates (two result sets from the cross-validation experiments), and compute the average accuracy of read count estimated by our method and read count reported by TopHat2 after removing multireads. We use an accuracy threshold of 0.8 to separate out well-estimated and poorly-estimated read count. Consequently, we can divide these 59129 genes into four different categories. The first two categories are well-estimated by both TopHat2 and our method, and poorly-estimated by both TopHat2 and our method. The third category depicts those that are well-estimated by TopHat2, but deteriorate after applying our method. The last category refers to those that are poorly-estimated by TopHat2, but are rescued by our method. The second category can be further divided into four subgroups to describe whether each method over-estimate or underestimate the read count. Over-estimation refers to the estimated read count being higher than the true read count, and under-estimation refers to the estimated read count being lower than the true read count. Similarly, the last two categories can be further divided into two subgroups. Figure 8 illustrates the proportion of each category and subgroups. In general, our method is able to accurately estimate the read count for 89.84% (73.33% + 16.51%) of all 59129 genes, whereas TopHat2 performs well only for 76.96% (73.33% + 3.63%). Among all the genes that are poorly-estimated by TopHat2 (16.51%+6.53%), our method is able to recover the read count for 71.6% of these genes (6.53%/(16.51% + 6.53%)). On the other hand, our method misses only 4.7% of the genes that are well-estimated by TopHat2 (3.63%/(3.63% + 73.33%)). To demonstrate the ability of our method recovering the read count estimation from TopHat2, we list a few genes as examples in Table 7. We randomly select the results from three experiments representing low, medium and high coverage. The first two pairs are
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 15
Author Manuscript
examples of processed pseudogene. ENSG00000111640 and ENSG00000106153 are protein-coding genes. According Pseudogene.org, they both contain at least one processed pseudogene located on a different chromosome, ENSG00000228232 (PGOHUM00000304558) and ENSG00000235084 (PGOHUM00000296224) respectively. These examples show that TopHat2 has the tendency to align reads from parent genes to their processed pseudogenes. The next example set contains an unprocessed pseudogene. The reads from ENSG00000154582 align to two other genomic regions besides itself, ENSG00000265545 and ENSG00000259838. ENSG00000265545 (PGOHUM00000294709) is a processed pseudogene of ENSG00000154582, according to Pseudogene.org. However, ENSG00000259838 is missing from the same pseudogene database, which could be a potential pseudogene of ENSG00000154582. The last four genes are the lincRNA, where TopHat2 consistently under-estimates their read counts.
Author Manuscript
5 CONCLUSION Pseudogenes quantification via RNA-Seq remains challenging due to the high sequence similarity with their parent genes. Short read sequences from low complexity genome region are likely to align to multiple places, and thus makes it difficult to re-establish the origins of the reads. Many approaches have been proposed to resolve the multiple alignment issues; however, none of these approaches consider the alignment relationship between homologous loci nor attempt to correct the falsely aligned reads.
Author Manuscript
Recently, we proposed a linear regression approach, PseudoLasso, to learn the read alignment behaviors, and leverage this knowledge to accurately estimate the abundance and correctly realign reads among homologous loci. One of the bottlenecks of PseudoLasso is the computational time in read count estimation for a new set of data. In this work, we refine the framework of PseudoLasso to elevate this computational burden. We incorporate community detection algorithm to partition genomic regions into different homologous communities, and build a linear regression model separately for each community. The nonnegative least square problem is then solved by an efficient solver, nnLeastR from the SLEP package, on a smaller dataset with a further potential of parallelism.
Author Manuscript
To demonstrate the efficiency and accuracy of our approach, we build the linear models for a list of parent genes and their corresponding expressed pseudogenes based on the annotation provided by Yale pseudogene knowledge-base. These genes are first partitioned into homologous communities using infomap. We use 100bp paired-end reads to learn the alignment behavior of TopHat2 for each homologous community, and apply these models to simulated datasets. The results show that our approach is able to estimate the abundance with a high accuracy, and complete the analysis with a reasonable time frame. We further apply our approach to model homologous regions across the entire human transcriptome. The read count estimated by our approach presents a much higher accuracy than TopHat2. The alignment behavior is sensitive to read length, and thus a new training is required for different read length. Currently, many publicly available RNA-Seq data are generated with read length of 50bp and 75bp. Therefore, we plan on training a new model with 75bp of read
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 16
Author Manuscript
length and evaluate the pseudogene expression on the Human Body Map 2.0 Project (NCBI GEO accession GSE30611).
Acknowledgments We thank Zhaojun Zhang for his comments and discussion, and Shunping Huang for his assistance in setting up the RNA simulator software, RNAseqSim. This work was funded by NIH R01HG006703, NIH P50 GM076468-08 and NSF IIS-1313606.
Appendix A; Detail Accuracy
Author Manuscript Author Manuscript
Fig. A.1.
Author Manuscript
Breakdown of Accuracy Comparison. The accuracy of different methods are further compared in two different aspects: the size of community and the read coverage. The size of community is categorized into four groups: large refers to the largest community; medium refers to community with more than 30 members (exclude the largest community); small refers to community with 2–30 members; tiny refers to community with single member. The read coverage is divided into three groups: 5X, 7X, and 10X represent low sequencing; 13X, 15X, 17X and 20X represent medium sequencing; 23X, 27X, and 30X are deep sequencing (labeled as “high” in the figure). The average accuracy and standard deviation are plotted in both figures.
References 1. Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nature reviews Genetics. Feb; 2011 12(2):87–98. 2. Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature methods. Jun; 2011 8(6):469–77. [PubMed: 21623353]
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 17
Author Manuscript Author Manuscript Author Manuscript Author Manuscript
3. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology. Apr.2013 14(4):R36. [PubMed: 23618408] 4. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, Liu J. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic acids research. Oct.2010 38(18):e178. [PubMed: 20802226] 5. Au KF, Jiang H, Lin L, Xing Y, Wong WH. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic acids research. Aug; 2010 38(14):4570–8. [PubMed: 20371516] 6. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology. Jul; 2011 29(7):644–52. 7. Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu AL, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJM, Hoodless PA, Birol I. De novo assembly and analysis of RNA-seq data. Nature methods. Nov; 2010 7(11):909–12. [PubMed: 20935650] 8. Hawkins PG, Morris KV. Transcriptional regulation of Oct4 by a long non-coding RNA antisense to Oct4-pseudogene 5. Transcription. Jan; 2010 1(3):165–175. [PubMed: 21151833] 9. Lin H, Shabbir A, Molnar M, Lee T. Stem cell regulatory function mediated by expression of a novel mouse Oct4 pseudogene. Biochemical and biophysical research communications. Mar; 2007 355(1):111–6. [PubMed: 17280643] 10. Zou M, Baitei EY, Alzahrani AS, Al-mohanna F, Farid NR, Meyer B. Oncogenic Activation of MAP Kinase by BRAF Pseudogene in Thyroid Tumors 1. Neoplasia. 2009; 11(1):57–65. [PubMed: 19107232] 11. Karro JE, Yan Y, Zheng D, Zhang Z, Carriero N, Cayting P, Harrrison P, Gerstein M. Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic acids research. Jan; 2007 35(Database issue):D55–D60. [PubMed: 17099229] 12. Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW, Lu Y, Denoeud F, Antonarakis SE, Snyder M, Ruan Y, Wei C-L, Gingeras TR, Guigó R, Harrow J, Gerstein MB. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome research. Jun; 2007 17(6):839–51. [PubMed: 17568002] 13. Han YJ, Ma SF, Yourek G, Park YD, Garcia JGN. A transcribed pseudogene of MYLK promotes cell proliferation. FASEB journal : official publication of the Federation of American Societies for Experimental Biology. Jul; 2011 25(7):2305–12. [PubMed: 21441351] 14. Svensson O, Arvestad L, Lagergren J. Genome-wide survey for biologically functional pseudogenes. PLoS computational biology. May.2006 2(5):e46. [PubMed: 16680195] 15. Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu X, Harte R, Balasubramanian S, Tanzer A, Diekhans M, Reymond A, Hubbard TJ, Harrow J, Gerstein MB. The GENCODE pseudogene resource. Genome Biology. Jan.2012 13(9):R51. [PubMed: 22951037] 16. Tonner P, Srinivasasainagendra V, Zhang S, Zhi D. Detecting transcription of ribosomal protein pseudogenes in diverse human tissues from RNA-seq data. BMC Genomics. Jan.2012 13(1):412. [PubMed: 22908858] 17. Mortazavi A, Williams BA, Mccue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods. 2008; 5(7):1–8. [PubMed: 18175409] 18. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics. Jan.2011 12:323. [PubMed: 21816040] 19. Paaniuc B, Zaitlen N, Halperin E. Accurate estimation of expression levels of homologous genes in RNA-seq experiments. Journal of computational biology : a journal of computational molecular cell biology. Mar; 2011 18(3):459–68. [PubMed: 21385047] 20. Ju, CJT., Zhao, Z., Wang, W. Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. New York, NY, USA: ACM; 2014. PseudoLasso:
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 18
Author Manuscript Author Manuscript
Leveraging Read Alignment in Homologous Regions to Correct Pseudogene Expression Estimates via RNASeq; p. 569-578.ser BCB ’14 21. Rosvall M, Bergstrom CT. Maps of information flow reveal community structure in complex networks. 2007 arXiv preprint physics.soc-ph/0707.0609. 22. Rosvall M, Axelsson D, Bergstrom CT. The map equation. The European Physical Journal-Special Topics. 2009; 178(1):13–23. 23. Zhang Z, Huang S, Wang J, Zhang X, Pardo Manuel de Villena F, McMillan L, Wang W. GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment. Bioinformatics (Oxford, England). Jul; 2013 29(13):i291–9. 24. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England). Aug; 2009 25(16):2078–9. 25. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. Oct; 1990 215(3):403–10. [PubMed: 2231712] 26. Liu, J., Ji, S., Ye, J. SLEP: Sparse Learning with Efficient Projections. Arizona State University; 2009. [Online]. Available: http://www.public.asu.edu/jye02/Software/SLEP
Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 19
Author Manuscript Author Manuscript
Fig. 1.
Comparison of Expected Read Count and Observed Read Count. Paired-end reads of 100bp are simulated for functional parent genes in 60 sets of experiment. The expected number of reads from a parent gene (x-axis) is plotted against the number of reads observed in either the parent gene itself, or its pseudogene (y-axis). Parent gene and pseudogene profiles are depicted by different symbols, but the parent-pseudogene pair is indicated by the same color.
Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 20
Author Manuscript Author Manuscript Fig. 2.
Author Manuscript Author Manuscript
The enhanced framework of PseudoLasso. In the training stage, gene features are represented by all substrings of gene sequences. These features are used to identify homologous loci through a community detection approach. Mining on the same list of genes, paired-end short reads are simulated with different coverages and are aligned back to the reference genome, providing the potential models to describe read distribution of experimental data. These distribution matrices are normalized, and split based on the homologous community partition. The read counts of genes for validated data or experimental data are estimated through these models. The two tasks in the training stage are indicated by different colors: steps with blue arrows describe the first task of feature generation and community detection; steps with purple arrows describe the second task of read distribution computation within each community. In the validation stage, reads from other experiments are first aligned to the reference genome. The alignment profiles are matched to the best normalized matrices from the training stage using k-nearest neighbor classification. The predicted read counts are optimized by solving the non-negative least squares equations.
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 21
Author Manuscript Fig. 3.
Author Manuscript
Distribution of possible types of repeated segments. Repeated sequences can be found in different types of transcripts, including pseudogenes, non-coding RNA, lincRNA, immunoglobulin pseudogenes and antisense RNA. There are 16810 pseudogenes reported by Ensembl, but only 14778 of them are cross-referenced to Pseudogene.org with annotated parents.
Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 22
Author Manuscript Author Manuscript
Fig. 4.
Running Time and Accuracy. Three different community partition algorithms are compared (no community partition, partition with infomap, and partition with edge-betweenness). The running time of infomap and edge-betweenness is plotted and labeled in 4a. The prediction stage requires non-negative linear regression solvers. Two different solvers are evaluated, lsqnonneg and nnLeastR. The λ parameter is set to 0.1 and 0.5 for nnLeastR. The average running time for the prediction stage is plotted in 4b, and the average accuracy is plotted in 4c. The standard deviations are also drawn, but the values are too small in 4c. The accuracy of TopHat2 is included as our baseline.
Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 23
Author Manuscript Author Manuscript Fig. 5.
Author Manuscript
Homologous Communities Partition. 4926 genomic regions from dataset A are partitioned into 875 homologous communities. The colors may be repeated to represent different communities (since there are only 256 colors), but loci in the same community are either positioned in a closed proximity or linked with an arrow. The arrow indicates information flow between two loci in the same community.
Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 24
Author Manuscript Author Manuscript Fig. 6.
Author Manuscript
The compactness of the largest community is evaluated by the ratio of within- and betweencommunity scores. The ratio of the largest partition generated by infomap algorithm is consistently higher than the ratio from a community with randomly selected vertices of a similar size.
Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 25
Author Manuscript Author Manuscript
Fig. 7.
Parameter Selection for kNN-Classification. The running time is plotted against the prediction error for different k-parameter in the kNN-Classification. The y-axis represents the prediction error in percentage scale, and the x-axis represents the running time in minute. The number next to each data point indicates the k-parameter used. The grey box highlight the best k-parameters judging from the trade-off between accuracy and efficiency. Dataset A (bottom) and B (top) are evaluated, and values are collected by taking the average over six replicates.
Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 26
Author Manuscript Author Manuscript
Fig. 8.
Accuracy Evaluation Between TopHat2 and Our Method on 59129 Genes. Well-estimation is defined by having accuracy above 0.8.
Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 27
TABLE 1
Author Manuscript
Mathematical Notations Symbol
Description
G = {g1, g2, ⋯, gn}
a set of gene gi, where 1 < i ≤ n
L = {l1, l2, ⋯, lm}
a set of mapped loci lj, where 1 < j ≤ m and n ≤ m
Y = {y1, y2, ⋯, yn}
a set of expected read count yi for gene gi
R = {r1,r2, ⋯, rp}
a set of reads in an experiment a subset of reads come from gene gi a subset of reads come from gene gi and aligned to locus lj
Author Manuscript
Xn×m
an n by m distribution matrix
An×m
an n by m pseudo matrix, which is the normalized distribution matrix Xn×m.
S = {s1, s2, ⋯, sm}
read count sj for locus li, it is equivalent to the column sum of distribution matrix
Muniq
alignment records of uniquely mapped reads
Mmulti
alignment records of multireads
C = {c1, c2, …, cn} SList
a set of variables associated with gene i to keep track of the number of missing reads final list of corrected alignments
ψ(rk, gi)
an indicating function for read rk comes from gene gi
ϕ(rk, gi, lj)
an indicating function for read rk comes from gene gi and aligned to locus lj
Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Author Manuscript
Author Manuscript
Author Manuscript 930 2610 14778
2000
6202
6202
D
E
G
F
616
198278
918
408
800
1200
B
C
59129
17863
5890
2248
1398
712
304
600
A
Corresponding Genes
Pseudogene Transcripts
Parent Transcripts
Dataset
117022
40329
29071
12997
8622
7146
4926
Aligned Loci
3.89%
9.86%
9.66%
9.94%
11.98%
11.14%
12.67%
Multiread Proportion
Dataset Description: Datasets A-D contain a list of randomly selected parent genes, along with the transcripts of their corresponding transcribed pseudogenes. Isoforms are merged and labeled with the same gene ID. Gene features are aligned to the reference genome. Keeping all multiple alignments reported by TopHat2, number of aligned regions and multireads proportion are recorded for each dataset. Dataset E contains the full set of transcribed pseudogenes and their parents; dataset F contains all pseudogenes and their parents. Dataset G contains all annotated transcripts (198278 transcripts) in human transcriptome.
Author Manuscript
TABLE 2 Ju et al. Page 28
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Author Manuscript
Author Manuscript
Author Manuscript NA NA NA
6186
36114
G
1872
2611
D
14702
1012
1540
C
F
686
E
544
864
1084
B
Betweenness
Infomap
1
1
1
1
1
1
1
None
Number of Communities
A
Dataset
45302
9500
14692
6699
4453
4227
2390
Infomap
NA
NA
NA
7672
5064
4462
2439
Betweenness
117022
40329
29071
12997
8622
7146
4926
None
Size of the Largest Community
Summary of Community Partition. Three different approaches are applied: partition with infomap, partition with edge-betweenness, and no partition. The number of communities and the size of the largest community are compared for each dataset among these three approaches. The comparison with edgebetweenness is omitted for big datasets (E, F, and G)
Author Manuscript
TABLE 3 Ju et al. Page 29
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Author Manuscript
Author Manuscript
Author Manuscript Infomap
2840461
68225 231777
81743814
733453072
F
36117
54790997
E
G
1726540
15461
18849184
D
98952
11959068
343799
127663
6272 10682
7374696
12140773
B
42139
6018
5894824
Raw Score
No. Edges
388422
37598
22626
3835
1508
3385
588
No. Edges
Between Community
Raw Score
Within Community
C
A
Dataset
Methods
102.78
15.86
19.88
13.60
13.43
40.22
13.67
Norm. Ratio
566696974
52387004
38924120
13661693
8344840
5551001
4065941
Raw Score
147423
22576
17667
6933
3850
2456
2244
No. Edges
Within Community
178715166
32197271
17593417
5531290
3923596
1922647
1871022
Raw Score
472776
83247
41076
12363
8340
7201
4362
No. Edges
Between Community
Random
10.17
6.00
5.14
4.40
4.61
8.47
4.22
Norm. Ratio
Compactness of Community Partition. The compactness of a partition is evaluated by the ratio of within and between community scores. The overall ratio of the partition generated by infomap algorithm is compared to a random partition
Author Manuscript
TABLE 4 Ju et al. Page 30
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 31
TABLE 5
Author Manuscript
Partition Information for 14778 Pairs of Parent-Pseudogene Same Community
Different Community
Processed Pseudogene
1095
9114
Unprocessed Pseudogene
1472
2668
Others
101
328
Author Manuscript Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Ju et al.
Page 32
TABLE 6
Author Manuscript
Average Running Time and Accuracy for Large Datasets. Dataset
Expressed Genes
Running Time (hrs)
TopHat2 Accuracy
Accuracy of Our Method
E
5890
2.516
0.786
0.899
F
17863
1.095
0.831
0.860
G
59129
11.491
0.569
0.752
Author Manuscript Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.
Author Manuscript
9:138199933-138203325
ENSG00000154582
ENSG00000237419
2:131983937-131994161
8:73939169-73972287
ENSG00000265545
ENSG00000236485
18:4459230-4459458
ENSG00000259838
7:77038-80418
15:41557027-41557435
ENSG00000235084
13:18540497-18550697
1:15604597-15605043
ENSG00000106153
ENSG00000231238
7:56101569-56106576
ENSG00000228232
ENSG00000242611
12:6533927-6538374
X:39787132-39788136
ENSG00000111640
Location
lincRNA
lincRNA
lincRNA
lincRNA
protein coding
processed pseudogene
unprocessed pseudogene
processed pseudogene
protein coding
processed pseudogene
protein coding
Biotype
101
114
57
67
2014
25
29
69
181
149
1776
Expected
Author Manuscript
Gene Name
55
16
43
45
1858
157
104
86
131
410
1307
TopHat2
78
153
57
70
2014
21
38
71
182
150
1745
Our Method
Replicate 5X_R1A
281
325
210
229
4840
109
157
110
601
321
4874
Expected
170
65
158
167
4528
420
329
156
422
1067
3571
TopHat2
279
323
214
248
4865
66
128
118
609
383
4781
Our Method
Replicate 15X_R3A
Author Manuscript
Read Count Estimations from TopHat2 and Our Method.
502
361
195
239
8111
134
169
203
908
658
7664
Expected
299
70
146
172
7567
759
536
269
640
1849
5680
TopHat2
517
372
202
283
8132
154
202
205
890
730
7653
Our Method
Replicate 23X_R3A
Author Manuscript
TABLE 7 Ju et al. Page 33
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.