HHS Public Access Author manuscript Author Manuscript

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18. Published in final edited form as:

IEEE/ACM Trans Comput Biol Bioinform. 2017 ; 14(3): 522–533. doi:10.1109/TCBB.2016.2591533.

Efficient approach to correct read alignment for pseudogene abundance estimates Chelsea J.-T. Ju, Zhuangtian Zhao, and Wei Wang [Member, IEEE] The authors are with the Department of Computer Science, University of California, Los Angeles, CA, 90095

Author Manuscript

Abstract

Author Manuscript

RNA-Sequencing has been the leading technology to quantify expression of thousands genes simultaneously. The data analysis of an RNA-Seq experiment starts from aligning short reads to the reference genome/transcriptome or reconstructed transcriptome. However, current aligners lack the sensitivity to distinguish reads that come from homologous regions of an genome. One group of these homologies is the paralog pseudogenes. Pseudogenes arise from duplication of a set of protein coding genes, and have been considered as degraded paralogs in the genome due to their lost of functionality. Recent studies have provided evidence to support their novel regulatory roles in biological processes. With the growing interests in quantifying the expression level of pseudogenes at different tissues or cell lines, it is critical to have a sensitive method that can correctly align ambiguous reads and accurately estimate the expression level among homologous genes. Previously in PseudoLasso, we proposed a linear regression approach to learn read alignment behaviors, and to leverage this knowledge for abundance estimation and alignment correction. In this paper, we extend the work of PseudoLasso by grouping the homologous genomic regions into different communities using a community detection algorithm, followed by building a linear regression model separately for each community. The results show that this approach is able to retain the same accuracy as PseudoLasso. By breaking the genome into smaller homologous communities, the running time is improved from quadratic growth to linear with respect to the number of genes.

Keywords Index Terms—Bioinformatics; RNA-Seq; Pseudogene; Linear Regression; Community Detection

Author Manuscript

1 Introduction THE advances of high throughput sequencing (HTS) have an incredible impact in the field of genetics. One of the main advantages is to simultaneously measure the expression of many genes in different tissues. The technology that utilizes HTS to capture the snapshot of existing RNA and quantify the expression abundance is referred to as RNA-Sequencing (RNA-Seq). RNA-Seq experiments start with converting purified and sheared RNA molecules to complementary DNA (cDNA). The sequence of cDNA fragments are read out

Ju et al.

Page 2

Author Manuscript

by sequencers, producing millions of short reads, ranging from 50bp to 300bp in length. The fragment can be read from one end only (single-end sequencing) or from both ends (pairedend sequencing). Following a series of data analysis, the amount of reads from a transcript can be used to estimate the expression abundance.

Author Manuscript

Current methods for RNA-seq quantification can be classified into two main approaches: alignment-based, and assembly-based [1] [2]. The alignment-based approach requires a reference genome or a reference transcriptome in the initial step, and determines where the short reads come from by aligning them to the reference. This approach is limited to those species with a reference genome, and is biased to the information provided by the reference. Several commonly used aligners include Tophat [3], MapSplice [4], and SpliceMap [5]. The assembly-based approach reconstructs the transcriptome by assembling reads from an experiment, and aligns reads to the reconstructed transcriptome. It lifts the constraint of having a reference genome or transcriptome, and removes the bias introduced by the reference. Several algorithms have been developed, such as Trinity [6] and Trans-ABySS [7]. However, transcriptome reconstruction is a computational intensive task, specifically with a large amount of short reads from a single experiment.

Author Manuscript

Despite the limitation, alignment-based approach is generally preferred due to its computational advantage. Most of the implementations can be easily parallelized. After aligning million of short reads to the reference, the abundance of each transcript is quantified based on the number of mapped reads, and normalized to account for transcript size and the total number of mapped reads. Statistical analysis can be further applied to identify significant changes in gene expression across different experiments. During the course of data analysis, each step has a cascading effect, and the outcome from read alignment can predominantly change the results of any downstream analysis. For this reason, we focus on the computational challenges of read alignment for RNA-Seq. The main challenge of read alignment lies on mapping reads that come from repeated parts of the genome. Most Eukaryotic genomes, including human genome, are full of large repeated segments. As a result, reads generated from these regions can be mapped to more than one position in the genome. This type of reads is referred to as the “multiread”, and is the main source of challenge for many computational biology problems ranges from de-novo assembly, copy number detection, and homologous genes expression profiling. One type of these repeats in the genome is the pseudogenes, which has received a notable attention due to the novel discovery of their regulatory roles in different biological processes [8], [9], [10].

Author Manuscript

Pseudogenes share a significant amount of sequence similarity with a set of protein-coding genes. They can be categorized into two forms based on their copying mechanisms [11]. The first form of pseudogenes derives from a direct genomic duplication of genes, and thus retain the intron-exon structure. They are often termed duplicated or non-processed pseudogenes. The second form of pseudogenes is the products of retrotransposition, where the mRNA transcripts of original genes are reverse transcribed and reintegrated into the genome at new locations. Consequently, they lose the introns and the 5′-end promoter sequence, and are referred to as processed pseudogenes. Both forms of pseudogenes can accumulate a substantial amount of mutations over time, which may disrupt their functions during

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 3

Author Manuscript

transcription or/and translation. For this reason, they are often labeled as “junk” DNA. Succeeding the ENCODE project, many studies have revealed that although pseudogenes lack functional gene products, they may interact with functional protein coding genes and regulate different biochemical process in cells [8], [9], [10], [12], [13], [14]. Genome-wide studies of pseudogenes have been extensively focused on identifying their genomic locations, and the knowledge from these studies are comprehensively integrated into publicly available databases, such as the GENCODE pseudogene resource [15] and Yale pseudogene resource [11]. On the other hand, genome-wide analyses of pseudogene expression remain challenging due to the highly sequence similarity with their homologous protein-coding genes.

Author Manuscript

In the effort of quantifying the expression of pseudogene, Tonner et al. [16] developed a pipeline that created a composite genome to include the human genome and the mRNA sequence of ribosomal protein genes, and kept only the uniquely mapped read to this genome for abundance estimation. The idea of discarding all mutireads is one of the most common approaches in resolving the multiple alignment issue. It is easy to apply, but tosses out critical information for quantification step. As a result, it creates a bias toward genes with a more unique sequence in the genome. Other related approaches to address the issue include assigning multireads to the best locus based on local coverage estimation from uniquely mapped reads [17], or based on probabilistic models [18], [19]. Nevertheless, none of these methods leverages the relationship of read alignment among homologous regions.

Author Manuscript

Recently, we propose a novel method, PseudoLasso [20], that keeps all multireads and correctly reassigns both unique reads and multireads between pseudogenes and their homologous regions. PseudoLasso is based on an observation that reads from a given gene are linearly distributed to different genomic loci. Since the homologous regions are sparse throughout the genome, this relation can be modeled using linear regression with L1 regularization (Lasso). Once the read count of each region is accurately estimated, reads are reassigned to regions in which the observed read count is lower than the estimation. In this work, we extend PseudoLasso by incorporating a community detection algorithm to identify and cluster the homologous genes. One of the bottlenecks of PseudoLasso resides on the computational time in solving the non-negative linear squares equations to estimate read counts for a new dataset. Building on the sparsity of homologous regions, we refine the framework to partition these genomic regions into different groups. The downstream analyses for read count estimation and alignment correction can be applied concurrently on smaller sets of genes.

Author Manuscript

2 Methods The ultimate goal of our method is to accurately estimate the read count for each gene, and leverage this information to guide the correction of short read alignments among homologous loci. In a paired-end sequencing experiment, a pair of mate reads are considered as a fragment, and thus the fragment count will be estimated. For generality, we use read count in this article to refer to fragment count for paired-end sequencing. The problem

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 4

Author Manuscript

statement is formally described below, and the mathematical notations are summarized in Table 1. PROBLEM STATEMENT 1 Given a set of read alignment records, the amount of reads in each genomic locus can be computed by counting the number of alignments fall into each region. A locus here refers to a region that span through a well-annotated gene or pseudogene, or a fragment of an intergenic region. Isoforms of the same gene is considered as one locus. Reads for expressed genes are mostly aligned back to themselves (the corresponding genomic loci), but they can also be misaligned or ambiguously aligned to other homologous loci. Thus, the goal is to estimate the true read counts for these expressed genes. Input: {s1, s2, …, sm} = a vector of read count for locus j, where 1 < j < m.

Author Manuscript

Output:

1 = a vector of estimated read count for gene i, where 1 < i < n.

Knowing only the read alignments and read counts for all loci is insufficient to recover the true read counts of expressed genes. However, we can consider reads aligned to each locus come from the corresponding known gene (itself) and its homologous genes. The ratio of this composition can be learned from simulated data. We use a distribution matrix X to keep track of the aligned loci for each gene, where rows represent the origin of reads, and columns represent the destination of the alignment. The distribution matrix is defined below. DEFINITION 1

Author Manuscript

Distribution Matrix—Let R be a set of reads in the experiment, R = {r1, r2, rp, G be a set of genes, G = {g1, g2, ⋯, gn}, and L·be a set of genomic loci, L = {l1, l2, ⋯, lm}. X is a n×m distribution matrix with n genes and m loci. We use

to denote a subset of reads,

such that if and only if ϕ(rk, gi, lj) = 1. ϕ(rk, gi, lj) is an indicating function as described in Equation (1). Each value in the distribution matrix represents the number of reads from gene gi aligned to locus lj, and thus to represent the number of element in a set.

. We use the cardinality notation

(1)

Author Manuscript

Using the distribution matrix, the input of our problem statement can be viewed as the column sum of each locus. The data input can be rewritten as Equation (2), where the errors follow a normal distribution with mean of zero.

1We use the “^” symbol to represent estimated variables.

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 5

Author Manuscript

(2)

Author Manuscript

As observed in PseudoLasso, the amount of reads at xij is proportional to the expected number of reads yi for gene gi across different simulated replicates. Due to space limitation in a figure panel, we use ten pairs of parent-pseudogenes to illustrate this phenomenon in Figure 1. Paired-end reads are simulated from parent genes only. The expected number of reads for each gene (x-axis) is plotted against the observed read counts to itself and its pseudogene (y-axis), showing a linear relationship across different replicates. This linear property allows us to learn a general distribution ratio from simulated replicates. We define a pseudo matrix A for these ratios, where aij is the normalized value of xij, aij = xij/yi. Thus, xij can be rewritten as aijyi, and the pseudo matrix is defined below. DEFINITION 2 Pseudo Matrix—The pseudo matrix An×m is the normalized version of the distribution matrix. It is defined by the proportion of reads mapped to locus lj out of the total number of reads for a given gene gi. Let Y be a vector of expected number of reads for genes G, Y = {y1, y2, ⋯ yn}. yi is defined as

, such that rk ∈

if and only if ψ(rk, gi) = 1. ψ(rk, gi)

is another indicating function described in Equation (3). Each value is computed as

.

(3)

Author Manuscript

With Pseudo Matrix, we can rewrite Equation 2 as

Putting all n genes and m loci in a matrix form (n × m), we have (4)

Author Manuscript

Reinstating our problem statement using Equation 4, S is the data input, which is a vector of observed read count in each genomic locus. A is the pseudo matrix obtained from training data, and Y is the unknown read count of expressed genes. The goal is to estimate the read count for all expressed genes (Y) with the following objective function:

(5)

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 6

Author Manuscript

Since the number of reads for each gene cannot be negative, this restriction is reflected on the non-negative constraint in our linear least squares equation. In general, there are more loci (m) than expressed genes (n) in both of the distribution matrix and the pseudo matrix due to homologous repeats in the genome. The complexity and running time in solving the linear equation increases exponentially with the number of expressed genes. An important observation is that these matrices are sparse, since each gene only has a few homologous loci across the genome. The sparsity allows us to partition both distribution and pseudo matrices into clusters of homologous loci. Each partition can be assessed independently to elevate the computational burden. We use a community detection algorithm to explore and identify these homologous communities, and the algorithm is described in Section 2.1.

Author Manuscript

The pseudo matrix represents ratios between the expected read count and the read count at different loci. This ratio may fluctuate slightly among replicates due to different sources of noise. To capture the noises and augment the prediction accuracy, we match the read count of loci (input) to the best profile obtained from the training replicates. The matching step is carried out through a classification approach, specifically the k-nearest neighbors algorithm described in Section 2.2. 2.1 Homologous Community Partition

Author Manuscript

The read count estimation for an expressed gene relies on the information of reads aligned to itself and its homologous loci. A group of homologous loci can be identified by their sequence similarity. From data mining perspective, a set of objects can be grouped together if they share a certain amount of features. In order to cluster these loci, we consider all substrings of the DNA sequence to be the features of expressed genes. A conventional approach is the k-mean clustering algorithm, which partitions the genomes into k clusters. Since each cluster has to capture all the adequate homologous loci, not just the parent and pseudogene regions, the challenge falls in predetermining the number of cluster, k. Instead, when we examine the alignments of these substrings, we can interpret them as information flow from a source (the locus of an expressed gene) to destinations (homologous loci). These flows depend on how an aligner recognize the reads. Thus, we propose a network model with directed graphs and use a community detection algorithm to divide genomic regions into different communities based on the behavior of a specific aligner.

Author Manuscript

In a directed graph, the vertex represents a locus in the genome, and the direct edge connecting two vertices describes the information flow between two loci. The weight of each edge is depicted by the number of reads from one locus aligned to another locus. Intuitively, the heavier the weight between two vertices indicates that more features are shared between this pair of loci. An edge to itself is expected to have a high weight since most of the reads come from one gene are likely to map back to itself. On the other hand, an edge with a light weight may due to random misalignment. The information flow approach is able to remove this type of noise and exclude the vertices with weak connections in the community. Weights for this network graph can be acquired through the global distribution matrix, which keeps track of all alignment behaviors for gene features.

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 7

Author Manuscript

The partition step uses the information from a global distribution matrix (X) to separate the genomic loci into different homologous communities. Since each express gene corresponds to one of these loci, the partition step also segregates these genes into different communities. The distribution matrix is now partitioned into several smaller matrices, and each of them can be computed independently with the same objective function described in Equation 5. 2.2 Read Count Profile Classification The pseudo matrix models the distribution ratio of reads among homologous loci, and this ratio may be slightly affected by various sources of noise. To capture the best pseudo matrix describing reads alignment behavior, each homologous community in different experiments is evaluated separately. The observed read counts of a set of homologous loci are matched to the best profile from the training sets using the k-nearest neighbors algorithm for classification.

Author Manuscript

3 IMPEMENTATION

Author Manuscript

In this paper, we enhance the framework of PseudoLasso by incorporating the community detection technique to separate genomic regions into non-overlapping homologous communities. Read distribution behavior of each cluster can be learned individually through linear regression. The workflow of this enhanced approach is illustrated in Figure 2. The training stage contains two tasks. The first task involves feature generation for homologous community detection; the second task computes read distribution within each community. In the validation stage, a set of normalized distribution matrices (pseudo matrices) from the training stage are used to estimate the read mapping behavior for a given data, and the true expression values are optimized using a non-negative least-square model. Details of our framework are described below. 3.1 Training Stage

Author Manuscript

To cluster the homologous loci, we use all substrings of transcript sequences as features. Given a list of transcript sequences, substrings are generated using a sliding window approach with a window size of 100bp. The choice of this window size is consistent with the read length we intent to model. These substrings are tagged with an gene ID of their origins, and are aligned back to the reference genome using TopHat2. We iterate through all alignment records to construct a distribution matrix indicating the origins and the destinations of all reads. Since all substrings are included, this distribution matrix is referred to as the global distribution matrix, providing the complete knowledge of information flows among genomic regions. We use infomap implemented in the igraph package of R [21] [22] to identify and partition homologous loci. infomap uses map equation as the objective function, which aims at maintaining the information flow during community partition. The distribution matrix is constructed by iterating through all ID-tagged reads in simulation. In a real experiment, this information is unknown, and can only be inferred through the normalized distribution matrices from simulations in the training stage. RNAseqSim [23] is used to simulate paired-end reads for a list of genes, with fragment size ranging from 100bp to 400bp. In order to mark the origin of each read, the software is modified to inherit the

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 8

Author Manuscript

gene ID in read names. Ten different levels of read coverage are used to imitate low (5X, 7X, and 10X), medium (13X, 15X, 17X, and 20X) and deep (23X, 27X, and 30X) sequencing; transcript abundance is assigned either with a fix number across all transcripts (4A, 6A, 8A)2 or with three different sets of random numbers (R1A, R2A, R3A)3. In total, the combination yields 60 sets of data. Six datasets are randomly chosen for validation, and the remaining sets serve as technical replicates during training. We use TopHat2 for short reads alignment. Applying its default settings, multiple alignments are reported up to 20 records, and these multireads are kept in our analysis. In addition, we use Samtools [24] to retrieve the alignment information for mapped reads, and Bedtools to facilitate the matching between genomic loci and gene annotations. Since our methods focus on correcting the read count for each loci, isoforms are treated as the same gene in the matching step. In another word, we target the alignment correction on gene level.

Author Manuscript

3.2 Validation/Experimental Stage As described in Equation 5, given a new set of reads, read count for each gene can be estimated through the observed read counts of all loci along with the pseudo matrices from training sets. The observed read counts are first matched to the best pseudo matrix using knearest neighbor classification, and the predicted read counts are optimized by solving the non-negative least squares problem. Both of these steps are implemented in Matlab.

Author Manuscript

For validation, a separate set of replicates is used to verified the predicted read counts for a list of genes, including pseudogenes and parent genes. We use accuracy to evaluate the performance of different methods. Accuracy is computed as 1 − err, where err is prediction error. The prediction error is the absolute relative error with respect to the true read count. Formally, accuracy is defined as:

After obtaining estimated read count for each gene, reads are realigned using the alignment correction algorithm proposed in PseudoLasso. The algorithm focuses on examining the alignment records and reassign the plausible alignments to the location of these genes up to the estimated amount. The algorithm can be divided into three phases:

Author Manuscript



Retain the uniquely mapped reads



Assign the multireads to the most likely region



Relocate the uniquely mapped reads from homologous loci

The first phase is summarized in Algorithm 1. The algorithm collects all uniquely mapped reads that fall within the genomic locus lj. These reads are sorted from good to bad based on the sequence quality (MAPQ) and whether the mate read is properly mapped for paired-end

2We use letter ‘A’ to denote the magnitude of abundance level. 4A, 6A, and 8A mean that transcripts are expressed 4,6,8 times respectively. 3R1A, R2A, and R3A denote the dataset where abundance levels are randomly assigned to each gene, ranging from 5A to 10A

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 9

Author Manuscript

sequencing. Each gene gi is associated with a locus lj in the distribution matrix. Since there are more loci than genes (m > n), loci that do not overlap with any expressed genes are estimated with read count of zero. The top uniquely mapped reads are kept for the associated locus lj. If the count of retained reads reaches the estimated amount , the locus is marked as resolved. On the other hand, if the number of uniquely mapped reads exceeds the estimated abundance, leftover reads are subjected to be reassigned to another homologous locus in the third phase.

Author Manuscript

In the second phase, Algorithm 2, “unresolved loci” from Phase I are sorted based on their remaining counts. Unassigned multireads are retrieved for these loci and sorted using the same criteria mentioned above. The correction starts with a locus with the least amount of remaining counts, and assign the sorted multireads to locus lj until either it is resolved or there is no more multireads aligned to this locus. Once a multiread has been assigned, it is removed from the multireads pool. The remaining count is updated after each assignment, and if it reaches zero, the corresponding locus is resolved.

Author Manuscript

The remaining unresolved loci do not have enough aligned reads, and thus require realigning leftover reads of their homologous loci from phase I. Identifying the homologous loci can be facilitated through the result of community detection. The global distribution matrix is divided into several subsets after homologous community partition. Within each community, homologous locus can be revealed by the non-zero value in the sub-matrix, where xij > 0 for gene gi and locus lj, and lj is not the associated gene for gi. Similar to phase II, the correction starts with the locus with the least amount of remaining counts. Leftover reads are aligned back to the sequence spanned unresolved loci using Blastn [25]. The top alignments with evalue ≤ 1.00E − 05 are assigned to the unresolved gene until either there is no more alignment or the gene is resolved. The Blastn alignment record is converted to BAM format. Details of this phase is described in Algorithm 3. The final list of alignments combines the results from these three phases, denoted as SList in algorithms. The corrected alignments are stored in BAM format, and can be processed for downstream analyses, such as transcriptome reconstruction and statistical analysis.

Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 10

Author Manuscript

Algorithm 1 Retain Uniquely Mapped Read

Author Manuscript

Algorithm 2 Assign Multiread

Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 11

Author Manuscript

Algorithm 3 Realign Reads from Homologous Loci

Author Manuscript Author Manuscript

4 RESULTS

Author Manuscript

The Yale pseudogene knowledgebase provides the information describeing the relationship between pseudogenes and their homologous parents. Build 79 contains 15774 pseudogenes and 6206 parent genes locating on canonical chromosomes (Chr1-22, X, Y, and mitochondria). We cross-reference these pseudogenes to the gene definitions from Ensembl release 79 of Human Genome GRCh38. Of these 15774 pseudogenes, 14778 are annotated by Ensembl. There are only 2610 transcripts annotated as transcribed pseudogenes. We use these transcribed pseudogenes to fine-tune the parameters, and to evaluate the effectiveness and efficiency of our approach. We start with a small dataset by randomly selecting 600 parent transcripts along with their transcribed pseudogenes (Dataset A). We then gradually increase our dataset size to model the read distribution for more pseudogenes (Dataset B, C, and D). The model is further applied to three larger datasets. The first one includes all transcribed pseudogenes and their parents (Dataset E); the second one includes all Ensembl pseudogenes with their parents (Dataset F; those without parents are omitted); the third dataset contains a complete list of annotated transcripts in human transcriptome (Dataset G). Ambiguous regions in a genome do not limit to pseudogenes; other segments such as long non-coding RNAs (lncRNA) can also be modeled. For this reason, we conduct an analysis

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 12

Author Manuscript

on the full human transcriptome to evaluate our model beyond pseudogenes. The composition of different possible types of repeated sequences is shown in Figure 3. Table 2 summarizes all of the datasets. For each dataset, we address the number of expressed transcripts, corresponding genes, and the number of aligned genomic regions. Since our approach focuses on gene level, genomic regions are characterized based on gene definitions. We also report the proportion of multireads in each dataset. There are about 9– 12% of multireads in datasets containing highly repeated sequences (Datasets A–F). It is also expected to see a drop of multiread proportion on Dataset E since this dataset contains a lot more genes with unique sequences. 4.1 Community Detection

Author Manuscript

We use datasets A – D to illustrate the advantage of incorporating community partition to our framework. Using the sliding window approach, consecutive substrings of the parent and pseudogene transcript sequences are treated as short reads and aligned to the reference genome. These aligned loci are partitioned into different clusters through community detection algorithms. The igraph package in R provides two different methods that perform community detection for a directed graph, infomap and edge-betweenness. Edgebetweenness focuses on the property of community structure, and has the tendency to produce larger communities. We analyze the performance of three different approaches, including no community partition, partition with infomap, and partition with edgebetweenness. Figure 4a illustrates the running time of these two algorithms across four datasets. Edge-betweenness requires significantly more time when there are more genes. The partition results of these three approaches are summarized in Table 3. Compare to infomap, edge-betweenness consistently partitions genomic loci into fewer communities.

Author Manuscript

After the partition, the read count of each community is estimated through linear regressions. We compare the performance of two non-negative linear regression solvers. One of them is the default solver provided by Matlab, lsqnonneg, and the other is an efficient sparse learning solver, nnLeastR, which utilizes the l1-norm regularization properties to improve the efficiency [26]. In nnLeastR, there is a λ parameter to specify the sparsity. Since the community detection algorithm has already partitioned the genomic regions into many dense communities, an ideal value for λ parameter would be 0.1 to minimize the sparsity. We use both 0.1 and 0.5 in our evaluation.

Author Manuscript

To evaluate the prediction running time and accuracy, we repeat each experiment five times for cross-validation. In each experiment, training and testing data are generated as described in Section 3.1, with 54 replicates randomly selected for training and 6 replicates for validation. The average running time and standard deviations of these 30 measurements are plotted in Figure 4b. In the comparison between keeping and discarding the partition, running time grows quadratically as we increase the number of genes without community partition. On the other hand, the running time grows linearly with the number of genes when incorporating the community detection approaches. Comparing the two community detection methods, infomap is able to provide a more efficient growth in running time than edge-betweenness.

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 13

Author Manuscript

We analyze the accuracy in the same settings. The average accuracy and standard deviations are plotted in Figure 4c. We include the read count results from TopHat2 as our baseline, where multireads are discarded. Overall, our accuracy is much higher than TopHat2. Comparing the prediction power between infomap and edge-betweenness, the accuracy is higher for infomap in larger dataset (Dataset D). We also break down the accuracy assessment for each method based on the community size and the read coverage in A.1a and A.1b. The running time is much faster when coupling with nnLeastR as our linear regression solver than lsqnonneg. The speed increases as we raise the degree of sparsity. Although the difference in accuracy is not obvious in the Figure, we observe a fair drop in accuracy as we increase the sparsity. As a result, nnLeastR with λ = 0.1 provides an efficient and accurate approach.

Author Manuscript

Besides the advantage of time efficiency and prediction accuracy, the concept of information flows provides a better explanation of our distribution matrix, and thus is able to remove noise and capture the underlying structure of homologous communities. The partition of Dataset A is visualized in Figure 5. Dataset A contains 600 parent transcripts and 304 pseudogene transcripts. These transcripts correspond to 712 genes. The substrings of these transcript sequences align to 4926 loci. Applying infomap, these 4926 loci are grouped into 875 homologous communities, with the biggest community of 2377 loci.

Author Manuscript

We also examine the tightness of all communities by evaluating the information flow within and between communities. For each pair of vertices, v1 and v2, connected by a non-zero weighted edge e12, we add up the weight of e12 to within-community score if v1 and v2 are in the same community. Otherwise, the weight of e12 is added to between-community score. These scores are normalized by the number of edges. We report the ratio of withincommunity score and between-community score. The higher the ratio represents a more compact set of communities. We report the overall ratios of the partition generated by infomap and a random partition in Table 4. The results show that infomap consistently preserves the compactness of each community for different datasets. We use the same analysis to evaluate the compactness of the largest community in Figure 6.

Author Manuscript

Dataset G contains 198278 transcripts, including 14778 pairs of parent-pseudogene. Infomap divides these transcripts into 36114 clusters. The information flow is based on how a specific aligner maps reads to different regions in the genome. Loci from a pair of parent gene and pseudogene may not end up in the same cluster if the aligner does not map reads from pseudogene to their parents and vice versa. Table 5 examines the proportion of these pairs being placed in the same or different clusters. There are around 18% of these pairs being partitioned into the same community. Among these 18%, 41% are processed pseudogenes and 55% are unprocessed pseudogenes. Others include unitary pseudogenes or other pseudogenes that are yet to be classified. Majority (82%) of these pairs are partitioned into different communities; 75% of them are processed pseudogenes, and 22% of them are unprocessed pseudogenes.

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 14

4.2 k-NN Classification

Author Manuscript Author Manuscript

k-NN classification is used to select the best pseudo matrix describing reads alignment behavior for a new dataset. In the training stage, there are 54 sets of replicates representing a wide range of coverage, and hence the alignment profile of each homologous community can be classified into 54 possible models. However, the running time increases as we raise the number of models (k-parameter) in this classification step, especially for datasets with a large number of expressed genes. In order to find the optimal k-parameter, we use two small gene lists (Datasets A and B) to evaluate the trade-off between running time and accuracy for different number of k. Six replicates are used for the evaluation, and the average running time and prediction errors are plotted in Figure 7. The results show that average error rates drop rapidly at first, and reach a steady phase after k = 10, as highlighted by the grey boxes in the figure. Compares to using all 54 replicates, a similar accuracy can be achieved at 10 replicates with a five-fold increase in speed. Consequently for all of our analyses, we set the k-parameter to 10. 4.3 Read Count Estimation We use infomap algorithm and nnLeastR linear solver with λ = 0.1 to estimate the read count for three larger gene sets (Datasets E, F, and G). Assuming all of the transcripts are expressed, the average running time and accuracy are listed in Table 6.

Author Manuscript Author Manuscript

We next take a closer look at the read count estimation for each 59129 genes on Dataset G. We randomly select 12 replicates (two result sets from the cross-validation experiments), and compute the average accuracy of read count estimated by our method and read count reported by TopHat2 after removing multireads. We use an accuracy threshold of 0.8 to separate out well-estimated and poorly-estimated read count. Consequently, we can divide these 59129 genes into four different categories. The first two categories are well-estimated by both TopHat2 and our method, and poorly-estimated by both TopHat2 and our method. The third category depicts those that are well-estimated by TopHat2, but deteriorate after applying our method. The last category refers to those that are poorly-estimated by TopHat2, but are rescued by our method. The second category can be further divided into four subgroups to describe whether each method over-estimate or underestimate the read count. Over-estimation refers to the estimated read count being higher than the true read count, and under-estimation refers to the estimated read count being lower than the true read count. Similarly, the last two categories can be further divided into two subgroups. Figure 8 illustrates the proportion of each category and subgroups. In general, our method is able to accurately estimate the read count for 89.84% (73.33% + 16.51%) of all 59129 genes, whereas TopHat2 performs well only for 76.96% (73.33% + 3.63%). Among all the genes that are poorly-estimated by TopHat2 (16.51%+6.53%), our method is able to recover the read count for 71.6% of these genes (6.53%/(16.51% + 6.53%)). On the other hand, our method misses only 4.7% of the genes that are well-estimated by TopHat2 (3.63%/(3.63% + 73.33%)). To demonstrate the ability of our method recovering the read count estimation from TopHat2, we list a few genes as examples in Table 7. We randomly select the results from three experiments representing low, medium and high coverage. The first two pairs are

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 15

Author Manuscript

examples of processed pseudogene. ENSG00000111640 and ENSG00000106153 are protein-coding genes. According Pseudogene.org, they both contain at least one processed pseudogene located on a different chromosome, ENSG00000228232 (PGOHUM00000304558) and ENSG00000235084 (PGOHUM00000296224) respectively. These examples show that TopHat2 has the tendency to align reads from parent genes to their processed pseudogenes. The next example set contains an unprocessed pseudogene. The reads from ENSG00000154582 align to two other genomic regions besides itself, ENSG00000265545 and ENSG00000259838. ENSG00000265545 (PGOHUM00000294709) is a processed pseudogene of ENSG00000154582, according to Pseudogene.org. However, ENSG00000259838 is missing from the same pseudogene database, which could be a potential pseudogene of ENSG00000154582. The last four genes are the lincRNA, where TopHat2 consistently under-estimates their read counts.

Author Manuscript

5 CONCLUSION Pseudogenes quantification via RNA-Seq remains challenging due to the high sequence similarity with their parent genes. Short read sequences from low complexity genome region are likely to align to multiple places, and thus makes it difficult to re-establish the origins of the reads. Many approaches have been proposed to resolve the multiple alignment issues; however, none of these approaches consider the alignment relationship between homologous loci nor attempt to correct the falsely aligned reads.

Author Manuscript

Recently, we proposed a linear regression approach, PseudoLasso, to learn the read alignment behaviors, and leverage this knowledge to accurately estimate the abundance and correctly realign reads among homologous loci. One of the bottlenecks of PseudoLasso is the computational time in read count estimation for a new set of data. In this work, we refine the framework of PseudoLasso to elevate this computational burden. We incorporate community detection algorithm to partition genomic regions into different homologous communities, and build a linear regression model separately for each community. The nonnegative least square problem is then solved by an efficient solver, nnLeastR from the SLEP package, on a smaller dataset with a further potential of parallelism.

Author Manuscript

To demonstrate the efficiency and accuracy of our approach, we build the linear models for a list of parent genes and their corresponding expressed pseudogenes based on the annotation provided by Yale pseudogene knowledge-base. These genes are first partitioned into homologous communities using infomap. We use 100bp paired-end reads to learn the alignment behavior of TopHat2 for each homologous community, and apply these models to simulated datasets. The results show that our approach is able to estimate the abundance with a high accuracy, and complete the analysis with a reasonable time frame. We further apply our approach to model homologous regions across the entire human transcriptome. The read count estimated by our approach presents a much higher accuracy than TopHat2. The alignment behavior is sensitive to read length, and thus a new training is required for different read length. Currently, many publicly available RNA-Seq data are generated with read length of 50bp and 75bp. Therefore, we plan on training a new model with 75bp of read

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 16

Author Manuscript

length and evaluate the pseudogene expression on the Human Body Map 2.0 Project (NCBI GEO accession GSE30611).

Acknowledgments We thank Zhaojun Zhang for his comments and discussion, and Shunping Huang for his assistance in setting up the RNA simulator software, RNAseqSim. This work was funded by NIH R01HG006703, NIH P50 GM076468-08 and NSF IIS-1313606.

Appendix A; Detail Accuracy

Author Manuscript Author Manuscript

Fig. A.1.

Author Manuscript

Breakdown of Accuracy Comparison. The accuracy of different methods are further compared in two different aspects: the size of community and the read coverage. The size of community is categorized into four groups: large refers to the largest community; medium refers to community with more than 30 members (exclude the largest community); small refers to community with 2–30 members; tiny refers to community with single member. The read coverage is divided into three groups: 5X, 7X, and 10X represent low sequencing; 13X, 15X, 17X and 20X represent medium sequencing; 23X, 27X, and 30X are deep sequencing (labeled as “high” in the figure). The average accuracy and standard deviation are plotted in both figures.

References 1. Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nature reviews Genetics. Feb; 2011 12(2):87–98. 2. Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature methods. Jun; 2011 8(6):469–77. [PubMed: 21623353]

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 17

Author Manuscript Author Manuscript Author Manuscript Author Manuscript

3. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology. Apr.2013 14(4):R36. [PubMed: 23618408] 4. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, Liu J. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic acids research. Oct.2010 38(18):e178. [PubMed: 20802226] 5. Au KF, Jiang H, Lin L, Xing Y, Wong WH. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic acids research. Aug; 2010 38(14):4570–8. [PubMed: 20371516] 6. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology. Jul; 2011 29(7):644–52. 7. Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu AL, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJM, Hoodless PA, Birol I. De novo assembly and analysis of RNA-seq data. Nature methods. Nov; 2010 7(11):909–12. [PubMed: 20935650] 8. Hawkins PG, Morris KV. Transcriptional regulation of Oct4 by a long non-coding RNA antisense to Oct4-pseudogene 5. Transcription. Jan; 2010 1(3):165–175. [PubMed: 21151833] 9. Lin H, Shabbir A, Molnar M, Lee T. Stem cell regulatory function mediated by expression of a novel mouse Oct4 pseudogene. Biochemical and biophysical research communications. Mar; 2007 355(1):111–6. [PubMed: 17280643] 10. Zou M, Baitei EY, Alzahrani AS, Al-mohanna F, Farid NR, Meyer B. Oncogenic Activation of MAP Kinase by BRAF Pseudogene in Thyroid Tumors 1. Neoplasia. 2009; 11(1):57–65. [PubMed: 19107232] 11. Karro JE, Yan Y, Zheng D, Zhang Z, Carriero N, Cayting P, Harrrison P, Gerstein M. Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic acids research. Jan; 2007 35(Database issue):D55–D60. [PubMed: 17099229] 12. Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW, Lu Y, Denoeud F, Antonarakis SE, Snyder M, Ruan Y, Wei C-L, Gingeras TR, Guigó R, Harrow J, Gerstein MB. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome research. Jun; 2007 17(6):839–51. [PubMed: 17568002] 13. Han YJ, Ma SF, Yourek G, Park YD, Garcia JGN. A transcribed pseudogene of MYLK promotes cell proliferation. FASEB journal : official publication of the Federation of American Societies for Experimental Biology. Jul; 2011 25(7):2305–12. [PubMed: 21441351] 14. Svensson O, Arvestad L, Lagergren J. Genome-wide survey for biologically functional pseudogenes. PLoS computational biology. May.2006 2(5):e46. [PubMed: 16680195] 15. Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu X, Harte R, Balasubramanian S, Tanzer A, Diekhans M, Reymond A, Hubbard TJ, Harrow J, Gerstein MB. The GENCODE pseudogene resource. Genome Biology. Jan.2012 13(9):R51. [PubMed: 22951037] 16. Tonner P, Srinivasasainagendra V, Zhang S, Zhi D. Detecting transcription of ribosomal protein pseudogenes in diverse human tissues from RNA-seq data. BMC Genomics. Jan.2012 13(1):412. [PubMed: 22908858] 17. Mortazavi A, Williams BA, Mccue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods. 2008; 5(7):1–8. [PubMed: 18175409] 18. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics. Jan.2011 12:323. [PubMed: 21816040] 19. Paaniuc B, Zaitlen N, Halperin E. Accurate estimation of expression levels of homologous genes in RNA-seq experiments. Journal of computational biology : a journal of computational molecular cell biology. Mar; 2011 18(3):459–68. [PubMed: 21385047] 20. Ju, CJT., Zhao, Z., Wang, W. Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. New York, NY, USA: ACM; 2014. PseudoLasso:

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 18

Author Manuscript Author Manuscript

Leveraging Read Alignment in Homologous Regions to Correct Pseudogene Expression Estimates via RNASeq; p. 569-578.ser BCB ’14 21. Rosvall M, Bergstrom CT. Maps of information flow reveal community structure in complex networks. 2007 arXiv preprint physics.soc-ph/0707.0609. 22. Rosvall M, Axelsson D, Bergstrom CT. The map equation. The European Physical Journal-Special Topics. 2009; 178(1):13–23. 23. Zhang Z, Huang S, Wang J, Zhang X, Pardo Manuel de Villena F, McMillan L, Wang W. GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment. Bioinformatics (Oxford, England). Jul; 2013 29(13):i291–9. 24. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England). Aug; 2009 25(16):2078–9. 25. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. Oct; 1990 215(3):403–10. [PubMed: 2231712] 26. Liu, J., Ji, S., Ye, J. SLEP: Sparse Learning with Efficient Projections. Arizona State University; 2009. [Online]. Available: http://www.public.asu.edu/jye02/Software/SLEP

Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 19

Author Manuscript Author Manuscript

Fig. 1.

Comparison of Expected Read Count and Observed Read Count. Paired-end reads of 100bp are simulated for functional parent genes in 60 sets of experiment. The expected number of reads from a parent gene (x-axis) is plotted against the number of reads observed in either the parent gene itself, or its pseudogene (y-axis). Parent gene and pseudogene profiles are depicted by different symbols, but the parent-pseudogene pair is indicated by the same color.

Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 20

Author Manuscript Author Manuscript Fig. 2.

Author Manuscript Author Manuscript

The enhanced framework of PseudoLasso. In the training stage, gene features are represented by all substrings of gene sequences. These features are used to identify homologous loci through a community detection approach. Mining on the same list of genes, paired-end short reads are simulated with different coverages and are aligned back to the reference genome, providing the potential models to describe read distribution of experimental data. These distribution matrices are normalized, and split based on the homologous community partition. The read counts of genes for validated data or experimental data are estimated through these models. The two tasks in the training stage are indicated by different colors: steps with blue arrows describe the first task of feature generation and community detection; steps with purple arrows describe the second task of read distribution computation within each community. In the validation stage, reads from other experiments are first aligned to the reference genome. The alignment profiles are matched to the best normalized matrices from the training stage using k-nearest neighbor classification. The predicted read counts are optimized by solving the non-negative least squares equations.

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 21

Author Manuscript Fig. 3.

Author Manuscript

Distribution of possible types of repeated segments. Repeated sequences can be found in different types of transcripts, including pseudogenes, non-coding RNA, lincRNA, immunoglobulin pseudogenes and antisense RNA. There are 16810 pseudogenes reported by Ensembl, but only 14778 of them are cross-referenced to Pseudogene.org with annotated parents.

Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 22

Author Manuscript Author Manuscript

Fig. 4.

Running Time and Accuracy. Three different community partition algorithms are compared (no community partition, partition with infomap, and partition with edge-betweenness). The running time of infomap and edge-betweenness is plotted and labeled in 4a. The prediction stage requires non-negative linear regression solvers. Two different solvers are evaluated, lsqnonneg and nnLeastR. The λ parameter is set to 0.1 and 0.5 for nnLeastR. The average running time for the prediction stage is plotted in 4b, and the average accuracy is plotted in 4c. The standard deviations are also drawn, but the values are too small in 4c. The accuracy of TopHat2 is included as our baseline.

Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 23

Author Manuscript Author Manuscript Fig. 5.

Author Manuscript

Homologous Communities Partition. 4926 genomic regions from dataset A are partitioned into 875 homologous communities. The colors may be repeated to represent different communities (since there are only 256 colors), but loci in the same community are either positioned in a closed proximity or linked with an arrow. The arrow indicates information flow between two loci in the same community.

Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 24

Author Manuscript Author Manuscript Fig. 6.

Author Manuscript

The compactness of the largest community is evaluated by the ratio of within- and betweencommunity scores. The ratio of the largest partition generated by infomap algorithm is consistently higher than the ratio from a community with randomly selected vertices of a similar size.

Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 25

Author Manuscript Author Manuscript

Fig. 7.

Parameter Selection for kNN-Classification. The running time is plotted against the prediction error for different k-parameter in the kNN-Classification. The y-axis represents the prediction error in percentage scale, and the x-axis represents the running time in minute. The number next to each data point indicates the k-parameter used. The grey box highlight the best k-parameters judging from the trade-off between accuracy and efficiency. Dataset A (bottom) and B (top) are evaluated, and values are collected by taking the average over six replicates.

Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 26

Author Manuscript Author Manuscript

Fig. 8.

Accuracy Evaluation Between TopHat2 and Our Method on 59129 Genes. Well-estimation is defined by having accuracy above 0.8.

Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 27

TABLE 1

Author Manuscript

Mathematical Notations Symbol

Description

G = {g1, g2, ⋯, gn}

a set of gene gi, where 1 < i ≤ n

L = {l1, l2, ⋯, lm}

a set of mapped loci lj, where 1 < j ≤ m and n ≤ m

Y = {y1, y2, ⋯, yn}

a set of expected read count yi for gene gi

R = {r1,r2, ⋯, rp}

a set of reads in an experiment a subset of reads come from gene gi a subset of reads come from gene gi and aligned to locus lj

Author Manuscript

Xn×m

an n by m distribution matrix

An×m

an n by m pseudo matrix, which is the normalized distribution matrix Xn×m.

S = {s1, s2, ⋯, sm}

read count sj for locus li, it is equivalent to the column sum of distribution matrix

Muniq

alignment records of uniquely mapped reads

Mmulti

alignment records of multireads

C = {c1, c2, …, cn} SList

a set of variables associated with gene i to keep track of the number of missing reads final list of corrected alignments

ψ(rk, gi)

an indicating function for read rk comes from gene gi

ϕ(rk, gi, lj)

an indicating function for read rk comes from gene gi and aligned to locus lj

Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Author Manuscript

Author Manuscript

Author Manuscript 930 2610 14778

2000

6202

6202

D

E

G

F

616

198278

918

408

800

1200

B

C

59129

17863

5890

2248

1398

712

304

600

A

Corresponding Genes

Pseudogene Transcripts

Parent Transcripts

Dataset

117022

40329

29071

12997

8622

7146

4926

Aligned Loci

3.89%

9.86%

9.66%

9.94%

11.98%

11.14%

12.67%

Multiread Proportion

Dataset Description: Datasets A-D contain a list of randomly selected parent genes, along with the transcripts of their corresponding transcribed pseudogenes. Isoforms are merged and labeled with the same gene ID. Gene features are aligned to the reference genome. Keeping all multiple alignments reported by TopHat2, number of aligned regions and multireads proportion are recorded for each dataset. Dataset E contains the full set of transcribed pseudogenes and their parents; dataset F contains all pseudogenes and their parents. Dataset G contains all annotated transcripts (198278 transcripts) in human transcriptome.

Author Manuscript

TABLE 2 Ju et al. Page 28

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Author Manuscript

Author Manuscript

Author Manuscript NA NA NA

6186

36114

G

1872

2611

D

14702

1012

1540

C

F

686

E

544

864

1084

B

Betweenness

Infomap

1

1

1

1

1

1

1

None

Number of Communities

A

Dataset

45302

9500

14692

6699

4453

4227

2390

Infomap

NA

NA

NA

7672

5064

4462

2439

Betweenness

117022

40329

29071

12997

8622

7146

4926

None

Size of the Largest Community

Summary of Community Partition. Three different approaches are applied: partition with infomap, partition with edge-betweenness, and no partition. The number of communities and the size of the largest community are compared for each dataset among these three approaches. The comparison with edgebetweenness is omitted for big datasets (E, F, and G)

Author Manuscript

TABLE 3 Ju et al. Page 29

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Author Manuscript

Author Manuscript

Author Manuscript Infomap

2840461

68225 231777

81743814

733453072

F

36117

54790997

E

G

1726540

15461

18849184

D

98952

11959068

343799

127663

6272 10682

7374696

12140773

B

42139

6018

5894824

Raw Score

No. Edges

388422

37598

22626

3835

1508

3385

588

No. Edges

Between Community

Raw Score

Within Community

C

A

Dataset

Methods

102.78

15.86

19.88

13.60

13.43

40.22

13.67

Norm. Ratio

566696974

52387004

38924120

13661693

8344840

5551001

4065941

Raw Score

147423

22576

17667

6933

3850

2456

2244

No. Edges

Within Community

178715166

32197271

17593417

5531290

3923596

1922647

1871022

Raw Score

472776

83247

41076

12363

8340

7201

4362

No. Edges

Between Community

Random

10.17

6.00

5.14

4.40

4.61

8.47

4.22

Norm. Ratio

Compactness of Community Partition. The compactness of a partition is evaluated by the ratio of within and between community scores. The overall ratio of the partition generated by infomap algorithm is compared to a random partition

Author Manuscript

TABLE 4 Ju et al. Page 30

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 31

TABLE 5

Author Manuscript

Partition Information for 14778 Pairs of Parent-Pseudogene Same Community

Different Community

Processed Pseudogene

1095

9114

Unprocessed Pseudogene

1472

2668

Others

101

328

Author Manuscript Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Ju et al.

Page 32

TABLE 6

Author Manuscript

Average Running Time and Accuracy for Large Datasets. Dataset

Expressed Genes

Running Time (hrs)

TopHat2 Accuracy

Accuracy of Our Method

E

5890

2.516

0.786

0.899

F

17863

1.095

0.831

0.860

G

59129

11.491

0.569

0.752

Author Manuscript Author Manuscript Author Manuscript IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Author Manuscript

9:138199933-138203325

ENSG00000154582

ENSG00000237419

2:131983937-131994161

8:73939169-73972287

ENSG00000265545

ENSG00000236485

18:4459230-4459458

ENSG00000259838

7:77038-80418

15:41557027-41557435

ENSG00000235084

13:18540497-18550697

1:15604597-15605043

ENSG00000106153

ENSG00000231238

7:56101569-56106576

ENSG00000228232

ENSG00000242611

12:6533927-6538374

X:39787132-39788136

ENSG00000111640

Location

lincRNA

lincRNA

lincRNA

lincRNA

protein coding

processed pseudogene

unprocessed pseudogene

processed pseudogene

protein coding

processed pseudogene

protein coding

Biotype

101

114

57

67

2014

25

29

69

181

149

1776

Expected

Author Manuscript

Gene Name

55

16

43

45

1858

157

104

86

131

410

1307

TopHat2

78

153

57

70

2014

21

38

71

182

150

1745

Our Method

Replicate 5X_R1A

281

325

210

229

4840

109

157

110

601

321

4874

Expected

170

65

158

167

4528

420

329

156

422

1067

3571

TopHat2

279

323

214

248

4865

66

128

118

609

383

4781

Our Method

Replicate 15X_R3A

Author Manuscript

Read Count Estimations from TopHat2 and Our Method.

502

361

195

239

8111

134

169

203

908

658

7664

Expected

299

70

146

172

7567

759

536

269

640

1849

5680

TopHat2

517

372

202

283

8132

154

202

205

890

730

7653

Our Method

Replicate 23X_R3A

Author Manuscript

TABLE 7 Ju et al. Page 33

IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 July 18.

Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates.

RNA-Sequencing has been the leading technology to quantify expression of thousands of genes simultaneously. The data analysis of an RNA-Seq experiment...
1MB Sizes 0 Downloads 6 Views