Published online 26 November 2014

Nucleic Acids Research, 2015, Vol. 43, No. 3 e16 doi: 10.1093/nar/gku1197

HiTSelect: a comprehensive tool for high-complexity-pooled screen analysis Aaron A. Diaz1,2,3,† , Han Qin2,4,5,† , Miguel Ramalho-Santos2,4,5,* and Jun S. Song1,2,3,* 1

Institute for Human Genetics, University of California, San Francisco, CA, USA, 2 The Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, University of California, San Francisco, CA, USA, 3 Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA, 4 Departments of Obstetrics and Gynecology and Pathology and Center for Reproductive Sciences, University of California, San Francisco, CA, USA and 5 Diabetes Center, University of California, San Francisco, CA, USA

Received July 24, 2014; Revised October 14, 2014; Accepted November 04, 2014

ABSTRACT

INTRODUCTION RNA interference (RNAi) provides a powerful technique for gene knockdown by exploiting a cell’s endogenous machinery for mRNA degradation. RNA-induced silencing complexes target mRNAs via short oligonucleotide guide strands excised from short-hairpin RNA (shRNA). Lossof-function analysis via high-complexity shRNA screening has successfully identified pathways associated with cancer (1–4), modulators of ricin susceptibility (5), regulators * To

whom correspondence should be addressed. Tel: +217 244 7750; Fax: +217 333 9819; Email: [email protected] Correspondence may also be addressed to Miguel Ramalho-Santos. Tel: +415 502 8543; Fax: +415 514 2346; Email: [email protected]



The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors. Present address: Jun S. Song, Departments of Bioengineering and Physics, University of Illinois, Urbana-Champaign, Urbana, IL, USA

 C The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Downloaded from http://nar.oxfordjournals.org/ at UPVA on May 15, 2015

Genetic screens of an unprecedented scale have recently been made possible by the availability of highcomplexity libraries of synthetic oligonucleotides designed to mediate either gene knockdown or gene knockout, coupled with next-generation sequencing. However, several sources of random noise and statistical biases complicate the interpretation of the resulting high-throughput data. We developed HiTSelect, a comprehensive analysis pipeline for rigorously selecting screen hits and identifying functionally relevant genes and pathways by addressing offtarget effects, controlling for variance in both gene silencing efficiency and sequencing depth of coverage and integrating relevant metadata. We document the superior performance of HiTSelect using data from both genome-wide RNAi and CRISPR/Cas9 screens. HiTSelect is implemented as an open-source package, with a user-friendly interface for data visualization and pathway exploration. Binary executables are available at http://sourceforge.net/projects/ hitselect/, and the source code is available at https: //github.com/diazlab/HiTSelect.

of protein degradation (6) and barriers to cellular reprogramming (7). Additionally, the clustered regularly interspaced palindromic repeats (CRISPR) pathway has been adapted from the bacterial and archaeal immune systems for gene knockout in eukaryotic cells via synthetic single guide-RNA (sgRNA). CRISPR-based screening in human cells has been shown to scale genome-wide effectively (8,9). A typical approach is to use barcoded lentiviral vectors to express genome-wide libraries of oligonucleotide guideRNA (shRNA or sgRNA). Cells are infected in a pooled fashion. Subsequently, cells exhibiting a phenotype of interest (often in response to a treatment) are identified and isolated. Barcodes are recovered from both phenotype-positive and phenotype-negative cells via polymerase chain reaction (PCR) amplification and quantified via next-generation sequencing (NGS). The relative abundance of guide-RNA in the phenotype-positive population, as measured by the abundance of reads mapping to the corresponding barcode, allows one to infer the effect of inhibiting the targeted gene on phenotype. Despite the power of genome-wide screens, off-target effects, variance in gene silencing efficiency and variance in sequencing depth of coverage can be significant, making it difficult to distinguish knockdowns/knockouts that truly regulate a given phenotype from background noise (10). Moreover, identifying relevant genes and pathways not only requires rigorous statistical methods for hit selection but also meta-analysis of relevant, secondary genomic data such as gene expression, chromatin state, known physical interactions and functional annotations. Thus, the enormous potential of pooled genome-wide screens is currently hampered by limitations in the methodologies for analysis. To address this roadblock, we developed a computational pipeline for analyzing pooled, high-complexity screens and implemented it into a user-friendly software package called HiTSelect (Supplementary Software). The

e16 Nucleic Acids Research, 2015, Vol. 43, No. 3

HiTSelect pipeline includes comprehensive modules for screen hit selection based on robust statistics, a module for the integration of gene expression data, a module for functional annotation analysis, a module for genetic interaction analysis and a module for gene network visualization (Figure 1). We illustrate the power of HiTSelect using the data from our recent screen for genes that function as barriers to reprogramming somatic cells to induced pluripotent stem (iPS) cells, as well as another recently published data set from a screen for growth factors in ovarian carcinoma (4,7). We compare the gene-ranking module of HiTSelect to both the RNAi Gene Enrichment Ranking (RIGER) and the Redundant siRNA Activity (RSA) algorithms (11,12). We show that HiTSelect’s ranking algorithm is both more sensitive and more specific, HiTSelect’s predictions correlate better with secondary validation assays and HiTSelect is less prone to off-target effects than RSA or RIGER. While RSA and RIGER were designed for multi-well platebased screens, and/or pooled screens with microarray readout, HiTSelect is specifically designed for high-complexity shRNA and CRISPR screens with NGS readout. In addition, HiTSelect provides tools for metadata integration, analysis and visualization that are not available in any other software.

knockdown/knockout efficiency across guide-RNA, which can be either shRNA in RNAi screens or sgRNA in CRISPR screens. To address the former, we use multiobjective optimization to identify genes with multiple highly active guide-RNA. To address the latter, we utilize a random effects model to assess sampling variance in the estimation of gene effect size. We focus on a two-group problem and assume that guide-RNA read counts have been tabulated from a phenotype positive sample S+ and a phenotype negative control sample S− . Let N+ and N− denote the total numbers of sequenced reads from each population, respectively. For clarity, we use Greek letters to index guide-RNA and Roman letters to index genes. To avoid division by zero in the following formulae, we add a pseudo-count of 1 read to each count. Let rα+ denote the read count of the α th guide-RNA in S+ . Let rα− denote the read count in S− normalized by sequencing depth N+ /N− . If the user selects the option to normalize by a user-provided list of control sequences, then we instead normalize by the ratio M+ /M− , where M+ and M− are the median read counts for the control sequences from S+ and S− , respectively. We model the activity level of the α th guideRNA in the positive population by the read count log odds ratio (log-odds) Xα that we estimate by the sample log-odds  pα (1−qα ) ˆ Xα = ln qα (1− pα ) . Here pα = rα+ /N+ and qα = rα− /N+ are the (normalized) frequencies of reads in S+ and S− , respectively. In the case of normalization by user-defined con  trol sequences pα = rα+ /N+ and qα = rα− / N− M+ /M− . ˆ α > 0 and rα+ > We designate a guide-RNA as ‘active’ if X m, where m = the sample median absolute deviation of the read counts in S+ . The latter criterion eliminates undersequenced guide-RNA. We call a guide-RNA active if it has greater odds of being sequenced in the treatment population relative to control. Given a gene G j targeted by n distinct guide-RNA {h α }nα=1 with log-odds {Xα }nα=1 , we wish to combine them into a single measure of gene effect size. If h α had perfect efficiency in its knockdown/knockout of G j , reads mapping to h α would occur in S+ with some log-odds Yj determined solely by G j ’s effect on phenotype. In practice, there will be less than perfect efficiency and reads will occur with log-odds Xα = Yj + Zα , for some Zα to be determined. Thus, var(Z) gives the variance in knockdown/knockout efficiency across guide-RNA targeting G j . The Q-statistic (13) of the residuals Xα allows us to assess the homogeneity of knockdown/knockout efficiency across {h α }nα=1 as well ˆ as compute an estimator  of var(Z). Given that Xα ’s stan1 1 1 1 dard error (SE) is sα = r + + N+ −r + + r − + N+ −r − by norα α α α mal approximation (in the case of sequencing depth normalization; for control  sequence normalization, replace this 1 formula with sα = r1+ + N+1−r + + r1− + N− (M+ /M − )−r − ), we α α α α n n n      2 ˆα − X ¯ ¯ = ˆ α/ have Q = uα X where X uα X uα α=1

MATERIALS AND METHODS HiTSelect’s gene ranking algorithms and statistics Our hit selection method was developed with two concerns in mind: dealing with off-target effects and handling variances in sequencing depth-of-coverage and in

α=1

α=1

and u α = 1/sα2 . Notice that the formula for the log-odds standard error is a decreasing function of increasing depthof-coverage. From Q, DerSimonian and Laird (14) derive Q−k+1  estimators of var(Z), given by 2 = n u α − , n u2 / n uα α=1

α=1

α

α=1

Downloaded from http://nar.oxfordjournals.org/ at UPVA on May 15, 2015

Figure 1. Flowchart for the HiTSelect de-convolution method. Screen deconvolution begins with loading screen readout and proceeds to hit selection. HiTSelect’s hit selection module estimates and controls for variances in intra-library knockdown efficiency and sequencing depth-of-coverage. HiTSelect users may then iterate between rounds of (i) functional analysis and clustering, (ii) gene interaction and gene network-centrality estimation, (iii) interactive data visualization of gene ontology and gene interaction (via Cytoscape) data and (iv) searching, curating and comparing gene sets generated from their analyses.

PAGE 2 OF 9

PAGE 3 OF 9

Nucleic Acids Research, 2015, Vol. 43, No. 3 e16

n n   1 ˆ α/ wα X wα . Here wα = s 2 + and Yj , given by Yˆ j = 2 α α=1 α=1 −1/2  n    and SE Yˆ j = wα . Thus, we model the effect α=1

ALGORITHM 1: Deterministic multi-objective ranking (i) For each gene G j : (a) Set score j = 0 (b) For each gene G i ∈ R; i = j : (1) If (ti , Yi ) ≺ (t j , Yj ), then score j = score j + 1 (2) If (ti , Yi ) ≡ (t j , Yj ), then score j = score j + 0.5 (ii) Rank genes based on their scores. To make our method more robust, we incorporated uncertainty in the estimation of (t j , Yj ), using a probabilistic version of this algorithm. For the independent random variables A = (A1 , . . . , Ak ) and B = (B1 , . . . , Bk ), we can define k

p (B ≺ A) = p(E [Bl ] < E[Al ]|al , bl ), where al and bl are l=1

sampled values of Al and Bl , respectively. In order to be able to compute p(E [Bl ] < E[Al ]|al , bl ), we need to make additional assumptions about how Al and Bl are distributed. In

p (E [Bl ] < E [Al ] |al , bl ) = ⎤ ⎡ 2 −(al −μ) −y2 ∞ ∞ 1 1 2σ 2 2σ 2 ∫ ⎣ √ e Al ∫ √ e Bl dy⎦ dμ. −∞ σ Al 2π bl −μ σ Bl 2π Hughes (16) demonstrates that this formula can be further reduced to p (E [Bl ] < E [Al ] |al , bl ) = 12 [1 + erf( √m2 )]. Here m =

 al −bl σ A2 +σ B2 l

, and erf(x) denotes the error function.

l

We formulas to estimate p(G i ≺ G j ) ≡  can use the  above   p E [Yi ] < E Yj |Yˆi , Yˆ j p(E [ti ] < E[t j ]|tˆi , tˆj ) for the gene     pair G i and G j . We can compute p E [Yi ] < E Yj |Yˆi , Yˆ j by setting m =

Yˆi −Yˆ j  . SE2 (Yˆi )+SE2 (Yˆ j )

We treat the number of ac-

tive guide-RNA, t j , as Poisson distributed, modeling the binary activities of guide-RNAs targeting a given gene j as independent identically distributed Bernoulli   random   variables (17). Thus, to compute p E [ti ] < E t j |tˆi , tˆj we need to compute the probability that tˆi is drawn from a Poisson distribution with mean less than the mean of the Poisson distribution from which tˆj was drawn. The test statistˆ −tˆ tic m = √i j is asymptotically standard normal under tˆi +tˆj

the null hypothesis of equal Poisson means (18) and has been shown to be more powerful than an exact test in determining whether two samples were drawn from Poisson distributions with the same rate (19). Thus, we replace m   above with m when we compute p E [ti ] < E t j |tˆi , tˆj . In general, guide-RNA pools targeting different genes will have different complexities, both in the original viral library and in the initial population of  infected cells. We apply the Anscombe transform (x → 2 x + 3/8) to tˆi and tˆj , before computing m , to stabilize variance in sampling across genes targeted by guide-RNA pools of varying degrees of complexity (20). This has the practical effect of producing a more uniform Pareto dominance estimate for small differences in the number of observed active guide-RNA (Supplementary Figure S1). Note that once we define p(G i ≺ G j ) we can immediately define the probability    of noncomparability: p G i ≡ G j = 1 − p G i ≺ G j − p(G j ≺ G i ). Let R denote the set of all genes that have at least one active guide-RNA. We compute all pairwise Pareto dominance probabilities between genes in R and assign a score   to gene G j via the formula scor e j = p Gi ≺ G j + i ∈R;i = j  p(G i ≡ G j ). We then rank genes by their score. 0.5 i ∈R;i = j

Our algorithm, as implemented in HiTSelect, can be summarized as follows. ALGORITHM 2: STOCHASTIC MULTI-OBJECTIVE RANKING (i) For each gene G j ∈ R: (a) Set score j = 0 (b) For each gene G i ∈ R; i = j : (1) score j = score j + p(G i ≺ G j )

Downloaded from http://nar.oxfordjournals.org/ at UPVA on May 15, 2015

size for gene G j by the summary log-odds Yj that we estimate via Yˆ j . This model accounts for variance in knockdown/knockout efficiency across guide-RNA via 2 and variance in the sequencing depth of individual guideRNA via sα2 . We now describe how both the estimate of gene effect size Yˆ j and its variance estimate SE2 (Yˆ j ) are used to formulate a probabilistic ranking scheme. If we denote the number of active guide-RNA targeting gene G j as t j , then the pair (t j , Yj ) characterizes the evidence for G j being a hit. We want to preferentially rank genes whose knockdown/knockout exhibits a large effect (Yˆ j is high) and for which that effect is reproducible (tˆj is high). To this end we developed a multi-objective optimization algorithm to maximize both quantities simultaneously. Our method for gene ranking uses the concept of Pareto dominance. A k-tuple A = (A1 , . . . , Ak ) of real numbers measuring k criteria for ranking is said to dominate another k-tuple B = (B1 , . . . , Bk ) if: (i) for all l, Bl ≤ Al , and (ii) there exists at least one l ∈ {1, . . . , k} such that Bl < Al . If these conditions hold, then we write B ≺ A. Since this relationship defines only a partial ordering on the set of ktuples, it is possible that neither A ≺ B nor B ≺ A. We will write A ≡ B in the case that A and B are not comparable. The gist of our algorithm will be to allow genes to compete for Pareto dominance in the multi-objective criterion (t j , Yj ). A Pareto dominant gene is a superior candidate for a hit, since it shows an increase in summary guide-RNA log-odds, in the number of distinct active guide-RNA, or both. We therefore score genes by the number of other genes that they dominate. In the absence of uncertainty, our method would be to give each gene 1 point for every gene it dominates and 1/2 point for every gene to which it is not comparable. We then rank genes by the number of points they have. In this way, genes compete for dominance in the ranked list. This algorithm can be summarized as:

the case of normal distributions with unknown means and known variance σ A2l and σ B2l , it can be shown (15) that

e16 Nucleic Acids Research, 2015, Vol. 43, No. 3 (2) score j = score j + 0.5 p(G i ≡ G j ) (ii) Rank genes based on their scores.

i, j =1...k

action strength estimates can be made comparable across networks. In matrix notation we write Am = [aimj ]. We formulate a statistic on the set of graphs with common nodes  k labeled by the genes G j j =1 , and vertex sets that have the {Am }rm=1 as their adjacency matrices. We refer to this ensemble of networks as the joint network. Bonacich first identified the components of the dominant eigenvector of a connected network’s adjacency matrix as a measure of network centrality for the corresponding nodes (21). This eigenvector solves Am x = λm x, where λm is the largest eigenvalue of Am in absolute value. We can extend network centrality to an estimate of joint-network centrality by solving the following optimization problem: r 

Ai x − λi x 2 + |1 − x 2 |. The first term assuresx minx i =1

is an approximate eigenvector for each of the r networks. The second term, |1 − x 2 |, assures convergence of iterative solution algorithms to a non-trivial solution by keeping iterates on the unit sphere and away from the origin. In the absence of this regularization term, the origin would be a basin of attraction. This restriction is acceptable, since we are only interested in the relative importance of genes to a particular pathway. HiTSelect solves this optimization problem using a sequential-quadratic program, with the centroid of the dominant eigenvectors of {Am }rm=1 (normalized to unit length) as a starting value. As a termination criterion, HiTSelect uses the criterion that the objective function is ≤ 10−6 or the algorithm has performed more than 400 iterations. Comparison of HiTSelect to the RIGER and RSA methods We compared the gene-ranking module of HiTSelect to RIGER and RSA on readout from two independent highcomplexity shRNA screens: (i) Tan et al. screened mesenchymal (HeyA8) and stem-like (Stem-A) ovarian carcinoma cell subpopulations for genes responsible for growth

and proliferation (4). Pooled libraries of 80 000 clones targeting 16 000 genes were introduced in the HeyA8 and Stem-A subpopulations separately. After 14 days, shRNA copy numbers were quantified in each population via NGS. The authors used the Stem-A population as a control to assess enrichment in the HeyA8 population. We here refer to HeyA8 as the treatment population. (ii) We co-infected human fibroblasts with lentivirus expressing a genome-wide high-coverage (600 000 shRNAs with ∼30 shRNA/gene) shRNA library along with SOX2, KLF4, OCT4 and cMYC reprogramming factors and an shRNA targeting P53 (7). Following the appearance of colonies with iPS cell characteristics 28 days post induction, the transduced cells were FACS-purified for TRA-1-81, a marker of fully reprogrammed cells. Then, shRNAs recovered from the TRA1-81 positive (TRA-1-81+) and TRA-1-81 negative (TRA1-81−) populations were sequenced (Supplementary Table S1). Using the TRA-1-81− population as a control, shRNA enrichment in the TRA-1-81+ population (which we call the treatment population) was determined. We chose these two screens as test data sets, because both used genomewide high-complexity libraries, used NGS as readout and performed extensive screen validation assays that we can use to estimate the sensitivity and specificity of hit selection algorithms. RIGER analysis was performed using the RIGER.jar Java archive downloaded from the GENEE website (http://www.broadinstitute.org/cancer/software/ GENE-E/extensions.html). As originally described by Luo et al. (12), this implementation uses a Kolmogorov– Smirnov test for significance assessment. RSA was designed for well-based screens (22), and codes for RSA analysis are available (http://carrier.gnf.org/publications/RSA/). However, the algorithm itself can be adapted to NGS screen readout by replacing well activity level with barcode read count odds ratio. We implemented RSA for NGS readout in MATLAB, using an odds ratio of 1 for the lower-bound parameter and 2 as an upper-bound parameter. Following their primary screen, Tan et al. (4) validated 135 genes using short interfering RNA (siRNA). Tan et al., having identified ovarian carcinoma subtypes via microarray and clinicopathological parameters, chose these genes with the intention of specifically targeting pathways enriched in the Stem-A subtype. One siRNA was transfected per well in a 96-well plate format. After 96 h of incubation, a colorimetric cell-proliferation assay was performed to quantify siRNA effect on growth. The assay was performed in quadruplicate, and there were two negative-control wells per plate. We compared siRNA-well to control-well proliferation ratios (obtained from the authors) between the HeyA8 and Stem-A populations using single-tailed t-tests, corrected for multiple-hypothesis testing via the Benjamini– Hochberg method (23). We found 65 genes in HeyA8 cells and 47 genes in Stem-A cells with differential negative effect on cell growth at a t-test P-value cutoff of P = 0.05. We then tabulated true-positive (TP), true-negative (TN), false-positive (FP) and false-negative (FN) gene-hit calls for HiTSelect, RIGER and RSA. Hit calls were tabulated in the following fashion: we consider a gene TP if it is ranked within the top 5% of genes in the primary screen of HeyA8 and, in the validation assay, has a mean proliferation ratio which is higher in HeyA8 at a t-test P ≤ 0.05; we call a

Downloaded from http://nar.oxfordjournals.org/ at UPVA on May 15, 2015

Joint-spectral network centrality for gene interaction networks  k Given a set of genes of interest G j j =1 , we wish to summarize multiple lines of evidence for their interaction into a single metric of a gene’s influence. HiTSelect interfaces with GeneMANIA (http://genemania.org) to aggregate genetic interaction (the perturbation of one gene causing a change in expression of a second gene), physical interaction (protein–protein interaction) and co-localization (genes expressed in the same tissue) experimental data. Suppose we have r total types of interaction, and for each interaction type, we have a measurement of the interaction strength ximj ∈, m = 1, . . . , r , between genes G i and G j for the mth interaction type. We want to summarize these r interaction networks. HitSelect computes the interaction strengths, ximj , using GeneMANIA. We then replace each ximj with its network-wide quantile aimj , i.e. its quantile estimated from   . This is done so that interthe set of numbers ximj

PAGE 4 OF 9

PAGE 5 OF 9

Single-gene knockdown validation BJ fibroblasts were seeded on a 6-well plate and infected with lentivirus expressing shRNA the following day. Each shRNA was cloned into a separate pSicoR-CMV-PuroT2A-GFP lentiviral vector. To make lentivirus, 293T cells at 60–70% confluency were transfected in 10-cm plates with 4 ␮g of the lentiviral vectors together with 1 ␮g each of the packaging plasmids VSV-G, MDL-RRE and RSVr using Fugene 6 from Roche. After 72 h, viral supernatants were harvested, filtered, titered and stored at −80◦ C. Five days after infection, fibroblasts were harvested, and RNA was isolated using the RNeasy Mini RNA Isolation kit (Qiagen) and reverse-transcribed using the High-Capacity cDNA Reverse Transcription kit (Applied BioSystems). The cDNA reaction was diluted 1:5 in TE (10-mM TrisCl/1-mM EDTA, pH 7.6) and used in Sybr Green real-time PCR reactions (Applied BioSystems). PCR primers were designed to amplify 100–400-bp fragments spanning exons. Housekeeping genes GAPDH and UBB were used as controls (Supplementary Table S2). Reactions were run in triplicates on a 7900HT machine (Applied BioSystems) according to the manufacturer’s instructions. Only samples with

single and matching end-point melting curve peaks were used for subsequent analysis. Cycle threshold values were imported into the REST software for fold-change calculations, using the housekeeping genes GAPDH and UBB as controls. RESULTS HiTSelect gene ranking outperforms on cancer stem cell and iPS cell shRNA screens Although RSA and RIGER were not designed for screens with NGS readout, they can be adapted for that purpose. One important insight of the RSA algorithm is the use of a test statistic that is sensitive to the number of distinct active guide-RNA (the hyper-geometric test). Thus, in principle, RSA controls for off-target effects, and it works well for smaller-scale screens on the order of hundreds of wells. However, we find that at the scale of the screens we considered (80 000 and 600 000 shRNA), the hyper-geometric test is not very sensitive. In fact, in both screens, RSA enriches for ‘singletons’, i.e. genes with only one active shRNA. By contrast, HiTSelect preferentially ranks genes for which the effect of knockdown is reproducible across multiple shRNA (Figure 2A and B). Compared to HiTSelect, RSA’s predictions for the primary screen of Tan et al. do not correlate as well with the results of their siRNA validation assay (Figure 2C). In comparing HiTSelect to RSA and RIGER in our iPS cell screen, we found that the upper-quartiles of the HiTSelect and RIGER gene rankings were enriched for positive-control genes to a statistically significant extent (hypergeometric tests: 3.3 × 10−5 ≤ P ≤ 2.4 × 10−3 ), indicating good sensitivity in hit detection (Figure 2D). However, RSA was not very sensitive and did not rank positive control genes highly (0.17 ≤ P ≤ 0.7). While neither HiTSelect nor RSA frequently ranked negative-control genes in the upper quartile range (0.16 ≤ P ≤ 0.68), RIGER ranks control sequences highly (P = 1.5 × 10−2 ). Thus, RIGER’s false-positive rate is high for some samples. RIGER also enriches for singletons (Figure 2A and B), and RIGER has the least correlation between the primary screen and the siRNA validation assay of Tan et al. (Figure 2C). Furthermore, RIGER ranks negative control sequences highly in our iPS cell screen (Figure 2D), indicating that RIGER’s FDR is high. Figure 2E shows the overlap of the three algorithms. In a typical workflow, a researcher may first call hits at the 5% level, in order to have a large enough sample to perform functional annotation clustering and gene network analysis. After obtaining relevant pathways, one may then choose genes from those pathways for validation using a more stringent threshold, such as 1%. At this threshold, the distributions of the number of active guide-RNA are shifted to the left for RSA and RIGER compared to HiTSelect in both screens. In particular, RSA and RIGER preferentially rank singletons in the top 1%. HiTSelect’s top 1% ranked genes all have at least two (and most have three or more) active shRNA. Thus, HiTSelect is less prone to offtarget effects since, for all genes ranked in the top 1%, the effect of knockdown is reproducible over multiple guideRNA. To compare the impact of off-target effects on hit

Downloaded from http://nar.oxfordjournals.org/ at UPVA on May 15, 2015

gene FP if it is enriched in the HeyA8 primary screen but not in the validation assay (P > 0.05). Similarly, if a gene is ranked within the top 5% in the screen of Stem-A and has a higher proliferation ratio in the Stem-A validation with P ≤ 0.05, then it is a TN. If it is enriched in Stem-A’s primary screen but fails to validate (P > 0.05), then we call it an FN. Sample sizes were unbalanced in the validation assay (65 genes down in HeyA8 cells versus 47 genes down in Stem-A). We chose the Matthews correlation coefficient as an estimator of an algorithm’s predictive value, since it is generally considered to be the most robust estimator of a classifier’s success for unbalanced sample sizes (24). We also chose balanced accuracy (BA) as a second measurement of these algorithms’ performance as classifiers. Balanced ac0.5TP 0.5TN curacy (BA = TP+FN + TN+FP ) is a renormalized accuracy that avoids overestimation of classifier performance arising when the sample is imbalanced (25). Lastly, we assembled lists of negative-control genes and positive-control genes for our iPS cell screen and compared their rank distributions between algorithms. In the iPS cell screen, we screened for barriers to reprogramming, i.e. genes whose knockdown enhances reprogramming efficiency. As negative controls, we used the validated targets of known tumor suppressor micro-RNA (miRNA) miR218 (26), a random gene list and explicit negative-control sequences. As positive control genes, we used a list of 26 barrier genes whose knockdown has been already shown to enhance reprogramming efficiency (Supplementary Table S2). Additionally, we used the experimentally validated targets of miR-17 (whose targets include the tumor suppressor P21) and miR-200 (an inhibitor of the epithelial-tomesenchymal transition) as positive controls. Both of these miRNA have been shown to enhance reprogramming efficiency (27,28). We also found that HiTSelect outperformed RSA and RIGER on simulated data by computing a Receiver Operating Characteristic curve shown in Supplementary Figure S2.

Nucleic Acids Research, 2015, Vol. 43, No. 3 e16

e16 Nucleic Acids Research, 2015, Vol. 43, No. 3

selection of the three algorithms, we spot-checked a handful of gene hits for ineffective shRNA and also employed the software GESS (29) to identify off-targets attributable to ‘seed effects’. Firstly, we identified the seven most highly ranked singleton genes that were mutually ranked in the top 1% by RIGER and RSA in our iPS cell screen. None of HiTSelect’s top 1% are singletons. We measured the knockdown efficiency of the single active shRNA for each sin-

Figure 3. Validation of knockdown for singleton genes highly ranked by RSA and RIGER. (A) RSA and RIGER enriched for genes with only one active shRNA in both the carcinoma screen of Tan et al. and in our iPS cell screen. In contrast to these methods, all of HiTSelect’s top genes demonstrate a robust effect over at least two shRNA. To validate that this phenomenon leaves RSA and RIGER prone to off-target effects, we performed knockdowns in human fibroblasts for the top seven ranked genes that were ranked in the 1% by both RSA and RIGER and that were also singletons. The ranking was relative to the iPS cell screen. Using the single shRNA identified by their algorithms per gene, we were able to produce a knockdown greater than 2-fold for only five out of seven genes. In contrast, all of HiTSelect’s genes identified in the top 1% had at least two active shRNA which produced a knockdown of greater than 2-fold. (B) GESS analysis to identify shRNAs that may exhibit off-target effects due to homology between their seed sequences and unintended gene targets. GESS compared the seed-match frequency (SMF) of shRNAs targeting the top 1% gene hits of our iPS screen (phenotype) to those targeting the bottom 1% ranked genes (no-phenotype). GESS did not find any HiTSelect hits with statistically significant off-target effects at the GESS (Benjamini–Hochberg corrected) P-value cutoff of 0.05. However, 17% of shRNA from RIGER and 35% of shRNA from RSA showed statistically significant off-target effects.

gleton gene via reverse transcription quantitative real-time polymerase chain reaction (RT-qPCR), comparing the effect of RNAi to that of a nonsense-sequence control. We then compared these knockdown effect sizes to those observed in knockdown experiments of genes identified in the top 1% by HiTSelect. All of HiTSelect’s top 1% hits have at least two active shRNA, and we performed knockdowns using the two most active shRNA (identified by HiTSelect) separately and averaged the results. All knockdowns were performed in human fibroblasts. The median knockdown efficiency was slightly greater in HiTSelect’s picks (log2 foldchange = −2) than in the RSA/RIGER singletons (log2 fold-change = −1.8), and two of the seven RSA/RIGER singletons did not show significant knockdown efficiency (0) in the PA/LFnDTA and DT screens, respectively (Supplementary Figure S4A and B). All seven of the genes validated by Zhou et al. (via a cell viability assay) were identified in HiTSelect’s picks. In particular, anthrax receptor (ANTXR1) and diphtheria toxin receptor (HBEGF) were the number 1 genes in HiTSelect’s ranking of the PA/LFnDTA and DT screens, respectively. Moreover, knockout of these two genes in Zhou et al.’s validation assay produced the greatest increases in cell viability over control, following treatment with PA/LFnDTA or DT, respectively. Due to the small number of gene hits in the screens of Zhou et al. (14–16 genes), DAVID did not identify any over-represented functional annotation terms, and GeneMANIA also did not identify significant interactions among these genes, limiting our re-analysis of the screens. HiTSelect thus provides a rigorous analysis method for selecting hits that correlate well with validation. In addition to these cell-surface receptors, HiTSelect also identifies adenosine triphosphate (ATP) binding proteins in the PA/LFnDTA screen: CFTR and PECR. PECR knockout was validated by Zhou et al. as conferring resistance to PA/LFnDTA, but CFTR was not. Another bacterial toxin, cholera toxin, induces the secretion of intestinal fluid via CFTR upregulation (in an ATPmediated fashion); thus, HiTSelect revealed that the role of CFTR in anthrax toxicity is a potential avenue for further investigation (33).

PAGE 8 OF 9

PAGE 9 OF 9

10. 11.

12.

13. 14. 15. 16.

18. 19. 20. 21. 22.

23. Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B Methodol., 57, 289–300. 24. Jurman,G., Riccadonna,S. and Furlanello,C. (2012) A comparison of MCC and CEN error measures in multi-class prediction. PLoS One, 7, e41882. 25. Brodersen,K.H., Ong,C.S., Stephan,K.E. and Buhmann,J.M. (2010) The balanced accuracy and its posterior distribution. In: Proceedings of the 20th International Conference on Pattern Recognition. IEEE Conference, Istanbul, 3121–3124. 26. Uesugi,A., Kozaki,K.-I., Tsuruta,T., Furuta,M., Morita,K.-I., Imoto,I., Omura,K. and Inazawa,J. (2011) The tumor suppressive microRNA miR-218 targets the mTOR component Rictor and inhibits AKT phosphorylation in oral cancer. Cancer Res., 71, 5765–5778. 27. Li,Z., Yang,C.-S., Nakashima,K. and Rana,T.M. (2011) Small RNA-mediated regulation of iPS cell generation. EMBO J., 30, 823–834. 28. Gregory,P.a, Bert,A.G., Paterson,E.L., Barry,S.C., Tsykin,A., Farshid,G., Vadas,M.a, Khew-Goodall,Y. and Goodall,G.J. (2008) The miR-200 family and miR-205 regulate epithelial to mesenchymal transition by targeting ZEB1 and SIP1. Nat. Cell Biol., 10, 593–601. 29. Sigoillot,F.D., Lyman,S., Huckins,J.F., Adamson,B., Chung,E., Quattrochi,B. and King,R.W. (2012) A bioinformatics method identifies prominent off-targeted transcripts in RNAi screens. Nat. Methods, 9, 363–366. 30. Jiao,X., Sherman,B.T., Huang,D.W., Stephens,R., Baseler,M.W., Lane,H.C. and Lempicki,R.a (2012) DAVID-WS: a stateful web service to facilitate gene/protein list analysis. Bioinformatics, 28, 1805–1806. 31. Zhou,Y., Zhu,S., Cai,C., Yuan,P., Li,C., Huang,Y. and Wei,W. (2014) High-throughput screening of a CRISPR/Cas9 library for functional genomics in human cells. Nature, 509, 487–491. 32. Bader,G. and Hogue,C. (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 27, 1–27. 33. Liu,S., Moayeri,M. and Leppla,S.H. (2014) Anthrax lethal and edema toxins in anthrax pathogenesis. Trends Microbiol., 22, 317–325.

Downloaded from http://nar.oxfordjournals.org/ at UPVA on May 15, 2015

17.

(2014) Genome-scale CRISPR-Cas9 knockout screening in human cells. Science, 343, 84–87. Rusk,N. (2013) Genetics: mammalian genes interacting. Nat. Methods, 10, 281–281. ¨ Konig,R., Chiang,C., Tu,B.P., Yan,S.F., DeJesus,P.D., Romero,A., Bergauer,T., Orth,A., Krueger,U., Zhou,Y. et al. (2007) A probability-based approach for the analysis of large-scale RNAi screens. Nat. Methods, 4, 847–849. Luo,B., Cheung,H.W., Subramanian,A., Sharifnia,T., Okamoto,M., Yang,X., Hinkle,G., Boehm,J.S., Beroukhim,R., Weir,B.a. et al. (2008) Highly parallel identification of essential genes in cancer cells. Proc. Natl Acad. Sci. U.S.A., 105, 20380–20385. Cochran,W. (1954) The combination of estimates from different experiments. Biometrics, 10, 101–129. DerSimonian,R. and Laird,N. (1986) Meta-analysis in clinical trials. Control. Clin. Trials, 7, 177–188. Fieldsend,J. (2005) Multi-objective optimisation in the presence of uncertainty. Evol. Comput. 2005., 1, 243–250. Hughes,E. (2001) Evolutionary multi-objective ranking with uncertainty and noise. In: Proceedings of the First International Conference on Evolutionary Multi-Criterion Optimization. Springer-Verlag, Berlin, pp. 329–343. Haight,F.A. (1967) Handbook of the Poisson Distribution. John Wiley & Sons, Inc., NY. Thode,H.C. (1997) Power and sample size requirements for tests of differences between two Poisson rates. J. R. Stat. Soc. D Stat., 46, 227–230. K. Detre,C.W. (1970) The comparison of two Poisson-distributed observations. Biometrics, 26, 851–854. Anscombe,F. (1948) The transformation of Poisson, binomial and negative-binomial data. Biometrika, 35, 246–254. Bonacich,P. (1972) Factoring and weighting approaches to status scores and clique identification. J. Math. Sociol, 2, 113–120. Chiang,C., Tu,B.P., Ko,R., Yan,S.F., Dejesus,P.D., Romero,A., Bergauer,T., Orth,A., Krueger,U., Zhou,Y. et al. (2007) A probability-based approach for the analysis of large-scale RNAi screens. Nat. Methods, 4, 847–849.

Nucleic Acids Research, 2015, Vol. 43, No. 3 e16

HiTSelect: a comprehensive tool for high-complexity-pooled screen analysis.

Genetic screens of an unprecedented scale have recently been made possible by the availability of high-complexity libraries of synthetic oligonucleoti...
2MB Sizes 4 Downloads 6 Views