Accepted Manuscript AGEseq: Analysis of Genome Editing by Sequencing Liang-Jiao Xue, Chung-Jui Tsai PII:
S1674-2052(15)00263-4
DOI:
10.1016/j.molp.2015.06.001
Reference:
MOLP 143
To appear in:
MOLECULAR PLANT
Received Date: 12 April 2015 Revised Date:
30 May 2015
Accepted Date: 2 June 2015
Please cite this article as: Xue L.-J., and Tsai C.-J. (2015). AGEseq: Analysis of Genome Editing by Sequencing. Mol. Plant. doi: 10.1016/j.molp.2015.06.001. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
1
AGEseq: Analysis of Genome Editing by Sequencing
2 Liang-Jiao Xue1,2,3*, Chung-Jui Tsai1,2,3
5
1
6
2
7
3
RI PT
4
School of Forestry and Natural Resources, University of Georgia, Athens, Georgia 30602, USA. Department of Genetics, University of Georgia, Athens, Georgia 30602, USA.
Institute of Bioinformatics, University of Georgia, Athens, Georgia 30602, USA.
8 *
Correspondence: Liang-Jiao Xue (
[email protected])
M AN U
9 10 11
TE D EP
13
Running title: Analysis of Genome Editing by Sequencing
AC C
12
SC
3
ACCEPTED MANUSCRIPT
Dear Editor,
15
Knock-out experiments are critical for the evaluation of gene function. Researchers have increasingly
16
relied on genome editing technologies for precise mutagenesis at loci of interest, using engineered
17
nucleases such as Zinc finger nucleases, transcription activator-like effector nucleases (TALENs), and
18
CRISPR (clustered regularly interspaced short palindromic repeats)-associated proteins. Sequence-
19
specific targeting and cleavage by these systems generate double-stranded breaks and trigger
20
endogenous repair machineries, resulting in small indels that can disrupt reading frames and gene
21
function (Curtin et al., 2012). These methods have been successfully applied to plants, with the CRISPR
22
system particularly powerful for non-model species (Belhaj et al., 2013; Lozano-Juste and Cutler, 2014).
23
Several tools, such as TALENT (Cermak et al., 2011) and CRISPR-P (Lei et al., 2014), have been developed
24
to facilitate design of genome editing experiments. However, few tools are available to evaluate the
25
outcome of genome editing. Amplicon-sequencing is commonly employed for genome editing analysis
26
where genomic sequences that span the target loci are amplified, sometimes cloned, and sequenced. A
27
number of programs have been developed to decode heterozygous chromatograms from direct-
28
sequencing of PCR products for identification of sequence polymorphisms (Crowe, 2005; Dmitriev and
29
Rakitov, 2008; Ma et al., 2015). However, the throughput of Sanger sequencing, even without cloning, is
30
not amenable to screening large numbers of transgenic lines, especially with increasingly sophisticated
31
multiplex targeting (Xie et al., 2015).
32
No open-source programs are currently available for analysis of amplicon-sequencing data from high-
33
throughput sequencers. After quality-control filtering and demultiplexing, amplicon sequence analysis
34
usually involves alignment with target/reference sequences and detection of editing events, such as
35
indels or single nucleotide polymorphisms (SNPs). Much bioinformatic effort is required, unless
36
commercial software is available. A web-based tool for amplicon-sequencing data analysis was recently
37
reported (Guell et al., 2014). However, only one reference sequence is accepted at a time, which makes
38
application to large datasets cumbersome. Here we report a versatile and user-friendly tool, Analysis of
39
Genome Editing by Sequencing (AGEseq), to address this limitation. AGEseq is available from AspenDB
40
(http://aspendb.uga.edu) as a standalone program or a Galaxy (Goecks et al., 2010)-based webtool.
41
AGEseq supports both Sanger and deep-sequencing reads. For deep-sequencing, degenerate primers
42
can be designed to amplify both alleles of the target gene as well as closely related gene(s). Amplicons
43
from unrelated genes or across samples are then barcoded and pooled for sequencing (Figure 1A). For
AC C
EP
TE D
M AN U
SC
RI PT
14
ACCEPTED MANUSCRIPT
data analysis, AGEseq requires a design file and a directory of read files as inputs. The design file
45
describes the reference sequences, usually containing 30-40 bp flanking regions of the target editing
46
site(s) (Figure 1B). The read files are stored in a directory named “reads” by default, and multiple file
47
formats are accepted (Figure 1A). AGEseq uses BLAT to align reference and read sequences. Aligned
48
reads are assigned to the best hit among the reference sequences provided in the design file, and
49
matching regions are extracted for indel- or SNP-calling. The output file reports the aligned (target and
50
read) sequences and detection frequency for each editing event (Figure 1C-D).
51
Our lab has recently applied CRISPR-based genome editing to perturb lignin biosynthesis in Populus. A
52
gene-specific guide RNA (gRNA) was designed to target 4-coumarate:CoA ligase 1 (4CL1), but not the
53
paralogous 4CL5 (Zhou et al., 2015). Degenerate primers were designed to amplify both 4CL1 (target)
54
and 4CL5 (off-target) sequences from independent transgenic lines to assess editing specificity. AGEseq
55
successfully distinguished the duplicates as well as their alleles (Figure 1B), and confirmed biallelic
56
mutations in all transgenic lines examined, with no off-target cleavage of 4CL5 (Figure 1E). In support of
57
a null 4CL1, all primary transformants exhibited a reddish-brown wood discoloration (Figure 1F) known
58
to be associated with lignin modification (Zhou et al., 2015). As a further test, AGEseq was applied to
59
amplicon data of soybean with DDM1 (Decrease in DNA Methylation) editing in one or two
60
homoeologous loci as described in Jacobs et al. (2015). The editing patterns detected by AGEseq were
61
consistent with those obtained by Geneious R7 (Biomatters Ltd.) used in that study, ranging from small
62
indels (10 nt), with varying (1-98%) editing efficiencies (Supplemental Table 1)
63
(Jacobs et al., 2015). AGEseq flags events with a long stretch of indels and/or mismatches as “strange
64
events” that require manual examination, and three such cases were identified. Manual inspection
65
confirmed a large (44 nt) deletion in one case, while the other two were found by Jacobs et al. (2015) to
66
harbor unusual insertions from the A. rhizogenes root-inducing plasmid after additional cloning and
67
sequencing. The results demonstrate the versatility of AGEseq in detecting or flagging genome editing
68
patterns across a wide range of data scenarios.
69
Detailed instructions of AGEseq are provided for all operating systems (Supplemental Text 1). The
70
analysis sensitivity can be adjusted by two user-configurable parameters: mismatch allowance (default
71
at 10%) and minimum read coverage (default at 0). Systematic errors introduced during amplicon
72
library preparation and sequencing that involve PCR, or by base-calling algorithms are common in deep-
73
sequencing data, and they will appear as “SNPs” in the AGEseq report (Figure 1C and Supplemental Text
74
1). For this reason, AGEseq considers indels as potential genome editing events by default, although
AC C
EP
TE D
M AN U
SC
RI PT
44
ACCEPTED MANUSCRIPT
SNPs are also reported. If SNPs are of interest, setting a minimum read coverage is recommended to
76
reduce random errors. A known limitation with BLAT and similar aligners is their inconsistent gap-
77
handling in the presence of homo-nucleotides, as shown for both 4CL1 alleles in Figure 1C (red boxes, 1-
78
nt deletion at position 56 or 57). AGEseq does not consider these differences and reports, by default,
79
the sum of all indel reads as well as WT-like (non-edited) reads from each sample in the summary
80
(Figure 1D). User inspection is therefore recommended. As mentioned, AGEseq also facilitates
81
identification of unusual events that require manual inspection, and sometimes follow-up experiments
82
to confirm the editing patterns.
83
The ability of AGEseq to effectively discriminate allelic sequences of duplicated genes suggests that it
84
can support analysis with polyploid genomes. When only one reference sequence is provided, the
85
AGEseq output can be mined for allelic variations, if any, in the target region. As a standalone software,
86
AGEseq is 1) easy to use—no command line or programming skill is required for Windows or Mac users;
87
2) versatile—multiple sequencing platforms and file types are supported for assessing genome editing,
88
allelic variation and/or off-target cleavage; and 3) extensible—the Perl script can easily be imported to
89
other bioinformatics pipelines. As an example, we adapted AGEseq as a utility in the Galaxy platform
90
(Goecks et al., 2010) to support web-based analysis. It is accessible at AspenDB
91
(http://aspendb.uga.edu/ageseq) or through Galaxy Tool Shed (https://toolshed.g2.bx.psu.edu) for
92
installation in local instances. A limitation of the webtool is that only one sequence read file can be
93
processed at a time. For a multiplexed dataset with a large number of samples, the use of the
94
standalone AGEseq program is recommended. Although developed for genome editing analysis, AGEseq
95
can be adapted for SNP genotyping, metagenomic analysis or other amplicon sequencing applications.
96
Author Contributions
97
L.X. and C.T. designed the software and wrote the paper, L.X. wrote the software.
98
Acknowledgments
99
This work was supported by the Georgia Research Alliance-Haynes Forest Biotechnology endowment.
AC C
EP
TE D
M AN U
SC
RI PT
75
100
The authors thank Thomas Jacobs and Xiaohong Zhou for providing amplicon sequences used in the
101
analysis, Camille Annan, Xi Gu, Swarnali Louha, Xiaohan Mei, and Elizabeth Trippe for testing the
102
software, Mohammad Mohebbi for developing redirecting website link to Galaxy/TsaiLab, and Scott
103
Harding for critical review of the manuscript.
104
ACCEPTED MANUSCRIPT
105
Competing interests
106
The authors declare no competing interests.
RI PT
107 Figure Legend
109
Figure 1. The workflow and output of AGEseq.
110
(A) Experimental considerations (left panel) and data analysis process (right panel) of AGEseq.
111
Amplicons derived from different alleles and/or homologous genes can be obtained using degenerate
112
primers. Amplicons from different samples or unrelated genes can be indexed, normalized and pooled
113
for sequencing. AGEseq supports multiple sequencing platforms and file types, and calls BLAT for
114
sequence alignment against user-supplied reference sequences. AGEseq then identifies indels and SNPs
115
to report genome editing events.
116
(B) Example of the design file. Multiple genes and alleles can be included.
117
(C) Example of an AGEseq output file with annotated genome editing events. Results are populated in
118
eight columns as follows. A, input file name; B, gene/allele name; C-D, target sequence (C) from the
119
design file that best matches the amplicon sequence in D; E, number of reads shown in D; F-G, target (F)
120
and read (G) sequences displayed according to their alignment. Dashes are introduced to denote indels
121
between the pair. “B” and “E” in F and G denotes beginning and end of the alignment; H, patterns of
122
indels, if any, denoted by position (first integer) of insertion (I) or deletion (D), followed by the number
123
(second integer) of nucleotides affected; I, patterns of SNPs, if any, denoted by the nucleotide
124
position(s). By default, only indels are considered as genome editing events, as SNPs may arise from
125
sequence errors introduced by PCR during library preparation and/or sequencing, or by base-callers. For
126
illustration purpose, only representative events with highest and lowest read support are shown for
127
each allele. Red boxed areas are examples of alignment artifacts that warrant manual inspection. Each
128
analysis group (one allele from one sample) ends with read statistics that include “total reads” in the
129
input file, “total hits” matching all target sequences in the design file, “sub hits” matching the specific
130
target, and all “indel hits” for the given target. See Supplementary Text for further explanation.
131
(D) Summary of AGEseq analysis is provided for each sample. Columns A-G are as above. Column H
132
(indel or WT rate %) is calculated as the fraction of “indel or WT hits” over “sub hits”. Column I denotes
AC C
EP
TE D
M AN U
SC
108
ACCEPTED MANUSCRIPT
the editing patterns as + (insertion) or – (deletion) with the number of nucleotides affected, or WT-like.
134
See Supplementary Text for further explanation.
135
(E) The summary information from (D) can be easily converted to a table suitable for publication. Allelic
136
detection frequencies (% Reads) were recalculated by dividing “indel hits” of each allele over the sum of
137
the “sub hits” from both alleles. For off-target assessment, it is more appropriate to report the
138
predominant, unedited (WT-like) event, as shown for the paralogous 4CL5.
139
(F) The wood discoloration phenotype of line 61 due to biallelic mutations in 4CL1 that affect lignin
140
biosynthesis (from Zhou et al., 2015). Scare bar is 1 cm.
141
Reference
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
Belhaj, K., Chaparro-Garcia, A., Kamoun, S., and Nekrasov, V. (2013). Plant genome editing made easy: targeted mutagenesis in model and crop plants using the CRISPR/Cas system. Plant Methods 9:39. Cermak, T., Doyle, E.L., Christian, M., Wang, L., Zhang, Y., Schmidt, C., Baller, J.A., Somia, N.V., Bogdanove, A.J., and Voytas, D.F. (2011). Efficient design and assembly of custom TALEN and other TAL effector-based constructs for DNA targeting. Nucleic Acids Res 39:e82. Crowe, M.L. (2005). SeqDoC: rapid SNP and mutation detection by direct comparison of DNA sequence chromatograms. BMC Bioinformatics 6:133. Dmitriev, D.A., and Rakitov, R.A. (2008). Decoding of Superimposed Traces Produced by Direct Sequencing of Heterozygous Indels. Plos Comput Biol 4:e1000113. Goecks, J., Nekrutenko, A., Taylor, J., and Team, T.G. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11:R86. Guell, M., Yang, L.H., and Church, G.M. (2014). Genome editing assessment using CRISPR Genome Analyzer (CRISPR-GA). Bioinformatics 30:2968-2970. Jacobs, T.B., LaFayette, P.R., Schmitz, R.J., and Parrott, W.A. (2015). Targeted genome modifications in soybean with CRISPR/Cas9. BMC Biotechnology 15:16. Lei, Y., Lu, L., Liu, H.Y., Li, S., Xing, F., and Chen, L.L. (2014). CRISPR-P: A Web Tool for Synthetic SingleGuide RNA Design of CRISPR-System in Plants. Mol Plant 7:1494-1496. Lozano-Juste, J., and Cutler, S.R. (2014). Plant genome engineering in full bloom. Trends Plant Sci 19:284-287. Ma, X., Chen, L., Zhu, Q., Chen, Y., and Liu, Y.-G. (2015). Rapid decoding of sequence-specific nucleaseinduced heterozygous and biallelic mutations by direct sequencing of PCR products. Molecular Plant:10.1016/j.molp.2015.1002.1012. Xie, K.B., Minkenberg, B., and Yang, Y.N. (2015). Boosting CRISPR/Cas9 multiplex editing capability with the endogenous tRNA-processing system. PNAS 112:3570-3575. Zhou, X., Jacobs, T.B., Xue, L.-J., Harding, S.A., and Tsai, C.-J. (2015). Exploiting SNPs for biallelic CRISPR mutations in the outcrossing woody perennial Populus reveals 4-coumarate:CoA ligase specificity and redundancy. New Phytol:doi:10.1111/nph.13470.
171
AC C
EP
TE D
M AN U
SC
RI PT
133
Amplicon primer design considerations • Degeneracy to amplify both alleles, if heterozygous • Degeneracy to amplify homologous genes, if desired Barcoding & multiplexing • Pooling across amplicons and/or samples Reference sequence extraction (see B) • Reference sequences of both alleles, if known • Reference sequences of homologous genes, if any
AGEseq
B. Design file
ACCEPTED MANUSCRIPT Input sequence data parsing
• Type: fa, fas, fasta, fastq, fq, seq, or txt • Single‐end or paired‐end Alignment of read and target sequences • Blat search and read assignment Detection of editing events • Indels or SNPs • Read counting
Sequence
4CL1_1
CTAAGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCTGG AGGGGCTCCATTG
4CL1_2
CTAGGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCTGG AGGGGCTCCATTG
4CL5_1
CCAAGTCACCCGATCTTGATAAACATGACTTGTCTTCGTTGAGGATGTTGAAGTCTGG AGGGTCGCCATTG
4CL5_2 CCAAGTCACCTGATCTTGATAAACATGACTTGTCTTCGTTGAGGATGTTGAAGTCTGG AGGGTCGCCGTTG
AC C
D. Output file (summary)
EP
TE D
M AN U
SC
C. Output file (all events)
Target
RI PT
A. Pre‐AGEseq
E.
Plant #
Target
Sequence
Total Read #
% Reads
Indels
WT
4CL1
CTAAGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCTGGAGGGGCTCCATTG CTAGGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCTGGAGGGGCTCCATTG
‐
‐
‐
61
4CL1
CTAAGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCT-GAGGGGCTCCATTG CTAGGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCT-GAGGGGCTCCATTG
2531 2580
49% 50%
‐1 ‐1
61
4CL5
CCAAGTCACCCGATCTTGATAAACATGACTTGTCTTCGTTGAGGATGTTGAAGTCTGGAGGGTCGCCATTG CCAAGTCACCTGATCTTGATAAACATGACTTGTCTTCGTTGAGGATGTTGAAGTCTGGAGGGTCGCCGTTG
2102 2379
47% 53%
‐ ‐
F.
Figure 1. The workflow and output of AGEseq. (A) Experimental considerations (left panel) and data analysis process (right panel) of AGEseq. Amplicons derived from different alleles and/or homologous genes can be obtained using degenerate primers. Amplicons from different samples or unrelated genes can be indexed, normalized and pooled for sequencing. AGEseq supports multiple sequencing platforms and file types, and calls BLAT for sequence alignment against user‐supplied reference sequences. AGEseq then identifies indels and SNPs to report genome editing events. (B) Example of the design file. Multiple genes and alleles can be included. (C) Example of an AGEseq output file with annotated genome editing events. Results are populated in eight columns as follows. A, input file name; B, gene/allele name; C‐D, target sequence (C) from the design file that best matches the amplicon sequence in D; E, number of reads shown in D; F‐G, target (F) and read (G) sequences displayed according to their alignment. Dashes are introduced to denote indels between the pair. “B” and “E” in F and G denotes beginning and end of the alignment; H, patterns of indels, if any, denoted by position (first integer) of insertion (I) or deletion (D), followed by the number (second integer) of nucleotides affected; I, patterns of SNPs, if any, denoted by the nucleotide position(s). By default, only indels are considered as genome editing events, as SNPs may arise from sequence errors introduced by PCR during library preparation and/or sequencing, or by base‐callers. For illustration purpose, only representative events with highest and lowest read support are shown for each allele. Red boxed areas are examples of alignment artifacts that warrant manual inspection. Each analysis group (one allele from one sample) ends with read statistics that include “total reads” in the input file, “total hits” matching all target sequences in the design file, “sub hits” matching the specific target, and all “indel hits” for the given target. See Supplementary Text for further explanation. (D) Summary of AGEseq analysis is provided for each sample. Columns A‐G are as above. Column H (indel or WT rate %) is calculated as the fraction of “indel or WT hits” over “sub hits”. Column I denotes the editing patterns as + (insertion) or – (deletion) with the number of nucleotides affected, or WT‐like. See Supplementary Text for further explanation. (E) The summary information from (D) can be easily converted to a table suitable for publication. Allelic detection frequencies (% Reads) were recalculated by dividing “indel hits” of each allele over the sum of the “sub hits” from both alleles. For off‐target assessment, it is more appropriate to report the predominant, unedited (WT‐like) event, as shown for the paralogous 4CL5. (F) The wood discoloration phenotype of line 61 due to biallelic mutations in 4CL1 that affect lignin biosynthesis (from Zhou et al., 2015). Scare bar is 1 cm.