AGEseq: Analysis of Genome Editing by Sequencing.

Accepted Manuscript AGEseq: Analysis of Genome Editing by Sequencing Liang-Jiao Xue, Chung-Jui Tsai PII:

S1674-2052(15)00263-4

DOI:

10.1016/j.molp.2015.06.001

Reference:

MOLP 143

To appear in:

MOLECULAR PLANT

Received Date: 12 April 2015 Revised Date:

30 May 2015

Accepted Date: 2 June 2015

Please cite this article as: Xue L.-J., and Tsai C.-J. (2015). AGEseq: Analysis of Genome Editing by Sequencing. Mol. Plant. doi: 10.1016/j.molp.2015.06.001. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

1

AGEseq: Analysis of Genome Editing by Sequencing

2 Liang-Jiao Xue1,2,3*, Chung-Jui Tsai1,2,3

5

1

6

2

7

3

RI PT

4

School of Forestry and Natural Resources, University of Georgia, Athens, Georgia 30602, USA. Department of Genetics, University of Georgia, Athens, Georgia 30602, USA.

Institute of Bioinformatics, University of Georgia, Athens, Georgia 30602, USA.

8 *

Correspondence: Liang-Jiao Xue ([email protected])

M AN U

9 10 11

TE D EP

13

Running title: Analysis of Genome Editing by Sequencing

AC C

12

SC

3

ACCEPTED MANUSCRIPT

Dear Editor,

15

Knock-out experiments are critical for the evaluation of gene function. Researchers have increasingly

16

relied on genome editing technologies for precise mutagenesis at loci of interest, using engineered

17

nucleases such as Zinc finger nucleases, transcription activator-like effector nucleases (TALENs), and

18

CRISPR (clustered regularly interspaced short palindromic repeats)-associated proteins. Sequence-

19

specific targeting and cleavage by these systems generate double-stranded breaks and trigger

20

endogenous repair machineries, resulting in small indels that can disrupt reading frames and gene

21

function (Curtin et al., 2012). These methods have been successfully applied to plants, with the CRISPR

22

system particularly powerful for non-model species (Belhaj et al., 2013; Lozano-Juste and Cutler, 2014).

23

Several tools, such as TALENT (Cermak et al., 2011) and CRISPR-P (Lei et al., 2014), have been developed

24

to facilitate design of genome editing experiments. However, few tools are available to evaluate the

25

outcome of genome editing. Amplicon-sequencing is commonly employed for genome editing analysis

26

where genomic sequences that span the target loci are amplified, sometimes cloned, and sequenced. A

27

number of programs have been developed to decode heterozygous chromatograms from direct-

28

sequencing of PCR products for identification of sequence polymorphisms (Crowe, 2005; Dmitriev and

29

Rakitov, 2008; Ma et al., 2015). However, the throughput of Sanger sequencing, even without cloning, is

30

not amenable to screening large numbers of transgenic lines, especially with increasingly sophisticated

31

multiplex targeting (Xie et al., 2015).

32

No open-source programs are currently available for analysis of amplicon-sequencing data from high-

33

throughput sequencers. After quality-control filtering and demultiplexing, amplicon sequence analysis

34

usually involves alignment with target/reference sequences and detection of editing events, such as

35

indels or single nucleotide polymorphisms (SNPs). Much bioinformatic effort is required, unless

36

commercial software is available. A web-based tool for amplicon-sequencing data analysis was recently

37

reported (Guell et al., 2014). However, only one reference sequence is accepted at a time, which makes

38

application to large datasets cumbersome. Here we report a versatile and user-friendly tool, Analysis of

39

Genome Editing by Sequencing (AGEseq), to address this limitation. AGEseq is available from AspenDB

40

(http://aspendb.uga.edu) as a standalone program or a Galaxy (Goecks et al., 2010)-based webtool.

41

AGEseq supports both Sanger and deep-sequencing reads. For deep-sequencing, degenerate primers

42

can be designed to amplify both alleles of the target gene as well as closely related gene(s). Amplicons

43

from unrelated genes or across samples are then barcoded and pooled for sequencing (Figure 1A). For

AC C

EP

TE D

M AN U

SC

RI PT

14

ACCEPTED MANUSCRIPT

data analysis, AGEseq requires a design file and a directory of read files as inputs. The design file

45

describes the reference sequences, usually containing 30-40 bp flanking regions of the target editing

46

site(s) (Figure 1B). The read files are stored in a directory named “reads” by default, and multiple file

47

formats are accepted (Figure 1A). AGEseq uses BLAT to align reference and read sequences. Aligned

48

reads are assigned to the best hit among the reference sequences provided in the design file, and

49

matching regions are extracted for indel- or SNP-calling. The output file reports the aligned (target and

50

read) sequences and detection frequency for each editing event (Figure 1C-D).

51

Our lab has recently applied CRISPR-based genome editing to perturb lignin biosynthesis in Populus. A

52

gene-specific guide RNA (gRNA) was designed to target 4-coumarate:CoA ligase 1 (4CL1), but not the

53

paralogous 4CL5 (Zhou et al., 2015). Degenerate primers were designed to amplify both 4CL1 (target)

54

and 4CL5 (off-target) sequences from independent transgenic lines to assess editing specificity. AGEseq

55

successfully distinguished the duplicates as well as their alleles (Figure 1B), and confirmed biallelic

56

mutations in all transgenic lines examined, with no off-target cleavage of 4CL5 (Figure 1E). In support of

57

a null 4CL1, all primary transformants exhibited a reddish-brown wood discoloration (Figure 1F) known

58

to be associated with lignin modification (Zhou et al., 2015). As a further test, AGEseq was applied to

59

amplicon data of soybean with DDM1 (Decrease in DNA Methylation) editing in one or two

60

homoeologous loci as described in Jacobs et al. (2015). The editing patterns detected by AGEseq were

61

consistent with those obtained by Geneious R7 (Biomatters Ltd.) used in that study, ranging from small

62

indels (10 nt), with varying (1-98%) editing efficiencies (Supplemental Table 1)

63

(Jacobs et al., 2015). AGEseq flags events with a long stretch of indels and/or mismatches as “strange

64

events” that require manual examination, and three such cases were identified. Manual inspection

65

confirmed a large (44 nt) deletion in one case, while the other two were found by Jacobs et al. (2015) to

66

harbor unusual insertions from the A. rhizogenes root-inducing plasmid after additional cloning and

67

sequencing. The results demonstrate the versatility of AGEseq in detecting or flagging genome editing

68

patterns across a wide range of data scenarios.

69

Detailed instructions of AGEseq are provided for all operating systems (Supplemental Text 1). The

70

analysis sensitivity can be adjusted by two user-configurable parameters: mismatch allowance (default

71

at 10%) and minimum read coverage (default at 0). Systematic errors introduced during amplicon

72

library preparation and sequencing that involve PCR, or by base-calling algorithms are common in deep-

73

sequencing data, and they will appear as “SNPs” in the AGEseq report (Figure 1C and Supplemental Text

74

1). For this reason, AGEseq considers indels as potential genome editing events by default, although

AC C

EP

TE D

M AN U

SC

RI PT

44

ACCEPTED MANUSCRIPT

SNPs are also reported. If SNPs are of interest, setting a minimum read coverage is recommended to

76

reduce random errors. A known limitation with BLAT and similar aligners is their inconsistent gap-

77

handling in the presence of homo-nucleotides, as shown for both 4CL1 alleles in Figure 1C (red boxes, 1-

78

nt deletion at position 56 or 57). AGEseq does not consider these differences and reports, by default,

79

the sum of all indel reads as well as WT-like (non-edited) reads from each sample in the summary

80

(Figure 1D). User inspection is therefore recommended. As mentioned, AGEseq also facilitates

81

identification of unusual events that require manual inspection, and sometimes follow-up experiments

82

to confirm the editing patterns.

83

The ability of AGEseq to effectively discriminate allelic sequences of duplicated genes suggests that it

84

can support analysis with polyploid genomes. When only one reference sequence is provided, the

85

AGEseq output can be mined for allelic variations, if any, in the target region. As a standalone software,

86

AGEseq is 1) easy to use—no command line or programming skill is required for Windows or Mac users;

87

2) versatile—multiple sequencing platforms and file types are supported for assessing genome editing,

88

allelic variation and/or off-target cleavage; and 3) extensible—the Perl script can easily be imported to

89

other bioinformatics pipelines. As an example, we adapted AGEseq as a utility in the Galaxy platform

90

(Goecks et al., 2010) to support web-based analysis. It is accessible at AspenDB

91

(http://aspendb.uga.edu/ageseq) or through Galaxy Tool Shed (https://toolshed.g2.bx.psu.edu) for

92

installation in local instances. A limitation of the webtool is that only one sequence read file can be

93

processed at a time. For a multiplexed dataset with a large number of samples, the use of the

94

standalone AGEseq program is recommended. Although developed for genome editing analysis, AGEseq

95

can be adapted for SNP genotyping, metagenomic analysis or other amplicon sequencing applications.

96

Author Contributions

97

L.X. and C.T. designed the software and wrote the paper, L.X. wrote the software.

98

Acknowledgments

99

This work was supported by the Georgia Research Alliance-Haynes Forest Biotechnology endowment.

AC C

EP

TE D

M AN U

SC

RI PT

75

100

The authors thank Thomas Jacobs and Xiaohong Zhou for providing amplicon sequences used in the

101

analysis, Camille Annan, Xi Gu, Swarnali Louha, Xiaohan Mei, and Elizabeth Trippe for testing the

102

software, Mohammad Mohebbi for developing redirecting website link to Galaxy/TsaiLab, and Scott

103

Harding for critical review of the manuscript.

104

ACCEPTED MANUSCRIPT

105

Competing interests

106

The authors declare no competing interests.

RI PT

107 Figure Legend

109

Figure 1. The workflow and output of AGEseq.

110

(A) Experimental considerations (left panel) and data analysis process (right panel) of AGEseq.

111

Amplicons derived from different alleles and/or homologous genes can be obtained using degenerate

112

primers. Amplicons from different samples or unrelated genes can be indexed, normalized and pooled

113

for sequencing. AGEseq supports multiple sequencing platforms and file types, and calls BLAT for

114

sequence alignment against user-supplied reference sequences. AGEseq then identifies indels and SNPs

115

to report genome editing events.

116

(B) Example of the design file. Multiple genes and alleles can be included.

117

(C) Example of an AGEseq output file with annotated genome editing events. Results are populated in

118

eight columns as follows. A, input file name; B, gene/allele name; C-D, target sequence (C) from the

119

design file that best matches the amplicon sequence in D; E, number of reads shown in D; F-G, target (F)

120

and read (G) sequences displayed according to their alignment. Dashes are introduced to denote indels

121

between the pair. “B” and “E” in F and G denotes beginning and end of the alignment; H, patterns of

122

indels, if any, denoted by position (first integer) of insertion (I) or deletion (D), followed by the number

123

(second integer) of nucleotides affected; I, patterns of SNPs, if any, denoted by the nucleotide

124

position(s). By default, only indels are considered as genome editing events, as SNPs may arise from

125

sequence errors introduced by PCR during library preparation and/or sequencing, or by base-callers. For

126

illustration purpose, only representative events with highest and lowest read support are shown for

127

each allele. Red boxed areas are examples of alignment artifacts that warrant manual inspection. Each

128

analysis group (one allele from one sample) ends with read statistics that include “total reads” in the

129

input file, “total hits” matching all target sequences in the design file, “sub hits” matching the specific

130

target, and all “indel hits” for the given target. See Supplementary Text for further explanation.

131

(D) Summary of AGEseq analysis is provided for each sample. Columns A-G are as above. Column H

132

(indel or WT rate %) is calculated as the fraction of “indel or WT hits” over “sub hits”. Column I denotes

AC C

EP

TE D

M AN U

SC

108

ACCEPTED MANUSCRIPT

the editing patterns as + (insertion) or – (deletion) with the number of nucleotides affected, or WT-like.

134

See Supplementary Text for further explanation.

135

(E) The summary information from (D) can be easily converted to a table suitable for publication. Allelic

136

detection frequencies (% Reads) were recalculated by dividing “indel hits” of each allele over the sum of

137

the “sub hits” from both alleles. For off-target assessment, it is more appropriate to report the

138

predominant, unedited (WT-like) event, as shown for the paralogous 4CL5.

139

(F) The wood discoloration phenotype of line 61 due to biallelic mutations in 4CL1 that affect lignin

140

biosynthesis (from Zhou et al., 2015). Scare bar is 1 cm.

141

Reference

142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170

Belhaj, K., Chaparro-Garcia, A., Kamoun, S., and Nekrasov, V. (2013). Plant genome editing made easy: targeted mutagenesis in model and crop plants using the CRISPR/Cas system. Plant Methods 9:39. Cermak, T., Doyle, E.L., Christian, M., Wang, L., Zhang, Y., Schmidt, C., Baller, J.A., Somia, N.V., Bogdanove, A.J., and Voytas, D.F. (2011). Efficient design and assembly of custom TALEN and other TAL effector-based constructs for DNA targeting. Nucleic Acids Res 39:e82. Crowe, M.L. (2005). SeqDoC: rapid SNP and mutation detection by direct comparison of DNA sequence chromatograms. BMC Bioinformatics 6:133. Dmitriev, D.A., and Rakitov, R.A. (2008). Decoding of Superimposed Traces Produced by Direct Sequencing of Heterozygous Indels. Plos Comput Biol 4:e1000113. Goecks, J., Nekrutenko, A., Taylor, J., and Team, T.G. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11:R86. Guell, M., Yang, L.H., and Church, G.M. (2014). Genome editing assessment using CRISPR Genome Analyzer (CRISPR-GA). Bioinformatics 30:2968-2970. Jacobs, T.B., LaFayette, P.R., Schmitz, R.J., and Parrott, W.A. (2015). Targeted genome modifications in soybean with CRISPR/Cas9. BMC Biotechnology 15:16. Lei, Y., Lu, L., Liu, H.Y., Li, S., Xing, F., and Chen, L.L. (2014). CRISPR-P: A Web Tool for Synthetic SingleGuide RNA Design of CRISPR-System in Plants. Mol Plant 7:1494-1496. Lozano-Juste, J., and Cutler, S.R. (2014). Plant genome engineering in full bloom. Trends Plant Sci 19:284-287. Ma, X., Chen, L., Zhu, Q., Chen, Y., and Liu, Y.-G. (2015). Rapid decoding of sequence-specific nucleaseinduced heterozygous and biallelic mutations by direct sequencing of PCR products. Molecular Plant:10.1016/j.molp.2015.1002.1012. Xie, K.B., Minkenberg, B., and Yang, Y.N. (2015). Boosting CRISPR/Cas9 multiplex editing capability with the endogenous tRNA-processing system. PNAS 112:3570-3575. Zhou, X., Jacobs, T.B., Xue, L.-J., Harding, S.A., and Tsai, C.-J. (2015). Exploiting SNPs for biallelic CRISPR mutations in the outcrossing woody perennial Populus reveals 4-coumarate:CoA ligase specificity and redundancy. New Phytol:doi:10.1111/nph.13470.

171

AC C

EP

TE D

M AN U

SC

RI PT

133

Amplicon primer design considerations • Degeneracy to amplify both alleles, if heterozygous • Degeneracy to amplify homologous genes, if desired Barcoding & multiplexing • Pooling across amplicons and/or samples Reference sequence extraction (see B) • Reference sequences of both alleles, if known • Reference sequences of homologous genes, if any

AGEseq

B. Design file

ACCEPTED MANUSCRIPT Input sequence data parsing

• Type: fa, fas, fasta, fastq, fq, seq, or txt • Single‐end or paired‐end Alignment of read and target sequences • Blat search and read assignment Detection of editing events • Indels or SNPs • Read counting

Sequence

4CL1_1

CTAAGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCTGG AGGGGCTCCATTG

4CL1_2

CTAGGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCTGG AGGGGCTCCATTG

4CL5_1

CCAAGTCACCCGATCTTGATAAACATGACTTGTCTTCGTTGAGGATGTTGAAGTCTGG AGGGTCGCCATTG

4CL5_2     CCAAGTCACCTGATCTTGATAAACATGACTTGTCTTCGTTGAGGATGTTGAAGTCTGG AGGGTCGCCGTTG

AC C

D. Output file (summary)

EP

TE D

M AN U

SC

C. Output file (all events)

Target

RI PT

A.                             Pre‐AGEseq

E.

Plant #

Target

Sequence

Total Read #

% Reads

Indels

WT

4CL1

CTAAGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCTGGAGGGGCTCCATTG CTAGGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCTGGAGGGGCTCCATTG

‐

‐

‐

61

4CL1

CTAAGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCT-GAGGGGCTCCATTG CTAGGTCACCTGATCTTGACAAGCATGACTTGTCTTCTTTGAGGATGATAAAATCT-GAGGGGCTCCATTG

2531 2580

49% 50%

‐1 ‐1

61

4CL5

CCAAGTCACCCGATCTTGATAAACATGACTTGTCTTCGTTGAGGATGTTGAAGTCTGGAGGGTCGCCATTG CCAAGTCACCTGATCTTGATAAACATGACTTGTCTTCGTTGAGGATGTTGAAGTCTGGAGGGTCGCCGTTG

2102 2379

47% 53%

‐ ‐

F.

Figure 1. The workflow and output of AGEseq. (A) Experimental considerations (left panel) and data analysis process (right panel) of AGEseq. Amplicons derived from different alleles and/or homologous genes can be obtained using degenerate primers.  Amplicons from different samples or unrelated genes can be indexed, normalized and pooled for sequencing. AGEseq supports multiple sequencing platforms and file types, and calls BLAT for sequence alignment against user‐supplied reference sequences. AGEseq then identifies indels and SNPs to report genome editing events. (B) Example of the design file.  Multiple genes and alleles can be included. (C) Example of an AGEseq output file with annotated genome editing events.  Results are populated in eight columns as follows. A, input file name; B, gene/allele name; C‐D, target sequence (C) from the design file that best matches the amplicon sequence in D; E, number of reads shown in D; F‐G, target (F) and read (G) sequences displayed according to their alignment. Dashes are introduced to denote indels between the pair. “B” and “E” in F and G denotes beginning and end of the alignment; H, patterns of indels, if any, denoted by position (first integer) of insertion (I) or deletion (D), followed by the number (second integer) of nucleotides affected; I, patterns of SNPs, if any, denoted by the nucleotide position(s). By default, only indels are considered as genome editing events, as SNPs may arise from sequence errors introduced by PCR during library preparation and/or sequencing, or by base‐callers.  For illustration purpose, only representative events with highest and lowest read support are shown for each allele.  Red boxed areas are examples of alignment artifacts that warrant manual inspection.  Each analysis group (one allele from one sample) ends with read statistics that include “total reads” in the input file, “total hits” matching all target sequences in the design file, “sub hits” matching the specific target, and all “indel hits” for the given target. See Supplementary Text for further explanation. (D) Summary of AGEseq analysis is provided for each sample. Columns A‐G are as above.  Column H (indel or WT rate %) is calculated as the fraction of “indel or WT hits” over “sub hits”.  Column I denotes the editing patterns as + (insertion) or – (deletion) with the number of nucleotides affected, or WT‐like.  See Supplementary Text for further explanation. (E) The summary information from (D) can be easily converted to a table suitable for publication. Allelic detection frequencies (% Reads) were recalculated by dividing “indel hits” of each allele over the sum of the “sub hits” from both alleles.  For off‐target assessment, it is more appropriate to report the predominant, unedited (WT‐like) event, as shown for the paralogous 4CL5. (F) The wood discoloration phenotype of line 61 due to biallelic mutations in 4CL1 that affect lignin biosynthesis (from Zhou et al., 2015). Scare bar is 1 cm.

Cas9 gene editing in butterflies.

Quantifying genome-editing outcomes at endogenous loci with SMRT sequencing.

Next-Generation Sequencing and Genome Editing in Plant Virology.

RNA over-editing of BLCAP contributes to hepatocarcinogenesis identified by whole-genome and transcriptome sequencing.

Multiplex genome editing by natural transformation.

Precise genome editing by homologous recombination.

Cas9.

Genome-wide analysis of differential RNA editing in epilepsy.

Coverage recommendations for methylation analysis by whole-genome bisulfite sequencing.

Easy quantitative assessment of genome editing by sequence trace decomposition.

Simple Genome Editing of Rodent Intact Embryos by Electroporation.

Optimizing cancer genome sequencing and analysis.

Whole-genome sequencing in outbreak analysis.

High-throughput genome editing and phenotyping facilitated by high resolution melting curve analysis.

Correction: High-throughput genome editing and phenotyping facilitated by high resolution melting curve analysis.

Genome Editing in Large Animals.

Genome editing-based HIV therapies.

Whole-organism lineage tracing by combinatorial and cumulative genome editing.

Cas9.

Bacterial cellular engineering by genome editing and gene silencing.

Cas9 genome editing.

Genome editing: The domestication of Cas9.

Ethical issues raised by whole genome sequencing.

Genome walking by next generation sequencing approaches.