Immunoglobulin transcript sequence and somatic hypermutation computation from unselected RNA-seq reads in chronic lymphocytic leukemia.

Immunoglobulin transcript sequence and somatic hypermutation computation from unselected RNA-seq reads in chronic lymphocytic leukemia James S. Blachlya, Amy S. Rupperta, Weiqiang Zhaob, Susan Longb, Joseph Flynna, Ian Flinnc, Jeffrey Jonesa, Kami Maddocksa, Leslie Andritsosa, Emanuela M. Ghiad, Laura Z. Rassentid, Thomas J. Kippsd, Albert de la Chapellee,1, and John C. Byrda,f,1 a Division of Hematology, Department of Internal Medicine, The Ohio State University, Columbus, OH 43210; bDepartment of Pathology, The Ohio State University, Columbus, OH 43210; cHematologic Malignancies Research Program, Sarah Cannon Research Institute, Nashville, TN 37203; dMoores Cancer Center, University of California at San Diego, La Jolla, CA 92092-0820; eDepartment of Molecular Virology, Immunology, and Medical Genetics, The Ohio State University, Columbus, OH 43210; and fCollege of Pharmacy, The Ohio State University, Columbus, OH 43210

Contributed by Albert de la Chapelle, February 24, 2015 (sent for review January 28, 2015; reviewed by Megan S. Lim and Adrian Wiestner)

Immunoglobulins (Ig) are produced by B lymphocytes as secreted antibodies or as part of the B-cell receptor. There is tremendous diversity of potential Ig transcripts (>1 × 1012) as a result of hundreds of germ-line gene segments, random nucleotide incorporation during joining of gene segments into a complete transcript, and the process of somatic hypermutation at individual nucleotides. This recombination and mutation process takes place in the maturing B cell and is responsible for the diversity of potential epitope recognition. Cancers arising from mature B cells are characterized by clonal production of Ig heavy (IGH@) and light chain transcripts, although whether the sequence has undergone somatic hypermutation is dependent on the maturation stage at which the neoplastic clone arose. Chronic lymphocytic leukemia (CLL) is the most common leukemia in adults and arises from a mature B cell with either mutated or unmutated IGH@ transcripts, the latter having worse prognosis and the assessment of which is routinely performed in the clinic. Currently, IGHV mutation status is assessed by Sanger sequencing and comparing the transcript to known germ-line genes. In this paper, we demonstrate that complete IGH@ V-D-J sequences can be computed from unselected RNA-seq reads with results equal or superior to the clinical procedure: in the only discordant case, the clinical transcript was outof-frame. Therefore, a single RNA-seq assay can simultaneously yield gene expression profile, SNP and mutation information, as well as IGHV mutation status, and may one day be performed as a general test to capture multidimensional clinically relevant data in CLL. RNA sequencing

experiment could replace many other discrete tests (qPCR, genotyping, microarray, IGHV mutation analysis, etc.), we hypothesized that in the presence of a clonal B-cell population, patientspecific or consensus degenerate primers and a dedicated sequencing experiment were not necessary to fully characterize the clonal IGH@ transcript. Here, using the “Ig-ID” pipeline we developed, we demonstrate that Ig heavy chain transcripts, including, critically, the complete V-D-J sequence, can be computed from unselected (i.e., using standard random hexamer priming vice IGH@-targeting primers; ref. 5) RNA-sequencing reads from CLL patient tumor cells. These computed transcripts matched those obtained from a CLIA-approved clinical laboratory with high concordance, in some cases uncovering possible misamplification in the traditional approach. Results Seventeen CLL patient tumor samples with clinical data available were subjected to RNA sequencing (Materials and Methods) in a pilot study. First, to establish that traditional mapping strategies inadequately handle the complex germ-line rearrangements Significance IGHV mutation status is a well established prognostic factor in chronic lymphocytic leukemia, and also provides crucial insights into tumor cell biology and function. Currently, determination of IGHV transcript sequence, from which mutation status is calculated, requires a specialized laboratory procedure. RNA sequencing is a method that provides high resolution, high dynamic range transcriptome data that can be used for differential expression, isoform discovery, and variant determination. In this paper, we demonstrate that unselected nextgeneration RNA sequencing can accurately determine the IGH@ sequence, including the complete sequence of the complementarity-determining region 3 (CDR3), and mutation status of CLL cells, potentially replacing the current method which is a specialized, single-purpose Sanger-sequencing based test.

| immunoglobulin | somatic hypermutation | B cells | CLL

I

mmunoglobulins (Igs) are proteins produced by mature B-lymphocytes that recognize foreign antigens, both as soluble antibody molecules and as part of the B-cell receptor. The generation of Ig diversity through gene recombination and hypermutation of the heavy chain (H) variable region (V) is essential to adaptive immunity. The extent of this process is strongly associated with both pathology and prognosis in chronic lymphocytic leukemia (CLL), wherein CLL that expresses an unmutated IGHV tends to be more aggressive than CLL using unmutated IGHV (1, 2). The accurate assessment of this IGHV mutation status is thus of a high clinical priority. As each patient’s leukemia generally expresses only a single IGH@, the mutation status of IGHV is determined by amplifying the expressed transcript via RT-PCR, sequencing the gene via the Sanger technique, and then comparing this sequence with known inherited IGHV sequences. However, there are limitations to such methods, including variation in technique across institutions. RNA-sequencing is a powerful technology that can simultaneously yield information about gene and isoform expression as well as underlying DNA sequence (3, 4). Motivated by the notion that a single RNA sequencing

4322–4327 | PNAS | April 7, 2015 | vol. 112 | no. 14

Author contributions: J.S.B. and J.C.B. designed research; J.S.B., W.Z., S.L., E.M.G., and L.Z.R. performed research; J.F., I.F., J.J., K.M., L.A., and A.d.l.C. contributed new reagents/analytic tools; J.S.B. and A.S.R. analyzed data; J.S.B., T.J.K., A.d.l.C., and J.C.B. wrote the paper; and J.F., I.F., J.J., K.M., L.A., and T.J.K. contributed patient material. Reviewers: M.S.L., University of Michigan; and A.W., Hematology Branch, National Heart, Lung, and Blood Institute. The authors declare no conflict of interest. Data deposition: The sequences reported in this paper have been deposited in the Gene Expression Omnibus database (accession no. GSE66228). 1

To whom correspondence should be addressed. Email: [email protected] or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1503587112/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1503587112

in the IGH@ locus (Fig. S1), we performed a genome-wide splicedmapping and examination of IGH@. On average, ∼1% of all reads mapped to this region (Table S1). However, of reads mapping to this region, 21–48% could not be clearly and unambiguously assigned to a single feature in the region; this total also reflects the few nonimmunoglobulin features annotated in this region. Using a naïve classifier frequently used for digital gene expression estimates, we counted the number of unambiguously mapped reads at V, D, and J genes. In each case, a V gene clearly emerged with the highest read count. In contrast, the D and J genes could not reliably be determined by simple counting, due either to the lack of a clear consensus highest mapping or the complete absence of mappings (Fig. S2). The identity of the V gene with the highest mapping was compared with the clinical (and later computed) V gene reported, and showed 94% and 100% concordance, respectively. The D and J genes with highest counts were not as informative. Likewise, neither mutation status, nor Ig nucleotide or translated peptide sequence could be obtained directly from these mapped data, indicating the need for an alternative method to correctly identify these genes. Ig Transcript Reconstruction. Next, using a genome-free method (Materials and Methods), each sample’s transcriptome was reconstructed de novo in the way the transcriptome of a poorly characterized or nonmodel organism would be (6), except that each sample was processed independently (whereas for transcriptome discovery samples are typically pooled before assembly), yielding ∼3 million putative transcripts per sample. Somatic mutations in heterogeneous cancer cell populations and assembly graph discontiguities in areas of low coverage inflated the size of the reconstructed transcriptome beyond the recently-estimated 196,520 human transcripts (7, 8). After selecting for transcripts bearing sequence homology to IGHV genes, between 6 and 43 transcripts remained. This diversity reflected in part minor populations of B cells present in the sequenced sample, but in some samples several closely related transcripts with identical V-D-J sequence (e.g., with/without poly-A tails; transcript reverse complements) were represented as distinct transcripts, also increasing this number. Kappa and lambda light chain transcripts are also frequently recovered at

this step, depending on their homology to heavy chain V genes. Light chain transcripts may also be targeted directly at this selection step by altering the homology affinity selector from heavychain V genes to light-chain V genes. Next, multiply-mapped reads were disambiguated and the transcript with the highest read count was determined. Likeliest V, D, and J alleles as well as percent identity to these references were calculated with IgBLAST (9). The V, D, J gene use and percent mutation (1 – identity) were then compared with available clinical data (Table 1). The Percent Mutation Is Similar and the Binary Classifier Mutated/ Unmutated Is Perfectly Concordant. Seventeen sequenced samples

with percent IGHV mutation (as determined by clinical laboratory test) recorded in the medical record were evaluated through our “Ig-ID” pipeline. The actual percent mutation obtained from Ig-ID was identical to the results provided by the clinical laboratory in 11 of 17 cases, with percentages within 1% for 5 cases, and within 2% for 1 case. Fig. 1 illustrates the substantial concordance between the computed and clinical results (Pearson’s r 0.992, 95% CI 0.976–0.998; concordance index 0.988, 95% CI 0.968–0.996), with differences between the two measures for each case versus their average value depicted in the form of a Bland–Altman plot. When samples were classified as mutated (M-CLL) or unmutated (U-CLL) according to the established 98% identity cutoff (10), unmutated cases comprised 12 of the 17 cases (71%), whereas mutated cases numbered 5 (29%). Within the unmutated subgroup, the percent IGHV mutation was identical in 8 (67%), and differences were within 1% in the other 4 cases (mean squared error of 0.216). Within the mutated subgroup, the percent IGHV mutation was identical in 3 (60%), with the two nonmatching cases differing by 0.4% and 1.7% (Table 1). Critically, the classification of IGHV mutated versus unmutated using RNA-seq data with the Ig-ID pipeline was perfectly concordant with reported clinical data in all 17 cases (Table 1 and Table S2). Discrepancies in Percent Mutation Are Due to Sequence Length. To explore reasons for discrepant percent mutation, we compared RNA-seq data with transcripts from the original clinical laboratory procedure. Of 17 samples, 12 had PCR-amplified, Sanger-sequenced

BIOPHYSICS AND COMPUTATIONAL BIOLOGY

Table 1. Comparison of methods

Concordance of V gene prediction and percent mutation between clinical laboratory and Ig-ID in the pilot set. Italic and bold cells indicate a mismatch with respect to clinical data: blue, samples US-1422368 and US-1422309 were identified as mislabeled (Table S3); red, the Sanger sequence from the clinical laboratory was out-of-frame and would not produce functional Ig.

Blachly et al.

PNAS | April 7, 2015 | vol. 112 | no. 14 | 4323

B Difference (Ig-ID - Sanger)

Percent IGHV mutation by Ig-ID

A

Percent IGHV mutation by Sanger sequencing

Average, Ig-ID and Sanger

Fig. 1. Comparison of Ig-ID computational pipeline values and clinical laboratory methods using PCR amplification in the pilot set. Five patients were zero percent mutated by both methods. (A) Scatter plot with identity line, correcting for samples US-1422368 and US-1422309. Dashed lines at 2% represent standardized cutoff for the mutated/unmutated classifier. (B) Bland–Altman plot of the same data with a continuous line of zero difference and dashed lines for the estimated mean difference ± 2 SDs.

transcripts available for comparison. The supplied transcripts were input into IgBLAST and the output was confirmed to match that reported in the medical record. For samples with percent mutation discrepancy between the clinical laboratory and the Ig-ID pipeline, computed and empirical transcripts were aligned to one another. Examining alignments from the most discrepant case (US-1422335; 10.2% with Sanger sequencing versus 8.5% with Ig-ID), revealed that the higher percent mutation in the Sanger-sequenced sample was due to

a smaller amplified PCR product (Fig. 2). This is an inescapable consequence of occasionally difficult amplification with leader-region primers, necessitating the use of amplification primers within the framework region of the V gene that lead to shortening of the resultant transcript. Although in this case both mutation percentages were above 2%, and hence clearly classified as mutated (10), this important consequence indicates that cases nearer the 2% threshold are at risk for being called incorrectly when framework primers must be used.

Fig. 2. Comparison of Sanger sequencing and Ig-ID. Percent mutation calculation is ideally performed for all nucleotides within the highlighted region. Amplification of clonal transcripts often requires use of framework region primers, with the result that the entire V gene is not amplified and sequenced. The side-effect is that the denominator in the identity calculation is smaller; this may inflate the percent mutation compared with analysis of the full-length transcript. In this case, the reported percent mutation was 10.2%, whereas the Ig-ID calculated percent mutation was 8.5%.

4324 | www.pnas.org/cgi/doi/10.1073/pnas.1503587112

Blachly et al.

Ig-ID Can Identify Mislabeled Samples. After V-D-J use and mutation status had been determined for all samples, the results of the 17 that had IGHV documentation in the medical record were directly compared. Two samples with consecutive study numbers matched exactly (including percent mutation) if the RNA-seq sample identifiers were switched (Table S3). All study numbers and medical record numbers from multiple sources matched; the RNA-seq identifier from the outside sequencing facility was the sole source of the mismatch. Identification of V Gene Use. Adjusting for the mislabeled samples, 16 of 17 samples (94%) matched the predicted V gene use reported in the medical record. One sample (US-1422294) was predicted to use IGHV3-74*01 with 100% identity to reference, but was reported in the medical record to have used IGHV3-9 with 100% identity. However, when the actual transcript sequence from the clinical laboratory was examined, its V-J was out of frame; thus, this transcript would not yield a translated, functional Ig molecule. This erroneous transcript was likely the result of errant priming during the PCR step. If this case of misamplification is removed from analysis, Ig-ID matches the clinical laboratory’s V gene prediction in 16 of 16 (100%) cases. Identification of Complete V-D-J Sequence. Identification of the V gene and [potentially] mutated V-gene sequence is adequate for clinical determination of hypermutation status, but is not sufficient on its own to fully characterize the peptide sequence of the BCR or antibody molecule’s antigen recognition region. The complementarity-determining region (CDR) 3, which shares responsibility for antigen recognition with CDR1 and CDR2, lies only partly in the V gene and is more largely made up of N-junctional diversity sequence and the D gene (Fig. 3). Sanger sequencing, if using a reverse primer in the J gene, can recover this sequence. Iglesia et al. computed V genes, including mutated sequence, from polyA/mRNA-seq data, but this method was unable to cross the V-D boundary and recover the CDR3 sequence (11). In addition to correctly determining the germ-line V gene present in the transcript and its exact sequence inclusive of hypermutation, the Ig-ID procedure recovers the entire V-D-J sequence; reconstruction of the entire V-D-J sequence and thus definition of the complete CDR3 is important to study BCR specificity and stereotypy. Available clinical laboratory Sanger sequences were compared with Ig-ID computed transcripts with

pairwise alignment by the Smith-Waterman algorithm (12) with representative results illustrated in Fig. 3B. Ig-ID Is Robust to Shorter Sequencing Length. Because optimal RNAsequencing experiment parameters are unknown, we sought to confirm Ig-ID in an independent validation set, reducing the read length, a key sequencing parameter. In the validation set, we examined five CLL patient samples and one CLL cell line that had undergone paired-end sequencing with a read length of 50 nucleotides (versus 91 for the pilot set). All six cases had both RNAsequencing and clinical IGHV results available. All six computed clonal transcripts matched by V gene, and four of six matched precisely the percent mutation, whereas in two cases the percent mutation differences were 1.4% and 0.8%. Overall, this suggests that 50-nt paired-end RNA-seq is also able to identify IGH@ transcripts, a savings in time and expense over 100-nt sequencing. Ig-ID Can Detect Biallelic and Second-Clone IGH@ Transcripts. Multiple clonal immunoglobulins can be found in CLL patients due either to a lack of allelic exclusion by a single neoplastic clone, ongoing mutation of the IGH@ locus as a single clone diverges, or due to CLL clones that have neoplastically arisen independent of one another (“biclonal”) (13, 14). After initial comparison of computed data with data from the medical record, one record was identified wherein three transcripts were reported. In this case, the two highest computed transcripts corresponded to the first two reported transcripts in the medical record. The third reported transcript was not detected.

Discussion The prognostic importance of somatic hypermutation within the Ig heavy chain variable region (IGHV) in CLL has been appreciated for well over a decade (1, 2). More recently, specific V gene use has been found to be informative as well. CLL cells are known to have a different distribution of V gene use compared with normal B cells (15). Furthermore, use of certain V genes (e.g., V1-69, V3-21) or even more specific restricted combinations of V(D)J and light chains are independent poor prognostic factors (16, 17), with specific types having distinct gene expression patterns (18). It is postulated that this stereotypy results in surface BCR with specific affinity for immunogen capable of tonically stimulating the BCR, thereby promoting cancer cell survival

A

CDR3 FW1

FW2

FW3

V

N

D

N

J

constant region

B

CDR3 < FW3

Sanger

Rfv3 during retroviral infection.

Chronic myeloid leukemia and chronic lymphocytic leukemia.

Immunoglobulin replacement in patients with chronic lymphocytic leukemia (CLL): kinetics of immunoglobulin metabolism.

Hsp90 inhibition increases SOCS3 transcript and regulates migration and cell death in chronic lymphocytic leukemia.

High-throughput sequencing using the Ion Torrent personal genome machine for clinical evaluation of somatic hypermutation status in chronic lymphocytic leukemia.

SnapShot: chronic lymphocytic leukemia.

Chronic lymphocytic leukemia.

Lenalidomide and chronic lymphocytic leukemia.

Prognosis in Chronic Lymphocytic Leukemia.

Immunoglobulin genes in chronic lymphocytic leukemia: key to understanding the disease and improving risk stratification.

Surface immunoglobulin density in the differential diagnosis of B-cell chronic lymphocytic leukemia and leukemic immunocytoma.

Targeting of somatic hypermutation by immunoglobulin enhancer and enhancer-like sequences.

Mutation status and immunoglobulin gene rearrangements in patients from northwest and central region of Spain with chronic lymphocytic leukemia.

Somatic MED12 mutations are associated with poor prognosis markers in chronic lymphocytic leukemia.

Application of a Novel "Pan-Genome"-Based Strategy for Assigning RNAseq Transcript Reads to Staphylococcus aureus Strains.

In situ studies of the antigen-driven somatic hypermutation of immunoglobulin genes.

Chemoimmunotherapy for chronic lymphocytic leukemia.

Management of chronic lymphocytic leukemia.

Acrocyanosis revealing chronic lymphocytic leukemia.

Obinutuzumab for chronic lymphocytic leukemia.

Preferential linkage of bcl-2 to immunoglobulin light chain gene in chronic lymphocytic leukemia.

[Soft palate chronic lymphocytic leukemia].

Chronic lymphocytic leukemia: therapeutic options.

Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data.