Subscriber access provided by SUNY DOWNSTATE

Article

Proteomic detection of immunoglobulin light chain variable region peptides from amyloidosis patient biopsies Surendra Dasari, Jason D Theis, Julie A Vrana, Oana M Mereuta, Patrick Quint, Prasuna Muppa, Roman M. Zenka, Renee C Tschumper, Diane F Jelinek, Jaime I Davila, Vivekananda Sarangi, Paul J Kurtin, and Ahmet Dogan J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00015 • Publication Date (Web): 03 Mar 2015 Downloaded from http://pubs.acs.org on March 19, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

1

Proteomic detection of immunoglobulin light chain variable region peptides from amyloidosis patient biopsies Surendra Dasari1, Jason D. Theis2, Julie A. Vrana2, Oana M. Meureta2,3, Patrick S. Quint2, Prasuna Muppa2, Roman M. Zenka4, Renee C. Tschumper5, Diane F. Jelinek5, Jaime I. Davila1, Vivekananda Sarangi1, Paul J. Kurtin2,6 and Ahmet Dogan2,3 1. Department of Health Sciences Research, Mayo Clinic, Rochester, MN 2. Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN 3. Present address: Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 4. Information Technology Administration, Mayo Clinic, Rochester, MN 5. Department of Immunology and Division of Hematology, Mayo Clinic, Rochester, MN 6. Corresponding Author, E-mail: [email protected], Tel: 507-284-9725

Keywords Amyloidosis, immunoglobulin, light chain, variable gene, variable gene family, peptides, proteomics, bioinformatics

Abstract Immunoglobulin light chain (LC) amyloidosis (AL) is caused by deposition of clonal LCs produced by an underlying plasma cell neoplasm. The clonotypic LC sequences are unique to each patient and they cannot be reliably detected by either immunoassays or standard proteomic workflows that target the constant regions of LCs. We addressed this issue by developing a novel sequence template-based workflow to detect LC variable (LCV) region peptides directly from AL amyloid deposits. The workflow 1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

2 was implemented in a CAP/CLIA compliant clinical laboratory dedicated to proteomic subtyping of amyloid deposits extracted from either formalin-fixed paraffin-embedded tissues or subcutaneous fat aspirates. We evaluated the performance of the workflow on a validation cohort of 30 AL patients, whose amyloidogenic clone was identified using a novel proteogenomics method, and 30 controls. The recall and negative predictive value of the workflow, when identifying the gene family of the AL clone, was 93% and 98%, respectively. Application of the workflow on a clinical cohort of 500 AL amyloidosis samples highlighted a bias in the LCV gene families used by the AL clones. We also detected similarity between AL clones deposited in multiple organs of systemic AL patients. In summary, AL proteomic data sets are rich in LCV region peptides of potential clinical significance that are recoverable with advanced bioinformatics.

Introduction Immunoglobulin (Ig) light chain (LC) amyloidosis (AL) is characterized by an abnormal extracellular deposition of a monoclonal Ig LC protein of either lambda (λ) or kappa (κ) isotype. The deposited LC variable region (LCV) is clonotypic and unique to each patient, and is thought to be the primary pathogenic driver of the disease. Several pioneering studies utilized LC mRNA sequencing of bone marrow neoplastic plasma cells to establish the LCV gene usage of AL clones1-6 and have suggested that the LCV gene used by the neoplastic plasma cell clone may determine the clinical features of AL amyloidosis such as organ involvement and systemic or localized presentation. However, most of these studies were restricted to small cohorts ( 0.9) and more than five spectral matches were considered for clinical interpretation. For every case, we created a clinical proteomics profile that lists all of the confident protein identifications present in each replicate dissection with their respective spectral counts. A licensed

6 ACS Paragon Plus Environment

Page 6 of 36

Page 7 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

7 pathologist scrutinized the detected amyloid proteome for the presence of a universal amyloid molecular signature (APOE, SAP and APOA4)11 and constant regions of Ig κ or λ light chains. The pathologist finalized the AL subtype by correlating the clinical factors in the patient’s health record with the most abundant amyloidogenic protein that was consistently detected across all replicates. Bioinformatic Detection of Light Chain Clones Every LC clone is made by rearranging a variable (V) gene and a junction (J) gene at the DNA level, which is then spliced at the RNA level with either a κ or λ constant (C) region gene. Sequence diversity of the variable region (VJ) is further enhanced by junctional diversity and/or somatic hypermutation. Human κ LCs have approximately 76 V genes available and they are arranged into 7 families. Similarly, λ LCs have approximately 70 V genes arranged into 11 families. We developed two orthogonal workflows to identify the V gene or gene family of the deposited LC clone (Figure 1). The proteogenomics workflow (Figure 1A) started by sequencing the LC mRNA repertoire of the AL patient’s neoplastic plasma cells to determine the identities and sequences of all putative clones. These annotated mRNA sequences were translated into protein sequences and mapped to the patient’s amyloid MS/MS to determine the deposited amyloidogenic clone. In contrast, the proteomics workflow (Figure 1B) started by assembling a list of generic LCV (protein) sequences from public sources. These sequences act as anchoring templates against which patient MS/MS were matched using either a database search-based strategy or a sequence tag-based search strategy or a de novo sequence-based homology search strategy. The LCV genes or gene families that have most MS/MS evidence were considered to represent the pathogenic LC. The key difference between these two workflows is that the proteogenomics workflow relies on the availability of bone marrow material for generating LCV gene mRNA sequences whereas the proteomics workflow attempts to partially reconstruct the identity of the deposited LC clone using generic LCV region sequence templates at the protein level.

7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

8 LCV mRNA Sequence Processing and Annotation. LC mRNA CCS reads shorter than the expected length of a mature V-J-C combination were removed. Each of the remaining reads was aligned against a database of “wild type” V, J and C human gene sequences of κ and λ isotype (version hg19). Variable and constant genes were aligned using BWASW21 software. Junction gene sequences were aligned using BLASTALL22 configured to require at least 90% sequence homology and at least 30 base pairs of sequence homology. CCS reads with an identifiable V-J-C combination and at least 10 subreads were considered for Indel correction. Deletions were replaced with corresponding bases from the reference sequence and homopolymeric insertions were deleted. Corrected reads were clustered by the sequence similarity (threshold of 100%) and a consensus sequence was generated for each cluster. Each consensus sequence was realigned to the V-J-C “wild type” sequences in order to detect the potential fusion boundaries between its constituent V, J and the C regions. Corresponding cluster’s raw CCS reads were annotated with the identity of LCV gene and potential fusion boundaries. Annotated consensus CCS reads (mRNA) were stored in a database for reconstructing the protein sequences associated with a patient’s LC clonal population. Deriving Patient Specific LCV Protein Sequences from mRNA Sequencing Data. We developed a custom algorithm, using Python programming language, to translate the annotated raw CCS reads into putative LCV protein sequences. This script started by detecting the fusion boundaries between V, J and C regions in each read using its annotation and the constant region was trimmed. The V-J fusion boundary in the remaining sequence was combinatorial edited to remove up to one, two or three nucleotides on either side of the boundary and a new read is generated for each edit. Edited reads were six-frame translated into protein sequences and sequences resulting in gaps or premature stop codons were discarded. Remaining protein sequences were clustered by sequence similarity (threshold of 100%) and represented with a consensus protein sequence. Consensus LCV region protein sequences were

8 ACS Paragon Plus Environment

Page 8 of 36

Page 9 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

9 annotated with LCV gene identity extracted from its constituent sequence annotations. A total of 91,446 unique LCV clonal sequences were generated from all patient samples. Construction of LCV Generic Sequence Templates. Known LCV region sequences were collected from the ImMunoGeneTics database (www.imgt.org)23, Boston University’s Amyloid Light Chain Database24, and a Mayo Clinic internal LCV sequence database. Short sequences of length less than 30 amino acids were discarded. Redundant sequences were clustered and represented with single entry. Each sequence entry was annotated with the corresponding V gene name following the IMGT’s nomenclature rules25. This resulted in a total of 4,334 protein sequence templates mapping to LCV genes of λ, κ and γ isotypes. Assembling Protein Sequence Databases for LCV Clone Sequencing and Identification. We created two distinct databases for matching the amyloid MS/MS spectra: “Patient Specific LCV Sequence Database” and “Generic LCV Sequence Template Database.” The first database was derived by combing a total of 111,692 protein sequences obtained from the SwissProt database (download date 08/2012) subselected for human species, common contaminants (proteolytic enzymes, cotton, keratins, etc.) and patient specific LCV sequences determined from bone marrow LC mRNA sequencing. The second database is similar to the first database except that the patient specific light chain clones were swapped with generic LCV sequence templates (# of sequences = 24,580). Reversed protein sequence entries were appended to both databases for estimating peptide and protein identification FDRs. Patient Specific LCV Clonal Sequencing for “Validation Cohort” Dataset. MyriMatch26 database search engine matched the MS/MS of each AL patient against the “Patient Specific LCV Sequence Database” (Figure 1A). The software was configured to use either 10ppm m/z tolerance for LTQ-Orbitrap precursors or 1.25 Da. m/z tolerance for LTQ-Velos precursors. A fragment tolerance of 0.5 Da. m/z was used for peptide-spectrum matching. MyriMatch derived semitryptic peptides from the sequence database while looking for the following variable modifications: oxidation of methionine (+15.994 Da.) 9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

10 and formation of n-terminal pyroglutamic acid (-17.023 Da.). IDPicker27, 28 filtered the peptide-spectrum matches (PSMs) at 2% FDR. The software was configured to use an optimal combination of MVH, mzFidelity and XCorr scores for filtering27. Protein identifications with at least two unique peptide identifications were considered to be present in the sample. Resulting proteins were clustered into groups of proteins that match the same set of peptides. The LCV gene identity of the most abundant (by MS/MS counts) protein group detected in a patient’s amyloid deposit was assumed to be the amyloidogenic clone. The LCV gene identity and the MS/MS of the amyloidogenic clone were used as a ground truth for characterizing the performance of the generic sequence template database-based search workflows. Determining LCV Clonal Identity via Generic Sequence Template Database Searches. We evaluated three orthogonal workflows that use known LCV sequence templates to identify the gene and gene families of the amyloidogenic clone (Figure 1B). This template-based search approach relies on the fact that clones produced from the same LCV gene share common framework regions (Supplemental File 1). Peptides in these framework regions are also idiosyncratic to each LCV gene and act as unique peptides that distinguish between clones produced by different LCV genes (Supplemental File 1). Also, proteins in the amyloid deposits are in a degraded state. Hence, given enough sequence templates covering all LCV genes, any type of database search could recover LCV peptides present in the AL deposits. The database search workflow matched the patient’s MS/MS against the “Generic LCV Sequence Template Database” using three different database search engines: Sequest, X!Tandem and Mascot. These search engines were configured as described above. Scaffold software filtered the results and proteins with at least single independent peptide identification (peptide probability > 0.9) were considered for LCV gene and gene family typing (Figure 1B). The error-tolerant search workflow employed a sequence tagging-based approach to match the MS/MS against the template database. For

10 ACS Paragon Plus Environment

Page 10 of 36

Page 11 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

11 this, DirectTag29 software inferred partial sequence tags for all MS/MS spectra present in a raw file. The software was configured to derive the best 50 tags of three amino acids in length from each spectrum. TagRecon30 software reconciled these tags against the sequence template database while allowing for single amino acid mutations. The software derived semitryptic peptides from the database and looked for the following variable modifications: oxidation of methionine (+15.9946 Da.) and formation of nterminal pyroglutamic acid (-17.023 Da.). IDPicker filtered the peptide identifications at 2% FDR and assembled into protein identifications as described above. Protein identifications with at least two unique peptide matches were considered for LCV gene and gene family typing. The sequence homology search employed a de novo sequencing-based workflow to match the MS/MS against the template database. For this, Peaks31 software (version 7, Bioinformatics Solutions Inc., Waterloo, Canada) was configured to derive de novo sequences assuming either 10ppm mass error for LTQ-Orbitrap precursors or 1.0Da mass error for LTQ-Velos precursors. A fragment mass error of 0.5 Da was assumed for all raw files. Derived sequences were matched against the sequence template database (containing no reversed sequence entries) using PeaksPTM32 module configured to derive semitryptic peptides from the database and look for the same variable modifications as described above. The modification panel was augmented with all possible mutations. Identified proteins were also interrogated for additional mutations using SPIDER33 homology search configured according to the vendor recommendations. All modules were restricted to at most three sequence alterations. Final peptide identification results were filtered at 2% FDR using the “decoy fusion” approach available in the Peaks software. Proteins with at least two unique peptide identifications were considered to be present in the sample and exported for further analysis. An important point of note is that we required two unique peptides matched for a protein when processing the error-tolerant search results. We relaxed this constraint to one unique peptide for the traditional database search results because a single LCV gene sequence template often had only one peptide identification when mutations were not considered. However, distinct peptides

11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

12 matching to different templates of a particular LCV gene fulfil the two unique peptides requirement at the LCV gene and LCV gene family level. Supplemental File 2 presents the settings used for all the search engines and results processing software. LCV Gene and Gene Family Typing Algorithm for Generic LCV Sequence Template Database Search Results. For each patient, we computed the number of MS/MS that exclusively matched to each LCV clonal sequence. These exclusive spectra of all detected clones were aggregated to the LCV gene and LCV gene family level. LCV gene and LCV gene families that do not have at least two unique peptide identifications were removed from further analysis. The most abundant LCV gene family (by unique spectral abundance) was assumed to be present in the patient’s amyloid deposit, given the following two criteria were met: a) the top ranking gene family had at least 5 unique MS/MS matches b) either there were no other LCV gene families detected in the deposit or the first ranked LCV gene family had 50% more spectral matches than the second ranked LCV gene family. Patient samples with no detectable LCV region peptides were labeled as “None Detected.” In parallel, a similar protocol was followed to infer the LCV gene of the deposited clone.

Results We developed a novel informatics approach that uses known LCV region (protein) sequence templates to identify the amyloidogenic clone present in AL deposits. To test this approach, we first established a validation data set containing 30 AL patients and 30 non-AL patients. The amyloidogenic clone of AL patients was identified using an integrated proteogenomics method. The non-AL patients served as negative controls while characterizing the performance of the workflow. We implemented the workflow in a CAP/CLIA compliant clinical laboratory and applied it on a large clinical cohort (N=500). We detected a bias in the LCV gene usage of AL clones as well as similarity between the LC clones deposited in multiple organs of systemic AL amyloidosis patients.

12 ACS Paragon Plus Environment

Page 12 of 36

Page 13 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

13 Proteomic Typing of Patient Amyloid Deposits Each patient’s AL diagnosis was established using established proteomics-based clinical assays10, 11. Figure 2 summarizes the amyloid proteomic profile of eight patients present in the validation cohort. The shotgun proteomics-based amyloid classification assay successfully identified either κ or λ constant chain proteins corresponding to either AL-κ or AL-λ subtypes while ruling out the presence of amyloid markers associated with non-AL subtypes (Figure 2). With this assay, the 30 AL subjects in the validation cohort were classified as 16 AL-λ amyloidosis patients and 14 AL-κ amyloidosis patients. The 500 clinical cohort patients were classified as 368 patients with AL-λ amyloidosis and 132 patients with AL-κ amyloidosis (Table 1). We selected equal number of AL-λ and AL-κ patients for the validation cohort. However, clinical cohort has a 2.7:1 ratio of AL-λ and AL-κ patients, which reflects the higher propensity of the λ LCs to form amyloid deposits34. Proteogenomic Identification of Amyloidogenic LC Clones in AL Validation Cohort In order to establish a truth set, we sequenced the LC mRNA repertoires of neoplastic plasma cells of the 30 AL patients present in the validation cohort (Figure 1A). A patient’s amyloid MS/MS spectra were matched to the corresponding LC repertoire to identify the most abundant clone. Table 2 presents the LCV gene identity of the detected amyloidogenic clones, their corresponding MS/MS counts and sequence coverage. The number of MS/MS matched to the clone’s V-J region and its corresponding constant region were normalized using the total number of MS/MS identified in the respective sample and scaled to 5000 (Table 2). Sequence coverage of the detected amyloidogenic clonotypic LCV proteins varied between patients (µ=69.8%, σ=22.8%) and ranged from 22.2% to 99.1%. This suggests that the LCV protein can be degraded prior to deposition and not all AL patients will have a full length LCV protein fragment in their amyloid deposit. One might question that trypsin digestion might have hindered recovery of full-length clones from patient samples. However, 15 out of 30 patients have LCV sequence coverage of at least 80%. The number of normalized MS/MS matched to the clonotypic LCV 13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 36

14 protein also varied between patients (µ=112.9, σ=88.6) and ranged from 16 to 372. We did not observe a strong correlation between the LCV sequence coverage and its corresponding normalized MS/MS counts (spearman’s ρ=0.31). This suggests that the nature (full length vs. degraded) and amount of the clonotypic LCV proteins present in the amyloid deposit are independent of each other. The proportion of MS/MS matching to LCV and LC constant regions varied between patients. For example, patient#1 had 10 times more normalized MS/MS matching to the variable region of the clone compared to the constant region. This suggests that AL deposits can contain variable region portion of the amyloidogenic clone without constant region (Table 2). Proteomic Detection of LCV Gene/Gene Family using Generic Sequence Templates We developed three orthogonal bioinformatics workflows to infer the identity (not the exact sequence) of the LCV clonotypic protein present in the AL deposits (Figure 1B). We utilized the well characterized validation cohort samples to evaluate the ability of these workflows to not only recall the identity of the amyloidogenic LCV but also to recover the MS/MS spectra belonging to the clone. Table 3 presents the LCV gene and gene family calls made by the three orthogonal workflows for each AL patient in the validation cohort. The recall rate of the LCV gene and LCV gene family by each of the workflows was also computed (Table 3). The traditional database, sequence tag and de novo homology searches have the LCV gene recall rate of 80%, 90% and 93%, respectively. This is expected because the sequence templates in the database are near representations of the amyloidogenic clone in the patient biopsy (Figure 3). Hence, allowing for amino acid mutations while matching the amyloid MS/MS spectra against the sequence templates progressively improves the identifiability of the amyloidogenic clone (Figure 3). Only 24 out of the 30 AL cases had the LCV gene called by all three different search strategies and we observed a 100% concordance between the different strategies for those cases (Table 3). For the 24 cases, on average, the percentage of amyloidogenic LCV clonal spectra (shown in Table 2) recovered by the traditional database, sequence tag and de novo homology searches were 49%, 54% and 86%, 14 ACS Paragon Plus Environment

Page 15 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

15 respectively. This supports the hypothesis that error-tolerant mutation search strategies have a higher chance of recovering the clonotypic MS/MS spectra when using generic LCV sequence templates. When compared to the LCV gene recall rate, the LCV gene family recall rate for traditional database, sequence tag and de novo homology searches are slightly higher at 93%, 97% and 97%, respectively. This is because distinct peptides matching to different LCV genes of the same family contribute to the family identification. None of the negative control (non-AL) samples in the validation cohort had a LCV gene or gene family detected by any of the three workflows when following the above mentioned protocol. In contrast, the numbers of positive controls that did not have a LCV gene assigned by the traditional database, sequence tag and de novo homology search strategies are 5, 3 and 1 respectively (Table 3). Hence, the LCV gene negative predictive value for the aforementioned three search strategies are 86%, 90% and 97%, respectively. Similarly, the numbers of positive controls that did not have a LCV gene family assigned by the traditional database, sequence tag and de novo homology search strategies was 1, 1 and 0, respectively (Table 3). Hence, the LCV gene family negative predictive value for the aforementioned three search strategies are 98%, 98% and 100%, respectively. It is well understood that error-tolerant searches are more error-prone when compared to traditional database searches due to their increased search space. We tested this by assuming that the number of MS/MS matched to the amyloidogenic LCV by the “Patient-specific LCV Sequence Database” search as the golden threshold (shown in Table 2). We next computed the number of cases for which each of the “Generic LCV Sequence Template” search strategies identified more LCV clonotypic MS/MS compared to the golden threshold. The extra spectra recovered by the template-based search strategies are more likely to be either false positives or contributions from the light chains present in the serum background. The traditional database search of the templates had one such case whereas the sequence tag and de

15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

16 novo homology searches had two and eight cases, respectively. Considering this, we implemented the generic sequence template-based traditional database search workflow LCV gene family typing in a CAP/CLIA compliant laboratory dedicated for amyloid proteomic subtyping. Figure 4 illustrates the exclusivity of the LCV region peptides detected by this workflow. LCV Gene Family Usage in AL Amyloidosis We inferred the frequency of the LCV gene families utilized by the amyloidogenic clones. For this, clinical cohort biopsy specimens were separated into AL-κ or AL-λ based on the LC isotype. Identity of the deposited LC clone was inferred using the generic LCV sequence template-based traditional database search workflow. Figure 5 presents the frequencies of various κ and λ LCV gene families detected in the cohort. Even though there are 7 κ and 11 λ LCV gene families in the human genome, majority of the detected amyloidogenic clones were restricted to only three κ and four λ LCV gene families (Figure 5). KV1 was the most frequently detected AL-κ clone (Figure 5; 60%; χ2 goodness of fit [GOF] pvalue95% probability) peptide identification are shown below. Blue stars indicate universal amyloid tissue markers. Yellow stars indicate amyloid subtype specific biomarkers. Most abundant (by spectral count) amyloid subtype marker determines the patient subtype. The assay classified the AL-κ and AL-λ patients accordingly.

25 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

26 Figure 3: Sequence Recovery of LCV Region. Patient specific amyloidogenic LCV sequence of a validation cohort’s patient (derived from bone marrow sequencing) was shown. Most abundant (by MS/MS counts) generic LCV sequence templates identified by the database (DB), sequence tag (ST) and de novo (DN) searches were aligned to the patient specific clone. Regions of protein with sequence coverage were highlighted in green. Red colored amino acids represent mutation sites where the patient specific sequence deviates from the generic LCV sequence templates. Both “ST” and “DN” searches were able to recover mutant peptides and improve the sequence coverage.

Patient Specific: YVLTQTPSVSVAPGQTARITCGGNNIGYKSVHWHQQKPGQAPVLVVYDDSDRPSGIPERFSGSNSGNTATLTISRVEAGDEADFYCQVWDSSSDQWVFGGGTKLTVL Generic LCV DB: SYVLTQPPSVSVAPGQTARITCGGNNIGSKSVHWYQQKPGQAPVLVVYDDSDRPSGIPERFSGSNSGNTATLTISRVEAGDEADYYCQVWDSSSDH Sequence ST: SYVLTQPPSVSVAPGQTARITCGGNNIGSKSVHWYQQKPGQAPVLVVYDDSDRPSGIPERFSGSNSGNTATLTISRVEAGDEADYYCQVWDSSSDHWVFGGGTKLTVL Templates DN: YVLTQPPSVSVAPGQTARITCGGNNIGSKSVHWYQQKPGQAPVLVVYDDSDRPSGIPERFSGSNSGNTATLTISRVEAGDEADYYCQVWDSSSDHVVFGGGTKLTVL

26 ACS Paragon Plus Environment

Page 26 of 36

Page 27 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

27 Figure 4. LCV Region Peptides are Specific to AL Amyloidosis and not Seen in Non-AL Amyloidosis. Four each of AL and non-AL (AA and ALECT2) patients in the validation cohort were analyzed with proteomics and LCV region peptides were detected using a generic LCV sequence template-based traditional database search workflow. Spectral counts of selected biomarkers with at least one high-confident (>95% probability) peptide identification are shown below. Blue stars indicate universal amyloid tissue markers. Yellow stars indicate amyloid subtype specific biomarkers. Double stars indicate the LCV gene templates. LCV region peptides were detected in all AL patients but in none of the non-AL patients. LC isotype of the detected variable region peptides in AL patients also correlated with the LC isotype of the detected constant regions.

27 ACS Paragon Plus Environment

Journal of Proteome Research

28 Figure 5. LCV Gene Family Usage in AL Amyloidosis. MS/MS data from a total of 500 AL cases was analyzed to identify the LCV gene family of the amyloidogenic clone. This figure illustrates the most frequently used LCV gene families by amyloidogenic clones of κ and λ isotype. #None-det represents

Percentage of Cases

cases where no LCV region peptides were detected.

80

AL-κ κ (N=132) AL-λ λ (N=368)

60 40 20 0

K V1 /L V1 K V2 /L V2 K V3 /L V3 K V4 /L V4 K V6 /L V6 K V8 /L V8 N on eDe t#

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

LCV Gene Families

28 ACS Paragon Plus Environment

Page 28 of 36

Page 29 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

29

Tables Table 1. Demographic Profile of Patients Participated in This Study. (a) Amyloidogenic LC clone present in validation cohort’s AL patients was sequenced using proteogenomics. Non-AL cases were equally distributed between ATTR, AA and ALECT2. (b) F stands for female, M stands for male and U stands for unknown. (c) Number of patients typed as either AL-λ or AL-κ.

a

Cohort no. of cases (F/M/U) Validation (AL) 30 (16/14/0) Validation (non-AL) 30 (10/20/0) Clinical 500 (195/296/9)

b

Age (years) 67.5 ± 10.4 63.4 ± 9.7 65.8 ± 13.0

AL-λ λ 16 N/A 368

c

29 ACS Paragon Plus Environment

AL-κ κ 14 N/A 132

c

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

30 Table 2. Proteogenomic Identification of Amyloidogenic LC Clones. MS/MS of validation cohort’s AL patients were matched against their corresponding putative clones derived from bone marrow. (a) Most abundant (by MS/MS counts) LCV gene. LCV genes were referred with IMGT names25. Prefix KV stands for κ variable. Prefix LV stands for λ variable. (b) Variable region sequence coverage of the clone. (c) Total MS/MS matched to the clone’s variable and constant regions (ConstReg). (d) Spectral counts were normalized using the total number of MS/MS identified in the respective sample and scaled to 5000.

30 ACS Paragon Plus Environment

Page 30 of 36

Page 31 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

31

Raw MS/MS Counts Patient# LCV Gene 1 KV1-16 2 KV1-16 3 KV1-16 4 KV1-33 5 KV1-33 6 KV1-33 7 KV1-33 8 KV1-33 9 KV1-33 10 KV1-33 11 KV1-39 12 KV1-5 13 KV3-20 14 KV3-20 15 LV1-44 16 LV2-14 17 LV2-14 18 LV3-1 19 LV2-14 20 LV2-14 21 LV2-14 22 LV2-8 23 LV3-21 24 LV3-21 25 LV3-21 26 LV3-21 27 LV3-21 28 LV6-57 29 LV6-57 30 LV6-57

a

b

SeqCov 42.1 86.9 47.7 86 86 64.5 95.3 59.8 42.1 80.4 60.7 56.1 22.2 99.1 68.2 95.5 45.5 89.6 48.2 95.5 28.2 92.7 54.6 54.1 81.7 95.4 88.1 48.2 85.6 95.5

c

Norm. MS/MS Counts

d

LCV Gene LC ConstReg LCV Gene LC ConstReg 66 7 70 7 47 58 28 34 75 42 117 66 73 74 115 117 356 219 177 109 62 116 75 141 42 50 18 21 182 103 260 147 80 157 79 155 285 173 135 82 82 154 63 118 60 133 60 133 47 154 37 122 1122 829 372 275 63 75 62 74 48 117 37 89 21 32 16 24 227 166 109 80 31 108 44 154 231 267 84 97 42 50 65 78 247 426 157 271 80 15 83 16 85 108 53 67 169 70 190 79 581 217 329 123 152 67 169 74 97 187 94 181 67 77 66 76 162 84 225 117

31 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

32 Table 3. Recovery of Amyloidogenic LC Clone via Generic LCV Sequence Template Database Searches. (a) LCV recovery from AL patients in the validation cohort using traditional database search of LCV sequence templates. (b) LCV gene of the top-ranking (by MS/MS counts) clone. NU stands for nonunique. ND stands for not detected. Prefix KV stands for κ variable. Prefix LV stands for λ variable. (c) Unique peptide sequences were mapped back to LCV gene family loci and top ranking gene family by MS/MS abundance is listed. (d) Total MS/MS matched to the clone’s gene and gene family. (e) LCV recovery using sequence tag-based searches configured to look for single amino acid mutations in peptides. (f) Homology-based searches looked for up to 3 amino acid mutations per peptide. (g) Recall computed using Table 2 as ground truth.

32 ACS Paragon Plus Environment

Page 32 of 36

Page 33 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

33

Database Search (No-mut) LCV Gene LCV Gene Patient# LCV Gene 1 NU 2 KV1-16 3 KV1-16 4 NU 5 KV1-33 6 KV1-33 7 KV1-33 8 KV1-33 9 KV1-33 10 KV1-33 11 NU 12 KV1-5 13 KV3-20 14 KV3-20 15 LV1-44 16 ND 17 ND 18 NU 19 LV2-14 20 LV2-14 21 LV2-14 22 LV2-8 23 LV3-21 24 LV3-21 25 LV3-21 26 LV3-21 27 LV3-21 28 LV6-57 29 LV6-57 30 LV6-57 Recall

g

80%

b

c

Family KV1 KV1 KV1 KV1 KV1 KV1 KV1 KV1 KV1 KV1 KV1 KV1 KV3 KV3 LV1 ND ND LV3 LV2 LV2 LV2 LV2 LV3 LV3 LV3 LV3 LV3 LV6 LV6 LV2 93%

d

SpC 27 48 109 19 28 74 37 75 47 47 307 32 21 37 21 19 14 43 49 169 67 79 71 127

a

Sequence Tag Search (1-mut)

LCV Gene Family SpC 21 27 84 38 154 19 28 104 56 87 84 67 109 1130 60 22 21 40 39 28 16 195 49 233 86 195 78 127

d

LCV Gene KV1-16 KV1-16 KV1-16 KV1-33 KV1-33 KV1-33 KV1-33 KV1-33 KV1-33 KV1-33 NU KV1-5 KV3-20 KV3-20 LV1-44 LV2-14 ND NU LV2-14 LV2-14 LV2-14 LV2-8 LV3-21 LV3-21 LV3-21 LV3-21 LV3-21 LV6-57 LV6-57 LV6-57

e

De Novo Homology Search (3-muts)

f

LCV Gene LCV Gene LCV Gene LCV Gene LCV Gene LCV Gene Family SpC Family SpC LCV Gene Family SpC Family SpC KV1 31 94 KV1-16 KV1 27 48 KV1 63 77 KV1-16 KV1 57 70 KV1 32 149 KV1-16 KV1 93 163 KV1 52 101 KV1-33 KV1 57 114 KV1 164 770 KV1-33 KV1 281 368 KV1 41 41 KV1-33 KV1 33 53 KV1 30 40 KV1-33 KV1 19 31 KV1 37 56 KV1-33 KV1 35 257 KV1 41 44 KV1-33 KV1 45 95 KV1 118 203 KV1-33 KV1 41 240 KV1 124 KV1-39 KV1 54 162 KV1 39 100 KV1-5 KV1 116 172 KV3 60 176 KV3-20 KV3 33 137 KV3 323 1493 KV3-20 KV3 367 1779 LV1 37 42 LV1-44 LV1 90 145 LV2 10 12 ND ND ND LV2-14 LV2 10 10 LV3 17 ND LV3 20 LV2 12 86 LV2-14 LV2 67 136 LV2 22 27 LV2-14 LV2 49 99 LV2 26 81 LV2-14 LV2 32 120 LV2 27 95 LV2-8 LV2 14 27 LV3 37 37 LV3-21 LV3 29 70 LV3 63 287 LV3-21 LV3 190 361 LV3 77 86 LV3-21 LV3 134 389 LV3 488 509 LV3-21 LV3 556 841 LV3 29 64 LV3-21 LV3 63 228 LV6 81 385 LV6-57 LV6 125 214 LV6 21 68 LV6-57 LV6 78 124 LV6 84 102 LV6-57 LV6 136 319

90%

97%

33 ACS Paragon Plus Environment

93%

97%

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

34 Table 4. Consistency of LCV Gene Families Detection in Patients with Multiple Anatomical Site Biopsies. 48 patients in the clinical cohort had biopsies obtained from multiple sites. MS/MS data from each biopsy was independently analyzed to identify the LCV gene family of the amyloidogenic clone. The organ and LCV gene family pairs of each patient are shown below. For all patients, the same LCV gene family was detected in all biopsied sites.

34 ACS Paragon Plus Environment

Page 34 of 36

Page 35 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Journal of Proteome Research

35

Patient# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

(Organ;LCV Gene Family) Pairs (Fat aspirate;KV1) (Bone marrow;KV1) (Fat aspirate;KV1) (Heart;KV1) (Fat aspirate;KV1) (Heart;KV1) (Fat aspirate;KV1) (Heart;KV1) (Fat aspirate;KV1) (GI tract;KV1) (Fat aspirate;KV3) (Bone marrow;KV3) (Fat aspirate;KV4) (Skin;KV4) (Fat aspirate;KV4) (GI tract;KV4) (Fat aspirate;KV6) (Bone marrow;KV6) (Fat aspirate;LV1) (Bone marrow;LV1) (Fat aspirate;LV1) (Bone marrow;LV1) (Fat aspirate;LV1) (Bone marrow;LV1) (Fat aspirate;LV1) (Skin;LV1) (Fat aspirate;LV2) (Heart;LV2) (Fat aspirate;LV2) (Skin;LV2) (Fat aspirate;LV3) (Bone marrow;LV3) (Fat aspirate;LV3) (Bone marrow;LV3) (Fat aspirate;LV3) (GI tract;LV3) (Fat aspirate;LV3) (Heart;LV3) (Fat aspirate;LV3) (Skin;LV3) (Fat aspirate;LV3) (GI tract;LV3) (Fat aspirate;LV3) (Heart;LV3) (Fat aspirate;LV3) (Heart;LV3) (Fat aspirate;LV3) (Kidney;LV3)

Patient# 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

(Organ;LCV Gene Family) Pairs (Fat aspirate;LV3) (Heart;LV3) (Fat aspirate;LV6) (Bone marrow;LV6) (Heart;LV6) (Fat aspirate;LV6) (Bone marrow;LV6) (Fat aspirate;LV6) (Bone marrow;LV6) (Fat aspirate;LV6) (Heart;LV6) (Fat aspirate;LV6) (GI tract;LV6) (Fat aspirate;LV6) (Heart;LV6) (Fat aspirate;LV6) (Heart;LV6) (Fat aspirate;LV6) (Heart;LV6) (Bone marrow;KV1) (Liver;KV1) (Bone marrow;KV1) (Kidney;KV1) (Bone marrow;KV1) (Heart;KV1) (Bone marrow;KV1) (Kidney;KV1) (Bone marrow;KV1) (Liver;KV1) (Bone marrow;KV1) (Kidney;KV1) (Bone marrow;KV3) (Kidney;KV3) (Bone marrow;LV1) (Liver;LV1) (Bone marrow;LV2) (Heart;LV2) (Bone marrow;LV2) (Skin;LV2) (Bone marrow;LV2) (GI tract;LV2) (Bone marrow;LV3) (Heart;LV3) (Bone marrow;LV3) (Heart;LV3) (Bone marrow;LV3) (Heart;LV3) (GI tract;KV1) (Kidney;KV1)

35 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

36

Table of Contents Graphic

36 ACS Paragon Plus Environment

Page 36 of 36

Proteomic detection of immunoglobulin light chain variable region peptides from amyloidosis patient biopsies.

Immunoglobulin light chain (LC) amyloidosis (AL) is caused by deposition of clonal LCs produced by an underlying plasma cell neoplasm. The clonotypic ...
1MB Sizes 0 Downloads 5 Views