Immunogenetics DOI 10.1007/s00251-015-0830-9

ORIGINAL PAPER

V genes in primates from whole genome sequencing data D. N. Olivieri & F. Gambón-Deza

Received: 19 November 2014 / Accepted: 16 February 2015 # Springer-Verlag Berlin Heidelberg 2015

Abstract The adaptive immune system uses V genes for antigen recognition. However, the evolutionary diversification and selection processes within and across species and orders remain poorly understood. Here, we studied the amino acid (AA) sequences obtained from the translated in-frame Vexons of immunoglobulins (IG) and T cell receptors (TR) from 16 primate species whose genomes have been sequenced. Multispecies comparative analysis supports the hypothesis that V genes in the IG loci undergo birth/death processes, thereby permitting rapid adaptability over evolutionary time. We also show that multiple cladistic groupings exist in the TRA (35 clades) and TRB (25 clades) V gene loci and that each primate species typically contributes at least one V gene to each of these clades. The results demonstrate that IG V genes and TR V genes have quite different evolutionary pathways; multiple duplications can explain the IG loci results, while coevolutionary pressures can explain the phylogenetic results of the TR V gene loci. Our results suggest that there exist evolutionary relationships between V gene clades in the TRA and TRB loci. Due to the long-standing preservation of these clades, such genes may have specific and necessary roles for the viability of a species.

D. N. Olivieri (*) Department of Informatics, University of Vigo, 32004 Ourense, Spain e-mail: [email protected] F. Gambón-Deza Servicio Gallego de Salud (SERGAS), Unidad de In-munología, Hospital do Meixoeiro, 36210 Vigo, Spain e-mail: [email protected] F. Gambón-Deza Instituto Biomédico de Vigo, 36310 Vigo, Spain

Keywords Immunologic repertoire . Primate evolution . Immunoglobulin (IG) . T cell receptor (TR) . Variable (V) gene . WGS data

Introduction The adaptive immune system contains natural molecular recognition machinery that is able to distinguish self from nonself and defend the body against infections (Janeway 1992). This molecular recognition system consists of two molecular structures, immunoglobulins (IG) and T receptors (TR). Immunoglobulins recognize antigen in soluble form and are composed of two types of chains, a heavy chain (IGH) and a light chain (either IGK or IGL). The recognition site is composed of the variable (V) domains present in the NH2-terminus of both chains. The antigen binding site is composed of two V domains, one from each chain. Within these protein domains, regions have been described that interact with antigen, called the complementarity determining regions (CDR) and the framework regions (FR). In IG, there are three CDR and three FR within each V domain. At the interaction site with antigen, six CDR are involved and supported by six FR. The TR recognize antigen that are presented by the major histocompatibility (MH) proteins, as peptide-MH (or pMH). Despite this substantial mechanistic difference of antigen recognition between TR and IG, both possess similar molecular structures. In particular, each has two chains possessing V domains at the site within the molecule that is responsible for pMH recognition. These domains are similar to those in IG, containing three FR and three CDR per chain (Janeway et al. 2005). Because the amino acid (AA) sequences of the IG and TR V domains are so similar, it has been hypothesized that all such sequences present today were derived from an ancestral gene

Immunogenetics

(Hughes 1994). This ancestral gene was assigned to immune recognition in the epoch coinciding with the origin of vertebrates. Evidence comes from the present-day IG and TR sequences in fish, whose structures have been maintained in all extant vertebrates (Flajnik and Kasahara 2010; Ghaffari and Lobb 1991). Genes of the V domains are distributed across seven unique loci. The genes from three of these loci are used to construct IG chains, while the other four loci contain genes that encode TR chains (Brack et al. 1978; Davis and Bjorkman 1988; Janeway et al. 2005; Tonegawa 1983) (see also: The Immunoglobulin FactsBook (Lefranc and Lefranc 2001a); The T cell receptor FactsBook (Lefranc and Lefranc 2001b)). The antigen recognition repertoire that an organism possesses is dictated by the total available set of these genes. These genes have two exons, one for the peptide leader and the other that encodes most of the V domain (V in-frame exon) (Lefranc 2014). Within the V exon, there are coding sequences for the first two complementarity determining regions (CDR1 and CDR2) for antigen recognition and the three framework regions (FR1, FR2, and FR3) (see the International ImMunoGeneTics Information System, http://www.imgt.org (Lefranc et al. 2009; Lefranc 2011b), IMGT/GENE-DB (Giudicelli et al. 2005). A third complementarity determining region (CDR3) is generated through a gene rearrangement process, called V-D-J rearrangement, whereby V exons are moved from their location in order to join with other genes, called D and J. This process is somatic and only occurs within lymphoid cells (Tonegawa 1983). The number of V exons for IG and TR is highly variable across different species, especially with respect to the IG loci. For example, there are approximately 600 IGHV genes for the microbat (Myotis lucifugus), while other mammals, such as those living in aquatic environments (e.g., seals, dolphins, and walruses) have much fewer IGHV genes (Olivieri et al. 2013). From the V exon sequence data available at http:// vgenerepertoire.org, the number of genes in the TRBV locus between species is approximately constant, while there is a large variance of the number of TRAV genes, particularly pronounced in the Bovine species of the Laurasiatheria. The reason for this variability amongst species is presently unknown. In this paper, we describe organizational and phylogenetic relationships of the amino acid (AA) sequences derived from the V exons of the Primate order uncovered from whole genome sequencing (WGS) datasets. In particular, we studied 16 representative primate species whose genomes have been sequenced in order to identify evolutionary patterns that could explain the present-day genomic repertoire of V genes.

Material and methods For these studies, we used genome data from whole genome shotgun (WGS) assemblies of species that are publicly

available at the NCBI. For the majority of these species, the V genes have not been annotated, or only partial annotations have been performed in specific loci (See the IMGT Repertoire, http://www.imgt.org and note that all curated genes were entered in IMGT/GENE-DB (Giudicelli et al. 2005), and IMGT gene nomenclature has been submitted to the NCBI (Lefranc 2014)). We used our VgenExtractor bioinformatics tool (Olivieri et al. 2013) (available at the repository, http:// vgenerepertoire.org) to identify the in-frame V exon sequences from an analysis of these genome files by searching for well-established signatures and motifs. Our software algorithm searches and extracts in-frame V exon sequences based upon known motifs. In particular, the algorithm scans large genome files and extracts candidate V exon sequences. These V exon sequences are delimited at the nucleotide level by an acceptor splice at the 5′ end and at the 3′ end by the V recombination signal (V-RS). These V exon sequences (and their translation) include the second part of the signal peptide (LPART2) and the V-REGION (IMGT-ONTOLOGY 2012, for a detailed description (Giudicelli and Lefranc 2012)). The IMGT unique numbering starts at the beginning of the V-REGION. The exons of functional (F) and open reading frame (ORF) fulfill specific criteria: they are flanked in 3′ by the V recombination signal (V-RS), they have a reading frame without stop codon, the length is at least 280 bp long, and they contain two canonical cysteines and a tryptophan at specific positions (1st-CYS 23 and 2nd-CYS 104, CONSERVEDTRP 41 according to the IMGT unique numbering ((Lefranc et al. 2003; Lefranc 2011a)). Since the VgenExtractor algorithm scans entire genomes by matching specific motif patterns along the exon sequence, a fraction of the extracted sequences may fulfill the conditions established by our algorithm for being functional V genes, yet are structurally very different. Such sequences are easily discarded with a BLASTP comparison against a V gene consensus sequence. We found that an ample threshold of the BLASTP E-value parameter (∼1e-15) is sufficient for eliminating all sequences that are not V genes. The VgenExtractor algorithm can be modified to identify pseudogenes by relaxing the motif filters or by accepting sequences containing a stop codon. Nonetheless, such a procedure would only uncover a fraction of the complete set of pseudogenes that could otherwise fulfill different criteria. A complete set of pseudogenes would remain elusive due to random alterations of sequences over evolutionary history. Thus, we limited all further studies to specifically targeted functional genes or those exons that possess the requirements seen in all V genes sequences annotated to date (see the IMGT Repertoire, http://www.imgt.org). Once we identified functional V exons, we analyzed the set of translated amino acid (AA) sequences with a pipeline that we implemented within the Galaxy toolset (https://usegalaxy. org/). The steps of the workflow are as follows. First, we

Immunogenetics

performed multiple BLAST alignment of the AA sequences against Vexon consensus sequences obtained from previously annotated V genes (IMGT Repertoire, http://www.imgt.org). Those sequences with a BLAST similarity score >0.001 were retained, while other sequences were discarded. From the resulting AA translated V exon sequences, we performed multiple alignment with clustalO (Sievers and Higgins 2014) and phylogenetic comparison studies using SEAVIEW (Gouy et al. 2010). For the tree construction, we used a maximum likelihood algorithm and the LG matrix. Finally, we used the MEGA5 (Tamura et al. 2011) and FigTree (http://tree.bio.ed. ac.uk/software/figtree/) to produce tree graphics. We classified the V exon sequences into one of the seven loci (IGHV, IGKV, IGLV, TRAV, TRBV, TRDV, and TRGV) by obtaining a heuristic score based upon a BLASTP against the NCBI NR protein database. The score is computed by mining the text description from protein hits that have a similarity score above a predetermined threshold in order to obtain a relevant word frequency indicative of exon type. For each protein description, the word frequency is weighted by the BLAST similarity score so that the most similar protein descriptions contribute more to the final loci classification. V gene identification with random forest Using an alternative technique, we independently confirmed the V exons (and their loci classification) obtained from the VgenExtractor by developing a random forest-based algorithm. The random forest method (Breiman 2001) is a powerful ensemble classification technique that is widely used for machine learning tasks and has gained popularity in bioinformatics (Zhang 2012), particularly in proteomic and genomic studies because it is nonparametric, relatively easy to implement, and is not sensitive to overfitting (Bylander 2002). Another aspect that makes random forests particularly appealing is that even for extremely large feature vectors, the method provides an automatic measure of feature importance without the need for dimensionality reduction. Such large feature vectors are typical for characterizing protein homology. Therefore, we developed an algorithm based upon the conventional random forest technique to predict functional in-frame V exon from WGS nucleotide datasets. To train the random forest, we obtained a large set of positively identified mammalian V exon sequences from vgenerepertoire.org and the IMGT. We transformed each inframe amino acid exon sequence to a 500-element vector by using the Bphysicochemical distance transform^ (PDT) (Liu et al. 2012). In order to classify putative exons into a particular locus, we trained the random forest separately for each locus. Our training set consisted of 16,829 positively identified V exon sequences (the positive sequences in each locus were IGHV: 3512, IGKV: 2396, IGLV: 3065, TRAV 4384, TRBV: 2150, TRDV: 600, and TRGV: 722) and 51,790 random

sequences representing a negative background. Thus, the training set consists of data having a background/signal ratio of 3:1, which we found sufficient in practice for classifying positive V exons from similar (but not V exons) sequences. For each locus, we performed a binary supervised learning procedure with the random forest classifier. Candidate exons to be tested by the random forest were selected from each WGS contig by extracting all sequences between the exon start nucleotide sequence, BAG^, and the three first conserved nucleotides of the RSS signal, BCAC^ (Hassanin et al. 2000). From this initial set of sequences, only those having a minimum size of 100 bp are tested by the random forest (such a condition could be relaxed for studies of pseudogenes). No other condition was imposed upon the candidate sequence before testing with the random forest. Selected sequences were first transformed using the PDT and then tested against each of the locus random forest models. Those sequences having a score greater than 50 % in any of the loci were assigned to the locus with maximal voting score. Finally, we aligned the sequences with those obtained from the VgenExtractor to construct a phylogenetic tree, thereby confirming the consistency of the two different methods for predicting Vexons and their corresponding locus assignments. We have made our V exon discovery algorithm based upon random forest freely available through vgenerepertoire.org. CDR and FR identification For this work, we also developed a python analysis script (we call Trozos; freely available at vgenextractor.org), which we used for extracting the CDR and FR sequences from the AA translated V exons. First, we studied the V exon amino acid sequences and could identify the presence of the two canonical cysteines (1st-CYS 23 and 2nd-CYS 104) and the presence of tryptophan, W41 (IMGT unique numbering (Lefranc et al. 2003; Lefranc 2011a)). Between each V exon, the number of amino acids in the CDR varies; however, we used the standardized IMGT naming/nomenclature to define the regions. In particular, CDR1 contains six amino acids and begins at the position of the first cysteine (i.e., +3 to cysteine+10). The CDR2 is defined as the sequence located between W41+15 and W41+22. The FR sequences are located between the CDR. Stretches of sequences, which we indicate by (i), refer to sequences we obtained computationally. For a particular set of sequences, some parameter adjustment in the Trozos algorithm is necessary for consistency, validating the final result against a visualization of the sequence alignment. For a detailed study of the alignments and a study of the conservation sites, we used the software Jalview (Waterhouse et al. 2009). We obtained the V exon sequences of 16 primates whose WGS sequences are available at the NCBI. The primates included in our study and their corresponding abbreviated accession numbers are the following: the Lemuriformes:

Immunogenetics

Daubentonia madagascariensis (AGTM01), Otolemur garnettii (AAQR03), Microcebus murinus (ABDC01), the Tarsiiformes: Tarsius syrichta (ABRT01), the New World Monkeys: Callithrix jacchus (ACFV01), Saimiri boliviensis (AGCE01), the Old World Monkeys: Macaca mulatta (AANU01), Macaca fascicularis (CAEC01), Chlorocebus sabaeus (AQIB01), Papio anubis (AHZZ01), the Hominids: Nomascus leucogenys (ADFV01), Pongo abelii (ABGA01), Gorilla gorilla (CABD02), Pan paniscus (AJFE01), Pan troglodytes (AACZ03), and Homo sapiens (AADD01). Some details of these WGS data sets are provided in Table 1.

Results Previous work described variations in the number of V genes between species (Guo et al. 2011; Niku et al. 2012). Likewise, we recently demonstrated the presence of distinct evolutionary processes between the IG and TR V genes (Olivieri et al. 2013). Nonetheless, the origin of this variation in V gene number and evolutionary processes is still unknown. To gain insight into this V gene number variation, we compared these genes across species of several mammalian orders and families. Here, we describe such a comparative study in Primates. In particular, we studied the V exon sequences from the 16 primate species represented in Fig. 1, covering the five major simian branches. Table 2 shows number of V exon sequences extracted per locus from the WGS datasets. While the study takes a broad view of the simian evolutionary branches, there is a greater representation (six species) from the hominid group simply because of the asymmetric availability of the WGS data in this primate order. Immunoglobulin genes The IGHV locus has been described in many vertebrate species (Berman et al. 1988; Gambon-Deza et al. 2009; GambonDeza et al. 2010; Miller et al. 1998). The joint study of all species with all V exon sequences has identified three evolutionary clans (IMGT (Lefranc 2001)). The clans are defined by specific sequences in the framework 1 (FR1-IMGT) regions and framework 3 (FR3-IMGT) regions, which are influential in antigen recognition functionality (Kirkham et al. 1992). The IGKV and IGKL loci are less well studied, and there are no published works that clearly indicate the existence of clans in these loci, contrary to the case in the IGHV locus. From the 16 species of primates studied (Table 2), we obtained a total of 701 IGHV exons sequences. While nearly all primates have on average between 30 and 60 genes in the IGHV locus, two species, D. madagascariensis and O. garnetti, have markedly fewer IGHV genes than the average.

To compare the V exon sequences in extant primate species, we carried out multi-species phylogenetic analysis by first aligning the AA translated sequences with clustalO (Sievers and Higgins 2014) and then performing tree construction with FastTree (Price et al. 2010) (using maximum likelihood and WAG matrices). Subsequently, we used FigTree or MEGA5 (Tamura et al. 2011) for visualization. Figure 2 shows the resulting phylogenetic trees with the presence of three major clades, or clans (Clan I, II, and III), which have already been described in vertebrates (Giudicelli and Lefranc 1999; Kirkham et al. 1992; Lefranc 2001) (see the IMGT clans http://www.imgt.org). Additionally, due to the large number of sequences in this study, we can discern the presence of subclades within each of the defined IGHV locus clans. Thus, from the primate V exon sequences found within Clan I, the 182 sequences from three subclades (A-29 Seq, B22 Seq, and C-131 Seq), the 139 sequences in Clan II form two subclades (A-41 Seq and B-98 Seq), and the 380 sequence in Clan III 380 can be grouped into three subclades (A-104 Seq, B-94 Seq, and C-182 Seq). The three principal clades have an origin prior to the appearance of mammals (Lefranc 2001). While some subclades may have an earlier origin to that of Primates (e.g., clade II-A and clade III-A), the overall subclade distribution seen in Fig. 2 is specific to the Primate order and does not necessarily coincide with the suborders of other mammals. Clues about the origins of V gene sequences can be gained by observing the distribution of the primate species amongst the clades. While Old World monkeys and Hominids have sequences in all clades, New World Monkeys have no sequences within clade III-B, and Tassiforms and Lemures have no sequences within the IGHV clade II (Table 3). The loss of these clades must have occurred in independent events within infraorder evolution. We also observe that species typically have several V genes per clade and that duplication events were recent, as indicated by the ages of the nodes in Fig. 2. From each of the clans, we determined the 90 % consensus sequences. These sequences are represented in the lower part of Fig. 2, showing sequence conservation in the framework regions (FR) and the existence of motifs that can be used to define separate clades in these regions. As expected, the least conserved regions correspond to those of the CDR. We found similar results in the IG light chain V genes. In particular, we found that across the primate species, there was wide variability in number IGKV genes (coding for the kappa chains or IGK), ranging between 10 and 78 V genes. Results are shown in Table 4. With respect to the number of IGLV (coding for the lambda chain or IGL), we found a similar variability, between 10 and 51 V genes. Also, some variation exists between the number ratios IGKV/IGLV within each of the primate species studied. These ratios are of interest, because it is well established that in humans and mice, there is more IGK than IGL both in serum as well as with respect to

Washington University 22 January 2010 Broad Institute 17 November 2011 Macaca mulatta Genome Sequencing Consortium 27 February 2006 F. Hoffmann-La Roche Ltd 23 July 2011 Vervet Genomics Consortium 25 March 2014 Baylor College of Medicine 04 February 2013 Gibbon Genome Sequencing Consortium 08 December 2012 Orangutan Genome Sequencing Consortium 14 August 2013 Wellcome Trust Sanger Institute 05 October 2011 Max-Planck Institute for Evolutionary Anthropology 24 April 2012 Chimpanzee Sequencing and Analysis Consortium 25 March 2011 J. Craig Venter Institute 24 September 2007

ACFV01 GCF_000004665.1 AGCE01 GCF_000235385.1 AANU01 GCA_000002255.1

ADFV01 GCA_000146795.3 ABGA01 GCF_000001545.1 CABD02 GCF_000151905.1 AJFE01 GCF_000258655.1 AACZ03 GCF_000001515.6 ABBA01 GCA_000002125.2

Gorilla gorilla Pan paniscus

Pan troglodytes

Homo sapiens

CAEC01 GCA_000222185.1 AQIB01 GCF_000409795.2 AHZZ01 GCF_000264685.2

Washington University 17 November 2008

Pongo abelii

Macaca fascicularis Chlorocebus sabaeus Papio anubis Hominids Nomascus leucogenys

Tarsius syrichta New World monkeys Callithrix jacchus Saimiri boliviensis Old World monkeys Macaca mulatta

Weill Cornell Medical College 01 January 2012 Broad Institute 16 March 2011 Broad Institute; The Genome Sequencing Platform 26 July 2007

Submitter date

ABRT01 GCA_000164805.1

AGTM01 GCA_000241425.1 AAQR03 GCF_000181295.1 ABDC01 GCA_000165445.1

Lemuriformes Daubentonia madagascariensis Otolemur garnettii Microcebus murinus

Tarsiiformes

WGS assembly no.

WGS Data for the 16 primates included in this study.NP/DS indicates no publication, direct submission

Species

Table 1

Sanger 2.844Gb

Sanger 6× 3.309Gb

Solexa 35× (3.035Gb) 454 GS FLX; 454 GS FLX Titanium 26× (2.869Gb)

Sanger 6× (3.441Gb)

Sanger 5.6× 2.962Gb

454 (2.878Gb) 454 Titanium Illumina HiSeq; ABI (2.789Gb) Sanger & Illumina 90× (2.948Gb)

Sanger 3.097Gb

ABI 3730 6.6× (2.914Gb) Illumina HiSeq 80× (2.608Gb)

Sanger (3.179Gb)

Illumina GA II× 38× (2,855Gb) Illumina GAII× 137× (2.519Gb) Sanger 1.9×

Seq tech., coverage and length

108,431 (71,333)

50,656 183,860

11,661 (464,875) 66,775 (121,235)

15,648 (408,559)

35,148 (197,900)

8925 (718,093) 90,449 (162,724) 40,262 (198,932)

25,707 (301,015)

29,273 (201,371) 38,823 (151,414)

2876 (1,201,173)

3653 (3,231,305) 27,100 (200,240) 3511 (669,735)

Contig N50 N.Contigs

Immunogenetics

Immunogenetics Fig. 1 Primate species tree. Phylogenetic tree of the primates included in the study. The tree was constructed from recent molecular phylogenetic data provided in (Perelman et al. 2011)

the number of genes (i.e., in humans, there are 46 IGKV genes and 33 IGLV, while in serum, there is approximately 70 % antibodies with IGK chains as compared to 30 % antibodies with IGL chains). In all primate species, we found a larger number of IGKV compared to IGLV genes, with only two exceptions, in prosimians, where the ratio is unity, and in the case of the M. murinus, where the ratio is inverted. From the 16 primate species in our study, we obtained a total of 629 IGKVexon sequences. The resulting phylogenetic tree from the corresponding AA translated sequences indicates the existence of two large (or principal) clades and two smaller clades having a lower number of sequences (Clade I— 210 sequences and II—419 sequences seen in Fig. 3). The two principal clades have representative sequences in all the primate species studied, and they must have a previous origin to the rise of Primates. In prosimians, they do not have sequences in clade II-A. As in the case of IGHV described previously, the subclades probably are specific to Primates. We obtained a total of 522 IGLV exon sequences, which group into five principal clades from a phylogenetic analysis. Within each clade, there are representatives from several or all of the 16 primate species in this study. This clade structure may be significant, since it corresponds to the five clades in IGLV locus found in reptiles, which we described in a previous publication (Olivieri et al. 2014). Since this structural conservation of the IGLV clades is maintained in distant reptile species, it is suggestive of underlying functionality required of these genes. The nodes in the IGLV tree of Fig. 3 show the antiquity of the clades. From sequence alignments within each subclade, we obtained the 90 % consensus sequences (those sequences whose

AA positions possess 90 % similarity) given in Fig. 3 (bottom). The AA in the sequences are present in nearly all sequences, while the B*^ represents sites of variability. Similar to what we showed for the IGHVexons, the clades are defined by sequence motifs present in the frameworks FR1 and FR3. The sequences in the FR2 region from different clades are similar, while variability can be detected in regions that contain the CDR. The TR V genes We used our gene-calling algorithm, VgenExtractor, to obtain the TR V exon sequences and studied the AA sequences of the V exons from the TRA and TRB loci in 16 different primate species. In particular, we obtained 670 TRAV exon sequences and found that the number of TRAV exons ranges between 30 (in C. jacchus) and 69 (in O. garnettii). Figure 4 shows a phylogenetic study of the AA sequences of the TRAV translated exons, where five major clades can be identified. From the age of these nodes and from a previous study, we established that the nodes of these principal clades were generated before the diversification of mammals, having origins in reptiles (Olivieri et al. 2014). From the tree in Fig. 4 (left), each of the five principal TRAV clades have several subclades that were generated during the evolution of mammals, several of which may be found in other mammalian orders (based upon our preliminary analysis of TRAV in non-primate species). As can be seen, the five major TRAV clades divide into a natural grouping of 35 subclades, each one having sequences from approximately 12 species. Within each subclade, each species contributes at

Immunogenetics Table 2 Distribution of V genes amongst the IG and TR loci

Species

IGHV

IGKV

IGLV

TRAV

TRBV

TRGV

TRDV

ALL

Lemuriformes D. madagascariensis O. garnettii M. murinus Tarsiiformes

4 9 40

10 49 26

10 42 52

40 69 38

17 31 30

2 5 3

5 13 3

88 218 191

38

50

26

40

12

3

6

175

75 27

24 29

27 14

30 37

43 36

3 4

6 2

208 159

63 67 64 68

65 66 63 78

51 48 44 47

45 44 36 38

57 56 53 58

5 4 3 5

6 7 7 6

292 292 270 300

28 72 44 30 53 44

33 42 30 26 41 46

20 31 22 28 27 33

42 42 41 36 52 42

37 62 49 51 55 49

4 4 3 6 5 3

5 9 5 6 4 6

169 262 194 183 237 193

T. syrichta New World monkeys C. jacchus S. boliviensis Old World monkeys M. mulatta M. fascicularis C. sabaeus P. anubis Hominids N. leucogenys P. abelii G. gorilla P. paniscus P. troglodytes H. sapiens

least one or two genes. As specific examples, Fig. 4 (right) shows an expanded view of specific subclades, explicitly exposing the names of the taxa of the constituent sequences. From a detailed examination of these subclades, at least one sequence from each species is found to be common, indicating that these are clades of orthologous genes. A detailed enumeration of the number of TRAV sequences contributed by each primate species to each of the 35 subclades is given in the heatmaps of Fig. 5 (left). From this figure, each clade contain sequences from 83 % of the primates in this study (an average of 13.2 species/subclade from a total of 16 species), demonstrating the evolutionary stability of these subclades. There are clades where the V exon sequences of some species are absent (approximately 17 % of subclade/species pairs); however, this could be due to genome sequencing gaps in the incomplete (draft) WGS assemblies or because our V gene calling exon failed to detect these sequences. In previous publications, we showed that our algorithm detects >90 % of V exon sequences (Olivieri et al. 2013). Despite these gene recognition limitations, the existence of subclades, their stability as seen in Fig. 5, and their orthology are clearly demonstrated and would not change with the addition or absence of a small number of V genes. Similar to the method we used in the IG loci, we determined the 90 % consensus sequences for each of the 35 subclades for V exons in the TRA locus, shown in Fig. 4

(bottom). Clear differences can be seen between sequences from different clades, and conservation exists within the same clade. Table 5 shows the variability that exists within each of the V exon sequence regions (FR1, CDR1, FR2, CDR2, and FR3). Within each of these five regions, we determined the fraction of the number of locations having variability (with positions shown as B*^) to the total number of positions including the amino acid conserved sites. As expected, the CDR regions are those that exhibit the most AA site variability; however, CDR2 is less variable than the CDR1, suggesting an underlying conservation process. When compared against the framework regions, the CDR2 region is slightly more variable. In Fig. 6 (top), we show the phylogenetic tree analysis for the AA translated TRBV exon sequences. Similar to that found in the TRA locus, we found that the TRBV sequences organize into defined clades. In particular, from the 696 TRBV exon sequences, we identified nine principal clades, from which we could further deduced 25 more recent subclades. Just as with the TRAV sequences, the ancestral TRBV genes from the nine major clades may have their origins prior to the speciation of mammals. Moreover, these subclades may be present in other suborders of mammals. As in the TRAV locus, each TRBV subclade contains V sequences from nearly all primate species of the study. Figure 5 (right) shows a detailed distribution of TRBV

Immunogenetics

Immunogenetics

ƒFig. 2

Analysis of the IGHV locus. The phylogenetic trees of the AA translated sequences from IGHV exons from 16 primate species. V exon sequences are obtained from whole genome shotgun (WGS) datasets using the VgenExtractor algorithm. Alignment of the amino acid sequences was performed with clustalO, tree construction with FastTree using the WAG matrix, and visualization with FigTree. Left: the tree of all IGHV exon sequences; Right: a detailed view of the subclade, Clan I-C, showing the names of constituent species. The leaves of the taxon, P. albelii, are highlighted in order to easily illustrate the distribution of a particular species within the subclade. In the bottom part of the figure, the consensus sequences of each clade are given. The amino acids that are found in more than 90 % of the sequences are marked by their letter, while the more diverse AA are represented by an asterisk. The AA alignments are presented with IMGT standardized labels and according to the IMGT unique numbering ((Lefranc et al. 2003; Lefranc 2011a, 2014))

sequences per species across each subclade. Within each subclade, approximately 88 % of the species studied contribute at least one or more sequences (an average of 14.2 species/ subclade from the 16 primate species of this study). Because nearly all species contribute a sequence (or more) to each subclade, Fig. 5 suggests that the TRAV and TRBV clades are evolutionarily stable. Figure 6 (bottom) shows the alignment within each clade carried out to obtain the set of 90 % consensus sequences. As before, we studied the variability between the canonical IMGT-defined regions. The results are summarized in Table 5, Table 3

where the greatest variability is detected within the CDR, as to be expected. As seen with the TRAV genes, the sequences of CDR1 have lower variability than those of CDR2. The phylogenetic results from the TRBV and TRAV loci show that V exon sequences exist are maintained throughout evolution across primate species, since each species contributes one such gene to each subclade of the tree. To further confirm these results and to determine which parts of the sequences are involved in the process of selection, we studied the FR and CDR sequences separately from each V exon sequence. To separate the four regions (i.e., FR1, FR2, FR3, and the CDR) from the AA translated V exon sequences, we used our algorithm Trozos (see BMethods^ section). Three sequences correspond to framework regions FR1, FR2, and FR3, while the fourth sequence is constructed by combining the two sequences from CDR1 and CDR2. As an illustration, Fig. 7 shows the alignment of constituent TRA V exon (AA translated) sequences from the clades 18, 19, and 20 in the primates studied, where the marked regions were defined by our software Trozos. Ortholog sequences The phylogenetic tree analysis just described suggests that the TR V genes are orthologous between primate species and that

Distribution of IGHV exons across the clans

Species

Lemuriformes D. madagasca O. garnettii M. murinus Tarsiiformes T. syrichta New World monkeys C. jacchus S. boliviensis Old World monkeys M. mulatta M. fascicularis C. sabaeus P. anubis Humanoids N. leucogenys P. abelii G. gorilla P. paniscus P. troglodytes H. sapiens

Clan I

Clan II C

A

Clan III

A

B

B

0 2 0

0 0 0

2 0 9

0 0 0

0 0 0

2

0

16

0

7 2

2 2

18 8

2 2 2

1 2 0

2 1 3 1 1 1 1

A

B

C

0 2 5

0 4 10

0 1 13

0

5

8

8

1 2

2 1

3 2

0 0

40 10

4 6 10

4 3 1

14 15 7

14 13 11

9 9 8

15 17 8

3

9

4

18

12

7

13

2 3 2 1 2 2

6 9 9 7 8 10

2 6 3 3 7 5

4 18 6 2 8 3

4 13 6 3 6 5

3 10 6 5 8 7

6 10 11 8 12 11

Immunogenetics Table 4

Distribution of V exons from IGKV and IGLV across clades and species

Species

IGKV I

Lemuriformes D. madagasca O. garnettii M. murinus Tarsiiformes T. syrichta New World monkeys C. jacchus S. boliviensis Old World monkeys M. mulatta M. fascicularis C. sabaeus P. anubis Humanoids N. leucogenys P. abelii G. gorilla P. paniscus P. troglodytes H. sapiens

IGLV II-A

II-B

II-C

mI

mII

I

II

III

IV

V

2 17 15

0 0 0

3 2 2

3 25 8

1 2 0

1 1 1

3 5 13

4 13 24

2 4 3

1 16 5

0 4 7

15

3

1

27

0

3

6

8

5

4

3

8 11

2 2

4 4

8 10

1 2

0 0

13 4

10 3

1 3

1 3

2 1

19 20 21 24

9 9 5 14

6 6 3 6

29 28 15 32

1 1 1 1

1 1 0 1

14 10 12 12

22 22 16 20

6 6 6 6

5 5 3 4

4 5 7 5

15 9 5 9 6 1

2 8 3 3 4 2

1 4 3 2 3 4

14 20 17 12 27 7

1 1 1 0 1 1

0 0 0 0 0 0

6 10 9 8 10 10

11 14 9 11 11 14

0 2 2 3 1 3

1 1 1 3 3 2

2 4 1 3 2 4

the pool of these genes preexisted prior to the diversification of this mammalian order. Apart from this standard phylogenetic analysis (Patterson 1988; Fitch 2000), we carried out a complementary procedure that could additionally confirm the existence of orthologs; i.e., whether each V exon of the TRV loci are unique to each species and whether there exists an ortholog in other species. The procedure is as follows: (1) we obtained the FR/CDR for each V exon for all species; (2) for the corresponding FR/CDR sequences of each V exon, we performed a pairwise sequence comparison with all other FR/CDR sequences from each V exon within and across all species; (3) we used a pair wise sequence similarity measure found by counting the exact positional matches of AA along the CDR/FR sequence; and (4) we observed the number distribution of exact matches for all sequences from all species. Therefore, if the corresponding CDR/FR sequences from V exons are found from one species having a large number of direct AA matches with a CDR/FR sequences (from V exons) in another species, they provide further evidence that they are orthologs. This procedure is carried out for each FR/CDR sequence against all FD/ CDR within and across all species. As an illustration of this procedure for determining orthologs, Fig. 8 shows the statistical distribution of exact sequence matches of two randomly chosen V exons that were

compared against all sequences within and across all other species studied. In particular, we randomly chose the corresponding CDR/FR sequences from Vs367 of Homo sapiens from the TRAV locus and Vs168 of Macaca mulatta from the TRBV locus (all sequences in this study are available at http:// vgenerepertoire.org). As described, these sequences were then compared against the corresponding CDR/FR sequences from all other species. In the plots of Fig. 8, the statistical mean and standard deviation are indicated by the box plots in the standard way, and the outlier points are given by red dots. The outliers of the figure indicate sequences that have high number of AA that exactly match the AA sequences of Vs367 of H. sapiens in the top plots, or Vs168 in the bottom plots). Orthologs, appearing as outliers, are seen for each of the four segments (CDRS, FR1, FR2, and FR3) in nearly all species. For example, in the CDRs (top left of Fig. 8), the sequences Vs1097 of Pan troglodytes, Vs531 of Pan paniscus, Vs133 of Gorilla gorilla, etc., all have high number of exact AA matches with the sequence Vs367 of H. sapiens, while all other sequences have 3× less AA matches. This phenomenon occurs in each of the segments, indicating the uniqueness of each V gene. We repeated the same experiment for the TRBV genes (i.e., the Vs168 exon sequence of M. mulatta, shown in Fig. 8 (bottom)). The results are similar to those of the TRAV; however, unique V exon sequences were not found in FR2.

Immunogenetics

Fig. 3 Analysis of IGKV and IGLV loci. The phylogenetic trees of the AA translated sequences from (IGKV (left) and IGLV (right)) exons from 16 primate species. V exon sequences are obtained from whole genome shotgun (WGS) datasets using the VgenExtractor algorithm. Alignment of the amino acid sequences was performed with clustalO, tree construction with FastTree using the WAG matrix, and visualization with FigTree. The root of the major clades is marked with roman

numerals. Significant subclades in the tree are identified at the right. In the bottom part of the figure, the consensus sequences of each clade are given. The amino acids that are found in more than 90 % of the sequences are marked by their letter, while more diverse AA are represented by an asterisk. The AA alignments are presented with IMGT standardized labels and according to the IMGT unique numbering ((Lefranc et al. 2003; Lefranc 2011a, 2014))

The data generated from the phylogenetic tree suggests frequent changes in the gene loci of IG as well as a reduced permissiveness in the genes of the TR chains. The theory of birth and death of genes has been postulated as a mechanism that directs the evolutionary processes of these genes. In Fig. 9, we studied this hypothesis by quantifying sequence similarities higher than 90 % in segments over 3000 bases in the IGHV, TRAV, and TRBV loci between the orangutan, human, and

macaque species. The results show that in the IGHV locus, there are more tracks and cross-links than compared to the TR loci. Also, comparing species uncovers relationships in the IGHV locus over evolutionary time. For example, the number of IGHV tracks and crossovers is higher between human and macaque (more distantly related species) than between human and orangutan (evolutionarily closer species). In the TRAV and TRBV loci, the tracks are approximately parallel;

Immunogenetics

Immunogenetics

ƒFig. 4

Analysis of the TRAV locus. The phylogenetic trees of the AA translated sequences from TRAV exons from 16 primate species. V exon sequences are obtained from whole genome shotgun (WGS) datasets using the VgenExtractor algorithm. Alignment of the amino acid sequences was performed with clustalO, tree construction with FastTree using the WAG matrix, and visualization with FigTree. The original tree with all V exon sequences are shown to the left. Clades are collapsed in the center tree to better illustrate the sequence similarities. Representative subclades, 3, 16, and 33, are shown to the right to illustrate the distribution of constituent primate taxa. At the bottom part of the figure, the consensus sequences of each clade are shown, where the amino acids that are found in more than 90 % of the sequences are marked by their letter, while the more diverse AA are represented by an asterisk. The AA alignments are presented with IMGT standardized labels and according to the IMGT unique numbering ((Lefranc et al. 2003; Lefranc 2011a, 2014))

indicating that less duplication/deletion processes took place between speciation events in these loci, thereby contrasting what is observed in the IGHV locus.

Discussion

mu

TRAV Clade distribution

mu

M.

D.

Cla

de ma da

gas car . r i n O. gar us n T. s etti i yri C. chta jac c S. b hus oli M. viens is mu M. latta fas C. cicula sab ris P. a aeus nub is N. leu cog P. a e bel nys ii G. go P. p rilla ani P. t scus rog lod H. sap ytes ien s

M.

D.

Cla

de ma da

gas car . r in O. gar us T. s netti i yri C. chta jac c S. b hus oli M. viens is mu M. latta fas C. cicula sab ris P. a aeus nub is N. leu cog P. a e bel nys ii G. go P. p rilla ani P. t scus rog lod H. sap ytes ien s

In previous studies (Olivieri et al. 2013; Olivieri et al. 2014), we showed data that indicates different evolutionary processes between the V genes of IG and TR. In IG, the processes of

birth and death are evident. We also highlighted the grouping of sequences into established IMGT clans and proposed new clades. For the species studied in this work, we describe the clustering of the IG light chain sequences into major clades. While the light chain cladistic groupings are not as well established in the literature as the three clans of the IGH chains, the large number of sequences in this study lends solid support to the existence of such a structure. The grouping of the IGL chains into five clades is of particular interest, since these clades originated prior to the diversification of mammals and reptiles and interestingly both have remained in evolutionary lines for over 300 million years, suggesting a functional significance of each clade that is still unknown. All loci containing V exons are very similar. Despite this wide similarity, however, there are stark evolutionary differences amongst the IG and TR loci. The IG loci exhibit more pronounced rates of change than found in the TR loci. This is seen by observing sequences between species of primates where there is greater sequence conservation in the TR loci. Apart from the evidence left as relics in genomic sequences, frequent duplications of IG genes generated recent clades with multiple members. Evolution provides a defense mechanism of an organism for rapid adaptation of IG chains to a rapidly changing external infectious environment. The fact that some

TRBV Clade distribution

Fig. 5 Number of species per clade. The tables show the number of species within each clade of the TRAV locus (left) and the TRBV locus (right) obtained from the phylogenetic analysis of each locus

Immunogenetics Table 5

Percentage of variable AA sites along the translated V exon primate sequences, derived from the alignments provided in Figs. 3 and 4

Locus

FR1 (%)

CDR1 (%)

FR2 (%)

CDR2 (%)

FR3 (%)

TRAV TRBV IGHV IGKV IGLV

32 34 24 44 50

54 43 54 56 85

30 27 31 35 51

38 59 68 50 83

30 31 29 32 39

clades are not present in individual infraorders may be the consequence of this accelerated evolution. Under this hypothesis, some genes no longer have an evolutionary advantage and, due to mutations, are not functionally expressed, thereby contributing to an ever increasing collection of pseudogenes in these loci. In the TRA and TRB loci, there is a conservation of V exons and low duplication permissiveness. In particular, we found a conservation of 35 TRAV exon sequences and 25 TRBV exon sequences. Nonetheless, in some species, we did not detect any conserved V exons. This may be a methodological error (as described previously, VgenExtractor only detects 90 % of the V exons from WGS data sets, and some sequencing gaps still exist in the draft versions of the primate WGS assemblies) or the system may be slightly redundant, permitting some V exon loss without compromising the survival of the individual. Similarly, in the TR loci, we detected duplication events but never observed multiple duplications, such as those in the IG loci. Thus, the phylogenetic trees suggest that the TCR V genes are orthologs between primates and that the pool of these genes preexisted prior to primate diversification from other mammals. It is likely that in future studies of V genes in Eutheria, representatives of these clades will be found and help to further clarify this diversification. The uniqueness of each gene in the TR loci is of particular interest. The number of V genes from these loci is not arbitrary. Large repertoire variations can be generated by the V-DJ rearrangement process and somatic mutations, giving rise to the possibility that very few V genes could be sufficient for somatic diversification. Previous work (Suarez et al. 2006), which studied the IG loci, has shown that nearly complete repertoires can be generated with only a few V regions. The results expressed in this work indicate that the genomic diversity of the V genes in the TR loci should have a functional basis maintained throughout evolution. Our results provide new insights into the evolution of CDR and FR regions. From an evolutionary point of view, the CDRs are sequence segments that should be permissive to mutations, while changes in the framework regions should be less permissive since they provide a well-defined structure. In general, when the AA sequences deduced from the V exons are aligned, the CDR are grouped into what are known as

hypervariable regions. However, when each clade is studied independently, the FR have variability similar to that found in the CDR (Fig. 7) especially in the CDR2 of the TRAV locus and the CDR1 of the TRBV locus. These results show that sequences of these CDR are maintained in evolution and that there is not a greater permissiveness to mutations than in the framework regions. The hypervariability found in the alignment of sequences of one species is due to the presence of different CDR within each V exon, but there exists an evolutionary ortholog maintained in other primate species. Our data indicates that the sequences of each TRAV exon may be positively selected with a specific, non-redundant function. Why have these genes been maintained in the TR V loci? A probable explanation is that this maintenance is due to a coevolution with other interacting molecules, such as MHC, that provide natural evolutionary pressures. These same evolutionary pressures may also condition the pairing of the TRA and TRB V domains. Therefore, the evolution of each V region must be constrained by modifications that occur equally in the MH exons as well as other changes in V region pairing. These coevolutionary mechanisms do not occur in the IG loci, since antigen recognition by antibodies is not conditioned by MH, making a greater permissibility of evolutionary modifications more likely. Why are there 25 TRBV and 35 TRAV subclades? A possible explanation could be that a minimum number of genes are required to form TRA/TRB pairs needed for T lymphocytes to recognize antigen presented by the large structural variations of MH1 and MH2. Indeed, it is known that MH Fig. 6 Analysis of the TRBV locus. The phylogenetic trees of the AA„ translated sequences from TRBV exons from 16 primate species. V exon sequences are obtained from whole genome shotgun (WGS) datasets using the VgenExtractor algorithm. Alignment of the amino acid sequences was performed with clustalO, tree construction with FastTree using the WAG matrix, and visualization with FigTree. The original tree with all V exon sequences are shown to the left. Clades are collapsed in the center tree to better illustrate the sequence similarities. Representative subclades, 3, 16, and 33, are shown to the right to illustrate the distribution of constituent primate taxa. At the bottom part of the figure, the consensus sequences of each clade are shown, while the more diverse AA are represented by an asterisk. The AA alignments are presented with IMGT standardized labels and according to the IMGT unique numbering ((Lefranc et al. 2003; Lefranc 2011a, 2014))

Immunogenetics

Immunogenetics

Fig. 7 Comparison of clades from CDR1 and CDR2. Detailed representation of the clades 18, 19, and 20 formed from the alignment of the AA sequences of TRAV exon sequences of primates. The phylogenetic tree (left) show the CDR1 (i) and CDR2 (i) sequences in

order to demonstrate the similarity of these sequences between members of the same clade. Sequence alignment (right) is shown for clade-18 and clade-20. The regions that are marked have been defined by our analysis software Trozos (see “Material and methods”)

can have multiple forms, particularly MH2. If this hypothesis was true, we would expect to find specific pairings of TRA/ TRB for putative MH proteins. Also, we would expect to find evidence of the association between V exons and the presence or absence of MH genes in evolutive studies.

A plausible explanation for the results presented in this study is that the MH genes that coexist with the TR V genes must act as evolutionary guides. In this scenario, the capacity for the TR to recognize the MH should be coded directly within the germline, while the antigen recognition of the TR/

Fig. 8 Sequence similarity the FR/CDR between primate species. The figure shows the results of a procedure to elucidate the number distribution of exact AA matches for CDR/FR in the TRAV and TRBV loci. The statistical distribution of CDR/FR sequence similarity from two randomly selected V exons compared to the corresponding CDR/FR sequences of V exons from all other species (top plots). The

corresponding CDR/FR of the Vs367 of H. sapiens from the TRAV locus is compared against all other CDR/FR of V exons from other species (bottom plots) The corresponding CDR/FR of Vs168 of M. mulatta from the TRBV locus is compared with all other CDR/FR of V exons from all other species

Immunogenetics

Fig. 9 Nucleotide sequence comparisons between primate species. Identical sequence within the IGHV, TRAV, and TRBV loci from the three species of primates, H. sapiens, P. abelii, and M. mulatta. For each species, the V exon sequences were extracted from genomic segments available at the Ensemble repository www.ensembl.org. For our analysis pipeline, we used Galaxy (http://galaxy.wur.nl). Sequences for the tracks were obtained in the following way: we performed a

BLASTN against the orangutan and macaque sequences (i.e., the db), with the query consisting of human sequence. We selected sequence identities >90 % and an alignment length >3000 bases and the figure was made with a custom python script. The locations of the V exons, marked in red, were obtained with VgenExtractor (http:// vgenerepertoire.org/)

pMC complex is a consequence of random somatic variations in the individual (VDJ rearrangements and somatic mutations). Studies suggest that recognition of MH is mediated by the CDR1 and CDR2 which are within the V exon, while the antigenic component is recognized by the CDR3 (encoded by the V-(D)-J junction) (Deng et al. 2012; Marrack et al. 2008). Our data is consistent with this description and that the MH recognition system must be encoded in the genome. This would explain the coevolution of both molecules. It is logical that the processes of somatic variability are directed towards antigen recognition and have a stochastic quality. If this description is correct, the MHC recognition structures would be limited in order to accompany evolutionary allowed changes. Our work points to the fact that these constraining structures are the FR and CDR amino acid sequences generated from the V exons. In summary, the present study describes the genomic relationships between V exon sequences of the IG and TR loci in primates. We observed evidence for evolutionary conservation of constituent CDR and FR sequences in the TRAV and TRBV loci across all species in the primate order. Our results

show that in the IG loci, duplications and losses of Vexons are common, while in the TRAV and TRBV loci, complex selection mechanisms may be responsible in order to conserve V exons between species.

References Berman JE, Mellis SJ, Pollock R, Smith CL, Suh H, Heinke B, Kowal C, Surti U, Chess L, Cantor CR et al (1988) Content and organization of the human Ig VH locus: definition of three new VH families and linkage to the Ig CH locus. EMBO J 7(3):727–738 Brack C, Hirama M, Lenhard-Schuller R, Tonegawa S (1978) A complete immunoglobulin gene is created by somatic recombination. Cell 15(1):1–14 Breiman L (2001) Random forests. Mach Learn 45(1):4–32 Bylander T (2002) Estimating generalization error on two-class datasets using out-of-bag estimates. J Mach Learn 48(1–3):287–297 Davis MM, Bjorkman PJ (1988) T-cell antigen receptor genes and T-cell recognition. Nature 334(6181):395–402. doi:10.1038/334395a0 Deng L, Langley RJ, Wang Q, Topalian SL, Mariuzza RA (2012) Structural insights into the editing of germ-line-encoded interactions

Immunogenetics between T-cell receptor and MHC class II by Valpha CDR3. Proc Natl Acad Sci U S A 109(37):14960–14965. doi:10.1073/pnas. 1207186109 Fitch WM (2000) Homology a personal view on some of the problems. Trends Genet 16(5):227–231 Flajnik MF, Kasahara M (2010) Origin and evolution of the adaptive immune system: genetic events and selective pressures. Nat Rev Genet 11(1):47–59. doi:10.1038/nrg2703 Gambon Deza F, Sanchez Espinel C, Magadan Mompo S (2009) The immunoglobulin heavy chain locus in the reptile Anolis carolinensis. Mol Immunol 46(8–9):1679–1687. doi:10.1016/j. molimm.2009.02.019 Gambon-Deza F, Sanchez-Espinel C, Magadan-Mompo S (2010) Presence of an unique IgT on the IGH locus in three-spined stickleback fish (Gasterosteus aculeatus) and the very recent generation of a repertoire of VH genes. Dev Comp Immunol 34(2):114–122. doi: 10.1016/j.dci.2009.08.011 Ghaffari SH, Lobb CJ (1991) Heavy chain variable region gene families evolved early in phylogeny. Ig complexity in fish. J Immunol 146(3):1037–1046 Giudicelli V, Lefranc MP (1999) Ontology for immunogenetics: the IMGT-ONTOLOGY. Bioinformatics 15(12):1047–1054 Giudicelli V, Lefranc MP (2012) IMGT-ONTOLOGY 2012. Front Genet 3:79. doi:10.3389/fgene.2012.00079 Giudicelli V, Chaume D, Lefranc MP (2005) IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes. Nucleic Acids Res 33(Database issue):D256– 261. doi:10.1093/nar/gki010 Gouy M, Guindon S, Gascuel O (2010) SeaView version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 27(2):221–224. doi:10. 1093/molbev/msp259 Guo Y, Bao Y, Wang H, Hu X, Zhao Z, Li N, Zhao Y (2011) A preliminary analysis of the immunoglobulin genes in the African elephant (Loxodonta africana). PLoS One 6(2):e16889. doi:10.1371/journal. pone.0016889 Hassanin A, Golub R, Lewis SM, Wu GE (2000) Evolution of the recombination signal sequences in the Ig heavy-chain variable region locus of mammals. Proc Natl Acad Sci U S A 97(21):11415–11420. doi: 10.1073/pnas.97.21.11415 Hughes AL (1994) The evolution of functionally novel proteins after gene duplication. Proc Biol Sci R Soc 256(1346):119–124. doi:10. 1098/rspb.1994.0058 Janeway CA Jr (1992) The immune system evolved to discriminate infectious nonself from noninfectious self. Immunol Today 13(1):11– 16. doi:10.1016/0167-5699(92)90198-g Janeway CA, Jr., Travers P, Walport M, Shlomchik MJ (2005) Immunobiology Garland Science Kirkham PM, Mortari F, Newton JA, Schroeder HW Jr (1992) Immunoglobulin VH clan and family identity predicts variable domain structure and may influence antigen binding. EMBO J 11(2): 603–609 Lefranc MP (2001) Nomenclature of the human immunoglobulin heavy (IGH) genes. Exp Clin Immunogenet 18(2):100–116 Lefranc MP (2011a) From IMGT-ONTOLOGY DESCRIPTION axiom to IMGT standardized labels: for immunoglobulin (IG) and T cell receptor (TR) sequences and structures. Cold Spring Harb Protoc 2011(6):614–626. doi:10.1101/pdb.ip83 Lefranc MP (2011b) IMGT, the international ImMunoGeneTics information system. Cold Spring Harb Protoc 2011(6):595–603. doi:10. 1101/pdb.top115 Lefranc MP (2014) Immunoglobulin and T cell receptor genes: IMGT ((R)) and the birth and rise of immunoinformatics. Front Immunol 5: 22. doi:10.3389/fimmu.2014.00022 Lefranc MP, Lefranc G (2001a) The Immunoglobulin Factsbook. Academic, San Diego

Lefranc MP, Lefranc G (2001b) The T cell receptor FactsBook. Academic, San Diego Lefranc MP, Pommie C, Ruiz M, Giudicelli V, Foulquier E, Truong L, Thouvenin-Contet V, Lefranc G (2003) IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains. Dev Comp Immunol 27(1):55–77 Lefranc MP, Giudicelli V, Ginestoux C, Jabado-Michaloud J, Folch G, Bellahcene F, Wu Y, Gemrot E, Brochet X, Lane J, Regnier L, Ehrenmann F, Lefranc G, Duroux P (2009) IMGT, the international ImMunoGeneTics information system. Nucleic Acids Res 37(Database issue):D1006–1012. doi:10.1093/nar/gkn838 Liu B, Wang X, Chen Q, Dong Q, Lan X (2012) Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One 7(9):e46633. doi:10.1371/journal. pone.0046633 Marrack P, Scott-Browne JP, Dai S, Gapin L, Kappler JW (2008) Evolutionarily conserved amino acids that control TCR-MHC interaction. Annu Rev Immunol 26:171–203. doi:10.1146/annurev. immunol.26.021607.090421 Miller RD, Grabe H, Rosenberg GH (1998) V (H) repertoire of a marsupial (Monodelphis domestica). J Immunol 160(1):259–265 Niku M, Liljavirta J, Durkin K, Schroderus E, Iivanainen A (2012) The bovine genomic DNA sequence data reveal three IGHV subgroups, only one of which is functionally expressed. Dev Comp Immunol 37(3–4):457–461. doi:10.1016/j.dci.2012.02.006 Olivieri D, Faro J, von Haeften B, Sanchez-Espinel C, Gambon-Deza F (2013) An automated algorithm for extracting functional immunologic V-genes from genomes in jawed vertebrates. Immunogenetics 65(9):691–702. doi:10.1007/s00251-013-0715-8 Olivieri DN, von Haeften B, Sanchez-Espinel C, Faro J, Gambon-Deza F (2014) Genomic V exons from whole genome shotgun data in reptiles. Immunogenetics 66(7–8):479–492. doi:10.1007/s00251-0140784-3 Patterson C (1988) Homology in classical and molecular biology. Mol Biol Evol 5(6):603–625 Perelman P, Johnson WE, Roos C, Seuanez HN, Horvath JE, Moreira MA, Kessing B, Pontius J, Roelke M, Rumpler Y, Schneider MP, Silva A, O’Brien SJ, Pecon-Slattery J (2011) A molecular phylogeny of living primates. PLoS Genet 7(3):e1001342. doi:10.1371/ journal.pgen.1001342 Price MN, Dehal PS, Arkin AP (2010) FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS One 5(3): e9490. doi:10.1371/journal.pone.0009490 Sievers F, Higgins DG (2014) Clustal Omega, accurate alignment of very large numbers of sequences. Methods Mol Biol 1079:105–116. doi: 10.1007/978-1-62703-646-7_6 Suarez E, Magadan S, Sanjuan I, Valladares M, Molina A, Gambon F, Diaz-Espada F, Gonzalez-Fernandez A (2006) Rearrangement of only one human IGHV gene is sufficient to generate a wide repertoire of antigen specific antibody responses in transgenic mice. Mol Immunol 43(11):1827–1835. doi:10.1016/j.molimm.2005.10.015 Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S (2011) MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol 28(10):2731–2739. doi:10.1093/molbev/ msr121 Tonegawa S (1983) Somatic generation of antibody diversity. Nature 302(5909):575–581 Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ (2009) Jalview Version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics 25(9):1189–1191. doi:10.1093/ bioinformatics/btp033 Zhang C (2012) Ensemble machine learning methods and applications. Springer, New York

V genes in primates from whole genome sequencing data.

The adaptive immune system uses V genes for antigen recognition. However, the evolutionary diversification and selection processes within and across s...
7MB Sizes 1 Downloads 15 Views