Marine Environmental Research xxx (2014) 1e11

Contents lists available at ScienceDirect

Marine Environmental Research journal homepage: www.elsevier.com/locate/marenvrev

Machine learning approaches to investigate the impact of PCBs on the transcriptome of the common bottlenose dolphin (Tursiops truncatus) Annalaura Mancia a, b, *, James C. Ryan c, Frances M. Van Dolah c, John R. Kucklick d, Teresa K. Rowles e, Randall S. Wells f, Patricia E. Rosel g, Aleta A. Hohn h, Lori H. Schwacke c a

Department of Life Sciences and Biotechnology, University of Ferrara, 44121 Ferrara, Italy Marine Biomedicine and Environmental Science Center, Medical University of South Carolina, Hollings Marine Laboratory, Charleston, SC 29412, USA NOAA, National Ocean Service, Hollings Marine Laboratory, Charleston, SC 29412, USA d National Institute of Standards and Technology, Hollings Marine Laboratory, Charleston, SC 29412, USA e NOAA, National Marine Fisheries Service, Office of Protected Species, Silver Spring, MD 20910, USA f Chicago Zoological Society, c/o Mote Marine Laboratory, Sarasota, FL 34236, USA g NOAA, National Marine Fisheries Service, Southeast Fisheries Science Center, Lafayette, LA 70506, USA h NOAA, National Marine Fisheries Service, Southeast Fisheries Science Center, Beaufort, NC 28516, USA b c

a r t i c l e i n f o

a b s t r a c t

Article history: Received 12 September 2013 Received in revised form 1 March 2014 Accepted 10 March 2014 Available online xxx

As top-level predators, common bottlenose dolphins (Tursiops truncatus) are particularly sensitive to chemical and biological contaminants that accumulate and biomagnify in the marine food chain. This work investigates the potential use of microarray technology and gene expression profile analysis to screen common bottlenose dolphins for exposure to environmental contaminants through the immunological and/or endocrine perturbations associated with these agents. A dolphin microarray representing 24,418 unigene sequences was used to analyze blood samples collected from 47 dolphins during capture-release health assessments from five different US coastal locations (Beaufort, NC, Sarasota Bay, FL, Saint Joseph Bay, FL, Sapelo Island, GA and Brunswick, GA). Organohalogen contaminants including pesticides, polychlorinated biphenyl congeners (PCBs) and polybrominated diphenyl ether congeners were determined in blubber biopsy samples from the same animals. A subset of samples (n ¼ 10, males; n ¼ 8, females) with the highest and the lowest measured values of PCBs in their blubber was used as strata to determine the differential gene expression of the exposure extremes through machine learning classification algorithms. A set of genes associated primarily with nuclear and DNA stability, cell division and apoptosis regulation, intra- and extra-cellular traffic, and immune response activation was selected by the algorithm for identifying the two exposure extremes. In order to test the hypothesis that these gene expression patterns reflect PCB exposure, we next investigated the blood transcriptomes of the remaining dolphin samples using machine-learning approaches, including K-nn and Support Vector Machines classifiers. Using the derived gene sets, the algorithms worked very well (100% success rate) at classifying dolphins according to the contaminant load accumulated in their blubber. These results suggest that gene expression profile analysis may provide a valuable means to screen for indicators of chemical exposure. Ó 2014 Elsevier Ltd. All rights reserved.

Keywords: Machine learning Transcriptome Bottlenose dolphin Ecogenomics Environmental contaminant exposure Biotoxin exposure

1. Introduction The response to infection or environmental stressors is neither well characterized nor well understood in dolphins, especially when compared to what is known in model species or humans.

* Corresponding author. Department of Life Sciences and Biotechnology, University of Ferrara, 44121 Ferrara, Italy. E-mail address: [email protected] (A. Mancia).

However, this is an important area of investigation given the protected status of dolphins in many countries and, as top level predators, the species’ relevance as a sentinel for coastal ecosystem health. Among the early responses to infection or environmental stress are changes in gene expression. qPCR techniques are widely used to accurately assay the level of expression for a specific gene and knowledge of the cetacean immune system has been expanded primarily through cloning by RT-PCR of cardinal immune response genes, such as major cytokines and immunoglobulins (Beineke et al., 2010). While qPCR has become routine and remains an

http://dx.doi.org/10.1016/j.marenvres.2014.03.007 0141-1136/Ó 2014 Elsevier Ltd. All rights reserved.

Please cite this article in press as: Mancia, A., et al., Machine learning approaches to investigate the impact of PCBs on the transcriptome of the common bottlenose dolphin (Tursiops truncatus), Marine Environmental Research (2014), http://dx.doi.org/10.1016/j.marenvres.2014.03.007

2

A. Mancia et al. / Marine Environmental Research xxx (2014) 1e11

important tool for analysis of transcriptomic responses in cetacean species, more extensive screens of the transcriptome are now available using microarrays. Genomic technologies and functional genomics have accelerated novel gene discovery, offering the opportunity to study molecular responses on a broad ecological scale through the deployment of gene microarrays as transcriptomic biosensors (Almeida et al., 2005; Gracey and Cossins, 2003: Gracey et al., 2004). The underlying paradigm of the transcript profiling approach is that all stimuli impinging on a cell will affect both gene and protein expression in that cell. Transcript profiling yields a quantitative “snapshot” of an entire expressed genome and is now an established technique in the biomedical models of human physiology and disease (Scherf et al., 2000; Weinstein, 1998, 2000; Staunton et al., 2001). Thousands of genes can be examined simultaneously resulting in large amounts of information about the interactions of many physiological systems. Furthermore, functional genomics approaches can yield results that are characteristic of the sum total of the environmental impacts an organism is experiencing (Chapman et al., 2009, 2010), thus making it not only informative of basic mechanisms, but also a valuable diagnostic tool for health and disease. Various studies have examined the effects of environmental factors (such as toxins and contaminants) on dolphins and other marine mammals all over the world (Wells et al., 1994; Scholin et al., 2000; Cardellicchio et al., 2000, 2002; Parsons and Chan, 2001; Roditi-Elasar et al., 2003; Houde et al., 2006; Fair et al., 2007; Stavros et al., 2008). The majority of these studies have focused on effects of specific agents as determined by cellular/tissue distribution and more descriptive biological outcomes. These studies are necessary to develop an understanding of how the compromised status of coastal ecosystems may affect the resident organisms. The susceptibility of dolphins to toxicants may also be very different from what has been previously described for humans. For example, in bottlenose dolphins the minimal body burden of methylmercury that produces mild symptoms is 2 mg/kg, which is seven times the human threshold (Rawson et al., 1995). In addition, the response to a specific stressor may vary dramatically as a function of a specific cell type (for example, epithelial vs. endothelial, or kidney vs. liver) (Kültz, 2005). In order to identify the health status of an organism it is necessary to identify biomarkers of both healthy and diseased states. A useful approach to select specific biomarkers involves the availability of genomic information and an understanding of the relationship between genes and proteins in the cellular context. The combination of transcriptomics and machine learning methods (algorithms that can learn from defined data, and then be used to classify undefined data) in the study of protected species, like dolphins, can provide informative pictures of the environmental impacts of stressors and be predictive of unhealthy environments and diseased states (Orrù et al., 2012; Weber et al., 2009). High-throughput, global measurements of gene expression can be accomplished through both sequence (RNA-Seq) and hybridization (microarray) methods (Friedman and Maniatis, 2011). Although RNA-Seq has emerged as a powerful technique for transcriptomics with a major advantage in the much simpler bioinformatics, in certain circumstances microarrays have advantages, particularly cost. The effective application of microarrays for the study of stress and disease in dolphins has already been established (Mancia et al., 2008, 2010; Ellis et al., 2009). The first dolphin microarray constructed was small (4500 features), and consisted of contact printed cDNAs from dolphin peripheral blood leukocytes for studies of immune function and stress reactions (Mancia et al., 2007). The recent availability of the dolphin genome sequence, as well as large Expressed Sequence Tags (ESTs) collections from

specific dolphin tissues, incited the newly designed, in-situ synthesized, 60-mer oligonucleotide dolphin microarray used here, with global coverage of the genome (Mancia et al., submitted). This new tool was coupled with machine learning methods, including Knearest neighbors (K-nn) and Support Vector Machines (SVMs), two supervised classification algorithms capable of learning from ‘known’ data, and classifying ‘unknown’ data (Frades and Matthiesen, 2010). The goal of this study was to identify a suite of bottlenose dolphin genes indicative of exposure to PCBs using transcriptome information. 2. Materials and methods 2.1. Dolphin samples: capture-release and processing Common bottlenose dolphin sampling was conducted during four non-consecutive summer seasons (2004e2006 and 2009) in Sapelo Island and Brunswick, Georgia (n ¼ 25; 14 males and 11 females, 2009), Beaufort, North Carolina (n ¼ 7; 2 males and 5 females, 2006), Sarasota Bay, Florida (n ¼ 6; 3 males and 3 females, 2005) and Saint Joseph Bay, Florida (n ¼ 9; 4 males and 5 females, 2006) (Table 1). Methods of dolphin capture-release for health assessment have been previously described (Schwacke et al., 2010; Wells et al., 2004). Briefly, dolphins were encircled with a seine net and then restrained by handlers. Female dolphins greater than or equal to 220 cm were held in the water until an ultrasound could be conducted to assess pregnancy status. Only non-pregnant females and males were brought aboard a processing boat for physical examination, weighing and morphometric measurements, and diagnostic sampling. Depending on the stage, pregnant females were not taken on board the boat but instead an abbreviated, in-water examination and sampling was conducted. Standardized data and blood collection were conducted as previously described (Schwacke et al., 2010; Wells et al., 2004). All dolphins were released on site immediately after sampling. Blood samples were collected from 47 dolphins in PAXgeneÔ Blood tubes (Qiagen, Valencia, CA, USA), mixed immediately to stabilize the RNA, and stored according to the manufacturer’s instructions, i.e., at room temperature for up to 24 h prior to RNA purification, and at 4  C when longer storage times were needed. 2.2. Chemical analysis 55 PCBs congeners or congener groups were measured in blubber as detailed in Litz et al., 2007. Blubber organohalogen contaminant concentrations correlate closely to blood organohalogen concentrations in bottlenose dolphins (Yordy et al., 2010) hence blubber concentrations are appropriate for use in this work. The measured values of the 55 congeners of PCBs (18, 28 þ 31, 44, 49, 52, 56, 66, 70, 74, 87, 92, 95, 99, 101, 105, 110, 118, 119, 128, 130, 137, 138, 146, 149, 153 þ 132, 151, 154, 156, 157, 158, 163, 167, 170, 172, 174, 176, 177, 178, 180 þ 193, 183, 185, 187, 189, 194, 195, 197, 199, 200, 201, 202, 203 þ 196, 206, 207, 208 and 209) were normalized to lipid present in the blubber for each animal. Concentrations of

Table 1 Dolphin samples and site of collection along the coast of US South-Eastern states. Sampling site

Males

Females

All

Beaufort (BEA), NC Sapelo Island (SAP)eBrunswick (BRN), GA Sarasota (SAR), FL Saint Joseph Bay (SJB), FL

2 14 3 4 23

5 11 3 5 24

7 25 6 9 47

Please cite this article in press as: Mancia, A., et al., Machine learning approaches to investigate the impact of PCBs on the transcriptome of the common bottlenose dolphin (Tursiops truncatus), Marine Environmental Research (2014), http://dx.doi.org/10.1016/j.marenvres.2014.03.007

A. Mancia et al. / Marine Environmental Research xxx (2014) 1e11

other organohalogens (DDT-related compounds, chlordane and related compounds, mirex, chlorobenzenes, and hexachlorocyclohexanes) are given elsewhere in Kucklick et al. (2011).

3

are accessible through GEO accession numbers GSM1226274e GSM1226295; GSM1226297eGSM1226313; GSM1226326e GSM1226329; GSM1226332eGSM1226334; GSM1226337, GSM1226339 (www.ncbi.nlm.nih.gov/geo/).

2.3. The dolphin microarray 2.5. Dolphin microarray data analysis The microarray used in this work is a species-specific, custom 4X44K Agilent oligo array. The microarray construction has been described in detail in Mancia et al. (submitted). Briefly, the 60mers on the array represent 24,418 unigene sequences selected after filtering/trimming/assembling the common bottlenose dolphin expressed sequence tag (EST) collection, publicly available at the National Center for Biotechnology Information (NCBI). The ESTs used were (for the most part) from cDNA libraries that originated from common bottlenose dolphin peripheral blood leukocytes (PBL), liver, kidney, spleen, muscle and skin. The dolphin libraries were developed at the Hollings Marine Lab, Charleston, SC (HML), sequenced at the Baylor College of Medicine and deposited at NCBI. The EST sequences were generated through both Roche 454 pyrosequencing and Sanger sequencing of total RNA: five (5) datasets from the short read archive (SRA), 7971 Sanger ESTs (dbEST) and 75 mRNA sequences from NCBI nucleotide database. All dolphin sequences were downloaded and trimmed, including removal of vector, low quality regions, low complexity regions and polyA/T tails using Clemson University’s high-performance computing cluster, Palmetto (http://citi.clemson.edu/palmetto). Contiguous sequences (contigs) were uploaded to Agilent’s eArray website for probe discovery. Sequences that were shown to have potential cross-hybridizations using the Agilent software were filtered to only include the longest, highest sequence count contigs leaving 24,418 unigene contigs. Thirty housekeeping genes were picked from the unigene list and printed in replicates (n ¼ 5) to be used as internal array controls. These controls are used by Agilent Feature Extraction software to calculate CVs and determine hybridization efficiency. Sixty-mer unigene probes and internal controls (custom and Agilent) were printed in a 4X44K format using the Agilent eArray interface. 2.4. Dolphin microarray hybridization All RNA labeling and microarray hybridizations were performed according to the manufacturer’s instructions in the One-Color Microarray-Based Gene Expression Analysis manuals (Agilent Technologies, Santa Clara, CA). The microarray was validated using in vitro cultures of dolphin cells (Mancia et al., 2012). For gene expression profiling, total RNA from the 47 blood samples from wild dolphins was extracted using the RNeasy Mini Kit according to the manufacturer (Qiagen, Valencia, CA). RNA was quantified using a NanoDrop ND-1000 (Wilmington, DE), qualified on an Agilent 2100 Bioanalyzer (Foster City, CA) and stored at 80  C until needed for gene expression profiling. Using the Agilent Quick Amp labeling kit, 500 ng of RNA was amplified and fluorescently labeled with Cy3 dye. This amplification product was measured for quantity and dye incorporation using the Nanodrop 1000. 1600 ng of fluorescently labeled RNA was hybridized to the microarray at 65  C in a rotating oven. After 17 h, the arrays were washed consecutively in solutions of 6 SSPE with 0.005% N-lauroylsarcosine and 0.06 SSPE with 0.005% N-lauroylsarcosine for 1 min each at room temperature. This wash was followed by a final 15 s wash in acetonitrile. Microarrays were then imaged on an Agilent microarray scanner, extracted with Agilent Feature Extraction software version A8.5.3, and the data analyzed with Rosetta Resolver 7.0 gene expression analysis system (Rosetta Informatics, Seattle, WA). The microarray build and hybridization data have been deposited in NCBI’s Gene Expression Omnibus and

One color gene expression arrays were normalized by removing control and flagged data, applying a trimming function to remove the top and bottom 5% of intensity data, then scaling to the mean intensity. Feature intensities underwent a weighted averaging for each probe using the error model for Agilent data in the Resolver software environment (Weng et al., 2006). These data were then used to build ratios between different groupings of animals (e.g. Ht vs Lt; H vs L e contaminant levels: H, high; L low; Ht, highest; Lt, lowest). The ratio data were initially filtered using a p < 0.0001 and 2 fold expression cutoff for significant differential expression between and within groups. Differentially expressed genes were then analyzed for gene ontology (GO) enrichment using the web tool DAVID (Database for Annotation, Visualization and Integrated Discovery). 2.6. Machine learning: K-nn and SVM The K-nearest neighbor algorithm (K-nn), a non-parametric method for classifying objects based on closest training examples in the feature space, was used for pattern recognition. K-nn is one of the simplest of all machine learning algorithms where an object/ sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common amongst its k nearest neighbors. Samples in our study were the animal’s transcriptome (gene expression values) and the two different classes were defined by chemical contaminant information (SumPCBs data, SPCBs, Tables 2 and 3). The choice of contaminants (SumPCBs) as parameters correlated to gene expression data was given by the availability of chemical contamination data from the same animals used in the gene expression analysis as well as by the availability of chemical contamination data from the locations sampled. To confirm K-nn results we also used a second supervised learning model, the support vector machines (SVMs), with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVMs takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Dolphins were separated by sex in the first phase of analysis to eliminate a known gender bias in the transcriptome. For each sex, a K-nn classifier was run to identify a set of genes to discriminate between samples with the greatest and least SumPCBs in their blubber. For each set of genes identified by sex, the classifier algorithm was instructed with SumPCBs contaminant levels (H for high and L for low, Ht for highest and Lt for lowest) this time using both male and female samples, for a total of 47 samples. Forty (40) random samples (both male and female) were used for training while the remaining 7 samples were used as ‘unknown’ samples to classify. 2.7. Quantitative real time PCR (qPCR) Results from the microarray analysis were tested by measuring mRNA expression of three selected genes for six different animals by qPCR (animals were selected based on quantity of RNA remaining after microarrays and content of PCBs measured in blubber): eukaryotic translation initiation factor 4E family member 2 (EIF4E2; Acc. No. 54500296) and GTP binding protein 5 (GTPBP5;

Please cite this article in press as: Mancia, A., et al., Machine learning approaches to investigate the impact of PCBs on the transcriptome of the common bottlenose dolphin (Tursiops truncatus), Marine Environmental Research (2014), http://dx.doi.org/10.1016/j.marenvres.2014.03.007

4

A. Mancia et al. / Marine Environmental Research xxx (2014) 1e11

Table 2 Male dolphins used in the microarray analysis and concentration of PCBs measured in blubber. Site

ID

Sex

Age

Length

Age class

Lipid %

SPCBs

Lip SPCBs

Group

SAR SAR BEA SAR BEA SJB SJB SAP SJB SJB SAP BRN BRN BRN BRN SAP SAP SAP SAP BRN BRN BRN SAP

232* 220* 460 218* 464 X02 X08 Z12 X06 X00 Z00 Z16 Z20 Z24 Z26 Z06* Z02 Z08 Z10 Z18* Z14 Z22 Z04*

M M M M M M M M M M M M M M M M M M M M M M M

2.5 5.5 e 5.5 e 32 24 17 e 2 16 15

195 211 236 216 258 274 247 251 178 243 242 238 250 248 243 239 231 257 257 224 254 251 241

Subadult Subadult Subadult Subadult Adult Adult Adult Adult Calf Subadult Adult Adult Adult Adult Adult Adult Subadult Adult Adult Adult Adult Adult Adult

66.36 51.24 59.30 53.74 61.96 51.31 50.55 40.63 66.23 61.49 36.37 45.69 44.55 40.00 37.27 30.70 30.14 23.13 28.60 35.89 35.98 41.90 33.16

11,294.23 15,401.60 19,396.95 21,428.67 33,951.50 29,535.24 33,586.75 28,132.00 49,650.78 49,193.45 29,263.00 44,862.00 54,516.00 52,548.00 56,396.00 49,900.00 53,767.40 45,826.50 59,433.87 86,996.00 11,8137.00 180,755.00 211,484.00

17,019.64 30,057.77 32,708.02 39,874.71 54,794.20 57,562.35 66,442.63 69,231.21 74,967.21 80,002.36 80,469.56 98,190.63 122,365.10 131,370.00 151,314.41 162,539.59 178,400.58 198,158.89 207,823.03 242,386.97 328,309.53 431,390.30 637,711.61

Lt Lt Lt Lt Lt L L L L L L L H H H H H H Ht Ht Ht Ht Ht

22 e 18 e 27 e 11 e 32 16

SAR, Sarasota, FL; BEA, Beaufort, FL; SJB, Saint Joseph Bay, FL; SAP, Sapelo Island, GA; Brunswick, GA. L, Low; Lt, Lowest. H, High; Ht, Highest. Lip SPCBs are measured in mg/g. (*), samples used in qPCR. Age was determined by reading dentinal and cemental growth layer groups (GLGs): Adult 10 GLS; 2  Subadult < 10 GLGs; Calf:

Machine learning approaches to investigate the impact of PCBs on the transcriptome of the common bottlenose dolphin (Tursiops truncatus).

As top-level predators, common bottlenose dolphins (Tursiops truncatus) are particularly sensitive to chemical and biological contaminants that accumu...
5MB Sizes 0 Downloads 3 Views