MARGEN-00323; No of Pages 11 Marine Genomics xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Marine Genomics journal homepage: www.elsevier.com/locate/margen

Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus) Nguyen Minh Thanh a,⁎,1, Hyungtaek Jung b,c,⁎⁎,1, Russell E. Lyons d, Isaac Njaci b, Byoung-Ha Yoon e,f, Vincent Chand c, Nguyen Viet Tuan c, Vo Thi Minh Thu a, Peter Mather c a

International University — VNU HCMC, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Viet Nam Centre for Tropical Crops and Biocommodities, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia Science and Engineering Faculty, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia d Animal Genetics Laboratory, School of Veterinary Science, University of Queensland, Gatton, QLD 4343, Australia e Medical Genomics Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon 305-806, Republic of Korea f Department of Functional Genomics, Korea University of Science and Technology, Daejoen 305-333, Republic of Korea b c

a r t i c l e

i n f o

Article history: Received 15 December 2014 Received in revised form 3 May 2015 Accepted 3 May 2015 Available online xxxx Keywords: Ion Torrent Pangasianodon hypophthalmus Salinity tolerance Simple sequence repeat Single nucleotide polymorphism Transcriptome

a b s t r a c t Striped catfish (Pangasianodon hypophthalmus) is a commercially important freshwater fish used in inland aquaculture in the Mekong Delta, Vietnam. The culture industry is facing a significant challenge however from saltwater intrusion into many low topographical coastal provinces across the Mekong Delta as a result of predicted climate change impacts. Developing genomic resources for this species can facilitate the production of improved culture lines that can withstand raised salinity conditions, and so we have applied highthroughput Ion Torrent sequencing of transcriptome libraries from six target osmoregulatory organs from striped catfish as a genomic resource for use in future selection strategies. We obtained 12,177,770 reads after trimming and processing with an average length of 97 bp. De novo assemblies were generated using CLC Genomic Workbench, Trinity and Velvet/Oases with the best overall contig performance resulting from the CLC assembly. De novo assembly using CLC yielded 66,451 contigs with an average length of 478 bp and N50 length of 506 bp. A total of 37,969 contigs (57%) possessed significant similarity with proteins in the non-redundant database. Comparative analyses revealed that a significant number of contigs matched sequences reported in other teleost fishes, ranging in similarity from 45.2% with Atlantic cod to 52% with zebrafish. In addition, 28,879 simple sequence repeats (SSRs) and 55,721 single nucleotide polymorphisms (SNPs) were detected in the striped catfish transcriptome. The sequence collection generated in the current study represents the most comprehensive genomic resource for P. hypophthalmus available to date. Our results illustrate the utility of next-generation sequencing as an efficient tool for constructing a large genomic database for marker development in nonmodel species. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Striped catfish, Pangasianodon hypophthalmus, is one of the most important farmed freshwater fish produced in the Mekong Delta and has contributed remarkably to the growth of the fishery sector in Vietnam. In 2014, striped catfish production was 1.1 million tonnes with an estimated export value of US$ 1.77 billion, which was only ⁎ Corresponding author. Tel.: +84 8 3724 4270. ⁎⁎ Correspondence to: H. Jung, Centre for Tropical Crops and Biocommodities, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia. Tel.: +61 7 3138 1111. E-mail addresses: [email protected] (N.M. Thanh), [email protected] (H. Jung), [email protected] (R.E. Lyons), [email protected] (I. Njaci), [email protected] (B.-H. Yoon), [email protected] (V. Chand), [email protected] (N.V. Tuan), [email protected] (V.T.M. Thu), [email protected] (P. Mather). 1 Equal contribution.

slightly lower than the total export value for marine shrimps (Directorate of Fisheries, 2015). In spite of the high commercial significance, there has been little real progress made towards improving the productivity of striped catfish culture stocks to meet increasing demand for product from both domestic and overseas markets. A breeding programme to improve economically important traits (body weight and fillet yield) in striped catfish was initiated in 2001 by the Southern National Breeding Centre for Freshwater Aquaculture under the auspices of Research Institute for Aquaculture No. 2, Vietnam (Sang et al., 2009, 2012). While progress with this breeding programme has yet to be reported, apart from growth rate and certain fillet traits, an emerging trait of importance is salinity tolerance. This is because striped catfish culture lines with tolerance to raised salinity conditions can contribute to long-term sustainability and growth of the industry in the Mekong Delta where most catfish farming is practiced. To date however, there have been relatively few physiological or genetic studies

http://dx.doi.org/10.1016/j.margen.2015.05.001 1874-7787/© 2015 Elsevier B.V. All rights reserved.

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

2

N.M. Thanh et al. / Marine Genomics xxx (2015) xxx–xxx

of striped catfish and information on genetic resources in this species is also very limited (but see Ha et al. (2009), Na-Nakorn and Moeikum (2009), Nguyen (2009) for microsatellite studies; Wong et al. (2011) who employed DNA barcoding for species identification; and a study of hypoxia tolerance by Lefevre et al. (2011)). Next-generation sequencing (NGS) technologies have opened many opportunities to develop molecular resources for non-model species that are of biological and economic interest. While whole genome sequencing remains out of reach for many non-model species, transcriptome sequencing has become more accessible, allowing deep understanding at the genomic level. Information from comprehensive transcriptomes provides important utility because it offers the essential resources for comparing many candidate genes among taxa that will allow insights into the molecular differences that may underlie adaptive changes (Vasemägi and Primmer, 2005). In addition, functional genomics datasets represent valuable collections of biomarker genes that may be mined to indentify suitable candidates for SNP marker development, whereby high-throughput SNPs potentially can be utilized for marker assisted selection (MAS) to develop superior aquaculture lines (Dunham et al., 2014). These strategies are relevant to the genetic improvement of farmed striped catfish, a non-model species lacking a complete and well-annotated genome. While transcriptomic data for striped catfish have become available recently (Thanh et al., 2014), the data are only representative of two tissue types (intestine and swim bladder). In the current study, RNA libraries were generated for additional key osmoregulatory organs (gill, kidney, liver and muscle) using RNA derived from the heaviest and lightest weight individuals cultured over a set timeframe to maximise transcript representation and diversity (Bilyk and Cheng, 2013). Osmoregulation in teleost fish is mediated by a range of organs including intestine, kidney and gill (Laverty and Skadhauge, 2012) while the liver plays an essential physiological role in response to stress due to its fundamental role in metabolism (Liu et al., 2013a). In addition, muscle development is also dependent on salinity level (Gong et al., 2004). The transcriptomes were developed using the Ion Torrent platform. The transcriptomes from intestine and swim bladder generated by Thanh et al. (2014) were also added to the current datasets to increase genetic resources for striped catfish. Transcriptome sequences were assembled into contigs via an optimization of de novo transcriptome assembly strategies. Functional annotation and gene ontology were performed, and a large amount of simple sequence repeats (SSRs) and single polymorphism nucleotides (SNPs) was identified. From this database, protein domains, putative genes and gene families likely to have a role in salinity adaptation in striped catfish were also determined and refined with extended genomic resources. To our knowledge, this is the most comprehensive report of a transcriptome from striped catfish developed to date. The improved and extended data will be used to construct a genomic database to support current and future genetic breeding and stock improvement programmes for this important culture species.

2. Materials and methods 2.1. Experimental fish Grow-out experiments were carried out at four salinity levels (6, 9, 12 and 15 ppt) to assess sub-lethal salinity effects on individual growth performance of striped catfish fingerlings. Fingerlings (8–10 g/fish) were reared in 500 L fibreglass tanks at a stocking density of 50 fish/ tank for six weeks. Fish were fed ad libitum twice a day using a commercial pelleted feed (30% protein content). Striped catfish fingerlings reared at 9 ppt salinity grew faster but their growth was not significantly different from those reared at any other salinity level or the control. As a result, tissues from fingerlings reared at 9 ppt salinity level were sampled for further transcriptomic analysis.

2.2. Sample collection and RNA extraction Four target osmoregulatory tissues, including gill, kidney, liver and muscle were sampled from individuals adapted to a raised salinity level of 9 ppt. Tissue samples were preserved in RNAlater (Ambion) prior to RNA extraction and were transported to the Molecular Genetics Research Facility at the Queensland University of Technology (QUT), Brisbane, Australia for transcriptomic analysis. A total of six samples per tissue (the three heaviest individuals and three lightest individuals raised at 9 ppt) were used in the current study. Total RNA was extracted using TRIzol/Chloroform reagent (Invitrogen) (Chromczynski and Mackey, 1995). Total RNA was treated with Turbo DNA-free (Ambion) to remove any contaminating gDNA and was purified further using an RNAeasy Mini Kit (QIAGEN). RNA quality and quantity were assessed using both a Bioanalyzer (Agilent) and a Qubit 2.0 fluorometer (Invitrogen). The mRNA from each tissue type was isolated using a Dynabead mRNA Purification Kit (Invitrogen) according to the manufacturer's protocol. 2.3. Library construction and Ion-Torrent sequencing High quality mRNA was fragmented into 100–200 bp fragments using an Ion Total RNA-Seq kit (Life Technologies) and was cleaned with RiboMinus Concentration Module (Invitrogen). The mRNA fragments were then converted to cDNA using the Ion Total RNA-Seq kit (Life Technologies) according to the manufacturer's guidelines. The cDNA library from each tissue comprised a pool of cDNAs prepared from the three heaviest and three lightest individuals and was quantified using a Qubit 2.0 fluorometer (Invitrogen) and a Bioanalyzer (Agilent). Templates for sequencing were prepared using a OneTouch Ion™ Template Kit (Life Technologies) following the Ion Xpress™ single read template 200 sequencing protocol. Templates from each tissue library were sequenced on 316 semiconductor chips using the Ion PGM™200 Sequencing Kit and PGM chemistry (Life Technologies) according to the manufacturer's protocol. A total of four 316 chips (one for each tissue library) were used for Ion Torrent sequencing. 2.4. Assembly strategy All sequence reads including two previously published data sets (Thanh et al., 2014) taken directly from the PGM sequencer were run through the Ion-Torrent server using default quality filtering parameters to remove sequencing adapters, poor sequences and very short sequences (b20 bp). Sequence reads were converted to FastQ files and further assessed for quality scores (Q N 20). Pre-processed sequences were then assembled using assembly programmes applying default or optimal parameters. Of the various genomics software available, we employed CLC Genomic Workbench (v6.0.4), Velvet/Oases (Robertson et al., 2010) and Trinity (r2013-08-14) (Grabherr et al., 2011). For CLC and Velvet analyses, we applied a multiple k-mer assembly approach to maximise assembly contiguity and sensitivity (Liu et al., 2013b). The CLC de novo assembly was performed with multiple k-mer length based on the input data from default (k = 20) to maximum (k = 60) setting with 10-mer difference in each setting using a minimum contig length of 200 bp. Velvet was run using different k-mer lengths of 21 to 71 in addition to other default parameters. Trinity was run using default parameters with a default k-mer of 25. Metrics used to assess assembly quality included: number of contigs, N50 length, average contig length and maximum contig length (Jiang et al., 2011). The mRNA sequence datasets from each tissue were considered separately as being representative of the transcriptome of that tissue type at the time of sampling. Assuming that some transcripts would be replicated across tissue datasets, they were merged into a combined dataset. While some identical contigs could be generated from more than one assembly or library introducing duplicates, further redundancy removal was not conducted here to maximise total contig numbers. Only outcomes

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

N.M. Thanh et al. / Marine Genomics xxx (2015) xxx–xxx

from the best assembler (namely the CLC Genomic Workbench here) were used in all further downstream analyses. All catfish EST sequences obtained were submitted to the NCBI Sequence Read Archive under Accession No. SRP028517. 2.5. Identification of protein domains and open reading frames (ORFs) Functional annotation of all contigs was performed using BLASTx searches (Altschul et al., 1997) against non-redundant (NR) database in NCBI (E-value threshold b 1e−5). For further downstream analyses, only contigs generated from CLC Genomic Workbench were utilized using a number of programmes and databases publicly available for non-model species, including Blast2GO (Götz et al., 2008), InterProScan (Hunter et al., 2012) and orfPredictor (Min et al., 2005).

3

3. Results and discussion 3.1. Ion-Torrent sequencing Table 1 presents the results of Ion-Torrent sequencing and read processing from six tissue types. Sequencing produced 1678.32 Mbp across all six libraries that in combination produced a total of 13,488,357 raw reads, resulting in 122 bp per read on average. Kidney tissue produced the largest number of raw reads (2,873,310), while gill tissue produced the longest average length per read (140 bp). After pre-processing to remove ambiguous nucleotides, low-quality reads (quality scores b Q20) and sequences less than 20 bp, 1197.47 Mbp and 12,177,770 trimmed reads were obtained for all six libraries combined with an average 97 bp per read, ranging from 89 bp per read for muscle to 105 bp per read for gill.

2.6. Comparative transcriptomic analysis 3.2. De novo transcriptome assembly Only contigs from the CLC assembly were searched against the NR database using BLASTx (E-value b 1e−5). For transcriptome comparisons, a reciprocal comparison approach was employed to identify putative orthologous genes in P. hypophthalmus (Du et al., 2012; Liu et al., 2013b). A comparison was conducted between P. hypophthalmus and seven teleost species with whole genomic sequences available in the Ensembl database including for Atlantic cod (Gadus morhua), fugu (Takifugu rubripes), medaka (Oryzias latipes), three-spined stickleback (Gasterosteus aculeatus), green spotted pufferfish (Tetraodon nigroviridis), Nile tilapia (Oreochromis niloticus) and zebrafish (Danio rerio) (Liao et al., 2013). First, tBLASTn was performed between the protein sequences of seven teleosts against P. hypophthalmus contigs applying an E-value b 1e− 10. Then BLASTx and BLASTn (E-value b 1e− 10) were performed for striped catfish contigs against seven teleosts to estimate the number of transcripts and genes represented in striped catfish. 2.7. EST-SSR and EST-SNP discovery The QDD programme (Meglécz et al., 2010) was used to identify simple sequence repeat (SSR) motifs in all unique sequences generated using the CLC Genomic Workbench. Default settings were employed to search all types of SSRs from dinucleotides to hexanucleotides. To be assigned, the minimum repeat unit was defined as 6 for dinucleotides and 5 repeats for all other SSR types. Perl script modules linked to primer modelling software Primer3 (Rozen and Skaletsky, 2000) were used to design PCR primers flanking for each unique SSR region identified. To determine putative SNPs, the best assembled contigs generated from CLC were used to call SNPs using BWA (Li and Durbin, 2009) and SAMtools (Li et al., 2009). A sequence variation was counted as a SNP or indel (insertion or deletion) when a mismatch was identified in contigs in four or more sequences and the minor allele sequence was present in at least two within contigs (Gao et al., 2012). Total number of transitions or transversions (Ts/Tv) and overall ratio were calculated across the dataset.

To achieve a reliable assembly result, we applied de novo assembly of six single libraries and all libraries combined using three assembly programmes, namely CLC Genomic Workbench, Trinity and Velvet/ Oases applying default (Table 2) or optimized parameters (Fig. 1). We compared the three assemblies based on the following criteria: number of contigs, N50 length of contigs, average contig length, maximum contig length, and number of contigs N 1 Kbp. Each assembly programme was observed to have individual relative strengths and weaknesses. For example, Trinity produced an assembly with the highest total number of contigs (109,092), Velvet/Oases generated an assembly with the largest contig size (52,331 bp), while CLC generated an assembly with the largest contig N50 (506 bp), mean contig length (478 bp) and number of contigs N 1000 bp (18,563). In general, the better assembler would be expected to return a high number of contigs with significant hits and will show a high coverage of the NR database (Zhou et al., 2012). By these standards, the CLC produced the largest proportion of contigs with significant hits (57.49%) and average coverage (30.69 ×). The performance of de novo assemblies using the three assemblers showed very similar trends for individual and the combined dataset. Despite a number of assemblers being available that can handle a vast volume of short-reads efficiently, transcriptome assembly can still be difficult because of alternative splice transcripts (Surget-Groba and Montoya-Burgos, 2010) and highly variable transcriptome coverage depending on gene expression level (Zerbino and Birney, 2008). Therefore, the quality of a de novo transcriptome assembly largely depends on the user-defined sequence overlap length between two reads indicated as k-mer length (Surget-Groba and Montoya-Burgos, 2010). To obtain the best k-mer for de novo assembly in the current datasets, assembly optimization was performed applying different k-mer length with the CLC Genomic Workbench and Velvet/ Oases (Fig. 1). CLC with a k-mer length of 20 generated the largest N50, best average contig length and the greatest number of contigs above 1000 bp in length while the number of contigs and maximum

Table 1 Overview of Ion Torrent sequencing and read processing. Intestine and swim bladder data have been published before in Thanh et al. (2014). Statistics

Ion-Torrent

Dataset name Total number of bases before processing (Mbp) Total number of Q20 bases (Mbp) Total number of raw reads Average read length (bp) Total number of bases after trimming and processing (Mbp) Total number of trimmed reads used for assembly Average read length after trimming and processing (bp)

All 1678.32 1400.74 13,488,357 122 1197.47 12,177,770 97

Intestinea 149.72 121.71 1,436,720 104 110.02 1,264,476 87

Gill 390.14 326.47 2,785,823 140 278.02 2,648,594 105

Kidney 378.14 319.35 2,873,310 132 272.73 2,623,929 104

Liver 273.83 222.87 2,209,003 123 185.87 1,903,147 98

Muscle 168.35 137.83 1,503,797 114 116.24 1,317,064 89

Swim bladdera 318.14 272.51 2,679,704 119 234.59 2,420,560 97

Trimming and processing indicate NQ20 and N20 bp for Ion-Torrent. a Dataset added from previous work (Thanh et al., 2014).

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

4

N.M. Thanh et al. / Marine Genomics xxx (2015) xxx–xxx

Table 2 Results of de novo assemblies using different assemblers run with default parameters. Intestine and swim bladder data have been published before in Thanh et al. (2014). Parameter CLC Bio (K-mer: 20)

Trinity (K-mer: 25)

Velvet/oases (K-mer: 31)

No. of total contigs Total bases of contigs (bp) No. of contig ≥ 1000 bp Contig N50 (bp) Average contig length (bp) Largest contig (bp) Contigs with significant hitsa Average coverage (x) No. of total contigs Total bases of contigs (bp) No. of contig ≥ 1000 bp Contig N50 (bp) Average contig length (bp) Largest contig (bp) Contigs with significant hitsa Average coverage (x) No. of total contigs Total bases of contigs (bp) No. of contig ≥ 1000 bp Contig N50 (bp) Average contig length (bp) Largest contig (bp) Contigs with significant hitsa Average coverage (x)

a b

All

Intestineb

Gill

Kidney

Liver

Muscle

Swim bladderb

66,451 31,755,136 18,563 506 478 7743 38,206 (57.49%) 30.69 109,092 43,502,400 2887 427 398 3830 59,676 (54.70%) 21.53 97,784 34,560,975 4982 471 353 52,331 40,023 (40.93%) 26.64

8431 3,486,271 1629 411 414 4056 6211 (73.67%) 22.04 13,530 4,716,859 195 352 348 2486 9508 (70.23%) 18.32 8500 2,825,143 422 447 332 5432 4817 (56.67%) 27.94

32,485 13,170,782 6144 409 405 6596 18,420 (56.70%) 14.48 53,836 19,019,702 634 361 353 2598 30,112 (55.93%) 11.62 48,051 13,980,742 1294 349 291 23,598 20,107 (41.85%) 13.86

29,940 12,392,014 6089 417 414 3462 18,199 (60.78%) 15.72 47,964 17,322,804 744 371 361 2571 27,137 (56.58%) 12.74 36,512 11,116,409 1172 372 304 14,498 15,948 (43.68%) 17.53

9160 4,002,243 1973 437 437 6784 5835 (63.71%) 37.22 15,496 5,529,670 246 363 357 4010 9481 (61.18%) 28.61 13,316 4,020,153 382 368 302 34,332 5864 (44.04%) 37.23

4386 1,802,243 782 410 411 8025 3290 (75.03%) 49.82 3385 1,224,276 71 370 361 2922 2698 (79.70%) 71.95 2111 793,007 139 561 376 7235 1389 (65.80%) 123.58

25,518 10,668,072 5257 421 418 6305 17,444 (68.36%) 16.44 38,647 14,037,167 685 372 363 2889 24,758 (64.06%) 12.71 24,735 7,714,672 931 379 312 6131 12,437 (50.28%) 22.41

Contigs showing significant hits (E b 1e−5) with non-redundant database (BLASTx). Dataset added from previous work (Thanh et al., 2014).

contig size appeared to plateau with a k-mer length of 60. In the Velvet/ Oases assembly, efficiency of assembly varied with the length of k-mer (Fig. 1). The highest number of contigs and greatest number of contigs N 1 Kbp were obtained with a k-mer length of 20, and maximum contig size resulted from a k-mer length of 30. Largest N50 length and average contig length appeared however, to plateau with a broad k-mer range of 20 to 70. It is generally agreed that larger values of these criteria indicate better assembly performance (Sadamoto et al., 2012). As highlighted in Fig. 1, assembly optimization for CLC and Velvet/Oases in the current study did not show the best k-mer for all criteria. Choosing an appropriate assembler with the best parameters is critical for achieving optimal assembly performance, and this is of particular importance in transcriptomic analyses involving non-model species. While the assemblers used here were developed for de novo genome/transcriptome assembly based on the de Bruijn graph algorithm (Garg et al., 2011; Duan et al., 2012), they apply different methods for dealing with sequence errors and single/pair-end data and may also differ in their relative abilities to capture different portions of the transcriptome with accuracy (Liu et al., 2013b). Trinity is designed to reconstruct highly expressed transcripts to full length using only a single k-mer length (Schulz et al., 2012) and is reported to be efficient in recovering full-length transcripts and spliced isoforms (Grabherr et al., 2011). Velvet targets de novo assembly of short reads with paired ends (Duan et al., 2012) and is a very popular choice due to its effectiveness in capturing both high and low expressed transcripts (Liu et al., 2013b). Oases is designed to deal with RNA-seq data that often includes uneven coverage and alternative splicing (Schulz et al., 2012). CLC, which is commercially available for short-read assembly (Miller et al., 2010), is a very comprehensive package that integrates analysis functions for both nucleotide and protein sequences (Wang and Liu, 2011). Ideally, the optimal assembler will use almost all of the reads input (Zhou et al., 2012). In this respect, Trinity utilized the larger number of bases and generated the highest number of contigs compared with Velvet/Oases and CLC. It should be recognised however, that unlike genomic sequences, transcriptomic datasets contain multiple variants of singular transcripts (Garg et al., 2011) and reads

may be joined together, even if they do not belong together (Haridas et al., 2011). These outcomes are not suitable for gene ontology (GO) analysis. Therefore, the highest number of contigs is not the best criterion for choosing the most appropriate assembler. According to Liu et al. (2013b), a higher N50 length and average contig length are considered to be the benchmarks for choosing the best assembly. Our results show that CLC behaved optimally for these criteria (Fig. 1C, E). In addition, in terms of assembly sensitivity, a longer k-mer will provide higher specificity but may not assemble all the available data while a smaller k-mer is more sensitive and may allow joining of more reads and a smaller k-mer can produce an efficient assembly with improved input data quality (Haridas et al., 2011). CLC (applying a default k-mer length of 20) also performed best for all assembly criteria. Direct comparisons between assemblers can be difficult and the choice of best assembler will depend on the dataset and will need to be optimized (Garg et al., 2011). From the data generated by the assemblies assessed, CLC appeared to provide the best assembly statistics and produced the lowest number of contigs and highest values for N50 length and average contig length. The CLC assembly therefore, was selected for further downstream analysis with contig lengths ranging from 200 bp to 7743 bp and 581 contigs exceeding 2 Kb in length (Fig. S1). 3.3. Comparative genomic analysis A total of 66,451 P. hypophthalmus contigs generated with CLC were searched against the GenBank NR database. From BLASTx searches, 37,969 of the 66,451 contigs (57%) possessed significant similarity (E value b 1e−5) with proteins in the NR database (Table S1). In comparison with the top 30 hit species, de novo assembly of the P. hypophthalmus contig sequences matched with 61.5% teleost proteins and approximately 7% with other vertebrate proteins (Fig. S2). The top five species ‘hits’ in BLASTx analyses were zebrafish (22,937 hits), Nile tilapia (4930 hits), green spotted pufferfish (1785 hits), channel catfish Ictalurus punctatus (1482 hit) and Atlantic salmon Salmo salar (1482 hits), respectively. These results reflect the relatively close phylogenetic relationship of the striped catfish with other teleost fishes or the abundant genomic nformation available for these species in the

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

N.M. Thanh et al. / Marine Genomics xxx (2015) xxx–xxx

5

Fig. 1. Comparison of de novo assemblies generated by CLC, Trinity, Velvet/Oases with multi k-values for a number of contigs (A), N50 lengths (B), average contig lengths (C), maximum contig lengths (D) and a number of contigs N 1 Kbp (E).

GenBank database. A small fraction of P. hypophthalmus sequences (0.85%) matched to protozoan/parasitic sequence, including Paramecium tetraurelia, Tetrahymena thermophile and Plasmodium falciparum (Fig. S2). In our study, mRNA was isolated from multiple tissues, particularly from intestine and gill libraries where Protozoa and bacteria may be present as naturally symbiotic microorganisms. Identification of genes from xenobiotic organisms in target species transcriptomes has been reported in many earlier studies (Miller et al., 2008; Vera et al., 2008; Hale et al., 2010; Liu et al., 2013b).

While striped catfish and ictalurid catfish share a close evolutionary relationship as both belong to the order Siluriformes (Jondeung et al., 2007), only 5% of striped catfish sequences (1969 of the 37,969 contigs) matched sequences from I. punctatus and I. furcatus. Of interest, no hits were recorded with the genus Pangasianodon or Pangasius, a close congener of Pangasianodon, in the BLASTx top hit species. This may be due to the limited number of protein sequences for Pangasianodon and Pangasius (168 and 423 proteins respectively) currently available in the NCBI database (assessed on 27 February 2015). Transcriptomic

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

6

N.M. Thanh et al. / Marine Genomics xxx (2015) xxx–xxx

resources generated for striped catfish in the current study will therefore provide a wealth of new genes for future studies in P. hypophthalmus and other related catfish species. In addition, ORF predictions resulted in 65,963 ORFs being identified and 45,288 ORFs (approximately 68%) possessed a minimum length ranging from 50 to 100 amino acids (Fig. S3, Table S2). This outcome was relatively low but again most likely results from the limited number of striped catfish gene and protein sequences currently available in public databases. In an effort to assess our transcriptome assembly in more detail, P. hypophthalmus assembled contigs were compared with Ensembl sequences from seven teleosts. Reciprocal tBLASTn showed that 85% Atlantic cod (18,887 hits) and medaka (20,987 hits) and up to 93% fugu (44,396 hits) matched with striped catfish transcripts (Table 3). A high degree of similarity between the transcriptome generated from striped catfish here and those of other teleosts indicated a strong conservation of gene content among teleost fish. From the BLASTn searches, striped catfish transcripts also showed the highest number of significant hits with zebrafish (25,295 hits out of 66,451 contigs or 38%). This can be explained because striped catfish and zebrafish are quite closely related phylogenetically as both fishes belong to the group Ostariophysi (Sarropoulou and Fernandes, 2011), and the sequence dataset available for zebrafish is larger than that available for other teleost species compared here. The transcriptome from striped catfish also shared a limited number of genes with annotated genes from Atlantic cod, fugu, medaka, three-spined stickleback, green spotted pufferfish and Nile tilapia (25 to 28%). Lower similarity levels may imply that many transcripts represented in striped catfish are unique genes novel to this species and may not have been identified previously in other teleosts. In addition, the transcriptomic analysis for striped catfish in the current study originated from only six tissues and was unlikely to cover the complete transcriptome diversity. As a result, some rare transcripts may have been missed or only collected as singletons during assembly.

3.4. Protein domain InterProScan searches identified 6369 protein domains based on 66,451 contigs that originated from six tissues from striped catfish (Table S3). Highly represented protein domains included immunoglobulin, protein kinase and WD domain (Table 4). While the results are similar to a transcriptomic analysis conducted on intestine and swim bladder tissues from striped catfish (Thanh et al., 2014), the improved and refined outcomes generated from the current study will be more informative for identifying putative genes and structural variants in future studies. Of interest, several additional protein domains not previously observed in striped catfish were detected in the current study when additional osmoregulatory organs including gill and kidney were analyzed. Large domains found among the striped catfish EST sequences included zinc finger, C2H2 (n = 355), zinc finger C2H2-type/integrase

DNA-binding domain (n = 265), zinc finger, C2H2-like (n = 265) and zinc finger, RING/FYVE/PHD-type (n = 220) (Table 4). Another large domain detected was fibronectin, type III with 251 sequences predicted (Table 4). Fibronectin is a high-molecular-weight glycoprotein and is present in plasma and the extracellular matrix found in most tissues. The protein plays an important role in several cellular processes including cell adhesion, migration, growth, differentiation, and many other factors that influence survival in invertebrates (Hynes, 1990). In fish, fibronectin was first identified and characterized in zebrafish (Zhao et al., 2001) and later in Japanese catfish Silurus asotus (Mori et al., 2007). The other abundant domain present in the P. hypophthalmus transcriptome was serine/threonine-/dual specificity protein kinase, catalytic domain (n = 237) (Table 4). Protein kinases are a large family of enzymes, many of which mediate responses of eukaryotic cells to external stimuli and they commonly include highly conserved residues in the protein kinase catalytic domain that are predicted to play important roles in catalysis (Zhao et al., 2001). According to Hanks et al. (1988), protein kinases fall into three broad classes, characterized with respect to substrate specificity and include serine/threonineprotein kinases, tyrosine-protein kinases and dual specificity protein kinases. Sun et al. (2013) reported that G-type lectin S-receptor-like serine/threonine protein kinase with a highly conserved serine/threonine protein kinase catalytic domain plays a crucial role in plant responses to salt stress. Furthermore, PKR (protein kinase R), a serine– threonine kinase, has roles in the innate immune response in fugu (Del Castillo et al., 2012).

3.5. Putative genes affecting salinity tolerance and growth rate in striped catfish The current study further explored and refined EST sequences to identify functional genes potentially involved in salinity adaptation and growth rate in striped catfish based on a comprehensive literature survey. This approach has been applied in several studies to determine functional genes related to growth, reproduction and immune function (Wu et al., 2009; Jung et al., 2011; Pereiro et al., 2012; Huang et al., 2013) because many genes have been functionally conserved through evolution from bacteria to humans (Jung et al., 2013). Genes of interest related to salinity tolerance and growth are presented in Table 5 and discussed in greater detail below. Much of the energy consumed during osmoregulation is used to synthesize a variety of ion transporting proteins and for active maintenance of electrochemical gradients. Na+/K+-ATPase (NKA) is a fundamental ion transporting protein in osmoregulation and ion exchange. It is abundant in osmoregulatory organs including gill, intestine and kidney. It is interesting to note that the intestinal epithelium exhibited highest NKA activity when this was compared among osmoregulatory tissues (Grosell et al., 1999). The NKA enzyme contains two major subunits (α and β) and a minor subunit (γ) also known as FXYD (Blanco and Mercer, 1998). We detected transcripts of both NKA

Table 3 Reciprocal BLAST comparison between P. hypophthalmus and seven teleost fishes. Species

Scientific name

#Proteins for subject speciesa

#Proteins with hits in P. hypophthalmus (tBLASTn)

%

#Protein hits by P. hypophthalmus (BLASTx)

%

#cDNAs for subject speciesb

Gene hits by P. hypophthalmus (BLASTn)

%

Atlantic cod Fugu Medaka Three-spined stickleback Green spotted pufferfish Nile tilapia Zebrafish

Gadus morhua Takifugu rubripes Oryzias latipes Gasterosteus aculeatus Tetraodon nigroviridis Oreochromis niloticus Danio rerio

22,100 47,841 24,661 27,576 23,118 26,763 42,555

18,887 44,396 20,987 23,870 20,610 23,915 36,584

85.5 92.8 85.1 86.6 89.2 89.4 86.0

30,065 31,239 30,713 31,511 30,579 32,555 34,563

45.2 47.0 46.2 47.4 46.0 49.0 52.0

22,618 48,003 24,662 27,628 23,265 26,788 49,647

16,990 16,757 16,342 18,062 16,510 18,640 25,295

25.6 25.2 24.6 27.2 24.8 28.1 38.1

a b

Using protein database from Ensembl. Using cDNA database from Ensembl, E-value cut-off b 1e−10.

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

N.M. Thanh et al. / Marine Genomics xxx (2015) xxx–xxx

7

Table 4 Summary of top 20 protein domains for P. hypophthalmus sequences. No.

IPR

Domain name

Domain description

No. of occurrence

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

IPR013783 IPR011009 IPR000719 IPR015943 IPR007087 IPR007110 IPR011993 IPR016024 IPR013087 IPR015880 IPR003961 IPR017986 IPR002290 IPR001680 IPR013083 IPR011989 IPR020683 IPR008985 IPR002110 IPR001849

Ig-like_fold Kinase-like_dom Prot_kinase_cat_dom WD40/YVTN_repeat-like_dom Znf_C2H2 Ig-like PH_like_dom ARM-type_fold Znf_C2H2/integrase_DNA-bd Znf_C2H2-like Fibronectin_type3 WD40_repeat_dom Ser/Thr_dual-sp_kinase_dom WD40_repeat Znf_RING/FYVE/PHD ARM-like Ankyrin_rpt-contain_dom ConA-like_lec_gl_sf Ankyrin_rpt Pleckstrin_homology

Immunoglobulin-like fold Protein kinase-like domain Protein kinase, catalytic domain WD40/YVTN repeat-like-containing domain Zinc finger, C2H2 Immunoglobulin-like Pleckstrin homology-like domain Armadillo-type fold Zinc finger C2H2-type/integrase DNA-binding domain Zinc finger, C2H2-like Fibronectin, type III WD40-repeat-containing domain Serine/threonine-/dual specificity protein kinase, catalytic domain WD40 repeat Zinc finger, RING/FYVE/PHD-type Armadillo-like helical Ankyrin repeat-containing domain Concanavalin A-like lectin/glucanases superfamily Ankyrin repeat Pleckstrin homology domain

772 579 505 357 355 336 314 267 265 265 251 249 237 222 220 216 211 198 197 191

and FXYD in our experimental tissue libraries. NKA mRNA expression levels have been known to be correlated with environmental salinity, demonstrating the important role of this ion pump during the salinity acclimation process in fish. As a consequence, information about mRNA expression during salinity exposure of this enzyme is relatively robust. Effects of NKA and elevated salinity exposure have been reported in many species including rainbow trout (Oncorhynchus mykiss) (Richards et al., 2003; Singer et al., 2007), Atlantic salmon

(D'Cotta et al., 2000), brown trout (Salmo trutta) (Madsen et al., 1995), and killifish (Fundulus heteroclitus) (Scott and Schulte, 2005). Many studies have also indicated that sodium–potassium–chloride cotransporter or Na–K–Cl cotransporter (NKCC) acts in parallel with NKA to ensure osmoregulatory capacity under elevated salinity in freshwater fish (Hiroi and McCormick, 2007). It can be very difficult however, to analyze NKCC responses since a simple assay is not yet available for this enzyme (Mackie et al., 2007). Recently,

Table 5 Potential genes involved in salinity tolerance and growth trait in P. hypophthalmus sequences. Candidate genes

E value

0–3.73e−6

NADH dehydrogenase

Claudin family (1, 5, 7, 8, 10, 11, 15 …) Insulin-like growth factor Thyroid hormone Glutathione S-transferase Na+/K+ ATPase alpha Aquaporin family (1, 3, 7, 8, 10) Glucocorticoid receptor Voltage-dependent anion-selective channel protein FXYD domain Carbonic anhydrase 14-3-3 protein Sodium- and chloride-dependent taurine transporter Sodium bicarbonate cotransporter Growth hormone receptor Sodium/glucose cotransporter 1/4 Cystic fibrosis transmembrane conductance regulator Sodium–potassium–chloride cotransporter 1/SLC12A2 Solute carrier family 15 (H+/peptide transporter) Calpastatin

3.70e−119–7.49e−6 −151

−9

–1.78e 0–1.15e−12 2.53e−140–2.61e−7 2.43e−155–2.43e−15

9.04e

2.23e−134–2.83e−22

Matched species

Length range (bp)

Anoplopoma fimbria, Danio rerio, Galaxias anomalus, Gallus gallus, Hepsetus odoe, Hoplomyzon sexpapilostoma, Ictalurus punctatus, Myotis brandtii, Pangasianodon gigas, Pangasius larnaudii, Pangasius larnaudii, Prochilodus lineatus, Salmo salar, Symphodus roissali Danio rerio, Ictalurus punctatus, Salmo salar, Takifugu rubripes

220–2746 32

Cyprinus carpio, Danio rerio, Ictalurus punctatus, Pelteobagrus fulvidraco, Salmo salar Danio rerio, Ictalurus furcatus Ctenopharyngodon idella, Danio rerio, Ictalurus furcatus, Ictalurus punctatus, Tanichthys albonubes Carassius auratus, Danio rerio, Dicentrarchus labrax, Galaxias maculatus, Pseudopleuronectes americanus, Scophthalmus maximus, Thunnus orientalis Anguilla anguilla, Anguilla japonica, Danio rerio, Ictalurus furcatus

207–2815 230–1214 268–1333 228–992

18 12 11 10

226–1232

9

264–1497 223–1872

7 7

298–765 261–1640 627–1347 247–1516

6 5 5 5

292–1433

5

290–823 367–2211

4 4

393–755

4

293–1515

4

339–651

3

444–992

3

−6

0–5.88e Cyprinus carpio, Danio rerio, Gallus gallus, Pimephales promelas, Salmo marmoratus 0–1.67e−17 Danio rerio, Gallus gallus, Ictalurus furcatus, Ictalurus punctatus, Micropterus salmoides, Salmo salar 7.37e−23–9.52e−13 1.76e−150–1.13e−11 9.35e−175–3.11e−30 2.99e−149–1.96e−12

Danio rerio, Salmo salar Danio rerio, Ictalurus punctatus, Oncorhynchus mykiss Artemia franciscana, Danio rerio, Ictalurus punctatus, Oncorhynchus mykiss Anoplopoma fimbria, Danio rerio

0–7.68e−45 Danio rerio, Tribolodon hakonensis 1.14e

−77

−13

–1.04e Pelteobagrus vachellii, Silurus meridionalis 0–5.27e−28 Danio rerio

9.70e−117–2.34e−10 Danio rerio

0–4.67e−10 Danio rerio 8.63e

−71

–3.29e

−12

Cyprinus carpio, Danio rerio

3.56e−17–8.31e−67 Dicentrarchus labrax, Ictalurus punctatus, Salmo salar

Total of occurrence

213–1518 31

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

8

N.M. Thanh et al. / Marine Genomics xxx (2015) xxx–xxx

immunofluorescence staining showed an NKCC signal on the basolateral membrane of mitochondrion-rich (MR) cells (previously known as “chloride cell”) (Laverty and Skadhauge, 2012) in sea wateracclimated sailfin molly (Poecilia latipinna) (Yang et al., 2011). A similar approach also yielded a positive correlation between NKCC and salinity elevation in climbing perch (Anabas testudineus) (Ching et al., 2013). Two isoforms of NKCC have been recorded, NKCC1 (generally referred to as the secretory isoform, expressed particularly in ion secretory epithelial cells) and NKCC2 (the absorptive isoform, found in the apical membrane of epithelial cells) (Kang et al., 2010). Inokuchi et al. (2008) also suggested that basolateral-NKCC1 in MR cells were ion secreting cells in Mozambique tilapia (Oreochromis mossambicus). NKCC also plays a significant role in the Cl− secretion pathway in European eel (Anguilla anguilla) (Cutler and Cramb, 2002), killifish (Scott et al., 2004) and striped bass (Morone saxatilis) (Tipsmark et al., 2004), and in some salmonids Salvelinus namaycush, Salvelinus fontinalis and S. salar (Hiroi and McCormick, 2007) after transfer to sea water. The absorptive isoform of NKCC1, NKCC2 is believed to play an important role in water reabsorption in the intestine, where it acts as an Na+ uptake transporter for Na+ to enter the enterocyte (or intestinal absorptive cells) (Watanabe et al., 2011). Cystic fibrosis transmembrane conductance regulator (CFTR) is another candidate gene identified in the current study that may enhance osmoregulatory capacity in P. hypophthalmus (Hwang and Lee, 2007). A detailed review by Marshall (2002) summarized the various mechanisms and actions of CFTR in chloride cells during salinity exposure of freshwater fish. In this model, the author suggested that CFTR acts as a passive channel providing exit for Cl− anions, while Na+ is secreted from the cell via the local paracellular pathway between chloride and accessory cells Marshall (2002). CFTR is a wellcharacterized member of the ATP-binding cassette (ABC) transporter family that is responsible for molecule transfer using ATP (Holland et al., 2003). Chen et al. (2001) suggested that there are two isoforms (CFTR I and II) in gill tissue that appear to be regulated independently from each other in Atlantic salmon. The two isoforms of CFTR have been shown to play a role in adaptation of salmon smolts following transfer to sea water (Singer et al., 2002). Gill CFTR activity increases after transfer to seawater in many fish species including in killifish (Marshall et al., 1999; Marshall et al., 2002), Hawaiian goby Stenogobius hawaiiensis (McCormick et al., 2003), Atlantic salmon (Singer et al., 2003) and Mozambique tilapia (Hiroi et al., 2005; Ouattara et al., 2009). For genes affecting growth rate in P. hypophthalmus, the somatotropic axis essentially consists of growth hormone-releasing hormone (GHRH), growth hormone inhibiting hormone (GHIH or somatostatin), growth hormone (GH), insulin-like growth factors (IGF-I and -II), and associated carrier proteins and receptors. Growth hormone has been shown to play a major role in growth-promotion in teleosts, in addition to playing a role in osmoregulation (McCormick, 2001). In mammals, most of the actions of growth hormone are mediated indirectly through IGF-I. IGF is a system of peptide hormones, cell surface receptors and circulating binding proteins. We also detected calpastatin that could have a potential role in muscle development in P. hypophthalmus. Calpastatin (CAST) is an inhibitor of calcium dependent neutral proteases, calpains, which have previously been suggested to play a potential role in muscle growth and to affect fillet quality in fish (Salem et al., 2005). This enzyme was reported to be modulated by the GH axis. Moreover, this pattern has been confirmed in transgenic Coho salmon (Overturf et al., 2010). The role of CAST in muscle growth in teleost fish is currently however, not as wellcharacterized as it is in livestock. Apart from a function in enhancement of growth rate, it is interesting to note that the GH and IGF-I axis is believed to be also involved in osmoregulation. GH and IGF-I modulate upregulation of various ion transporters including NKA and NKCC that are critical for salt secretion in fish gill. Transfer of fish from freshwater to seawater resulted in an increase of IGF in four-spine sculpin (Cottus kazika) (Inoue et al.,

2003), rainbow trout (Poppinga et al., 2007), Atlantic salmon (Agustsson et al., 2001) and Mozambique tilapia (Magdeldin et al., 2007). It is interesting to note that injection of IGF appeared to increase salinity tolerance of tilapia (O. mossambicus and O. niloticus), striped bass and killifish (Mancera and McCormick, 1998), as well as in Atlantic salmon (McCormick, 1996). In brown trout, long term IGF-I treatment can increase the number of gill chloride cells and NKA activity while increasing salt secretory capacity (Seidelin et al., 1999). In contrast, Imsland et al. (2007) found no correlation between environmental salinity and IGF-I levels in juvenile turbot (Scophthalmus maximus). Another study of black-chinned tilapia Sarotherodon melanotheron showed the reverse effects of IGF in liver, intestine and gill (Link et al., 2010). This study reported that IGF-I and IGF-II effects vary greatly depending on individual osmoregulatory tissue. IGF-I and IGF-II concentration has also been reported to be up-regulated in gill and to be down-regulated in liver in striped bass after seawater transfer (Tipsmark et al., 2007). While osmoregulation in other catfish (Furspan et al., 1984; Eckert et al., 2001) has been better characterized, the molecular and genetic basis of salinity adaptation and growth is currently poorly understood in striped catfish. Therefore, a number of putative genes identified here can contribute to studies of individual response to salinity stress and growth rate in striped catfish. Further analysis of differential gene expression and quantitative PCR will be required however, to confirm individual putative candidate genes actually play roles in salinity tolerance and growth performance in the target species.

3.6. Putative SSRs and SNPs The transcriptome data provided a huge resource for data mining and discovery of gene-associated markers. In total, 28,879 SSRs or microsatellites were identified, of which 13,440 SSRs (46.5%) were detected from the assembled sequences and 15,439 SSRs (53.5%) were detected from non-assembled sequences or singletons. The most common repeat motifs were dinucleotides that accounted for 82.02%, followed by trinucleotides (14.33%) and tetra/penta/hexanucleotides (3.65%) (Table 6, Table S4). Among dinucleotide repeats, AC/CA types were most abundant (68.35%), followed by AG/GA (24.64%), AT/TA (5.4%) and CG/GC (1.60%). For trinucleotide repeats, AAT/TAA/ATA types were the most abundant (19.91%) while ACT/CTA/TAC types were rare (0.87%). From SSRs containing ESTs, we were able to design 3229 and 4875 primer sets for contig sequences and singleton sequences, respectively. These consisted of 81.03% dinucleotide repeat Table 6 Summary of simple sequence repeat (SSR) types and primer sets in the P. hypophthalmus transcriptome. SSR type

No. of SSRs

Di-nucleotide AT/TA CA/AC CG/GC GA/AG Tri-nucleotide AAC/ACA/CAA ACG/CGA/GAC ACT/CTA/TAC AGC/GCA/CAG AGG/GAG/GGA CAT/ATC/TCA CCA/CAC/ACC CCG/CGC/GCC GAA/AAG/AGA TAA/ATA/AAT Others (Tetra/Penta/Hexa) Total

11,184 12,502 688 592 7242 8948 226 154 3028 2808 1990 2148 127 352 15 27 27 9 497 210 360 195 329 485 58 49 39 3 212 320 326 498 266 789 13,440 15,439

Contig

No. of primer sets

Singleton Total

Contig

23,686 2656 1280 182 16,190 1718 380 54 5836 702 4138 504 479 34 42 5 36 6 707 121 555 88 814 83 107 14 42 8 532 54 824 91 1055 69 28,879 3229

Singleton

Total

3911 190 2799 48 874 702 117 7 3 69 63 161 17 1 100 164 262 4875

6567 372 4517 102 1576 1206 151 12 9 190 151 244 31 9 154 255 331 8104

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

N.M. Thanh et al. / Marine Genomics xxx (2015) xxx–xxx

9

growth were also identified. Further examination of the physiological roles of these genes and their variants will be necessary to develop associated genomic markers. In addition, thousands of SSRs and SNPs detected here should enable population genomic and gene-based association studies of striped catfish. Such studies will contribute to our understanding of the genetic basis of salinity adaptation. The current study represents the first large scale sequencing effort using NGS for striped catfish. The results demonstrate that high-throughput transcriptome sequencing is a fast, economical and effective method for obtaining genomic resources and molecular markers in species where complete genome sequences are not available. Transcriptomic analysis is applicable to many other non-model aquaculture species without genome references.

Acknowledgement Fig. 2. Summary of SNPs/indels identified from P. hypophthalmus transcriptome.

primers, 14.88% trinucleotide repeat primers, and 4.08% tetra/penta/ hexanucleotide repeat primers (Table S4). Similarly, a total of 55,721 putative SNPs and 15,441 indels were identified from alignments of multiple sequences used for contig assembly. Putative SNPs consisted of 36,469 transitions and 19,252 transversions (Fig. 2, Table S5). Transitions occurred at a higher rate than transversions in the striped catfish transcriptome, with a ratio of 1.89:1.00. A/G and C/T were the most frequent SNP types observed while the G/T and G/C were the least common types. These findings are in line with the previous SNP studies based on transcriptome datasets in a number of aquatic species (Franchini et al., 2011; Jung et al., 2011; Wang et al., 2013; Cui et al., 2014). The microsatellites and SNPs identified in our study will provide valuable resources for population genetic studies and resource assessment, and provide tools for genetic linkage and QTL analysis that can facilitate marker-assisted selection in P. hypophthalmus in the future. SSRs and SNPs identified from transcriptomic sequences have advantages over molecular markers developed in non-transcribed regions because they are likely to be linked to protein coding genes (Gao et al., 2012). They can therefore facilitate detection of functional variation (Bouck and Vision, 2007) and potentially may have substantial physiological impacts (Gao et al., 2012). According to Salem et al. (2012), SNPs explain 90% of the genetic differences between individuals, and crossing over is less likely to separate SNP markers from genes when SNPs are detected within or near coding sequences. These SNPs are therefore potentially very useful in aquaculture species where complete genome sequences are not available for example in striped catfish. While next generation sequencing can provide excellent resources for molecular marker mining, EST-derived SSR or SNP markers can produce false positives due to sequencing errors, misassembly of paralogous sequence variants or multisite sequence variants (Liu et al., 2011; Jung et al., 2014). As a result, SSRs and SNPs reported here require further validation and evaluation before they are applied in striped catfish or other closely related species. 4. Conclusions The current study has improved information on the transcriptome response in freshwater striped catfish (P. hypophthalmus) exposed to an optimal salinity level. The de novo transcriptome developed has significantly expanded genomic resources available for this important culture species. The complete assembly was composed of 66,451 contigs assembled from 12,177,770 ESTs that were generated from six target tissue libraries using the Ion Torrent sequencing platform. Approximately 57% of the contigs were successfully annotated and presented homology with proteins deposited in the databases. A number of extended putative genes potentially related to salinity tolerance and

This research was funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 106.99-2011.63. The authors are grateful to the Molecular Genetics Research Facility at QUT that offered facilities and technical assistance to run Ion PGM.

Appendix A. Supplementary data Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.margen.2015.05.001.

References Agustsson, T., Sundell, K., Sakamoto, T., Johansson, V., Ando, M., Th Björnsson, B., 2001. Growth hormone endocrinology of Atlantic salmon (Salmo salar): pituitary gene expression, hormone storage, secretion and plasma levels during parr-smolt transformation. J. Endocrinol. 170, 227–234. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. Bilyk, K.T., Cheng, C.H.C., 2013. Model of gene expression in extreme cold — reference transcriptome for the high-Antarctic cryopelagic notothenioid fish Pagothenia borchgrevinki. BMC Genomics 14, 634. Blanco, G., Mercer, R.W., 1998. Isozymes of the Na–K-ATPase: heterogeneity in structure, diversity in function. Am. J. Physiol. 275, F633–F650. Bouck, A., Vision, T., 2007. The molecular ecologist's guide to expressed sequence tags. Mol. Ecol. 16, 907–924. Chen, J., Cutler, C., Jacques, C., Boeuf, G., Denamur, E., Lecointre, G., Mercier, B., Cramb, G., Férec, C., 2001. A combined analysis of the cystic fibrosis transmembrane conductance regulator: implications for structure and disease models. Mol. Biol. Evol. 18, 1771–1788. Ching, B., Chen, X.L., Yong, J.H., Wilson, J.M., Hiong, K.C., Sim, E.W., Wong, W.P., Lam, S.H., Chew, S.F., Ip, Y.K., 2013. Increases in apoptosis, caspase activity and expression of p53 and bax, and the transition between two types of mitochondrion-rich cells, in the gills of the climbing perch, Anabas testudineus, during a progressive acclimation from freshwater to seawater. Front. Physiol. 4, 135. Chromczynski, P., Mackey, K., 1995. Short technical report: modification of TRIZOL reagent procedure for isolation of RNA from polysaccharide- and proteoglycan-rich sources. Biotechniques 19, 942–945. Cui, J., Wang, H., Liu, S., Qiu, X., Jiang, Z., Wang, X., 2014. Transcriptome analysis of the gill of Takifugu rubripes using Illumina sequencing for discovery of SNPs. Comp. Biochem. Physiol. D 10, 44–51. Cutler, C.P., Cramb, G., 2002. Two isoforms of the Na+/K+/2Cl− cotransporter are expressed in the European eel (Anguilla anguilla). Biochim. Biophys. Acta Biomembr. 1566, 92–103. D'Cotta, H., Valotaire, C., le Gac, F., Prunet, P., 2000. Synthesis of gill Na+–K+-ATPase in Atlantic salmon smolts: differences in α-mRNA and α-protein levels. Am. J. Physiol. Regul. Integr. Comp. Physiol. 278, R101–R110. Del Castillo, C.S., Hikima, J., Ohtani, M., Jung, T.S., Aoki, T., 2012. Characterization and functional analysis of two PKR genes in fugu (Takifugu rubripes). Fish Shellfish Immunol. 32, 79–88. Directorate of Fisheries, 2015. Tinh hinh san xuat thuy san nam 2014 (26/02/2015) (in Vietnamese). http://www.fistenet.gov.vn/thong-tin-huu-ich/thong-tin-thong-ke/ thong-ke-1/tinh-hinh-san-xuat-thuy-san-nam-2014 (Accessed 20 March 2015). Du, H., Bao, Z., Hou, R., Wang, S., Su, H., Yan, J., Tian, M., Li, Y., Wei, W., Lu, W., Hu, X., Wang, S., Hu, J., 2012. Transcriptome sequencing and characterization for the sea cucumber Apostichopus japonicus (Selenka, 1867). PLoS ONE 7 (e33311). Duan, J., Xia, C., Zhao, G., Jia, J., Kong, X., 2012. Optimizing de novo common wheat transcriptome assembly using short-read RNA-Seq data. BMC Genomics 13, 392.

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

10

N.M. Thanh et al. / Marine Genomics xxx (2015) xxx–xxx

Dunham, R.A., Taylor, J.F., Rise, M.L., Liu, Z., 2014. Development of strategies for integrated breeding, genetics and applied genomics for genetic improvement of aquatic organisms. Aquaculture 420–421, S121–S123. Eckert, S.M., Yada, T., Shepherd, B.S., Stetson, M.H., Hirano, T., Grau, E.G., 2001. Hormonal control of osmoregulation in the Channel catfish Ictalurus punctatus. Gen. Comp. Endocrinol. 122, 270–286. Franchini, P., van der Merwe, M., Roodt-Wilding, R., 2011. Transcriptome characterization of the South African abalone Haliotis midae using sequencing-by-synthesis. BMC Res. Notes 4, 59. Furspan, P., Prange, H.D., Greenwald, L., 1984. Energetics and osmoregulation in the catfish, Ictalurus nebulosus and I. punctatus. Comp. Biochem. Physiol. 77A, 773–778. Gao, Z., Luo, W., Liu, H., Zeng, C., Liu, X., Yi, S., Wang, W., 2012. Transcriptome analysis and SSR/SNP markers information of the blunt snout bream (Megalobrama amblycephala). PLoS ONE 7 (e42637). Garg, R., Patel, R.K., Jhanwar, S., Priya, P., Bhattacharjee, A., Yadav, G., Bhatia, S., Chattopadhyay, D., Tyagi, A.K., Jain, M., 2011. Gene discovery and tissue-specific transcriptome analysis in chickpea with massively parallel pyrosequencing and web resource development. Plant Physiol. 156, 1661–1678. Gong, H.Y., Wu, J.L., Huang, W.T., Lin, C.J.F., Weng, C.F., 2004. Response to acute changes in salinity of two different muscle type creatine kinase isoforms, from euryhaline teleost (Oreochromis mossambicus) gills. Biochim. Biophys. Acta 1675, 184–191. Götz, S., Garcia-Gomez, J.M., Terol, J., William, T.D., Gagaraj, S.H., 2008. High-throughput functional annotation and data mining with Blast2GO suite. Nucleic Acids Res. 36, 3420–3435. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., Regev, A., 2011. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652. Grosell, M., de Boeck, G., Johannsson, O., Wood, C.M., 1999. The effects of silver on intestinal ion and acid–base regulation in the marine teleost fish, Parophrys vetulus. Comp. Biochem. Physiol. C Pharmacol. Toxicol. Endocrinol. 24, 259–270. Ha, H.P., Nguyen, T.T.T., Poompuang, S., Na-Nakorn, U., 2009. Microsatellites revealed no genetic differentiation between hatchery and contemporary wild populations of striped catfish, Pangasianodon hypophthalmus (Sauvage 1878) in Vietnam. Aquaculture 29, 154–160. Hale, M.C., Jackson, J.R., de Woody, J.A., 2010. Discovery and evaluation of candidate sexdetermining genes and xenobiotics in the gonads of lake sturgeon (Acipenser fulvescens). Genetica 138, 45–456. Hanks, S.K., Quinn, A.M., Hunter, T., 1988. The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. Science 241, 42–52. Haridas, S., Breuill, C., Bohlmann, J., Hsiang, T., 2011. A biologist's guide to de novo genome assembly using next-generation sequence data: a test with fungal genomes. J. Microbiol. Methods 86, 368–375. Hiroi, J., McCormick, S.D., 2007. Variation in salinity tolerance, gill Na+/K+-ATPase, Na+/ K+/2Cl− cotransporter and mitochondria-rich cell distribution in three salmonids Salvelinus namaycush, Salvelinus fontinalis and Salmo salar. J. Exp. Biol. 210, 1015–1024. Hiroi, J., McCormick, S.D., Ohtani-Kaneko, R., Kaneko, T., 2005. Functional classification of mitochondrion-rich cells in euryhaline Mozambique tilapia (Oreochromis mossambicus) embryos, by means of triple immunofluorescence staining for Na+/ K+-ATPase, Na+/K+/2Cl− cotransporter and CFTR anion channel. J. Exp. Biol. 208, 2023–2036. Holland, I.B., Cole, S.P.C., Kuchler, K., Higgins, C.F., 2003. ABC Proteins: From Bacteria to Man. Elsevier Science, London. Huang, X.D., Zhao, M., Liu, W.G., Guan, Y.Y., Shi, Y., Wang, Q., Wu, S.Z., He, M.X., 2013. Gigabase-scale transcriptome analysis on four species of pearl oysters. Mar. Biotechnol. 15, 253–264. Hunter, S., Jones, P., Mitchell, A., Apweiler, R., Attwood, T.K., Bateman, A., Bernard, T., Binns, D., Bork, P., Burge, S., de Castro, E., Coggill, P., Corbett, M., Das, U., Daugherty, L., Duquenne, L., Finn, R.D., Fraser, M., Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., McMenamin, C., Mi, H., Mutowo-Muellenet, P., Mulder, N., Natale, D., Orengo, C., Pesseat, S., Punta, M., Quinn, A.F., Rivoire, C., Sangrador-Vegas, A., Selengut, J.D., Sigrist, C.J., Scheremetjew, M., Tate, J., Thimmajanarthanan, M., Thomas, P.D., Wu, C.H., Yeats, C., Yong, S.Y., 2012. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 40, D306–D312. Hwang, P.P., Lee, T.H., 2007. New insights into fish ion regulation and mitochondrion-rich cells. Comp. Biochem. Physiol. A Mol. Integr. Physiol. 148, 479–497. Hynes, O.R., 1990. Fibronectins. Springer-Verlag, New York. Imsland, A.K., Björnsson, B.T., Gunnarsson, S., Foss, A., Stefansson, S.O., 2007. Temperature and salinity effects on plasma insulin-like growth factor-I concentrations and growth in juvenile turbot (Scophthalmus maximus). Aquaculture 271, 546–552. Inokuchi, M., Hiroi, J., Watanabe, S., Lee, K.M., Kaneko, T., 2008. Gene expression and morphological localization of NHE3, NCC and NKCC1a in branchial mitochondria-rich cells of Mozambique tilapia (Oreochromis mossambicus) acclimated to a wide range of salinities. Comp. Biochem. Physiol. A Mol. Integr. Physiol. 151, 151–158. Inoue, K., Iwatani, H., Takei, Y., 2003. Growth hormone and insulin-like growth factor I of a euryhaline fish Cottus kazika: cDNA cloning and expression after seawater acclimation. Gen. Comp. Endocrinol. 131, 77–84. Jiang, Y., Lu, J., Peatman, E., Kucuktas, H., Liu, S., Wang, S., Sun, F., Liu, Z., 2011. A pilot study for channel catfish whole genome sequencing and de novo assembly. BMC Genomics 12, 629. Jondeung, A., Sangthong, P., Zardoya, R., 2007. The complete mitochondrial DNA sequence of the Mekong giant catfish (Pangasianodon gigas), and the phylogenetic relationships among Siluriformes. Gene 387, 49–57.

Jung, H., Lyons, R.E., Dinh, H., Hurwood, D.A., McWilliam, S., Mather, P.B., 2011. Transcriptomics of a giant freshwater prawn (Macrobrachium rosenbergii): de novo assembly, annotation and marker discovery. PLoS ONE 6 (e27938). Jung, H., Lyons, R.E., Hurwood, D.A., Mather, P.B., 2013. Genes and growth performance in crustacean species: a review of relevant genomic studies in crustaceans and other taxa. Rev. Aquat. 5, 77–110. Jung, H., Lyons, R.E., Yutao, L., Thanh, N.M., Dinh, H., Hurwood, D.A., Salin, K.R., Mather, P.B., 2014. A candidate gene association study for growth performance in an improved giant freshwater prawn (Macrobrachium rosenbergii) culture line. Mar. Biotechnol. 16, 161–180. Kang, C.K., Tsai, H.J., Liu, C.C., Lee, T.H., Hwang, P.P., 2010. Salinity-dependent expression of a Na+, K+, 2Cl− cotransporter in gills of the brackish medaka Oryzias dancena: a molecular correlate for hyposmoregulatory endurance. Comp. Biochem. Physiol. A Mol. Integr. Physiol. 157, 7–18. Laverty, G., Skadhauge, E., 2012. Adaptation of teleosts to very high salinity. Comp. Biochem. Physiol. A 163, 1–6. Lefevre, S., Huong, D.T.T., Wang, T., Phuong, N.T., Bayley, M., 2011. Hypoxia tolerance and partitioning of bimodal respiration in the striped catfish (Pangasianodon hypophthalmus). Comp. Biochem. Physiol. A 158, 207–214. Li, H., Durbin, R., 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 2009. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25, 2078–2079. Liao, X., Cheng, L., Xu, P., Lu, G., Wachholtz, M., Sun, X., Chen, S., 2013. Transcriptome analysis of crucian carp (Carassius auratus), an important aquaculture and hypoxiatolerant species. PLoS ONE 8 (e62308). Link, K., Berishvili, G., Shved, N., D'Cotta, H., Baroiller, J.F., Reinecke, M., Eppler, E., 2010. Seawater and freshwater challenges affect the insulin-like growth factors IGF-I and IGF-II in liver and osmoregulatory organs of the tilapia. Mol. Cell. Endocrinol. 327, 40–46. Liu, S., Zhou, Z., Lu, J., Sun, F., Wang, S., Liu, H., Jiang, Y., Kucuktas, H., Kaltenboeck, L., Peatman, E., Liu, Z., 2011. Generation of genome-scale gene-associated SNPs in catfish for the construction of a high-density SNP array. BMC Genomics 12, 53. Liu, S., Wang, X., Sun, F., Zhang, J., Feng, J., Liu, H., Rajendran, K.V., Sun, L., Zhang, Y., Jiang, Y., Peatman, E., Kaltenboeck, L., Kucuktas, H., Liu, Z., 2013a. RNA-Seq reveals expression signatures of genes involved in oxygen transport, protein synthesis, folding, and degradation in response to heat stress in catfish. Physiol. Genomics 45, 462–476. Liu, S., Zhang, Y., Zhou, Z., Waldbieser, G., Sun, F., Lu, J., Zhang, J., Jiang, Y., Zhang, H., Wang, X., Rajendran, K.V., Khoo, L., Kucuktas, H., Peatman, E., Liu, Z., 2013b. Efficient assembly and annotation of the transcriptome of catfish by RNA-Seq analysis of a doubled haploid homozygote. BMC Genomics 13, 595. Mackie, P.M., Gharbi, K., Ballantyne, J.S., McCormick, S.D., Wright, P.A., 2007. Na+/K+/2Cl− cotransporter and CFTR gill expression after seawater transfer in smolts (0+) of different Atlantic salmon (Salmo salar) families. Aquaculture 272, 625–635. Madsen, S.S., Jensen, M.K., Nhr, J., Kristiansen, K., 1995. Expression of Na(+)–K(+)ATPase in the brown trout, Salmo trutta: in vivo modulation by hormones and seawater. Am. J. Physiol. Regul. Integr. Comp. Physiol. 269, R1339–R1345. Magdeldin, S., Uchida, K., Hirano, T., Grau, G., Abdelfattah, A., Nozaki, M., 2007. Effects of environmental salinity on somatic growth and growth hormone/insulin-like growth factor-I axis in juvenile tilapia Oreochromis mossambicus. Fish. Sci. 73, 1025–1034. Mancera, J.M., McCormick, S.D., 1998. Osmoregulatory actions of the GH/IGF axis in nonsalmonid teleosts. Comp. Biochem. Physiol. B 121, 43–48. Marshall, W.S., 2002. Na+, Cl−, Ca2+ and Zn2+ transport by fish gills: retrospective review and prospective synthesis. J. Exp. Zool. 293, 264–283. Marshall, W.S., Emberley, T.R., Singer, T.D., Bryson, S.E., McCormick, S.D., 1999. Time course of salinity adaptation in a strongly euryhaline estuarine teleost, Fundulus heteroclitus: a multivariable approach. J. Exp. Biol. 202, 1535–1544. Marshall, W.S., Lynch, E.M., Cozzi, R.R., 2002. Redistribution of immunofluorescence of CFTR anion channel and NKCC cotransporter in chloride cells during adaptation of the killifish Fundulus heteroclitus to sea water. J. Exp. Biol. 205, 1265–1273. McCormick, S.D., 1996. Effects of growth hormone and insulin-like growth factor I on salinity tolerance and gill Na+, K+-ATPase in Atlantic salmon (Salmo salar): interaction with cortisol. Gen. Comp. Endocrinol. 101, 3–11. McCormick, S.D., 2001. Endocrine control of osmoregulation in teleost fish. Integr. Comp. Biol. 41, 781–794. McCormick, S.D., Sundell, K., Björnsson, B.T., Brown, C.L., Hiroi, J., 2003. Influence of salinity on the localization of Na+/K+-ATPase, Na+/K+/2Cl− cotransporter (NKCC) and CFTR anion channel in chloride cells of the Hawaiian goby (Stenogobius hawaiiensis). J. Exp. Biol. 206, 4575–4583. Meglécz, E., Constedoat, C., Dubut, V., Gilles, A., Malausa, T., Pech, N., Martin, J.F., 2010. QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects. Bioinforma. Appl. Notes 26, 403–404. Miller, W., Drautz, D.I., Ratan, A., Pusey, B., Qi, J., Lesk, A.M., Tomsho, L.P., Packard, M.D., Zhao, F., Sher, A., Tikhonov, A., Raney, B., Patterson, N., Lindblad-Toh, K., Lander, E.S., Knight, J.R., Irzyk, G.P., Fredrikson, K.M., Harkins, T.T., Sheridan, S., Pringle, T., Schuster, S.C., 2008. Sequencing the nuclear genome of the extinct woolly mammoth. Nature 456, 387–390. Miller, J.R., Koren, S., Sutton, G., 2010. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327. Min, X.J., Butler, G., Storms, R., Tsang, A., 2005. OrfPredictor: predicting protein coding regions in EST-derived sequences. Nucleic Acids Res. 33 (Web Server Issue), W677–W680. Mori, T., Kawaguchi, W., Hiraka, I., Kurata, Y., Saito, T., Uchida, N., 2007. Isolation and identification of the major plasma fibronectin cDNA from Japanese catfish Silurus asotus. Comp. Biochem. Physiol. B 146, 53–59.

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

N.M. Thanh et al. / Marine Genomics xxx (2015) xxx–xxx Na-Nakorn, U., Moeikum, T., 2009. Genetic diversity of domesticated stocks of striped catfish, Pangasianodon hypophthalmus (Sauvage 1878), in Thailand: relevance to broodstock management regimes. Aquaculture 297, 70–77. Nguyen, T.T.T., 2009. Patterns of use and exchange of genetic resources of the striped catfish Pangasianodon hypophthalmus (Sauvage 1878). Rev. Aquac. 1, 224–231. Ouattara, N.G., Bodinier, C., Nègre-Sadargues, G., D'Cotta, H., Messad, S., Charmantier, G., Panfili, J., Baroiller, J.F., 2009. Changes in gill ionocyte morphology and function following transfer from fresh to hypersaline waters in the tilapia Sarotherodon melanotheron. Aquaculture 290, 155–164. Overturf, K., Sakhrani, D., Devlin, R.H., 2010. Expression profile for metabolic and growthrelated genes in domesticated and transgenic coho salmon (Oncorhynchus kisutch) modified for increased growth hormone production. Aquaculture 307, 111–122. Pereiro, P., Balseiro, P., Romero, A., Dios, S., Forn-Cuni, G., Fuste, B., Planas, J.V., Beltran, S., Novoa, B., Figueras, A., 2012. High-throughput sequence analysis of turbot (Scophthalmus maximus) transcriptome using 454-pyrosequencing for the discovery of antiviral immune genes. PLoS ONE 7 (e35369). Poppinga, J., Kittilson, J., McCormick, S.D., Sheridan, M.A., 2007. Effects of somatostatin on the growth hormone-insulin-like growth factor axis and seawater adaptation of rainbow trout (Oncorhynchus mykiss). Aquaculture 273, 312–319. Richards, J.G., Semple, J.W., Bystriansky, J.S., Schulte, P.M., 2003. Na+/K+-ATPase αisoform switching in gills of rainbow trout (Oncorhynchus mykiss) during salinity transfer. J. Exp. Biol. 206, 4475–4486. Robertson, G., Schein, J., Chiu, R., Corbett, R., Field, M., Jackman, S.D., Mungall, K., Lee, S., Okada, H.M., Qian, J.Q., Griffith, M., Raymond, A., Thiessen, N., Cezard, T., Butterfield, Y.S., Newsome, R., Chan, S.K., She, R., Varhol, R., Kamoh, B., Prabhu, A.L., Tam, A., Zhao, Y., Moore, R.A., Hirst, M., Marra, M.A., Jones, S.J., Hoodless, P.A., Birol, I., 2010. De novo assembly and analysis of RNA-seq data. Nat. Method 7, 909–912. Rozen, S., Skaletsky, H., 2000. Primer3 in the WWW for general users and for biologist programmers. Methods Mol. Biol. 132, 365–386. Sadamoto, H., Takahashi, H., Okada, T., Kenmoku, H., Toyota, M., Asakawa, Y., 2012. De novo sequencing and transcriptome analysis of the central nervous system of mollusc Lymnaea stagnalis by deep RNA sequencing. PLoS ONE 7 (e42546). Salem, M., Yao, J., Rexroad, C.E., Kenney, P.B., Semmens, K., Killefer, J., Nath, J., 2005. Characterization of calpastatin gene in fish: its potential role in muscle growth and fillet quality. Comp. Biochem. Physiol. B Biochem. Mol. Biol. 141, 488–497. Salem, M., Vallejo, R.L., Leeds, T.D., Palti, Y., Liu, S., Sabbagh, A., Rexroad III, C.E., Yao, J., 2012. RNA-seq identifies SNP markers for growth traits in rainbow trout. PLoS ONE 7 (e36264). Sang, N.V., Thomassen, M., Klemetsdal, G., Gjøen, H.M., 2009. Prediction of fillet weight, fillet yield, and fillet fat for live river catfish (Pangasianodon hypophthalmus). Aquaculture 288, 166–171. Sang, N.V., Klemetsdal, G., Ødegård, J., Gjøen, H.M., 2012. Genetic parameters of economically important traits recorded at a given age in striped catfish (Pangasianodon hypophthalmus). Aquaculture 344–349, 82–89. Sarropoulou, E., Fernandes, J.M.O., 2011. Comparative genomics in teleost species: knowledge transfer by linking the genomes of model and non-model fish species. Comp. Biochem. Physiol. D 6, 92–102. Schulz, M.H., Zerbino, D.R., Vingron, M., Birney, E., 2012. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 1086–1092. Scott, G.R., Schulte, P.M., 2005. Intraspecific variation in gene expression after seawater transfer in gills of the euryhaline killifish Fundulus heteroclitus. Comp. Biochem. Physiol. A Mol. Integr. Physiol. 141, 176–182. Scott, G.R., Richards, J.G., Forbush, B., Isenring, P., Schulte, P.M., 2004. Changes in gene expression in gills of the euryhaline killifish Fundulus heteroclitus after abrupt salinity transfer. Am. J. Physiol. Cell Physiol. 287, C300–C309. Seidelin, M., Madsen, S.S., Byrialsen, A., Kristiansen, K., 1999. Effects of insulin-like growth factor-I and cortisol on Na+, K+-ATPase expression in osmoregulatory tissues of brown trout (Salmo trutta). Gen. Comp. Endocrinol. 113, 331–342.

11

Singer, T.D., Clements, K.M., Semple, J.W., Schulte, P.M., Bystriansky, J.S., Finstad, B., Fleming, I.A., McKinley, R.S., 2002. Seawater tolerance and gene expression in two strains of Atlantic salmon smolts. Can. J. Fish. Aquat. Sci. 59, 125–135. Singer, T.D., Finstad, B., McCormick, S.D., Wiseman, S.B., Schulte, P.M., McKinley, R.S., 2003. Interactive effects of cortisol treatment and ambient seawater challenge on gill Na+, K+-ATPase and CFTR expression in two strains of Atlantic salmon smolts. Aquaculture 222, 15–28. Singer, T.D., Raptis, S., Sathiyaa, R., Nichols, J.W., Playle, R.C., Vijayan, M.M., 2007. Tissuespecific modulation of glucocorticoid receptor expression in response to salinity acclimation in rainbow trout. Comp. Biochem. Physiol. B Biochem. Mol. Biol. 146, 271–278. Sun, X.L., Yu, Q.Y., Tang, L.L., Ji, W., Bai, X., Cai, H., Liu, X.F., Ding, X.D., Zhu, Y.M., 2013. GsSRK, a G-type lectin S-receptor-like serine/threonine protein kinase, is a positive regulator of plant tolerance to salt stress. J. Plant Physiol. 170, 505–515. Surget-Groba, Y., Montoya-Burgos, J.I., 2010. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res. 20, 1432–1440. Thanh, N.M., Jung, H., Lyons, R.E., Chand, V., Tuan, N.V., Thu, V.T.M., Mather, P., 2014. A transcriptomic analysis of striped catfish (Pangasianodon hypophthalmus) in response to salinity adaptation: de novo assembly, gene annotation and marker discovery. Comp. Biochem. Physiol. D 10, 52–63. Tipsmark, C.K., Madsen, S.S., Borski, R.J., 2004. Effect of salinity on expression of branchial ion transporters in striped bass (Morone saxatilis). J. Exp. Zool. A Comp. Exp. Biol. 301, 979–991. Tipsmark, C.K., Luckenbach, J.A., Madsen, S.S., Borski, R.J., 2007. IGF-I and branchial IGF receptor expression and localization during salinity acclimation in striped bass. Am. J. Physiol. Regul. Integr. Comp. Physiol. 292, R535–R543. Vasemägi, A., Primmer, C.R., 2005. Challenges for identifying functionally important genetic variation: the promise of combining complementary research strategies. Mol. Ecol. 14, 3623–3642. Vera, J.C., Wheat, C.W., Fescemyer, H.W., Frilander, M.J., Crawford, D.L., Hanski, I., Marden, J.H., 2008. Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol. Ecol. 7, 1636–1647. Wang, S., Liu, Z., 2011. SNP discovery through EST data mining. In: Liu, Z. (Ed.), Next Generation Sequencing & Whole Genome Selection in Aquaculture. Wiley-Blackwell, pp. 91–108. Wang, S., Hou, R., Bao, Z., Du, H., He, Y., Su, H., Zhang, Y., Fu, X., Jiao, W., Li, Y., Zhang, L., Wang, S., Hu, X., 2013. Transcriptome sequencing of Zhikong scallop (Chlamys farreri) and comparative transcriptomic analysis with Yesso scallop (Patinopecten yessoensis). PLoS ONE 8 (e63927). Watanabe, S., Mekuchi, M., Ideuchi, H., Kim, Y.K., Kaneko, T., 2011. Electroneutral cationCl− cotransporters NKCC2β and NCCβ expressed in the intestinal tract of Japanese eel Anguilla japonica. Comp. Biochem. Physiol. A Mol. Integr. Physiol. 159, 427–435. Wong, L.L., Peatman, E., Lu, J., Kucuktas, H., He, S., Zhou, C., Na-Nakorn, U., Liu, Z., 2011. DNA barcoding of catfish: species authentication and phylogenetic assessment. PLoS ONE 6 (e17812). Wu, P., Qi, D., Chen, L., Zhang, H., Zhang, X., Qin, G.J., Hu, S., 2009. Gene discovery from an ovary cDNA library of oriental river prawn Macrobrachium nipponense by ESTs annotation. Comp. Biochem. Physiol. D 4, 111–120. Yang, W.K., Kang, C.K., Chen, T.Y., Chang, W.B., Lee, T.H., 2011. Salinity-dependent expression of the branchial Na+/K+/2Cl− cotransporter and Na+/K+-ATPase in the sailfin molly correlates with hypoosmoregulatory endurance. J. Comp. Physiol. B. 18, 953–964. Zerbino, D.R., Birney, E., 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829. Zhao, Q., Liu, X., Collodi, P., 2001. Identification and characterization of a novel fibronectin in zebrafish. Exp. Cell Res. 268, 211–219. Zhou, Y., Gao, F., Liu, R., Feng, J., Li, H., 2012. De novo sequencing and analysis of root transcriptome using 454 pyrosequencing to discover putative genes associated with drought tolerance in Ammopiptanthus mongolicus. BMC Genomics 13, 266.

Please cite this article as: Thanh, N.M., et al., Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus), Mar. Genomics (2015), http://dx.doi.org/10.1016/j.margen.2015.05.001

Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus).

Striped catfish (Pangasianodon hypophthalmus) is a commercially important freshwater fish used in inland aquaculture in the Mekong Delta, Vietnam. The...
790KB Sizes 0 Downloads 8 Views