GENE-41175; No. of pages: 7; 4C: Gene xxx (2016) xxx–xxx

Contents lists available at ScienceDirect

Gene journal homepage: www.elsevier.com/locate/gene

Research Paper

Survey and analysis of simple sequence repeats (SSRs) in three genomes of Candida species Dongmei Jia Department of Dermatology, Qingpu Branch of Zhongshan Hospital, Fudan University, 1158 East Gongyuan Road, 201700, Shanghai, PR China

a r t i c l e

i n f o

Article history: Received 26 October 2015 Received in revised form 15 January 2016 Accepted 12 February 2016 Available online xxxx Keywords: Simple sequence repeats SSRs Microsatellites Candida

a b s t r a c t Simple sequence repeats (SSRs) or microsatellites, which composed of tandem repeated short units of 1–6 bp, have been paying attention continuously. Here, the distribution, composition and polymorphism of microsatellites and compound microsatellites were analyzed in three available genomes of Candida species (Candida dubliniensis, Candida glabrata and Candida orthopsilosis). The results show that there were 118,047, 66,259 and 61,119 microsatellites in genomes of C. dubliniensis, C. glabrata and C. orthopsilosis, respectively. The SSRs covered more than 1/3 length of genomes in the three species. The microsatellites, which just consist of bases A and (or) T, such as (A)n, (T)n, (AT)n, (TA)n, (AAT)n, (TAA)n, (TTA)n, (ATA)n, (ATT)n and (TAT)n, were predominant in the three genomes. The length of microsatellites was focused on 6 bp and 9 bp either in the three genomes or in its coding sequences. What's more, the relative abundance (19.89/kbp) and relative density (167.87 bp/kbp) of SSRs in sequence of mitochondrion of C. glabrata were significantly great than that in any one of genomes or chromosomes of the three species. In addition, the distance between any two adjacent microsatellites was an important factor to influence the formation of compound microsatellites. The analysis may be helpful for further studying the roles of microsatellites in genomes' origination, organization and evolution of Candida species. © 2016 Elsevier B.V. All rights reserved.

1. Introduction Candida species are pathogenic yeasts and are the most prevalent cause of opportunistic fungal infections in humans (Butler et al., 2009). The diseases caused by these yeasts include superficial infections of the oral cavity and vagina (commonly known as thrush) and deepseated systemic infections, and are associated with high levels of morbidity and mortality (Jackson et al., 2009). In normal populations, superficial Candida infections of the skin occur because of a combination of skin barrier deficiency and the opportunistic dissemination of endogenous Candida species colonizing the skin (Hube et al., 2015). Candida infection can also be activated by antibiotic or steroid therapy, along with predisposing factors for cutaneous candidiasis such as obesity and diabetes mellitus. Candida dubliniensis is the most closely related species to Candida albicans (Sullivan et al., 2004) which is the most frequent cause of superficial and systemic candidosis. Although C. dubliniensis is not widely recognized as the pathogenic yeast species, there are about 2%–3% of C. dubliniensis recovered from blood samples (Kibbler et al., 2003; Odds et al., 2007). Epidemiological and infection model data Abbreviations: SSRs, simple sequence repeats; CM, the count of microsatellites; LM, the length of microsatellites; RA, relative abundance; RD, relative density; CCM, the count of compound microsatellites; LCM, the length of compound microsatellites. E-mail address: [email protected].

suggest that C. dubliniensis is less prevalent than C. albicans because it is substantially less pathogenic (Vilela et al., 2002). Candida glabrata, which responsible for 15% cases of systemic candidosis, can cause disease independently, because it does not secrete proteinase activity and apparently cannot make true hyphae (Kaur et al., 2005). Candida orthopsilosis is closely related to Candida parapsilosis which is also one of the most common causes of Candida infection, particularly in South America (Pfaller et al., 2011). Whereas C. parapsilosis is a major cause of disease in immunosuppressed individuals and in premature neonates, C. orthopsilosis is more rarely associated with infection (Riccombeni et al., 2012). Microsatellites, also referred to as short tandem repeats (STRs) or simple sequence repeats (SSRs), are special DNA/RNA sequence with repeated unit of 1–6 bp (Tautz, 1993). Although experiment is needed, microsatellites may be also a better choice to study genome evolution (Madsen et al., 2008) because of its polymorphisms and high mutability (Kim et al., 2008; Madsen et al., 2008). This polymorphisms of length are caused by slip-strand mutations and may also affect local structure of the DNA molecule or the encoded proteins (Ellegren, 2004). SSRs have been extensively surveyed in the genome-wide level of eukaryotes, prokaryotes and viruses; the result revealed that SSRs are not randomly distributed in genomes (Hong et al., 2007; Rajendrakumar et al., 2007); what's more, SSRs are thought not only to influence transcriptional activity (Kashi et al., 1997) but also to play a functional role in the evolution of gene regulation (Gur-Arie et al., 2000; Huang et al.,

http://dx.doi.org/10.1016/j.gene.2016.02.018 0378-1119/© 2016 Elsevier B.V. All rights reserved.

Please cite this article as: Jia, D., Survey and analysis of simple sequence repeats (SSRs) in three genomes of Candida species, Gene (2016), http:// dx.doi.org/10.1016/j.gene.2016.02.018

2

D. Jia / Gene xxx (2016) xxx–xxx

2003; Kashi and King, 2006). Although the importance of microsatellites has been gradually understood and recognized in genomes, the ubiquitous occurrence of microsatellites has puzzled geneticists, ever since discovery in the early 1980s (Miesfeld et al., 1981). In addition, some people thought that recombination between homologous microsatellites generate compound microsatellites (Jakupciak and Wells, 1999), such as (TG) n -(T) n -(TG) n -(ACG) n , which consist of two or more individual microsatellites (Kofler et al., 2008) and are expected to have higher polymorphism than single microsatellite (Chen et al., 2011). Compound microsatellites have been investigated in human genomes (Bull et al., 1999; Weber, 1990), eight eukaryotic genomes (Kofler et al., 2008), Escherichia coli genomes (Chen et al., 2011) and some viral genomes (Alam et al., 2014a, 2014b; Wu et al., 2014). Although debates over microsatellites evolution are neverending, these studies give useful information to explore possible roles of microsatellites. In this study, the analysis of microsatellites and compound microsatellites about distribution, composition and polymorphism were performed in three available genomes of Candida (Candida dubliniensis CD36, Candida glabrata CBS 138 and Candida orthopsilosis Co 90-125). The analysis may be helpful for further studying the roles of repeat

sequences in genome origination, organization and evolution of Candida species. 2. Material and methods 2.1. Genome sequences All three available genomes (each chromosome was assembled completely) and annotations of Candida species (Candida dubliniensis CD36, Candida glabrata CBS 138 and Candida orthopsilosis Co 90-125) were downloaded from NCBI (http://www.ncbi.nlm.nih.gov/genome/). A total of 31 chromosomes exist in the three Candida species, and one of them is complete mitochondrial chromosome (No. 22, Chr-MT) in Candida glabrata CBS 138 (Table 1, Supplementary Table 1). 2.2. Classification of microsatellites Mono-SSRs, di-SSRs, tri-SSRs, tetri-SSRs, penta-SSRs and hexaSSRs were divided into 4, 6, 10, 33, 102 and 350 kinds respectively, according to open reading frame and complementary strand (Jurka and Pethiyagoda, 1995; Katti et al., 2001). Such as mono-SSRs left

Table 1 The overview of three genomes of Candida species. Size (bp)c

GC content of sequence (%)d

CMe

LMf

GC content of SSRs (%)g

Candida dubliniensis 1 Chr-1 NC_012860.1 2 Chr-2 NC_012861.1 3 Chr-3 NC_012862.1 4 Chr-4 NC_012863.1 5 Chr-5 NC_012864.1 6 Chr-6 NC_012865.1 7 Chr-7 NC_012866.1 8 Chr-R NC_012867.1 Genome

3,214,061 (1,971,465) 2,289,089 (1,440,477) 1,863,824 (1,147,182) 1,641,709 (1,005,912) 1,245,899 (726,456) 1,073,895 (665,028) 1,022,435 (581,247) 2,267,510 (1,388,880) 14,618,422 (8,926,647)

32.96 (34.69) 33.29 (34.87) 33.23 (34.81) 33.12 (34.54) 33.09 (34.63) 33.18 (34.63) 33.74 (34.62) 33.63 (35.50) 33.25 (34.83)

26,083 (9442) 18,183 (7041) 15,076 (5593) 13,197 (4900) 10,427 (3458) 8497 (3176) 8110 (2895) 18,474 (6646) 118,047 (43,151)

238,103 (77,990) 165,324 (58,911) 137,565 (46,199) 119,424 (40,644) 95,377 (28,642) 77,363 (26,719) 73,011 (24,386) 165,396 (53,407) 1,071,563 (356,898)

18.35 (23.36) 18.63 (23.57) 18.65 (23.32) 19.02 (23.61) 18.40 (23.29) 19.18 (24.18) 19.05 (23.35) 18.42 (24.14) 18.63 (23.59)

8.12 (4.79) 7.94 (4.89) 8.09 (4.88) 8.04 (4.87) 8.37 (4.76) 7.91 (4.78) 7.93 (4.98) 8.15 (4.79) 8.08 (4.83)

74.08 (39.56) 72.22 (40.90) 73.81 (40.27) 72.74 (40.41) 76.55 (39.43) 72.04 (40.18) 71.41 (41.95) 72.94 (38.45) 73.3 (39.98)

Candida glabrata 9 Chr-A 10 Chr-B 11 Chr-C 12 Chr-D 13 Chr-E 14 Chr-F 15 Chr-G 16 Chr-H 17 Chr-I 18 Chr-J 19 Chr-K 20 Chr-L 21 Chr-M 22 Chr-MT Genome

491,328 (298,614) 502,101 (316,323) 558,504 (348,186) 651,701 (418,713) 687,738 (441,120) 927,101 (578,139) 992,211 (649,206) 1,050,361 (690,678) 1,100,349 (697,779) 1,195,132 (801,393) 1,302,831 (824,157) 1,455,689 (919,188) 1,402,899 (937,767) 20,063 (9528) 12,338,008 (7,930,791)

40.11 (42.60) 38.91 (40.83) 39.69 (41.64) 39.09 (40.87) 39.00 (40.90) 38.10 (39.98) 38.58 (40.39) 38.05 (39.78) 38.62 (40.60) 38.88 (40.85) 38.25 (40.16) 38.47 (40.32) 38.54 (40.16) 17.64 (18.99) 38.62 (40.50)

2702 (1111) 2672 (1172) 2845 (1269) 3360 (1536) 3663 (1683) 5194 (2280) 5482 (2523) 5659 (2697) 5989 (2623) 6158 (2971) 6995 (3122) 7781 (3390) 7360 (3459) 399 (119) 66,259 (29,955)

19,501 (8022) 19,054 (8134) 20,310 (9036) 23,558 (10,641) 26,151 (11,954) 36,954 (16,108) 39,181 (17,773) 39,827 (18,857) 42,315 (18,063) 44,540 (21,567) 49,580 (21,739) 55,147 (23,715) 52,422 (24,290) 3368 (1034) 471,908 (210,933)

27.50 (35.04) 23.62 (30.21) 25.53 (33.61) 25.37 (31.45) 24.54 (32.25) 23.61 (29.75) 23.91 (29.30) 22.76 (28.28) 23.34 (29.90) 25.10 (32.35) 23.60 (30.12) 23.32 (29.75) 22.98 (28.86) 1.01 (1.55) 23.8 (30.27)

5.50 (3.72) 5.32 (3.71) 5.09 (3.64) 5.16 (3.67) 5.33 (3.82) 5.60 (3.94) 5.53 (3.89) 5.39 (3.90) 5.44 (3.76) 5.15 (3.71) 5.37 (3.79) 5.35 (3.69) 5.25 (3.69) 19.89 (12.49) 5.37 (3.78)

39.69 (26.86) 37.95 (25.71) 36.37 (25.95) 36.15 (25.41) 38.02 (27.10) 39.86 (27.86) 39.49 (27.38) 37.92 (27.30) 38.46 (25.89) 37.27 (26.91) 38.06 (26.38) 37.88 (25.80) 37.37 (25.90) 167.87 (108.52) 38.25 (26.6)

2,899,497 (2,022,726) 2,405,696 (1,655,355) 1,622,663 (1,066,067) 1,577,249 (1,080,161) 1,441,205 (968,764) 1,012,063 (676,896) 922,188 (604,064) 601,572 (397,615) 12,482,133 (8,471,648)

37.53 (38.75) 37.37 (38.46) 37.44 (38.72) 37.52 (38.73) 37.38 (38.51) 37.32 (38.45) 37.59 (38.60) 37.53 (38.69) 37.46 (38.62)

13,882 (7086) 11,523 (5694) 8164 (3779) 7736 (3674) 7121 (3434) 5119 (2392) 4559 (2171) 3015 (1409) 61,119 (29,639)

102,648 (52,872) 84,977 (42,205) 60,568 (28,931) 57,056 (27,285) 53,432 (25,763) 39,375 (18,829) 34,271 (16,257) 22,907 (10,871) 455,234 (223,013)

25.05 (30.56) 24.43 (29.36) 24.97 (31.12) 24.63 (30.20) 25.29 (30.90) 25.41 (30.41) 25.11 (30.34) 25.97 (32.04) 24.99 (30.44)

4.79 (3.50) 4.79 (3.44) 5.03 (3.54) 4.90 (3.40) 4.94 (3.54) 5.06 (3.53) 4.94 (3.59) 5.01 (3.54) 4.9 (3.5)

35.40 (26.14) 35.32 (25.50) 37.33 (27.14) 36.17 (25.26) 37.07 (26.59) 38.91 (27.82) 37.16 (26.91) 38.08 (27.34) 36.47 (26.32)

NO.

Chr.a

Acc.No.b

NC_005967.2 NC_005968.1 NC_006026.1 NC_006027.1 NC_006028.2 NC_006029.1 NC_006030.1 NC_006031.1 NC_006032.2 NC_006033.2 NC_006034.2 NC_006035.2 NC_006036.2 NC_004691.1

Candida orthopsilosis 23 Chr-1 NC_018292.1 24 Chr-2 NC_018295.1 25 Chr-3 NC_018296.1 26 Chr-4 NC_018297.1 27 Chr-5 NC_018298.1 28 Chr-6 NC_018300.1 29 Chr-7 NC_018301.1 30 Chr-8 NC_018302.1 Genome

RAh

RDi

a

Chr.: the names of each chromosome in NCBI. Acc.No.: accession number. c Size: the chromosome size and its coding sequences size in parentheses. d GC content of sequences: the GC content in each chromosome and in its coding sequences in parentheses. e CM: the counts of microsatellites in each chromosome and in its coding sequences in parentheses. f LM: the length of microsatellites in each chromosome and in its coding sequences in parentheses. g GC content of SSRs: the GC content of microsatellites in each chromosome and in its coding sequences in parentheses. h RA: relative abundance of microsatellites (the number of microsatellites per kbp) in each chromosome and in its coding sequences in parentheses. i RD: relative density of microsatellites (the total length (bp) contributed by each microsatellite per kb of sequence analyzed) in each chromosome and in its coding sequences in parentheses. b

Please cite this article as: Jia, D., Survey and analysis of simple sequence repeats (SSRs) in three genomes of Candida species, Gene (2016), http:// dx.doi.org/10.1016/j.gene.2016.02.018

D. Jia / Gene xxx (2016) xxx–xxx

fall into: (A)n, (T)n , (C)n and (G)n ; di-SSRs consist of (AC)n/(CA)n , (AT)n/(TA)n, (AG)n/(GA)n, (CT)n/(TC)n, (CG)n/(GC)n and (TG)n/(GT)n; tri-SSRs left fall into: (AAT)n/(TAA)n/(TTA)n/(ATA)n/(ATT)n/(TAT)n, (AAC)n/(CAA)n/(TTG)n/(ACA)n/(GTT)n/(TGT)n, (AAG)n/(GAA)n/ (TTC)n/(AGA)n/(CTT)n/(TCT)n, (AGG)n/(GAG)n/(TCC)n/(GGA)n/ (CCT)n/(CTC)n, (ACC)n/(CAC)n/(TGG)n/(CCA)n/(GGT)n/(GTG)n, (GGC)n/(GCG)n/(CGC)n/(CGG)n/(CCG)n/(GCC)n, (ATC)n/(CAT)n/ (ATG)n/(TCA)n/(GAT)n/(TGA)n, (ACG)n/(GAC)n/(GTC)n/(CGA)n/ (TCG)n/(CGT)n, (ACT)n/(TAC)n/(AGT)n/(CTA)n/(GTA)n/(TAG)n, and (AGC)n/(CAG)n/(CTG)n/(GCA)n/(TGC)n/(GCT)n. 2.3. Extraction of microsatellites and compound microsatellites In order to work more efficiently, two Perl scripts were designed to identify microsatellites (Supplementary program 1) and compound microsatellites (Supplementary program 2) from nucleotide sequences (fasta format). The program of microsatellites finder consists of a regular expression which can match SSRs. The output file of microsatellites finder is the input file of compound microsatellites finder. In this paper, the perfect mono-SSRs, di-SSRs, tri-SSRs, tetra-SSRs, penta-SSRs and hexa-SSRs were identified, when the repeated times were greater than or equal to 6, 3, 3, 3, 3 and 3, respectively, based on empirical criterion (Alam et al., 2013; Chen et al., 2012; Rajendrakumar et al., 2007; Singh et al., 2014). Adjacent microsatellites being separated by less than a maximum threshold, called dmax (distance allowed between any two microsatellites), were classified as compound microsatellite (Kofler et al., 2008; Wu et al., 2014). The count of compound microsatellites were calculated when the value of dmax start from 10 to 500 and it increases by 10 (dmax = 10, 20, 30…480, 490 and 500). The programs were performed in system of Linux (CentOS 6.5). 2.4. Relative abundance and relative density Relative abundance (RA) and Relative density (RD) were used to allow the comparisons of SSRs to be parallel among genome sequences with different sizes. RA is defined as the number of microsatellites in unit genomic length (kbp). RD represents that the count of bases (bp) which can form SSRs in unit genomic length (kbp). Data analysis and plotting were performed by R (version 3.2.2) in the paper. 3. Results 3.1. The count, length, RA and RD of SSRs in the three genomes Here, we selected and analyzed the three genomes of C. dubliniensis, C. glabrata and C. orthopsilosis. The genome has 8 chromosomes with 14,618,422 bp length in C. dubliniensis; besides the genome has 14 chromosomes with 12,338,008 bp and 8 chromosomes with 12,482,133 bp in C. glabrata and C. orthopsilosis, respectively. It's worth noting that Chr-MT (No. 22) is the complete mitochondrion's sequence of C. glabrata (Table 1, Supplementary Table 1). The GC contents of genomes were 33.25%, 38.62% and 37.46% in the genomes of C. dubliniensis, C. glabrata and C. orthopsilosis, respectively; and GC contents in its coding sequences (CDS) are little more than that in its genomes. The distribution of SSRs was surveyed in these genomes and the results are shown as follows (Table 1, Supplementary Table 2). The counts of microsatellites (CM) were 118,047 and 43,151 in genome and its coding sequences of C. dubliniensis; meanwhile, the length of microsatellites (LM) were 1,071,563 bp and 356,898 bp in genome and its coding sequences. Similarly, the GC content of SSRs (23.59%) in coding sequence was great than that in genome (18.63%) of C. dubliniensis. The RA and RD were 8.07/kbp and 73.30 bp/kbp in C. dubliniensis, which are the most extensive in the three genomes. The genome of C. glabrata have 13 autosomes and a single mitochondrial chromosome; and this genome have the highest GC content (38.62%) and the lowest genome size in these three species. The

3

CM, LM, RA and RD of SSRs were 66,259, 471,908, 5.37/kbp and 38.25 bp/kbp in C. glabrata genome, respectively. Compared to C. dubliniensis, the genomes had similar size and GC content between C. glabrata and C. orthopsilosis, and their SSRs also had more similar distribution. 3.2. The length of SSRs with different repeated units From the distribution of SSRs with different repeated types, the vast majority of microsatellites were mono-SSRs, di-SSRs and tri-SSRs; and the microsatellites of tetri-SSRs, penta-SSRs and hexa-SSRs were rarely extracted from the genomes and coding sequences (Fig. 1, Supplementary Table 2). In mono-SSRs, (A)n and (T)n were predominant in the three genomes and coding sequences, especially C. dubliniensis (Fig. 2, Supplementary Table 3). Mono-SSRs were distributed preferentially in the non-coding sequence and were rare in the coding sequences. There were very few (G)n and (C)n in three genomes. In di-SSRs, (AT)n/(TA)n were the most in length, followd by (AC)n/(CA)n, (AG)n/ (GA)n, (CT)n/(TC)n and (TG)n/(GT)n; (CG)n/(GC)n were the least in three genomes and coding sequences (Fig. 3, Supplementary Table 3). In addition, the length of tri-SSRs detected from the genomes of C. dubliniensis and C. orthopsilosis were significantly more than that detected from the genomes of C. glabrata (Fig. 4, Supplementary Table 3). The tri-SSRs were mainly concentrated on (AAT)n/(TAA)n/(TTA)n/ (ATA)n/(ATT)n/(TAT)n, (AAC)n/(CAA)n/(TTG)n/(ACA)n/(GTT)n/(TGT)n and (AAG)n/(GAA)n/(TTC)n/(AGA)n/(CTT)n/(TCT)n. The tri-SSRs of (GGC)n/(GCG)n/(CGC)n/(CGG)n/(CCG)n/(GCC)n barely occurred in these three genomes. The genome of C. glabrata had fewer tri-SSRs and mainly located coding sequences, compared to other two genomes. Because of low-yielding, tetra-, penta- and hexanucleotide repeats were not seriously analyzed in the paper. 3.3. The distribution of SSRs' length Here, we also investigated the distribution of length of SSRs which located in the complete genomes and in its coding sequences (Fig. 5, Supplementary Table 4). It is obviously observed that the length of microsatellites were focused on 6 bp and 9 bp either in genomes or in coding sequences. The length of these SSRs were mainly less than 10 bp and it was difficult to over 30 bp. The longer SSRs (N20 bp) were more tend to occurred in genomes of C. dubliniensis and C. orthopsilosis. A similar trend of count changes was observed in three genomes and its coding sequences when the length of microsatellites increasing.

Fig. 1. The LM and CM in three genomes and in its coding sequences. LM, the length of microsatellites; CM, the count of microsatellites.

Please cite this article as: Jia, D., Survey and analysis of simple sequence repeats (SSRs) in three genomes of Candida species, Gene (2016), http:// dx.doi.org/10.1016/j.gene.2016.02.018

4

D. Jia / Gene xxx (2016) xxx–xxx

Fig. 2. The length of mono-SSRs in three genomes and in its coding sequences.

3.4. SSRs in the complete mitochondrial chromosome of C. glabrata There was a complete mitochondrial sequence (Chr-MT) with 20,063 bp in the genome of C. glabrata (Table 1). The GC content (17.64%), CM (399) and LM (3368) of complete mitochondrial sequence were lower than that in the whole genomes, but the relative abundance (19.89/kbp) and relative density (167.87 bp/kbp) were significantly great than that in any one of genomes or chromosomes. 3.5. Distance between adjacent microsatellites Whether two or more adjacent microsatellites count as a compound microsatellite depends on the value of dmax (Kofler et al., 2008). These adjacent microsatellites are counted as a compound microsatellite when the distance separating any two microsatellites is less than or equivalent to dmax (Wu et al., 2014). The dmax were set from 10 to 500, increasing by 10. To assess the effect of dmax on the compound microsatellites, we measured the count of compound microsatellites (CCM) and the length of compound microsatellites (LCM) in genomes and in its coding sequences of three genomes (Fig. 6). The results

Fig. 3. The length of di-SSRs in three genomes and in its coding sequences.

Fig. 4. The length of tri-SSRs in three genomes and in its coding sequences.

show that CCM gradually increased with the increasing of dmax when dmax is lower; on the contrary, CCM gradually decreased with the increasing of dmax when dmax is greater than 100 in C. dubliniensis, or greater than 50 in C. glabrata and C. orthopsilosis. However, the LCM increased gradually with the increasing of dmax, and the LCM increased slowly when dmax greater than 200. 4. Discussion The occurrence about repeated sequences has been paying attention continuously. So far, microsatellites have been extensively focused in eukaryotic (Bacolla et al., 2008; Ellegren, 2004; Katti et al., 2001; Kelkar et al., 2008; Li et al., 2004; Rajendrakumar et al., 2007), prokaryotic (Gur-Arie et al., 2000; Kim et al., 2008; Mrazek et al., 2007) and even in viral genomes, such as Hepatitis C virus (Chen et al., 2009), Potyvirus (Zhao et al., 2011), Tobamoviruses (Alam et al., 2013), Carlaviruses (Alam et al., 2014a), Potexvirus (Alam et al., 2014b) and Herpesvirus (Wu et al., 2014). Although some controversial opinion existed, all these work promote the study of SSRs forward. Candida species are pathogenic yeasts and are the most prevalent cause of opportunistic fungal infections in humans (Butler et al., 2009). The pathogenesis and clinical characterization of common dermatomycotic species were compared in many previous work (Butler et al., 2009; Hube et al., 2015; Odds et al., 2007; Vilela et al., 2002). The three assembled completely genomes were available for Candida species in GenBank, and this gave us a chance to perform the comparative analysis of SSRs in the level of genomes. Microsatellites are a better choice to study genome because of its polymorphisms and high mutability (Kim et al., 2008; Madsen et al., 2008). Here, the distribution, composition and polymorphism of microsatellite and compound microsatellites were analyzed in the genomes of C. dubliniensis, C. glabrata and C. orthopsilosis. Considering the complexity, imperfect microsatellites were not extracted to study in the present work. In all of the genomes, among mono-SSRs, (A)n/(T)n were predominant, while (G)n/(C)n were rare in the genomes of C. dubliniensis, C. glabrata and C. orthopsilosis (Fig. 2). This phenomenon was also found in many eukaryotic genomes, such as humans, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana and Saccharomyces cerevisiae (Katti et al., 2001). On the other hand, the dinucleotide repeats of (AT)n/(TA)n were significantly more than other di-SSRs (Fig. 3), and this was only consistent with Arabidopsis chromosomes. The three Candida species showed a kind of common feature of yeast, whoes genomes have lots of tri-SSRs of (AAT)n/(TAA)n/(TTA)n/(ATA)n/

Please cite this article as: Jia, D., Survey and analysis of simple sequence repeats (SSRs) in three genomes of Candida species, Gene (2016), http:// dx.doi.org/10.1016/j.gene.2016.02.018

D. Jia / Gene xxx (2016) xxx–xxx

5

Fig. 5. The distribution of SSRs with different length.

(ATT)n/(TAT)n (Katti et al., 2001). Interestingly, (GC)n and (GGC)n/ (GCG)n/(CGC)n/(CGG)n/(CCG)n/(GCC)n repeats were extremely rare in all of the eukaryotic genomes studied. Lower frequencies of (GC)n repeats can decreases its chances of mutation to thymine by deamination in vertebrate genomes (Schorderet and Gartler, 1992). However, this mechanism of CpG suppression cannot explain the rarity of (CG)n

Fig. 6. a. The effects of dmax on CCM. b. The effects of dmax on LCM. CCM: the count of compound microsatellites; LCM: the length of compound microsatellites; dmax: microsatellites being separated by less than a maximum threshold (distance allowed between any two microsatellites).

dinucleotide repeats in Candida species, since they do not show cytosine methylation (Katti et al., 2001). There was a limit of microsatellites' length, and the length of these SSRs were mainly shorter than 10 bp and it was difficult to over 30 bp in the genomes and its coding sequences of C. dubliniensis, C. glabrata and C. orthopsilosis. Slippage replication (Ellegren, 2004) trend to get long of the microsatellites, on the contrary, point mutations lead to shorten the long SSRs; the length of microsatellites might be an balanced result between slippage events and point mutations (Bell and Jurka, 1997; Kruglyak et al., 1998). Moreover, compared with expansion mutation events, contraction mutations occur more frequently with increases in allele size (Xu et al., 2000); long alleles tend to mutate to shorter lengths, thus preventing their infinite growth (Ellegren, 2000). This is the reason why the overwhelming majority of SSRs was short microsatellites (b=10 bp) in three genomes of Candida species (Fig. 1). The length and count distributions of microsatellites indicate that the frequency of repeats decreased rapidly with the length of repeated units (Supplementary Table 2). The paucity of longer microsatellites could also be due to their downward mutation bias and short persistence time (Harr and Schlotterer, 2000), and the longer SSRs have higher mutation rates and are more unstable (Kruglyak et al., 1998; Wierdl et al., 1997), compared to shorter SSRs. Shorter SSRs are also variable in length and may play important roles in physiology and/or evolution (Metzgar et al., 2001; Moxon et al., 1994; Mrazek, 2006; Rocha and Blanchard, 2002). The length of shorter microsatellites was focused on 6 bp and 9 bp, either in genomes or its coding sequences of these three genomes (Fig. 5). The microsatellites with length of 6 bp and 9 bp may be a fairly stable status in three genomes of Candida species. Moreover, microsatellites with 6 bp and 9 bp may be related to three unit codon, because codon consist of three bases. The GC content was 17.64% in the mitochondrial chromosome of C. glabrata, and the value was the lowest in 30 chromosomes. Meanwhile, the relative abundance (19.89/kbp) and relative density (167.87 bp/kbp) of SSRs in the complete mitochondrial sequence of C. glabrata were significantly greater than that in any one of genomes or chromosomes. GC content is an important factor in affecting formation of microsatellites (Ouyang et al., 2012). Sequences with the extremely low or high GC contents will have a reduced range of sequence variability (lower informational entropy) and therefore a tendency to be more repetitious. So the RA and RD of the mitochondrial chromosome was the highest compared to the other chromosomes.

Please cite this article as: Jia, D., Survey and analysis of simple sequence repeats (SSRs) in three genomes of Candida species, Gene (2016), http:// dx.doi.org/10.1016/j.gene.2016.02.018

6

D. Jia / Gene xxx (2016) xxx–xxx

Compound microsatellites are composed of two or more adjacent microsatellites, and it have been investigated in many species. The previous researchers of this field had reported that the number of compound microsatellites (CCM) and RA increases with dmax, and the increase is not linear (Chen et al., 2011; Kofler et al., 2008; Wu et al., 2014); these phenomenon were simultaneously explained that the adjacent SSRs have higher possibility to count as a compound microsatellite when dmax increasing (Wu et al., 2014). However, the results show that CCM gradually increased with the increasing of dmax when dmax is lower, and CCM gradually decreased with the increasing of dmax when dmax is greater in three genomes of Candida species (Fig. 6). Because the dmax was only set between 0 and 50, so the count of compound microsatellites was not observed to decrease when dmax increase to a larger extent in the genomes of herpesvirus (Wu et al., 2014). Here, we had realized the global change of CCM by expanding the span of dmax in genomes of Candida species. The CCM increased firstly and then decreased when dmax greater than 100 (in C. dubliniensis) or 50 (in C. glabrata and C. orthopsilosis), and the reason of this phenomenon may be that the distance between two adjacent microsatellites is about 100 bp in genome of C. dubliniensis or about 50 bp in genomes of C. glabrata and C. orthopsilosis. On the other hand, the LCM always increased with the increasing of dmax, and the LCM increased slowly when dmax great than 200. 5. Conclusions In conclusion, the distribution, composition and polymorphism of microsatellites and compound microsatellites were analyzed the in three available genomes of Candida species. The results show that there were 118,047, 66,259 and 61,119 microsatellites in genomes of C. dubliniensis, C. glabrata and C. orthopsilosis, respectively. The SSRs covered more than 1/3 length of genomes in the three species. The microsatellites, which just consist of bases A and (or) T, such as (A)n, (T)n, (AT)n, (TA)n, (AAT)n, (TAA)n, (TTA)n, (ATA)n, (ATT)n and (TAT)n, were predominant in the three genomes. The length of microsatellites focused on 6 bp and 9 bp either in the three genomes or its coding sequences. What's more, the relative abundance (19.89/ kbp) and relative density (167.87 bp/kbp) of SSRs in sequence of mitochondrion of C. glabrata were significantly greater than that in any one of genomes or chromosomes of the three species. In addition, the distance between any two adjacent microsatellites was an important factor to influence the formation of compound microsatellites. The analysis may be helpful for further studying the roles of microsatellites in genome origination and evolution of Candida species. Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.gene.2016.02.018. Acknowledgements We thank editor and anonymous reviewers for their valuable comments. This work was supported by funding from the Shanghai Municipal Commission of Health and Family Planning (grant number 20134362) and Study of Qingpu Branch of Zhongshan Hospital, Fudan University (grant number QYM2012-09). References Alam, M.C., Singh, K.A., Sharfuddin, C., Ali, S., 2013. In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes. Gene 530, 193–200. Alam, M.C., Singh, K.A., Sharfuddin, C., Ali, S., 2014a. Genome-wide scan for analysis of simple and imperfect microsatellites in diverse carlaviruses. Infect. Genet. Evol. 21, 287–294. Alam, M.C., Singh, K.A., Sharfuddin, C., Ali, S., 2014b. Incidence, complexity and diversity of simple sequence repeats across potexvirus genomes. Gene 537, 189–196. Bacolla, A., Larson, E.J., Collins, R.J., et al., 2008. Abundance and length of simple repeats in vertebrate genomes are determined by their structural properties. Genome Res. 18, 1545–1553.

Bell, G.I., Jurka, J., 1997. The length distribution of perfect dimer repetitive DNA is consistent with its evolution by an unbiased single-step mutation process. J. Mol. Evol. 44, 414–421. Bull, N.L., Pabon-Pena, R.C., Freimer, B.N., 1999. Compound microsatellite repeats: practical and theoretical features. Genome Res. 9, 830–838. Butler, G., Rasmussen, D.M., Lin, F.M., Santos, M., Sakthikumar, S., Munro, A.C., Rheinbay, E., 2009. Evolution of pathogenicity and sexual reproduction in eight Candida genomes. Nature 459, 657–662. Chen, M., Tan, Z., Jiang, J., Li, M., Chen, H., et al., 2009. Similar distribution of simple sequence repeats in diverse completed Human Immunodeficiency Virus Type 1 genomes. FEBS Lett. 583, 2959–2963. Chen, M., Zeng, G., Tan, Z., Jiang, M., Zhang, J., Zhang, C., Lu, L., Lin, Y., Peng, J., 2011. Compound microsatellites in complete Escherichia coli genomes. FEBS Lett. 585, 1072–1076. Chen, M., Tan, Z., Zeng, G., Zeng, Z., 2012. Differential distribution of compound microsatellites in various Human Immunodeficiency Virus Type 1 complete genomes. Infect. Genet. Evol. 12, 1452–1457. Ellegren, H., 2000. Heterogeneous mutation processes in human microsatellite DNA sequences. Nat. Genet. 24, 400–402. Ellegren, H., 2004. Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet. 5, 435–445. Gur-Arie, R., Cohen, C.J., Eitan, Y., Shelef, L., Hallerman, E.M., Kashi, Y., 2000. Simple sequence repeats in Escherichia coli: abundance, distribution, composition, and polymorphism. Genome Res. 10, 62–71. Harr, B., Schlotterer, C., 2000. Long microsatellite alleles in Drosophila melanogaster have a downward mutation bias and short persistence times, which cause their genomewide underrepresentation. Genetics 155, 1213–1220. Hong, C.P., Piao, Z.Y., Kang, T.W., Kwon, S.J., Kim, J.S., Yang, T.J., Park, B.S., Lim, Y.P., 2007. Genomic distribution of simple sequence repeats in Brassica rapa. Mol. Cells 23, 349–356. Huang, T.S., Lee, C.C., Chang, A.C., Lin, S., et al., 2003. Shortening of microsatellite deoxy (CA) repeats involved in GL331-induced down-regulation of matrix metalloproteinase-9 gene expression. Biochem. Biophys. Res. Commun. 300, 901–907. Hube, B., Hay, R., Brasch, J., Veraldi, S., Schaller, M., 2015. Dermatomycoses and inflammation: the adaptive balance between growth, damage, and survival. J. Mycol. Med. 25, e44–e58. Jackson, P.A., Gamble, A.J., Yeomans, T., Moran, P.G., Saunders, D., Harris, D., Aslett, M., 2009. Comparative genomics of the fungal pathogens Candida dubliniensis and Candida albicans. Genome Res. 19, 2231–2244. Jakupciak, J.P., Wells, R.D., 1999. Genetic instabilities in (CTG·CAG) repeats occur by recombination. J. Biol. Chem. 274, 23468–23479. Jurka, J., Pethiyagoda, C., 1995. Simple repetitive DNA Sequences from primates: compilation and analysis. J. Mol. Evol. 40, 120–126. Kashi, Y., King, D.G., 2006. Simple sequence repeats as advantageous mutators in evolution. Trends Genet. 22, 253–259. Kashi, Y., King, D., Soller, M., 1997. Simple sequence repeats as a source of quantitative genetic variation. Trends Genet. 13, 74–78. Katti, V.M., Ranjekar, K.P., Gupta, S.V., 2001. Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol. 18, 1161–1167. Kaur, R., Domergue, R., Zupancic, M.L., Cormack, B.P., 2005. A yeast by any other name: Candida glabrata and its interaction with the host. Curr. Opin. Microbiol. 8, 378–384. Kelkar, D.Y., Tyekucheva, S., Chiaromonte, F., et al., 2008. The genome-wide determinants of human and chimpanzee microsatellite evolution. Genome Res. 18, 30–38. Kibbler, C.C., Seaton, S., Barnes, R.A., Gransden, W.R., Holliman, R.E., et al., 2003. Management and outcome of bloodstream infections due to Candida species in England and Wales. J. Hosp. Infect. 54, 18–24. Kim, T.S., Booth, J.G., Gauch, H.G., Sun, Q., Park, J., Lee, Y.H., Lee, K., 2008. Simple sequence repeats in Neurospora crassa: distribution, polymorphism and evolutionary inference. BMC Genomics 9, 31. Kofler, R., Schlotterer, C., Luschutzky, E., Lelley, T., 2008. Survey of microsatellite clustering in eight fully sequenced species sheds light on the origin of compound microsatellites. BMC Genomics 9, 612. Kruglyak, S., Durrett, R.T., Schug, M.D., Aquadro, C.F., 1998. Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. PNAS 95, 10774–10778. Li, Y.C., Korol, A.B., Fahima, T., Nevo, E., 2004. Microsatellites within genes: structure, function, and evolution. Mol. Biol. Evol. 21, 991–1007. Madsen, E.B., Villesen, P., Wiuf, C., 2008. Short tandem repeats in human exons: a target for disease mutations. BMC Genomics 9, 410. Metzgar, D., Thomas, E., Davis, C., Field, D., Wills, C., 2001. The microsatellites of Escherichia coli: rapidly evolving repetitive DNAs in a non-pathogenic prokaryote. Mol. Microbiol. 39, 183–190. Miesfeld, R., Krystal, M., Arnheim, N., 1981. A member of a new repeated sequence family which is conserved throughout eucaryotic evolution is found between the human δ- and β-globin genes. Nucleic Acids Res. 9, 5931–5947. Moxon, E.R., Rainey, P.B., Nowak, M.A., Lenski, R.E., 1994. Adaptive evolution of highly mutable loci in pathogenic bacteria. Curr. Biol. 4, 24–33. Mrazek, J., 2006. Analysis of distribution indicates diverse functions of simple sequence repeats in Mycoplasma genomes. Mol. Biol. Evol. 23, 1370–1385. Mrazek, J., Guo, X., Shah, A., 2007. Simple sequence repeats in prokaryotic genomes. PNAS 104, 8472–8477. Odds, F.C., Hanson, M.F., Davidson, A.D., Jacobsen, M.D., Wright, P., Whyte, J.A., Gow, N.A., Jones, B.L., 2007. One year prospective survey of Candida bloodstream infections in Scotland. J. Med. Microbiol. 56, 1066–1075. Ouyang, Q., Zhao, X., Feng, H., Tian, Y., Li, D., Li, M., Tan, Z., 2012. High GC content of simple sequence repeats in herpes simplex virus type 1 genome. Gene 499, 37–40.

Please cite this article as: Jia, D., Survey and analysis of simple sequence repeats (SSRs) in three genomes of Candida species, Gene (2016), http:// dx.doi.org/10.1016/j.gene.2016.02.018

D. Jia / Gene xxx (2016) xxx–xxx Pfaller, M.A., Messer, S.A., Moet, G.J., Jones, R.N., Castanheira, M., 2011. Candida bloodstream infections: comparison of species distribution and resistance to echinocandin and azole antifungal agents in Intensive Care unit (ICU) and non-ICU settings in the SENTRY antimicrobial surveillance program (2008–2009). Int. J. Antimicrob. Agents 38, 65–69. Rajendrakumar, P., Biswal, A.K., Balachandran, S.M., Srinivasarao, K., Sundaram, R.M., 2007. Simple sequence repeats in organellar genomes of rice: frequency and distribution in genic and intergenic regions. Bioinformatics 23, 1–4. Riccombeni, A., Vidanes, G., Proux-Wera, E., Wolfe, K.H., Butler, G., 2012. Sequence and analysis of the genome of the pathogenic yeast Candida orthopsilosis. PLoS One 7, e35750. Rocha, E.P.C., Blanchard, A., 2002. Genomic repeats, genome plasticity and the dynamics of Mycoplasma evolution. Nucleic Acids Res. 30, 2031–2042. Schorderet, D.F., Gartler, S.M., 1992. Analysis of CpG suppression in methylated and nonmethylated species. Proc. Natl. Acad. Sci. 89, 957–961. Singh, K.A., Alam, M.C., Sharfuddin, C., Ali, S., 2014. Frequency and distribution of simple and compound microsatellites in forty-eight Human papillomavirus (HPV) genomes. Infect. Genet. Evol. 24, 92–98.

7

Sullivan, D.J., Moran, G.P., Pinjon, E., Al-Mosaid, A., Stokes, C., Vaughan, C., Coleman, D.C., 2004. Comparison of the epidemiology, drug resistance mechanisms, and virulence of Candida dubliniensis and Candida albicans. FEMS Yeast Res. 4, 369–376. Tautz, D., 1993. Notes on the definition and nomenclature of tandemly repetitive DNA sequences. EXS 67, 21–28. Vilela, M.M., Kamei, K., Sano, A., Tanaka, R., Uno, J., Takahashi, I., Ito, J., Yarita, K., Miyaji, M., 2002. Pathogenicity and virulence of Candida dubliniensis: comparison with C. albicans. Med. Mycol. 40, 249–257. Weber, J.L., 1990. Informativeness of human (dC-dA)n. (dG-dT)n polymorphisms. Genomics 7, 524–530. Wierdl, M., Dominska, M., Petes, T.D., 1997. Microsatellite instability in yeast: dependence on the length of the microsatellite. Genetics 146, 769–779. Wu, X., Zhou, L., Zhao, X., Tan, Z., 2014. The analysis of microsatellites and compound microsatellites in 56 complete genomes of Herpesvirales. Gene 551, 103–109. Xu, X., Peng, M., Fang, Z., Xu, X., 2000. The direction of microsatellite mutations is dependent upon allele length. Nat. Genet. 24, 396–399. Zhao, X., Tan, Z., Feng, H., Yang, R., Li, M., Jiang, J., Shen, G., Yu, R., 2011. Microsatellites in different Potyvirus genomes: survey and analysis. Gene 488, 52–56.

Please cite this article as: Jia, D., Survey and analysis of simple sequence repeats (SSRs) in three genomes of Candida species, Gene (2016), http:// dx.doi.org/10.1016/j.gene.2016.02.018

Survey and analysis of simple sequence repeats (SSRs) in three genomes of Candida species.

Simple sequence repeats (SSRs) or microsatellites, which composed of tandem repeated short units of 1-6 bp, have been paying attention continuously. H...
566B Sizes 1 Downloads 11 Views