YEAST
0
VOL.
0 0 0
0 0
0
XI
o0 0
8: 987-995 (1992)
0o
Yeast Sequencing Reports
0 0 0 0
The Sequence of a 9-3kb Segment Located on the Left Arm of the Yeast Chromosome XI Reveals Five Open Reading Frames Including the CCEl Gene and Putative Products Related to MY02 and to the Ribosomal Protein LlO STEVE PASCOLO*, MARJAN GHAZVINIt, JEANNE BOYER, LAURENCE COLLEAUX, AGNES THIERRY AND BERNARD DUJON Unite de Genitique Moliculaire des Levures ( U R A I I49 du C.N.R.S.),Institut Pasteur, 2.5 Rue du Dr Roux, F-7.5724 Paris Cedes 15, France
Received 29 June 1992; accepted 6 August 1992
We report here the sequence of a 9.3 kb DNA segment of chromosome XI of Saccharomyces cerevisiae, located between the MAKII locus and the centromere. This sequence contains four long open reading frames (ORFs), YKL160, YKLl62, YKL164, YKL165 and part ofanother ORF, YKL166, covering altogether 90% oftheentire sequence. One of these ORFs, YKL164, corresponds to CCEI. Translation products of two other ORFs, YKL160 and YKL165, exhibit homology with previously known S . cerevisiae proteins: the ribosomal protein LIO, and the M Y 0 2 gene product, respectively. KEY WORDS - Saccharomyces
cerevisiae; chromosome XI; CCEI; shotgun sequencing.
INTRODUCTION As a part of the BRIDGE program of the Commission of European Communities to sequence chromosome XI of Saccharomyces cerevisiae, we have determined the sequence of a 9.3 kb fragment located close to the centromere on the left arm of that chromosome. This sequence is contiguous to the sequence reported by Boyer et al. (1992) in the accompanying paper. No genetic locus has previously been mapped in this fragment. The sequence was determined by the shotgun strategy, using the double-strand sequencing method on *Present address: Unite des Papillomavirus, Institut Pasteur, 25 Rue du Dr Roux, 75724 Pans Cedex 15, France. tPresent address: Unit6 des Virus Oncogenes, Institut Pasteur, 25 Rue du Dr Roux, 75724 Paris Cedex 1 5 , France. 0749-503)3/92/011987-09 $09.50 0 1992 by John Wiley & Sons Ltd
fragments of cosmid pEKG021. Five long open reading frames (ORFs) have been discovered. MATERIALS AND METHODS Strain and vectors Escherichiu coli JMlOI/TGl (A(1uc pro), thil, supE44, hsdD5, F(traD36,proA+ B+lacIQ lac Z D M 1 5 ) ) was used throughout this work. Cosmid pEKG021, received from A. Thierry (manuscript in preparation), is a pWE15 derivative containing a 37 kb insert of S. cerevisiae chromosome XI. Plasmid pBoSAl (Jacquier et ul., 1992) is a derivative from pBluescript used for subcloning EcoRI fragments from pEKGO21. Plasmid PBluescriPt (Stratagene) was used for the other cloning steps and sequencing.
988
S. PASCOLO ET AL.
P B
B to latl tebnwre
RV
RV RV
RV RV
I
I I
I I
pUKG041
1
to CEN
pUKG047
Figure 1 . Partial restriction map of cosmid pEKG021. EcoRI fragments sequenced in this work as well as the adjacent fragments sequenced by Boyeret al. (1992) are indicated with their exact sizes in base pairs. Adjacent EcoRI fragments larger than I kb that have also been mapped are indicated with approximate sizes. One XbaI fragment which was cloned to complete the sequence is also indicated (dashed line). Note that only part of the chromosome XI insert is shown here: ca. 4 kb and 8 kb are missing at the right end and the left end, respectively. Inserts of cosmids pEKG047 and pUKG041 which partially overlap pEKG021 are indicated. Abbreviations: B: BamHI; P; PsrI; RI: EcoRI; RV: EcoRV; X: XbaI; X*: uncleavable XbaI site in dam+ strains (overlapped by two GATC methylation sites).
The three EcoRI fragments (2647,1960 and 3647) were subcloned into the pBoSAl vector giving rise to plasmids pE2647, pE1960 and pE3647, respectively. The XbaI fragment 3560 was subcloned into the pBluescript SK vector giving rise to plasmid pX3560.
+
Sequencingstrategy Insert of cosmid pEKG021 was first mapped by digestions with EcoRI, EcoRV, XbaI, BamHI, PstI and by hybridizations using purified EcoRI fragments as probes (Figure 1). The sequencing strategy is outlined in Figure 2. Random sequencing The sequencing strategy was adapted from Thierry et al. (1990). 20 pg of EcoRI inserts from pE2647, pE1960 and pE3647 (see Figure 2) were purified using fbagarase after electrophoresis. Each EcoRI fragment was polymerized using T4 DNA ligase and then submitted to sonication (20 kHz during 8 min). Polymerization is needed to avoid biases of DNA fragmentation during sonication. DNA fragments ranging in size from 500 to 1000 bp were purified using Geneclean I1 (Bio 101) and then ligated with SmaI-digested and dephosphorylated pBluescript SK + vector. These constructions were used to transform E. coli cells. Transformant clones were hybridized using purified EcoRI fragments of cosmid pEKG021 as probes. DNA was extracted from positive clones. Sequencing reactions were performed on double-stranded DNA using Sequenase (United States Biochemical
Corp.), KS primer and, for some chosen clones, SK primer (Stratagene) as well. The products were analysed on 50 cm long acrylamide-urea gels. Directed sequencing Gaps were filled using specific oligonucleotides as primers and double-stranded DNA from pE2647, pE1960, pE3647 and pX3560 as templates. Other gaps were filled using oligonucleotides complementary to T3 or T7 promoters as primers and double-stranded DNA from deletions of pE2647 and pE1960 as templates (to generate appropriate deletions pE2647 was digested by SalI or NsiI plus PstI and religated, pE1960 was digested with Hind111 or SpeI and religated). The EcoRI site separating fragments 2647 and 1960 was sequenced across using an asymmetric PCR reaction (Colleaux et a[., 1992). Sequence analysis software Reading and assembling of the clones as well as interpretation of sequences were performed using DNA Strider 1.2 software (Marck, 1988). Comparisons between products of ORFs and protein databanks were performed using the SASIP package (Claverie, 1984) and the MIPS package (Mewes). RESULTS AND DISCUSSION Sequence determination For the three EcoRI fragments sequenced by the random strategy (see Figure 2), 32 clones were
PRODUCTS RELATED TO M Y 0 2 AND TO THE RIBOSOMAL PROTEIN L10
T," l e f t
-
989
T o centromere
telomere
ECoRl Xbal EcoRV HinDIII BamHl
-
YKL166
YKL I65
---
YKL164
YKL162
YKL160
Figure 2. Sequencing strategy and results. Small arrows indicate individual sequence reads on random clones from each of the four subfragments (filled boxes). Stars indicate sequences obtained using specific oligonucleotides, squares indicate sequences obtained using deletions and the circle indicates the sequence obtained from asymmetric PCR. Positions of the restriction sites used for mapping and cloning of the four subfragments are shown. Bottom: occurrence of ORFs in the six registers: initiation codons (small bars), stop codons (full bars). Long ORFs are indicated by thick arrows with their numbers.
sequenced from one primer only and 55 clones were sequenced from the two primers. As shown in Figure 2, the clones used for sequencing were randomly distributed except for an excessive representation of clones from the extremities of the fragment 2647 (this was due to a poor ligatiotn of this fragment prior to sonication). Altogether, the random sequencing has provided us with 93% of the double-strand sequence of the three fragments. To complete the entire sequence, four reads were determined from deleted clones (see Materials and Methods) and 19 reads were determined from synthetic oligonucleotides in order to fill in the gaps located within the EcoRI fragments and between fragments EcoRI 1960 and EcoRI 3647. With an average length of 259bp per read, our strategy has led to only 4.6 readings per base pair of final sequence.
Sequence analysis
Figure 3 shows the entire sequence of 9350 nucleotides. The sequence has been extended to the end of YKL160 using data from Boyer et al. (1992). Five long ORFs representing 89.6% of the total sequence have been recognized (note that YKLl66 is incomplete). Each OR F has been considered from its most upstream ATG codon. Codon adaptation indexes (Sharp and Li, 1987) for the ORFs are: 0.279, 0.139, 0.112, 0.130 and 0.128 for YKL160, 162, 164, 165 and 166, respectively. This is indicative of low or moderately expressed genes except for YKL160 which is coding a product related to a ribosomal protein (see below). Part of the sequence presented here (positions 2464 to 3973) has independently been reported recently (Kleff et al., 1992). The sequence reported
' T o left
I
telomere
GAATTCMTA ATAAAATCM CCMTTTCCA GTTTTCATTG ATTCAGTATG CTTATTAGTT ATTAGAAMG AAATACTGTA GCCAGGTATT GGTACACGTC
I
100
TCAGAATGTA GAATGCTTCA GCACGCTGTT CCAAMATCT AGTAAACTTG TGMCTAAAA TCTGCTCMT TTCATCTGCC TGTTTCACCA TTMACTCAT
200
ACGCACCGAG TTTACACTGG G C T C M T C M TACTTGTTCA TTTTCATTAC GAGATATATG CATCGGTTGG A G C M C M T T CGGCGCTAGT GTTAGGGACT
300
TCMCTTCCG GTCTGTTGTG CCTCTCCACT TCCTGCGAAG AGAMTTGCT CAMGTCAAC GCAGCTTCCA GGGAATMCG MCAGCCGTT AGATACGGAC I
400 500 600
AGTTGATGAC ATCGCMTAG CGTGCCTACC TATCAGGATA ACMGGTCTG MGCCAGATC GACCACTAGA CMGCATTGC CGGAGGTGGG AGACGTGAAA
100
GTCTAT T T G G M G G M GCTAAAGATG C M G T G G M G GATATATTAT TACMTACTT TGACAAAGAA ATCTACGTGG GAGAAGCCCA AGGAACTMT
800
TTCTCAGGAG GAGCTACTTC TTCGAGAAAA TGGCTGGMG GCGGCCMGA CGGCAGACGG CAAAGTATAC TATTATMTC C M C M C M G AGAAACCAGC
900
TGGACTATTC CGGCCTTCGA GMGAAAGTA GMCCCATCG CAGMCAAAA ACATGATACA GTATCCCATG CACAGGTTM TGGAAATAGA ATAGCCCTTA
1000
CGGCTGGAGA AAAACMGAG CCGGGACGM CTATAAACGA GGAGGAAAGC CMTATGCTA A T M C T C T M ACTGCTTMT GTCAGGAGM GGACTMAGA
1100
AGAAGCAGAG MGGAATTTA TTACCATGCT GAAGGAAMT CMGTAGACT CTACTTGGTC ATTCAGTAGA ATTATTTCAG MCTGGGGAC CAGAGATCCA
1200
AGGTATTGGA TGGTCGATGA TGACCCCTTA TGGAAGAAAG AAATGTTTGA GAAATATCTT TCCMTAGAT CAGCCGATCA ACTTCTTMG GAACACMTG
1300
A A A C M G C M ATTCAAAGM GCCTTTCAGA AAATGTTGCA AAACMTTCT CATATAAAAT ATTACACCCG TTGGCCTACC GCAAAGAGAC TMTTGCCGA
1400
CGAACCMTA TACAAACACT CCGTGGTCM TGAAAAGACA MGAGACAGA CCTTTCMGA TTATATAGAT ACCCTCATCG ACACTCAGM A G M T C A M A
1500
W T T G A AAACACAGGC CCTAMAGAA CTMGAGAGT ATTTAAACGG TATTATMCA ACATCATCCT CTGAAACTTT CATMCCTGG CAGCAGCTTT
1600
TAAATCACTA TGTTTTTGAT M G A G T M G A GATATATGGC CAACCGGCAC TTCAAAGTCT TMCCCACGA AGATGTTTTA MCGAGTATC TGAAAATAGT
1700
AAATACGATT GAAAACGATC T T C A M A C M ACTAAATGAG CTCCGACTGC GCMTTATAC CAGAGACCGT ATTGCTAGAG A T M C T T T M MGCTTATTA
1800
AGAGMGTGC CMTCAAAAT CAAAGCAAAT ACTAGATGGT CAGATATTTA TCCTCATATA MGTCTGATC CGCGCTTTTT ACATATGCTT GGAAGGAATG
1900
GCTCGTCCTG CCTTGATTTA TTTTTAGATT TTGTTGATGA ACAAAGGATG TACATCTTTG CACAAAGATC MTAGCCCAA CAGACGTTGA TAGATCAAAA
2000
TTTTGMTGG MTGATGCCG ATAGCGACGA GATCACCMG CAAAACATAG AAMAGTTCT GGAMATGAC CGGAAATTTG ACAAGGTGGA T M A G M G A C
2100
ATCAGTTTGA TTGTTGATGG TTTGATAAAG CAAAGAAACG AAAAGATACA ACACAAACTC CAAAATGAGC GTAGGATATT GGAGCAMAG MGCACTATT
2200
TTTGGTTACT TTTGCAAAGG ACATATACM AAACCGGTM GCCCMGCCT AGTACGTGGG ATTTAGCTTC CAAAGAGCTT GGCGMTCTC TTGAATACAA
2300
,
GGCACTAGGC GATGMGATA A C A T M G M G ACAAATTTTC GAGGATTTTA AGCCTCAAAG CTCTGCACCG ACTGCCGAAA GCGCTACTGC AAACTTMCG
2400
TTGACCGCGT CAAAAAAGAG GCATTTMCT CCGGCTGTGG MTTGGACTA TTG
TCTGA CGTGTCGACC TCTATCTTGT TMTCATTAT ATAAATTATA
2500
CATAGAGATT ACATGTTATT TCMGTTTGC GATCACCGCA C M T T T
TA GTCATTCTTG TMGTGTTCT GCMAAATTT CAGCTTTTGT ACCTTATTTT
2600
CACAAAATTC AAACACCTGT CCGMCTGTG TTTTMCCAG TGTTTTTGAA TTCMGAGTT CAGTMTACT TTCATMTTT TTTMCCACT CCATCCAAGA
2700
CAAACMTGG AGGAATGAAT CTGCCAAATC GTCATCTTTT CTCACCCCCG MTTATCTTG GATCTCTAGT ATATCACATA GCTTGAAACT TTTTTTTTTG
2800
GTMGGGCAT TTCTTATCCT ATTATTCCM ACTCCTATGA ACTCGACCAG TTTTGTAGM CTAGTTGAAT TACCTTCTAG TATTGAGGTT GAAAGTATTT
2900
TTTTCACTAG CTTTATTCGA GAATCTTTGC TATGTTTGTT AGATTTTMC TTTTTTGAAC TGGTCGGTGT CTCTTCTCTT GGAATGCACC M T A T G M G T
3000
CATCCGATGT GGATCGGACG MCATACCAT ATACCTCMC TTGGACGTAT TCGGTATTTT ATTCGTATAC TTCATTTTAT TTTCCMGTT AGAGAAAAGA
3100
ATCTGTTCGA GMTATTCAC TTTTAAAATT GGGTCTAMA TATGCCTCGA AGACATAGTT CTGGTACGTT GCCTTTCMT TGTAAACATA TCTGGTATCG
3200
GCATAGATTC AAATAAATAC TCCGTMGGT TAAATACMG CTCAGMGTT TCAGCAGGAT T C M G C T T M CTTTTTGAGG T T T T G W ATTTCTCCTC
3300
TAGATTTATC TTTTGCCAGT CTAGTACTTT AGGGAGCGGA TCATTATTGA GCMTTGCAT CTTAGAGAAA GCAAAGTTAG AAACACCAGC ATCCATAGAC
3400
MTATGTTM TTCTTCCCTC TCTTATCTTT TGTTGTCGTA ACTTCTCCM AAATTCACAC TGTTCTTGAA TGTAGGTTCT TTTAGCTTCT TTCGTCGTGC
3500
CATTTACTGC T C C M T M C A MTGATAAAG ATTTCAGTTG TGTGCTTTTT GCATTTTGGC AGCAGGAATC GATGAGTTGC MTATCTTAG CTTTCTGTGC
3600
TTACTATCTT TTTTAGCTTT AGACTGTATA ACCGTMGCC TTCGTTGTAT GCMTGGATA MCAAATATG GMTCGCCAG TTGAAWAAA
3700
AAAATGAGM MGTACGGAT MTCCGAAGA MTTTAGTCT GATTCCATCC ATTTTTTGTA T C T T M G A M T M T T T T G M GT-CTTAAATAG
3800
A A M A G T M C TAAAA
3900
TAG GAMGTAGAA MGCTCCTGC ACCCTCTTCA ATGGCTTGAC AAAGACGAGA CCGCATMTA TCTTTGCTAG TATACTTCGG
CMTTTCAAA TMTTAGCAC ATGTCATTAC ACTTGGTAGA TATTCGTCTG CTGTTAGGCC ATCTTCAGCA TGCTTTMCA CMCTGTAAA CTTTGGGTTC
4000
AAACTTTTM ATCCCCCMT TGGAAGCTTG GGAGATCCCG TTAAAAATTG CAAAAATMT CTTCTTTCAT GCTTACCAAA CGCGGATATT ATTGATATM
4100
M T C A T G M T G A T T G M G M TCCATTGTAT AGCCATGTTC AGCGTTCMG TTTGTGTATA MGTTGCCAT AGACCAGTCT TCCTCMCTC GTCCGAAAAT
4200
ATCCACTMT TCATCCGGAA AAAGTATTAG CATCCTCTCA T A G W C A CCTTTGAAAA ACCTTCMTA MTGCTTTTA ACTGTTTTTC MTGCCCTTA
4300
CCTMTATTT GGTCGATMC GCCATGGATA TATTCTTCAA CATTAGMGA GTTCMGGAT TTATTACMC CCCCCGGMT CAACTCMTG TCATCATTTC
4400
CAGGMCGGT AAATGTCMG GACMCGATT CTAGGGTCAT ATTGTCATCC TTATTCGCTA CTATGTATTT MGGGATTTT GCGAGTMCG GATCTACCM
4500
TTCGATCATT MCAGACAGG TTTCAACGTC GCTCGGCACT GTCGTCACAT TGGGCGTAGA CATTCTGTGC M T M C T C M AAAAGACTTT GCTMATCTA
4600
M G T C M G A A TTCTATTATC M G C M C G A T C T G G C M C M ATGTCCCCAA ATATCCAAAA A G T T C M T M CTTTTTCATT ATTGG-C
GGGTTGAGGG
4700
GCTCTGGGAA CMTAAAGTG G T M T A T M T CGTCAGTAGT ATCMCATCC ATTTCGCTTC TGTMCTATA AGAGTTACM CGCCACATAT TTAACGACTT
4800
TCTTGCAAAA TACTTGGAAA CTACGGAGTA AAATTCCAM GTCGGTCCTA MCCTGTTCC TGCTTCTTCT TGATATTCM TTTCCAGTAC GTCAGGGCTA
4900
CTTCCGTACT TGGATAAAAT CTTGAGACCG GTAGCGAATA TTGTTTTTCT TGAAATCCGC AGCTTACGCC TAGTMTTCT CCCMGTTGT TGTMAGCTT
5000
CGTCATTCCT TAAATCTTTC GAGCCTTTAC TCTTATTCTT CCAAAGTTGA ATCMTCTTC CGTMCCAAA TGAAGTACAT TGTAGGAAAA GCATCCTGGT
5100
ATCAAACGGA M C A A M A T G GGMTCTCCT GGTC-T
5200
AGTGACCMT CCGGCAAAGC TCCACTTGCT ACTACCMTG GTTCATCTAG TTGTCTAGCG
AGCTFAGCGC TTAGTTTTGA GTTGATAAAA CTGTCGCTTT TMCACCGCA GCTATGTAGA AAATCCAGCA ATGTCAGAAT ATCCGCTGM GATWAGTAT
5300
CMCCTGTGC GMTTCTCTT TTTTTATAAA M T C T C T T M CTTTTTCCCC TCATTAGCTT CCTCGGCTGC CTCACTCTCT CTATTGTTAC CTTCTAAACT
5400
TTTGCAAAAT TTGATTGTAT GTGTATCATC CCATAAAGTT TTTMGTCAC GATTTCCCCT GACGAATGTA TTAAATATCA CACCAAATAC TGTAGACTCC
5500
ATGTCMCTT TTTCATTATC ATAAAAAAAG TCAAAGTTCT TTTTTCTCAT ATGATCCAAG CAATTTTCTT CTTCTTCCCT ATCAGCTTCG
5600
GTACTGGMG
ATGTMGGTT TGGGATTMT GAATTCAAAA MCGCATTCT TACCATTCTG TGTCTCAAM ACTCATTMG TGAGGTAAAA GMGCTATGC MTGGACCGA MCGATAGTA GATGATAAAT CAGTACCMT ATTATCTTTG CTTGCATCGC CATCATAAAC C M C T T M T T TTTATCTCTT TAGCCAGTGA AGA'L'AChCCA
5700 5800
99 1
PRODUCTS RELATED TO MY02 AND TO THE RIBOSOMAL PROTEIN LIO CCACCATCGT GTAMCCGCA ATCMCTATA GACAAATTCT CCAGCCTTGT GAGAGCAGAT TGTAGGATTT CT-TCT CCTCT-
G T C M T A C M TCCTCAAATA
T ~ T T T A G C AAGAATGAMT GGGATACCGT TGAGGATGTA ATTCTTTTAG TTATGGAGGA AGCTAGCCCT GTAGMGTM
5900 6000
ATTCGAMCC
TGACACGTCG AAATCTTCAT GGAAAATACA TTTTTTTAM ACAGACCAM TTCCCTTCCA ATCCTCTTCA GTTTTGTCCG GAGTGGMGG ATTTTCTAAA
6100
A T A G A M C M CGCCTTCGAT TTGATGGAGC TCCTCTGTTA TTGCTTCTTG CTCMCGTTC ATCTGAGAM GTACTCTATT MCTAGGTTC ACACCCTTGT
6200
TTTTMTATA AGCTAGAGAT AGAGTTCTGA AAATGTGTAT TGAMTTTTC TTTGGTTTTA CTGAATCAGG MTTTCCATA TCAGTAMTT CGTMTCATA
6300
TTCTTCATCA CCCTCATCAC ATTCCTCMT ACTGCTATGC AAATCCCCTT CTTCGTCAGA MGTGAAATA TTTTCATTCC CGTCTTCCTT T A M T C M T G
6400
T T A T T G A M T CCACAGACM ATCCTTMCC AAATCAMAA TGCCCTCTCT TTTGATGGM GGMAGAACA GTTCGGAAM CTTTTTACM A T T M G T C M
6500
GCMCGAGAG ACCACCMCC MCAGTGTAC CAGCTTCTGA TGAGTMGTA CCATTAGCGT TAGACGCTGT TTCTTTTTGG GCCAGGATAG ATCCGATTM
6600
C T T M T M G T TGATCATTGA TTGCCTTTGC TGTGGMTTA TTTATGCATG ATACAACCCT CAGTAAAGCA ATMGTACGT ATCTTCTTAC GTCAAAGTCA
6700
GCAGCATTTG TATAMTTTC M C G A G M T T GGMTTAGAC ATTGMCTAG GGAATCAMC TTTTCCTGGT TAGAMTTAC GCCACGGTCG CTATTTCCGG
6800
TATATTTATC CGCTGACAGT ATTCTTTCAT CCTCGGGAGG M C A A T A C A ACTATAMTC TACAAATGCT M T C A A T A M CTGTTTGGGA CATAAATCAG
6900
TGTTTCATGT MCCCTGCGT TAGGACTTTT ACTATMTGC TGGMTGATC GTGTTGCCAT GTCGACMTG TCAGTTTTCT CTCTCAGTTC TCTTGAMGT
7000
ACATCACTAC TCATCGCTM TACGGTTAAA ATATCCAMC ATTTCAGTTT ATTCTCCMG GGGGTATCCT G M T A G M C TAGCTGMCT ATTCTTTCGA
7100
TTAAATCCM CGAAAACAM GTCTCAAATT TCTCAACCCC ATGCAACGCC CCGCAMTAC CGTACATGGC ATTTACMGC CTGGTTMTA TTGGTTGGTC
7200
TGTCGCATTC G A G M M T T G GCTTCAGCGT TGGMGTACT T C M C M T G G TCTTAMGTC ATCCGTTCGG ATACTGCTAC AGGCGTTCGA M C M T T G C G
7300
ATAGCCTTCC TCTGCGCATG TATAGTTAM AMTCGAAGA ATTGGACGTA GATTGATMT TGGCCCGTTT T T M M T G T C TCTCCCATGT ACTCTAGAM
1400
TATATTCCAC CGTTTCTAM ACTTGTTCTG CGAGGTCMT GTMCTGATC TCTACCMTT TTCCTTGTM MTTGGTATA ACGTGTTCAT CMCAGCTAT
7500
TGAMTAGAT TCAGGGCAGA CCTCIhhMG ATTATACATG CATCTACMG CTTGCATTTG T M T T C T M T TCTTCCCGTA AMTTTTATC AGAGAGTATG
7600
GCAGCTATAT TACCTATTM CGTTTCCATC GGTATMTTC TATCGACMC CATTTGATTC ATCATTMTA TGTTTTCAGA MGTTCTTTT AMCTCTCCA
1700
TTGCMTATA AGGATCCTCG GAGGCATTCC CAGTATTCTC TATCMTTTA GAMTCCGCT CATTCCTTGC CGMCTCTCT GCGCTCCTCT CCATCCTTCC
7800
TCCMTCATC G A T M M T C T CGGGTAGTGT TCTTCCMCC GGGTGTTGCC CMGTCCTTC ACTTGTTTGT CTTCTTTGTT CTAGTCTTTG TGCGMTGTT
7900
T C C M M T G T CAGGTAMTG TAGGGGGTTG CTGCCAMTT CATTATTATC GTGAGAACGT TCMGTCCAT CTTCTTCTTC ATTTCTGTCT TCGTCACTGC
8000
CATTCTGAGT TCTATGATAT CCATATAAAA GTCCAGAATC ATCTTCATGA TGGTMTCGT TATCGTCATC TTCACCATCA TCTGCCTCM MCTTCCGAC
8100
GCTAGAMGC ATATGTTCAT CTTCATCTTC ATCAGGATM TMGAGTACT CACCCTGTAC ATGGCCATCC TCATCATAGT CATCTTCTAC CTGCGTATCC
8200
ATCATATMT
CACTGTTTTC GCTATGGGAC TCATGTTCAT CMGGTTGTG
CGMTTATTT
T C A G A C A ~ CTTGGCTACTC TTCTATTTTA MTGTTTATC
8300
~~
C M M T T T T C ACACCTGAGA TATATTGTTG TTMTAAAGC ACACAGAMG GAMGTATCG ATCCCAGAM GTGGTCTTTC CCTGGGTATG ATCAMGCTA
8400
TTCCCCTTTC CMCTACTGT TGGCTATGAT CCCACMTTT GATTTCAGTT C A G M M C C C MCCATTAGG ACTTTACCCG GCGMGGTGA M G A h h A M T
8500
TTTTTTCCAG W T G C G A TGAGCTTTTG AAAAGTTGCA MTATTTGTA GTACTTTGGC TMGCAGAGC TGATTCCTTT TGCTATACGT ACATCACTGT
I
CTATCGTCCA T M G A T T T ATTATAGTTG MTdATGCCA A G A T C M M C GTTCCMGCT AGTCACTTTA GCACAAACCG ATAMAAGGG CAGAGAAAAT
I
8600
I
8700
TTTTCGATGA AGTGAGGGM GCCTTAGATA CTTATAGATA CGTCTGGGTC CTACATCTTG ACGACGTMG MCTCCAGTT T T A C M G A M
$800
CATTGGGAGA A M G A G G G M G M G M T A C A M G N T C T
8900
GTATCMTTG AGTMGCTTT GTAGCGGTGT CACTGGTTTA TTGTTCACTG ATGAGGATGT CMCACTGTC M G G M T A C T TTMGTCATA CGTTCGTTCA
9000
GACTATTCM GACCGMCAC GAMGCACCA CTMCATTTA CMTTCCTGA GGGCATTGTC TACTCACGTG GTGGTCAMT TCCAGCTGAG GMGATGTTC
3100
CMTGATTCA TTCTTTAGAG CCMCTATGA G A M C A M T T TGAMTTCCA ACTAAAATCA M G C T G G T M GATCACCATT GACAGTCCAT ATTTAGTTTG
3200
CACTGMGGA G M A M T T A G ATGTTCGTCA AGCTTTMTA C T G M A C M T TCGGTATCGC TGCTTCTGM TTCAMGTCA AGGTGTCGGC CTACTATGAC
3300
AAAGAMGM
TCAGMCATC TTGGGCAGGC T C T M G T T M TTATGGGTM GAGGAMGTT TTAC-G
MCGATAGCT CCACTGTTGA M G C A C T M C ATCMCATGG M T
To centromere
Figure 3. Complete sequence of the 9.3 kb segment. The sequence is written from 5' to 3'. The five ORFs are boxed and their orientation is given by thick arrows. Maximum extension of each ORF is given using the first ATG encountered from the next upstream stop codon. Consensus sequences falling between ORFs are indicated: a TATA box (box) upstream of YKL165, a termination site (underlined) downstream of YKL162.
Table I . Optimized FASTA scores obtained when the putative translationproduct of each ORF is comparedwith the MIPSX databank. Size ORF name (amino acids) YKL160 YKL162 YKL164 YKL165
236 1483 353 583
Optimized score 108 1 I8
1734(*) 135
Homologous or identical (*) protein
Reference
S. cerevisiae ribosomal protein L10 S. cerevisiae IRA2 gene product S . cerevisiae CCEl gene product S. cerevisiae M Y 0 2 gene product (myosin-I isoform)
Mitsui and Tsurugi (1988) Tanaka et al. (1 990) Kleff et al. (1992) Johnston e t a l . (1991)
I
l
l
,
l
100
,
l
l
,
l
l
l
l
l
I
200
l
l
l
l
-
500
cys & H I S map
1
1
1
1
loo0
1
lo00
1
1
1
1
A
B
A
3
B
-3
2
1 -2
500
1
500
0
1
1
1
0
1
2 1
1
EHESHSENSD YMMDTQVEDD YDEDGHVQGE YSYYPDEDED EHMLSSVGSF DYHHEDDSGL LYGYHRTQNG SDEDRNEEED GLERSHDNNE FGSNPLHLPD QRRQTSEGLG QHPVGRTLPE ILSMIGGRME RSAESSARNE RISKLIENTG ESLKELSENI LMMNQMVVDR IIPMETLIGN IAAILSDKIL REELELQMQA CPESISIAVD EHVIPILMK LVEISYIDIA EQVLETVEYI SRVHGRDILK FDFLTIHAQR KAIAIVSNAC SSIRTDDFKT IVEVLPTLKP IFSNATDQPI ICGALHGVDK FETLFSLDLI ERIVQLVSIQ DTPLENKLKC LDILTVLAMS KTDIVDMATR SFQHYSKSPN AGLHETLIW PNSLLISISR FIWLFPPED GNSDRGVISN QEKFDSLVQC LIPILVEIYT NAADFDVRRY VLIALLRWS NDQLIKLIGS ILAQKETASN ANGTYSSEAG TLLVGGLSLL DLICKKFSEL FDLVKDLSVD FNNIDLKEDG NENISLSDEE GDLHSSIEEC DEGDEEYDYE KPKKISIHIF RTLSLAYIKN KGVNLVNRVL SQMNVEQEAI TEELHQIEGV DKTEEDWKGI WSVLKKCIFH EDFDVSGFEF TSTGLASSIT KRITSSTVSH FEDCIDRFLE ILQSALTRLE NFSIVDCGLH DGGGVSSLAK EIKIKLVYDG LSSTIVSVHC IASFTSLNEF LRHRMVRMRF LNSLIPNLTS SSTEADREEE NFDFFYDNEK VDMESTVFGV IFNTFVRRNR DLKTLWDDTH TIKFCKSLEG ANEGKKLRDF YKKREFAQVD TGSSADILTL LDFLHSCGVK SDSFINSKLS LWASGALPD WSLFLTRRFP FLFPFDTRML FLWTSFGYG RLIQLWKNKS ALWLGRITR RKLRISRKTI FATGLKILSK YGSSPDVLEI EYQEEAGTGL SKYFARKSLN MWRCNSYSYR SEMDVDTTDD YITTLLFPEP LNPFSNNEKV VARSLLDNRI LDFRFSKVFF ELLHRMSTPN VTTVPSDYET CLLMIELVDP ANKDDNMTLE SLSLTFTVPG NDDIELIPGG CNKSLNSSNV EEYIHGVIDQ KAFIEGFSKV FSYERMLILF PDELVDIFGR VEEDWSMATL YTNLNAEHGY ISIISAFGKH ERRLFLQFLT GSPKLPIGGF KSLNPKFTW LKHAEDGLTA ANYLKLPKYT SKDIMRSRLC QAIEEGAGAF LLS
2 1
3 4
AKLARQLDEP KGSKDLRNDE GPTLEFYSW IELFGYLCTF LLAKSLKYIV ILGKGIEKQL TMDSSIIHDF DEYLPSVMTC
NNRESEAAEE
MSENNSHNLD EADDGEDDDN ILETFAQRLE NASEDPYIAM CRCMYNLFEV TGQLSIYVQF LTRLVNAMYG SDVLSRELRE ERILSADKYT CINNSTAKAI FFPSIKREGI FTDMEIPDSV VSILENPSTP FILAKSFLEV DASKDNIGTD ENCLDHMRKK
YKL 162
Figure 4. Sequences of the putative gene products from YLK160, YKL162 and YKL165. Amino acid sequences were analysed using DNA Strider (Marck, 1988). Top: hydrophobicity profiles (Kyte and Doolittle, 1982). Middle: acido-basic profile (lane A: acidic: full bar= E; intermediate b a r = D lane B: basic: full bar= R; intermediate bar = K; small bar = H). Bottom: cysteine-histidine map (on the bottom lane: Y, C, F, L and H are plotted as bars of increasing length, H being plotted as a full bar, on the top lane onlyC and H are plotted).
B
A
2
3
3
4
1
2
1 2
400
4w 1
300
300
0
200
2w
1
100
100
MSIWKEAKDA SGRIYYYNTL TKKSTWEKPK ELISQEELLL RENGWXAAKT ADGKWYYNP TTRETSWTIP AFEKKVEPIA EQKHDTVSHA QVNGNRIALT AGEKQEPGRT INEEESQYAN NSKLLNVRRR TKEEAEKEFI TMXENQVDS TWSFSRIISE LGTRDPRYWM VDDDPLWKKE MTEKYLSNRS ADQLLKEt!NE T S K F W Q K MLQNNSHIKY YTRWPTAKRL IADEPIYKHS WNEKTKRQT FQCYIDTLID TQKKESKKKLX TQALKELREY LNGIITTSSS ETFITWQQLL VHYGDKSKR YHANRHFKVL THEDVLNEYL KIVNTIENDL QNKLSELKLR NYTRDRIARD NFKSLLREVP IKIKANTRWS DIYPHIKSDP RFLHMLCRNG SSCLDLFLDF VDEQRMYlFA QRSIAQQTLI DQNFEWNDA3 SUEITKQNIE KVLENDRKFD KVDKE3ISLI VCGLIKQRNE KIGQKLQNEil RILEQKKHYF WLLIQRTYTK TGKPKlSTWD LASKELGESL EYKALGDEDN IRilQIFEDFK PESSAF'TAES ATANLTLTAS KKRHLTPAVE LDY
Cys 8. His map
B
A
0
2
1 -
4
l
3 4
l
0
l
- -2
l
WVLHLDDVRT QLSKLCSGVT IVYSRGGQIP EGEKLDVRQA
200
REALDTYRYV REEEYKENLY APLTFTIPEG TIDSPYLVCT TNINME
0 -
YKL165
I
ENKERIFDEV KVLQKALGEK RSDYSRPNTK IPTKIKAGKI YDNDSSTVES
100
TLAQTDKKGR AGSKLIMGKR TVKEYFKSYV LEPTMRNKFE SEFKVKVSAY
2-
2 1 -
MPRSKRSKLV PVLQEIRTSW GLLFTDEDVN AEEDVPMIHS LILKQFGIAA
YKL160
993
PRODUCTS RELATED TO MY02 AND TO THE RIBOSOMAL PROTEIN LIO
YKLl60 L10
MPRSKRSKLVTLAQTDKKGRENKERIFDEVREALDTYRYWLHLDDVRT
YKL16O L10
PVLQEIRTSWAGSKLI-MGKRKVLQKALGEKREEEYKENLYQLSKL---C QQMHEVRKELRGRAWLMGKNTMVFGUU-----RGFLSDLPDFEKLLPFV
MGGI---------------REKKAEYFAKLREYLEEYKSLFWGVDNVSS
*
**-*
. .*.* YKL16O L10
YKL16O L10
* * * * - * - . * . .*.*
. .* ..
.. *** ....*.
*
.
**
SGVTGLLFTDEDVNTVKEYFKSYVRSDYSRPNTKAPLTFTIPEGIVYSRG KGYVGFVFTNEPLTEIKNVIVS------NRVAAPA-------------RA
* *..**.* . .*. . *
.*
. *
*.
GQIPAEED-VPMIHS-LEPTMRNKFE----IPTKIKAGKITIDSPYLVCTE
GAVAPEDIWVRAVNTGMEPGKTSFFQALGVPTKIARGTIEIVSDVKWDA
* ...*. YKL16O L10
*
*
... .**
. *.
**** * * * *
*
GEKLDVRQALILKQFGIAASEFKVKVSAYYDND----SSTVESTNINM-GNKVGQSEASLLNLLNISPFTFGLTWQVYDNGQVFPSSILDITDEELVS
*.*..
.* .*.
.
*..
*
*
***.
** .. *.
..
YKL16O L10
YKL16O L10
Figure 5. Alignment between the amino acid sequences ofYKL160 and the ribosomal protein L10 of Saccharomyces cerevisiae. The two sequences have been aligned using CLUSTAL (Higgins and Sharp, 1989). Stars indicate identical amino acids, dots indicate conservative substitutions. Sequence of the L10 protein is from Mitsui and Tsurugi (1988).
by these authors is identical to ours except for the replacement of the A at position 3870 by a C (this changes codon 1314 of YKL 162from a Ser codon to an Ala codon). We have checked again this position on our gel readings and confirm the presence of an A. The sequence was analysed for the presence of ARS, transcription and splicing signals and for consensus sequences for protein binding using a list established by MIPS (unpublished). Consensus sequences located at expected positions with respect to ORFs are indicated in Figure 3. One TATA box and one termination site are located upstream of YKL165 and downstream ofYKL162, respectively. No intron, no tRNA and no ARS sequence was discovered.
Analysis of ORFproducts
The nucleotide sequences of the five ORFs have been translated into amino acid sequences, considering the first AUG codon of the O R F as the starting codon and compared with databanks: PGTrans, PSeqIP, NBRF and MIPSX. YKLl64 corresponds to the previously sequenced CCEl gene which encodes a cruciform cutting endonuclease (Kleff er al., 1992) and will not be further discussed here. The products of the three other complete ORFs, YKL165, YKL162 and YKLl60, show borderline homologies with previously characterized yeast proteins as shown in Table 1. Their amino acid sequences and profiles are given in Figure 4.
994
S. PASCOLO ET AL.
yk1165
* my02 yk1165
857 aa
L---NHYVFDKS-
*
..
**
. ..
TITN
*
LTHEDVLNEYLKIVN----------
** * -
.*
LKADAKSVNHLKEVS...68 aa..
my02
LRTKKDTVWQSLI
yk1165
TIENDLQNKLNELR 235 aa
my02
TIENNLQSTEQTLK
****.**.
.
*.
552 aa
Figure 6. Partial alignment between YKL165 gene product and the yeast MY02 protein.The two sequences have been aligned using CLUSTAL. Stars indicate identical amino acids, dots indicate conservative substitutions. Sequence of the MY02 protein is from Johnston et al. (1991). Conserved amino acids ofthe heptad repeats of MY02 that are present in YKL165 are boxed.
YKL160showshomologywith the LlOribosomal protein of yeast. Alignment between the two proteins is given in Figure 5. YKL160 can be aligned with the amino-terminal part of L10 but is missing the carboxy-terminal part. The L10 and L12 proteins, characterized from a variety of archaebacterial and eukaryotic sources, have a highly conserved sequence at their carboxyterminal end containing the hexapeptide motif DDDMGF but diverge at their amino-terminal end (Newton et ul., 1990). However, YKL160 does not show the hexapeptide motif, raising questions on it being a ribosomal protein of the L10 family. The product of YKL 165 shows a weak homology with the MY02 protein of S. cerevisiue even though the two proteins differ significantly in length (583 and 1574, respectively). Partial alignment between the two proteins is shown in Figure 6. Some of the amino acids which are conserved in the heptad repeats of MY02 are also found in the YKL165 gene product. Other parts of the two proteins, however, do not reveal obvious possible alignments. Finally, YKL162 shows a weak homology with the IRA2 gene product. However, no significant alignment could be found. ACKNOWLEDGEMENTS
S.P. and M.G. have contributed equally to this work. We thank Martina Haasemann and all MIPS staff for help with the sequence analysis, and all members of the Unite de Genetique Moleculaire des Levures for advice and critical discussions. This
work was supported by the BRIDGE program of the Division of Biotechnology of the Commission of European Communities and by the ‘Action Sptcifique Genome’ of the Ministere de 1’Education Nationale (DRED). B.D. is Professor of Molecular Genetics at the Universitt Pierre et Marie Curie. REFERENCES Boyer, J., Pascolo, S., Richard, G.-F. and Dujon, B. (1992). Sequence of a 7.8 kb segment on the left arm of yeast chromosome XI reveals four open reading frames including the CAP1 gene, a homolog to an RNA polymerase I1 elongation factor and an introncontaining gene. Yeast, in press. Claverie, J. M. (1984). A common philosophy and FORTRAN77 software package for implementing and searching sequence databases. Nucl. Acids Res. 12, 397-407. Colleaux, L., Richard, G.-F., Thierry, A. and Dujon, B. (1992). Sequence of a segment of yeast chromosome XI identifies a new mitochondria1carrier, a new member of the G protein family, and a protein with the PAAKK motif of the HI histone. Yeast 8,325-336. Higgins, D. G. and Sharp, P. M. (1988). Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene 73,237-244. Jacquier, A., Legrain, P. and Dujon, B. (1992). Sequence of a 10.7 kb segment of yeast chromosome XI identifies the APNl and the BAFI loci and reveals one tRNA gene and several new open reading frames including homologs to RAD2 and kinases. Yeast 8, 121-132. Johnston, G. C., Prendergast, J. A. and Singer, R. A. (1991). The Saccharomyces cerevisiae M Y 0 2 gene encodes an essential myosin for vectorial transport of vesicles. J. CeN Biol. 113, 539-551.
PRODUCTS RELATED TO MY02 AND TO THE RIBOSOMAL PROTEIN LIO
Kleff, S., Kemper, B. and Sternglanz, R. (1992). Identification and characterization of yeast mutants and the gene for a cruciform cutting endonuclease. EMBO J . 11,699-704. Kyte, J. and Doolittle, R. F. (1982). A simple method for displaying the hydrophobic character of a protein. J. Mol. Biol. 157, 105-132. Marck, C. (1988). ‘DNA Strider’: a ‘C’ program for the fast analysis of DNA and protein sequences on the Apple Macintosh family ofcomputers. Nucl. Acids Res. 16, 1829-1 836. Mitsui, K. and Tsurugi, K. (1988). cDNA and deduced amino acid sequence of 38 kDa-type acidic ribosomal protein A0 from Succhuromyces cerevisiue. Nucl. Acids Res. 16,3573. Newton, C . H., Shimmin, L. C., Yee, J. and Dennis, P. P. (1990). A family of genes encode the multiple forms
995
of the Succhuromyces cerevisiue ribosomal proteins equivalent to the Escherichia coli L12 protein and a single form of the L10 equivalent ribosomal protein. J. Bacteriol. 172,579-588. Sharp, P. M . and Li, W. H. (1987).The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucl. Acids. Res. 15,1281-1295. Tanaka, K., Nakafuku, M., Tamanoi, F., Kaziro, Y., Matsumoto, K. and Toh-E, A. (1990). ZRA2, a second gene of Saccharomyces cerevisiue that encodes a protein with a domain homologous to mammalian rats GTPaseactivating Protein. Mol. Cell. Biol. 10,4303-43 13. Thierry, A., Fairhead, C. and Dujon, B. (1990). The complete sequence of the 8.2 kb segment left of M A T on chromosome I11 reveals five ORFs, including a gene for a yeast ribokinase. Yeast 6,521-534.