The sequence of a 9.3 kb segment located on the left arm of the yeast chromosome XI reveals five open reading frames including the CCE1 gene and putative products related to MYO2 and to the ribosomal protein L10.

YEAST

0

VOL.

0 0 0

0 0

0

XI

o0 0

8: 987-995 (1992)

0o

Yeast Sequencing Reports

0 0 0 0

The Sequence of a 9-3kb Segment Located on the Left Arm of the Yeast Chromosome XI Reveals Five Open Reading Frames Including the CCEl Gene and Putative Products Related to MY02 and to the Ribosomal Protein LlO STEVE PASCOLO*, MARJAN GHAZVINIt, JEANNE BOYER, LAURENCE COLLEAUX, AGNES THIERRY AND BERNARD DUJON Unite de Genitique Moliculaire des Levures ( U R A I I49 du C.N.R.S.),Institut Pasteur, 2.5 Rue du Dr Roux, F-7.5724 Paris Cedes 15, France

Received 29 June 1992; accepted 6 August 1992

We report here the sequence of a 9.3 kb DNA segment of chromosome XI of Saccharomyces cerevisiae, located between the MAKII locus and the centromere. This sequence contains four long open reading frames (ORFs), YKL160, YKLl62, YKL164, YKL165 and part ofanother ORF, YKL166, covering altogether 90% oftheentire sequence. One of these ORFs, YKL164, corresponds to CCEI. Translation products of two other ORFs, YKL160 and YKL165, exhibit homology with previously known S . cerevisiae proteins: the ribosomal protein LIO, and the M Y 0 2 gene product, respectively. KEY WORDS - Saccharomyces

cerevisiae; chromosome XI; CCEI; shotgun sequencing.

INTRODUCTION As a part of the BRIDGE program of the Commission of European Communities to sequence chromosome XI of Saccharomyces cerevisiae, we have determined the sequence of a 9.3 kb fragment located close to the centromere on the left arm of that chromosome. This sequence is contiguous to the sequence reported by Boyer et al. (1992) in the accompanying paper. No genetic locus has previously been mapped in this fragment. The sequence was determined by the shotgun strategy, using the double-strand sequencing method on *Present address: Unite des Papillomavirus, Institut Pasteur, 25 Rue du Dr Roux, 75724 Pans Cedex 15, France. tPresent address: Unit6 des Virus Oncogenes, Institut Pasteur, 25 Rue du Dr Roux, 75724 Paris Cedex 1 5 , France. 0749-503)3/92/011987-09 $09.50 0 1992 by John Wiley & Sons Ltd

fragments of cosmid pEKG021. Five long open reading frames (ORFs) have been discovered. MATERIALS AND METHODS Strain and vectors Escherichiu coli JMlOI/TGl (A(1uc pro), thil, supE44, hsdD5, F(traD36,proA+ B+lacIQ lac Z D M 1 5 ) ) was used throughout this work. Cosmid pEKG021, received from A. Thierry (manuscript in preparation), is a pWE15 derivative containing a 37 kb insert of S. cerevisiae chromosome XI. Plasmid pBoSAl (Jacquier et ul., 1992) is a derivative from pBluescript used for subcloning EcoRI fragments from pEKGO21. Plasmid PBluescriPt (Stratagene) was used for the other cloning steps and sequencing.

988

S. PASCOLO ET AL.

P B

B to latl tebnwre

RV

RV RV

RV RV

I

I I

I I

pUKG041

1

to CEN

pUKG047

Figure 1 . Partial restriction map of cosmid pEKG021. EcoRI fragments sequenced in this work as well as the adjacent fragments sequenced by Boyeret al. (1992) are indicated with their exact sizes in base pairs. Adjacent EcoRI fragments larger than I kb that have also been mapped are indicated with approximate sizes. One XbaI fragment which was cloned to complete the sequence is also indicated (dashed line). Note that only part of the chromosome XI insert is shown here: ca. 4 kb and 8 kb are missing at the right end and the left end, respectively. Inserts of cosmids pEKG047 and pUKG041 which partially overlap pEKG021 are indicated. Abbreviations: B: BamHI; P; PsrI; RI: EcoRI; RV: EcoRV; X: XbaI; X*: uncleavable XbaI site in dam+ strains (overlapped by two GATC methylation sites).

The three EcoRI fragments (2647,1960 and 3647) were subcloned into the pBoSAl vector giving rise to plasmids pE2647, pE1960 and pE3647, respectively. The XbaI fragment 3560 was subcloned into the pBluescript SK vector giving rise to plasmid pX3560.

+

Sequencingstrategy Insert of cosmid pEKG021 was first mapped by digestions with EcoRI, EcoRV, XbaI, BamHI, PstI and by hybridizations using purified EcoRI fragments as probes (Figure 1). The sequencing strategy is outlined in Figure 2. Random sequencing The sequencing strategy was adapted from Thierry et al. (1990). 20 pg of EcoRI inserts from pE2647, pE1960 and pE3647 (see Figure 2) were purified using fbagarase after electrophoresis. Each EcoRI fragment was polymerized using T4 DNA ligase and then submitted to sonication (20 kHz during 8 min). Polymerization is needed to avoid biases of DNA fragmentation during sonication. DNA fragments ranging in size from 500 to 1000 bp were purified using Geneclean I1 (Bio 101) and then ligated with SmaI-digested and dephosphorylated pBluescript SK + vector. These constructions were used to transform E. coli cells. Transformant clones were hybridized using purified EcoRI fragments of cosmid pEKG021 as probes. DNA was extracted from positive clones. Sequencing reactions were performed on double-stranded DNA using Sequenase (United States Biochemical

Corp.), KS primer and, for some chosen clones, SK primer (Stratagene) as well. The products were analysed on 50 cm long acrylamide-urea gels. Directed sequencing Gaps were filled using specific oligonucleotides as primers and double-stranded DNA from pE2647, pE1960, pE3647 and pX3560 as templates. Other gaps were filled using oligonucleotides complementary to T3 or T7 promoters as primers and double-stranded DNA from deletions of pE2647 and pE1960 as templates (to generate appropriate deletions pE2647 was digested by SalI or NsiI plus PstI and religated, pE1960 was digested with Hind111 or SpeI and religated). The EcoRI site separating fragments 2647 and 1960 was sequenced across using an asymmetric PCR reaction (Colleaux et a[., 1992). Sequence analysis software Reading and assembling of the clones as well as interpretation of sequences were performed using DNA Strider 1.2 software (Marck, 1988). Comparisons between products of ORFs and protein databanks were performed using the SASIP package (Claverie, 1984) and the MIPS package (Mewes). RESULTS AND DISCUSSION Sequence determination For the three EcoRI fragments sequenced by the random strategy (see Figure 2), 32 clones were

PRODUCTS RELATED TO M Y 0 2 AND TO THE RIBOSOMAL PROTEIN L10

T," l e f t

-

989

T o centromere

telomere

ECoRl Xbal EcoRV HinDIII BamHl

-

YKL166

YKL I65

---

YKL164

YKL162

YKL160

Figure 2. Sequencing strategy and results. Small arrows indicate individual sequence reads on random clones from each of the four subfragments (filled boxes). Stars indicate sequences obtained using specific oligonucleotides, squares indicate sequences obtained using deletions and the circle indicates the sequence obtained from asymmetric PCR. Positions of the restriction sites used for mapping and cloning of the four subfragments are shown. Bottom: occurrence of ORFs in the six registers: initiation codons (small bars), stop codons (full bars). Long ORFs are indicated by thick arrows with their numbers.

sequenced from one primer only and 55 clones were sequenced from the two primers. As shown in Figure 2, the clones used for sequencing were randomly distributed except for an excessive representation of clones from the extremities of the fragment 2647 (this was due to a poor ligatiotn of this fragment prior to sonication). Altogether, the random sequencing has provided us with 93% of the double-strand sequence of the three fragments. To complete the entire sequence, four reads were determined from deleted clones (see Materials and Methods) and 19 reads were determined from synthetic oligonucleotides in order to fill in the gaps located within the EcoRI fragments and between fragments EcoRI 1960 and EcoRI 3647. With an average length of 259bp per read, our strategy has led to only 4.6 readings per base pair of final sequence.

Sequence analysis

Figure 3 shows the entire sequence of 9350 nucleotides. The sequence has been extended to the end of YKL160 using data from Boyer et al. (1992). Five long ORFs representing 89.6% of the total sequence have been recognized (note that YKLl66 is incomplete). Each OR F has been considered from its most upstream ATG codon. Codon adaptation indexes (Sharp and Li, 1987) for the ORFs are: 0.279, 0.139, 0.112, 0.130 and 0.128 for YKL160, 162, 164, 165 and 166, respectively. This is indicative of low or moderately expressed genes except for YKL160 which is coding a product related to a ribosomal protein (see below). Part of the sequence presented here (positions 2464 to 3973) has independently been reported recently (Kleff et al., 1992). The sequence reported

' T o left

I

telomere

GAATTCMTA ATAAAATCM CCMTTTCCA GTTTTCATTG ATTCAGTATG CTTATTAGTT ATTAGAAMG AAATACTGTA GCCAGGTATT GGTACACGTC

I

100

TCAGAATGTA GAATGCTTCA GCACGCTGTT CCAAMATCT AGTAAACTTG TGMCTAAAA TCTGCTCMT TTCATCTGCC TGTTTCACCA TTMACTCAT

200

ACGCACCGAG TTTACACTGG G C T C M T C M TACTTGTTCA TTTTCATTAC GAGATATATG CATCGGTTGG A G C M C M T T CGGCGCTAGT GTTAGGGACT

300

TCMCTTCCG GTCTGTTGTG CCTCTCCACT TCCTGCGAAG AGAMTTGCT CAMGTCAAC GCAGCTTCCA GGGAATMCG MCAGCCGTT AGATACGGAC I

400 500 600

AGTTGATGAC ATCGCMTAG CGTGCCTACC TATCAGGATA ACMGGTCTG MGCCAGATC GACCACTAGA CMGCATTGC CGGAGGTGGG AGACGTGAAA

100

GTCTAT T T G G M G G M GCTAAAGATG C M G T G G M G GATATATTAT TACMTACTT TGACAAAGAA ATCTACGTGG GAGAAGCCCA AGGAACTMT

800

TTCTCAGGAG GAGCTACTTC TTCGAGAAAA TGGCTGGMG GCGGCCMGA CGGCAGACGG CAAAGTATAC TATTATMTC C M C M C M G AGAAACCAGC

900

TGGACTATTC CGGCCTTCGA GMGAAAGTA GMCCCATCG CAGMCAAAA ACATGATACA GTATCCCATG CACAGGTTM TGGAAATAGA ATAGCCCTTA

1000

CGGCTGGAGA AAAACMGAG CCGGGACGM CTATAAACGA GGAGGAAAGC CMTATGCTA A T M C T C T M ACTGCTTMT GTCAGGAGM GGACTMAGA

1100

AGAAGCAGAG MGGAATTTA TTACCATGCT GAAGGAAMT CMGTAGACT CTACTTGGTC ATTCAGTAGA ATTATTTCAG MCTGGGGAC CAGAGATCCA

1200

AGGTATTGGA TGGTCGATGA TGACCCCTTA TGGAAGAAAG AAATGTTTGA GAAATATCTT TCCMTAGAT CAGCCGATCA ACTTCTTMG GAACACMTG

1300

A A A C M G C M ATTCAAAGM GCCTTTCAGA AAATGTTGCA AAACMTTCT CATATAAAAT ATTACACCCG TTGGCCTACC GCAAAGAGAC TMTTGCCGA

1400

CGAACCMTA TACAAACACT CCGTGGTCM TGAAAAGACA MGAGACAGA CCTTTCMGA TTATATAGAT ACCCTCATCG ACACTCAGM A G M T C A M A

1500

W T T G A AAACACAGGC CCTAMAGAA CTMGAGAGT ATTTAAACGG TATTATMCA ACATCATCCT CTGAAACTTT CATMCCTGG CAGCAGCTTT

1600

TAAATCACTA TGTTTTTGAT M G A G T M G A GATATATGGC CAACCGGCAC TTCAAAGTCT TMCCCACGA AGATGTTTTA MCGAGTATC TGAAAATAGT

1700

AAATACGATT GAAAACGATC T T C A M A C M ACTAAATGAG CTCCGACTGC GCMTTATAC CAGAGACCGT ATTGCTAGAG A T M C T T T M MGCTTATTA

1800

AGAGMGTGC CMTCAAAAT CAAAGCAAAT ACTAGATGGT CAGATATTTA TCCTCATATA MGTCTGATC CGCGCTTTTT ACATATGCTT GGAAGGAATG

1900

GCTCGTCCTG CCTTGATTTA TTTTTAGATT TTGTTGATGA ACAAAGGATG TACATCTTTG CACAAAGATC MTAGCCCAA CAGACGTTGA TAGATCAAAA

2000

TTTTGMTGG MTGATGCCG ATAGCGACGA GATCACCMG CAAAACATAG AAMAGTTCT GGAMATGAC CGGAAATTTG ACAAGGTGGA T M A G M G A C

2100

ATCAGTTTGA TTGTTGATGG TTTGATAAAG CAAAGAAACG AAAAGATACA ACACAAACTC CAAAATGAGC GTAGGATATT GGAGCAMAG MGCACTATT

2200

TTTGGTTACT TTTGCAAAGG ACATATACM AAACCGGTM GCCCMGCCT AGTACGTGGG ATTTAGCTTC CAAAGAGCTT GGCGMTCTC TTGAATACAA

2300

,

GGCACTAGGC GATGMGATA A C A T M G M G ACAAATTTTC GAGGATTTTA AGCCTCAAAG CTCTGCACCG ACTGCCGAAA GCGCTACTGC AAACTTMCG

2400

TTGACCGCGT CAAAAAAGAG GCATTTMCT CCGGCTGTGG MTTGGACTA TTG

TCTGA CGTGTCGACC TCTATCTTGT TMTCATTAT ATAAATTATA

2500

CATAGAGATT ACATGTTATT TCMGTTTGC GATCACCGCA C M T T T

TA GTCATTCTTG TMGTGTTCT GCMAAATTT CAGCTTTTGT ACCTTATTTT

2600

CACAAAATTC AAACACCTGT CCGMCTGTG TTTTMCCAG TGTTTTTGAA TTCMGAGTT CAGTMTACT TTCATMTTT TTTMCCACT CCATCCAAGA

2700

CAAACMTGG AGGAATGAAT CTGCCAAATC GTCATCTTTT CTCACCCCCG MTTATCTTG GATCTCTAGT ATATCACATA GCTTGAAACT TTTTTTTTTG

2800

GTMGGGCAT TTCTTATCCT ATTATTCCM ACTCCTATGA ACTCGACCAG TTTTGTAGM CTAGTTGAAT TACCTTCTAG TATTGAGGTT GAAAGTATTT

2900

TTTTCACTAG CTTTATTCGA GAATCTTTGC TATGTTTGTT AGATTTTMC TTTTTTGAAC TGGTCGGTGT CTCTTCTCTT GGAATGCACC M T A T G M G T

3000

CATCCGATGT GGATCGGACG MCATACCAT ATACCTCMC TTGGACGTAT TCGGTATTTT ATTCGTATAC TTCATTTTAT TTTCCMGTT AGAGAAAAGA

3100

ATCTGTTCGA GMTATTCAC TTTTAAAATT GGGTCTAMA TATGCCTCGA AGACATAGTT CTGGTACGTT GCCTTTCMT TGTAAACATA TCTGGTATCG

3200

GCATAGATTC AAATAAATAC TCCGTMGGT TAAATACMG CTCAGMGTT TCAGCAGGAT T C M G C T T M CTTTTTGAGG T T T T G W ATTTCTCCTC

3300

TAGATTTATC TTTTGCCAGT CTAGTACTTT AGGGAGCGGA TCATTATTGA GCMTTGCAT CTTAGAGAAA GCAAAGTTAG AAACACCAGC ATCCATAGAC

3400

MTATGTTM TTCTTCCCTC TCTTATCTTT TGTTGTCGTA ACTTCTCCM AAATTCACAC TGTTCTTGAA TGTAGGTTCT TTTAGCTTCT TTCGTCGTGC

3500

CATTTACTGC T C C M T M C A MTGATAAAG ATTTCAGTTG TGTGCTTTTT GCATTTTGGC AGCAGGAATC GATGAGTTGC MTATCTTAG CTTTCTGTGC

3600

TTACTATCTT TTTTAGCTTT AGACTGTATA ACCGTMGCC TTCGTTGTAT GCMTGGATA MCAAATATG GMTCGCCAG TTGAAWAAA

3700

AAAATGAGM MGTACGGAT MTCCGAAGA MTTTAGTCT GATTCCATCC ATTTTTTGTA T C T T M G A M T M T T T T G M GT-CTTAAATAG

3800

A A M A G T M C TAAAA

3900

TAG GAMGTAGAA MGCTCCTGC ACCCTCTTCA ATGGCTTGAC AAAGACGAGA CCGCATMTA TCTTTGCTAG TATACTTCGG

CMTTTCAAA TMTTAGCAC ATGTCATTAC ACTTGGTAGA TATTCGTCTG CTGTTAGGCC ATCTTCAGCA TGCTTTMCA CMCTGTAAA CTTTGGGTTC

4000

AAACTTTTM ATCCCCCMT TGGAAGCTTG GGAGATCCCG TTAAAAATTG CAAAAATMT CTTCTTTCAT GCTTACCAAA CGCGGATATT ATTGATATM

4100

M T C A T G M T G A T T G M G M TCCATTGTAT AGCCATGTTC AGCGTTCMG TTTGTGTATA MGTTGCCAT AGACCAGTCT TCCTCMCTC GTCCGAAAAT

4200

ATCCACTMT TCATCCGGAA AAAGTATTAG CATCCTCTCA T A G W C A CCTTTGAAAA ACCTTCMTA MTGCTTTTA ACTGTTTTTC MTGCCCTTA

4300

CCTMTATTT GGTCGATMC GCCATGGATA TATTCTTCAA CATTAGMGA GTTCMGGAT TTATTACMC CCCCCGGMT CAACTCMTG TCATCATTTC

4400

CAGGMCGGT AAATGTCMG GACMCGATT CTAGGGTCAT ATTGTCATCC TTATTCGCTA CTATGTATTT MGGGATTTT GCGAGTMCG GATCTACCM

4500

TTCGATCATT MCAGACAGG TTTCAACGTC GCTCGGCACT GTCGTCACAT TGGGCGTAGA CATTCTGTGC M T M C T C M AAAAGACTTT GCTMATCTA

4600

M G T C M G A A TTCTATTATC M G C M C G A T C T G G C M C M ATGTCCCCAA ATATCCAAAA A G T T C M T M CTTTTTCATT ATTGG-C

GGGTTGAGGG

4700

GCTCTGGGAA CMTAAAGTG G T M T A T M T CGTCAGTAGT ATCMCATCC ATTTCGCTTC TGTMCTATA AGAGTTACM CGCCACATAT TTAACGACTT

4800

TCTTGCAAAA TACTTGGAAA CTACGGAGTA AAATTCCAM GTCGGTCCTA MCCTGTTCC TGCTTCTTCT TGATATTCM TTTCCAGTAC GTCAGGGCTA

4900

CTTCCGTACT TGGATAAAAT CTTGAGACCG GTAGCGAATA TTGTTTTTCT TGAAATCCGC AGCTTACGCC TAGTMTTCT CCCMGTTGT TGTMAGCTT

5000

CGTCATTCCT TAAATCTTTC GAGCCTTTAC TCTTATTCTT CCAAAGTTGA ATCMTCTTC CGTMCCAAA TGAAGTACAT TGTAGGAAAA GCATCCTGGT

5100

ATCAAACGGA M C A A M A T G GGMTCTCCT GGTC-T

5200

AGTGACCMT CCGGCAAAGC TCCACTTGCT ACTACCMTG GTTCATCTAG TTGTCTAGCG

AGCTFAGCGC TTAGTTTTGA GTTGATAAAA CTGTCGCTTT TMCACCGCA GCTATGTAGA AAATCCAGCA ATGTCAGAAT ATCCGCTGM GATWAGTAT

5300

CMCCTGTGC GMTTCTCTT TTTTTATAAA M T C T C T T M CTTTTTCCCC TCATTAGCTT CCTCGGCTGC CTCACTCTCT CTATTGTTAC CTTCTAAACT

5400

TTTGCAAAAT TTGATTGTAT GTGTATCATC CCATAAAGTT TTTMGTCAC GATTTCCCCT GACGAATGTA TTAAATATCA CACCAAATAC TGTAGACTCC

5500

ATGTCMCTT TTTCATTATC ATAAAAAAAG TCAAAGTTCT TTTTTCTCAT ATGATCCAAG CAATTTTCTT CTTCTTCCCT ATCAGCTTCG

5600

GTACTGGMG

ATGTMGGTT TGGGATTMT GAATTCAAAA MCGCATTCT TACCATTCTG TGTCTCAAM ACTCATTMG TGAGGTAAAA GMGCTATGC MTGGACCGA MCGATAGTA GATGATAAAT CAGTACCMT ATTATCTTTG CTTGCATCGC CATCATAAAC C M C T T M T T TTTATCTCTT TAGCCAGTGA AGA'L'AChCCA

5700 5800

99 1

PRODUCTS RELATED TO MY02 AND TO THE RIBOSOMAL PROTEIN LIO CCACCATCGT GTAMCCGCA ATCMCTATA GACAAATTCT CCAGCCTTGT GAGAGCAGAT TGTAGGATTT CT-TCT CCTCT-

G T C M T A C M TCCTCAAATA

T ~ T T T A G C AAGAATGAMT GGGATACCGT TGAGGATGTA ATTCTTTTAG TTATGGAGGA AGCTAGCCCT GTAGMGTM

5900 6000

ATTCGAMCC

TGACACGTCG AAATCTTCAT GGAAAATACA TTTTTTTAM ACAGACCAM TTCCCTTCCA ATCCTCTTCA GTTTTGTCCG GAGTGGMGG ATTTTCTAAA

6100

A T A G A M C M CGCCTTCGAT TTGATGGAGC TCCTCTGTTA TTGCTTCTTG CTCMCGTTC ATCTGAGAM GTACTCTATT MCTAGGTTC ACACCCTTGT

6200

TTTTMTATA AGCTAGAGAT AGAGTTCTGA AAATGTGTAT TGAMTTTTC TTTGGTTTTA CTGAATCAGG MTTTCCATA TCAGTAMTT CGTMTCATA

6300

TTCTTCATCA CCCTCATCAC ATTCCTCMT ACTGCTATGC AAATCCCCTT CTTCGTCAGA MGTGAAATA TTTTCATTCC CGTCTTCCTT T A M T C M T G

6400

T T A T T G A M T CCACAGACM ATCCTTMCC AAATCAMAA TGCCCTCTCT TTTGATGGM GGMAGAACA GTTCGGAAM CTTTTTACM A T T M G T C M

6500

GCMCGAGAG ACCACCMCC MCAGTGTAC CAGCTTCTGA TGAGTMGTA CCATTAGCGT TAGACGCTGT TTCTTTTTGG GCCAGGATAG ATCCGATTM

6600

C T T M T M G T TGATCATTGA TTGCCTTTGC TGTGGMTTA TTTATGCATG ATACAACCCT CAGTAAAGCA ATMGTACGT ATCTTCTTAC GTCAAAGTCA

6700

GCAGCATTTG TATAMTTTC M C G A G M T T GGMTTAGAC ATTGMCTAG GGAATCAMC TTTTCCTGGT TAGAMTTAC GCCACGGTCG CTATTTCCGG

6800

TATATTTATC CGCTGACAGT ATTCTTTCAT CCTCGGGAGG M C A A T A C A ACTATAMTC TACAAATGCT M T C A A T A M CTGTTTGGGA CATAAATCAG

6900

TGTTTCATGT MCCCTGCGT TAGGACTTTT ACTATMTGC TGGMTGATC GTGTTGCCAT GTCGACMTG TCAGTTTTCT CTCTCAGTTC TCTTGAMGT

7000

ACATCACTAC TCATCGCTM TACGGTTAAA ATATCCAMC ATTTCAGTTT ATTCTCCMG GGGGTATCCT G M T A G M C TAGCTGMCT ATTCTTTCGA

7100

TTAAATCCM CGAAAACAM GTCTCAAATT TCTCAACCCC ATGCAACGCC CCGCAMTAC CGTACATGGC ATTTACMGC CTGGTTMTA TTGGTTGGTC

7200

TGTCGCATTC G A G M M T T G GCTTCAGCGT TGGMGTACT T C M C M T G G TCTTAMGTC ATCCGTTCGG ATACTGCTAC AGGCGTTCGA M C M T T G C G

7300

ATAGCCTTCC TCTGCGCATG TATAGTTAM AMTCGAAGA ATTGGACGTA GATTGATMT TGGCCCGTTT T T M M T G T C TCTCCCATGT ACTCTAGAM

1400

TATATTCCAC CGTTTCTAM ACTTGTTCTG CGAGGTCMT GTMCTGATC TCTACCMTT TTCCTTGTM MTTGGTATA ACGTGTTCAT CMCAGCTAT

7500

TGAMTAGAT TCAGGGCAGA CCTCIhhMG ATTATACATG CATCTACMG CTTGCATTTG T M T T C T M T TCTTCCCGTA AMTTTTATC AGAGAGTATG

7600

GCAGCTATAT TACCTATTM CGTTTCCATC GGTATMTTC TATCGACMC CATTTGATTC ATCATTMTA TGTTTTCAGA MGTTCTTTT AMCTCTCCA

1700

TTGCMTATA AGGATCCTCG GAGGCATTCC CAGTATTCTC TATCMTTTA GAMTCCGCT CATTCCTTGC CGMCTCTCT GCGCTCCTCT CCATCCTTCC

7800

TCCMTCATC G A T M M T C T CGGGTAGTGT TCTTCCMCC GGGTGTTGCC CMGTCCTTC ACTTGTTTGT CTTCTTTGTT CTAGTCTTTG TGCGMTGTT

7900

T C C M M T G T CAGGTAMTG TAGGGGGTTG CTGCCAMTT CATTATTATC GTGAGAACGT TCMGTCCAT CTTCTTCTTC ATTTCTGTCT TCGTCACTGC

8000

CATTCTGAGT TCTATGATAT CCATATAAAA GTCCAGAATC ATCTTCATGA TGGTMTCGT TATCGTCATC TTCACCATCA TCTGCCTCM MCTTCCGAC

8100

GCTAGAMGC ATATGTTCAT CTTCATCTTC ATCAGGATM TMGAGTACT CACCCTGTAC ATGGCCATCC TCATCATAGT CATCTTCTAC CTGCGTATCC

8200

ATCATATMT

CACTGTTTTC GCTATGGGAC TCATGTTCAT CMGGTTGTG

CGMTTATTT

T C A G A C A ~ CTTGGCTACTC TTCTATTTTA MTGTTTATC

8300

~~

C M M T T T T C ACACCTGAGA TATATTGTTG TTMTAAAGC ACACAGAMG GAMGTATCG ATCCCAGAM GTGGTCTTTC CCTGGGTATG ATCAMGCTA

8400

TTCCCCTTTC CMCTACTGT TGGCTATGAT CCCACMTTT GATTTCAGTT C A G M M C C C MCCATTAGG ACTTTACCCG GCGMGGTGA M G A h h A M T

8500

TTTTTTCCAG W T G C G A TGAGCTTTTG AAAAGTTGCA MTATTTGTA GTACTTTGGC TMGCAGAGC TGATTCCTTT TGCTATACGT ACATCACTGT

I

CTATCGTCCA T M G A T T T ATTATAGTTG MTdATGCCA A G A T C M M C GTTCCMGCT AGTCACTTTA GCACAAACCG ATAMAAGGG CAGAGAAAAT

I

8600

I

8700

TTTTCGATGA AGTGAGGGM GCCTTAGATA CTTATAGATA CGTCTGGGTC CTACATCTTG ACGACGTMG MCTCCAGTT T T A C M G A M

$800

CATTGGGAGA A M G A G G G M G M G M T A C A M G N T C T

8900

GTATCMTTG AGTMGCTTT GTAGCGGTGT CACTGGTTTA TTGTTCACTG ATGAGGATGT CMCACTGTC M G G M T A C T TTMGTCATA CGTTCGTTCA

9000

GACTATTCM GACCGMCAC GAMGCACCA CTMCATTTA CMTTCCTGA GGGCATTGTC TACTCACGTG GTGGTCAMT TCCAGCTGAG GMGATGTTC

3100

CMTGATTCA TTCTTTAGAG CCMCTATGA G A M C A M T T TGAMTTCCA ACTAAAATCA M G C T G G T M GATCACCATT GACAGTCCAT ATTTAGTTTG

3200

CACTGMGGA G M A M T T A G ATGTTCGTCA AGCTTTMTA C T G M A C M T TCGGTATCGC TGCTTCTGM TTCAMGTCA AGGTGTCGGC CTACTATGAC

3300

AAAGAMGM

TCAGMCATC TTGGGCAGGC T C T M G T T M TTATGGGTM GAGGAMGTT TTAC-G

MCGATAGCT CCACTGTTGA M G C A C T M C ATCMCATGG M T

To centromere

Figure 3. Complete sequence of the 9.3 kb segment. The sequence is written from 5' to 3'. The five ORFs are boxed and their orientation is given by thick arrows. Maximum extension of each ORF is given using the first ATG encountered from the next upstream stop codon. Consensus sequences falling between ORFs are indicated: a TATA box (box) upstream of YKL165, a termination site (underlined) downstream of YKL162.

Table I . Optimized FASTA scores obtained when the putative translationproduct of each ORF is comparedwith the MIPSX databank. Size ORF name (amino acids) YKL160 YKL162 YKL164 YKL165

236 1483 353 583

Optimized score 108 1 I8

1734(*) 135

Homologous or identical (*) protein

Reference

S. cerevisiae ribosomal protein L10 S. cerevisiae IRA2 gene product S . cerevisiae CCEl gene product S. cerevisiae M Y 0 2 gene product (myosin-I isoform)

Mitsui and Tsurugi (1988) Tanaka et al. (1 990) Kleff et al. (1992) Johnston e t a l . (1991)

I

l

l

,

l

100

,

l

l

,

l

l

l

l

l

I

200

l

l

l

l

-

500

cys & H I S map

1

1

1

1

loo0

1

lo00

1

1

1

1

A

B

A

3

B

-3

2

1 -2

500

1

500

0

1

1

1

0

1

2 1

1

EHESHSENSD YMMDTQVEDD YDEDGHVQGE YSYYPDEDED EHMLSSVGSF DYHHEDDSGL LYGYHRTQNG SDEDRNEEED GLERSHDNNE FGSNPLHLPD QRRQTSEGLG QHPVGRTLPE ILSMIGGRME RSAESSARNE RISKLIENTG ESLKELSENI LMMNQMVVDR IIPMETLIGN IAAILSDKIL REELELQMQA CPESISIAVD EHVIPILMK LVEISYIDIA EQVLETVEYI SRVHGRDILK FDFLTIHAQR KAIAIVSNAC SSIRTDDFKT IVEVLPTLKP IFSNATDQPI ICGALHGVDK FETLFSLDLI ERIVQLVSIQ DTPLENKLKC LDILTVLAMS KTDIVDMATR SFQHYSKSPN AGLHETLIW PNSLLISISR FIWLFPPED GNSDRGVISN QEKFDSLVQC LIPILVEIYT NAADFDVRRY VLIALLRWS NDQLIKLIGS ILAQKETASN ANGTYSSEAG TLLVGGLSLL DLICKKFSEL FDLVKDLSVD FNNIDLKEDG NENISLSDEE GDLHSSIEEC DEGDEEYDYE KPKKISIHIF RTLSLAYIKN KGVNLVNRVL SQMNVEQEAI TEELHQIEGV DKTEEDWKGI WSVLKKCIFH EDFDVSGFEF TSTGLASSIT KRITSSTVSH FEDCIDRFLE ILQSALTRLE NFSIVDCGLH DGGGVSSLAK EIKIKLVYDG LSSTIVSVHC IASFTSLNEF LRHRMVRMRF LNSLIPNLTS SSTEADREEE NFDFFYDNEK VDMESTVFGV IFNTFVRRNR DLKTLWDDTH TIKFCKSLEG ANEGKKLRDF YKKREFAQVD TGSSADILTL LDFLHSCGVK SDSFINSKLS LWASGALPD WSLFLTRRFP FLFPFDTRML FLWTSFGYG RLIQLWKNKS ALWLGRITR RKLRISRKTI FATGLKILSK YGSSPDVLEI EYQEEAGTGL SKYFARKSLN MWRCNSYSYR SEMDVDTTDD YITTLLFPEP LNPFSNNEKV VARSLLDNRI LDFRFSKVFF ELLHRMSTPN VTTVPSDYET CLLMIELVDP ANKDDNMTLE SLSLTFTVPG NDDIELIPGG CNKSLNSSNV EEYIHGVIDQ KAFIEGFSKV FSYERMLILF PDELVDIFGR VEEDWSMATL YTNLNAEHGY ISIISAFGKH ERRLFLQFLT GSPKLPIGGF KSLNPKFTW LKHAEDGLTA ANYLKLPKYT SKDIMRSRLC QAIEEGAGAF LLS

2 1

3 4

AKLARQLDEP KGSKDLRNDE GPTLEFYSW IELFGYLCTF LLAKSLKYIV ILGKGIEKQL TMDSSIIHDF DEYLPSVMTC

NNRESEAAEE

MSENNSHNLD EADDGEDDDN ILETFAQRLE NASEDPYIAM CRCMYNLFEV TGQLSIYVQF LTRLVNAMYG SDVLSRELRE ERILSADKYT CINNSTAKAI FFPSIKREGI FTDMEIPDSV VSILENPSTP FILAKSFLEV DASKDNIGTD ENCLDHMRKK

YKL 162

Figure 4. Sequences of the putative gene products from YLK160, YKL162 and YKL165. Amino acid sequences were analysed using DNA Strider (Marck, 1988). Top: hydrophobicity profiles (Kyte and Doolittle, 1982). Middle: acido-basic profile (lane A: acidic: full bar= E; intermediate b a r = D lane B: basic: full bar= R; intermediate bar = K; small bar = H). Bottom: cysteine-histidine map (on the bottom lane: Y, C, F, L and H are plotted as bars of increasing length, H being plotted as a full bar, on the top lane onlyC and H are plotted).

B

A

2

3

3

4

1

2

1 2

400

4w 1

300

300

0

200

2w

1

100

100

MSIWKEAKDA SGRIYYYNTL TKKSTWEKPK ELISQEELLL RENGWXAAKT ADGKWYYNP TTRETSWTIP AFEKKVEPIA EQKHDTVSHA QVNGNRIALT AGEKQEPGRT INEEESQYAN NSKLLNVRRR TKEEAEKEFI TMXENQVDS TWSFSRIISE LGTRDPRYWM VDDDPLWKKE MTEKYLSNRS ADQLLKEt!NE T S K F W Q K MLQNNSHIKY YTRWPTAKRL IADEPIYKHS WNEKTKRQT FQCYIDTLID TQKKESKKKLX TQALKELREY LNGIITTSSS ETFITWQQLL VHYGDKSKR YHANRHFKVL THEDVLNEYL KIVNTIENDL QNKLSELKLR NYTRDRIARD NFKSLLREVP IKIKANTRWS DIYPHIKSDP RFLHMLCRNG SSCLDLFLDF VDEQRMYlFA QRSIAQQTLI DQNFEWNDA3 SUEITKQNIE KVLENDRKFD KVDKE3ISLI VCGLIKQRNE KIGQKLQNEil RILEQKKHYF WLLIQRTYTK TGKPKlSTWD LASKELGESL EYKALGDEDN IRilQIFEDFK PESSAF'TAES ATANLTLTAS KKRHLTPAVE LDY

Cys 8. His map

B

A

0

2

1 -

4

l

3 4

l

0

l

- -2

l

WVLHLDDVRT QLSKLCSGVT IVYSRGGQIP EGEKLDVRQA

200

REALDTYRYV REEEYKENLY APLTFTIPEG TIDSPYLVCT TNINME

0 -

YKL165

I

ENKERIFDEV KVLQKALGEK RSDYSRPNTK IPTKIKAGKI YDNDSSTVES

100

TLAQTDKKGR AGSKLIMGKR TVKEYFKSYV LEPTMRNKFE SEFKVKVSAY

2-

2 1 -

MPRSKRSKLV PVLQEIRTSW GLLFTDEDVN AEEDVPMIHS LILKQFGIAA

YKL160

993

PRODUCTS RELATED TO MY02 AND TO THE RIBOSOMAL PROTEIN LIO

YKLl60 L10

MPRSKRSKLVTLAQTDKKGRENKERIFDEVREALDTYRYWLHLDDVRT

YKL16O L10

PVLQEIRTSWAGSKLI-MGKRKVLQKALGEKREEEYKENLYQLSKL---C QQMHEVRKELRGRAWLMGKNTMVFGUU-----RGFLSDLPDFEKLLPFV

MGGI---------------REKKAEYFAKLREYLEEYKSLFWGVDNVSS

*

**-*

. .*.* YKL16O L10

YKL16O L10

* * * * - * - . * . .*.*

. .* ..

.. *** ....*.

*

.

**

SGVTGLLFTDEDVNTVKEYFKSYVRSDYSRPNTKAPLTFTIPEGIVYSRG KGYVGFVFTNEPLTEIKNVIVS------NRVAAPA-------------RA

* *..**.* . .*. . *

.*

. *

*.

GQIPAEED-VPMIHS-LEPTMRNKFE----IPTKIKAGKITIDSPYLVCTE

GAVAPEDIWVRAVNTGMEPGKTSFFQALGVPTKIARGTIEIVSDVKWDA

* ...*. YKL16O L10

*

*

... .**

. *.

**** * * * *

*

GEKLDVRQALILKQFGIAASEFKVKVSAYYDND----SSTVESTNINM-GNKVGQSEASLLNLLNISPFTFGLTWQVYDNGQVFPSSILDITDEELVS

*.*..

.* .*.

.

*..

*

*

***.

** .. *.

..

YKL16O L10

YKL16O L10

Figure 5. Alignment between the amino acid sequences ofYKL160 and the ribosomal protein L10 of Saccharomyces cerevisiae. The two sequences have been aligned using CLUSTAL (Higgins and Sharp, 1989). Stars indicate identical amino acids, dots indicate conservative substitutions. Sequence of the L10 protein is from Mitsui and Tsurugi (1988).

by these authors is identical to ours except for the replacement of the A at position 3870 by a C (this changes codon 1314 of YKL 162from a Ser codon to an Ala codon). We have checked again this position on our gel readings and confirm the presence of an A. The sequence was analysed for the presence of ARS, transcription and splicing signals and for consensus sequences for protein binding using a list established by MIPS (unpublished). Consensus sequences located at expected positions with respect to ORFs are indicated in Figure 3. One TATA box and one termination site are located upstream of YKL165 and downstream ofYKL162, respectively. No intron, no tRNA and no ARS sequence was discovered.

Analysis of ORFproducts

The nucleotide sequences of the five ORFs have been translated into amino acid sequences, considering the first AUG codon of the O R F as the starting codon and compared with databanks: PGTrans, PSeqIP, NBRF and MIPSX. YKLl64 corresponds to the previously sequenced CCEl gene which encodes a cruciform cutting endonuclease (Kleff er al., 1992) and will not be further discussed here. The products of the three other complete ORFs, YKL165, YKL162 and YKLl60, show borderline homologies with previously characterized yeast proteins as shown in Table 1. Their amino acid sequences and profiles are given in Figure 4.

994

S. PASCOLO ET AL.

yk1165

* my02 yk1165

857 aa

L---NHYVFDKS-

*

..

**

. ..

TITN

*

LTHEDVLNEYLKIVN----------

** * -

.*

LKADAKSVNHLKEVS...68 aa..

my02

LRTKKDTVWQSLI

yk1165

TIENDLQNKLNELR 235 aa

my02

TIENNLQSTEQTLK

****.**.

.

*.

552 aa

Figure 6. Partial alignment between YKL165 gene product and the yeast MY02 protein.The two sequences have been aligned using CLUSTAL. Stars indicate identical amino acids, dots indicate conservative substitutions. Sequence of the MY02 protein is from Johnston et al. (1991). Conserved amino acids ofthe heptad repeats of MY02 that are present in YKL165 are boxed.

YKL160showshomologywith the LlOribosomal protein of yeast. Alignment between the two proteins is given in Figure 5. YKL160 can be aligned with the amino-terminal part of L10 but is missing the carboxy-terminal part. The L10 and L12 proteins, characterized from a variety of archaebacterial and eukaryotic sources, have a highly conserved sequence at their carboxyterminal end containing the hexapeptide motif DDDMGF but diverge at their amino-terminal end (Newton et ul., 1990). However, YKL160 does not show the hexapeptide motif, raising questions on it being a ribosomal protein of the L10 family. The product of YKL 165 shows a weak homology with the MY02 protein of S. cerevisiue even though the two proteins differ significantly in length (583 and 1574, respectively). Partial alignment between the two proteins is shown in Figure 6. Some of the amino acids which are conserved in the heptad repeats of MY02 are also found in the YKL165 gene product. Other parts of the two proteins, however, do not reveal obvious possible alignments. Finally, YKL162 shows a weak homology with the IRA2 gene product. However, no significant alignment could be found. ACKNOWLEDGEMENTS

S.P. and M.G. have contributed equally to this work. We thank Martina Haasemann and all MIPS staff for help with the sequence analysis, and all members of the Unite de Genetique Moleculaire des Levures for advice and critical discussions. This

work was supported by the BRIDGE program of the Division of Biotechnology of the Commission of European Communities and by the ‘Action Sptcifique Genome’ of the Ministere de 1’Education Nationale (DRED). B.D. is Professor of Molecular Genetics at the Universitt Pierre et Marie Curie. REFERENCES Boyer, J., Pascolo, S., Richard, G.-F. and Dujon, B. (1992). Sequence of a 7.8 kb segment on the left arm of yeast chromosome XI reveals four open reading frames including the CAP1 gene, a homolog to an RNA polymerase I1 elongation factor and an introncontaining gene. Yeast, in press. Claverie, J. M. (1984). A common philosophy and FORTRAN77 software package for implementing and searching sequence databases. Nucl. Acids Res. 12, 397-407. Colleaux, L., Richard, G.-F., Thierry, A. and Dujon, B. (1992). Sequence of a segment of yeast chromosome XI identifies a new mitochondria1carrier, a new member of the G protein family, and a protein with the PAAKK motif of the HI histone. Yeast 8,325-336. Higgins, D. G. and Sharp, P. M. (1988). Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene 73,237-244. Jacquier, A., Legrain, P. and Dujon, B. (1992). Sequence of a 10.7 kb segment of yeast chromosome XI identifies the APNl and the BAFI loci and reveals one tRNA gene and several new open reading frames including homologs to RAD2 and kinases. Yeast 8, 121-132. Johnston, G. C., Prendergast, J. A. and Singer, R. A. (1991). The Saccharomyces cerevisiae M Y 0 2 gene encodes an essential myosin for vectorial transport of vesicles. J. CeN Biol. 113, 539-551.

PRODUCTS RELATED TO MY02 AND TO THE RIBOSOMAL PROTEIN LIO

Kleff, S., Kemper, B. and Sternglanz, R. (1992). Identification and characterization of yeast mutants and the gene for a cruciform cutting endonuclease. EMBO J . 11,699-704. Kyte, J. and Doolittle, R. F. (1982). A simple method for displaying the hydrophobic character of a protein. J. Mol. Biol. 157, 105-132. Marck, C. (1988). ‘DNA Strider’: a ‘C’ program for the fast analysis of DNA and protein sequences on the Apple Macintosh family ofcomputers. Nucl. Acids Res. 16, 1829-1 836. Mitsui, K. and Tsurugi, K. (1988). cDNA and deduced amino acid sequence of 38 kDa-type acidic ribosomal protein A0 from Succhuromyces cerevisiue. Nucl. Acids Res. 16,3573. Newton, C . H., Shimmin, L. C., Yee, J. and Dennis, P. P. (1990). A family of genes encode the multiple forms

995

of the Succhuromyces cerevisiue ribosomal proteins equivalent to the Escherichia coli L12 protein and a single form of the L10 equivalent ribosomal protein. J. Bacteriol. 172,579-588. Sharp, P. M . and Li, W. H. (1987).The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucl. Acids. Res. 15,1281-1295. Tanaka, K., Nakafuku, M., Tamanoi, F., Kaziro, Y., Matsumoto, K. and Toh-E, A. (1990). ZRA2, a second gene of Saccharomyces cerevisiue that encodes a protein with a domain homologous to mammalian rats GTPaseactivating Protein. Mol. Cell. Biol. 10,4303-43 13. Thierry, A., Fairhead, C. and Dujon, B. (1990). The complete sequence of the 8.2 kb segment left of M A T on chromosome I11 reveals five ORFs, including a gene for a yeast ribokinase. Yeast 6,521-534.

The complete sequence of a 9,543 bp segment on the left arm of chromosome III reveals five open reading frames including glucokinase and the protein disulfide isomerase.

The sequence of a 12 kb fragment on the left arm of yeast chromosome XI reveals five new open reading frames, including a zinc finger protein and a homolog of the UDP-glucose pyrophosphorylase from potato.

The complete sequence of the 8.2 kb segment left of MAT on chromosome III reveals five ORFs, including a gene for a yeast ribokinase.

The sequence of a 6.3 kb segment of yeast chromosome III reveals an open reading frame coding for a putative mismatch binding protein.

The complete sequence of a 10.8 kb segment distal of SUF2 on the right arm of chromosome III from Saccharomyces cerevisiae reveals seven open reading frames including the RVS161, ADP1 and PGK genes.

The sequence of an 8 kb segment on the left arm of chromosome II from Saccharomyces cerevisiae identifies five new open reading frames of unknown functions, two tRNA genes and two transposable elements.

Sequence of a 12.7 kb segment of yeast chromosome II identifies a PDR-like gene and several new open reading frames.

Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III.

The complete sequence of the unit YCR59, situated between CRY1 and MAT, reveals two long open reading frames, which cover 91% of the 10.1 kb segment.

Sequence of the novel essential gene YJU2 and two flanking reading frames located within a 3.2 kb EcoRI fragment from chromosome X of Saccharomyces cerevisiae.

Transcriptional control, translation and function of the products of the five open reading frames of the Escherichia coli nir operon.

The open reading frame YCR101 located on chromosome III from Saccharomyces cerevisiae is a putative protein kinase.

The MAT locus revisited within a 9.8 kb fragment of chromosome III containing BUD5 and two new open reading frames.

Upstream Open Reading Frames Located in the Leader of Protein Kinase Mζ mRNA Regulate Its Translation.

The nucleotide sequence of the L10 equivalent ribosomal protein gene of Streptomyces antibioticus.

The nucleotide sequence of gene rpIJ encoding ribosomal protein L10 of Salmonella typhimurium.

An 11.4 kb DNA segment on the left arm of yeast chromosome II carries the carboxypeptidase Y sorting gene PEP1, as well as ACH1, FUS3 and a putative ARS.

Nucleotide sequence of 9.2 kb left of CRY1 on yeast chromosome III from strain AB972: evidence for a Ty insertion and functional analysis of open reading frame YCR28.

The complete sequence of a 11,953 bp fragment from C1G on chromosome III encompasses four new open reading frames.

Sequence of the yeast gene RVS 161 located on chromosome III.

Nucleotide sequence of the yeast 5S ribosomal RNA gene and adjacent putative control regions.

The complete sequence of a 6146 bp fragment of Saccharomyces cerevisiae chromosome III contains two new open reading frames.

The human gene encoding acetylcholinesterase is located on the long arm of chromosome 7.

Studies on the RNA and protein binding sites of the E. coli ribosomal protein L10.