OENOMICS14, 449-454 (1992)

Length and Sequence Variation in the Apolipoprotein B Intron 20 Alu Repeat MARK D. SHRIVER,* GERARD SIEST,I AND ERICBOERWlNKLE*'1 *Center for Demographic and Population Genetics, Graduate School of Biomedical Sciences, University of Texas Health Science Center at Houston, P.O. Box 20334, Houston, Texas 77225; and "rCenter for Preventive Medicine, B.P. 7, 54501 Nancy, France Received February 26, 1992; revised July 17, 1992

INTRODUCTION We h a v e d e v e l o p e d a s i n g l e - s t r a n d e d c o n f o r m a t i o n p o l y m o r p h i s m ( S S C P ) p r o t o c o l f o r t y p i n g b o t h seq u e n c e a n d l e n g t h v a r i a t i o n s i n a n A l u e l e m e n t located i n i n t r o n 2 0 o f t h e h u m a n a p o l i p o p r o t e i n B (apo B) gene. U s i n g t h e p o l y m e r a s e c h a i n r e a c t i o n ( P C R ) , w e s i m u l t a n e o u s l y a m p l i f i e d a n d i s o t o p i c a l l y l a b e l e d the a p o B i n t r o n 2 0 A l u . T h e A l u t a i l , w h i c h is c o m p o s e d o f two arrays of variable numbers of tandem repeats, ( T T T X ) v (X = A o r G) a n d (T)z, w a s s e p a r a t e d f r o m t h e r e s t o f the P C R p r o d u c t b y r e s t r i c t i o n e n z y m e d i g e s t i o n w i t h P s t I . L e n g t h v a r i a t i o n i n t h e A l u tail ( I N 2 0 - R E P ) was thus separated from sequence variation in the Alu b o d y ( I N 2 0 - S E Q ) , r e n d e r i n g the S S C P p a t t e r n s b o t h e a s i e r to i n t e r p r e t a n d m o r e i n f o r m a t i v e . I n a s a m p l e o f 2 4 2 u n r e l a t e d i n d i v i d u a l s f r o m N a n c y , F r a n c e , w e obs e r v e d 11 S S C P alleles at the I N 2 0 - S E Q l o c u s t h a t diff e r e d o n l y in sequence. A t t h e I N 2 0 - R E P l o c u s , w e obs e r v e d 7 alleles t h a t d i f f e r e d i n b o t h s e q u e n c e a n d length. All alleles at b o t h l o c i w e r e s u b c l o n e d a n d sequenced. One a d d i t i o n a l allele t h a t did n o t u n d e r g o a d e t e c t a b l e m o b i l i t y s h i f t i n S S C P gels w a s u n c o v e r e d at e a c h l o c u s d u r i n g s e q u e n c i n g o f t h e S S C P alleles. T h e a d d i t i o n a l I N 2 0 - S E Q allele w a s t y p e d b y r e s t r i c t i o n enzyme digestion. Although the number of IN20-SEQ a n d I N 2 0 - R E P alleles w a s l a r g e , m o s t w e r e u n c o m m o n ; t h e t h r e e m o s t c o m m o n a l l e l e s at e a c h l o c u s r e p r e sented more than 94% of those sampled. We also typed the children of the 242 unrelated French individuals, enabling verification of the Mendelian segregation of the two loci and construction of haplotypes. Twentythree out of a possible 84 haplotypes were observed w i t h a h e t e r o z y g o s i t y o f 0 . 8 1 3 . As e x p e c t e d g i v e n t h e i r close proximity, these two loci are in significant linkage disequilibrium. Using maximum parsimony, we w e r e a b l e to u n a m b i g u o u s l y p l a c e 1 1 o f t h e 1 2 I N 2 0 SEQ alleles on a phylogenetic network. We conclude that Alu length and sequence polymorphisms are a source of extensive and widely dispersed variation for a v a r i e t y o f g e n e t i c a p p l i c a t i o n s . © 1992 AcademicPress, Inc.

A l u e l e m e n t s are a family of i n t e r s p e r s e d r e p e a t e d sequences of a p p r o x i m a t e l y 300 b p in length a n d are prese n t a n e s t i m a t e d 500,000 ( S c h m i d a n d Jelinek, 1982) to 1,000,000 ( H w u et al., 1986) t i m e s p e r h u m a n h a p l o i d genome. T h e s e e l e m e n t s are the m o s t a b u n d a n t m e m b e r of a class of r e t r o p o s o n s k n o w n as s h o r t i n t e r s p e r s e d repeats. A l u s are a l m o s t i n v a r i a b l y flanked on b o t h sides b y direct r e p e a t s of 8 - 1 9 bp, which are indicative of t h e i r r e t r o p o s o n origin. I n addition, s o m e A l u s h a v e t a n d e m r e p e a t tails (usually A x or (NAx)y w h e r e N is f r e q u e n t l y cytosine), which are t h o u g h t to be c r e a t e d at t h e t i m e of insertion (reviewed in Rodgers, 1985). Recently, it was s h o w n t h a t the n u m b e r of r e p e a t s in a p a r t i c u l a r A l u tail is often variable a m o n g c h r o m o s o m e s ( E c o n o m o u et al., 1990; Zuliani a n d H o b b s , 1990a). S e p a r a t e f r o m the length variation, t h e r e also m a y be extensive sequence v a r i a t i o n in h u m a n A l u elements, which is accessible via s i n g l e - s t r a n d c o n f o r m a t i o n p o l y m o r p h i s m ( S S C P ) protocols (Orita et al., 1989, 1990). Little is k n o w n a b o u t t h e m u t a t i o n a l m e c h a n i s m s a n d d y n a m i c s of length or sequence v a r i a t i o n in A l u e l e m e n t s a n d t h e i r tail repeats. In this p a p e r we r e p o r t on the m o l e c u l a r n a t u r e of variation b e t w e e n A l u sequence a n d A l u tail r e p e a t alleles in the h u m a n a p o l i p o p r o t e i n B gene. A p o l i p o p r o t e i n B (apo B) is a large a n d h y d r o p h o b i c p r o t e i n of 4536 a m i n o acids a n d is the m a j o r p r o t e i n c o m p o n e n t of low-density l i p o p r o t e i n (LDL), very-lowdensity lipoprotein, a n d c h y l o m i c r o n particles. I n addition to serving a s t r u c t u r a l role in t h e s e lipoproteins, apo B is t h e p r i m a r y ligand for t h e L D L receptor, facilitating r e m o v a l of L D L f r o m t h e p l a s m a . D u r i n g genomic sequencing of t h e apo B locus, a n A l u e l e m e n t at t h e 3' e n d of i n t r o n 20 was f o u n d (Ludwig et al., 1987). I t was l a t e r r e p o r t e d t h a t the apo B i n t r o n 20 A l u e l e m e n t tail was p o l y m o r p h i c in length due to variable n u m b e r s of a t e t r a 1 T o w h o m correspondence should be addressed. 449

0888-7543/92 $5.00 Copyright © 1992 by Academic Press, Inc. All rights of reproduction in any form reserved.

450

SHRIVER, SIEST, AND B O E R W I N K L E B 1 51

ACATTGATTC

CTTGGAGTTT

CTCTACCTTT

TCCTCTCTTT

CCTTCCAAAA

CATAGCTTAT

TTATTTATTT

ATTTATTTGT

TTGTTTATAT

ATTTATTTAT

A

A



i01

TTATTTATTT

TTTTTTTGAG

ATGGAGTCTC

GCTCTTTTGC

CCAGGCTGCA

151

GTGCAGTGGT

GCCATCTCGG

CTCACTGCAA

GCCCCGCCTC

CCGGGTTCAT

201

GCCATTCTCC

TGCCTCAGCC

TCCTGAGTAG

CTGGGACTAC

AGGCACCCAC

GGCTAATTTT

TTGTATTTTT

AGTAGAGACG

A

A

A

I

IN20-REP

{~

IN20-SEQ

{ 251

CAACGCGCCC

GGGTTTCACC A

P

dr

~

dr

301

~I

351

ATGTTAGCCA

~'

GAATGGTCTT

A

A

GGCCTCCCAA

GATCTCCTGA

Alu Body

~

~%CATAGCT

401

TGCCCGCCTT

A

AGTGCTGGGA

TTACAGGTGT

A

Toi{ Arroy

CCTCATGATC

A

GAGCCACCGC

ACCCGGCCCA

A

TCTTACCACA



CATCTCTTGA

TTCTCT

FIG. I. (A) Diagramatie representation of the apo B intron 20 Alu. "P" denotes the PstI restriction e n z y m e site.Direct repeats are labeled "dr" and represented by dotted boxes. Arrows indicate the P C R priming sites. (B) Sequence of the apo B intron 20 A l u element as originally reported (Morton and Wu, 1988). The sequence shown corresponds to IN20-SEQ allele B and IN20-REP allele 1. Numbering begins with the first base of the PCR product. The 5' amplimer sequence and the 3' amplimer binding site are double underlined. Direct repeats are shown in bold. Variant nucleotide positions are indicated by open triangles (A). The PstI restriction enzyme site used to separate the two loci is indicated by solid triangles (T and A). The N/aIII and MscI restriction enzyme sites are underlined. A solid diamond indicates the beginning of the A l u sequence, corresponding to nucleotide position 3 of the consensus (0) (Baines, 1986).

nucleotide repeat, ( T T T A ) , (Zuliani and Hobbs, 1990b). Here, we investigated whether the apo B intron 20 A l u also had SSCP detectable sequence variation. Primers were designed to flank the direct repeats and thus amplify the entire A l u sequence and the 3' tail repeat. We discovered that there was extensive sequence variation in the intron 20 A l u element detectable by SSCP and that we could type both sequence variation in the A l u body and length variation in the tail o n the same SSCP gel. Sequence and length variations were typed and haplotypes were constructed in 121 nuclear families from Nancy, France. Each of the observed alleles was sub-

cloned and sequenced in several haplotype combinations. In addition, an evolutionary tree relating the A l u body alleles, which differed only in sequence, was constructed.

METHODS SSCP typing. GenomicDNA was prepared from buffylayer white blood cells. Polymerase chain reaction (PCR) A C A T T G A T T C C T T G G A G T T T C T C T A ; 3'ALU, G A G A T G T G T G G T A A G ) were selected from the of apo B (Zuliani and Hobbs, 1990b; Ludwig et al.,

primers (5'ALU, AGAGAATCAAgenomic sequence 1987) and synthe-

A 1

2

3

4

5

6

7

8

9

B 1

2

3

4

5

6

7

8

9 ,J+

,

,A+ ,O+

,A+

sA+

,F+

,G+

,M+

,

,A+

,A+ ,C+

,A+ ,K+

,

~D-

,A-

,A,F-

,G,A-' '

,B+,

,B'M-

,E+

,E,A-

I~A--

,I+

,d-

,B+

,B+

,8-

,A-

,B-

,liK--

F I G . 2. Apo B IN20,SEQ SSCP allele banding patterns. (A) Representative genotypes containing the 11 detectable SSCP variants at the apo B IN20-SEQ locus. All mdiv,duals shown are heterozygous for one of the common alleles, A or B, and another allele. Genotypes are lane 1, A/D; lane 2, A/F; lane 3, A/G; lane 4, B/M; lane 5, A/E; lane 6, A/C; lane 7, B/I; lane 8, A/K; and lane 9, B/J. Gels were run as described under Materials and Methods. (B) Graphical depiction of the IN20-SEQ allele bands. The + and - signs indicate the two strands of each allele.

VARIATION IN THE apo B INTRON 20 Alu

451 RESULTS

1

2

3

4

5

6

7

1234

5 6 7 8

FIG. 3. Apo B IN20-REP allele SSCP banding patterns and segregation. (A) Representative genotypes containing the seven apo B IN20-REP alleles that were observed in this sample. The three most common alleles are shown in homozygous form: lane 1, 1/1; lane 2, 2/2; and lane 3, 3/3. The four rare alleles are shown in the heterozygous state with allele 1: lane 4, 1/4; lane 5, 1/5; lane 6, 1/6; and lane 7, 1/7. Brackets indicate the clusters of bands indicative of alleles 6 and 7. (B) Segregation of IN20-REP alleles in two families: lane l , 2/6; lane 2, 1/2; lane 3, 1/2; lane 4, 1/6; lane 5, 1/3; lane 6, 1/7; lane 7, 3/7; lane 8, 3/7.

sized on an Applied Biosystems 391 PCR-Mate DNA synthesizer (Foster City, CA). Briefly, the PCR conditions used consisted of 30 cycles of 92°C for 1 min, 62°C for I min, and 72°C for 1 min and 20 s. Prior to amplification, 5/~1 (about 1 ~tg) of genomic DNA sample was incubated (57°C for 30 min and 100°C for 10 min) with 1 ~l proteinase K buffer (100 gg/ml proteinase K in standard 10X PCR buffer with 1% Laureth 12 and 0.5% Tween 20). Two microliters of the treated genomic DNA was then subjected to PCR in a 50-~tl reaction volume containing 1 gM each oligo, 200 gM each dNTP, I gCi [a-32p]dATP, 5 ~t110X PCR buffer (Promega, Madison, WI), and 0.125 U Taq polymerase (Promega), topped with two drops of mineral oil. The PCR product was digested overnight at 37°C with 5 U of PstI, which cuts the product once (see Fig. 1) separating the majority of the Alu sequence from the tail tandem repeats. Both the sequence (IN20-SEQ) locus and the tandem tail repeat (IN20-REP) locus polymorphisms were typed by running the labeled product on an 8% neutral polyacrylamide (20 X 40 X 0.4 cm) gel. Gels were run in 1X TBE at 4°C for 6-8 h at 1000 V, dried, and subjected to autoradiography overnight at room temperature. Sequencing. PCR products were sequenced using the standard Sequenase (U.S. Biochemical, Cleveland, OH) protocol after being subcloned into M13mpl9. All alleles were subcloned and sequenced in most (21 of 23) haplotype combinations to look for hidden allele heterogeneity. For each haplotype, several clones were sequenced to ensure that the observed variation was biological and not due to artifactual Taq mutations. Restriction enzyme digestion of the PCR product was performed by adding a total of 1 U enzyme to 50 gl of a 100-gl PCR reaction. To increase the effectiveness of the digestion, enzyme was added to the PCR product twice (0.5 U each time), 4 h apart, and the reaction was allowed to proceed overnight. Statistical tests. Goodness-of-fit to Hardy-Weinberg equilibrium expectations was tested by pooling the observed number of heterozygotes and homozygotes and testing against their expected numbers, as suggested by Chakraborty et al. (1991). Several standard measures of linkage disequilibrium were calculated (Hedrick, 1987; Morton and Wu, 1988) to measure the association between the IN20-SEQ and the IN20-REP loci. Monte Carlo methods (de Andrade et al., in preparation) were used to test the statistical significance of the observed association between IN20-SEQ and IN20-REP alleles.

Figure 1 shows the relevant features of the apolipop r o t e i n B i n t r o n 20 A l u e l e m e n t . A m o n g i n d i v i d u a l s , t h e a m p l i f i e d p r o d u c t r a n g e d i n size f r o m 436 t o 443 bp. Prior to typing, the Alu tail was separated from the Alu body by restriction enzyme digestion because the large n u m b e r o f a l l e l e s in t h e t w o s y s t e m s t o g e t h e r w o u l d g e n erate SSCP patterns that were prohibitively complex. In addition, the two e l e m e n t s are structurally distinct within the whole Alu element, and the apo B Alu tail has a l r e a d y b e e n t y p e d as a s e p a r a t e g e n e t i c l o c u s ( Z u l i a n i a n d H o b b s , 1990b). W h e n t h e a m p l i f i c a t i o n p r o d u c t w a s digested with PstI, two fragments were generated. The l a r g e r f r a g m e n t i n c l u d e d n u c l e o t i d e s 1 5 1 - 4 3 6 i n Fig. 1 a n d w a s c o n s t a n t i n size (285 b p ) . T h e s m a l l e r f r a g m e n t r a n g e d i n size f r o m 150 t o 157 b p a m o n g i n d i v i d u a l s . T h e 285-bp fragment, w h i c h c o n t a i n e d the Alu body, was n a m e d the I N 2 0 - S E Q locus because v a r i a t i o n in this f r a g m e n t w a s d u e s o l e l y t o s e q u e n c e d i f f e r e n c e s (see b e low). T h e s h o r t e r f r a g m e n t , w h i c h w a s v a r i a b l e i n l e n g t h a n d c o n t a i n s t h e A l u tail, b e c a m e t h e I N 2 0 - R E P l o c u s b e c a u s e its l e n g t h v a r i a t i o n w a s d u e t o a v a r i a b l e n u m b e r o f t a i l r e p e a t s (see b e l o w ) . T h e s e two loci were amplified a n d t y p e d in a s a m p l e of 484 i n d i v i d u a l s i n 121 n u c l e a r f a m i l i e s r a n d o m l y a s c e r tained from Nancy, France. Twelve alleles were identified a n d c h a r a c t e r i z e d a t t h e I N 2 0 - S E Q locus; 11 o f t h e 12 a l l e l e s w e r e r e c o g n i z e d as d i s t i n c t o n S S C P gels a n d a r e s h o w n i n Fig. 2A a n d d i a g r a m e d i n Fig. 2B. H o m o z y g o t e s s h o w e d t w o b a n d s o n t h e S S C P gels ( d a t a n o t

TABLE1 I N 2 0 - S E Q and I N 2 0 - R E P Allele F r e q u e n c i e s Allele

Count

Frequency

A. IN20-SEQ alleles A B C D E F G H I J K M Total

181 254 26 1 1 7 1 1 2 1 6 3 484

0.374 0.525 0.054 0.002 0.002 0.014 O.OO2 0.002 0.004 0.002 0.012 0.006 1.000

B. IN20-REP alleles 1 2 3 4 5 6 7 Total

225 140 89 1 3 25 1 484

0.465 0.289 0.184 0.002 0.006 0.052 0.002 1.000

452

SHRIVER, SIEST, AND BOERWINKLE TABLE 2 IN20-SEQ-REP

Haplotype Counts in Unrelated French Individuals

Allele

1

2

3

4

A B C D E F G H I J K M

138 65 1 1

27 97

5 80 1

1

Totals

213

1 5 1 1

7

21

1 1

1 1

6 3 131

88

1

TABLE 3 Position and Identity of Variant Nucleotides in IN20-SEQ Alleles Allele

200 ~

201

290

306

312

316

336

362

378

A

T

G

G

G A A A A A A

A

G

T

G

T

.b

C D

A

E

F G H

G G T C A A C

A

I

J K M

6

1 3

shown), one for each D N A strand. H e t e r o z y g o t e s often a p p e a r e d as four b a n d s (lanes 1-9 in Fig. 2). T h e b a n d ing p a t t e r n s of t h r e e I N 2 0 - S E Q alleles, F, I, a n d K (lanes 2, 7 a n d 8, respectively, in Fig. 2), were p e r c e p t i b l y different b u t too similar to definitively type by S S C P . T h e r e f o r e , the t y p i n g of t h e s e t h r e e alleles was verified b y digestion with the restriction e n z y m e s M s c I a n d N l a I I I . T h e b a n d i n g p a t t e r n s of t h e o t h e r I N 2 0 - S E Q alleles were distinct a n d easily t y p e d b y S S C P . S e v e n alleles were identified a n d c h a r a c t e r i z e d at the I N 2 0 R E P locus; all of t h e s e are s h o w n in Fig. 3A. One s t r a n d of the I N 2 0 - R E P locus was labeled m o r e t h a n twice as m u c h as the o t h e r s t r a n d because of differences in the n u m b e r of a d e n i n e s b e t w e e n the two. T h e s t r a n d t h a t labeled m o r e i n t e n s e l y r a n below the less labeled strand. H o m o z y g o t e s , therefore, h a d one d a r k a n d one light b a n d , a n d heterozygotes h a d two d a r k a n d two light bands. I N 2 0 - S E Q alleles 1, 2, 3, 4, a n d 5 (lanes 1, 2, 3, 4, a n d 5, respectively, in Fig. 3A) r a n according to size. Alleles 6 a n d 7 (lanes 6 a n d 7, respectively, in Fig. 3A) also r a n according to size, b u t in addition exhibited

B

5

G C

A

A

G

a Nucleotide position numbers are as indicated in Fig. 1. b Periods indicate identity with the sequence in the first line.

3

25

1

Totals 170 244 26 1 1 6 1 1 2 1 6 3 462

small clusters of b a n d s n e a r t h e p r i m a r y bands. B y subcloning a n d sequencing, we h a v e d e t e r m i n e d t h a t t h e s e clusters are likely t h e result of T a q p o l y m e r a s e slippage during amplification in the long p o l y ( T ) t r a c t s of t h e s e alleles (data n o t shown). However, these b a n d s were c o n s i s t e n t l y p r o d u c e d on successive amplifications a n d segregated in a M e n d e l i a n fashion as s h o w n in Fig. 3B. T h e frequencies of t h e sequence a n d length alleles as e s t i m a t e d f r o m the u n r e l a t e d individuals in our s a m p l e are p r e s e n t e d in T a b l e s 1A a n d 1B, respectively. Although, t h e r e were m a n y alleles at b o t h loci, m o s t were u n c o m m o n . Alleles 1, 2, a n d 3 at t h e I N 2 0 - R E P locus m a d e up 94% of the r e p e a t alleles in this sample. Alleles A, B, a n d C at t h e I N 2 0 - S E Q locus r e p r e s e n t e d 95% of t h e sequence alleles in this sample. T h e u n b i a s e d estim a t e of heterozygosity was 0.665 for the I N 2 0 - R E P locus a n d 0.583 for t h e I N 2 0 - S E Q locus. B o t h loci were in H a r d y - W e i n b e r g equilibrium ( I N 2 0 - S E Q ; X2 = 0.774, d f = 1; I N 2 0 - S E Q : X2 = 2.04, d f = 1). H a p l o t y p e s of the two loci were c o n s t r u c t e d b y e x a m ining t h e segregation of alleles in 121 nuclear families. T w e n t y - t h r e e of a possible 84 h a p l o t y p e s were f o u n d in this s a m p l e a n d the n u m b e r s of each o b s e r v e d are given in T a b l e 2. Considered t o g e t h e r as a haplotype, the I N 2 0 - S E Q a n d I N 2 0 - R E P loci h a d an u n b i a s e d heterozygosity of 0.813. T h e two loci, I N 2 0 - R E P a n d I N 2 0 SEQ, are n o n r a n d o m l y associated. T w o of the m o s t frequently used m e a s u r e s of linkage disequilibrium (Hedrick, 1978), D ~ a n d D*, h a d values of 0.00083 a n d 0.00213, respectively ( P < 0.001). T o c h a r a c t e r i z e the o b s e r v e d S S C P v a r i a t i o n at t h e sequence level a n d to look for hidden variation, P C R p r o d u c t s amplified f r o m individuals c o n t a i n i n g e a c h allele observed in our S S C P t y p i n g were subcloned into M 1 3 m p l 9 a n d sequenced. T a b l e 4 shows t h e n u m b e r a n d identity of the I N 2 0 - S E Q - R E P h a p l o t y p e s t h a t were sequenced. In all, we sequenced 35 I N 2 0 - S E Q - R E P h a p l o t y p e s f r o m 18 individuals. Multiple r e p r e s e n t a tives of each allele were sequenced as were m o s t of t h e observed h a p l o t y p e c o m b i n a t i o n s (21 of 23). T h e positions a n d identities of the v a r i a n t nucleotides in each

VARIATION IN THE apo B INTRON 20 Alu

453

TABLE 4 I N 2 0 - R E P Allele Sequences and Haplotypes T hat Were Sequenced Allele

Size

1 2 3 4 5 6 7

150 154 157 153 155 157 153

Repeat array sequence GCTTA GCTTC GCTTC GCTTC GCTTC GCTTC GCTTC

Haplotypes sequenced

(TTTA)4 (TTTG)2 TTTA TA TA (TTTA)4 (T)lo (TTTA) 5 (TTTG) 2 (TTTA) 7 (T) e (TTTA) 5 (TTTG) 3 (TTTA) 6 (T) 9 (TTTA)~ (TTTG) 3 (TTTA) 5 (T) 9 (TTTA)~ (TTTG) 2 (TTTA) 5 (TTA) 2 (T) 9 (TTTA) 5 (TTTG) 2 (TTTA)~ TTA (T)14 (TTTA) 5 (TTTG) 2 (TTTA) 4 (T)17

A1(5), a B1(3), D1, H1, K1, I1 A2(2), B2(6), E2, F2, I2 A3, B3(3) B4 C5 C6(2), J6, M6 B7

a Numbers in parentheses are the numbers of haplotypes sequenced when more than one was sequenced.

allele are shown in Table 3. Nine variant nucleotide positions were observed. Eight of these had only two alternative bases, while one had three. We uncovered one allele at the IN20-SEQ locus t hat did not undergo an SSCP mobility shift and identified restriction site differences between alleles by sequencing. IN20-SEQ alleles H and M were typed as the same allele on SSCP gels but sequencing showed t h a t they were different. Restriction enzyme digestion with M s c I and N l a I I I enabled us to resolve sequence differences between these two alleles. T he restriction site of N l a I I I was present when variant nucleotide positions 200 and 201 were T G and absent when these positions were CG or CA. Likewise, the variable restriction site for M s c I was present when nucleotide position 306 was G and absent when it was A. Sequences of seven A l u t andem tail repeat (IN20R E P locus) alleles are shown in Table 4. Alleles differed not only in the number of tetranucleotide repeats as reported by Zuliani and Hobbs (1990b), but also in the organization of the repeats and in the number of T's in the poly(T) section of the A l u tail. T he tandem repeat organization of the IN20-REP locus can be divided into four juxtaposed arrays: (a) ( T T T A ) w, W = 4-5; (b) (TTTG)x, X = 2-4; (c) (TTTA)y, Y = 4-7; and (d) (T)z, Z = 6-17. Alleles 5 and 6 shared an added tandem repeat t h at is unique to these two alleles; the repeat T T A was found once in allele 6 and twice in allele 5. In addition to E

G M

K

F I G . 4. Most parsimonious IN20-SEQ allele phylogeny. Circles represent the 11 IN20-SEQ alleles that can be placed. The area of each circle is proportional to the relative frequency of that allele in the sample. Short lines represent one mutational event; the long lines represent two mutational events.

variation in repeat array organization and number, there were also two variant nucleotide positions at the IN20R E P locus: nucleotide position 59 (A ~-* C) and nucleotide position 89 (T ~-~ A) in the second T T T A array. All IN20-REP A alleles t hat we sequenced had A at nucleotide positions 59 and 89. All other alleles sequenced had C and T at these two positions, respectively. A phylogeny of the evolutionary relationships between IN20-SEQ alleles was constructed using maximum parsimony. These relationships are represented in Fig. 4. T he second and third most frequent IN20-SEQ alleles (A and C) can each be derived from the most frequent allele (B) by one change. All of the other alleles, except one, could be derived from one of the three common alleles by one or two steps. IN20-SEQ allele J could not be definitively placed on the tree because it can be derived from either A or C by one change. Unlike the IN20-SEQ locus, a single most parsimonious tree was not evident for the IN20-REP locus, and the three common alleles at the IN20-REP locus could not be derived from one another by single mutational steps. DISCUSSION We report here a detailed analysis of two adjacent multiallelic loci t hat together constitute the intron 20 A l u of the human apo B gene. Variability among alleles was detected using SSCP, a technique t h a t is sensitive to sequence- and length-dependent conformational changes in single-stranded DNA (Orita et al., 1990). In a sample of 242 unrelated French individuals, we observed 12 alleles at the IN20-SEQ locus t h a t differed only in sequence. As expected, the majority of the nucleotide differences (8 of 11) were due to C *-* T or A *-* G transitions (Vogel and Kopun, 1977). Two of these 8 transitions occurred at CpG dinucleotides, which are reported hot spots for mutation (Barker et al., 1984; Cooper et al., 1985). These results are similar to those reported by Orita et al. (1990), where 7 of 10 substitutions found to be causing SSCP shifts in amplified A l u elements were transitions, 3 of which occurred at CpG dinucleotides. We have described seven alleles at the IN20-REP locus, differing in both length and sequence. Differences between the IN20-REP alleles were complex and are more easily discussed if the variation is divided into three groups: nucleotide substitutions, number of tetra-

454

SHRIVER, SIEST, AND BOERWINKLE

nucleotide repeats, a n d length of the poly(T) tract. Nucleotide position 57 in Fig. 1 was variable a m o n g IN20R E P alleles. Allele 1 h a d nucleotide A while all other I N 2 0 - R E P alleles h a d nucleotide C at this position. Allele 1 h a d a n o t h e r unique feature t h a t was due to nucleotide substitution; the second repeat in the second a r r a y of ( T T T A ) repeats was T A T A a n d n o t TT"rA. Apparently, the second position in this repeat (nucleotide position 89) h a d m u t a t e d from T to A. Like m a n y other s h o r t t a n d e m repeat arrays, the I N 2 0 - R E P locus h a d more t h a n one repeat unit sequence ( T T T A a n d T T T G ) . Therefore, alleles with the same n u m b e r of repeats could have different sequences. For example, the tetranucleotide repeat arrays from alleles 2 [ ( T T T A ) 5 (TTTG)2 (TTTA)7] and 3 [(TTTA)6 (TTTG)3 (TTTA)6] h a d different sequences b u t the same n u m b e r of t a n d e m repeats. Alleles 5 a n d 6 also shared a t a n d e m repeat a r r a y t h a t was unique to these two alleles. I m m e d i a t e l y 5' of the poly(T) tract, allele 5 h a d two copies a n d allele 6 h a d one copy of the repeat T T A . In addition to variable n u m b e r s of tetranucleotide repeats, there was considerable variability a m o n g IN20R E P alleles in the n u m b e r of T ' s in the poly(T) array. T h e poly(T) a r r a y ranged in size from 6 to 17 nucleotides a n d d e t e r m i n e d the size of the I N 2 0 - R E P alleles to a similar degree as the n u m b e r of tetranucleotide repeats. T h e r e was very little hidden heterogeneity a m o n g I N 2 0 - R E P alleles. For example, we sequenced 12 IN20R E P 1 alleles f o u n d in association with different IN20S E Q alleles (see Table 4) a n d f o u n d t h a t t h e y all h a d the same tetranucleotide repeat structures and poly(T) a r r a y lengths. However, one variation in the I N 2 0 - R E P locus t h a t was n o t detectable in S S C P gels was uncovered by sequencing. T h e I N 2 0 - R E P allele 3 associated with I N 2 0 - S E Q allele G was f o u n d to have the same poly(T) a r r a y size as the o t h e r I N 2 0 - R E P 3 alleles, b u t a different tetranucleotide a r r a y structure. As there was no obvious m e a n s of differentiating this new allele other t h a n sequencing, we did n o t type this heterogeneity in the entire sample. W h e n considered together, the I N 2 0 - R E P a n d IN20SEQ loci f o r m e d a highly informative genetic m a r k e r ( H = 0.813), which could be useful for genetic linkage, anthropological, a n d disease association studies. T h e r e are between 500,000 ( S c h m i d a n d Jelnek, 1982) a n d 1,000,000 (Hwu e t al., 1986) A l u elements in the h u m a n genome. A survey of h u m a n sequences in the G e n B a n k database (Moyzis e t al., 1989) showed t h a t A l u s are rand o m l y dispersed in these sequences a n d t h a t one is present on average every 4000 bp. M o s t genes therefore have one or more A l u elements. Orita e t al. (1990) reported t h a t 50% of the A l u s t h a t t h e y screened with S S C P h a d more t h a n one allele. I n addition, E c o n o m o u e t al. (1990) a n d Zuliani a n d H o b b s (1990a) d e m o n s t r a t e d t h a t a selection of A l u repeats showed length variation when amplified a n d r u n on d e n a t u r i n g gels. W e have c o m b i n e d these two techniques to type b o t h sequence a n d length variations in one A l u element in a single S S C P gel. I n

conclusion, the associated loci, I N 2 0 - R E P a n d IN20SEQ, in i n t r o n 20 of the h u m a n apo B gene discussed here are easily typed, highly informative, a n d m a y be useful for a variety of genetic studies. ACKNOWLEDGMENT Support for this research was from Grants NIH-HL-40613 and IJCX-0038 to E.B. REFERENCES Baines, W. (1986). The multiple origins of the human Alu sequences. J. Mol. Evol. 23: 189-199. Barker, D., Shafer, M., and White, R. (1984). Restriction sites containing CpG show a higher frequency of polymorphism in human DNA. Cell 36: 131-138. Chakraborty, R., Fornage, M., Gueguen, R., and Boerwinkle, E. (1991). Population genetics of hypervariable loci: Analysis of PCR based VNTR polymorphism within a population. I n "DNA Fingerprinting: Approaches and Applications" (T. Burke, G. Dolf, A. J. Jeffreys, and R. Wolff, Eds.), pp. 127-143, Bikhauser Verlag, Basel, Switzerland. Cooper, D. N., Smith, B. A., Cooke, H. J., Niemann, S., and Schmidtke, J. {1985). An estimate of unique DNA sequence heterozygosity in the human genome. Hum. Genet. 69: 201-205. Economou, E. P., Bergen, A. W., Warren, A. C., and Anonarakis, S. E. (1990). The polyadenylate tract of Alu repetitive elements is polymorphic in the human genome. Proc. Natl. Acad. Sci. USA 8 7 : 2951-2954. Hedrick, P. W. (1987). Gametic disequilibrium measures: Proceed with caution. Genetics 117: 331-341. Huang, L. S., Ripps, M. E., and Breslow, J. L. (1991). Molecular basis of five apolipoprotein B gene polymorphisms in noncoding regions. J. Lipid Res. 31: 71-77. Hwu, H. R., Roberts, J. W., Davidson, E. H., and Britten, R. J. (1986). Insertion and/or deletion of many repeated DNA sequences in human and higher ape evolution. Proc. Natl. Acad. Sci. USA 83: 38753879. Ludwig, E. H., Blackhart, B. D., Pierotti, V. R., Caiati, L., Fortier, C., Knott, T., Scott, J., Mahley, R. W., Levy-Wilson, B., and McCarthy, B. J. (1987). DNA sequence of the human apolipoprotein B gene. D N A 6: 363-372. Morton, N. E., and Wu, D. (1988). Alternative bioassays of kinship between loci. Am. J. Hum. Genet. 42: 173-177. Moyzis, R. K., Torney, D. C., Meyne, J., Buckingham, J. M., Wu, J-R., Burks, C., Sirotkin, K. M., and Goad, W. B. (1989). The distribution of interspersed repetitive DNA sequences in the human genome. Genomics 4: 273-289. Orita, M., Suzuki, Y., Sekiya, T., and Hayashi, K. (1989). Rapid and sensitive detection of point mutations and DNA polymorphisms using the polymerase chain reaction. Genomics 5: 874-879. Orita, M., Sekiya, T., and Hayashi, K. (1990). DNA polymorphism in Alu repeats. Genomics 8: 271-278. Rodgers, J. H. (1985). The origin and evolution of retroposons. Part 2: The structure and evolution of retroposons. Int. Rev. Cytol. 9 3 : 231-279. Schmid, C. W., and Jelinek, W. R. (1982). The Alu family of dispersed repetitive sequences. Science 216: 1065-1070. Vogel, F., and Kopun, M. (1977). Higher frequencies of transitions among point mutations. J. Mol. Evol. 9: 159-180. Zuliani, G., and Hobbs, H. H. (1990a). A high frequency of length polymorphism in repeated sequences adjacent to Alu sequences. Am. J. Hum. Genet. 46: 963-969. Zuliani, G., and Hobbs, H. H. (1990b). Tetranucleotide repeat polymorphism in the apolipoprotein B gene. Nucleic Acids Res. 18: 4299.

Length and sequence variation in the apolipoprotein B intron 20 Alu repeat.

We have developed a single-stranded conformation polymorphism (SSCP) protocol for typing both sequence and length variations in an Alu element located...
1MB Sizes 0 Downloads 0 Views