INVESTIGATION

High-Resolution Specificity from DNA Sequencing Highlights Alternative Modes of Lac Repressor Binding Zheng Zuo and Gary D. Stormo1 Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63108-8510 ORCID ID: 0000-0001-6896-1850 (G.S.)

ABSTRACT Knowing the specificity of transcription factors is critical to understanding regulatory networks in cells. The lac repressor– operator system has been studied for many years, but not with high-throughput methods capable of determining specificity comprehensively. Details of its binding interaction and its selection of an asymmetric binding site have been controversial. We employed a new method to accurately determine relative binding affinities to thousands of sequences simultaneously, requiring only sequencing of bound and unbound fractions. An analysis of 2560 different DNA sequence variants, including both base changes and variations in operator length, provides a detailed view of lac repressor sequence specificity. We find that the protein can bind with nearly equal affinities to operators of three different lengths, but the sequence preference changes depending on the length, demonstrating alternative modes of interaction between the protein and DNA. The wild-type operator has an odd length, causing the two monomers to bind in alternative modes, making the asymmetric operator the preferred binding site. We tested two other members of the LacI/ GalR protein family and find that neither can bind with high affinity to sites with alternative lengths or shows evidence of alternative binding modes. A further comparison with known and predicted motifs suggests that the lac repressor may be unique in this ability and that this may contribute to its selection.

T

HE lactose regulatory system established the paradigm of a trans-acting factor binding to a cis-acting element to regulate the expression of the adjacent gene in response to an environmental signal (Jacob and Monod 1961). Many aspects of the lac repressor protein have been studied extensively (reviewed in Lewis 2005). Our primary interest is in the DNA binding specificity of the lac repressor. Measurements of affinity changes due to operator sequence variation, by base replacement, by the use of base analogs, or by changing the length of the operator, have been performed almost since the operator sequence was first determined (Goeddel et al. 1978; Sadler et al. 1983; Betz et al. 1986; Copyright © 2014 by the Genetics Society of America doi: 10.1534/genetics.114.170100 Manuscript received August 21, 2014; accepted for publication September 6, 2014; published Early Online September 9, 2014. Supporting information is available online at http://www.genetics.org/lookup/suppl/ doi:10.1534/genetics.114.170100/-/DC1. Sequence data from this article have been deposited with the NCBI GEO database under accession no. GSE61223. 1 Corresponding author: Department of Genetics, Center for Genome Sciences and Systems Biology, 4444 Forest Park Blvd., Washington University School of Medicine, St. Louis, MO 63108-8510. E-mail: [email protected]

Sartorius et al. 1989; Lehming et al. 1990; Sasmor and Betz 1990; Frank et al. 1997; Spronk et al. 1999; Falcon and Matthews 2001; Kalodimos et al. 2002, 2004b; Daber and Lewis 2009). But those analyses all measured binding affinity to only a few sequences. The lac repressor has not, to our knowledge, been analyzed by current high-throughput methods that can determine specificity over thousands, or even millions, of sequences in parallel (Stormo and Zhao 2010), such as protein-binding microarrays (PBM) (Berger et al. 2006; Gordan et al. 2013), SELEX-seq [or HT-SELEX (Zhao et al. 2009; Zykovich et al. 2009; Jolma et al. 2010; Wong et al. 2011)], bacterial one-hybrid (B1H) (Meng et al. 2005; Noyes et al. 2008; Christensen et al. 2011), and mechanically induced trapping of molecular interactions (MITOMI) (Maerkl and Quake 2007). While those methods offer an expansive overview of binding specificity, the accuracies of the resulting models are highly variable. This can be due to low-quality data, but even with highly reproducible experiments specificity determination requires complex computational modeling (Stormo 2013; Weirauch et al. 2013; Orenstein and Shamir

Genetics, Vol. 198, 1329–1343 November 2014

1329

2014) and the results can vary considerably, depending on the method used and the protein being analyzed, including their ability to predict in vivo binding site locations (see Weirauch et al. 2013 for a comparison of many different approaches to the analysis of PBM data for many different transcription factors). Here we introduce a new high-throughput method, Specseq, that directly measures specificity by sequencing. It has several advantages over existing methods for quantifying large collections (thousands) of binding site energies in one experiment. Using lac repressor as an example, we show that this method has excellent reproducibility, giving energy measurements generally consistent within 0.1 kT (k is the Boltzmann constant and T is temperature in degrees kelvin; at 0 degrees C, 1 kT = 0.54 kcal/mol). While not measuring in parallel as many different sequences as some of the higher-throughput methods, we obtain high accuracies because we measure exactly what is necessary, the relative affinity to a large collection of sequences, without any mathematical fitting procedures or approximations required. It is similar to MITOMI in the number of different sites that can be analyzed in parallel, but it is much simpler to perform, requiring only a means to separate bound and unbound sequences, which are then sequenced using high-throughput, short-read sequencing machines. When applied to the lac repressor we obtain, in a single experiment, data covering a large fraction of all the previous studies, plus thousands of additional variants, allowing us to compile a much more comprehensive profile of its specificity. We confirm that the lac repressor can bind to sites of different lengths but that the preferred sequence, and the mode of binding, depends on the length. We also applied Spec-seq to two other members of the LacI/GalR protein family, PurR and YcjW, to obtain extensive models of their specificity and test the generality of the lac repressor’s ability to bind to operators of variable length. We find that the lac repressor is apparently unique in its ability to bind with high affinity to sites of different lengths and with different modes of binding. In addition to generating extensive data sets of the specificity of particular proteins, the high-throughput methods allow for assessments of the accuracies of simple models, such as position weight matrices (PWMs). The earliest theoretical analyses of binding specificity assumed that each position contributed independently (additively) to binding affinity and that each variation from the preferred sequence had the same energetic cost (Von Hippel 1979; von Hippel and Berg 1986). As data accumulated showing that variations at different positions had different effects on binding, PWM models were developed to accommodate position-specific costs (reviewed in Stormo 2000; Stormo 2013), but those models still assume additivity. Quantitative analyses subsequently found that, while additivity is often violated (e.g., Stormo et al. 1986; Frank et al. 1997; Man and Stormo 2001; Bulyk et al. 2002; Maerkl and Quake 2007; Zhao et al. 2012; Weirauch et al. 2013), it is often

1330

Z. Zuo and G. D. Stormo

a surprisingly good approximation (Sarai and Takeda 1989; Takeda et al. 1989; Benos et al. 2002; Zhao et al. 2012; Weirauch et al. 2013), indicating that the average effect of each variant, which is captured in the PWM, is similar in most contexts. Deviations from additivity are often rather modest, ,0.5 kT between observed and predicted binding energies, so the data from high-throughput experiments may not be sufficiently accurate to distinguish between models. Spec-seq provides quite accurate relative binding energies, usually within 0.1 kT, allowing us to make reliable determinations of nonadditive contributions. Recent analyses of high-throughput data sets have explicitly considered higher-order models to account for nonadditive contributions (Slattery et al. 2011; Zhao et al. 2012; Weirauch et al. 2013; Orenstein and Shamir 2014) and they are beginning to populate motif repositories such as JASPAR (Mathelier and Wasserman 2013; Mathelier et al. 2014). A separate issue is whether proteins can bind DNA in multiple modes, binding in two (or more) distinct ways, so that the specificity cannot be captured with a single PWM or even a higher-order model. A classic example is SREBP1, a basic helix-loop-helix (bHLH) protein that can bind either in a standard antiparallel mode to an E-box motif or in an alternative parallel mode to a quite distinct motif (Kim et al. 1995; Parraga et al. 1998). One recent article suggested multiple modes of binding are quite common (Badis et al. 2009) but further analyses of the same and similar data sets suggested that good PWMs can be found for most proteins and that multiple modes of binding are relatively rare (Zhao and Stormo 2011; Weirauch et al. 2013). Structural studies of the lac repressor bound to the wild-type operator have yielded conflicting results. X-ray crystallography shows the two monomers binding in nearly identical conformation whereas NMR studies show the right monomer binding in an extended conformation (Bell and Lewis 2001; Kalodimos et al. 2002, 2004b; Lewis 2005). From our quantitative data it is clear that the lac repressor binds in alternative modes that depend on the length of the binding site and that the asymmetry of the wild-type operator is not merely tolerated, but is actually the preferred sequence for sites with an odd length.

Materials and Methods Theory for high-throughput determination of relative affinities

The binding of a protein, P, for a particular DNA sequence, Si, is represented by the reaction diagram kon P þ Si

P  Si : koff

The affinity of the protein for that DNA is usually defined as the association constant, KA (or its reciprocal, the KD), and determined by measuring the equilibrium concentrations of each reactant and the complex, P  Si ; as in

KA ðSi Þ ¼

kon ½P  Si  ; ¼ ½P½Si  koff

(1)

where inside the brackets, [ . . . ] refers to concentrations. If multiple different DNA sequences compete for the same pool of proteins, their relative affinities (the ratios of their KA) can be determined by measuring the concentrations of each sequence in the bound and unbound fractions without measuring the free protein concentration, which is usually the most difficult quantity to measure accurately: KA ðS1 Þ : KA ðS2 Þ : ::: : KA ðSn Þ ¼

½P  S1  ½P  S2  ½P  Sn  : : ::: : : ½S1  ½S2  ½Sn  (2)

Current sequencing technology allows us to determine those concentration ratios for thousands of possible binding sites in parallel and therefore to determine the relative binding affinity, or equivalently the differences in binding free energy, for the entire collection of sequences in one experiment, producing a detailed specificity profile for the protein of interest. To be explicit, Equation 2 states that the ratios of concentrations in the bound and unbound fractions are equivalent to the ratios of binding constants, but we do not need to determine absolute concentrations, only their ratios that can be obtained directly from sequencing, as diagrammed in Figure 1. After separating the bound and unbound fractions the DNA from each fraction is barcoded and sequenced on an Illumina MiSeq from which we obtain a certain number of counts for each sequence, Nb in the bound fraction and Nu in the unbound fraction. Provided we have enough counts in each fraction, we can accurately estimate the ratios of the concentrations from those counts, which for the basis of the Spec-seq method are ½P  Si  Nb ðSi Þ  ; ½P  Sj  Nb ðSj Þ

½Si  Nu ðSi Þ  : ½Sj  Nu ðSj Þ

Therefore KA ðSi Þ ½P  Si  ½Sj  Nb ðSi Þ Nu ðSj Þ ¼  : KA ðSj Þ ½P  Sj  ½Si  Nb ðSj Þ Nu ðSi Þ

(3)

From a clean separation into bound and unbound fractions, the accuracy of the relative affinities is limited only by the accuracy of the count ratios. By obtaining hundreds to thousands of counts of each sequence in each fraction we derive very highly reproducible relative affinity values. The raw counts and ratios for each sequence in each experiment described below are available in Supporting Information, Table S1. The natural logarithms of those ratios are the relative (differences in) free energies of binding in units of kT (T = 273° K in these experiments). Setting the relative free energy of the preferred site to 0, all of the relative free energies are provided in Table S2. An expanded description of the theoretical justification

for Spec-seq and a comparison with standard methods for determining association constants is provided in File S1. Library design and preparation

Using known, or predicted, consensus binding sites of transcription factors, we randomize specific positions, using degenerate oligonucleotides. To ensure the shifted fractions of DNA binding sites are visible under the fluorescent imager, no more than five positions in each library are randomized. To make double-stranded DNA (dsDNA) libraries, 100 pmole single-strand degenerate template oligos were mixed with an equal amount of FAM-labeled reverse complement primer (PE2). In the presence of Taq Polymerase, brief 10-sec denaturing followed by 10 min of 55° annealing/extension is sufficient to make dsDNA libraries. Because any unextended single-stranded DNA (ssDNA) could contaminate the unbound band, the reaction mix was digested by 1 ml NEB Exo I exonuclease (New England Biolabs, Beverly, MA) for 30 min. All final DNA products were purified by QIAGEN (Valencia, CA) PCR purification columns. Sequences of all designed oligos and primers are provided in File S1. Protein expression and purification

We have used two methods for obtaining the proteins used in Spec-seq. In one, Escherichia coli strain BL21(DE3) was induced by IPTG for 2 hr to express the His-tagged protein of interest from a plasmid with an upstream T7 promoter. TALON His-Tag resin (Clontech) was then employed to purify the C-terminal tagged protein of interest. Alternatively, in vitro protein expression can be used. Twenty-five microliters NEB PURExpress system containing 100 ng PCR product of linear DNA construct encoding the protein with an upstream T7 promoter was run for 2 hr at 37°. Since all the existing protein components in the NEB system carry a His tag, it is possible to reverse purify the synthesized protein without any specific tag and further concentrate it by an Amicon (Danvers, MA) centrifugal filter. After reverse purification, protein concentration was estimated by Protein A280 (NanoDrop) and showed consistent values as expected from NEB PURExpress specification (100 ng/ml). Bound and unbound DNA separation for sequencing

All binding reactions were done in 13 NEBuffer 4 (50 mM potassium acetate, 20 mM Tris-acetate, 10 mM magnesium acetate, 1 mM DTT, pH 7.9) supplemented with 5% glycerol, and the purified protein of interest (LacI, PurR, and YcjW) was sequentially titrated into a 15-ml binding reaction system by twofold increase per lane, starting at 50 ng. For LacI this is 80 nM of monomer, well above the KD for tetramerization of 5 nM (Royer et al. 1990), so we expect binding to tetramers and we observe only one bound band. For the PurR protein guanine was included at 20 mM. At least 30 min incubation on ice can establish the equilibrium

Specificity by DNA Sequencing

1331

Figure 1 Overview of Spec-seq method. The protein of interest (TF) is mixed with a library of DNA sequences. Each sequence will be partially bound by the protein and occur, to varying extents, in both the bound and unbound fractions. Bound and unbound fractions are separated and sent for sequencing, using barcodes to distinguish the fractions. Counts of each sequence in each fraction are used to obtain the ratios that are proportional to relative affinities.

state of protein–DNA interaction. All EMSA experiments were run in 6% 0.53 TBE PAGE gels at 120 V for 30 min in the cold room. The DNA reaction mix was added to a running gel to hasten entry of the DNA into the gel matrix. 59-FAM-labeled DNA fragments can be visualized by a BioRad (Hercules, CA) Imager with a 520-nm bandpass filter (Figure S1). Shifted and unshifted bands were cut, including all of the visible bands and somewhat above and below them; extracted; and purified separately, using acrylamide gel extraction buffer and QIAGEN PCR columns. Each fraction of DNA was barcoded and amplified, respectively, by indexedIllumina primers (PE1-Genetics1/2, PE2.0). To minimize possible PCR bias, we recommend no more than 10 cycles of amplification (although in independent tests we found very minimal PCR bias). For typical randomized DNA libraries with diversity up to 2000, a standard 1x75 Miseq run is sufficient to produce 15 million reads covering all possible variants with sufficient counts to get reliable ratios. Each read is sorted according to its barcode and binding site content. All sequences with variations in the conserved regions are filtered out. For each unique variant in the library, the ratio between bound reads and unbound

1332

Z. Zuo and G. D. Stormo

reads is calculated and designated as relative binding affinity, compared to the preferred sequence, as described above. All sequence data are available from the NCBI Gene Expression Omnibus (GEO) database under accession no. GSE61223. Given the relative affinity/binding energy values of all variants, linear regression can produce an energy matrix [as a PWM (Stormo et al. 1986)], which can be visualized in logo form. Often variants with more than two mismatches will saturate to the nonspecific plateau level, so we use those variants with no more than two mismatches to the consensus site for regression analysis. We can then assess the fit of that PWM to the entire distribution as well as to just those sequences used to obtain the PWM.

Results Relative affinities of lac operator variants

Figure 2 shows the sequence of the lac operator, O1, which overlaps the promoter for the lacZYA operon and controls its expression, as well as the other two accessory operators O2 and O3. All three operator sequences are approximately symmetric, as expected for binding by a dimeric (actually

Figure 2 Lac operator sequences from E. coli and randomized library designs.

a dimer of dimers) repressor. We focus on the central 11 bp of O1 that contains the asymmetric positions T G A G C G G A T A A: We divide the sequence into a central 3-bp “spacer” and two flanking 4-bp regions that we refer to as L for the left half-site (TGAG) and R for the right half-site (ATAA). The underlined bases are asymmetric about the central G, which must be asymmetric. Figure 2 also shows the libraries of randomized sequences included in this study with spacer lengths from 2 to 4 bp (R2 to R4). To facilitate the comparison of different sequences, we number those 11 positions from 25 to +5 with 0 being the center of the 3-bp spacer (Figure 2). Operators with 2-bp spacers omit the 0 position, while operators with 4-bp spacers contain two 0 positions (referred to as 20 and +0). The R2 library deletes the central G to create a symmetric 2-bp spacer and randomizes the other underlined bases (24, 22, +2, +4) to include all 256 possible binding site sequences at those variable positions. The R4 library inserts a C before the central G to make a symmetric 4-bp spacer and also randomizes the other underlined bases to include all 256 possible binding sites at those variable positions. The R3.1 and R3.2 libraries maintain the 3-bp spacer length of the wild-type operator but randomize the central base and also all 4 bases of one of the half-sites, R in R3.1 and L in R3.2. These libraries each contain 1024 different binding site sequences. These four randomized libraries were combined in a single binding reaction with the lac repressor. This allows us to determine relative binding affinities (and free energies) for all of the 2560 variants, both within each library and between libraries. Figure 3A demonstrates the reproducibility of the method, where the relative ratios of all sequences in the R2 and R4 libraries are shown for two different binding reactions with twofold differences in repressor concentration; the points fit a straight line with r2 = 0.996 and nearly all of the replicate ratios are within 5% of the mean. Figure 3B shows the same data transformed to show binding free energy differences compared to the best binding sequence (with free energy defined as 0), demonstrating that most replicates are within

0.1 kT of each other, although there is a bit more variance among the low-affinity (high-energy) sites. An alternative way to estimate our measurement reproducibility is to compare those ratio differences between complementary variants. For example, if the four random positions of an R2 sequence are AAAA, then the sequence TTTT contains the same binding site in the opposite orientation. For each of the R2 and R4 libraries there are 120 pairs of complementary sequences, after excluding 16 self-complementary sites. The ratios for each identical pair for each library and in both experimental conditions (from Figure 3A) are plotted in Figure 3C and their free energies in Figure 3D. Again nearly all of the ratios are within 5% and energy differences are mostly within 0.1 kT, but with somewhat higher variance for the low-affinity sites. The entire list of all site counts in both the bound and unbound fractions, and their ratios, from each experiment is provided in Table S1. We define the relative free energy of the highest-affinity site to be 0, and the differences in free energy for all of the remaining 2559 binding sites are included in Table S2, organized by library. The total range of binding free energy is 5.4 kT between the highest- and lowest-affinity sites. It should be noted that this is substantially less than the difference in binding free energy between a high-affinity operator and complete nonspecific binding of the lac repressor because all of our randomized sites contain many matches to the preferred binding site. Every site contains the preferred outer positions 210 to 26 and +6 to +10, and every site contains at least 16 of the preferred bases in each library and at most five mismatches. These data give us a comprehensive view of the contributions to specificity from the central portion of the binding sites, the region that is asymmetric in the wild-type O1 operator. Figure 3 demonstrates the precision of Spec-seq, showing the highly reproducible values obtained from independent experiments with different conditions. To assess the accuracy of Spec-seq we compare its results to published values of the specificity of the lac repressor. Spronk et al. (1999) measured binding affinities to nine binding sites that overlap our set, including variants of both sequence and spacing. They used a standard approach of measuring the fraction of DNA bound, in an EMSA gel, at different protein concentrations and fitting the results to a binding curve, as in Equation 1. Table 1 shows their calculated relative KD’s (the reciprocal of KA) compared to the O1 operator and the results obtained from Spec-seq. For three sequences they were unable to determine the binding affinity because they failed to reach saturation in the binding curves; those three sequences are among the four lowest-affinity sequences in our experiments, consistent with their results. For the sequences in which they did obtain affinities, the rank orders of relative affinities coincide and for most cases our measurement is within one standard deviation of their reported value. But our precision is much higher; their standard deviations vary between 42% and 92% of the measured value whereas we assume a 5% variation based on Figure 3. In addition, recent experiments that measured single-molecule

Specificity by DNA Sequencing

1333

Figure 3 Consistency of relative affinity and energy measurements. (A) Bound/unbound ratios, normalized to highest value, for the R2 and R4 libraries from two independent experiments with different protein concentrations. The horizontal axis is the value from the first experiment and the vertical axis is the value from the second experiment. Lines indicate deviations of 5% from the perfect match in the two experiments (r2 . 0.99). (B) The negative natural logarithms of the data in A. The lines represent differences of 0.1 kT. (C) From both experiments from A, the normalized bound/unbound ratio of each sequence with its complement is shown. Self-complementary sequences are excluded (r2 . 0.99). (D) The negative natural logarithms of the data in C.

on and off rates in vivo report the relative affinity (ratios of on/off rates) between O1 and the 2-bp symmetry operator that is nearly identical to our measured value (0.61 vs. 0.51) (Hammar et al. 2014). Taken together these results indicate that Spec-seq can provide highly precise and accurate measurements of relative affinity for thousands of sequences in parallel. Lac repressor binding models

Figure 4 shows energy logos (Foat et al. 2006; Zhao et al. 2012) for each class (2-, 3-, and 4-bp spacers) of randomized library where the energy values are from the singlebase variants from the preferred binding site sequence (energy matrices are in Figure S2). For the 2-bp CG spacer the preferred binding site is symmetric, with both half-sites the same as L (we refer to this binding site as L2L9, where L9 indicates the complement of L). This site was previously shown to have higher affinity than O1 (Sadler et al. 1983) and is, in fact, the highest-affinity binding site in the entire collection. Changes from the preferred base to any of the other bases cost 0.5 kT at the inner positions (22, +2) and 1.4 kT at the outer positions (24, +4). The lowest-affinity

1334

Z. Zuo and G. D. Stormo

sites in the 2-bp library have 5.1 kT higher free energy than the preferred site. The preferred binding site for the 4-bp CCGG spacer library is also symmetric, but now both half-sites are the same as R (the site is R94R). That site has a free energy 0.3 kT above that of L2L9. For this operator changing the T/A at positions 22/+2 to C/G costs only 0.2 kT, whereas changes to G/C cost 1.4 kT and those to A/T cost 2.1 kT. Changes at the outer positions, 24/+4, all cost between 0.8 and 1.3 kT. The lowest-affinity sites have energies 4.5 kT higher than that of R94R for this library. The distinct differences in preferred binding sites between the 2- and 4-bp libraries demonstrates that the protein binds in alternative modes, depending on the distance between the half-sites. The binding energies are nearly the same and both preferred sites are symmetric but the base preferences differ such that operators with a 2-bp spacer prefer two copies of the L halfsite and those with a 4-bp spacer prefer two copies of the R half-site. The preferred binding site for the 3-bp spacer libraries is the same asymmetric site as that of the wild-type O1

Table 1 Relative KD comparison to published values Sequence from Spronk et al. (1999) WT (21) WT (+1) SymL (21) SymL SymL (+1) SymR (21) SymR SymR (+1)

Sequence from this work

Relative KD (Spronk et al. 1999)

Relative KD (this work)

L2R L4R L2L9 L3L9 L4L9 R92R R93R R94R

ND 31 6 13 0.8 6 0.3 24 6 16 ND ND 31 6 21 2.4 6 2.2

22.2 11.0 0.61 4.6 54.6 60.3 27.1 0.82

Sequence names are from Spronk et al. (1999) and from this work. ND, not determined, saturation not obtained.

operator, L3R, consistent with the results from the 2- and 4-bp libraries. The center is defined by the CG at positions 21/0, where the lac repressor hinge helix interacts with the minor groove (Kalodimos et al. 2004a; Lewis 2005), and the preferred site is asymmetric with each half-site preferring the sequence that corresponds to its distance from the center. The binding energy to this preferred site is 0.5 kT higher than to L2L9. Changing the central G to any other base greatly reduces the affinity. Changing the G to A costs 2.1 kT, to C costs 2.9 kT, and to T costs 3.6 kT. Changes to L at positions 22 and 24 are very similar to those observed in the L2L9 operator. This library also contains changes to positions 23 and 25 that are somewhat more costly than to positions 22 and 24, consistent with them being more conserved among the three wild-type operators (Figure 2). Changes to R at positions 2 and 4 are also similar to those observed in R94R, although most of the costs are somewhat diminished. Changes to R at position 3 are more costly than to the other positions. The total range of energies between the highest- and lowest-affinity sites for L variants is 4.8 kT, whereas for R variants the range is only 3.9 kT. In general variations in R are less costly than those in L, consistent with L’s greater conservation and the fact that most of the O1 mutations obtained in genetic screens for inactive operators altered positions in the L half-site (Gilbert et al. 1975; Betz et al. 1986). Figure 5 shows the 10 possible combinations of L and R with each of the three spacings and lists the measured relative binding free energy to each one (9 of these, all but R93L9, were analyzed by Spronk et al. 1999 and are listed in Table 1). The configuration L2L9 has the lowest energy (highest affinity) of all of the operators tested, which we define to be 0 relative free energy. The next-lowest energy combination is R94R with a relative free energy of 0.3 kT. The wild-type operator O1, L3R, is the next lowest with a relative free energy of 0.5 kT. The other 7 operators all have considerably larger free energies, between 1.9 and 4.6 kT, demonstrating that the preference for the L or R half-site depends on the distance from the CG in the spacer. Figure 6 plots the relative free energies for all of the variants to L and R for each library compared to the predicted energy, assuming additive contributions of the single variants. The symbol used for each point indicates the number of variant positions it contains compared to the preferred

sequence for that library. For libraries R2 and R4 (Figure 6, A and B), sites with two variants are further distinguished as to whether the variants are on the same, or opposite, halfsites. Several features of the plots are immediately obvious. One is that for each library there is a plateau in binding energy such that adding more variants no longer increases the binding energy and many different sequences have essentially the same energy. For the R2 and R3.2 libraries that each vary the L half-site (Figure 6, A and C), that plateau occurs at 5 kT. For the R4 library (Figure 6B) the plateau is at 4.5 kT, which is only 4 kT above the preferred sequence, R94R, for that library. The difference in these energy wells indicates the more favorable contacts made between the repressor and the L half-site compared to the R half-site. The R3.1 library (Figure 6D) shows a wide dispersal of energies between 2 and 4 kT for a large variety of sequences. The difference between the R4 and R3.1 libraries, which both vary the R half-site, is due to the additional positions that are varied in the R3.1 library (3 and 5). If one considers only variations at positions 2 and 4, which are varied in R4, then the two plots are much more similar. The R3.1 plot indicates that the protein can accommodate many different sequences so that the total energy cost is in the range of 2–4 kT, often much less than the sum of the individual variants. The Figure 6 plots allow us to assess the additivity approximation, which assumes that binding energy contributions from individual base variants can be summed to obtain the energy of the multivariant binding sites. If the protein binds to its sites in a perfectly additive fashion, then all the variants would fall on the diagonal straight lines of Figure 6. We expect additivity to be violated at the high-energy plateau, but it is often observed that additivity is a reasonably good approximation at lower energies (Sarai and Takeda 1989; Takeda et al. 1989; Benos et al. 2002; Zhao and Stormo 2011; Stormo 2013; Weirauch et al. 2013). Surprisingly, for library R2 all of the multivariant sites (with two minor exceptions among high-energy sites) have higher measured binding energy than predicted from the single variants, often .1 kT higher than predicted. One might expect that the sum of individual energies would set an upper bound on the combined energy and that the protein may compensate in some way for the multiple variants to have a somewhat lower energy than the sum. This is observed in the bHLH proteins (Maerkl and Quake 2007)

Specificity by DNA Sequencing

1335

similar with both over- and underpredicted energies, but many more of the multivariants have reasonably low energy. As described above, the R3.1 library has a wide dispersal of binding energies over the entire range of variants, including both under- and overpredictions. This suggests the protein can make favorable contacts with many different sequences when all four positions of R are varied. The Figure 6 plots use the single-variant energies to predict the binding energies of all the multivariant sequences. One can also use linear regression to find the best fit to all of the data or at least to all of the data below the nonspecific binding energy plateau (Stormo et al. 1986; Stormo and Zhao 2007; Zhao et al. 2009). Applying linear regression to these data sets obtains energy models (as PWMs) that fit the data reasonably well. For example, using all sequences in the R2 library with two or fewer variants to the preferred sequence, one gets an energy PWM that fits the training data with an r2 = 0.96 and nearly all of the predictions are within 0.5 kT of the measured values (Figure S3). The fit to the entire data set, including the plateau region, has an r2 = 0.73 (Figure S3). Similar results can be obtained for the other libraries (Figure S3), indicating that the additivity assumption, while clearly an approximation, can often provide reasonably good estimates of the true binding energies, as has been reported previously for many different transcription factors (Benos et al. 2002; Zhao et al. 2009; Zhao and Stormo 2011; Weirauch et al. 2013). In fact, if the accuracy of our measurements were limited to 0.5 kT, which is a range typical of many high-throughput techniques, the predictions obtained from linear regression would fit the data nearly as well as possible given its inherent variance. However, because we can measure energies to within 0.1 kT, we can easily see deviations from additivity even after applying regression to obtain optimal PWMs. Specificity of other LacI/GalR proteins

Figure 4 Energy Logos. (A) R2 library. (B) R3.2 (left half) and R3.1 (right half) libraries. (C) R4 library. The heights of each letter are the energies (in kT units) for each base, adjusted so the sum for each position is 0, from the single-base variants to the preferred sequence for each library. Note that the energy scale is negative so the preferred sequences (lowest energy) are on top and the bases are in order, top to bottom, from highest to lowest affinity.

where essentially all of the multivariant sites had lower energy than predicted from additivity. In the R3.2 library nearly all of the sites with two or more variants have greatly increased energy, at least 3 kT, but there are ones that are both above and below the predicted energy because of the large effects of some of the single variants. Library R4 is

1336

Z. Zuo and G. D. Stormo

The LacI/GalR family of transcription factors has many known and predicted binding site motifs (Novichkov et al. 2013), nearly all of them symmetric with a CG central pair. However, since many of those motifs are predicted based on comparative analyses that often assume symmetry, it is plausible that more members of the LacI/GalR family might also have flexible binding specificities that allow for variable length spacers and asymmetric binding sites. Methods that assume symmetric binding motifs can give misleading results, just as methods that assume asymmetry can if the sites are truly symmetric (Motlhabi and Stormo 2011). Having quantitative binding affinity data is the best way to resolve the issue of symmetry. Based on the known and predicted motifs for PurR and YcjW we synthesized libraries that vary four positions of one half-site to determine the specificity more accurately and that also contain both 2- and 3-bp spacer regions (Figure 7A). Note that the predicted consensus site for YcjW has a 2-bp spacer but is asymmetric, differing at positions +/22

Figure 5 Schematic model for different energy and conformation states. Shown are 10 different arrangements of half-sites L (TGAG) and R (ATAA) with each of the three different spacers. L9 and R9 refer to their complements. The measured energies are shown for each arrangement.

and +/24 in the two half-sites, similar to the lac operator O1. Figure 7B shows the energy logo of PurR, which is consistent with the known motif with a 2-bp spacer. The energy values at position +2 are in the same order as in previous measurements, but somewhat lower in magnitude (Glasfeld et al. 1999). But the previous measurements were made on symmetric double mutants so our smaller observed energies could be due to superadditivity of double mutants, as observed for the lac repressor. All randomized PurR sites with the 3-bp spacers had much lower affinity, indicating that it does not have the flexibility of the lac repressor to bind to extended sites (all data available in Table S3). The highestaffinity site in the 3-bp spacer library, ACGCAAA-CGGTTGCCGT (variable positions underlined), has an energy of 1.0 kT but is best explained as a G substitution at position +2 with a 2-bp spacer. Most of the highest-affinity sites from this library are best explained by substitutions to the 2-bp spacer. The highest-affinity site that is best explained as a 3-bp spacer, ACGCAAA-CGG-TTTGCGT, has an energy of 3.3 kT, indicating a large cost to binding in an extended conformation and no alternative sequence preference. The preferred site for YcjW is also symmetric (Figure 7C), contrary to the predicted consensus. But this protein is quite nonspecific with .50 different sequences binding within 1 kT of the optimal site (all data available in Table S3). Again most of the highest-affinity sites in the 3-bp library can best be explained by variations to a 2-bp spacer site, and the highest-affinity site that is best explained with a 3-bp spacer, TTCATGGGAC-CGC-GTCCCACGGA, has an energy of 1.2 kT, also indicating a substantial cost to binding in an extended conformation. The flexibility to bind to operators with variable length spacers, and with alternative motifs depending on the spacer length, is not a general characteristic of the LacI/GalR family and may even be unique to the

lac repressor protein itself. In fact, previous experiments found that most variants of the lac repressor that bind to L2L9 with high affinity lose high-affinity binding to O1 (Daber and Lewis 2009). PWMs based on regression analysis of the PurR and YcjW data are provided in Figure S4.

Discussion Lac repressor specificity

The NMR structure of the lac repressor bound to the O1 operator shows the asymmetric arrangement of the protein with each half-site (Kalodimos et al. 2002). The hinge helix from each monomer sits in the minor groove where Leu56 inserts between the CG dinucleotide at operator positions 21/0. The monomer interacting with the R half-site is extended and rotated, compared to the monomer interacting with the L half-site, because it must reach across positions 0 and +1 to fit into the major groove where the recognition helix of the protein interacts with positions +2 to +10. Figure 5 shows our model of the interaction of the protein with 10 different combinations of L and R operator sequences with each of three different spacers, CG, CGG, and CCGG. A protein monomer can be in either a compressed or an extended conformation, which is determined by the spacer length. For the CG spacer, both monomers are in the compressed state, for the CCGG spacer both are in the extended state, and for the CGG spacer there is one of each. The color of the oval in each operator configuration indicates whether the sequence matches the preferred site for that state of the monomer, blue if they match and red if they conflict. The lowest-energy site, L2L9, is the compressed state of the protein interacting with its preferred binding sequence in both half-sites. Replacing one or both monomers with extended states, with the CGG and CCGG spacers,

Specificity by DNA Sequencing

1337

Figure 6 Predicted vs. measured variants binding energy distribution. (A) R2 library. (B) R4 library. (C) R3.2 library. (D) R3.1 library. The predictions in A–D are based on the single variants. Each symbol represents a specific sequence and the number of mismatches from the preferred sequence for that library is shown in the inset. For the R2 and R4 libraries the sites with two variants are distinguished by whether they occur on the same (0, 2) or opposite (1, 1) half-sites. For libraries R3.1 and R3.2 only variations in the half-sites, not the central base, are included.

respectively, and also replacing the sequences with the preferred sites for each state, L3R and R94R, costs a small amount of free energy, between 0.2 and 0.5 kT for each substitution; there is some context effect because replacing just one half-site is more costly than replacing both. But it is much more costly to have a mismatch between the conformation of the monomer and that of the binding site sequence. The extended state interacting with the L half-site sequence (shown as red, thin ovals in L3L9, L4L9, and L4R) costs 2.4 6 0.5 kT (again context matters to a small extent). The compressed state interacting with the R half-site sequence (red thick ovals in L2R, R92R, and R93R) costs 3.7 kT, except that two such substitutions (in R92R) hit the low-affinity plateau of 4.6 kT. There is one large exception to this model, the site R93L9, which should have both types of conflicts but instead has a cost of 3.4 kT. We think this is best explained with the hinge helices sitting over the GG at positions 0/+1 instead of over the CG at positions 21/0 (note the red hinge helix). This site is equivalent to the O1 operator with the central G changed to C whose energy was measured in both R3 libraries (Figure 4B). Our model of L3R (the O1 operator) is equivalent to the model based on the NMR structure (Kalodimos et al. 2004b), and our measurements of binding energy differences are consistent with the subset of previously published measurements (Spronk et al. 1999). But in addition to comparing those 10 variations of L, R, and spacing, we have

1338

Z. Zuo and G. D. Stormo

measured energy differences over a collection of 2560 binding sites. We determined that among the 2048 variants with a 3-bp spacer that we examined, the natural operator has the highest affinity. Furthermore we can address questions about the additivity of variant energies. Figure 6 shows that in each library the energies of many binding sites with multiple variations are poorly predicted from the sums of the energies of the individual variations. For the R2 library the deviations are nearly always underpredicted, where the actual change in binding energy exceeds the sum from the individual changes. This can be explained if the binding to the L half-site is a very tight interaction with many favorable contacts and very low entropy. A single variation to the sequence loses some favorable contacts but that free energy increase is somewhat offset by an increase in entropy; even a single variation allows the protein some flexibility in the interaction. When two variations occur, they each lose some favorable contacts, but some of the entropy compensation has already happened with the single changes, leading to a larger increase in free energy than their sum. The effect is much larger for variations at positions 24/+4 than for variations at positions 22/+2. This is likely because those are more disruptive to the interaction of the recognition helix, being in its center rather than at the end, creating more entropy in the resulting complex. It is a bit surprising that the effect is more dramatic when two variations occur on both half-sites than if they both occur on the same half-site.

where additivity breaks down in general. For libraries R4 and R3.1, which vary the R half-site, there is a mixture of over- and underpredicted sites. The overpredictions are likely due to the protein adapting to multiple variants, making alternative contacts, so that the total free energy increase is less than the sum. The protein appears not to fit as tightly into the R half-site so that individual variants and combinations of variants are not as deleterious to binding as to the L half-site. The tightness of the L interaction may be partly due to the extra contact between the hinge helix and the HTH domain (Q54–N25) in the compressed conformation (Kalodimos et al. 2004b). Nonetheless, the total difference in binding energy between L2L9, L3R, and R94R is rather modest, suggesting that although the R half-site has fewer favorable contacts, it maintains a higher entropy than the L half-site interaction. Selection for an asymmetric binding site

Figure 7 Analysis of PurR and YcjW. (A) Consensus sequences and randomized libraries with 2- and 3-bp spacers for PurR and YcjW. (B) Energy logo for 2-bp PurR library. (C) Energy logo for 2-bp YcjW library.

This implies that the two monomers communicate across the dimer interface so that variations in the complex on one side of the operator influence the complex on the other side. The underprediction of energies for multivariant sites is most pronounced in the R2 library but is also seen in the R3.2 library, which also varies the L half-site. It is less obvious because all of the double variants include at least one of the highly specific positions, from 23 to 25, and most of them include combinations of those positions. This puts most of the multivariant sites into the nonspecific plateau range

Our current work shows that among 3-bp spacer sites the asymmetric wild-type operator, O1, has optimal binding affinity. Although an alternative operator with a 2-bp spacer has a somewhat higher affinity, all of the known lac operators from many different species maintain the same asymmetric site (Novichkov et al. 2013). No other members of the LacI/GalR family are known to bind asymmetric motifs, and even most variants of the lac repressor that bind with high affinity to the L2L9 site do not bind with high affinity to O1 (Daber and Lewis 2009). It appears that both the lac repressor protein and the asymmetric binding site have been selected because they offer some advantage to the cell. It is possible that the asymmetric site has a kinetic advantage for binding, although that was not seen in in vivo singlemolecule studies where facilitated diffusion was observed but on rates were, if anything, slightly faster for the symmetric operator L2L9 (Hammar et al. 2012, 2014). It is possible that the asymmetric site facilitates looping of the DNA and some simulation studies have suggested that could be the case (Colasanti et al. 2013), but other studies have not observed facilitated looping with O1 (Boedicker et al. 2013). A third possibility is that the asymmetric binding site influences the accessibility of the inducer to its binding site and facilitates induction. The recognition helix in the extended conformation loses a contact with the hinge helix compared to the compressed conformation (Kalodimos et al. 2004b) and the allosteric effect of inducer binding is transmitted through the hinge helix to destabilize binding and has a diminished effect on the L2L9 operator (Falcon and Matthews 2001). In fact, the L2L9 operator, while increasing repression in vivo, also has reduced levels of expression under inducing conditions (Daber and Lewis 2009). Daber and Lewis also make another interesting point about the lac repressor binding with high affinity to an asymmetric site. While allowing the repressor to bind to sites with multiple spacings decreases its specificity compared to a protein that is restricted to a single spacing class, the specificity of the operator for different proteins is increased. Most bacterial transcription factors bind as dimers and prefer symmetric binding sites and so are

Specificity by DNA Sequencing

1339

unlikely to bind with high affinity to O1, even if they have a preference for either the L or the R half-sites. The unique (as far as we know) capability of the lac repressor to bind with very high (near maximum) affinity to an asymmetric site means that it has little competition from other repressors in binding to O1. That feature, combined with the very high induction capacity, can explain the selective advantage of the unusual lac regulatory system. A number of further experiments are suggested by our results. We have not varied the outer segments of the operator sequence, positions 210 to 26 and +6 to +10 or positions 21 and +1. Those are all known to contribute to the specificity of the lac repressor but are outside the asymmetric region of O1 that was the focus of this work. It would also be interesting to examine the specificity profile of some of the mutant lac repressors that have been obtained in various studies (Sartorius et al. 1989; Lehming et al. 1990; Markiewicz et al. 1994; Daber and Lewis 2009; Milk et al. 2010). Since a single lane of an EMSA gel is sufficient to determine the relative affinities of thousands of variant binding sites, many different proteins could be assayed in parallel and more thorough assessments of recognition models could be obtained. Furthermore it would be informative to study some of the proteins in more detail for biophysical properties of the interactions, such as how specificity is affected by ionic strength, temperature, or pH (Mossing and Record 1985; Frank et al. 1997), all of which can be accomplished efficiently using Spec-seq. Assessing the contributions of enthalpy and entropy to binding to the L and R half-sites independently would also be interesting because our results suggest the two conformations of the protein have a different mix of contributions. Spec-seq is rapid and accurate for determining specificity

The method of determining specificity by sequencing, Specseq, is similar to some previously published methods but has some distinct advantages. We first employed analysis of bound and unbound fractions in the study of the Mnt repressor protein, using restriction enzymes (Stormo and Yoshioka 1991; Stormo et al. 1993), which limited the number of positions that could be studied. We then used DNA sequencing performed on pooled DNA, not single sequences, so the signal was analog with limited range and higher uncertainty than we can now obtain (Fields and Stormo 1994; Fields et al. 1997). Furthermore, since we did not determine sequences of individual binding sites, but rather pooled collections, the analysis was forced to assume independent contributions at each position. While that method allowed us to obtain a reasonable PWM approximation of binding (Fields et al. 1997; Stormo and Fields 1998), it did not allow for the measurement of nonindependent effects. We then developed a method that used fluorescently tagged DNA binding sites, QuMFRA, so that the ratio of the bound and unbound fluorescent signals is proportional to relative binding affinities (Man and Stormo 2001; Man et al. 2004; Liu

1340

Z. Zuo and G. D. Stormo

and Stormo 2005a,b). This allows for the determination of relative affinity without assumptions of additivity, but only a few different sequences can be compared at a time. With four different fluorescent tags we were limited to comparing four different sequences in a single reaction. Spec-seq overcomes both types of limitations. It generates digital signals, the number of counts for each sequence in both fractions from which probabilities of each sequence in the bound and unbound fractions are obtained. This increases the range of signals and decreases the uncertainty compared to the initial sequencing method. And like QuMFRA it does not require the independence assumption and lets us assess the goodness of that approximation rigorously. By allowing for the simultaneous determination of relative affinity for thousands of sequences in a single reaction, and not requiring assumptions about independence, Spec-seq gives us a rapid and accurate method to determine quantitative specificity for any DNA-binding protein of interest. If one requires knowing absolute binding affinities, that needs to be determined only for one sequence along with the relative affinities for the all of the rest. The same approach could be used to determine the specificity of RNA-binding proteins with the complication that secondary structure as well as sequence has to be considered in the modeling. We chose to separate bound and unbound fractions using a standard EMSA followed by cutting out the bands for sequencing, but any method of separating the two fractions could be used. One could use simple filter binding, although background effects may be larger. One could also use an antibody to the protein, or to a tag attached to the protein, to pull out the bound fraction. One advantage of using EMSA is that there may be multiple shifted bands if the protein can bind in multiple copies or if one assays the binding of two or more proteins to the DNA library. For example, one could assay the binding of two proteins, A and B, by generating libraries that have variable binding sites for each and also have variable spacings between them. If the EMSA gel has four bands, one for each protein bound alone as well as the unbound and doubly bound DNA (e.g., Ng et al. 2012), one could obtain the specificity profile for each protein as well as the cooperativity as a function of the spacer DNA all in one experiment. While limited to a few thousand variants in any one reaction, by performing sets of reactions with different components varied, one can expect to get a very thorough view of the binding of the protein complexes to DNA. Several high-throughput methods have been developed in the last decade for determining protein–DNA interaction specificity (reviewed in Stormo and Zhao 2010). Compared to those methods Spec-seq has one significant disadvantage, but several valuable advantages. The disadvantage is that the number of sequences that can be assayed in a single experiment is in the thousands (but see alternative below), whereas PBM and related methods (Berger et al. 2006; Warren et al. 2006; Nutiu et al. 2011), SELEX-seq (Zhao et al. 2009; Zykovich et al. 2009; Jolma et al. 2010, 2013; Wong et al.

2011), and B1H (Meng et al. 2005; Christensen et al. 2011; Gupta et al. 2014) can assay millions, or more, of sequences simultaneously. But those methods do not measure binding affinity directly, but rather something related to it such as binding site counts in SELEX-seq and B1H or fluorescence intensity in PBM. Estimating binding affinities requires fitting those measurements to some model. A variety of methods have been developed, but the best usually involve a biophysical model of the binding interaction and how that relates to the measured quantities. For example, the method of “binding energy estimates by maximum likelihood” (BEEML) (Zhao et al. 2009) uses SELEX-seq data for both the selected binding sites and the input (or prior) DNA and estimates the binding energies for individual sequences from the following biophysical model, PðSi jBÞ 1 } ; PðSi jIÞ 1 þ eEi 2m where P(Si|B) is the probability of sequence Si in the bound fraction and P(Si|I) is that in the input DNA (which could be from any round of SELEX). Ei is the energy value wanted and m is the chemical potential that is related to the protein concentration. Similar approaches are typically used for PBM and B1H data as well, although details vary, depending on the experimental approach (Stormo 2013). Ei is always modeled as some function of the sequence, such as a PWM, although more complex models, including k-mer and shape contributions to binding energy, can also be used (Slattery et al. 2011; Zhao et al. 2012; Gordan et al. 2013; Mathelier and Wasserman 2013; Stormo 2013; Weirauch et al. 2013; Orenstein and Shamir 2014). Determination of relative binding energies, compared to a reference sequence with binding free energy defined as 0, can be obtaining by fitting to PðSi jBÞ PðSref IÞ 1 þ e2m ¼ : 1 þ eEi 2m PðSref BÞ PðSi jIÞ Note that the protein concentration does not cancel out and estimates of all parameters, including the chemical potential and the parameters of the energy function, require nonlinear regression methods. In contrast, with Spec-seq the relative binding energies are obtained from the ratios of each sequence in the bound and unbound fractions from PðSi jBÞ PðSref UÞ ¼ e2Ei ; PðSref BÞ PðSi jUÞ where P(Si|U) is the probability of sequence Si in the unbound fraction (see expanded theory description in File S1). No energy model, such as a PWM, is needed nor is any regression required. We measure exactly what we want, relative binding affinities (and energies by taking

the logarithm). Of course we can use the measured binding energies to assess how well various models fit the data, as we described in Results. Spec-seq can provide accurate results and does not require any complex mathematical modeling or any assumptions about the recognition process. Given a consensus sequence one can generate whatever variations one wants to test, often covering the vast majority of likely binding sites within a reasonable range of relative affinities. For example, for a 10-long binding site there are only 3676 sequences with no more than three mismatches to the consensus, a number that could be assayed in a single Spec-seq run. If one starts with no information about the specificity of the protein, one could do two stages: first, a general method such as PBM, SELEX-seq, or B1H, to get a consensus sequence followed by, second, accurate quantitation of relative affinities with Spec-seq. Alternatively one could perform a standard SELEX procedure, with a very diverse library, but sequence both the bound and unbound fractions. If the number of sequences is large, not all of them will be counted in both fractions, and for the missing ones one could obtain only an estimate (upper or lower bound) on the relative affinity. But for every sequence that occurs in both fractions with sufficient counts, relative affinities are obtained and this could easily include all of the most relevant (highest affinities) binding sites. The results of Spec-seq are most similar to those of the MITOMI method, as is the number of sequences that can be assayed in parallel (Maerkl and Quake 2007). In the first description of the method Maerkl and Quake (2007) assayed four bHLH transcription factors, each to 464 different binding sites. The method requires construction of a microfluidic device with multiple chambers and control elements, the synthesis of the entire set of individual sites to be analyzed, and then the synthesis of proteins within the chambers, using in vitro transcription/translation protocols. In contrast, Spec-seq could obtain equivalent data (as relative affinities) for many more sequences from one lane of an EMSA gel for each protein. If absolute affinities are required, that needs to be determined for only one sequence. Parallel analyses of many different proteins can be obtained by simply running additional lanes in the EMSA gels and submitting each for sequencing with the appropriate barcodes to separate the data from individual experiments.

Acknowledgments We thank members of the G. D. Stormo laboratory for useful comments on the work and Rohit Pappu for stimulating discussions about the interpretation of some results. We thank the anonymous reviewers who provided helpful suggestions that improved the article. Funding for this work was from National Institutes of Health grants HG000249 and HG005970.

Specificity by DNA Sequencing

1341

Literature Cited Badis, G., M. F. Berger, A. A. Philippakis, S. Talukder, A. R. Gehrke et al., 2009 Diversity and complexity in DNA recognition by transcription factors. Science 324: 1720–1723. Bell, C. E., and M. Lewis, 2001 Crystallographic analysis of Lac repressor bound to natural operator O1. J. Mol. Biol. 312: 921– 926. Benos, P. V., M. L. Bulyk, and G. D. Stormo, 2002 Additivity in protein-DNA interactions: How good an approximation is it? Nucleic Acids Res. 30: 4442–4451. Berger, M. F., A. A. Philippakis, A. M. Qureshi, F. S. He, P. W. Estep, 3rd et al., 2006 Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24: 1429–1435. Betz, J. L., H. M. Sasmor, F. Buck, M. Y. Insley, and M. H. Caruthers, 1986 Base substitution mutants of the lac operator: in vivo and in vitro affinities for lac repressor. Gene 50: 123–132. Boedicker, J. Q., H. G. Garcia, and R. Phillips, 2013 Theoretical and experimental dissection of DNA loop-mediated repression. Phys. Rev. Lett. 110: 018101. Bulyk, M. L., P. L. Johnson, and G. M. Church, 2002 Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 30: 1255–1261. Christensen, R. G., A. Gupta, Z. Zuo, L. A. Schriefer, S. A. Wolfe et al., 2011 A modified bacterial one-hybrid system yields improved quantitative models of transcription factor specificity. Nucleic Acids Res. 39: e83. Colasanti, A. V., M. A. Grosner, P. J. Perez, N. Clauvelin, X. J. Lu et al., 2013 Weak operator binding enhances simulated lac repressor-mediated DNA looping. Biopolymers 99: 1070–1081. Daber, R., and M. Lewis, 2009 Towards evolving a better repressor. Protein Eng. Des. Sel. 22: 673–683. Falcon, C. M., and K. S. Matthews, 2001 Engineered disulfide linking the hinge regions within lactose repressor dimer increases operator affinity, decreases sequence selectivity, and alters allostery. Biochemistry 40: 15650–15659. Fields, D. S., and G. D. Stormo, 1994 Quantitative DNA sequencing to determine the relative protein-DNA binding constants to multiple DNA sequences. Anal. Biochem. 219: 230–239. Fields, D. S., Y. He, A. Y. Al-Uzri, and G. D. Stormo, 1997 Quantitative specificity of the Mnt repressor. J. Mol. Biol. 271: 178–194. Foat, B. C., A. V. Morozov, and H. J. Bussemaker, 2006 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22: e141–e149. Frank, D. E., R. M. Saecker, J. P. Bond, M. W. Capp, O. V. Tsodikov et al., 1997 Thermodynamics of the interactions of lac repressor with variants of the symmetric lac operator: effects of converting a consensus site to a non-specific site. J. Mol. Biol. 267: 1186–1206. Gilbert, W., J. Gralla, J. Majors, and A. Maxam, 1975 Lactose operator sequences and the action of lac repressors, pp. 193–210 in Protein-Ligand Interactions, edited by B. Sund. De Gruyter, Berlin/ New York. Glasfeld, A., A. N. Koehler, M. A. Schumacher, and R. G. Brennan, 1999 The role of lysine 55 in determining the specificity of the purine repressor for its operators through minor groove interactions. J. Mol. Biol. 291: 347–361. Goeddel, D. V., D. G. Yansura, and M. H. Caruthers, 1978 How lac repressor recognizes lac operator. Proc. Natl. Acad. Sci. USA 75: 3578–3582. Gordan, R., N. Shen, I. Dror, T. Zhou, J. Horton et al., 2013 Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Rep. 3: 1093–1104.

1342

Z. Zuo and G. D. Stormo

Gupta, A., R. G. Christensen, H. A. Bell, M. Goodwin, R. Y. Patel et al., 2014 An improved predictive recognition model for Cys (2)-His(2) zinc finger proteins. Nucleic Acids Res. 42: 4800– 4812. Hammar, P., P. Leroy, A. Mahmutovic, E. G. Marklund, O. G. Berg et al., 2012 The lac repressor displays facilitated diffusion in living cells. Science 336: 1595–1598. Hammar, P., M. Wallden, D. Fange, F. Persson, O. Baltekin et al., 2014 Direct measurement of transcription factor dissociation excludes a simple operator occupancy model for gene regulation. Nat. Genet. 46: 405–408. Jacob, F., and J. Monod, 1961 Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3: 318–356. Jolma, A., T. Kivioja, J. Toivonen, L. Cheng, G. Wei et al., 2010 Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20: 861–873. Jolma, A., J. Yan, T. Whitington, J. Toivonen, K. R. Nitta et al., 2013 DNA-binding specificities of human transcription factors. Cell 152: 327–339. Kalodimos, C. G., A. M. Bonvin, R. K. Salinas, R. Wechselberger, R. Boelens et al., 2002 Plasticity in protein-DNA recognition: lac repressor interacts with its natural operator 01 through alternative conformations of its DNA-binding domain. EMBO J. 21: 2866–2876. Kalodimos, C. G., N. Biris, A. M. Bonvin, M. M. Levandoski, M. Guennuegues et al., 2004a Structure and flexibility adaptation in nonspecific and specific protein-DNA complexes. Science 305: 386–389. Kalodimos, C. G., R. Boelens, and R. Kaptein, 2004b Toward an integrated model of protein-DNA recognition as inferred from NMR studies on the Lac repressor system. Chem. Rev. 104: 3567–3586. Kim, J. B., G. D. Spotts, Y. D. Halvorsen, H. M. Shih, T. Ellenberger et al., 1995 Dual DNA binding specificity of ADD1/SREBP1 controlled by a single amino acid in the basic helix-loop-helix domain. Mol. Cell. Biol. 15: 2582–2588. Lehming, N., J. Sartorius, B. Kisters-Woike, B. von Wilcken-Bergmann, and B. Muller-Hill, 1990 Mutant lac repressors with new specificities hint at rules for protein–DNA recognition. EMBO J. 9: 615–621. Lewis, M., 2005 The lac repressor. C. R. Biol. 328: 521–548. Liu, J., and G. D. Stormo, 2005a Combining SELEX with quantitative assays to rapidly obtain accurate models of protein-DNA interactions. Nucleic Acids Res. 33: e141. Liu, J., and G. D. Stormo, 2005b Quantitative analysis of EGR proteins binding to DNA: assessing additivity in both the binding site and the protein. BMC Bioinformatics 6: 176. Maerkl, S. J., and S. R. Quake, 2007 A systems approach to measuring the binding energy landscapes of transcription factors. Science 315: 233–237. Man, T. K., and G. D. Stormo, 2001 Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res. 29: 2471–2478. Man, T. K., J. S. Yang, and G. D. Stormo, 2004 Quantitative modeling of DNA-protein interactions: effects of amino acid substitutions on binding specificity of the Mnt repressor. Nucleic Acids Res. 32: 4026–4032. Markiewicz, P., L. G. Kleina, C. Cruz, S. Ehret, and J. H. Miller, 1994 Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as “spacers” which do not require a specific sequence. J. Mol. Biol. 240: 421–433. Mathelier, A., and W. W. Wasserman, 2013 The next generation of transcription factor binding site prediction. PLoS Comput. Biol. 9: e1003214.

Mathelier, A., X. Zhao, A. W. Zhang, F. Parcy, R. Worsley-Hunt et al., 2014 JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 42: D142–D147. Meng, X., M. H. Brodsky, and S. A. Wolfe, 2005 A bacterial onehybrid system for determining the DNA-binding specificity of transcription factors. Nat. Biotechnol. 23: 988–994. Milk, L., R. Daber, and M. Lewis, 2010 Functional rules for lac repressor-operator associations and implications for proteinDNA interactions. Protein Sci. 19: 1162–1172. Mossing, M. C., and M. T. Record, Jr.., 1985 Thermodynamic origins of specificity in the lac repressor-operator interaction. Adaptability in the recognition of mutant operator sites. J. Mol. Biol. 186: 295–305. Motlhabi, L. M., and G. D. Stormo, 2011 Assessing the effects of symmetry on motif discovery and modeling. PLoS ONE 6: e24908. Ng, C. K., N. X. Li, S. Chee, S. Prabhakar, P. R. Kolatkar et al., 2012 Deciphering the Sox-Oct partner code by quantitative cooperativity measurements. Nucleic Acids Res. 40: 4933–4941. Novichkov, P. S., A. E. Kazakov, D. A. Ravcheev, S. A. Leyn, G. Y. Kovaleva et al., 2013 RegPrecise 3.0–a resource for genomescale exploration of transcriptional regulation in bacteria. BMC Genomics 14: 745. Noyes, M. B., R. G. Christensen, A. Wakabayashi, G. D. Stormo, M. H. Brodsky et al., 2008 Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133: 1277–1289. Nutiu, R., R. C. Friedman, S. Luo, I. Khrebtukova, D. Silva et al., 2011 Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29: 659–664. Orenstein, Y., and R. Shamir, 2014 A comparative analysis of transcription factor binding models learned from PBM, HTSELEX and ChIP data. Nucleic Acids Res. 42: e63. Parraga, A., L. Bellsolell, A. R. Ferre-D’Amare, and S. K. Burley, 1998 Co-crystal structure of sterol regulatory element binding protein 1a at 2.3 A resolution. Structure 6: 661–672. Royer, C. A., A. E. Chakerian, and K. S. Matthews, 1990 Macromolecular binding equilibria in the lac repressor system: studies using high-pressure fluorescence spectroscopy. Biochemistry 29: 4959–4966. Sadler, J. R., H. Sasmor, and J. L. Betz, 1983 A perfectly symmetric lac operator binds the lac repressor very tightly. Proc. Natl. Acad. Sci. USA 80: 6785–6789. Sarai, A., and Y. Takeda, 1989 Lambda repressor recognizes the approximately 2-fold symmetric half-operator sequences asymmetrically. Proc. Natl. Acad. Sci. USA 86: 6513–6517. Sartorius, J., N. Lehming, B. Kisters, B. von Wilcken-Bergmann, and B. Muller-Hill, 1989 lac repressor mutants with double or triple exchanges in the recognition helix bind specifically to lac operator variants with multiple exchanges. EMBO J. 8: 1265– 1270. Sasmor, H. M., and J. L. Betz, 1990 Symmetric lac operator derivatives: effects of half-operator sequence and spacing on repressor affinity. Gene 89: 1–6. Slattery, M., T. Riley, P. Liu, N. Abe, P. Gomez-Alcala et al., 2011 Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 147: 1270–1282.

Spronk, C. A., G. E. Folkers, A. M. Noordman, R. Wechselberger, N. van den Brink et al., 1999 Hinge-helix formation and DNA bending in various lac repressor-operator complexes. EMBO J. 18: 6472–6480. Stormo, G. D., 2000 DNA binding sites: representation and discovery. Bioinformatics 16: 16–23. Stormo, G. D., 2013 Modeling the specificity of protein-DNA interactions. Quant. Biol. 1: 115–130. Stormo, G. D., and D. S. Fields, 1998 Specificity, free energy and information content in protein-DNA interactions. Trends Biochem. Sci. 23: 109–113. Stormo, G. D., and M. Yoshioka, 1991 Specificity of the Mnt protein determined by binding to randomized operators. Proc. Natl. Acad. Sci. USA 88: 5699–5703. Stormo, G. D., and Y. Zhao, 2007 Putting numbers on the network connections. BioEssays 29: 717–721. Stormo, G. D., and Y. Zhao, 2010 Determining the specificity of protein-DNA interactions. Nat. Rev. Genet. 11: 751–760. Stormo, G. D., T. D. Schneider, and L. Gold, 1986 Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res. 14: 6661–6679. Stormo, G. D., S. Strobl, M. Yoshioka, and J. S. Lee, 1993 Specificity of the Mnt protein. Independent effects of mutations at different positions in the operator. J. Mol. Biol. 229: 821–826. Takeda, Y., A. Sarai, and V. M. Rivera, 1989 Analysis of the sequence-specific interactions between Cro repressor and operator DNA by systematic base substitution experiments. Proc. Natl. Acad. Sci. USA 86: 439–443. Von Hippel, P. H., 1979 On the molecular bases of the specificity of interaction of transcriptional proteins with genome DNA, pp. 279–347 in Biological Regulation and Development, edited by R. F. Goldberger. Plenum, New York. von Hippel, P. H., and O. G. Berg, 1986 On the specificity of DNAprotein interactions. Proc. Natl. Acad. Sci. USA 83: 1608–1612. Warren, C. L., N. C. Kratochvil, K. E. Hauschild, S. Foister, M. L. Brezinski et al., 2006 Defining the sequence-recognition profile of DNA-binding molecules. Proc. Natl. Acad. Sci. USA 103: 867–872. Weirauch, M. T., A. Cote, R. Norel, M. Annala, Y. Zhao et al., 2013 Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31: 126–134. Wong, D., A. Teixeira, S. Oikonomopoulos, P. Humburg, I. N. Lone et al., 2011 Extensive characterization of NF-kappaB binding uncovers non-canonical motifs and advances the interpretation of genetic functional traits. Genome Biol. 12: R70. Zhao, Y., and G. D. Stormo, 2011 Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29: 480–483. Zhao, Y., D. Granas, and G. D. Stormo, 2009 Inferring binding energies from selected binding sites. PLoS Comput. Biol. 5: e1000590. Zhao, Y., S. Ruan, M. Pandey, and G. D. Stormo, 2012 Improved models for transcription factor binding site identification using nonindependent interactions. Genetics 191: 781–790. Zykovich, A., I. Korf, and D. J. Segal, 2009 Bind-n-Seq: highthroughput analysis of in vitro protein-DNA interactions using massively parallel sequencing. Nucleic Acids Res. 37: e151. Communicating editor: M. Johnston

Specificity by DNA Sequencing

1343

GENETICS Supporting Information http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.170100/-/DC1

High-Resolution Specificity from DNA Sequencing Highlights Alternative Modes of Lac Repressor Binding Zheng Zuo and Gary D. Stormo

Copyright © 2014 by the Genetics Society of America DOI: 10.1534/genetics.114.170100

File S1 Supplemental materials Expanded Spec-seq theory For a transcription factor TF binding to a DNA sequence Si, the reaction diagram is: on   TF + Si   TF ⋅ Si k

k

off

Si will be a collection of sequences in the discussion below, but for now just consider it one specific sequence because that is the way traditional binding constants are determined:

K A (= Si )

kon [TF ⋅ Si ] = ≡ Ki koff [TF ][ Si ]

(1)

We will use Ki for the association constant for sequence Si just to simplify the notation. In a traditional experiment to determine Ki one would vary the protein concentration and measure the fraction of Si that is bound to the protein, Fb:

Fb ( Si ) =

[TF ⋅ Si ] K i [TF ] P ( Si | B ) P ( B ) ( B | Si ) = = P= [TF ⋅ Si ] + [Si ] K i [TF ] + 1 P ( Si )

(2)

The last two parts just change to conditional (Bayesian) probability notation. P ( B | Si ) means “probability of bound given Si”, and P ( Si | B ) means “probability of Si given bound”. Those are related by Bayes’ rule as shown in the last two terms that include the non-conditional (prior) probabilities for being bound and for sequence Si. If there is only one sequence then this just shows the trivial relationship Fb(Si)=P(B), but when comparing multiple sequences this notation is critical for keeping track of everything. (Note: if we substitute Ei = − ln K i and µ = ln[TF ] then we can write this as:

= Fb ( Si )

K i [TF ] 1 = K i [TF ] + 1 1 + e Ei − µ

the Fermi-Dirac form of binding probability, relating it to free energy and chemical potential, µ .) We could also take the same reaction described above but instead of determining the fraction bound we could measure the fraction unbound, Fu:

= Fu ( Si )

[ Si ] P( Si | U ) P(U ) 1 (3) (U | Si ) = = P= [TF ⋅ Si ] + [Si ] K i [TF ] + 1 P ( Si )

And the ratio of the bound and unbound is:

2 SI

Z. Zuo and G. D. Stormo

Fb ( Si ) [TF  Si ] P ( B | Si ) P ( Si | B ) P ( B )         (4)    K i [TF ]   Fu ( Si ) P (U | Si ) P ( Si | U ) P (U ) [ Si ]

 

The second and third terms (in red) are just rewriting of equation (1) and we could have gotten there  directly, but this puts it in context of the normal experimental procedure which uses equation (2). Now  the interesting thing occurs when we want to know the relative affinities, the ratios of the K’s, for two  different sequences, Si and Sj. If we mix them together in the reaction, so it is the same as the reaction  diagram in equation (1) but it now has two different sequences and we can measure the fraction bound  of each of them separately (for example if they have different fluorescent labels), then:   

Fb ( Si ) Fu ( S j ) [TF  Si ] [ S j ] K P( B | Si ) P(U | S j ) P( Si | B) P( S j | U )   i       (5)  Fu ( Si ) Fb ( S j ) [ Si ] [TF  S j ] K j P( B | S j ) P(U | Si ) P( S j | B) P( Si | U )

The protein concentration in the middle term (in red) of equation (4) has dropped out since it is same  for both sequences (i.e. they are both competing for the same pool of protein), which eliminates the  need to determine the free protein concentration and we get what we want, the ratios of binding  affinities (red in equation 5). In fact the result is independent of the protein concentration as long as it is  in the range (i.e. not too high or too low) where none of the concentrations (or probabilities)  approaches zero. The last term is the key to the method because that describes what we actually  measure, the probabilities of each sequence in each fraction, bound and unbound. The normalizing  terms all drop out of the equation, which shows that we can get what we want, relative affinities, by  simply separating bound and unbound fractions and determining the relative proportions of each  sequence in each fraction. Of course there is nothing that limits this approach to two sequences, the  mixture can have any number of sequences and simply determining the probabilities of each sequence  in each fraction is sufficient to determining the relative affinities for every sequence.         

 

   

Z. Zuo and G. D. Stormo 

3 SI 

Primer sequences  PE1‐Genetics1:  AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT NN  GATAGTCTCATTTTCACC  PE1‐Genetics2:  AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNN  GATAGTCTCATTTTCACC    PE2‐reversed‐FAM:  /56‐FAM/TGCTGAACCGCTCTTCCGATCT    Randomized libraries sequences:  GATAGTCTCATTTTCACC AATTGTNANCGNTNACAATT AGATCGGAAGAGCGGTT (LacOR2)  GATAGTCTCATTTTCACC AATTGTGAGCNGNNNNCAATT AGATCGGAAGAGCGGTT (LacOR3.1)  GATAGTCTCATTTTCACC AATTGTTATCNGNNNNCAATT AGATCGGAAGAGCGGTT (LacOR3.2)  GATAGTCTCATTTTCACC AATTGTNANCCGGNTNACAATT AGATCGGAAGAGCGGTT (LacOR4)    GATAGTCTCATTTTCACC ACGCAAACGNNNNCGT AGATCGGAAGAGCACACG (PurR‐rand‐even)  GATAGTCTCATTTTCACC ACGCAAACGGNNNNCGT AGATCGGAAGAGCACACG (PurR‐rand‐odd)  GATAGTCTCATTTTCACC TTTCATGGGACCGNNNNCACGGA AGATCGGAAGAGCACACG (Ycjw‐rand‐even)  GATAGTCTCATTTTCACC TTTCATGGGACCGCNNNNCACGGA AGATCGGAAGAGCACACG (Ycjw‐rand‐odd)      Computational tools used for generating energy matrices and logos can be found on following links:  (http://stormo.wustl.edu/cgi‐bin/mkhan/motif_mlr.pl) for regression analysis  (http://stormo.wustl.edu/EnergyModel) for energy logos.   

4 SI 

 

 

Z. Zuo and G. D. Stormo 

  Fig. S1A lacI EMSA 

  Fig. S1B PurR EMSA 

  Fig. S1C YcjW EMSA    Figure S1   DNA fragments separation by EMSA. Each lane contains 100ng FAM‐labeled DNA fragments. From left to right, lane  1 is DNA‐only control, lanes 2‐6 have 2‐fold protein concentration increase per lane starting with about 50ng.  

 

 

   

Z. Zuo and G. D. Stormo 

5 SI 

A.  

PWM_R2 -4

-2

+2

+4

A

1.39 0.51 0.63 1.39

C

1.48 0.44 0

G T

0

0

0

0.38 1.48

1.38 0.57 0.51 1.40

  B. (position 0 is average value from R3.1 and R3.2) 

-5

PWM_R3

-4

-3

0

0.57 2.13

+2

+3

+4

+5

0

1.23

0

0

A

1.57 1.85

C

2.62 2.08 4.15 0.42 2.86 0.59 0.80 0.47 0.30

G

1.85

T

0

0

0

-2

2.25

0

0

0.17 1.25 0.41 0.60

1.81 2.32 0.53 3.63 0.99

0

0.41 0.53

  C. 

PWM_R4 -4

-2

+2

+4

0

0

A

1.30 2.13

C

0.95 0.27 1.38 0.75

G

0.80 1.47 0.10 1.05

T

0

0

2.09 1.26

  Figure S2   Energy position weight matrices (PWMs) from single variations to the preferred sequence, which is given an energy  of 0 for each library. Energy units are kT. These are all derived from the same experiment, #1 (in Table S1). 

   

6 SI 

 

 

Z. Zuo and G. D. Stormo 

Figure S3   Energy matrices, logos and fits to the data for each library from regression on variants with ≤  2 variants. In these matrices the sum at each position equals 0, to match the logos, shown below. 

A. Energy values from regression for R2 library. -2 +2 +4 PWM_R2 -4 A

0.70 0.29 0.56 0.69

C

0.61 -0.03 -0.88 -2.01

G

-1.99 -0.88 -0.02 0.59

T

0.68 0.62 0.33 0.73

B. Energy values from regression for R4 library -2 +2 +4 PWM_R4 -4 A

0.55 1.00 -0.84 -0.84

C

0.13 -0.66 0.53 0.13

G

0.17 0.56 -0.68 0.21

T

-0.85 -0.90 0.99 0.50

C. Energy values from regression for R3.1 and R3.2 libraries. Position 0 omitted. -4 -3 -2 +2 +3 +4 +5 PWM_R3 -5 A

0.52 0.49 -1.74 0.45 -0.34 0.36 -0.39 -0.82

C

0.90 0.58 0.96 0.30 0.20 0.04 -0.03 0.03

G

0.32 -1.36 0.43 -1.21 -0.19 0.47 0.26 0.55

T

-1.73 0.29 0.35 0.47 0.32 -0.88 0.16 0.24

  D. Energy logo for R2 PWM from regression.   

         

Z. Zuo and G. D. Stormo 

7 SI 

E. Energy logo from regression PWM for libraries R3.1 and R3.2.  

  F. Energy logo from regression PWM for library R4. 

  G. Observed vs predicted values using the regression PWM for the R2 library. For sites with ≤2  variants (the training data) r2 = 0.96. For all data r2 = 0.73. 

 

8 SI 

 

Z. Zuo and G. D. Stormo 

H. Observed vs predicted values using the regression PWM for the R4 library. For sites with ≤2  variants (the training data) r2 = 0.86. For all data r2 = 0.79.   

I.

  Observed vs predicted values using the regression PWM for the R3.2 library. For sites with ≤2  variants (the training data) r2 = 0.85 and for all data r2 = 0.48.  

  J.

Observed vs predicted values using the regression PWM for R3.1 library. For sites with ≤2  variants (the training data) r2 = 0.74 and for all data r2 = 0.55. 

     

Z. Zuo and G. D. Stormo 

9 SI 

   A. 

PWM_PurR

+2

+3

+4

+5

A

-0.46 0.35 0.48 -0.14

C

0.72 1.14 0.60 0.40

G

0.35 -0.43 0.59 -0.24

T

-0.62 -1.20 -1.66 -0.03

  B. 

PWM_YcjW

+2

+3

+4

+5

A

0.13 0.49 -0.20 0.44

C

-0.11 1.14 -0.48 -1.25

G

-0.16 -0.43 0.85 1.36

T

0.14 -1.20 -0.17 -0.56

  Figure S4   (A) Energy PWMs for PurR on regression for sites with ≤2 variants.  Fit to the training data is r2 = 0.90 and for all data  is r2 = 0.85. (B) Energy PWMs YcjW based on regression for sites with ≤2 variants. Fit to the training data is  r2 = 0.91 and for all  data is r2 = 0.88. 

                     

10 SI 

 

Z. Zuo and G. D. Stormo 

Tables S1‐S3  Available for download as .txt files at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.170100/‐/DC1    Table S1   Counts in bound and unbound fractions and ratios for each library in three different experiments.    Table S2   Relative energies from experiment 1 of Table S1. Relative binding energy = log(ratio(GAGCGCTC in library R2)) ‐  log(ratio(sequence)); Also shown are the number of mismatches compared to the preferred sequence of each library.    Table S3   PurR experiment 

   

Z. Zuo and G. D. Stormo 

11 SI 

High-resolution specificity from DNA sequencing highlights alternative modes of Lac repressor binding.

Knowing the specificity of transcription factors is critical to understanding regulatory networks in cells. The lac repressor-operator system has been...
2MB Sizes 1 Downloads 7 Views