How Many Numbers Are Required to Specify Sequence-Dependent Properties of Polynucleotides? ROBERT F. COLDSTEIN’ and ALBERT S. BENIGHT’

’Computer Center, m/c 135, and Department of Chemistry, and 2Department of Chemistry, m/c 1 1 1 , University of Illinois at Chicago, Chicago, Illinois 60680

SYNOPSIS

There are 10 unique dinucleotidesof double-stranded DNA, but only 8 independent nearestneighbor energies that occur in circular DNA, as shown by D. M. Gray and I. Tinoco [ ( 1970) Biopolymers 9,223-2441. We extend that analysis to include end effects, and show that the number of unique dinucleotide pairs (including ends) is 14, but there are only 12 independent energies. We discuss how these 12 energies (or spectra or any other pairwise additive property) can be measured and displayed, and how they should and should not be compared between experimenters. As an example, we analyzed the recently reported melting curves [ M. J. Doktycz et al. (1992) Biopolymers, 32, 849-864.1 of 16 DNA dumbbells in two different Na+ environments. This analysis reveals a new means for evaluating end effects and the emergence of longer than nearest-neighborinteractions at low salt concentration. 0 1992 John Wiley & Sons, Inc.

INTRODUCTION Many physical properties of polynucleotides, such as thermal stability or CD spectra, depend on the specific sequence of constituent nucleotide bases. As a practical matter, one would like to be able to predict this type of property for any given sequence, or conversely, infer the sequence from a measurement of the property. The question that naturally arises is, what is the minimum number of numbers (or energies or functions or spectra) needed to predict a property for any particular sequence, and how should these numbers be represented? Gray and Tinoco’ solved this problem for circular polymers (or polymers that are long enough so that end effects can be ignored) by introducing a “polymer approach.’’ They showed that the energy, for example, of any given sequence can be expressed as a linear combination of the energies of a selected basis set of long polymers. The number of polymers in this basis set depends on the number of unique monomers in the polymer, and on whether the basic interaction is nearest-neighbor, next nearest-neighBiopolymers, Vol. 32.1679-1693 (1992) 0 1992 John Wiley & Sons,Inc.

CCC 0006-3525/92/121679-15

bor, and so forth. Gray and Tinoco gave formulas for finding the number of polymers in the basis set and for constructing the set itself, although in general a given set is not unique. The polymer approach was contrasted with a “subunit” approach, in which one would construct a different basis consisting of the energy of every possible subunit, and in which one would compute a weighted average of these subunit energies, depending on the relative occurrence of those subunits in the polymer. For example, with nearest-neighbor interactions in double-stranded DNA, there are 10 unique subunits, namely AA/TT, CC/GG, AT, TA, CG, GC, AC/GT, CA/TG, AG/CT, and GA/TC. (In our notation, we use a slash to separate the complimentary strands of double-stranded DNA. In both strands, however, the left-to-right order corresponds to the 5’-to-3’ direction. If a strand is self-complimentary, such as AT, we simply do not list the opposing strand.) If one knew the energy of each subunit, and if the total energy of a polymer is just the sum of the energies of the substituent subunits (by definition of nearest-neighbor interactions ) , then these 10 numbers are sufficient for the calculation of the energy of any circular double-stranded DNA. Surprisingly, Gray and Tinoco showed that due 1679

1680

GOLDSTEIN AND BENIGHT

to constraints on the frequency of the subunits in a circular polymer (or for a long polymer where end effects are ignored), only 8 linear combinations of the 10 possible energies could ever be used. Therefore in these polymers, the nearest-neighbor interactions are expressed in terms of only 8 quantities, not 10. Furthermore, in the polymer approach, one only needed to construct and measure 8 specific long polymers, not 10 subunits. Thus it was claimed that the polymer approach, although complimentary to the subunit approach, had some experimental advantages, particularly when the number of subunits was large. In this paper, we will extend the Gray and Tinoco approach to polymers in which end effects are significant by introducing a fictitious new base called a “cut.” When a cut is inserted into a circular polymer, it becomes, in reality, a straight polymer with two ends. But we will treat it topologically as a circular polymer with a special new base. Using this artifice, we can exploit the previous results of Gray and Tinoco for circular polymers, with the addition of special rules for the cut base that provide new constraint equations. We will show, for example, that there are 12 independent energies and therefore also 12 independent polymers for double-stranded DNA. Although there are 14 apparent types of “subunits” within a DNA sequence, one need only measure 12 small double-stranded nucleotides ( 10 dinucleotides and 2 mononucleotides) . In this view in which end effects are explicitly included, the polymer approach and subunit approach differ only in the way information is presented; there is no practical difference. This means, in particular, that it is impossible to experimentally assign a unique energy to every conceivable subunit. For example, one simply cannot make a molecule that only consists of the TA subunit (there are always end effects), and one cannot make any number of molecules, the linear combination of which gives the TA energy without any other terms. We will also discuss a way, using the singular value decomposition ( SVD ) , 2 to extract the unique energies from experimental data, and to (nonuniquely) assign energies to all of the subunits, if desired. These nonunique energies can be used to calculate the energy of any given polymer, and are a convenient way to represent this ability. They cannot, however, be compared between experiments, or experimenters, or between experiments and theoretical calculations, due to the nonuniqueness. One can compare, for instance, the sum of the energies of T A and AT, which is unique, but not the energy of T A or A T separately.

Finally, we will analyze the melting of 16 DNA dumbbells3 as an extended example. This example will illustrate ( 1 ) the necessity of using a complete set of molecules, ( 2 ) the emergence of longer than nearest-neighbor interactions a t low salt, (3) the importance of using the full variance-covariance matrix for prediction of energies and errors of arbitrary sequences, and ( 4 ) the ability to measure end effects and to check the validity of the nearestneighbor model even in a regime where the leastsquares fit is poor (this last point is specific to molecules with ends capped with long fixed sequences).

MODEL OF NEAREST-NEIGHBOR AND HIGHER ORDER INTERACTIONS We assume a sequence-dependent property of a polymer can be expressed as a finite series,

where 0 ( k ) represents the kth-order interaction. That is, 0 is the singlet (single base-pair) interaction, 0 (’) is the doublet ( nearest-neighbor ) interaction, and so forth. The singlet and doublet interactions may have many physical origins, but higher than doublet interactions are likely to derive principally from electrostatic interactions. Therefore these higher order interactions will become less significant with increasing distance and increasing interaction order, and Eq. ( 1 ) should converge rapidly. As long as k,, is not too large, this model provides a reasonable and extensible way of interpreting sequencedependent data on polymers. If k,, is large, however, the total number of parameters needed becomes quite cumbersome, and a different description should be sought. A determination of k,, ultimately depends on experimental precision. Higher precision and improved experiments (such as experiments with larger numbers of molecules) will allow the experimenter to probe ever higher orders of interaction. For a given set of data, however, there will come a point where increasing k,, will not lead to a statistically significant increase in goodness of fit. This point can be determined by a sequential F test (see, for example, Draper and Smith4).And at this point, one simply cannot distinguish effects of higher order interactions without better experiments. Even if one has determined k,, , the uniqueness of the lower order k < k,, interactions must be con-

HOW MANY NUMBERS ARE REQUIRED?

sidered. In general, all or any part of the interactions at order k can be included in the ( k 1)-order interactions; the fitted parameters of 8 ( k ) and 8 ( k + l ) are not independent. For example, any energy one ascribes to each single type of base pair ( A / T or C/ G ) can always be included in the doublet interaction energies, without changing the quality of the fit to the doublet interactions. The total energy of a molecule can be experimentally measured, but the apportionment of that energy between singlet and doublet interactions may be largely arbitrary. (We will often use the term “energy” for concreteness, but energy is only one of many sequence-dependent properties that one might consider. Our general method will work equally well for any property with a reasonably small k,, .) Although a working definition of k < k,,,,, energies is thus arbitrary, we propose the following working definition that will allow different workers to compare results even if they arrive at different values of k,,,. In this method, one performs sequential fits to models with increasing order interactions. First, the singlet energies are determined by a weighted leastsquares fit to the experimental energies of each molecule. Then the doublet energies are determined by a weighted least-squares fit to the residuals from the singlet fit. In this way, the kth-order energies are systematically obtained by fitting to the residuals from the ( k - 1)th-order fit. By definition of k,,,, the final fit to the k,,,-order energies cannot be rejected due to lack of fit. But our working definition of the interaction hierarchy allows assignment of lower order interactions (with poor goodness-of-fit values) even if one does not have enough data to continue fitting up to k,, . Theoretically, the evaluation of singlet energies requires measurements on only two molecules because there are only two types of base pairs in double stranded DNA. But a good experimental design that minimizes the influence of experimental and roundoff errors will require measurements on a few more than the minimum number of molecules. Thus one could determine the singlet energies with measurements on, say, 5 or 6 molecules, which are not enough to determine doublet energies, even if the experimental precision is high enough to nominally reject the model due to lack of fit. Likewise, one could determine the doublet energies with measurements on 10-20 molecules, which are not enough to determine triplet energies, even if the fit is poor. The presumption is that the goodness-of-fit parameter would finally become acceptable when k,, is reached, if one had enough data, and that the lower order interactions are relatively insensitive to ex-

+

1681

actly which molecules are measured (provided the design matrix has maximal rank, as discussed below). This will generally work as long as the kthorder energies are much larger than the ( k 1)thorder energies, as seems to be the case for doublestranded DNA. So if one does not have enough data to determine k,, , the fitted parameters at lower order can be compared with other experiments, but any attempt to use these parameters to predict energies of independent sequences is suspect.

+

INTERACTIONS WITH ENDSINTRODUCTION OF THE FICTITIOUS BASE PAIR End effects can be analyzed by introducing a fictitious base pair into double-stranded DNA, which we call a cut, and denote by E/E. This artifice allows us to treat linear DNA as if it were circular. In this ATCG notation, for example, the polymer TAGC is cir~

ATCGE EATCGE or is linear, and is TAGCE ETAGCE TCGEA exactly the same as AGCET‘ We also introduce the notation Niand 4 . The subscript i stands for the type of nucleotide subunit, such as A / T for singlet fitting or AC/GT for doublet fitting. Ni is the number of the type i subunits that occur in a given molecule and 8i is the additive property (such as CD spectra or interaction energy) attributable to each type i subunit within the polynucleotide. The additivity assumption is that the 8 property ( a t a given order) of the molecule can be expressed as 8 = C iNiei. (Actually, this equation should have a k index to indicate which order interaction is being considered. We will suppress that index for notational convenience, with the understanding that if i refers to a singlet-type subunit, then Ni refers to the number of singlet subunits, Bi refers to the singlet energy for that subunit, and so forth for the higher order interactions.) Although a cut would naturally be interpreted as the solvent around an exposed end, it could actually be interpreted as any type of end cap, such as a hairpin loop or long section of constant sequence DNA. (Experimental advantages of such an unusual end will be discussed later.) In the following treatment, however, we will assume there is only one type of cut being considered at any time. If we treat a cut exactly the same as a third type of base pair (denoted as, say, X / Y ) ,then in addition to the 10 unique pairs of real bases ( ~ A A / T ~O,C C I G G , cular, but

1682

OAT,

GOLDSTEIN AND BENIGHT

OCA/TG* OAGICT, and there are 11pairs that involve the end, or cut, namely BAXIYT, ~ X T / A Y , ~ T X / Y A , BXAITY, OCX/YG, ~ X G I C YOGX/YC, , ~ X C I G Y ,OXX/YY, ~ X Y and , ~ Y X Of . these 21 subunits, Gray and Tinoco showed that only 18 linear combinations of the energies are independent. That is because there are three constraints, one for each base pair, namely that the number of bases following a given type (A/T, for example) equals the number preceding it. In particular, the three constraints are ~ T A , ~ C G , ~ G C ,OAC/GT,

&A/TC),

Of course the cut is not like any other base pair. Two cuts next to each other are not allowed so there are three additional constraint equations:

Of the remaining 8 subunits that involve cuts, there are four equivalence relations, because the X and Y bases are equivalent, and will henceforth be called “E,” for end. The new relations are Table I

AX/YT

=

XT/AY = AE/ET

(8)

TX/YA

=

XA/TY = TE/EA

(9)

CX/YG

=

XG/CY = CE/EG

(10)

GX/YC

=

XC/GY = GE/EC

(11)

Thus the 21 types of subunits that arise when including the cut are reduced to 14, and the corresponding 18 independent energies are reduced to 11 by virtue of the 7 additional constraints due to the special nature of the cut. However, these 7 additional constraints turn one of the original Gray and Tinoco constraints [ Eq. ( 2 ) ] into a tautology. Due to the assignment of X / Y = E / E , Eq. ( 2 ) is linearly dependent on Eqs. (8)-(l l ) .So in this special case, this Gray and Tinoco constraint is not independent-the final number of independent constraints is 6, and the number of independent energies is 12. One representation of these 12 linear combinations is shown in Table I for subunit counts Ni and correspondingly in Table I1 for the energies Oi . The first 8 energies are needed to compute the energy of an arbitrary circular polymer or to compute the difference between any two polymers with fixed ends (that is, any change in energy due to insertion, deletion, or mutation of real bases). The last 4 energies are needed to compute the energy of introducing a cut into a circular polymer, i.e., of computing the end energies in a straight polymer. We will refer to these 12 linear combinations as “canonical,” although any 12 linearly independent combinations of these canonical forms are equally acceptable. The only mathematical requirement for choosing these combinations is that they be linearly independent and that each Ni be orthogonal to the constraints in Eqs. ( 3 ) - ( 4 ) . We have chosen the forms in Table I to satisfy this mathematical requirement, and also so that the Ni are mutually orthogonal, so that they match Gray and Tinoco’ as

Twelve Canonical Linear Combinations of Subunit Counts

HOW MANY NUMBERS ARE REQUIRED?

1683

Table I1 Twelve Canonical Linear Combinations of the Subunit Properties (Such as Energies or CD Spectra) that Can Be Determined, in Principle, from Measurements

much as possible, and so that they possess a high degree of symmetry. Once the form of N Lare determined in Table I, the form of 19,in Table I1 is simply the pseudo-inverse of the matrix represented by N, in Table I. That is, if 8, is considered a vector, then it is chosen so that fi, 3, = a,, where 6, is 1 if i = j and 0 otherwise. This choice of 0, is necessary so that the total energy of a polymer can be written N,O, as discussed later in the sections on Data Analysis and the Example. In comparing the polymer and subunit approaches, Gray and Tinoco showed that to fully characterize a nearest neighbor property of any circular polymer, one would have to measure the 10 unique dinucleotides, but only 8 properly selected long polymers. But when we consider end effects explicitly, the differences between the polymer and subunit approaches vanish. That is, there are 12 doublet energies, which can correspond to 12 properly selected long polymers, or to the 12 unique smallest nucleotides, namely the 10 dinucleotides (EAAE/ETTE, EATE, ETAE, ECCE/EGGE, ECGE, EGCE, EACE/EGTE, ECAE/ETGE, EAGE/ECTE, EGAE/ETCE) and the 2 mononucleotides (EAE/ETE and ECE/EGE). Of course, what one normally thinks of as a “mononucleotide” really contains two doublet interactions between a real base pair and a cut on either side; the “dinucleotide” correspondingly contains three doublet interactions between the two real base pairs and the cut. Although it is true that each of these 12 small nucleotides does not represent a single subunit within a long polymer, it is clear that the only difference between the subunit and polymer approaches is the way information is displayed, not obtained. Any physical differences between the two ap-

-

c,

proaches (i.e., differencesbetween measurements on long or short polymers) will necessarily indicate the presence of longer than doublet interactions, which may be either sequence dependent or simply generally length dependent. And as a practical measure, even if one knows the highest order interactions are in fact nearest neighbor, one should measure more than the minimum number of necessary polymers, so that measurement errors are minimized and the final estimated parameters have greater precision.

DATA ANALYSIS One normally measures a fixed-order property (perhaps singlet or doublet) by measuring the behavior of many different polymers, and constructing the “best” set of parameters in a least-squares sense. This is equivalent to solving the linear equation (in a least-squares sense)

where 3 is the vector of energy parameters to be found, E is the vector of measured energies, one for each polymer. The u is a diagonal matrix, whose elements are the estimated errors of the experimental measurements. M is a matrix whose rows are labeled by which polymer is measured, and whose columns are labeled by subunit type or linear combination thereof. Thus Mij is the number of times the j t h type of subunit (or “canonical” linear combination) occurs in the ith polymer. For example, in singlet fitting, M would have 3 columns, one each for A / T , C/G, and E/E. In doublet fitting, M might have 12 columns, for the 12 canonical linear combinations shown in Table I, and so on for the higher order interactions. Gener-

1684

GOLDSTEIN AND BENIGHT

-

ally, M has more rows than columns; a-1 M is usually called the design matrix.' One possibility for doublet fitting uses an M with 14 columns, one for each unique type of doublet subunit, including ends. In this case, M is necessarily rank deficient, no matter how many rows it has or how many different polymers were measured. One simply cannot solve for all 14 values without additional constraints or assumptions. (If M has 12 columns, one for each canonical linear combination, then M may still be rank deficient if less than 12 molecules have been measured, or if there is too much degeneracy among the measured molecules. But SVD can be used to detect this circumstancethe rank of a matrix is equal to the number of nonzero singular values.) If M is not rank deficient, then any standard method can be used to solve Eq. ( 1 2 ) , such as Gaussian elimination, LU decomposition, GaussSeidel iterations, SVD, and so forth. All of these methods are equivalent to inverting M, and differ only in their efficiency and sensitivity to round-off error. If one chooses M with 14 columns (and is thereby rank deficient), a unique solution does not exist, but a useful solution can still be obtained by use of the SVD. In the process of using SVD to solve for 8, one computes the singular values of M and uses their inverses. A rank deficient matrix will have one or more zero singular values, but instead of dividing by zero, one sets the inverse of these zero singular values to zero. This is tantamount to ?moving those linear combinations of energies from 8 that Eq. (12) could not be solved for in the first place. See Press et al.' for more details. The resulting 3 will minimize the square of the weighted residuals,

-

It can be used to compute the energy of any polymer in the original set, and by extension, any polymer at all, provided the original set was complete. It is just that the individual energies obtained in this manner cannot be physically interpreted. The canonical linear combinations of these energies, however, can be interpreted and compared-they are exactly the unique values one would have obtained if M had been constructed with 12 columns and without rank deficiency from the beginning. One can obtain error estimates on the fitted parameters Bi from the variance-covariance matrix C given by

Note that when the 12 independent parameters are extracted from data, the errors on those parameters tend to be highly correlated (see the example later). Thus if one uses these parameter estimates to compute the energy of a new sequence, and if an estimate of the error in this prediction is desired, the full variance-covariance matrix should be used in computing the error. That is, once the doublet parameters 8i and the variance-covariance matrix Cij have been determined from a weighted least-squares fit, the energy of a new sequence with subunit counts N iis predicted to be Ci NiBik NiCijNj.

ixi,

EXAMPLE We will illustrate the above techniques with an example analysis of thermal stability measurements on 16 duplex DNA sequences, linked on each end by a T, single-strand hairpin l00p.~The sequences all have common end caps ( a six base-pair duplex sequence GGATAC, with a T, hairpin loop), but the central set of base pairs are different for each molecule (see Figure 1 ) . Doktycz et al.3 used 17 such sequences measured a t four different Na' concentrations to evaluate the salt dependence of the nearest-neighbor interactions (without explicit consideration of possible end effects). Our aim in this paper, however, is to examine the applicability of a more general analysis method, which includes end effects, rather than to reanalyze any one set of data. In that regard, we have selected 16 of these sequences measured in two different Na' concentrations to best illustrate the techniques involved. The 16 sequences, the melting temperatures, and the stabilization energies are shown in Table I11 for both 115 and 25 m M Na' concentrations. We have assumed the stabilization energies are directly re-

T Figure 1. Example of one of the 16 DNA dumbbells. The dumbbell with the unique central sequence TTAA, enclosed by the box, is shown. The sequences of the 15 other molecules differ only in their unique central sequences (see Tables 111-VII, XI, and XIII). That is, the 5'-GGATAC-3'duplex sequences and T4hairpin loops on both ends are the same in all molecules.

HOW MANY NUMBERS ARE REQUIRED?

1685

Table I11 Melting Temperatures and Energies tor 16 Sequences, Measured at Two Different Na+ Concentrations* 25 m M Na+ tm

115 mM Na+ tm

No.

Central Sequence

(“C)

Energy (cal/mol)

(“C)

Energy (cal/mol)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ATAT TATA TTAA TTTT/AAAA ATAC/GTAT ATCA/TGAT ACAC/GTGT C ACA/TGTG GAGA/TCTC CTCT/AGAG GCAC/GTGC CGGA/TCCG CCCC/GGGG GCGC CGCG ACACAC/GTGTGT

76.47 75.16 75.31 76.78 78.84 79.00 82.77 81.39 81.55 80.71 85.60 82.91 85.17 89.71 87.44 82.89

-138669 -138149 -138209 -138792 -139610 -139673 -141169 -140621 -140685 -140352 -142292 -141225 -142121 -143923 -143022 -158869

83.52 82.22 82.72 84.14 86.15 86.25 89.44 88.29 88.43 87.67 92.85 90.64 92.38 96.18 94.18 89.87

-141467 -140951 -141149 -141713 -142510 -142550 -143816 -143359 -143415 -143113 -145169 -144292 -144982 -146490 -145697 -161985

a

The estimated experimental error on each T,,, is 0.14”C.

lated to the melting temperatures by AH = NAST,,,, where A S = 24.8 cal/K and N is the total number of duplex base pairs. This relation depends on various assumptions and approximations3that will affect the physical interpretation of the final results. In this paper, however, we are less interested in assumptions underlying the melting curves and more interested in reliable ways of determining the parameters of additive properties, such as CD spectra or stabilization energies. For this example, then, we will simply take the stabilization energies in Table I11 as given, and not attempt to question or justify the underlying thermodynamics. The first step is to fit the singlet energies to this data, which can be done using two different models, namely with and without the hairpin represented as a special base pair, E/E. The first model recognizes that the hairpin may somehow contribute to the stability of the molecule a t the singlet interaction level, even though its hydrogen-bonding capacity is not engaged in the usual Watson-Crick pairing. Variations on this model may include one or more adjacent Watson-Crick pairs as part of the special base E / E , with the expectation that bases next to the hairpin may have significantly perturbed hydrogen bonds. As it turns out for this data, it does not matter a t all if these base pairs are included in E /

E or not; the BAIT and O C / G energies are completely unaffected. This model is fit to three explicit parameters: 8 A / T , 8c/c, and 8E/E. The second model uses the entire sequence of Watson-Crick base pairs, but simply ignores the presence of the hairpin, since it has no hydrogen bonds to melt. The assumption is that OE/E = 0, and therefore there are only two fitted parameters in this model. It is important to note that the decision of which model to use will affect the interpretation of the singlet and doublet energies, but it will not affect the prediction of energy for any sequence based on a consistent set of combined singlet/doublet energy parameters. The splitting of energies into singlet and doublet (and higher order) is mathematically arbitrary, but the total energy of a molecule (measured or predicted) is well defined. M for the singlet fit is shown in Table IV. The parameters and residuals are shown in Table V, and are marked “E” where the E / E base refers to the hairpin itself, and “no E” where the hairpin is ignored. These results show that DNA base pairs are more stable in high salt than low salt, and that C/ G base pairs are more stable than A/T, as expected. What is new is that the hairpin itself seems to affect the thermal stability by a small amount in a salt-

1686

GOLDSTEIN AND BENIGHT

Table IV M from 16 Sequences for Singlet Fitting, with Explicit Ends and Inclusion of all HydrogenBonded Base Pairs in the End Caps”

Ni No.

Central Sequence

A/T

C/G

E/E

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ATAT TATA TTAA TTTT/AAAA ATAC/GTAT ATCA/TGAT ACAC/GTGT CACA/TGTG GAGA/TCTC CTCT/AGAG GCAC/GTGC CGGA/TCCG CCCC/GGGG GCGC CGCG ACACAC/TGTGTG

10 10 10 10 9 9 8 8 8 8 7 7 6 6 6 9

6 6 6 6 7 7 8 8 8 8 9 9 10 10 10 9

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

This matrix has 3 nonzero singular values.

Table 7 Residuals and Parameters from Singlet Energy Fitting (Here, a Residual Energy Is t,.e Energy of a Molecule After the Best Fit Singlet Energy Has Been Subtracted) Residual Energies (cal/mol) 25 mM Na+

No.

Central Sequence

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ATAT TATA TTAA TTTT/AAAA ATAC/GTAT ATCA/TGAT ACAC/GTGT CACA/TGTG GAGA/TCTC CTCT/AGAG GCAC/GTGC CGGA/TCCG CCCC/GGGG GCGC CGCG ACACAC/GTGTGT 0AfT 0C/G

%EfE

E

115 m M Na’

No E

E

No E

-211.23 308.57 249.05 -334.24 -21.54 -85.03 -450.86 96.71 33.22 366.53 -443.70 623.68 857.02 -944.45 -43.71 -0.00

-183.76 336.04 276.52 -306.77 7.88 -55.59 -419.46 128.11 64.62 397.94 -410.34 657.04 892.34 -909.12 -8.38 -525.46

-143.54 372.29 173.89 -389.56 -86.20 -125.88 -290.74 165.57 110.02 411.58 -542.90 334.02 744.51 -763.32 30.27 -0.00

-118.11 397.72 199.32 -364.13 -58.95 -98.63 -261.68 194.63 139.08 440.65 -512.02 364.89 777.21 -730.62 62.97 -486.34

-8510.26 -9640.37 2243.18

-8230.85 -9362.91 0.00

-8679.40 -9780.93 2076.17

-8420.78 -9523.53 0.00

a Columns marked “E” used a special end base for the hairpin loops; columns marked “no E” used the 6 G/C and 6 A/T composition of the end cap explicitly, but ignored the hairpin loop.

HOW MANY NUMBERS ARE REQUIRED?

dependent way. In other words, if the hairpin is ignored, the average stabilities of A / T and C / G must be correspondingly adjusted, Once the singlet energies are determined (with or without the hairpin end cap represented by E / E ) ,we can proceed to determine the doublet (nearest-neighbor stacking) interactions by fitting the residuals of the singlet fit. M for the doublet fit is shown in Table VI. This matrix has 14 columns, one for each unique nearest-neighbor subunit, but it is rank deficient-it has only 12 finite singular values, and therefore the 14 resulting doublet energies will not be unique. These 14 energies can, however, be used to find the 12 unique canonical energy values, and can be used by themselves to predict thermal stabilities of independent DNA sequences. Although only 12 of the sequences are actually necessary (because the maximal rank of M is 12 ) , measurements on the additional sequences improve the computed results by reducing the effects of measurement and roundoff errors. If 12 sequences were to be chosen, however, they should not be chosen at random. If sequence 16 were left off, for example, the remaining matrix has 11 nonzero singular values. The energies computed from such a matrix may well reproduce the 15 measured energies, but they might not be reliable for predicting the energies of other independent polynucleotides, such as sequence 16. One way to deal with the rank deficient matrix in Table VI is to transform the columns to the 12 canonical linear combinations as shown in Table VII. This matrix is not rank deficient (although it would be if sequence 16 were left off), and any standard method of linear least-squares fitting would work. Alternatively, one could directly solve Eq. (12) for the ( nonunique ) dinucleotide subunit energies with the matrix in Table VI by using the SVD as discussed in the data analysis section. The 12 canonical linear combinations of energies formed from the 14 dinucleotide subunit energies will be exactly the same as those obtained from the 12 column matrix (Table VII). But to emphasize that these dinucleotide energies are not unique and should not be directly interpreted, we list two different sets of dinucleotide subunit energies (with very different values) in Table VIII, both of which give the same energies calculated according to Table 11. For example, for e3 = $ ( O A T O T A ) ; then from both sets 1 and 2 in Table VIII, OAT OTA = 492.48. Therefore these two sets are indistinguishable when used to compute the energy of any double-stranded DNA sequence. To demonstrate that the SVD technique should not be applied blindly, however, we show in Table

+

+

1687

IX the doublet parameters and the energies of the 16 sequences ( a ) measured experimentally, ( b ) computed from the fit to all 16 rows of M from Table VII, and ( c ) computed from the fit to the first 15 rows of M. Both fits correctly reproduce the first 15 experimentally measured energies, and do so equally well. But because the 15 row matrix has only 11 nonzero singular values, the energies so obtained do not represent all the values (or even the correct values) necessary to predict the energies of other sequences, such as sequence 16. Note that the parameters in the column labeled “Fit from 16” are exactly the same as would be obtained from substituting either column from Table VIII into the linear combinations of Table 11. In contrast, the parameters in the “Fit from 15” column are clearly different. The differences in doublet parameters between columns at the bottom of Table IX represent physical differences in energies, but the differences between columns in Table VIII are purely cosmetic. We thus recommend that experimental results be reported and compared in the canonical form of Table IX. Even though a given set of parameters can fit a set of experimental data, this alone does not prove that this parameter set can independently predict the energy of any other polymer. For this to be true, it is also necessary that the original experimental set contain a complete set of subunits, i.e., all 12 linearly independent combinations must be included. This completeness can be verified by counting the number of nonzero singular values. Furthermore, if the rank of M is reduced by removal of any single molecule (i.e., the deletion of a single row), then there exists a linear combination of Oi that is determined by experiments on a single molecule and is highly sensitive to any error in these measurements. For a more precise estimation of Oi , an experimenter should make sure that removal of any single molecule (or preferably a larger number of molecules) does not reduce the rank of M. The variance-covariance matrix from the fit to 16 molecules (whose energy parameters were shown in Table IX) is shown in Table X. Note that the magnitudes of many of the off-diagonal elements are similar to the corresponding diagonal elements. That means the estimated errors on the Oi are highly correlated. As an example, consider using the parameters from Table IX and the variance-covariance matrix from Table X to predict the energy of EATCCGATE. One first determines the subunits counts N . and then applies the formula Ci NiOi k For this particular sequence, the

m.

~~

5 6 7 8 9 10 11 12 13 14 15 16

4

1 2 3

No.

~

~~~~~

1 1

2 1

AT

1

1 2 1

TA

2 3

AA/TT

This matrix has 12 nonzero singular values.

~

ATAT TATA TTAA TTTT/AAAA ATAC/GTAT ATCA/TGAT ACAC/GTGT CACA/TGTG GAGA/TCTC CTCT/AGAG GCAC/GTGC CGGA/TCCG CCCC/GGGG GCGC CGCG ACACAC

Central Seauence

1

2

2 1

1

3

CA/TG

1 1 2

1

AC/GT

1

2 1

I

1 2

CT/AG

1 2

1

CG

2 1

1

GC

Number of Doublet Subunits TC/GA

Table VI Rank Deficient M from 16 Sequences with 14 Columns"

1 3

GG/CC

1

1

1 1 1 1

2

EA/TE

1

1 1

1

2 2 1

ET/AE

1

1 2

2

1

1

1

EG/CE

2

1 1

1

1

EC/GE

5

30

B

m

3

5

2

8

L!

P

0

HOW MANY NUMBERS ARE REQUIRED?

Table VII

1689

M from 16 Sequences' Number of Canonical Subunits

No.

Central Sequence

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ATAT TATA TTAA TTTT/AAAA ATAC/GTAT ATCA/TGAT ACAC/GTGT CACA/TGTG GAGA/TCTC CTCT/AG AG GCAC/GTGC CGGA/TCCG CCCC/GGGG GCGC CGCG ACACAC/GTGTGT

a

Nl

N2

N3

N4

N6

3 3 1

2 3

2 1

1 1 3 3

1

3 3 1 1

1 3

2 1

3 3

N7

N8

N9

1 -1 -1

1 -1 -1

2 2 2 2 1 2 1 1 1 1

3

2 -2 -1 3 -1 1

5

-2 3 -2 2

1 -1 1 -1 -2

1

1

N,,,

N,,

N,, 14 -14 -14

1

-7

1

-7 7 -7 7 -14 7

1 1 1 2 1 2 2 2 1

-14 14 -7

7 7 -7 -7 7 -7

7

There are 12 nonzero singular values.

numbers of doublet subunits of this sequence are Ni = (0, 1, 2, 1, 0, 2, 7, 1, 2, 0, 0, 14) and the resulting doublet energy is 323 k 161 cal/mol. (If we had

Table VIII Two Complete and Equivalent Sets of Doublet Interaction Energies, Both Derived from the Same Measurements on 16 Seqences" Subunit Energy

No.

Subunit

Set 1

Set 2

1 2 3 4 5 6 7

AT TA AA/TT AC/TG CA/TG TC/GA CT/AG CG GC GG/CC EA/TE ET/AE EG/CE EC/GE

146.930 345.554 86.832 175.820 123.362 176.126 339.519 72.985 -39.612 386.547 -391.061 -209.664 -383.123 -46.995

238.060 254.4 23 86.832 278.4 70 20.712 187.645 327.999 -41.184 74.557 386.547 -436.626 -164.098 -440.208 10.090

8 9 10 11 12 13 14

N5

The canonical linear combinations of these energies, determined according to Table 11, are identical for the two sets, while the individual doublet energies given here are clearly not equal.

ignored the correlation of 0 estimates by only using the diagonal values of Ci,,the estimated error for this sequence would have been 268 cal/mol.) Now consider a different sequence, EATGGATE. In this case Ni = (0, 1, 2, 0, 1, 1, 4, 4, 2, 0, 0, 14) and the resulting energy is 198 f 117 cal/mol. But if we were interested in the difference in energy between these two sequences, we need to recognize that the errors are still correlated. If we simply subtract the two energy predictions and estimate the predicted error as the rms of the two individual error estimates (which is appropriate only if they are uncorrelated), the difference is 125 f 199 cal/molapparently not a very precise prediction. But the correct method is to subtract the Ni first, and then recompute the difference and error directly. The difference in subunit counts is Ni = (0, 0, 0, 1, -1, 1, 3, -3,0,0,0, 0 ) ,and the predicted energy difference is 125 f 68 cal/mol, a substantial improvement due to the cancellations that occur when subtracting the Ni. The total molecular energy is obtained by adding the singlet and doublet values. But the error on this total energy is just the error on the doublet value. Due to the sequential nature of the fit and due to the fact that the parameters from different interaction orders are not independent, the error on the total energy is computed only from the last (highest order) fit.

1690

GOLDSTEIN AND BENIGHT

Table IX Top: Total Doublet Energy for Each Molecule, Either Measured or Computed from Fits to 16 or 16 Sequences";Bottom: The Doublet Parameters for the Two Fitsb

Energies (cal/mol) No.

Experimental

Fit From 16

Fit From 15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

-143.54 372.29 173.89 -389.56 -86.20 -125.88 -290.74 165.57 110.02 411.58 -542.90 334.02 744.51 -763.32 30.27 0.00

-142.70 418.71 99.88 -340.22 -105.87 -154.30 -299.18 165.88 98.98 417.10 -506.67 378.99 729.52 -772.48 12.36 0.00

-142.70 418.71 99.88 -340.22 -105.87 -154.30 -299.18 165.88 98.98 417.10 -506.67 378.99 729.52 -772.48 12.36

81 82

63 04

85 86

07

8s 89

el0 ell 812

86.832 386.547 246.242 16.687 149.591 257.822 -34.401 -34.678 -300.362 -215.059 28.011 -15.116

-365.1 I -95.726 203.990 63.685 -165.871 -32.966 75.265 -34.401 -34.678 -26.526 58.777 28.011 -15.116

Molecules are numbered in the same order as the other tables. Both sets of parameters (shown a t bottom) give identical energies for the first 15 molecules (shown a t top), but very different energies for molecule 16 (shown in bold face). a

TEST OF THE NEAREST-NEIGHBOR APPROXIMATION The above analysis has properly dealt with the way to extract nearest-neighbor interaction parameters from experimental data. In practice, the question of whether nearest-neighbor interactions are entirely sufficient to describe the data in the first place should be asked. This is answerable by considering the goodness-of-fit parameter, and is described in standard books.',* As it turns out for the data in this example, however, there is an additional test one can perform.

That is because the "end' nucleotide used in these measurements was not the solvent exposed end, but rather a fixed long sequence. Thus molecule 1, for example, whose central sequence is ATAT, is actually part of a longer sequence, * CATATG . Instead of analyzing this as EATATE, as done in the earlier example, we can analyze it as CATATG, as done previ~usly.~ If the nearest neighbor approximation is correct, this reanalysis (essentially replacing EA with CA and so forth) should produce very little change in the internucleotide interactions, in particular in the 8 energy values (Oi, i = 1 8) that describe cire

---

-

- e

HOW MANY NUMBERS ARE REQUIRED?

Table X

1691

Variance-Covariance Matrix from Fit to 16 Molecules” 09

3880 1870 3570 1820 2230 2750 300 322 -5380 -2700 38 -23

1870 2460 1960 2150 1670 1990 -103 163 -2840 -3190 -5 -6

3570 1960 3720 1910 2260 2780 248 316 -5360 -2840 38 -31

1820 2150 1910 2360 1650 1980 -98 141 -2760 -3240 -4 -6

2230 1670 2260 1650 1690 1950 67 198 -3350 -2460 19 - 19

2750 1990 2780 1980 1950 2550 79 222 -4130 -2940 17 -19

300 -103 248 -98 67 79 182 4 -412 126 2 -3

322 163 316 141 198 222 4 136 -460 -208 5 -2

-5380 -2840 -5360 -2760 -3350 -4130 -412 -460 8020 4100 -55 46

-2700 -3190 -2840 -3240 -2460 -2940 126 -208 4100 4770 8 12

38 -5 38 -4 19 17 2 5

-55 8 5 0

-23 -6 -31 -6 -19 -19 -3 -2 46 12 0 4

~~

This is used to estimate errors on the 0, and on energy predictions for new DNA sequences.

cular or infinite polymers. But if higher than nearest neighbor interactions are needed, or particularly if end effects are significant, this reanalysis may produce a large change in the estimated interactions. By replacing the E base with C or G, the maximum rank of M is reduced from 12 to 9. And fitting the same data to a model with fewer parameters is a more stringent test of the nearest neighbor approximation. But the result does not provide much explicit information about end interactions. The M matrix for this new fit is shown in Table

XI. The matrix has rank 9-the 8 usual linear combinations that describe circular polymers, and a ninth (column Nll) that comes from “breaking” the equivalent circular polymer at a G -C bond and adding ends. If the broken bond were something other than G-C, the ninth column would be a different linear combination of Nll and N12from Table I. Note that Nll is constant for all molecules in this series. The corresponding 011 will then summarize all the implicit information about ends, and it is not

Table XI M from 16 sequencesa Number of Canonical Subunits

No. 1 2 3 4

5 6 7 8 9 10 11 12 13 14 15 16 a

Central Sequence CATATG CTATAG CTTAAG CTTTTG/CAAAAG CATACG/CGTATG CATCAG/CTGATG CACACG/CGTGTG CCACAG/CTGTGG CGAGAG/CTCTCG CCTCTG/CAGAGG CGCACG/CGTGCG CCGGAG/CTCCGG CCCCCG/CGGGGG CGCGCG CCGCGG

CACACACG/CGTGTGTG There are 9 nonzero singular values.

3 3 2 3

2 2 2 1

1

2 1

1

2 1

1 1

1

1 2

2 4

1 4 3 2

2 6

1 -5 -5 -2 1 1 1 -2 1 -2 1 1 1 1 1 1

5 -1 -1 2 -1 5 -1 2 -1 2 -1 -1 -1 -1 -1 -1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

1692

GOLDSTEIN AND BENIGHT

Table XI1 The 12 Canonical Energy Values Obtained by Fitting to 16 Molecules With and Without Designating a Special “E” Base, for Two Different Salt Concentrations* Canonical Energies (cal/mol) 115 mM Na’

25 mM Na+

0,

With E Base

Without E Base

With E Base

Without E Base

205.389 463.388 327.583 11.874 189.431 305.376 6.608 -36.139 -439.608 -244.070 29.852 -18.113

141.362 467.303 265.595 22.131 162.569 277.619 -6.093 -49.424 0.000 0.000 -520.437 0.000

86.832 386.547 246.242 16.687 149.591 257.822 -34.40 1 -34.678 -300.362 -215.059 28.011 -15.116

51.973 395.700 214.320 23.045 141.237 234.4 23 -32.276 -35.038 0.000 0.000 -436.615 0.000

32 2 x 10-6

35 I x 10-5

4.9 0.29

7.7 0.36

a Both sets give similar values for the first 8 parameters, indicating that the nearest-neighbor approximation is reasonably good. The last four values of 0 differ due to different treatment of the ends.

Table XI11 The Residuals Obtained by Fitting Doublet Parameters to 16 Molecules With and Without Designating a Special “E” Base, for Two Different Salt Concentrations” Residual Energies (cal/mol) 25 mM Na’

No. 1 2 3

4 5 6 7 8 9 10 11 12 13 14 15 16

115 mM Na+

Central Sequence

With E Base

Without E Base

With E Base

Without E Base

CATATG CTATAG CTTAAG CTTTTG/C AAAAG CATACG/CGTATG CATCAG/CTGATG CACACG/CGTGTG CCACAG/CTGTGG CGAGAG/CTCTCG CCTCTG/CAGAGG CGCACG/CGTGCG CCGGAG/CTCCGG CCCCCG/CGGGGG CGCGCG CCGCGG CACACACG/CGTGTGTG

31.66 -78.07 106.79 -71.19 81.01 60.38 -72.00 -51.37 -130.25 65.12 114.37 135.00 -45.00 -31.25 -51.87

-39.07 -82.46 106.47 -70.98 97.53 63.07 -125.73 -8.37 -101.83 31.34 162.30 109.25 -36.78 -57.56 -47.17

-0.84 -46.41 74.00 -49.33 19.67 28.42 8.43 -0.31 11.03 -5.51 -36.23 -44.97 14.99 9.16 17.90

11.71 -62.70 63.58 -42.39 50.10 -10.82 -8.27 -9.50 19.74 50.13 -24.04 -78.80 9.14 -8.08 40.21 0.00

-0.00

0.00

-0.00

The residuals here are the energies after the best fit singlet and doublet energies have been subtracted.

HOW MANY NUMBERS ARE REQUIRED?

useful for predicting the energy of any other molecule with different ends. But the first 8 energies are useful for predicting the energy of any circular polymer or the energy difference between any two polymers with identical ends. The fitted energies are shown in Table XII, and the residuals from these fits are shown in Table XIII. These tables allow comparisons of the corresponding energies found when ends are explicitly included, for both high and low Na' concentrations. Also shown are the x 2 values and Q, the probability that a x 2this poor or poorer could be obtained by random chance alone. First of all, the value of Q at low salt concentration is quite small, indicating that the nearest-neighbor approximation is unable to account for the measured energies within the experimental error. At 115 m M Na+, however, Q $ 0.1, so the high salt parameters cannot be rejected due to lack of fit. Thus k,,, = 2 at 115 mM, but k,,, > 2 for 25 m M salt. Nonetheless, the first 8 energies at each salt concentration are relatively insensitive to the explicit consideration of ends, which lends credibility to the nearest-neighbor approximation. Furthermore, the value of Q actually decreases slightly when the ends are made explicit, indicating that the increase in goodness of fit due to increasing the number of fitted parameters is not significant. Thus the nearest-neighbor approximation appears valid a t high salt concentration. Even at low salt concentration, it seems that the poor fit is not due to special interactions with the end cap, but probably due to higher order interactions overall.

1693

trary sequence. Such a complete set of subunit energies can only be obtained from a matrix M of rank 12. (The rank can be checked by SVD). Alternatively, nonunique values may be assigned to the 14 subunits (possibly by using a SVD technique), and may be used for predicting the property in other nucleotides. In such cases, we prefer not to compare such sets between experiments, due to the nonuniqueness. Rather, one should first construct the more physically meaningful canonical combinations (Table 11) from the subunit values. By using fixed sequences for ends, such as DNA dumbbells, one can actually fit a nearest-neighbor approximation with 9 parameters, 8 of which are useful in computing energy differences between molecules with fixed ends. Further, fitting with 9 parameters (rather than 12) provides a more stringent test of the nearest-neighbor approximation. Although the determination of a complete basis of energy parameters requires a design matrix M of maximal rank (8, 9, or 12, depending on type of molecules and approximations employed), a highprecision estimation of these parameters will be enhanced by a somewhat more stringent requirement. That is, the rank of M should not decrease when any single row is deleted (or better yet, when any n rows are deleted, where n is as large as practical). This prevents the domination of the parameter estimates by too few measurements and their attendant errors. Our discussion has assumed nearest-neighbor interactions in double-stranded DNA for concreteness, but the inclusion of ends in the analysis can be extended to other polymers and other types of interactions without difficulty.

SUMMARY If a property of double-stranded DNA, including end effects, depends on nearest-neighbor interactions, there are 14 unique subunits, but only 12 independent linear combinations or 12 polymers for which the interaction property can (and must) be measured. These 12 values can be compared between experiments and experimenters. If a particular set of sequences possesses less than 12 linearly independent combinations, subunit energies can be calculated by SVD. But these energies are incomplete-they will correctly predict the measured energies of the original set (and possibly selected other sequences), but will not, in general, accurately predict the energy of an arbitrary sequence. That is, a good fit to the experimental data is necessary, but not sufficient, for producing a set of subunit energies that can be applied to an arbi-

This work was supported by NIH grant GM-39471.

REFERENCES 1. Gray, D. M. & Tinoco, I. (1970) Biopolymers 9,223244. 2. Press, W. H., Flannery, B. P., Teukolsky, S. A. & Vetterling, W. T. (1986) Numerical Recipes Cambridge University Press, New York. 3. Doktycz, M. J., Goldstein, R. I?., Paner, T. M., Gallo, F. J. & Benight, A. S. (1992) Biopolymers, 32, 849864. 4. Draper, N. R. & Smith, H. (1981) Applied Regression Analysis, John Wiley & Sons, New York. Received December 31, 1991 Accepted May 28, 1992

How many numbers are required to specify sequence-dependent properties of polynucleotides?

There are 10 unique dinucleotides of double-stranded DNA, but only 8 independent nearest-neighbor energies that occur in circular DNA, as shown by D. ...
1MB Sizes 0 Downloads 0 Views