223

Ann. Hum. Genet. (1992), 56, 223-232 Printed an Great Britain

Algorithms for a location database N. E. MORTON, A. COLLINS, S. LAWRENCE

AND

D. C. SHIELDS

CRC Research Group in Genetic Epidemiology, University of Southampton, Level C, Princess Anne Hospital, Southampton SO9 4HA, U.K. SUMMARY

The algorithms that drive the ldb location database are described. The program captures data on genetic and physical maps and combines information from different sources into a summary map. To assure portability it was developed in Fortran on a SUN SPARCStation under Unix. The algorithms, which combine rule-based seriation with a minimum deviance bootstrap, allow investigators and chromosome committees to produce a composite location in Mb that integrates partial maps. The program and manual are now available from the authors.

INTRODUCTION

A location database is defined in linear space by a vector of genetic and physical locations for each locus, which may be ordered by composite location (Morton, 1991a). The structure is hierarchical, like computer directories, the highest level corresponding to single chromosomes and the lowest level to DNA sequences. We are concerned here with the highest level, which is the initial focus of the Human Genome Project. Although the Human Genome Project is understandably preoccupied with databases (Pearson et al. 1991), there is no agreement on the structure and operation of a mapping database, which has passed through three stages. I n the first phase each chromosome committee at a Human Gene Mapping Workshop attempted to reconcile linkage and physical evidence subjectively by placing loci in a tentative order, distinguishing provisional and confirmed assignments without inferring a composite location (Hamerton, 1976). The limit of this approach was reached as the data grew exponentially and so an algorithm was introduced to order loci by interval, sorting on the cytogenetic assignment in relation t o the telomere of the short arm and taking the proximal band as the major key (de la Chapelle, 1985). For this purpose genetic data were projected subjectively onto the cytogenetic map, obscuring the source and dependence of this information. Many cytogenetic assignments are coarse, and more precise genetic and physical information was neglected. By this algorithm two loci listed in proximity need not be close, while two loci listed with many intervening loci might be adjacent on the gene map. There is often conclusive evidence against the order specified by this algorithm. The investigator who wants to identify genes in proximity to a locus of interest must make a subjective evaluation of increasingly complex evidence not organized in a helpful way. The third stage is represented by the Genome Data Base (GDB), which emphasizes description of loci and probes but also gives gene orders based on small subsets of the data (Pearson et al. 1991). There is no limit to the number of such orders, location is not specified, information is fragmented and therefore difficult to extract, this system is not locus-oriented, and the relational language is arcane. Unlike a publication where the data are protected, any

224

N. E. MORTONAND

OTHERS

partial map may be altered or deleted by an editor with administrative approval, and an editor may accept or reject personal communications or incomplete references. No attempt is made to synthesize the data into a summary map. On the contrary a location database emphasizes synthesis of maps. I n our proposal data are captured in flat files to facilitate reading and analysis. The system is object-oriented, where the object is a DNA sequence (locus). The prototype of this location database (Zdb) has been briefly described (Morton, 1991a; Collins et d.1992). Here we present the algorithms that drive it. They can be improved, but no database that neglects these principles can satisfy the needs of workers on any genome for which linkage and physical maps are both important.

DEFINITIONS

A locus is a DNA sequence represented by its midpoint. This correspondence between sequences and points present practical problems that in principle are resolved by a nomenclature committee for expressed loci (Shows et al. 1987) and by a DNA committee for anonymous sequences (Willard et al. 1985). Neither has met recently t o consider basic definitions. Currently they attempt with limited success to assign the same symbol for all clones with a common sequence (excluding YACs), but a series of partially overlapping clones (contig) is not a locus. Since these two rules lead to contradictions, the definition of a locus is in flux (Fig. 1). However a locus is defined, the database is locus-oriented. Each locus may appear in a summary map and one or more partial maps as a record containing its symbol, a vector of locations, and perhaps its rank defined in terms of spacing, polymorphism, and support (Table 1). Index loci are a subset of reference loci, which in turn are a subset of framework loci, whose order is well established (Lawrence et al. 1991). Other loci are locally unordered but with significant regional support either from a genetic or physical map or by cytogenetic or somatic cell assignment to an interval of less than 10 Mb. Loci with the two lowest ranks ( - and w) are retained in the database but usually omitted from published maps (Collins et al. 1992). Location relative to an origin is specified in megabases (Mb) for physical maps and in centimorgans (cM) for sex-specific linkage maps. For absolute location the origin is a t the short arm telomere (pter). For relative location the working origin (ofset) is proximal to pter. Physical maps include radiation hybrids, which may be scaled to Mb on the assumption of uniform breakage, and a number of methods that give Mb by sequence or restriction fragments of contigs. The composite location is the only variable that attempts to reconcile different types of information, whereas other locations are direct estimates and information from one variable is not interpolated into another. I n its early stages the composite location serves primarily to order loci more reliably than by cytogenetic assignment alone, but ultimately i t converges to the physical map. Homology with the mouse or other mammal may be indicated by the symbol y : w, where y is the mammalian chromosome and w is location in its sex-averaged linkage map (for example 3:40 means location a t 40cM on chromosome 3). While homology is of obvious interest, locations in the mouse (and afortiori in other mammals) are imprecise and there have been complex rearrangements, so we hesitate to use this information at present in constructing the human map (Collins et al. 1992). Cytogenetic assignment takes two fields for the leftmost and rightmost band included in the

Algorithms for a location database

225

sequence

probe

1

2

3

4

5

A

B C I

I I

1 I

I

I

1



I

1

1

I

Fig. 1. Three probes define 5 sequences : are there 1, 3 or 5 loci ?

Table 1. Definition of locus rank Rank

Definition

locally unordered ; cytogenetic or somatic cell assignment t o a n interval > 25 Mb locally unordered; cytogenetic or somatic cell assignment t o a n interval 1 6 2 5 Mb locally unordered ; either genetic or physical evidence insufficient t o order with respect to t h e 0 closest flanking loci, or cytogenetic or somatic cell assignment t o an interval of less than 10 Mb lesser framework locus; evidence sufficient to order with respect to the closest framework loci, 1 but there may be a more polymorphic locus nearby reference locus; highly polymorphic framework locus, spaced 5-10 cM in males, with no more 2 polymorphic locus in this interval index locus; very highly polymorphic framework locus, spaced 1 6 1 5 cM in males, with no 3 more polymorphic locus in this interval Collectively ranks 1-3 are framework loci, reliably ordered with respect to the closest framework loci, which are also called skeletal loci (Morton & Collins, 1990).The higher ranks are defined on polymorphism and spacing as a rough guide to detecting linkage in the absence of good candidate loci, although probe accessibility, technical simplicity, and proximity to a locus of interest are more important than polymorphism, which is not an advantage for sequencing. w

interval (ISCN, 1985). Region is defined by a somatic cell panel or by rare-cutter restriction enzymes, but unfortunately there is no consensus. The promise of rare-cutter fragments has not yet been realized, partly because of failure to partition each chromosome in somatic cell hybrids and to assign loci with known location to these fragments (Cantor et al. 1988).No chromosome committee has yet defined a somatic cell panel that represents a consensus, is public, has high resolution, and provides a hierarchical nomenclature for segments comparable to the salivary band designations in Drosophila (Lindsley & Grell, 1968). However, the coarseness of metaphase bands and the difficulty of identifying early prophase bands make it likely that regionalization by somatic cell hybrids or restriction fragments will be developed. Indeed, there is such a development for chromosome 11 (Tanigami et al. 1992). Each partial map is associated with a reference to a paper published or in press. Any change in the map is documented in a note. Partial maps are designed as permanent repositories of useful data, and therefore are subject to minimal change. A partial map may contain more than one delimited region beginning with an offset and may include more than one type of data. A summary map typically includes multiple types of data constituting a vector of locations (Table 2).

226

N. E. MORTONAND

OTHERS

Table 2. Definition of maps Framework : Comprehensive : Inclusive : Partial : Summary :

consisting only of reliably ordered loci including locally unordered loci whose regional assignment is secure (this implies assignment to an interval of less than 10 Mb, and cytogenetic or somatic cell assignments to larger intervals are omitted) including all loci, however uncertain their location containing a subset of locations (genetic, physical or both) for a subset of loci containing a vector of genetic and physical locations that represent the consensus of partial maps and a composite location that projects all summary maps into the physical scale (Mb)

Summary maps are formed in four phases. I n the first phase information from each partial map is entered in one of two modes, called incorporation (addition) and rectiJicution (correction). In the second phase the composite location is formed from the summary map by a priority rule (composition).These two phases constitute the provisional Summary. The third phase (revision) recalculates each summary location by iterative weighting of the data in partial maps. The fourth phase (reconciliation)recalculates composite locations by iterative weighting of the data in the summary map. These two phases constitute the bootstrap by analogy with gene mapping (Morton & Andrews, 1989). Besides improving the maps, they make the results effectively independent of the way in which the provisional summary was created. At any time a summary map may be scaled to preassigned arm lengths. A list of valid locus symbols together with obsolete aliases is maintained, and symbols in partial and summary maps may be updated (aliased). Beside partial and summary maps, ldb contains other types of files. Parameter$Zes give locus synonyms (alias),genetic and physical arm lengths (arm),lengths of cytogenetic bands (band) and somatic cell regions (rg),and constants A, B that determine weights for each type of map (weight).LodJiles may be partial (reporting a single source of data) or summary (combining all accepted, non-overlapping partial lod files). The Zod operation recreates the summary lod file from the set of partial lod files. Housekeepingfiles give results of operations on map, parameter, and lod files. Small partial maps are sometimes not ordered with respect to the origin, which presents a problem for a location database. I n such cases the location with respect to an arbitrary origin may be given in the variable note which is not used by the program, and the operative location taken as zero to indicate a megalocus to be resolved later. Maps of regions larger than 1 Mb are usually of known polarity. SYMBOLS

Locus symbols follow convention except that an alphabetical suffix always signifies a subsequence, and E is not used to signify an expressed sequence of unknown function. When a conventional locus symbol has not yet been assigned, the provisional symbol is enclosed in parentheses. Other variables are abbreviated as follows : rank (rk), mouse homology (mus), physical location (phymb), location from radiation hybrids (rhmb),genetic location in males (mcm),genetic location in females (fcm),leftmost cytogenetic band (Zbd),rightmost cytogenetic band (rbd), and region (rg).The pair of bands lbd and rbd is bd and the midpoint is taken as location. The mean of sex-specific genetic locations is cm. Indices i , j , k denote locus, partial map, and variable, respectively. Locations of variables are

Algorithms for a location database

227

indicated by p , s, and c in a partial map, summary map, and composite map, respectively. A location to be incorporated and the flanking markers in the partial map take indices 1, 2, respectively : the locus corresponding to may or may not already exist in the summary map because of information on other variables. The telomere of the short arm is denoted by ptr, the telomere of the long arm by qtr, and the centromere by cen.

x,

x

INCORPORATION

For this operation it is assumed that a partial map is useful to fill gaps in the summary map, but not to alter summary variables. This is straightforward for rg, bd, and mus. For rhmb, phymb, and cm the closest left flanker common to the partial and summary maps takes index 1 and the closest common right flanker takes index 2. Then sx

= sl+(s2-s1)(Px-P1)/(P2-Pl).

If no left flanker common to the partial and summary maps is found, the most distant common right flanker is sought. If this is not the same as the closest common right flanker, they are identified as 2 and 1 , respectively. Similarly, if no right flanker common t o the partial and summary map is found, the most distant common left flanker is sought. If this is not the same as the closest common left flanker, they are identified as 1 and 2, respectively. This extrapolation is avoided if ptr and qtr are included in the partial map as is sometimes feasible. It is not necessary that cen be included in the partial map, although of course its inclusion is desirable (if feasible) for incorporation near the centromere. Incorporation of a locus does not depend on the order of loci in the partial and summary maps. Rank is taken from the summary file if given, otherwise it is left blank.

RECTIFICATION

For this operation it is assumed that a particular partial map is more reliable than the summary map. Therefore rectification differs from incorporation by retaining all orders and intervals specified in the partial map. The working origin (‘offset ’) is made t o conform t o the summary map, and other locations not given in the partial map are incorporated into it. In the rectify operation the variable of interest in the partial map is assumed to be sorted from ptr to qtr. The algorithm searches for a locus common to the summary and partial map, beginning with the first locus in the latter. Suppose the first common locus has location so on the summary map and p , on the partial map. Then if p , d so the quantity so-po is added t o all locations in the partial map and the summary map is incorporated into it. In this way all distances in the partial map and all other information in the summary map are conserved. When applied to cytogenetic assignment, lbd and rbd are unchanged in the summary file if not given in the partial map. Otherwise they are taken from the partial map. Rank is taken from the partial map if given, otherwise from the summary file if given, otherwise it is left blank.

N. E. MORTONAND

228

OTHERS

COMPOSITION

The compose operation attempts to combine all information in a summary file into the variable comp, leaving other variables unchanged. Partial maps are not used. Variables are prioritized from highest to lowest, with the default being rhmb, phmb, cm, rg, and bd, where cm denotes genetic map and bd is cytogenetic assignment. For a particular chromosome it may be that some other ordering of priority is appropriate. If a variable of high priority is known for a particular locus, variables of lower priority will not be used to establish composite location. Genetic location agrees with physical locations in order but not in scale, and so an unknown location may be incorporated from closest flanking markers 1, 2 common to the genetic and composite maps. Let

x

S=

+

mcm fcm

2

be the location in the sex-averaged map, where both variables must be coded if autosomal. For the X chromosome s = fcm, and for the Y chromosome s = mcm. Then the estimated composite location is cX = c1+ ( c z - c l )

(sX-s1)/(sz-s1)

x

which is also valid for extrapolation when is distal to 1 and 2, where 1 is the closest and 2 the most distant common marker. If s1 = s2, cx = c, = c2. The physical variables phymb and rhmb may locally be scaled differently from each other and from comp. Therefore the above equation is also used for them. A locus assigned to a region is given location (lmb rmb)/2, assuming that arm lengths have been scaled to the currently best estimates. On that assumption a locus assigned to chromosome bands is given a location

+

(lmbmin+rmbmax)/2. This cytogenetic assignment is usually the least reliable estimate of physical location. The variable of highest priority is introduced into a composite map that contains only the locations of ptr, cen, and qtr given in arm, t o which comp is thereby scaled. For compose rank is not modified unless composite location is determined from cytogenetic or regional assignment in the absence of other information. Then if the interval from physical locations in the band or rg file is d 10 Mb, rk = 0. If the interval is d 25 Mb but > 10 Mb, rk = - (minus). Else rk = w : these loci remain in the map file but would usually be deleted for presentation as unacceptably imprecise. REVISION

This is a global operation on the whole set of partial maps, replacing each variable in the summary map by a weighted combination of the partial maps, with display of discrepancies. The composite locations are unchanged. Partial maps considered obsolete should be eliminated before revision. The weights are estimated iteratively as follows. Define a quadratic form

where wijkis the weight for the projection stjkinto the summary map of the ith locus from the

Algorithms for ' a location database

229

j t h partial map for the kth variable. The summary map is constant for a given iteration, but changes a t the next cycle. Projection follows the logic for incorporation, except that a summary location sik is not a bar t o computation of Syk. At the first iteration sik is taken from the rulebased map. Let wijk

where

=

'/(l

+Bk&jk)

elkis the interval in the partial map by which the interpolation is made. The parameter

B, is estimated by minimizing & k / w k with the Gemini program (Lalouel, 1983), taking the trial value from the weight file, where w k is the sum of weights over loci represented for the kth variable in more than one partial map. Change in W i f k is governed by iteration of B,. To transform these arbitrary weights into information, note that Q has N,-n,- 1 degrees of freedom, where nk is the number of loci with a value of sik and N, is the total number of interpolates for such loci. Then the information weight is Wijk = A , wijk where A , = (N,-n,-

1)/&.

For the kth variable in the summary map the parameters A,, B, are stored in the weight file. The standard error of the location of i in a summary map is 1/ (1/'j w i j k ) . Final values of A,, B, replace initial values in the weight file. If for a given variable there is an authoritative partial map, it should be rectified after revision.

RECONCILIATION

This is a global operation on the summary map, replacing composite locations by a weighted combination of the other variables, which are unchanged. We define a quadratic form

with degrees of freedom M - m - 2 ( K - 1) - 1 where m is the number of loci with a value of cik, M is the total number of interpolates, and there are K informative variables. Let =

+BkSik)

where Si, is the interval in the kth variable which provides an interpolate for the ith locus and A, is fixed for one variable. The 2(K- 1 ) 1 remaining parameters are estimated by minimizing &/W with the Gemini program, where W is the sum of weights over loci with more than one variable. The information weights are W& = AL/(l +B,S,,) where A; = Ak{N-m-2(K-l)-1)}/&. The final values of A;, B, are stored in the weight file. The composite location for the ith locus has standard error l/(l/Zkw;,).

+

NOMENCLATURE OF FILES

The directory chromi for the ith chromosome (i = 1 , 2 , ... , 2 2 , x, y) includes four kinds of files (map, housekeeping, parameter, and lods). Files named p l , p2, ... , contain partial maps, and the summary file is called map. Operations on partial maps update the summary file, to which

230

N. E.MORTONAND

OTHERS

they add in variable ref an audit trail of up to 3 references per locus. As additional references are entered, the oldest reference is deleted. Partial maps (and therefore references) not cited by a t least one locus may be purged. Operations on the summary map give rise to two housekeeping files. The old summary map is written to m p . bak and comments generated by the operation are written to memo, which should be scanned and perhaps printed. Other housekeeping files band.bak and rg.bak are generated by the scale operation. The ref housekeeping file compiles references from partial maps and partial lod files by the reference command. Housekeeping files are refreshed when next used. For the ith chromosome there are five parameter files that give locus synonyms (alias), variable weights (weight),and sizes of arms (arm),cytogenetic bands (band)and regions ( r g ) .File arm has one record for each chromosome arm, with variables for arm (p or q), length in megabases (mbsz),and lengths in centimorgans for males (mcmsz)and females (fcmsz),currently given by Morton (1991b). The band and rg files have physical size (mbsz),and the locations in megabases of the left and right limits of the band or region (lmb and rrnb). For regions a definition in terms of a somatic cell panel is also included as a 0, 1 matrix (note).The sum of physical sizes for a given chromosome arm corresponds t o the physical length in the arm file. Records in the summary lod file (lod) give a pair of loci (locusl, ZocusZ), source, sample, and three pairs of 8,Z scores (thm, zm, thf,zf, thu, zu) for males, females, and unspecified sex. The ith source corresponds to partial lod file Li. Sample is used to distinguish families and groups of families within a source when genetic heterogeneity is suspected. The last update of the LODSOURCE database constitutes one partial lod file (Keats et al. 1991), and the latest update of the CEPH lods is another (Dausset et al. 1990). The remaining partial lod files represent sources subsequent to those updates, which may be made obsolete by later revisions of the primary files. OTHER OPERATIONS

The operation scale makes entries in the band, rg, and map files conform to the parameters in the arm file, which should be the best values currently available (Morton, 1991b). The operation order puts a map file into order by a specified variable. The operation format rearranges and deletes fields in a file, writing the old file to * .bak where * denotes m p , p , band, or rg. Only a subset of variables recognized by ldb (including note) are retained in the output. The operation alias replaces locus symbols in map and lod files by preferred symbols taken from the alias file and produces a list of locus symbols and the files in which they appear. Records with the same alias are combined. DISCUSSION

These algorithms combine rule-based seriation (incorporate, rectify, and compose) with a minimum deviance bootstrap (revise and reconcile). Rule-based logic arbitrarily dichotomizes information as 0, 1 to create a trial map quickly, but locations depend on the sequence of operations on partial maps. Rule-based logic is not applicable to data of uncertain precision and cannot resolve discrepancies as objectively or efficiently as minimum deviance, which weights each piece of information appropriately and thus produces a final map that does not depend on the sequence of operations. A neural network differs from weighted analysis by assuming a

Algorithms for a location database

23 1

training set of known errors, which is progressively enlarged and the weights recalculated. Summary maps in man are too sparse and the resolution of discrepancies too uncertain to consider a neural network which under ideal conditions might be applicable to order, but not location. Only experience will tell whether our use of minimum deviance can be improved. With classical markers the linkage map and cytogenetic assignment were complementary and the physical map within a chromosome band was dispensable. The need to integrate genetic and physical data into a composite map arose with dense DNA markers, where the limit of resolution is the sequence. The very large number of loci and mapping methods makes one or more computcr databases essential as the source for analysis and summary in hard-copy tables and catalogs, which arc also required. This is a novel problem in all organisms, and several approaches have been taken. All use location databases in which loci are the primary objects of reference, with the exception of the GDB system for man which is in part an interval databasc, presenting difficulties that have been discussed elsewhere (Morton, 1991a ) . With at most one exception the graphic interface, if any, is designed to work with hierarchical ASCII files independently of specific data management systems (Kosowsky et al. 1991 ; Yoshida et al. 1991). Map rcconciliation, if attempted, is usually delegated to an authoritative curator, but Letovsky & Serlyn (1992) have pioneered a rule-based program. All the working systems are programmed in the Fortran or C language, since relational systems using the SQL language have substantially increased cost for programming and data entry and retrieval. I n one example a run of 3 h with an extended version of SQL was reduced to less than a second when the analysis was programmed in C (Letovsky & Berlyn, 1992). Programs with a defined objective (e.g. reconciliation, graphical display) on restricted datasets (e.g. locations, YACs, lo&, sequences) have proved more flexible and easier to develop than a central repository for all mapping and sequencing data, of necessity based on untested principles. Experience with special purpose databases is indispensable to eventual development of a successful system, which is likely to involve a nctwork of nodes managed by scientists rather than an informatic mono1i t h . I t would not be difficult to adapt ldb to high-resolution maps and sequences, for example by assigning to tach contig or fragment a directory in which the files represent partially or wholly sequenced probes. A consensus on the logic and nomenclature of such elements is essential, but the Human Genome Project has not addressed serious problems of definitions, symbols, mapping conventions, and long-range connectivity. There must also be an initiative from scientists working on a particular chromosome to establish their own centre to update maps continually by algorithms for a location database, following the pattern in experimental organisms where annual meetings to update maps are perceived as ineffectual. Data should be shared with GDB and other public databases, but that cannot be the main objective of workshops and centres. Future progress in mapping the human genome depends on wresting dircction from the international bureaucracy and returning to scientific initiatives. The mapping enterprise that stimulated and justified the Human Genome Project must try to survive it. REFERENCES

CANTOR,C'. R . , SMITH,C'. L. & MATHEW,M. K. (1988). Pulsed field gel electrophoresis of very large DNA molecules. In Annual Review of Biophysics and Biophysical Chemistry (ed D. M . Engelman, L. R. Cantor and I). T. I'ollard). Palo Alto : Annual Reviews, Inc. IH

HGE58

232

N. E. MORTONAND

OTHERS

COLLINS, A , , KEATS,B. J . , DRACOPOLI, N., SHIELDS,D. C. & MORTON,N. E. (1992).Integration of gene maps: chromosome I . Proc. Nut. Acad. Sci. USA 89, 45984602. DACJSSET, J . , CANN,H., COHEN,D., LATIIROP,M., LALOUEL,J.-M. & WHITE, R. (1990). Centre d'Etude du Polymorphisme Humain (CEPH) : collaboratorive genetic mapping of the human genome. Genomics 6, 575-577. DE LA CHAPELLE, A. (1985).The 1985 human gene map and human gene mapping in 1985. C'ytogenet. ('ell Genet. 40, 1-7. HAMMERTON, J . L. (1976). Report of the committee on the genetic constitution of chromosomes 1 and 2. In Human Gene Mapping 3, Birth Defects Original Article Series 12(7) (ed. D. Bergsma). pp. 7-23. Basel: S. Karger. IS(!N (1985).A n International System for Human Cytogenetic Nomenclature. Basel : S. Karger. S. L., MORTON, N. E., ROBSON,E. B., BUETOW,K . H.. CARTWRIQHT. P. E.. KEATS. B. J . B., SHERMAN, CHAKRAVARTI, A , , FRANCKE U., GREEN,P. P. & OTT, J . (1991).Guidelines for human linkage maps (TSTA, 1990). Ann. Hum. Genet. 55, 1-6. KOSOWSKY, M., BLAKE,C., BRADT,I),, EPPIQ,J., GRANT,P., MOBRAATEN,L., NADEAV,

Algorithms for a location database.

The algorithms that drive the ldb location database are described. The program captures data on genetic and physical maps and combines information fro...
726KB Sizes 0 Downloads 0 Views