Clinical Genetics 1979: 15: 137-146

Electronicdata processing in the Danish Cytogenetic Central Register and EDP problems of registers in general POULVIDEBECH AND JOHANNES NIELSEN Cytogenetic Laboratory, Psychiatric Hospital, Arhus, Denmark A brief introduction to the Danish Cytogenetic Central Register (DCCR) is given, and possibilities, principles and problems concerning the establishment and maintenance of a national cytogenetic register are presented. Various data carrier media for registers in general are discussed, of which the magnetic disc is considered most appropriate. General principles for programs capable of performing insertions, deletions and other modifications in the data base are outlined as well as the principles for the programs in the DCCR. The individual records should preferably be identified by aid of a central person registration number (CPR) rather than by name. The data should be stored and sorted by this identification in order to facilitate retrieval of a desired record. The structure of the records is discussed with regard to prevention of the occurrence of certain errors as well as the optimization of processing. Flexibility and economy of space are achieved by using programs able to handle records of unequal length, and problems occurring in connection with this are discussed. The question of how to protect sensitive data is dealt with, and two different methods used in the DCCR are outlined. Programs capable of analyzing karyotypes with the purpose of recognizing various cytogenetic syndromes have been developed for use in the DCCR. Various examples of computing times of typical program runs are presented. Received 22 M a y , revised 17 July, accepted for publication I 1 August 1978 K e y words: Chromosome abnormalities; cytogenetic register; database; data security; elec-

tronic data processing; karyotype analyzing; prenatal chromosome examinations. Th e aim of the present paper is t o outline some problems a n d possibilities of planning the electronic data processing (EDP) of a cytogenetic register by discussing the structure of t h e Danish Cytogenetic Central Register a n d the principles of the programs necessary t o maintain it. I t has further been our intention to p u t forward some propositions concerning improvements of already existing registers as well as t o contribute t o the dialogue among those with experience in using EDP for cytogenetic registers.

T h e definition of some of the words relating to electronic da ta processing m a y not be generally known, a nd so a glossary is presented on page 17.

The Danish Cytogenetic Central Register (DCCR)

T h e da ta in the D C C R a re treated in five main groups: 1. Prenatal chromosome examinations. 2. Abortions. 3. Persons with chromosome abnormalities

0009-9163/79/020137-10 $02.50/0 0 1919 Munksgaard, Copenhagen

138

VIDEBECH A N D NIELSEN Table 1

All cases registered in the Danish Cytogenetic Central Register, March, 1978 Abortions Karyotypes

Prenatal investigations

Other investigations'

Total

Total

%

Total

%

Total

%

Total

%

Abnormal Variants Normal

261 10 161

60.4 2.3 37.3

203 225 2,962

6.0 6.6 87.4

3,582 1,339 10,176

23.7 8.9 67.4

4,046 1,574 13,299

21.4 8.3 70.3

Total

432

100.0

3,390

100.0

15,097

100.0

18,919

100.0

-i

All persons referred for cytogenetic examination All persons with abnormal karyotype and chromosome variants i n population studies All relatives t o persons with chromosome abnormalities and variants

and variants from population studies as well as referrals to cytogenetic examination. 4. Persons referred for cytogenetic examination, but having a normal karyotype. 5. Relatives of probands with chromosome abnormalities or variants. These groups are stored separately on magnetic tape and read onto disc when they are going to be used for a period. Disc space is consequently not used to store the data permanently. Descriptions of the Register have previously been given by Nielsen et al. (1973), and Human Cytogenetic Registers (1977) presented data concerning the DCCR as well as other cytogenetic registers in use around the world in 1976. Table 1 shows the present case load in the Register, a total of 18,919. The Register is used by all participating laboratories for immediate information concerning results of chromosome examination in certain persons, and printouts with data from the Register are distributed to the participating laboratories once a year. Data from the Register are used for national cytogenetic service purposes - data from the Register have revealed a skewed gcographical and social level of cytogenetic service coverage in Denmark and have shown the

necd for more cytogenetic laboratories and a better information service. In cooperation with the Danish National Central Person Number Register, the Cytogenetic Register is used for mortality studies of persons with different chromosome abnormalities, as well as for studies of seasonal variations in the birth of children with different chromosome abnormalities. Followup studies are planned of persons with different chromosome abnormalities, in cooperation with other Danish national registers such as the Register for Mentally Ill, the Register for Mentally Retarded, the Register for Epileptics and the Cancer Register. Follow-up of all males in the Register with 46,XX as well as males with 48,XXYY has been carried out recently (Scirensen et al. 1978a, b) and data from the Register have been used for a study of all cases with Turner's syndrome among Danish schoolgirls aged 7 to 17. The programs used in connection with the DCCR have mainly been written in the programming language Fortran Extended, which is described in manual 1. Some of the principles of text-management have been described by Day (1970, 1972). Other programs have been written in a highly specialized language called sort/merge, which is

DANISH CYTOGENETIC CENTRAL REGISTER AND EDP

constructed exclusively for sorting data (manual 3). All the programs are executed under the control of the KRONOS 2.1 operating system (manual 2 ) on a Control Data Corporation CYBER 173 computer. General Principles of the Organization and Management of the Database

Data curriers On most modern computers there are several devices for storing data, such as punched cards, magnetic tapes and discs. For more intensive use the disc is preferable, partly for its high transmission speed and partly because it can be read randomly. This means that retrieval of data in the middle of the database does not imply reading the entire file till the desired data is found, as is the case with other data carriers. Data handling equipment When maintaining an EDP register, it is necessary to have programs which can make corrections and additions to it. These tasks might be executed by programs supplied from the computer manufacturer, such as MODIFY and UPDATE on the CYBER. They are, however, due to their general nature, seldom as effective in use as specially designed user-constructed programs. Before the construction of the present system, the DCCR was processed by such a prefabricated program. The programs must as a minimum be able to perform the following processing functions: a. Insert new data into the Register. b. Delete or correct existing data. c. Retrieve selected cases from the Register for further investigation. Tasks a and b can be done in at least two fundamentally different ways. One easily programable approach is to copy the entire database to a new file until the point

139

where the modification is to be performed, then carry it out, copy until the next modification is to take place and so on. Finally the rest of the database is copied to the new file, which is the version to be used for any further processing. This method thus demands processing of the entire file, no matter how few modifications one wishes to make. As the database grows, this will become more and more expensive. Programs supplied by the computer manufacturer frequently use this solution. It is, however, possible to write programs which are able to perform the previously mentioned tasks without having to process the entire database. This can be done by utilizing a so-called indexed sequential file organization, which together with other fundamentals of database systems is described by Martin (1975). An indexed sequential database furthermore has the advantage that it can be read very simply by any program, in contrast to databases constructed by most prefabricated database systems, which can only be processed by the system itself. The internal structure of the database It is rather difficult to give advice concerning the internal structure of the database, as great differences exist in the individual user’s demands on the register. However, some points may be stressed. a. Record order and identification. The records should be kept in a well-defined sequence, which minimizes the effort of retrieving desired data. The sequence is defined by each person’s identification, which can be his or her name or a number associated with the person, such as the birthday combined with a serial number or a case record number. The use of names for identification purposes is, however, not recommended because of the high risk of misspelling.

140

VIDEBECH AND NIELSEN s e r i a l number

I J ’ I ’ I I I I l date

month year of b i r t h

1

check digit; odd f o r males, e v e n for f e m a l e s Fig. 1. The structure of the central person registration number (CPR).

A better solution than assigning an “invented” number to every record is, of course, to use a central person registration number (CPR). For explanation of the structure of the CPR number, see Figure 1. In countries like Denmark, where such numbers are in use, the CPR number is of great importance to a register. This is partly because it identifies every person in the country unambiguously, which makes comparisons between different registers possible, and partly because of its check digit that makes it possible to detect errors of any kind by the so-called modulo-11 test (see glossary).

b. Record format. A typical record consists of a fixed number of card imuges, each containing the relevant data. This separation of data onto a number of cards constitutes, however, a potential threat to the integrity of the system, because when the records are manipulated outside the database, erroneous addition or deletion of just a single card will shift the “reading frame’’ in the same way as a nonsense mutation in the DNA molecule. The result will be disastrous. The cards in each record must therefore be tied together, for instance with the identification number occurring on all cards belonging to the record in question. A sequence error can then easily be detected and corrected. However, this method consumes, extra storage space because of the repeated identifications. Finally the programmer may reserve some

columns on the first card to denote how many cards the record comprises, and some on the last card in the record for a special end-of-record mark. In the case of an unintentional deletion of a card, the end-of-record mark will no longer correspond t o the record length designator, which could easily be detected. This will be further described later on. If the last mentioned solution is chosen, the computer will always be able to encounter the boundaries of a record, which means that neither the records nor the individual cards need to be of equal length. The flexibility thus achieved makes it possible to let, for instance unusually long karyotypes continue on additional cards without having to reserve the same space for all the karyotypes in the register. It will likewise be possible to extend the records with new information later on if desired. The data should not be distributed randomly on the cards, but organized so that data often used together occur on the same card. This implies that programs reading the data base often need only process a single card in every record to retrieve the data needed for calculation. The processing is therefore simplified and can be done much faster. Identification numbers should be constructed so that, if possible, they can be contained in one computer word, which facilitates comparisons between identifications. If the identification requires two computer words, it will take nearly twice as much time to locate a person in the register as if it is stored in only one word. In summary, there are three aspects that must be considered when choosing the format of the record, namely: 1. The record should be resistant to errors, as mentioned. 2. Flexibility with respect to permitting unequal length of records.

DANISH CYTOGENETIC

CENTRAL REGISTER AND EDP

141

50

.1 CPR.

RL

P A R I S H OCCUP.

3. Fast accessibility of data, especially the identification number. It should be borne in mind that after the data have been written into the records it will be extremely difficult to alter them to another layout.

Organization of the Danish Cytogenetic Central Register

The DCCR is stored on a disc-pack when in current use. Safety copies, from which the database can be restored, are made on magnetic tape every time the data are altered in any way, because no matter how careful one is, mistakes or machine errors do occur. The data are organized in an indexed sequential file, which makes it possible to access the data by random reads. Internal structure of the DCCR data The records constituting the DCCR are sorted strictly in ascending CPR number

ASCT.

DEATH A2 [ M A R I T A L STATE

A1

A7

A8

order. As seen in Figure 1, the CPR consists of six figures, designating the birth date and a four-digit serial number. The record format for persons with chromosome aberrations is outlined in Figure 2. As can be seen from the figure, every record comprises a maximum of five cards, plus possible continuation cards, but the typical record will span only the three cards marked, because cards for which no data are ascertained from the person in question are omitted. This, of course, saves a lot of disc space. The length of the records varies from 90 to 200 characters. Every card has a card number in column 1 that indicates the type of data stored on the rest of the card. Therefore the computer does not need to investigate more than the first column of a card to know whether to continue on that card o r not. The data have been arranged so that information often used together in one program execution is situated on the same card. As previously mentioned, this will make the processing faster.

VIDEBECH AND NIELSEN

142

J 11001770533

1

1

L Fig. 3. Example of detection of certain errors in the records. Due to a mistake, the 3-card of the first record has been deleted. A8 therefore no longer points to the last card in record 1. but to the first card in record 2. The mistake is detected, because the information here is not equal to the first 9 digits in the CPR number.

We have developed programs that can interpret the Paris nomenclature directly, and a karyotype is therefore stored according to this nomenclature as readable text. Transforming the karyotype into coded numbers prior to processing is thus not only redundant, but also undesirable because of the decreased readability in print-outs and the risk of coding errors. Should a karyotype occur that overflows the 2-card, although this is rather unlikely, a continuation card will automatically be set up by the computer. This is programmed in preparation for the possibility of more detailed karyotype descriptions in the future within the framework of the Paris nomenclature. The Register will thus never run out of space for any karyotype, irrespective of its length. Furthermore, more than one karyotype can be registered if karyotypes from different tissues are available. The karyotypes and the codes for the tissue are then stored on as many 2cards as necessary. The start of a card can vary relative to the start of the record, as the cards do not necessarily have a fixed length. This saves disc space, but implies the problem that

the computer could he unable to recognize the start of the individual cards following the 1-card. For security reasons this problem is solved in two ways. The normal processing of the records utilizes the addresses at the beginning of the card, counted in computer words from the start of the record, which is written in the columns marked A2, A3 etc. These addresses, which point to the first column of the cards, enable the computer to obtain each card very fast. Every card is furthermore terminated by a special character reserved for this purpose. It is thus easy to verify that the addresses actually point to the start of their corresponding card. Part of the CPR number is repeated on the last card as an end-of-record symbol in order to prevent errors of the “reading frame” type. By comparing this number, which is found with the aid of the address of the &card, with the CPR, any error of this type can be ruled out. If, for example, the 3-card was deleted by mistake and never replaced (Fig. 3), the computer would use the address A8 to find card 8. Because of the error, this address would not point to the start of card 8, but to some columns in

DANISH CYTOGENETIC CENTRAL REGISTER AND EDP

143

the succeeding record, which would fail the test against the CPR. The error is thus detected. Data Security

There are two vulnerable points in electronic data processing as far as leaks of confidential information are concerned: 1. the database stored in the computer; and 2. the various EDP printouts kept in offices. The data base can often be protected by the operating system by passwords, which of course should be utilized fully. This protection can, however, be evaded by a skilled program, and it does not include printouts and often not magnetic tapes either. It is therefore not satisfactory, as a computer register, like an ordinary file with confidential data, should always be secured against any possible misuse. The best solution is to encode the controversial parts of the data only, i.e. the name and identification number if official central person registration numbers are being used. If this is done, there is of course no need for any general scrambling program. The encoding must, however, be planned so that it does not complicate the authorized use of the data by making decoding necessary for routine runs. The encoding should thus leave frequently used information in the CPR number unchanged. We have written a scrambling program for the CPR number, and as this number is generally used for identification purpose, for obtaining the year of birth, as well as to designate the sex of the person, it is an obvious advantage that the encoding preserves these features. AS seen in Figure 4, the check cipher is not changed, and the year of birth is shifted to the leftmost position, which makes is easy for the computer to access. The other digits are added to one and multiplied by a series of numbers, and the products are

7

,’ro

Electronic data processing in the Danish cytogenetic central register and EDP problems of registers in general.

Clinical Genetics 1979: 15: 137-146 Electronicdata processing in the Danish Cytogenetic Central Register and EDP problems of registers in general POU...
624KB Sizes 0 Downloads 0 Views