Intelligent Management of Epidemiologic Data Fernando Ferni 1 2, Leonardo Meo Evoli 3, Domenico M. Pisanelli 4, Fabrizio L. Ricci 3 1. University of Rome, DIS, V.Salaria 113, 00198 Roma, Italy 2. University of Rome, IV Surgical Clinic, V.le Policlinico, 00185 Roma, Italy 3. CNR, ISRDS, V.C.De Loulis 12,00185 Roma, Italy 4. CNR, ITBM, V.le Marx 15, 00156 Roma, Italy

Abstract In the lifecycle of epidemiologic data three steps can be identified: production, interpretation and exploitationfor decision. Computerized support can be precious, if not indispensable, at any of the three levels, therefore several epidemiologic data management systems were developed. In this paper we focus on intelligent management of epidemiologic data, where intelligence is needed in order to analyze trends or to compare observed with reference value and possibly detect abnormalities. After having outlined the problems involved in such a task, we show the features of ADAMS, a system realized to manage aggregated data and implemented in a personal computer environment.

1. Introduction It is worldwide recognized the importance of epidemiologic data for an effective health care policy. Such a "unique source of readily-available health status indicators" [6] has also a priceless value for the prevention of diseases, a value acknowledged by the World Health Organization (WHO) and ratified in the twenty-ninth World Health Assembly [13: pp.699-700]. These data are therefore relevant instruments to support decision in health care management and planning. The work of epidemiologist essentially consists in the assessing and the interpretation of data. In their lifecycle three steps can be identified: production, interpretation and exploitation for decision. Computerized support can be precious, if not indispensable, at any of the three levels and this led to the development of several epidemiologic data management systems. Most of those systems are actually database management tools integrated with statistical functions [8], others incorporate some functions for the management of knowledge in the domain of medicine, epidemiology and health-care in order to achieve a better exploitation of data for health-care planning and epidemiologic research (see 0195-4210/91/$5.00 ©) 1992 AMIA, Inc.

for instance [2]). These systems seem to have a rather propaedeutic character as they do not provide direct support for this kind of action. What is really needed is an intelligent management of data in order to analyze trends or to compare observed with reference value and possibly detect abnormalities. Knowledge-based techniques can be employed to represent statistical rules [4] (e.g.: how to merge two tables with average values), epidemiologic procedural knowledge (e.g.: the steps to follow to perform an analysis) and epidemiologic heuristic knowledge (e.g.: which kind of analysis in a given situation yields the most significant result). TEA (Tools for Epidemiologic Assistance) is a research project of the National Research Council which deals with the issue of "intelligent" management of epidemiologic data, covering the broad spectrum of these problems [5]. We use the term "intelligent" to denote both a high-level management of data (macrodata management) and knowledge-based techniques. In this paper we outline the problems involved in such a task and synthetically show the features of ADAMS, a system realized in this framework to manage aggregated data. In the second paragraph we emphasize the differences between common data base management systems and statistical data base management systems and introduce the basic issues, particularly the well known dicotomy microdata-macrodata (i.e. elementary vs. aggregated data). The third paragraph points out the utility of managing data only at aggregated level without considering microdata. This led to the design of a conceptual model to handle macrodata: GRASS*, which is the groundwork for implementing software tools allowing macrodata management. One of these tools - the ADAMS system is described in the fourth paragraph. Its usefulness as a decision aid and the overall relevance of TEA project for its health-added value are discussed in the final paragraph.

343

2. Peculiarities of Statistical Data Bases

or the distribution of viral hepatitis according to countries and years. In each statistical table where macrodata appear, we distinguish between the statistical data themselves and the data used to describe them, called metadata. We identify within metadata a set of category attributes (also called variables) (e.g. "sex"); each of them is associated with a set of values (also called modalities) constituting the table variable domain (e.g. "male", "female"). Finally, the data type identifies the aggregate function generating the values contained in the ST (e.g. "count", "average",...). In order to clarify this peculiar nomenclature, we report a simple example of statistical table in figure 1.

Over the last years there has been a growing interest in applying a database theoretic approach to Statistical Databases (SDBs) [7]. This term refers in literature to a kind of database representing statistical or summary information. Examples of SDBs are found in several domains pertaining socio-economic issues (e.g.: population censuses, health statistics, energy production), commercial applications (e.g.: marketing projections, financial reports) and epidemiology. The relevance of SDBs in applied informatics is costantly growing because of their fundamental role as decision support in statistical analyses. SDBs address theoretical issues not completely covered by conventional DB theory and, with respect to this topic, two main reasons of difference can be emphasized [10]: the data structure - most of the existing models a) for conventional DBs support merely simple data structures, as the "relations", whereas SDBs need to support complex data structures; the data manipulation - boolean operations or b) logical associations between data are not of primary importance to statisticians; in fact, the most common manipulation is related to the encoding of data, or to the reclassification of the descriptive data. Wong has singled out two broad classes of SDBs: micro SDBs (mSDBs) and macro SDBs (MSDBs) [14]. The former refers to SDBs containing microdata (mD) (i.e.: records of individual entities or events), whereas the latter pertains macrodata (MD) resulting from the application of statistical aggregate operation to individual data. An example of mSDB is a hospital admission register collecting individual events such as the reason for admission for each incoming patient. A corresponding MSDB could show the overall number of admissions by age, sex and geographical origin of the patient Conceptual modeling of microdata is not a task involving problems peculiar to SDBs. Our experience confirms that the well assessed Entity-Relationship (ER) model [1] used for conventional DB modeling is perfectly suited for the task, especially if we employ extended models (EER) allowing, among other things, the representation of the generalization abstraction. Macrodata is a different story, having a more complex data structure. They do not concern single events but rather trace out trends associated with phenomena such as mortality due to viral hepatitis according to age and sex,

summary

attribute

:...category.: attributes.. ......... A....

I-

....

*

T'ife

...............................

.....

..........

I

........>-

ummary values

Figure 1. A simple example of statistical table (data type is "average"; source is [12]). In conclusion, the complex structure of statistical tables requires more sophisticated data models. In the next paragraph we want to emphasize the relevance of this modeling task for epidemiologic investigation purposes.

3. Toward a high-level management of data

Very often an accurate evaluation of epidemiologic data involves transformation of tables in formats more suitable for statistical analysis. A typical case is that of a user who has to compare macrodata locally collected with a reference table having a different structure. In order to cope with this mismatch of tables, a sort of "reshaping" of them must be adopted. The need for reshaping is the essential motivation for a high level management of macrodata. We need a formal representation of macrodata by a means of a model in order to perform this high level management. We adopted GRASS* to implement our tools [9]. Although it is not in the scope of this paper to review formally this model, it can be helpful to give its

344

"flavour" with an example. In figure 2 we report a set of tables whose features are graphically represented according to the GRASS* paradigm. In particular we have four kinds of nodes:

then we can use the relation: (rl) countries --> continents in order to re-classify ST b in: (STc) "Disease Distribution by continents"

S-nodes semantically grouping STs or other S-nodes concerning the same phenomenon or the same part of described reality (e.g.: ggiddui2lQgic data on viral hepatitis). T-nodes denoting the summary attribute of a ST (e.g.: job distribin. A-nodes denoting the aggregation (as defined in [11]) of the category attributes which describe the statistical entity. It represents the Cartesian product between all the modalities instances of ascending category attributes (e.g.: given countries * years all the possible couples of istances of countries and years). The A-nodes can be organized at various levels of aggregation. C-nodes denoting a single category attribute which describes a ST (e.g.: age)

ADAMS (Aggregated DAta Management System) [3] provides the possibility of managing statistical tables following two interaction paradigms: one based on a keyword language and the other on a visual interaction. The system runs on Macintosh II and has been developed in an MPW environment in Pascal object oriented language using the Mac App program library. The window-based interface (figure 3) consists of: 1) the GRASS window showing the logical structure schema of the SDB by means of the GRASS* model and employing the same graphic formalism shown in figure 2; 2) the QUERY-GRAPH window where manipulation is possible by means of a visual interaction; 3) the STAQUEL window for the key-word based interaction; 4) the GUIDE window to display help messages. X FilM Edt

ptlbras Oblerst

fInd Ievis lobes 1 q r-grp-W D

-SS-mimds7

s-Whfalli

D

/ 7Z

V

aI

--

-

i

distributionWafillA r

RI &HOulto

distrdibuti

otlt

ds

CJ

.- ~~_

rhtupI

7ontrsaffT

VW$

b0__dXSO *_mm)..s

.-_

rnoll-

malu/ut~wt).

ftwoodmWarow0mv

-Obvowv

VW so 6thw 4 WV IWUUO

-km~

Figure 2. Statistical tables represented with the GRASS* model.

Figure 3. The interface of ADAMS.

This model captures the static aspect of the tables and is the starting point for the user to perform dynamic manipulation of tables, i.e. high level management of macrodata.

One can use the STAQUEL language typing a query finalized at obtaining the desired table. It is conceived for trained user, allowing a quick interaction but requiring experience with the proper syntax. Otherwise, queries can be formulated with a graph whose nodes represent manipulations and leafs arguments. Queries are interactively edited by selecting objects under the control of the system that checks the syntax and the semantic of the query in progress. User errors are prevented enabling or disabling icons representing data (Grass graph nodes) or manipulations (operators). On the screen enabled/disabled icons are colored/shaded. This interaction technique does not require the keyboard: enabled objects can be selected by means of the mouse.

4. The ADAMS system The fundamental manipulations on STs are: summarization, which yields data aggregation of a ST by eliminating a variable, and reclassification, which substitutes a variable in a ST. For example, starting from: (STa) 'Disease Distribution by countries-years" we can summarize on the variable "years" getting: (STh) "Disease Distribution by countries"

345

To give an example of the functionality of ADAMS, let us show the steps followed by a user to transform table

STa in table STc performing summarization and reclassification. Whereas in the case of keyboard interaction, the only thing to do is typing a query in the STAQUEL window, in the other

case

the

user

ADAMS when a sufficient training is attained. On the other hand, the effective impact of visual manipulation on the novice users encourages us to stress and improve this interaction technique.

selects the summarization

operator from the menu "operators", then (s)he selects the

C-node "years" and the T-node "Disease distribution" in the GRASS window to obtain table STh. If (s)he interested in a table containing aggregated data on population distribution by continents, (s)he can reclassify the result of summarization (STh) by means of the reclassification operator in the menu and obtaining STc. The relation used (rl) must be selected from an appropriate list. The query-graph used for this last modification is reported in its window in figure 3. Edited queries are displayed in the STAQUEL and in the QUERY-GRAPH windows regardless of the type of interaction, and the user can modify his/her interaction technique during the formulation of the query.

X6 FH& Edit Debug

nlmtilled-l

I~~~~~~~~~~~~~~~~~~~~~~~ Variables:

Tables:

.

I

The system allows to export tables in a spreadsheet format so that all the spreadsheets functionalities usually employed by PC users are available. Furthermore, all the most important statistics packages can be used on these formats. The printing format of STs may be specified by means of a printing editor (figure 4). Subsequently the system exports modalities and values in a spreadsheet format

Figure 4. The printing editor of ADAMS.

D13

I

Fines*a

I tribtitlo1

ffiH_

ADAMS has been designed to help in defining the context of statistical analysis and in identifying data of significance in working hypotheses. In fact, when such data are scattered over various statistical tables, the system offers the possibility of drawing up one single table containing only the data considered relevant. Epidemiologic investigation is furtherly easened if we consider that data are shown in the format chosen by the user and exported in spreadsheet format to be compatible with the most popular software packages. This allows to exploit common visual presentation features such as histograms or pie charts.

Doti Opzloni Macro

Celia Formats

File Modific

6

(figure 5).

ADAMS provides two different interaction techniques in order to satisfy different user-profiles. It is conceived to be used directly by epidemiologists after a period of training on the system. As a research prototype, it has not been released yet. Physicians at the IV Surgical Clinic of the University of Rome are experimenting the interaction paradigms assisted by one of the system's designers (F.F.). Preliminary results confirms that visual interaction results more intuitive and easy to use with respect to keyboard interaction that is preferred for its conciseness by people using the system more systematically. Although obscure "at first sight", STAQUEL syntax can become the most rapid and profitable way of using

, . 0~~~a

-w

1_

T-wt""=Iln~~~~~~~~----- ---b5 200

1 3W

712

1

hbl

123

1071

F

14s _1

Ii7

_

_

w n

D|

Figure 5. The export of data in a spreadsheet format.

5. Conclusions An appropriate policy of prevention and

resources

management in health-care is not possible without the

support of epidemiology. Intelligent management of epidemiologic data aims at being helpful in the following phases: 1) in the process leading to the production of epidemiologic data; 2) in the analysis and interpretation of these data; 3) in exploiting the information obtained from these data.

346

The tools conceived in the TEA framework - like ADAMS - are active in all the steps of epidemiologic data "lifecycle": production, examination and usage. In particular, with respect to these phases: 1) they help the user in building macrodata tables (aggregated data) starting from microdata (disggegted data); 2) at aggregated level, ADAMS allows the examination of a given table (e.g.: local values) by generating the reference table (e.g.: national values) with the same structure; 3) statistical knowledge-based techniques can act as valid tools in extracting information from data, being profitable decision supports for epidemiologists, health care managers and planners. Tools for intelligent management of data aims at contributing to the quality of health care in supporting the tasks that epidemiology traditionally carries on for the prevention of diseases and early detection of socially dangerous morbid events.

References

[1] [2]

[3]

[4]

[51

[6]

Chen PPS, "The Entity Relationship Model: Toward a Unifying View of Data", A CM Transaction on Database Systems, 1 (1), 1976. De Rosis F, Pizzutilo S, Greco D, "MICROIDEA: Improving Decisions in Epidemiological Analysis by a Microcomputer", Medical Informatics, 11 (3), 1986. Ferni F, Grifoni P, Meo-Evoli L, Ricci FL, "ADAMS: Data-base Management as a Decision Support System for Epidemiology", to appear in: Proceedings Medical Informatics Europe '91, Berlin et alibi, Springer Verlag, 1991. Falcitelli G, Meo-Evoli L, Nardelli E, Ricci FL,"The Mefisto* Model: an Object Oriented Representation for Statistical Data Management", in: Data Analysis, Learning Symbolic and Numeric Knowledge, Nova Science Publishers, 1989. Falcitelli G, Pisanelli DM, Ricci FL, "Intelligent Databases for Medical Statistical Analysis", in: Proceedings IEEE Enginering in Medicine and Biology Conference, Seattle, 1989. Kleinman JC, "The Continued Vitality of Vital Statistics", editorial, American Journal of Public Health, 72 (2), 1982.

347

[7]

[8]

Michalewicz Z, "Proceedings of the Symposium on Scientific and Statistical Database Management", Lecture Notes in Medical Informatics, 420, Berlin et alibi, Springer Verlag, 1990. Ozsoyoglu G, Ozsoyoglu ZM, Matos V,

"Extending Relational Algebra and Relational Calculus with Set-Valued Attributes and Aggregate Functions", ACM Transaction on Database Systems, 12 (4), 1987. [9] Rafanelli M, Ricci FL, "Proposal of a Logical Model for Statistical Data Bases", in: Proceedings of the Second International Workshop on Statistical Database Management, Los Altos, California, 1983. [10] Shoshani A, "Statistical Databases: Characteristics, Problems and Solutions", in: Proceedings Eighth Very Large Data Bases, Mexico City, Mexico, 1982. [11] Smith JM, Smith DCP, "Database Abstraction: Aggregation and Generalization", A C M Transaction on Database Systems, 2 (2), 1977. [12] Tooley JA, Carle LA, US News & World Report, 1989. [13] World Health Organization, International Classification of Diseases. Manual of the International Statistical Classification ofDiseases, Injuries and Causes of Death, Voll. 1 and 2, 9th Revision, Geneva, WHO Press, 1977. [14] Wong HKT, "Micro and Macro Statistical / Scientific Database Management", in: Proceedings First IEEE International Conference on Data Engineering, Los Angeles, 1984.

Intelligent management of epidemiologic data.

In the lifecycle of epidemiologic data three steps can be identified: production, interpretation and exploitation for decision. Computerized support c...
1018KB Sizes 0 Downloads 0 Views