Rheumatol Int DOI 10.1007/s00296-014-2954-x

Review Article

Databases and registers: useful tools for research, no studies Rafael J. Curbelo · Estíbaliz Loza · Maria Jesús García de Yébenes · Loreto Carmona 

Received: 30 September 2013 / Accepted: 27 January 2014 © Springer-Verlag Berlin Heidelberg 2014

Abstract  There are many misunderstandings about databases. Database is a commonly misused term in reference to any set of data entered into a computer. However, true databases serve a main purpose, organising data. They do so by establishing several layers of relationships; databases are hierarchical. Databases commonly organise data over different levels and over time, where time can be measured as the time between visits, or between treatments, or adverse events, etc. In this sense, medical databases are closely related to longitudinal observational studies, as databases allow the introduction of data on the same patient over time. Basically, we could establish four types of databases in medicine, depending on their purpose: (1) administrative databases, (2) clinical databases, (3) registers, and (4) study-oriented databases. But a database is a useful tool for a large variety of studies, not a type of study itself. Different types of databases serve very different purposes, and a clear understanding of the different research designs mentioned in this paper would prevent many of the databases we launch from being just a lot of work and very little science. Keywords  Databases · Registries · Cohorts

It often happens that the only point in which researchers agree at a given project is about the need to collect data. One constantly hears about databases or registers for R. J. Curbelo · E. Loza · M. J. G. de Yébenes · L. Carmona (*)  Instituto de Salud Musculoesquelética, INMUSC, Ofelia Nieto, 10, 28039 Madrid, Spain e-mail: [email protected] R. J. Curbelo  Physiotherapy School, University of Valladolid, Valladolid, Spain

specific diseases, or for specific treatments. And in any meeting, you learn more about databases that are being used for investigative purposes and that are being treated as if they were observational studies themselves. There are indeed many misunderstandings about databases. The main of which being the absolute forgetting of the first principle of data collection, namely, that we should not collect data just for the sake of it, without explicit and specified objectives [1].

What is a database? Database is a commonly misused term in reference to any set of data entered into a computer. Actually, most of the time we are talking about datasets or flat-type data: spreadsheets with columns that identify variables or fields and rows that identify patients or study units. A database is a more complex collection of interrelated data without unnecessary redundancy and independent of the software used [1]. A database serves a main purpose, to organise data. It does so by establishing several layers of relationships— databases are hierarchical. Data are introduced in different tables accounting for the different levels, which, most crucially, are all interrelated. In clinical databases, the main level is usually the subject/patient, and then, the secondary levels may be visits or hospital stays or treatments or adverse events, all of which can be repeated over time periodically or in spontaneous observed occurrence (Table 1). An additional key point in defining a database is the notion of time. Databases commonly organise data over different levels and over time, where time can be measured as the time between visits or between treatments or adverse events, etc. In this sense, medical databases are closely

13

Table 1  Database: key points Useful tool for a large variety of studies, not a type of study Organise the data, creating levels of hierarchical relationship Takes into account the notion of time Source of tables for further statistical analysis

related to longitudinal observational studies, as databases allow the introduction of data on the same patient over time. A dataset, on the other hand, is a single table. An example of a dataset would be the data collected in a crosssectional study, as for example, the one taken in a health survey, or in a case–control study, or just as it happens in most retrospective cohorts, when entered into a spreadsheet or directly into the statistical file. Subjects or patients have just one visit in which data are recorded; regardless the moment when the measure of interest occurred, this is on the day of the interview or in the past. Even when the data are recorded with a specific date (i.e. the date when the disease was diagnosed) and thus there is a reference to time, this does not mean that data are actually recorded “over time.” The “over time” concept has a prospective sense with sequential episodes of data collection in separated time points. Sometimes, it may be convenient to introduce the data not directly into the spreadsheet, but into user-friendly screens that may lead the researcher to the impression that several levels or sets of information actually exist, while underneath all the information may be recorded in just one table. In this case, we may still have a dataset, not a database.

Analysing databases Databases can be explored with a predetermined objective, and the data obtained be analysed. The practical truth is that information can only be analysed statistically if presented in a table, regardless of its length. This implies that the database must be queried and the inter-related data from different tables be compiled in a table or dataset (see Fig. 1). In tables obtained from inter-related data, several records or rows may correspond to the same individual. This is what is usually referred to as a long format. You must be aware of it when doing the analysis, as many times data in different rows are interpreted as independent, when they are actually related to the same patient. For some specific analyses, the statistician may need each row corresponding to a patient, or to whatever the main level of the database is, and each secondary level (visit, treatment, etc.) be identified with a numerical index (visit 1, visit 2,

13

Rheumatol Int

treatment 1, treatment 2, etc.). This is called wide format and can be obtained back and for from the long format with some statistical packages (see Fig. 1).

Types of databases Basically, we could establish four types of databases in medicine, depending on their purpose: (1) administrative databases, (2) clinical databases, (3) registers, and (4) study-oriented databases (see Fig. 2). Administrative databases are set up with the specific purpose of managing the organizational and economic aspects of a given population. They allow assessing and previewing the need of workforce for the organisation or system, monitoring processes, allocation of resources, logistic needs, insurance monitoring, costs, etc. In some cases, variables that can be obtained from administrative databases behave as good surrogate measures for other outcomes (i.e. hospital admissions for serious events), and there are plenty of examples in the scientific literature in which these databases are used to estimate the prevalence of certain diseases or to explore research hypothesis [2, 3]. They are mainly used when a rapid response to a clinical question is needed, but unfortunately, one must acknowledge their limitations. They are usually very complex and not prepared for all clinical questions, and their use implies a thorough knowledge of their structure and definitions [4]. An advantage is that they include a large numbers of persons, sick or not, which clearly increases the statistical power for questions on rare outcomes, for instance. Also, a pertinent use of these databases is to cross-check the data included in clinical databases, such as to confirm death and cause of death or work status, as well as drug consumption. Quite often, physicians record patients’ characteristic and evolution directly from their practice in an intent to monitor health care practices. The result is the establishment of clinical databases. These are useful to retrieve patients’ reports and individual clinical histories, and are a good way to assess the individual practice composition of a department or a clinic [5]; however, many researchers use these clinical databases to answer research questions, or even more, to explore associations. Many of these databases are quite complex and collect important outcome measures that may give the illusion of a real research tool. However, they may have important biases to be considered for the testing of research hypothesis. The patients included have not been selected in a random manner, which precludes the use of statistical instruments for formal hypothesis testing. They might also pose problems with regards to how representative the sample is, which cannot be easily detected. And most important, it is very difficult to record

Rheumatol Int

Fig. 1  A database consists of a repository of inter-related tables, as seen in the upper panel; each table being related to others by an index item or field. To analyse a database, one must retrieve tables by querying the database. This causes the production of tables that contain repeated items. When data are inter-related in tables, records may be repeated (long format), like in the right panel, with each register corresponding to a visit, and a patient spread over several registers.

Although many clinicians tend to enter the data of the same patient in a single row (not a database), analysis is not as straightforward as in the long format. When every patient represents a single row (wide format; left panel), the information from each visit must be represented by a different variable and numeral (i.e. visit1, visit2, age1, age2, drug1, drug2…)

all variables with the expected quality of a main outcome variable if one does not know ahead of time what the main outcome variable will be, without a clear definition or a study protocol [6, 7]. Related challenges are selection and detection biases, missing data, associations by chance, post hoc hypotheses, etc. These are exactly the same problems that any retrospective study based on the review of clinical charts face, but with the delusion of a computerized database that looks more scientific. A different issue is to establish a cohort of patients with specific research objectives and a protocol, for which researchers collect data in the clinic with the investigative purpose [8]. Registers are sometimes confused with clinical databases. A true register is a system that captures new elements, being these elements incident cases of a disease or incident treatments with a specific drug. The timing for introducing data in a register is not pre-planned at start; the

timing depends on when a new case is identified, or a new treatment is started. Disease registers are structured around geographical areas and units, and involve the existence of active investigations to identify cases and to notify them [9, 10]. Databases from disease register permits to explore and to identify disease clusters, to generate risk and prognostic hypothesis and related studies, and to manage research. These registers play a very important role in the epidemiological research of rare and deadly diseases [11]. In a drug register, the pace of the data entry is given by the initiation of new treatments, of a target drug or therapeutic group, in patients who meet the inclusion criteria. The main purpose of such registers is the identification of adverse events not shown in randomised controlled trials and of rare long-term adverse events that may be related to certain drugs [12]. Drug registers are usually used to select patients for cohort or case–control studies.

13



Fig. 2  Types of databases (according to the purpose and design of the study it serves). The pyramid denotes how solid a database is. A database that has been built oriented to a specific study is usually a very solid one and easy to analyse. Registries try to build databases that are close to study-oriented databases and therefore can be solid as well, although they have many missing data and difficult analysis. Clinical databases are very similar to registers, but more problematic (missing data, “dirty” data, unclear definitions, “biases…”). The most problematic databases are administrative databases, unless we are interested in administrative data, as the health data are usually of unclear quality

The databases with the largest utility to research are those study-oriented ones. Cohorts use databases in studies set up for testing specific hypotheses. Sensu stricto, cohorts are set to test aetiological hypotheses, that is, the association between exposure and outcome (e.g. disease occurrence), but they can also be assembled to test prognosis, that is, the association between patients or environment features at the beginning of follow-up and outcome (e.g. mortality or poor outcome), or to test resource-use hypotheses, that is, the association between context and patients characteristics and resource use. Patients or subjects in a cohort must be sampled randomly to be representative [13]. The timing of the data entry is pre-established in visits or examinations at a given interval (periodical). Interestingly, many biologics registers in rheumatology are actually set-up like cohorts [14–16]. In this sense, registers can be called studies, although they should better be called cohorts. Some terms related to cohorts may need a recall. First of all, an inception cohort is that formed from all the new— incident—cases in a given population. In some studies, inception cohort is used to refer to a sample of patients in the same stage of disease evolution that is usually at the beginning, close to diagnosis [17]. Cohorts allow nested case–control studies. There are clear advantages of using a cohort for a case–control study: the controls are truly representative of the target population and have been selected at random, in the same way as the cases, allowing for valid

13

Rheumatol Int

Fig. 3  A typical Lexis diagram representing an open cohort. Each individual is represented by a 45° line. Participants enter the study at different time points (T calendar time), usually when they become incident cases. Then, they are followed up until they either have an event or until they are lost to follow-up, die, during a period of time (A time in study) that may differ from participant to participant

extrapolation of results to the population it is sampled from. Open cohorts allow for permanent incorporation of new subjects, either new comers in the general population area, or new cases in a cohort of patients. Open cohorts allow the study of the simultaneous influence of calendar time, age (duration), and cohort (onset) in demography, epidemiology, and clinical follow-up, in what is called a Lexis diagram [18] (see Fig. 3). The average age may remain stable over time, as young patients keep adding into the cohort. Closed cohorts are usually for a specific research objective, for which a sample size has been calculated, and recruitment is finished after a given period of time. In closed cohorts, subjects become older with follow-up. Permanent surveys are a type of cohort in which the examination of the same type of subjects over time is established in repeated crosssectional surveys. They are actually large open cohorts, which are monitored over such a long time, that investigators change as well as research questions at different point times, e.g. the Framingham Heart Study [19]. Finally, the randomised controlled trials are exactly cohorts in which the hypothesis being tested is the efficacy of a treatment or an intervention controlled by the investigator. They differ from observational cohorts by the existence of a planned intervention on subjects by design. Other study designs that use study-oriented databases are called longitudinal observational studies. Prognosis and resource-use studies will only be considered cohorts when the sample is selected in a random fashion and when the hypothesis is pre-established before the launch of the cohort. In the case of prognostic studies, the hypothesis would be to study the link between

Rheumatol Int

the exposure and potential prognostic factors (i.e. determinants) and the rate of occurrence of a given outcome. In the second case, the rate of use of a given resource or a hypothetical cost during a specified time is the focus of interest for description and determinants seeking. Finally, another type of databases are repositories. Repositories are databases that keep patients’ samples, data, or any material, including studies, or research instruments, that can be further used for research by doing the proper queries and linking them to more elaborated data.

Databases are not studies Databases whose objective is merely to collect data are useless. Those who work in clinical research units supporting the research in clinical centres all can recall of colleagues who come to them with datasets and who want them to build a clinical question and a study for their data [20]. It is their discouraging duty to explain that clinical databases are a clinical tool, as well as computerised medical records are, not studies. Studies start with a research question and then build on a protocol to refute the question; databases are not always created on these grounds, only the study-oriented ones. Keeping track of your patients is not the same as running a study. Nevertheless, an acceptable use of many databases would be to identify patients with a given characteristic, to use them to draw a random sample, and to test a research hypothesis prospectively, but always bear in mind, the selection bias inherent to any sample drawn from a clinical database, limiting the validity of the results obtained [21]. Another fruitful use of a database is the possibility to select incident or prevalent cases for a specific study, and also controls, by filtering the patients by specific characteristics or selection criteria. When doing this, we should bear in mind that there may be considerable diagnostic errors and that we may need a secondary validation of selection criteria. In brief, we should bear in mind the meaning and use of databases. As it has been highlighted, a database is a useful tool for a large variety of studies, not a type of study itself. Different types of databases serve very different purposes, and a clear understanding of the different research designs mentioned in this paper would prevent many of the databases we launch from being just a lot of work and very little science. Conflict of interest None.

References 1. Hulley SB, Cummings S, Browner WS, Grady DB, Newman TB (2007) Designing clinical research, 3rd edn. Lippincott Williams & Wilkins, Philadelphia

2. Ludwig KA, Kosinski LA (2013) The risk-benefit ratio. Does the administrative database help? JAMA Surg 148(4):322 3. Beaudet N, Courteau J, Sarret P, Vanasse A (2013) Prevalence of claims-based recurrent low back pain in a Canadian population: a secondary analysis of an administrative database. BMC Musculoskelet Disord 14:151 4. Patkar NM, Curtis JR, Teng GG, Allison JJ, Saag M, Martin C, Saag KG (2009) Administrative codes combined with medical records based criteria accurately identified bacterial infections among rheumatoid arthritis patients. J Clin Epidemiol 62(3):321– 327, 327 e321–327 5. Wolfe F (1999) Critical issues in longitudinal and observational studies: purpose, short versus long term, selection of study instruments, methods, outcomes, and biases. J Rheumatol 26(2):469–472 6. Silman A, Symmons D (1999) Reporting requirements for longitudinal observational studies in rheumatology. J Rheumatol 26(2):481–483 7. Symmons DP (2004) Methodological issues in conducting and analyzing longitudinal observational studies in rheumatoid arthritis. J Rheumatol Suppl 69:30–34 8. Kremer JM (2006) The CORRONA database. Autoimmun Rev 5(1):46–54 9. Ceballos M, Lopez-Revuelta K, Saracho R, Garcia Lopez F, Castro P, Gutierrez JA, Martin-Martinez E, Alonso R, Bernabeu R, Lorenzo V et al (2005) Dialysis and transplant patients Registry of the Spanish Society of Nephrology. Nefrologia 25(2):121–124, 126–129 10. Stel VS, Tomson C, Ansell D, Casino FG, Collart F, Finne P, Ioannidis GA, De Meester J, Salomone M, Traynor JP et al (2010) Level of renal function in patients starting dialysis: an ERAEDTA Registry study. Nephrol Dial Transplant 25(10):3315–3325 11. Zurriaga Llorens O, Martinez Garcia C, Arizo Luque V, Sanchez Perez MJ, Ramos Aceitero JM, Garcia Blasco MJ, Ferrari Arroyo MJ, Perestelo Perez L, Ramalle Gomara E, Martinez Frias ML et al (2006) Disease registries in the epidemiological researching of rare diseases in Spain. Rev Esp Salud Publica 80(3):249–257 12. Dixon WG, Carmona L, Finckh A, Hetland ML, Kvien TK, Landewe R, Listing J, Nicola PJ, Tarp U, Zink A et al (2010) EULAR points to consider when establishing, analysing and reporting safety data of biologics registers in rheumatology. Ann Rheum Dis 69(9):1596–1602 13. Mann CJ (2003) Observational research methods. Research design II: cohort, cross sectional, and case–control studies. Emerg Med J 20(1):54–60 14. Listing J, Strangfeld A, Kekow J, Schneider M, Kapelle A, Wassenberg S, Zink A (2008) Does tumor necrosis factor alpha inhibition promote or prevent heart failure in patients with rheumatoid arthritis? Arthr Rheum 58(3):667–677 15. Askling J, Fored CM, Geborek P, Jacobsson LT, van Vollenhoven R, Feltelius N, Lindblad S, Klareskog L (2006) Swedish registers to examine drug safety and clinical issues in RA. Ann Rheum Dis 65(6):707–712 16. Dixon WG, Symmons DP, Lunt M, Watson KD, Hyrich KL, Silman AJ (2007) Serious infection following anti-tumor necrosis factor alpha therapy in patients with rheumatoid arthritis: lessons from interpreting data from observational studies. Arthr Rheum 56(9):2896–2904 17. Pratt AG, Lorenzi AR, Wilson G, Platt PN, Isaacs JD (2013) Predicting persistent inflammatory arthritis amongst early arthritis clinic patients in the UK: is musculoskeletal ultrasound required? Arthr Res Ther 15(5):R118 18. Carstensen B (2007) Age-period-cohort models for the Lexis diagram. Stat Med 26(15):3018–3045 19. Kaess BM, Gona P, Larson MG, Cheng S, Aragam J, Ken chaiah S, Benjamin EJ, Vasan RS (2013) Secular trends in

13

echocardiographic left ventricular mass in the community: the Framingham Heart Study. Heart 99(22):1693–1698 20. Rao JK, Callahan LF (1995) Systems for data analysis. Rheum Dis Clin N Am 21(2):359–378

13

Rheumatol Int 21. Pincus T, Sokka T (2003) Uniform databases in early arthritis: specific measures to complement classification criteria and indices of clinical change. Clin Exp Rheumatol 21(5 Suppl 31):S79–S88

Databases and registers: useful tools for research, no studies.

There are many misunderstandings about databases. Database is a commonly misused term in reference to any set of data entered into a computer. However...
340KB Sizes 1 Downloads 0 Views