Feature Article

Practical Considerations for Matching STD and HIV Surveillance Data with Data from Other Sources

Lori M. Newman, MDa Michael C. Samuel, DrPHb Mark R. Stenger, MAc Todd M. Gerber, MSd Kathryn Macomber, MPHe Jeffrey A. Stover, MPHf,g Wendy Wise, MPHh

SYNOPSIS Data to guide programmatic decisions in public health are needed, but frequently epidemiologists are limited to routine case report data for notifiable conditions such as sexually transmitted diseases (STDs) and human immunodeficiency virus (HIV). However, case report data are frequently incomplete or provide limited information on comorbidity or risk factors. Supplemental data often exist but are not easily accessible, due to a variety of real and perceived obstacles. Data matching, defined as the linkage of records across two or more data sources, can be a useful method to obtain better or additional data, using existing resources. This article reviews the practical considerations for matching STD and HIV surveillance data with other data sources, including examples of how STD and HIV programs have used data matching.

a

Division of STD Prevention, Centers for Disease Control and Prevention, Atlanta, GA

b

STD Control Branch, Division of Communicable Disease Control, California Department of Public Health, Richmond, CA

c

Infectious Disease and Reproductive Health, Washington State Department of Health, Olympia, WA

d

Division of Family Health, New York State Department of Health, Albany, NY

e

Bureau of Epidemiology, Michigan Department of Community Health, Lansing, MI

f

Office of Epidemiology, Division of Disease Prevention, Virginia Department of Health, Richmond, VA

g

Department of Epidemiology and Community Health, School of Medicine, Virginia Commonwealth University, Richmond, VA

h

STD/TB Surveillance, Ohio Department of Health, Columbus, OH

Address correspondence to: Lori M. Newman, MD, Division of STD Prevention, Centers for Disease Control and Prevention, 1600 Clifton Rd. NE, MS E-02, Atlanta, GA 30333; tel. 404-639-6183; fax 404-639-8610; e-mail .

Public Health Reports / 2009 Supplement 2 / Volume 124

E

7

8 E

Feature Article

Epidemiologists looking for data to guide programmatic decisions in public health often rely upon routine case report data for notifiable conditions. At a national level, case report data for sexually transmitted diseases (STDs) such as gonorrhea, chlamydia, and syphilis are limited to information on age, race/ethnicity, sex, and geographic location, providing little or no information on risk factors. And, STD and human immunodeficiency virus (HIV) comorbidity cannot be assessed at a national level because the Centers for Disease Control and Prevention (CDC) does not use a common system of unique identifiers for STDs and HIV. Although additional data such as gender of sex partners or treatment may be available at a local level, case report data are frequently of poor quality or incomplete. These data limitations may be due in part to insufficient funding, lack of staff with appropriate training or expertise, limited time, or limited organizational or policy support for supplemental surveillance activities. In many situations, supplemental data exist but are not readily accessible by the epidemiologist. The traditional silo approach to surveillance and research, whereby programs focus on a single disease, may create a duplication of effort as well as systems, communication, and software barriers to integrating data from different sources. For example, an STD epidemiologist interested in evaluating gonorrhea risk factors may not be able to utilize information on behavioral risk factors collected through HIV surveillance, due to perceived or real restrictions on access to HIV surveillance data by HIV security and confidentiality guidelines. Utilization of existing data may also be hindered by (1) epidemiologists’ lack of awareness of the kind of data being collected through other surveillance programs, (2) technological obstacles created from use of different data collection systems or software, and (3) lack of the skills or resources needed to conduct matching of databases. Data matching is an important means of strengthening STD and HIV public health interventions through improved use of existing data. Data matching (also known as record linkage) in this article is defined as the integration or linking of records between two or more data sources to provide more complete and comprehensive information, to test hypotheses, or to assess prevention and control activities. Detailed descriptions of matching methodology can be found elsewhere in the literature.1–7 The objective of this article is to review the practical considerations for matching STD and HIV surveillance data with other data sources, including examples of how STD and HIV programs have used data matching. These examples show how matching can be used to generate information not otherwise available; improve completeness of exist-

ing databases; guide public health program activities; identify co-infection, new cases, or repeat infection; answer research questions; and promote collaboration among programs. MATCHING TECHNIQUES There are two fundamentally different types of data matching: individual and ecologic. In individual matching, data for an individual from one data source are matched with data for the same individual from another data source. In ecologic matching, data from one source for a stratum or group (e.g., the population of a county) are matched with data from another source for that same stratum or group. Although this article addresses primarily individual matching, many of the principles are applicable to ecologic matching as well. The strength of data matching is that it provides epidemiologists with access to a wide variety of existing data sources that are alternatives to case report data. Examples of alternative data sources are presented in Figure 1. This list is not exhaustive; any two or more datasets with line-listed data for individuals from the same or overlapping populations can potentially be matched. It is important to note, however, that data sources without unique identifying variables (e.g., Youth Risk Behavior Surveillance System [YRBSS]) may be useful only for ecologic matches. Deterministic and probabilistic matching Record-matching methods fall into two categories: deterministic and probabilistic. All record-matching projects employ some variant or combination of these two methods. To briefly define both methods, deterministic methods match records through agreement between data elements common to both records, while probabilistic methods match records through making assumptions about the data, based on the distribution of the variable values in the general population, rather than upon just an individual record itself.2,6 Deterministic matching. Deterministic methods are the most widely used in routine practice and can be as simple as establishing record linkages based on an exact character correspondence between one or more common data elements, such as first and last name. Most database and statistical software packages have logical comparison or string functions available for this straightforward matching method. Deterministic matching based on exact agreements may have serious drawbacks, however, if the underlying data elements are subject to variation in spelling, data-entry accuracy, completeness of data ascertainment, common-usage pseudonyms (e.g., “Bill” for “William”), or common

Public Health Reports / 2009 Supplement 2 / Volume 124

Matching STD and HIV Surveillance Data with Other Source Data

E

9

Figure 1. Potential data sources for matching STD and/or HIV case report dataa 1. Administrative/service utilization (e.g., AIDS Drug Assistance Program, hospital, Title X, Medicaid, and pharmaceutical usage and sales) 2. Clinical (e.g., laboratory and hospital) 3. Disease registries (e.g., STD, HIV, tuberculosis, and enteric diseases) 4. Vital statistics (e.g., birth and death records) 5. Population-based surveys (e.g., census data, Youth Risk Behavior Surveillance System, Behavioral Risk Factor Surveillance System, and Pregnancy Risk Assessment Monitoring System) 6. Geographic information systems shape files and address dictionaries 7. Other (e.g, detention records, drug abuse treatment program records, and provider and facility reference files) a

Prepared by the Centers for Disease Control and Prevention

STD  sexually transmitted disease HIV  human immunodeficiency virus AIDS  acquired immunodeficiency syndrome

last names such as Smith or Gomez. These variations may result in a high level of missed true matches when simple exact comparisons are used (i.e., low match sensitivity) and may create the potential for a high level of specious, false matches for common data elements (i.e., low match specificity). These drawbacks may limit the usefulness of simple deterministic methods in practical matching applications that use large public health datasets, such as STD and HIV surveillance registries. Refinements within deterministic methods can greatly enhance the sensitivity and specificity for matching unique individuals across datasets. One common method is to search for a match on multiple criteria for each record and to assign point values to each matching element, based on the relative utility of each element in identifying unique individuals. This allows the analyst to specify an upper match threshold point value above which all or most matches are true matches, and a lower match threshold point value below which no or few true matches exist. Visual examination of all available data elements in a representative sample of records can determine these threshold values. Some proportion of true matches will exist in a range of scores between the upper and lower threshold values (because of the variations discussed previously). Possible matches in the middle range are subject to manual verification before being considered true matches, to maximize match sensitivity and specificity. The point values and thresholds can be adjusted to produce the narrowest practical range of point values where potentially matching records are found. Figure 2 shows one example of such an algorithm and point values. Probabilistic matching. The second principal record linkage method, probabilistic matching, involves calculating a probability score for potentially matching records. A commonly occurring identifying element

or combination of elements in the population from which the data are collected results in a lower probability that the specific match is a “true” matching record and increases the probability that the match occurred by chance. For instance, if the data show Figure 2. Example of a deterministic algorithm used to match data by California Department of Public Health, Division of Communicable Disease Control, STD Control Brancha

a Source: Personal communication, Michael C. Samuel, DrPH, STD Control Branch, Division of Communicable Disease Control, California Department of Public Health, July 2009.

STD  sexually transmitted disease

Public Health Reports / 2009 Supplement 2 / Volume 124

10

E

Feature Article

several records for multiple people named John Smith, there is a lower probability that a match for any specific John Smith actually represents the true John Smith of interest. However, an uncommon element or combination of elements that naturally occurs infrequently in the population is much more likely to produce a true match. In many ways, probabilistic matching replicates a key intuitive element of the logic employed in manual review of data for matched records. In probabilistic matching, the score for a pair of records is based on the probability that particular identifying elements (or combinations of elements) will match by chance (the u probability) in a nonmatching record pair and on the probability that an element or variable will match for a true matched pair (the m probability).2 These values either are estimated, based on frequency distributions in the underlying datasets, or are directly calculated, where sufficient information is available on the distribution of the variable of interest in the population (e.g., u value for dichotomous variables, such as gender, is usually 0.5). For each identifying variable used in the matching criteria, u and m probabilities are used to derive agreement (or disagreement) weights. For each pair of records compared, a total weight is calculated by summing the agreement or disagreement weights for all the variables of interest. Point value scoring, as described previously for deterministic matching, can also be used with probabilistic matching to achieve the desired sensitivity and specificity. Probabilistic methods are computationally considerably more complex than deterministic methods and may, therefore, require more sophisticated software to implement routinely. However, the value added in terms of potentially improved predictive value for matching records might justify the additional resources needed. Several comparisons of deterministic and probabilistic matching strategies that use actual data in demonstrating the advantages and disadvantages of these strategies have been published.6,8 Hybrid matching. Hybrid methods combine elements of probabilistic and deterministic matching techniques in a single matching algorithm. An example of this approach would be to use a deterministic match on a blocking, or grouping, variable (such as disease, or county or state of residence) to eliminate known nonmatching records, and then process all the blocked records using probabilistic methods. This approach has the advantage of being quicker for very large datasets and more manageable in desktop environments with limited processing power.

Data elements In addition to the selection of match method, the data elements (also known as variables) used to compare and match records should be carefully considered. The choice of data elements to use in matching records, regardless of matching method used, is usually constrained by the information available, but, in general, those data elements with the greatest discriminating power between records should be used (i.e., those that have the largest number and even distribution of potential response options). Where available, an ideal match element is a unique identifier, such as a social security number or medical record number. However, few public health surveillance registries have reliable unique identifiers, so a combination of other matching data elements is often used. Considerations regarding selection of match criteria and how such selection can affect the sensitivity, specificity, and positive predictive value of the match have been published.1,3,6,9 Aside from the obvious requirement that potential matching elements must be present in both datasets, a number of considerations may affect the discriminating power of specific data elements. If either dataset has a significant proportion of records with missing data or unknown values for a given element, this will reduce the discriminating power of that item in any type of match conducted. Race and ethnicity, for example, are often missing in STD data, making these data elements generally poor candidates for matching. An additional consideration is the potential for a given variable to legitimately have differing values in both datasets. An example of this potentially problematic issue would be records collected at different time periods, such as STD morbidity reported for multiple years. Individuals often change residence. Therefore, using address information as matching criteria would be problematic, in that two or more legitimate values for address may exist for the same unique individual, resulting in a nonmatch and, consequently, reducing overall match sensitivity. In addition to these considerations, the coding and standardization of variables to be used in a matching strategy should be addressed. Race and ethnicity are good examples of data elements commonly subject to different coding schema. Many public health registries have historically coded race and ethnicity categorically as a single variable, allowing patients to respond with only a single option. However, the use of multiple variables for race and a separate variable for ethnicity, such as mandated by the U.S. Office of Management and Budget (OMB)10 and used by the U.S. Census Bureau for the 2000 Census,11 adds complexity to the

Public Health Reports / 2009 Supplement 2 / Volume 124

Matching STD and HIV Surveillance Data with Other Source Data

use of race as a matching criterion. Although the OMB standards outline a common method of classifying race and ethnicity that allows for a more accurate or detailed description of an individual’s race and ethnicity, the likelihood of a respondent’s race and ethnicity being recorded in a consistent manner is diminished. Issues that may affect the speed, efficiency, and potential sensitivity in processing a match should also be considered. These may include structural database issues, such as variable format type and length, and how missing data are stored. In general, good candidate matching variables will be stored in comparable data formats. Dates, for example, can be stored as either numeric (thereby facilitating date calculations and automated match comparisons) or character (thereby facilitating collection of incomplete date information, such as year and month without day). Although format conversion can be included as part of the matching process, similarly formatted variables facilitate and improve the quality of the match. Longer variables may make less ideal matching variables, all other things being equal, in that the chance of transposition and other forms of data-entry error increases with increasing length. Additionally, match efficiency and sensitivity may be affected by how missing data are stored. Use of inconsistent coding (e.g., where “missing” is coded as “N/A,” “999,” or “.”) may unnecessarily eliminate records or move records into categories requiring manual review. As mentioned previously, blocking (or grouping) on a given variable often expedites match processing for very large datasets. Blocking variables are used to create a subset of a dataset and prevent further match processing by excluding all records that fail to match on the particular data element. Year of birth, for example, is often complete and reliable in many different datasets. If year of birth is used as a blocking variable, any time it does not match for a given candidate pair of records, the pair is considered nonmatching and no further processing to compare the two records would take place. Records with matching year of birth would be further processed to generate a score. Blocking variables should be robust (unlikely to change over time and relatively resistant to data-entry errors). Further examples of how to select blocking variables are available in the literature.2,4,7 Software A wide variety of software utilizing deterministic, probabilistic, or hybrid methodologies is available to assist with data matching. Many different types of specialized matching software are available (e.g., Dataflux,12 LinkPlus,13 Netrics,14 QuadraMed,15 and Febrl16). In

E

11

general, probabilistic matching requires specialized matching software, while deterministic matching often can be accomplished with simple programming in many standard database applications (e.g., Microsoft Access®,17 and Foxpro18) or statistical software (e.g., SAS®,19 SPSS®,20 R,21 and Epi Info™22). In some public domain surveillance software, the matching functionality is either incorporated into the software (e.g., HIV/ acquired immunodeficiency syndrome [AIDS] Reporting System [HARS], eHARS, and STD Management Information System [STD*MIS],23–25 all from CDC), or is designed to be compatible with commercial software for matching. Important issues to consider when selecting matching software include the format of the data to be used, the size of the dataset, the type of analysis desired, the skill level of the analyst, and the monetary and computer resources available. Confidentiality, ethical, and legal issues When working with two or more datasets, the confidentiality standards for the set requiring the strictest standards must be applied to the entire match. This is especially true when working with HIV datasets, as there are restrictions regarding the storage, use, and dissemination of these data.26 For example, the Michigan Department of Community Health found it more practical to use HARS/eHARS SAS-based matching programs to match HIV and STD data due to stricter standards for HIV than for STD data in Michigan (Personal communication, Kathryn Macomber, MPH, Bureau of Epidemiology, Michigan Department of Community Health, February 2009). It may be necessary to negotiate for access to datasets containing sensitive protected health information. Various techniques can be used to alleviate confidentiality concerns in owners of sensitive identified data, such as doing the match on their computer and deleting the prelink datasets with identifiers, or conducting a virtual match in which the datasets are matched within a network, but personal identifiers from either dataset are never downloaded to the analyst’s computer. However, the establishment of data security, confidentiality, and release guidelines that are either integrated, or at least uniform across program areas, is a critical step toward removing organizational barriers to matching activities and obviating the need for such steps. Given the complexity and variety of potential uses of matched datasets, some level of ethical review is encouraged. Matching routinely collected communicable disease datasets, such as STDs and HIV, is frequently considered surveillance and may not require Institutional Review Board (IRB) approval. However, matching to databases outside of mandated disease

Public Health Reports / 2009 Supplement 2 / Volume 124

12

E

Feature Article

reporting, such as some laboratory results or health maintenance organization records, might require IRB approval. When addressing how matched data shall be handled confidentially, preliminary issues may involve the determination of who is going to conduct the match, formation of agreements among parties that data will be kept confidential, and identification of the variables necessary to be shared. An important step in this process is the formation of a data release policy. A data release policy sets in writing an agreement between the owners of two datasets specifying who conducts the matching, how the data will and will not be used, the parameters of the match, and how the data will be stored or destroyed once the matching activity is complete. An example of a data request form (which encourages documentation of proposed data usage) and a data recipient agreement (which explicitly outlines how data should be used) for STD, HIV, and other program data, are available from the Virginia Department of Health.27 Matching may provide information that is beyond the original intent of the match. Before conducting any matching, the agencies involved need to consider the possible results of the matching so that if serious ethical or other issues arise, the agencies are clear about what steps to take. For example, if a dataset with chlamydia in individuals of unknown age is linked to a demographic dataset, the matching might identify cases among minors. In Michigan, Act 238 of the 1975 Child Protection Law, Section 722.623, outlines special protective services that must be in place when reporting STDs for children younger than 12 years of age.28 In this case, for example, the parties involved should agree upon a plan of action for managing an STD identified in children younger than age 12 years prior to conducting the data matching. Another example of a match providing information beyond the original intent is when HIV data are linked to STD data as a means of evaluating comorbidity. Through such matching, it is possible to identify HIVpositive patients who acquired an STD after diagnosis of HIV, which suggests continued high-risk behavior. Because some states have laws pertaining to HIV transmission by HIV-infected individuals, management of such situations should be discussed in advance. For example, it is a Class I misdemeanor under the Code of Virginia (§18.2-67.4:1)29 for people to fail to disclose their HIV, syphilis, or hepatitis B status to a sex partner. Organizational policies, in addition to state and local laws, should be reviewed so that all parties agree in advance on how to handle any such situations that may arise.

Routine matching/database integration Once an organization has initiated data matching, such matching can be made routine. The group can write programs so that routine analysis of matched data occurs with little effort. The Michigan Department of Community Health, for example, had developed protocols to use SAS software programs to routinely match HIV case report data annually to other types of data such as tuberculosis (TB) case reporting, cancer registries, electronic laboratory reporting, and birth and death records. HARS had an internal interactive matching program, which could be used to easily match line-listed STD or other types of data to HIV/AIDS cases in the system. The eHARS system is the HIV/ AIDS surveillance system replacing HARS nationally and was scheduled to be deployed in all 59 sites by September 2009 (Personal communication, Sam Costa, MA, Centers for Disease Control and Prevention, July 2009). Compatible matching software to what was in HARS will soon be routine in eHARS. Matching skills will be important for eHARS users in order to import the large volume of electronic laboratory reports. Currently, many local health jurisdictions use SAS programs to match their HIV data to other data sources, including TB, syphilis cases, and cancer registries. These SAS programs are available in many states, and can typically be accessed by talking to HIV reporting staff. Many of these SAS programs originated from the need to match electronic laboratory test results to HIV case report data from providers. (An example of Michigan’s SAS program for STD and HIV case report matching in eHARS is available to those joining the public Outcome Assessment through Systems of Integrated Surveillance workgroup at www.stdprevention .org.) Data warehouses, into which data from different sources are loaded and stored in a common system with a similar format, are another means of facilitating easy data matching. For example, the Ohio Department of Health’s data warehouse contained STD and TB case report data, as well as vital statistics (births and deaths) and cancer incidence data. Although the classic definition of a data warehouse focuses on data storage, essential components of a data warehousing system include the means to retrieve and analyze data; to extract, transform, and load data; and to manage the dictionary data. Increasing numbers of states are also beginning to collect data in integrated electronic systems, such as CDC’s National Electronic Disease Surveillance System’s Base System or other Public Health Information Network (PHIN)-compliant systems.30,31 Use of such systems may increase the frequency of data matching,

Public Health Reports / 2009 Supplement 2 / Volume 124

Matching STD and HIV Surveillance Data with Other Source Data

as well as heighten awareness of the need to proactively address matching issues such as standardization, accuracy, speed, confidentiality, and utility. These systems, which are generally person-based, rather than eventbased, use matching techniques to link new events to a person who may already be in the system. Michigan has a PHIN-compliant integrated electronic system, the Michigan Disease Surveillance System (MDSS),32 which uses a deterministic record linkage process to compare a string of variables among records. In MDSS, the variables used to determine matches are first name, last name, date of birth, sex, and street address. MDSS automatically calculates a matching weight from the string of matches to distinguish among a match, a nonmatch, or a possible match. The user is then prompted to classify the new information as representing a new person and event, a new event for a previously reported person, or an amendment of a previous event for a previously reported person. Personbased systems such as these force the determination of comorbidity at the time of data entry into the system, rather than at a later point in time. MATCHING EXAMPLES Although nearly any two overlapping data sources theoretically can be matched, it is important to ask why one should match data. Ideally, a single dataset would contain all desired information for every individual. However, in reality, complete information on all individuals rarely exists. Therefore, matching can be a way of generating information that is not otherwise available in a given dataset. For example, the California Department of Public Health (CDPH) used SAS software to match existing administrative enrollment and paid claims data from Family Planning, Access, Care, and Treatment (Family PACT), a state-funded program providing health-care services to low-income women, with existing laboratory data from Unilab (West Hills, California), a laboratory providing chlamydia testing services (Figure 3). CDPH chose SAS for this match-

E

ing activity because of its powerful data manipulation functionality and ease of use for deterministic matching. The resulting matched data demonstrated high chlamydia positivity in specific minority populations, information that was not available from either data source when analyzed independently.33 Matching also can be a means of improving the completeness of routinely collected data, or adding additional demographic, clinical, or behavioral information to existing data. In Virginia, matching TB case report data with HIV/AIDS and STD case report data identified injection drug use among HIV/TB co-infected patients that had not been elicited during the TB case interviews, although information on injection drug use is routinely sought in TB interviews. Matches with AIDS Drug Assistance Program (ADAP) data and HIV morbidity data have also proven to be a useful tool for the Virginia Department of Health to ensure completeness of HIV/AIDS case reporting, as well as completeness of updates to CD4 and viral load laboratory data (Personal communication, Jeffrey A. Stover, MPH, Office of Epidemiology, Division of Disease Prevention, Virginia Department of Health, June 2008). Another form of matching to estimate the completeness (sensitivity) and specificity of a disease reporting system is the capture-recapture technique. The capturerecapture technique matches two independent data sources to estimate the true incidence of a condition.34 The Massachusetts Department of Public Health used the capture-recapture technique in an evaluation of the completeness of reporting of chlamydial infection, and estimated that laboratory reports alone were 46% complete, health-care provider reports alone were 54% complete, and either type of report accounted for 75% of the estimated total number of cases.35 The following examples of ecologic matching provided data for targeting underserved areas and to guide public health program activities. The New York State Department of Health has matched gonorrhea case report data with a database of disease outreach

Figure 3. Matching chlamydia laboratory data and Family Planning, Access, Care, and Treatment (Family PACT) program administrative data, California, 2000a

Chlamydia test results Race/ethnicity Gender

13

Laboratory data

Administrative data

Merged data

Complete Missing 100% Missing 7%

Missing 100% Complete Complete

Complete Complete Complete

a Source: Chow JM, Packel LJ, Miller J, Treat J, Barbosa S, Hanbury M, et al. Commercial laboratory participation in chlamydia prevalence monitoring: the case for public-private partnerships in STD control [abstract no. 0273]. Presented at the International Society for Sexually Transmitted Disease Research Congress; 2003 Jul 27–30; Ottawa, ON, Canada.

Public Health Reports / 2009 Supplement 2 / Volume 124

14

E

Feature Article

worker activity to assess whether outreach worker activity is occurring in areas with the greatest morbidity.36 The Virginia Department of Health has matched STD/ HIV data with bus routes for targeting HIV prevention resources. The Virginia Department of Health has also spatially matched gonorrhea morbidity reports and public housing polygons within the city of Richmond to visually assess and further refine understanding of core areas of STD transmission. This type of nonroutine data visualization has assisted in targeting STD prevention efforts within block groups of high gonorrhea morbidity that are independent of public housing correlations (Figure 4) (Personal communication, Jeffrey A. Stover, MPH, Office of Epidemiology, Division of Disease Prevention, Virginia Department of Health, June 2008). The Baltimore City Health Department has matched

STD case report data, socioeconomic data from the U.S. Census, and HIV counseling and testing data from a mobile van service to guide mobile van outreach activities (Personal communication, Jonathan Ellen, MD, Johns Hopkins School of Medicine, July 2009). Matching can also be a method of identifying co-infection. Several health departments (Michigan, Indiana, Massachusetts, San Francisco, Washington State, Baltimore, Florida, Ohio, New York State, and Virginia) have matched STD case report data with HIV case report data.37 For example, both the Michigan and Indiana health departments found through matching datasets that the majority of syphilis cases among people reported with HIV infection were reported after HIV had been reported, suggesting continued high-risk sexual behavior in the population most likely

Figure 4. Distribution of gonorrhea morbidity by block group and geographic relevance to public housing within the city of Richmond, Virginia, 2003–2007a

a Source: Personal communication, Jeffrey A. Stover, MPH, Office of Epidemiology, Division of Disease Prevention, Virginia Department of Health, July 2009.

Public Health Reports / 2009 Supplement 2 / Volume 124

Matching STD and HIV Surveillance Data with Other Source Data

to transmit HIV (Personal communication, Kathryn Macomber, MPH, Bureau of Epidemiology, Michigan Department of Community Health, February 2004). Such findings have important implications for HIV prevention initiatives, including counseling, testing, and referral programs. Matching has also been used as a mechanism for identifying new cases and validating completeness of reporting. In Virginia, for example, matching the HIV/ AIDS case report data and ADAP data resulted in a 7% increase in HIV/AIDS case reports for a given calendar year (Personal communication, Jeffrey A. Stover, MPH, Office of Epidemiology, Division of Disease Prevention, Virginia Department of Health, March 2005). This resulted in the establishment of procedures to ensure routine matches of HIV/AIDS morbidity and ADAP data. The New York State Department of Health used SAS matching functionality to identify electronic laboratory reports that did not generate STD morbidity reports at the county level. The department flagged potential unmatched cases, which were then prioritized and assigned for follow-up to determine morbidity and case management status (Personal communication, Todd Gerber, MS, Division of Family Health, New York State Department of Health, March 2006). These types of procedures assist in producing more complete disease reporting and can have a significant impact on funding decisions and resource allocation. Matching a database to itself can also assist in identifying cases of repeat infection or duplicate reports of the same case. In California, matching data for individual patients over several years in the routine case-based surveillance dataset identified 6% of chlamydia patients and 4% of gonorrhea patients as having a repeat infection within one to six months of initial report.38 CDPH also assessed the time between infections and was able to identify a large number of patients who were reported twice within a month, highlighting the need to adhere to reporting regulations that an infection reported within 30 days of a prior infection should not be considered a new case for surveillance purposes. Matching may be used to address research questions. CDPH, for example, matched chlamydia case report data from pregnant women with birth certificate data and identified an association between chlamydia and an adverse birth outcome.39 The Washington State Department of Health matched HIV and STD registries and used multivariate models to investigate demographic and behavioral factors significantly associated with HIV infection risk and STD incidence among HIV-infected populations.37,40 Registry matching may also be useful even if little or

E

15

no overlap of records occurs. For example, matching HIV cases with chlamydia cases in Washington State demonstrated that HIV co-infection among people diagnosed with chlamydia was relatively rare (less than 0.5% of patients with chlamydia were co-infected with HIV) and that chlamydial infection was not significantly associated with risk of HIV infection.40 This finding led HIV prevention planners in Washington State to conclude that routine HIV screening of people diagnosed with chlamydial infection, in the absence of any other HIV risk factors, was not an efficient use of scarce HIV prevention resources. Several state health departments (including Massachusetts, Virginia, and Ohio) found that matching TB and HIV datasets produced so few cases of co-infection that routinely matching TB and HIV data was not a useful activity for case finding. Regardless of the match results, data matching may also prove valuable in establishing collaborative working relationships that promote and expedite future collaborations among organizations. For example, in the process of establishing an ongoing matching process for HIV and STD data in Florida, the HIV and STD programs realized the need for and subsequently created technical assistance guidelines and service-level agreements. Data matching may also be a means of addressing duplicate areas of data collection, as well as areas of unmet need. LIMITATIONS OF MATCHING We have outlined some important considerations for conducting matched data analyses and have presented general and specific examples of how matching has been useful for STD and HIV programs. In the examples provided in this article, matching was used for the following purposes: (1) to obtain information not otherwise available; (2) to improve the completeness and quality of routinely collected data; (3) to provide data for targeting underserved areas and guide public health program activities; (4) to identify co-infection, new cases, repeat infection, and duplicate reporting; (5) to address research questions; (6) to identify inefficient uses of resources; and (7) to promote collaboration among organizations. In many settings, however, data matching is not an efficient, practical, ethical, or effective method of data analysis. Matching may be technically difficult or even impossible. Matching based on demographic or other variables not unique to an individual can be time-consuming and difficult. Matching can result in errors (false matches or false nonmatches) that may be unacceptable in some settings. For example, comorbidity information obtained through matching generally

Public Health Reports / 2009 Supplement 2 / Volume 124

16

E

Feature Article

should not be used to guide treatment decisions for an individual patient. In such a case, it may be preferable to obtain the information directly from the patient. The results of an ecologic match may be invalid if the matched datasets do not cover overlapping or similar populations. For example, address data from a police dataset on drug and prostitution arrests matched with address data from STD case reports might lead to a match for a given address, but may not identify the same individual in that household. Matching may also not be practical, given strict confidentiality guidelines for one or both datasets. Even if data can be matched, the results may not necessarily be useful to public health practitioners or researchers. FUTURE CONSIDERATIONS In the future, use of PHIN-compliant systems may eliminate the need to conduct matching procedures for data within such systems and may ease the matching of data outside of such systems. Implementation of such systems is intended explicitly to promote integrated analysis through use of common fields and coding, and to provide standards for electronic reporting of surveillance data. PHIN-compliant systems are generally patient-centered at the state level, thus creating a unique identifier for each patient at the state level. PHIN-compliant systems may facilitate the creation of automated analysis and reporting of merged data. Implementation of PHIN-compliant systems will not obviate the need for matching datasets. Such systems may shift the process of matching to a more distributed model that relies on a greater number of individuals at the local level with varying degrees of training and time to conduct matching. PHIN data stewards may find it necessary to conduct matches to periodically check accuracy and validity of data. PHIN-compliant systems are designed primarily for surveillance data of reportable diseases. Although the systems will likely be used for other types of data (e.g., prevalence monitoring data, enhanced surveillance data, or special studies), there are still a wide variety of data sources for nonreportable diseases or conditions that will not use PHIN-compliant systems but could potentially be used to enhance data analyses. For example, an ecologic analysis of drug treatment program data and STD data, as performed in Massachusetts, would still require matching PHIN-compliant systems data with other datasets. Furthermore, PHIN-compliant systems are not in use or planned for use by all states and cities, and may not be used for all notifiable conditions. Even once such systems are widely used, however, data confidential-

ity precautions intrinsic to these systems may restrict access to data matching across diseases to a limited number of individuals with security rights. Due to such obstacles, it is important to identify the specific settings in which use of merged data adds value to analysis of single datasets. Implementation of PHIN-compliant systems will increase the need to proactively discuss confidentiality and ethical issues related to data matching and to create an atmosphere in which integrated service-level agreements, data request forms, and data recipient agreements can be developed. CONCLUSION Data matching can be an effective, inexpensive, and simple way to augment knowledge obtained from one data source with that from another. For example, matching can be made simple through the programming of automatic analyses, generation of routine reports, or creation of data warehouses, and has the potential to rapidly change the business processes for STD and HIV programs. Although it is important to assess the potential utility and validity of a match before allocating human or technical resources to the analysis, many public health programs have found data matching to be an important tool in improving disease control activities, allocating resources, identifying high-risk individuals, and improving public health surveillance systems. The Outcome Assessment through Systems of Integrated Surveillance (OASIS) Project was funded by the National Center for HIV/AIDS, STD, and TB Prevention, Centers for Disease Control and Prevention (CDC) from 1995 through 2005. We are grateful to the following OASIS sites for sharing their collective experience in data matching: Baltimore City Health Department, California Department of Health Services, Florida Department of Health, Indiana State Department of Health, Massachusetts Department of Health, Michigan Department of Community Health, Missouri Department of Health, New York City Department of Health and Mental Hygiene, New York State Health Department, North Carolina Department of Health, Ohio Department of Health, Oregon Health Division, San Francisco Department of Health, Texas Department of Health, Virginia Department of Health, and Washington State Department of Health. We also thank Dr. Richard Selik, Division of HIV/AIDS Prevention, and Dr. Samuel Groseclose, Greg Pierce, and Dr. Hillard Weinstock, Division of STD Prevention, CDC, for their assistance in reviewing this article.

REFERENCES 1.

2.

Gill LE, Baldwin JA. Methods and technology of record linkage: some practical considerations. In: Baldwin JA, Acheson ED, Graham W, editors. Textbook of medical record linkage. Oxford (UK): Oxford University Press; 1987. p. 39-54. Blakely T, Salmond C. Probabilistic record linkage and a method to

Public Health Reports / 2009 Supplement 2 / Volume 124

Matching STD and HIV Surveillance Data with Other Source Data

3. 4. 5. 6. 7. 8. 9.

10.

11.

12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

calculate the positive predictive value. Int J Epidemiol 2002;31:124652. Cook LJ, Olson LM, Dean JM. Probabilistic record linkage: relationships between file sizes, identifiers, and match weights. Methods Inf Med 2001;40:196-203. Jaro MA. Probabilistic linkage of large public health data files. Stat Med 1995;14:491-8. Newcombe HB. Distinguishing individual linkages of personal records from family linkages. Methods Inf Med 1993;32:358-64. Roos LL, Wajda A. Record linkage strategies: part I: estimating information and evaluating approaches. Methods Inf Med 1991;30:117123. Wajda A, Roos LL, Layefsky M, Singleton JA. Record linkage strategies: part II: portable software and deterministic matching. Methods Inf Med 1991;30:210-4. Gomatam S, Carter R, Ariet M, Mitchell G. An empirical comparison of record linkage procedures. Stat Med 2002;21:1485-96. Etkind P, Tang Y, Whelan M, Ratelle S, Murphy J, Sharnprapai S, et al. Estimating the sensitivity and specificity of matching namebased with non-name-based case registries. Epidemiol Infect 2003;131:669-74. Office of Management and Budget. Revisions to the standards for the classification of federal data on race and ethnicity. Federal Register Notice, October 30, 1997 [cited 2008 Apr 2]. Available from: URL: http://www.whitehouse.gov/omb/fedreg/1997standards.html Census Bureau (US). Racial and ethnic classifications used in Census 2000 and beyond [cited 2008 Apr 2]. Available from: URL: http:// www.census.gov/population/www/socdemo/race/racefactcb .html Dataflux Corp. Dataflux [cited 2009 Jul 21]. Available from: URL: http://www.dataflux.com Informer Technologies Inc. LinkPlus [cited 2009 Jul 21]. Available from: URL: http://linkplus.software.informer.com Netrics Inc. Netrics [cited 2009 Jul 21]. Available from: URL: http:// www.netrics.com QuadraMed Corp. QuadraMed [cited 2009 Jul 21]. Available from: URL: http://www.quadramed.com SourceForge, Inc. Febrl [cited 2009 Jul 21]. Available from: URL: http://sourceforge.net/projects/febrl Microsoft Corp. Microsoft® Access [cited 2009 Jul 21]. Available from: URL: http://office.microsoft.com/en-us/access/default .aspx Hallogram Publishing. Foxpro [cited 2009 Jul 21]. Available from: URL: http://www.foxprosoftware.com SAS Institute, Inc. SAS® [cited 2009 Jul 21]. Available from: URL: http://www.sas.com SPSS Inc. SPSS® [cited 2009 Jul 21]. Available from: URL: http:// www.spss.com Lucent Technologies. R [cited 2009 Jul 21]. Available from: URL: http://www.r-project.org Centers for Disease Control and Prevention (US), Division of Integrated Surveillance Systems and Services (DISSS). Epi Info™ [cited 2009 Jul 21]. Available from: URL: http://www.cdc.gov/epiinfo CDC (US). HIV/AIDS statistics and surveillance [cited 2009 Jul 21]. Available from: URL: http://www.cdc.gov/hiv/topics/ surveillance CDC (US). eHARS [cited 2009 Jul 21]. Available from: URL: http:// www.cdc.gov/hiv/topics/surveillance/resources CDC (US), Division of STD Prevention. Sexually transmitted diseases: STD*MIS [cited 2009 Jul 21]. Available from: URL: http:// www.cdc.gov/std/std-mis

26.

27.

28.

29.

30.

31. 32.

33.

34. 35. 36. 37. 38.

39.

40.

E

17

CDC (US), Council of State and Territorial Epidemiologists. Technical guidance for HIV/AIDS surveillance programs, volume III: security and confidentiality guidelines [cited 2008 Apr 2]. Available from: URL: http://www.cdc.gov/hiv/topics/surveillance/ resources/guidelines/guidance/index.htm Virginia Department of Health, Office of Epidemiology, Division of Disease Prevention. Data and statistics [cited 2008 Apr 2]. Available from: URL: http://www.vdh.state.va.us/epidemiology/DiseasePrevention/DAta/datarequest.htm State of Michigan, Child Protection Law, Act 238 of 1975, Section 722.623 [cited 2008 Aug 14]. Available from: URL: http://www .legislature.mi.gov/(S(y2ea2kfkrmpjdyap2kzrjyrm))/mileg.aspx?p age=getObject&objectName=mcl-722-623 Code of Virginia, Title 18.2: crimes and offenses generally, Section 18.2-67.4:1: infected sexual battery; penalty [cited 2009 Jul 21]. Available from: URL: http://law.onecle.com/virginia/crimes-andoffenses-generally/18.2-67.4:1.html CDC (US). Public Health Information Network, National Electronic Disease Surveillance System: NEDSS Base System (NBS) [2009 Jul 16]. Available from: URL: http://www.cdc.gov/phin/activities/ applications-services/nedss/nbs.html CDC (US). Public Health Information Network (PHIN) [cited 2009 Jul 16]. Available from: URL: http://www.cdc.gov/phin/resources/ requirements.html Michigan Department of Community Health, Scientific Technologies Corp. Michigan Disease Surveillance System, system description: version 1.05: supporting documentation, 2005. Lansing (MI) and Tucson (AZ): Michigan Department of Community Health and Scientific Technologies Corp.; 2005. Chow JM, Packel LJ, Miller J, Treat J, Barbosa S, Hanbury M, et al. Commercial laboratory participation in chlamydia prevalence monitoring: the case for public-private partnerships in STD control [abstract no. 0273]. Presented at the International Society for Sexually Transmitted Disease Research Congress; 2003 Jul 27–30; Ottawa, ON, Canada. Brenner H. Use and limitations of the capture-recapture method in disease monitoring with two dependent sources. Epidemiology 1995;6:42-8. Reporting of chlamydial infection—Massachusetts, January–June 2003. MMWR Morb Mortal Wkly Rep 2005;54(22):558-60. Du P, Coles FB, Gerber T, McNutt LA. Effects of partner notification on reducing gonorrhea incidence rate. Sex Transm Dis 2007;34:18994. Stenger MR, Courogen MT, Carr JB. Trends in Neisseria gonorrhoeae incidence among HIV-negative and HIV-positive men in Washington State, 1996–2007. Public Health Rep 2009;124(Suppl 2):18-23. Chow JM, Guo J, Gilson D, Samuel MC, Barbosa S, Sisco K, et al. Repeat chlamydia and gonorrhea infection using case-based surveillance reports and laboratory-based prevalence monitoring data, California, 2003–2004. Presented at the National STD Prevention Conference; 2006 May 8–11; Jacksonville, FL. Kang MS, Chow JM, Bolan G. Association of Chlamydia trachomatis infection and adverse pregnancy outcomes: results of matching morbidity and vital statistics databases [abstract no. C8]. Presented at the National STD Prevention Conference; 2002 March 4–11; San Diego, CA. Stenger MR, Djomand G. Sexually transmitted infection morbidity subsequent to HIV infection in Washington State: results of a surveillance registry matching project [abstract no. MoOrC1014]. Presented at the XIV International AIDS Conference; 2002 Jul 7–12; Barcelona, Spain.

Public Health Reports / 2009 Supplement 2 / Volume 124

Practical Considerations for Matching STD and HIV Surveillance Data with Data from Other Sources.

Data to guide programmatic decisions in public health are needed, but frequently epidemiologists are limited to routine case report data for notifiabl...
164KB Sizes 0 Downloads 8 Views