Data validation and other strategies for data entry.

532550 research-article2014

WJNXXX10.1177/0193945914532550Western Journal of Nursing ResearchKupzyk and Cohen

Research Method

Data Validation and Other Strategies for Data Entry

Western Journal of Nursing Research 2015, Vol. 37(4) 546–556 © The Author(s) 2014 Reprints and permissions: sagepub.com/journalsPermissions.nav DOI: 10.1177/0193945914532550 wjn.sagepub.com

Kevin A. Kupzyk1 and Marlene Z. Cohen1

Abstract Data entry can result in errors that cause analytic problems and delays in disseminating research. Invalid responses can lead to incorrect statistics and statistical conclusions. The purpose of this article is to provide researchers some basic strategies for avoiding out-of-range data entry errors and streamlining data collection. This article identifies some basic strategies using Microsoft® Excel, which is an inexpensive method of data entry that can be used when research budgets are constrained. Data files can be structured so that out-of-range values cannot be entered. When string variables are entered, researchers may be inconsistent in the way they code responses. Data validation can be accomplished through the use of restricting response options and skipping items can be avoided by using count functions to tabulate the number of valid responses. We also discuss advantages and disadvantages of several methods of data entry, including using web-based data entry and relational databases. Keywords instrument development, methods, descriptive quantitative, gerontology, data entry

1University

of Nebraska Medical Center, Omaha, USA

Corresponding Author: Kevin Kupzyk, University of Nebraska Medical Center, 685330 Nebraska Medical Center, Omaha, NE 68198-5330, USA. Email: [email protected]

Downloaded from wjn.sagepub.com at GEORGETOWN UNIV MED CTR on March 10, 2015

Kupzyk and Cohen

547

Accurate data entry is essential for carrying out high-quality research, and increasing data quality is an ongoing concern in medical research (Beretta, Aldrovandi, Grandi, Citerio, & Stocchetti, 2007; Greiver et al., 2011). Invalid values are not flagged automatically by statistical procedures, so data validation should be considered prior to data entry. Many different database and spreadsheet-based programs are available for data entry. The purpose of this article is to provide researchers some basic strategies for avoiding out-ofrange data entry errors and streamlining computerized data collection. We will also discuss the advantages and disadvantages of several different methods of data entry.

Manual and Computerized Data Entry Several problems can arise in studies where data have been hand entered. Hand entry of data from paper surveys or questionnaires, as well as computerized data entry, can result in data entry errors. Analysts should, as they usually learn through experience, perform double data entry and check data from all variables for out-of-range values and inconsistencies (Horn, Miksch, Egghart, Popow, & Paky, 1997). Double entry is known to be significantly more accurate than single entry (Paulsen, Overgaard, & Lauritsen, 2012), but is also costly and time-consuming. The process of finding and resolving data entry errors can take a great deal of time, delay analyses, or possibly change descriptive statistics or statistical conclusions if analyses have previously been performed using the uncorrected data. Making invalid conclusions as a result of incorrect data would be very problematic, and is avoidable with careful planning prior to the onset of data collection. It is important to note that data validation and data verification are two separate issues. A data point can be a valid value, but not correct. On a 5-point scale, for example, 4 could be entered instead of 3. The strategies described in this article do not address data verification, or data cleaning. Double data entry is one strategy that can help with data verification. Also, analysts can select random samples of entered data and check the values entered against paper copies. These strategies are only feasible when there are paper copies of the data. Paperless forms of data entry, such as computerized, web-based, or mobile data entry, are becoming more common, but hand entry of paper data will likely be used by many researchers for quite some time. Any strategies that can be implemented to improve the accuracy of hand-entered data will improve the quality of research based on that data. The occurrence of data entry errors can be avoided with creative use of data validation rules and logic functions. Studies that employ computer programmers to create databases and data entry systems will not likely need to


548

Western Journal of Nursing Research 37(4)

take these steps, but often research budgets are constrained such that they can only support a few faculty and staff members. With the growing use of small, portable computers (e.g., netbooks) for data collection, analysts and database designers can create computer-based data entry systems that greatly reduce the possibilities of entry errors. Many different database and statistical analysis programs can be used for data entry. We present examples that show how several data entry strategies can be implemented using Microsoft® Excel. Excel has the advantage of being a simple program that is accessible and familiar to most, and can do some very complex functions. It has been widely used for data collection, calculation, and custom analyses (Zhang, Huo, Zhou, & Xie, 2010) and it also a valuable tool for teaching data analysis and computer skills (Warner & Meehan, 2001).

Method To provide an example of these data entry strategies, we describe a recently designed pilot study, which had the goal of recruiting 40 nursing home residents as participants and following them for 4 time points. Five measures were used that had 90 items. The 40 individuals across 4 time points with 90 items per occasion resulted in 40 × 4 × 90 = 14,400 potential data points (assuming no attrition). The fact that this was a pilot study with only 40 individuals makes 14,400 data points a conservative number for a research study. Many studies will have more participants and possibly more measures. Traditionally, these data would have been collected on paper and then hand entered into a computer. The potential for entry errors is great with such a large amount of data. As seasoned researchers know, no study is immune to data entry errors, even with well-trained and competent research assistants (RA).

Programs for Data Entry A growing number of studies use advanced technology for online data entry, such as REDCap (e.g., Harris et al., 2009), or mobile data entry devices. REDCap, which stands for Research Electronic Data Capture, is free and can be used easily by data entry personnel once a measure or set of measures has been properly set up. Using REDCap requires a reasonable amount of training or self-teaching, and REDCap has a number of instructional videos. The strategies presented here can be implemented with REDCap as well, including built-in data validation. Using REDCap, however, requires an end-user license agreement between your university or institution and Vanderbilt University, and a support person at your institution who can review and


Kupzyk and Cohen

549

approve projects. Data entered using REDCap are stored on web servers hosted at and maintained by each institution’s technology specialists, as opposed to storing it on your own computer, external hard drive, or network drive. REDcap does have the ability to restrict response options and check for missing data, but does not allow changes in the design of the page or data entry template. Using a web-based survey tool, such as REDCap, restricting response options can be accomplished by setting up the question in a multiple-choice format and missing data can be avoided by requiring a response for each item. Many studies have used either software (Goldberg, Niemierko, Shubina, & Turchin, 2010; Kim & Park, 2012) or websites (Cooper et al., 2006) to collect data with the specific intent of preventing data entry errors and omissions. Wagner, Capezuti, Taylor, Sattin, and Ouslander (2005) discussed the utility of computerized data entry systems in reducing error in a study reporting incidents of falls. In their system, the user is able to select from a list of valid options and missing data were minimized by not allowing the assessor to proceed through the report without answering all questions on a page. They found the system enhanced reporting and the nursing home facilities identified strengths with the computerized data entry system, but the project required collaborating with a computer programmer. These studies often involve extensive programming in computer languages that most researchers do not have experience with, such as Javascript, XML, or HTML. In small studies, or those without large budgets, it may not be feasible to hire a computer programmer to create a software program for data entry. Microsoft® Access is a highly flexible relational database program that allows users to link multiple data files using participant ID values, create queries (filtered tables), use custom forms for data entry, and create reports. These features can improve the efficiency of data collection, management, and analysis. Researchers can use many data entry programs and systems. However, Excel is a commonly used program for data entry, especially in small studies, because it is easy to use, widely accessible, and at least somewhat familiar to most researchers and students. Using the example study, and for the purposes of this article, Microsoft® Excel 2010 is discussed for use with entering data and checking for errors. The strategies described in our study can be implemented by a statistician, data analyst, or researcher with a working knowledge of Microsoft® Excel. Excel is a surprisingly flexible program with a great deal of functionality of which many researchers are not aware. Excel data files can easily be transferred into SAS, SPSS, Stata, and many other programs for advanced statistical analyses. Researchers who solely use their statistical analysis software of


550


choice for data entry are limited to post hoc handling of errors. Data that are entered directly into SPSS or SAS cannot be constrained to allow only valid responses. The strategies discussed here, however, can be used to improve the accuracy of data entry and help prevent data entry errors from ever occurring. Researchers can use Excel to restrict response options, automate checks for missing data, and create data entry templates that have the same design as the measurement tools. Each of these strategies will be discussed, along with some general recommendations for computerized data collection in the field.

Restricting Response Options The simplest way to reduce errors is to restrict response options to only valid responses. Regardless of the type of question, response options can be restricted to a finite set of valid responses, to ensure that out-of-range values are not entered. For example, items that are rated on a 1 to 5 Likert-type scale can be constrained to only allow one of those five options (or no response). This makes it impossible for a data entry person to accidently type in a 6 instead of a 5 or a 44 instead of a 4. Inconsistencies can exist in how demographic information is coded, especially in studies with multiple data entry people. For example, analysts often discover when calculating frequency instead of one count for female and one for male, separate counts can exist: male, female, M, F, mal, fem, boy, girl, man, and woman. Recoding is then required of all the aberrant responses to the desired two responses. Such post hoc data cleaning takes valuable time that can be reduced by making it impossible for implausible or varying values to be entered from the beginning. Changing the data entry format from free response to selecting from a dropdown list will save time and keep responses consistent. To restrict response options using Excel, the columns for each variable can be constrained to only allow the user to enter one of the valid responses on the variable to be selected. This is often referred to as data validation. A data dictionary, which describes each variable and what the values mean, and the validation rules, including ranges and possible values, should be reviewed prior to and while constructing a database (Gliklich & Dreyer, 2010). Once all variables have been named in the first row in Excel, each unique set of responses, or value labels, can be listed anywhere in the spreadsheet where data will not be entered. In this study, for example, several columns past the last variable, male and female were entered in the first and second cells of one column and marital status options were entered in the next column. Subsequent columns could contain yes and no, ethnicity options, religious affiliations, and so forth. To have a dropdown list in the data entry area of the spreadsheet, the column where data were to be entered was selected (e.g., cells


Kupzyk and Cohen

551

A2:A200, directly below the first variable name). Under the “Data” tab in Excel is a “Data Validation . . . ” button. In the settings tab of the data validation window, by selecting “Allow: List,” the user can input the “Source:” of the list, which will be the set of valid responses previously entered for that variable (note that the dollar signs denote the fixed location of the selected cells). Now when a cell in the data entry area is selected, the user finds a dropdown menu of the valid options for that variable. The resulting data set is more accurate and data entry is consistent with less typing needed. For researchers wanting to hand enter data values using the numeric keypad on their keyboard instead of clicking on the dropdown list, values can still be entered in that manner as long as they meet the criteria specified by the data validation rules. Using the dropdown is especially useful for string values, such as ethnicity and marital status. This will require the analyst to recode the values to numeric values for use in an analysis, and this standardization of possible values eliminates the frustration of dealing with multiple different spellings of each value. Numeric codes could be used if later recoding is problematic, but data entry personnel would need to refer to a key each time to ensure the right numeric code is being entered. For Likert-type scales or numeric variables, data validation involves simply constraining values entered into those variables to be within the valid numeric range. In the data validation window, the user can specify “Allow: Whole Number or Allow: Decimal (a continuous number),” between the lower and upper bounds. For example, a 1 to 5 Likert-type response variable would be validated by allowing only whole numbers between 1 and 5. If any out-of-range value is entered, an error message will be seen. In our pilot study, all variables that were not free response variables were validated using the data validation function, preventing entry of out-of-range values.

Preventing Missing Data Data are genuinely and justifiably missing when a patient is unwilling or unable to answer a question. Items accidently skipped over during data entry result in missing values where there were actual responses. An advantage to using built-in functions for data entry is preventing these inadvertent omissions of items when data are being entered directly into a computer or netbook in the field. In Excel, a count function can be used to make sure all items have been answered on each scale. In our study, the spreadsheet contained a cell at the end of each scale that counted the number of responses and printed a “Completed” message when all had been completed. If not, the assessor was shown the number of items remaining. The Excel “COUNT” and “IF” functions were used to accomplish this task. For example, if 10 items were to


552


be filled out in the A1 through J1 cells of a spreadsheet, cell J11 could contain the following: =IF(count(A1:J1)=10, “Completed ”, “Not complete”). The IF function follow the if-then-else logic of traditional computer programming languages. If the condition in the first argument is met, then the value given in the second argument is printed. If the condition is not met, the result shown is the value in the third argument. The else result may be thought of as the if not result (i.e., if true-then this, if not-then this). The IF function uses the following syntax: =IF(, , ). The values can be numeric or text-based, provided the text is in quotes. The count function returns the number of valid responses in the row of 10 cells. Therefore, the condition that must be met for the resulting value in the cell to be “Completed” is that 10 valid responses must be in the row of 10 cells. While it may be easy to see if items are missing when a few items are being entered, visual inspection becomes much more difficult when a scale contains many items. In the event that a data point is legitimately missing, the result of the IF function will still be “Not complete” unless some value is filled in for the missing item. Researchers typically use “-9, -999, 999,” or a similar numeric code to indicate that the value is legitimately missing. We would caution against using string values, zero, or any numeric value that could be a plausible value on any variable in the data set. If a specific missing value code is used, then it needs to be included in the list of valid values.

Data Entry Templates Templates that match research tools can assist in data collection, and can include the strategies just described. Another strategy used in our pilot study was that a new template was opened in which to record data each time the RA collected data from a new individual or at a new time point. A template is a type of Excel file that, once opened, only allows the user to save the file as a regular Excel spreadsheet (.xls or.xlsx). The template file does not get rewritten once data are entered, so a new file with no data is opened each time the template is opened. It would be sufficient to use the same spreadsheet and add a new column to each worksheet for each new entry. However, having a clean template for each occasion makes data entry easier because the RA does not need to scroll over the previously collected data and there is less possibility that previously collected data will inadvertently be changed or erased. This strategy does, however, make data management more challenging because the data are spread out across many spreadsheets in many files (one for each participant at each time point). Therefore, it is important to plan in advance how to combine the data into a comprehensive data set.


Kupzyk and Cohen

553

A very compact netbook was used to collect the data in our pilot study, which made it difficult for some nursing home residents to read items and response options. Thus, the RA carried large-type paper copies of the instruments. Some participants were not comfortable with the RA using the computer at all. The RA carried paper copies to collect the data, and then later transferred responses to the data file. In some cases, a participant asked to see the question in print. Having a paper copy also allowed the assessor to point to the questions and response options if needed. When measurement tools are converted from paper forms to computer forms, it is helpful to maintain the layout and look of the paper copy in the computer format to aid the RA in relating the computerized form to the paper copies. For example, the size, format, color of text, cell highlighting, borders, visual spacing of items, and other aspects of layout can all be manipulated in Excel to closely match the original instruments. Therefore, each of the worksheets in the Excel file had a different look to match the format of each measure, which is difficult to achieve using a database or web-based data collection tool. Our template was constructed using the strategies introduced earlier, including restricting response options and checks for missing data on each measure. Once the data have all been entered in Excel, the analyst can use SPSS, SAS, or another statistical software package to import the data from Excel. The columns that contained the lists of valid responses may be deleted before or after exporting the data. After data collection, all the completed Excel spreadsheets were put into one file folder designated for collected data. A SAS macro was then built that took every Excel file and stacked them all in one SAS data file. A macro is a text-based syntax file that runs a batch of code a set number of times, or performs actions on a set of data files. They perform identical actions to a large set of files. Our macro was constructed to avoid having to open each Excel file to cut-and-paste them together for data analysis. Building a macro could also be accomplished using SPSS or the Visual Basic interface in Excel. Creating a macro, however, does involve some trial-and-error so it is important to test the Excel template and macro by filling out several templates with mock data to ensure that the macro will be able to stack the files properly. Some modifications to the Excel template may be necessary for the macro to work properly, therefore it is imperative to make the modifications before any real data are collected. The process of building an Excel template, and a plan for compiling data if separate templates are used, can be somewhat timeconsuming prior to the onset of a study. Researchers and analysts should plan adequate time for database construction and data entry. Even for low budget studies, relational databases may be useful alternatives to simple spreadsheets if the researcher is comfortable with using them.


554


The disadvantage of databases is that their use can be difficult to master. Researchers who wish to use databases should seek assistance from a computer programmer or participate in training courses to learn and practice using them before relying on them for their research. As mentioned earlier, relational database programs, such as Microsoft® Access, are highly flexible tools that have several advantages, such as linking multiple data files, creating queries, custom data entry forms, and automatically generating reports when new data are entered. Spreadsheets, such as those used in Excel, do not have these advantages. One example of an advantage of using Access is when a researcher wishes to link multiple data files, such as when demographic data are in one file and outcome measures are in other files. This can help avoid looking at redundant information in longitudinal studies and keeps files smaller and more manageable. Linking files by participants’ ID values enables the user to construct a full data set when needed. Using a spreadsheet, all data for a participant needs to be kept in one row, and merging files can be problematic if ID values are not kept in order. Another disadvantage of Access is that changes or deletions are saved immediately. Excel spreadsheets offer the chance to not save changes that have been made if the researcher knows that something has been altered. Being able to revert to an earlier saved version of the data may aid in correcting errors in data management. Researchers may also encounter problems when data are to be transferred to a statistical analysis program, such as SAS or SPSS. Both SAS and SPSS are able to import files from both Excel and Access, but problems are more likely to occur using spreadsheets. For example, spreadsheets enable the user to add charts or graphics, keep variable names in a place other than the top row, and change the configuration of the spreadsheet by merging multiple cells together. These can all cause problems with data transfer. In contrast, database programs do not allow this flexibility within their tables. The increased flexibility with a spreadsheet can therefore be a hindrance when transferring data. Although spreadsheets may make it easier to see what one is doing, spreadsheet users are certainly not immune to errors. Researchers must be very careful with important data regardless of the system used for data entry and storage.

Summary Many options exist for methods of data entry. REDCap and other online or mobile methods are growing in popularity, but software programs such as Excel or SPSS are the most commonly used methods. Each comes with its own advantages and disadvantages, and the best choice depends on the


Kupzyk and Cohen

555

situation and what the researcher is most comfortable with. Excel is a good choice for most researchers as it is widely available and familiar to most, and no further help is needed from computer programmers or technology specialists. Accurate data collection can be greatly improved by building in constraints that ensure valid data. The greatest advantage of these strategies is realized after data collection is complete. Without invalid data values and inadvertently missing data, a great deal of time, effort, and stress associated with continued data checking, error fixing (or data cleaning), and repeatedly analyzing data can be saved. Accurate data entry is a matter of quality assurance. This article provides some basic database design strategies that any researcher can use to improve data accuracy without a large budget. Declaration of Conflicting Interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding The author(s) received no financial support for the research, authorship, and/or publication of this article.

References Beretta, L., Aldrovandi, V., Grandi, E., Citerio, G., & Stocchetti, N. (2007). Improving the quality of data entry in a low-budget head injury database. Acta Neurochirurgica, 149, 903-909. Cooper, C. J., Cooper, S. P., del Junco, D. J., Shipp, E. M., Whitworth, R., & Cooper, S. R. (2006). Web-based data collection: Detailed methods of a questionnaire and data gathering tool. Epidemiologic Perspectives & Innovations, 3, 1-11. Gliklich, R. E., & Dreyer, N. A. (Eds.). (2010). Registries for evaluating patient outcomes: A user’s guide (2nd ed., AHRQ Publication No.10-EHC049). Rockville, MD: Agency for Healthcare Research and Quality. Goldberg, S. I., Niemierko, A., Shubina, M., & Turchin, A. (2010). “Summary Page”: A novel tool that reduces omitted data in research databases. BMC Medical Research Methodology, 10, 91-97. Greiver, M., Barnsley, J., Aliarzadeh, B., Krueger, P., Moineddin, R., Butt, D. A., & Kaplan, D. (2011). Using a data entry clerk to improve data quality in primary care electronic medical records: A pilot study. Informatics in Primary Care, 19, 241-250. Harris, P. A., Taylor, R., Thielke, R., Payne, J., Gonzalez, N., & Conde, J. G. (2009). Research Electronic Data Capture (REDCap) - A metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics, 42, 377-381.


556


Horn, W., Miksch, S., Egghart, G., Popow, C., & Paky, F. (1997). Effective data validation of high-frequency data: Time-point-, time-interval-, and trend-based methods. Computers in Biology and Medicine, 27, 389-409. Kim, H.-Y., & Park, H.-A. (2012). Development and evaluation of data entry templates based on the entity-attribute-value model for clinical decision support of pressure ulcer wound management. International Journal of Medical Informatics, 81, 485-492. Paulsen, A., Overgaard, S., & Lauritsen, J. M. (2012). Quality of data entry using single entry, double entry and automated forms processing—An example based on a study of patient-reported outcomes. PLoS ONE, 7, e35087. Wagner, L. M., Capezuti, E., Taylor, J. A., Sattin, R. W., & Ouslander, J. G. (2005). Impact of a falls menu-driven incident-reporting system on documentation and quality improvement in nursing homes. The Gerontologist, 45, 835-842. Warner, C. B., & Meehan, A. M. (2001). Microsoft Excel™ as a tool for teaching basic statistics. Teaching of Psychology, 28, 295-298. Zhang, Y., Huo, M., Zhou, J., & Xie, S. (2010). PKSolver: An add-in program for pharmacokinetic and pharmacodynamic data analysis in Microsoft Excel. Computer Methods and Programs in Biomedicine, 99, 306-314.


Data collection and analysis strategies for phMRI.

Human HPRT mutant database: software for data entry and retrieval.

The data - sources and validation.

Data Analysis Strategies for Protein Microarrays.

Data Normalization Strategies for MicroRNA Quantification.

Strategies for comparative analyses of registry data.

Strategies for analyzing bisulfite sequencing data.

Improving data mining strategies for drug design.

Single vs. double data entry in CAST.

Using Crowdsourced Trajectories for Automated OSM Data Entry Approach.

Technical Note: Development and validation of an open data format for CT projection data.

Practical Considerations for Matching STD and HIV Surveillance Data with Data from Other Sources.

Genomic, Proteomic, and Metabolomic Data Integration Strategies.

Novel and emerging strategies for longitudinal data collection.

Multivariate analysis: the need for data, and other problems.

Ensuring Confidentiality of Geocoded Health Data: Assessing Geographic Masking Strategies for Individual-Level Data.

Prehospital data entry compliance by paramedics after institution of a comprehensive EMS data collection tool.

Gap-filling strategies for annual VOC flux data sets.

Analysis strategies for serial multivariate ultrasonographic data that are incomplete.

Adjusting body mass for measurement error with invalid validation data.

Multivariate data validation for investigating primary HCMV infection in pregnancy.

Strategies for determining kinship in wild populations using genetic data.

Data-Processing Strategies for Crossed-Strip Gamma-Ray Detectors.

Biomarker validation: common data analysis concerns.