Accepted Manuscript Should we Trust Survey Data? Assessing Response Simplification and Data Fabrication Jörg Blasius, Victor Thiessen PII: DOI: Reference:

S0049-089X(15)00076-9 http://dx.doi.org/10.1016/j.ssresearch.2015.03.006 YSSRE 1773

To appear in:

Social Science Research

Received Date: Revised Date: Accepted Date:

5 February 2014 21 February 2015 19 March 2015

Please cite this article as: Blasius, J., Thiessen, V., Should we Trust Survey Data? Assessing Response Simplification and Data Fabrication, Social Science Research (2015), doi: http://dx.doi.org/10.1016/j.ssresearch.2015.03.006

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Should we Trust Survey Data? Assessing Response Simplification and Data Fabrication Jörg Blasius(1) and Victor Thiessen(2) (1)

University of Bonn Institute of Political Sciences and Sociology 53113 Bonn, Lennéstr. 27, Germany Tel.: +49 (0) 228 738421, Fax: +49 (0) 228 738430 Email: [email protected] (2) Dept. of Sociology and Social Anthropology Dalhousie University Halifax, NS Canada B3H 1P9 Tel: +1 902 494-1777, Fax: +1 902 494 2897 Email: [email protected]

Should we Trust Survey Data? Assessing Response Simplification and Data Fabrication

Highlights 

Internationally important survey projects such as PISA can generate dynamics that undermine the integrity of the data.



Even among well-educated respondents, such as school principals, extreme response simplification occurs surprisingly frequently.



We uncovered strong evidence of data fabrication in several countries in PISA 2009.



Unit nonresponse rates are, at best, ambiguous indicators of data quality.

1

Should we Trust Survey Data? Assessing Response Simplification and Data Fabrication

Abstract While many factors, such as unit- and item nonresponse, threaten data quality, we focus on data contamination that arises primarily from task simplification processes. We argue that such processes can occur at two levels. First, respondents themselves may engage in various response strategies that minimize their time and effort in completing the survey. Second, interviewers and other employees of the research institute might take various shortcuts to reduce their time and/or to fulfill the requirements of their contracts, in the simplest form this can be done via copy-and-paste procedures. This paper examines the cross-national quality of the reports from principals of schools participating in the 2009 PISA. We introduce two measures of data quality to document that extreme response simplification characterizes the behavior of substantial numbers of school principals in numerous countries. Additionally, we discovered strong evidence of data fabrication in several countries.

Keywords: data quality, data fabrication, undifferentiated responses, identical response pattern.

2

Should we Trust Survey Data? Assessing Response Simplification and Data Fabrication

1. Introduction It is a truism that survey data are vulnerable to numerous sources of error, some of which can be minimized through stringent quality controls by survey research institutes. Perhaps for this reason, researchers increasingly rely on well-known and reputable data sets such as the International Social Survey, the European Social Survey, some National Election Studies, and well-known panel data such as the PSID. Another important data set that generates much attention in the news media and in politics is PISA, the Program for International Student Assessment. The PISA surveys are known to be executed by well-qualified research groups, with stringent technical quality control mechanisms implemented by the Organization for Economic Cooperation and Development (OECD). Part of the data consists of school-level information provided by the principals of the participating schools. In addition to the strict quality controls and the high educational levels of the respondents (principals), one might expect this data to be of exceptionally good quality. On the other hand, the very fact of the importance of the PISA surveys may lead to dynamics that undermine their quality. The purpose of this paper is to examine this issue using the 2009 school-level data, which are freely accessible (http://pisa2009.acer.edu.au/downloads.php). Our investigation focuses on two components of data quality. First, we search for signs of respondent task simplification behaviors. Here we identify statistically improbable response patterns on sets of items such as the nature of the school climate. Second, we look for evidence of duplicated data that might have been fabricated simply through copy-and-paste procedures. For these two tasks, we compare each case with all others within each country and between all countries. To accomplish this efficiently, we apply principal component analysis 3

(PCA) and multiple correspondence analysis (MCA) to obtain factors scores for each respondent. We employ these techniques solely for the purpose of examining the frequency distributions of factor scores to identify multiple occurrences of identical response patterns; i.e., unlike the traditional use of these techniques, we are not interested in possible substantive interpretations.

2. The Quality of Survey Data Most previous research on poor data quality due to measurement error focused on respondents. Two main approaches were developed in this regard: response styles (Baumgartner and Steenkamp, 2001) and “satisficing” strategies (Krosnick and Alwin, 1987; Krosnick, 1991, 1999). These are briefly summarized next, followed by a review of the more scant literature on the role of interviewers and research institutes. Oskamp (1977: 37) defined response styles as “systematic ways of answering which are not directly related to the question content”. A large literature has identified various response tendencies such as acquiescence, selecting the extreme options, mid-point responding, limited response differentiation, and random or arbitrary responding (Aichholzer, 2013; Baumgartner and Steenkamp, 2001; Van Rosmalen, van Herk and Groenen, 2010; Watkins and Cheung, 1995). Controlling for response styles is particularly important for cross-cultural studies, since response styles have been found to vary by both race and culture (Bachman and O'Malley, 1984; Dayton, Zhan, Sangl, Darby, and Moy, 2006; Hamamura, Heine, and Paulhus, 2008; Hui and Triandis, 1989). Including a latent style factor increased the number of countries with factorial invariance (Cambré, Welkenhuysen-Gybels, and Billiet, 2002; Diamantopoulos, Reynolds, and Simintiras, 2006; Khorramdel and von Davier, 2014) and produced fewer counter-intuitive findings (Elliott et al., 2009; Weech-Maldonado, Elliott, Oluwole, and Hays, 2008). 4

Krosnick and others popularized the term “satisficing” to provide a theoretical rationale to explain various response tendencies (cf. Krosnick, 1991, 1999; Krosnick and Alwin, 1987). Krosnick’s application is based on Tourangeau’s (Tourangeau and Rasinski, 1988; Tourangeau, Rips and Rasinski, 2000) four-step cognitive process model for producing highquality information: the respondent must 1) understand the question, 2) retrieve the relevant information, 3) synthesize the retrieved information into a summary judgment, and 4) choose a response option that most closely corresponds with the summary judgment. Satisficing theory emphasizes that it is the limited cognitive ability of some respondents that induces them to engage in satisficing behavior. The empirical evidence is consistent with this emphasis, since educational attainment and cognitive skills are consistent predictors of various indicators of satisficing (Converse, 1976; Krosnick, 1991; Marsh, 1986, 1996; Meisenberg and Williams, 2008; Thiessen and Blasius, 2008; Wilkinson, 1970). Krosnick, Narayan and Smith (1996) distinguished between weak and strong satisficing. Weak satisficing entails executing all four stages described by Tourangeau, but being less than thorough in doing so. In contrast, strong satisficing omits the retrieval and judgment steps altogether, with respondents simply selecting what they consider to be reasonable responses. An extreme form of “strong satisficing” might consist of selecting only a single response option (such as “strongly agree”) to all items in a large set of items. A related form is to consistently choose the most positive answer, especially when it is the socially most desirable one, and hence easily defended (Krosnick et al, 1996: 32). Such behavior results in undifferentiated response patterns (URPs), also known as straight-lining. Respondents producing any kind of URP simultaneously simplify their task and save time. Applying Tourangeau’s model, there is minimal need to understand the question or to retrieve relevant information and as a consequence, little information that requires a synthesis into a summary judgment and without a summary judgment, there would be no response option that would most closely correspond to it. 5

We emphasize task simplification rather than satisficing responses for several reasons. While we agree that cognitive limitation can spawn satisficing behaviors, our concept of task simplification is more general. It permits us to understand satisficing dynamics that occur not only because of high cognitive demands, but also because of lack of interest or knowledge of the topics being investigated (Blasius and Thiessen, 2001). Further, it is also more parsimonious in that it recognizes that all response styles also simplify the task for the respondent. Rather than having to weigh numerous response options, only subsets of them need to be considered. It is arguably for this reason that response styles are, just like manifestations of satisficing, consistently negatively associated with education (Bachman and O'Malley, 1984; Elliott, Haviland, Kanouse, Hambarsoomian, and Hays, 2009; Greenleaf, 1992; Watson, 1992). Perhaps most importantly, it alerts the researcher that it is not only respondents who might wish to minimize their time and energy commitment: interviewers and employees in research institutes might have the same motivation. For interviewers this can take the form of deviating from the prescribed interview protocol, which includes faking, or partially faking, their interviews (Winkler, Menold and Porst, 2013). For example, they can skip time-consuming questions such as long item batteries with plausible fabricated responses filled in later (Blasius and Thiessen, 2013). Under the header of interviewer motivation, cheating was discussed as early as the mid-forties, (Crespi, 1945, 1946; Durand, 1946). Crespi (1945: 431) argued that cheating “lies as much in the structure of the ballots and the conditions of administration as in the personal integrity of the interviewer”. He identified questionnaire design features that might demoralize interviewers such as unreasonable length, apparent repetition of questions, as well as complex questions. Another reason for task-simplifying rule-breaking could be “pressure from a supervisor or even higher up” (Nelson and Kiecker, 1996: 1110). Likewise, staff of research institutes can reduce their effort by fabricating interviews through copy-and-paste procedures. As Nelson and Kiecker (1996) stated for the interviewers, higherlevel employees might pressure their subordinates to be more efficient; that is, to process 6

more cases in less time while simultaneously maintaining a high response rate. Greater efficiency increases profits and higher response rates improve the institute’s reputation. We are not suggesting that supervisors explicitly instruct their staff to fabricate data or that the head of an institute would condone its employees to copy-and-paste interviews. Nevertheless, instances of fabricated interviews as well as copied and pasted data in well-known international projects such as the World Value Survey 2005-2008 have been documented (Blasius and Thiessen, 2012). Common to all three sources of task simplification (respondent, interviewer, and staff of research institutes) is that they minimize the time and effort necessary to complete interviews, or to realize the required sample size (or response rate). Of course, all three sources of task shortcuts also reduce the quality of the data. Although not all URPs indicate low data quality, all of them can be seen as manifestations of time-saving task simplification (Blasius and Thiessen, 2012). By simplifying the task, respondents can appear to be cooperative while simultaneously minimizing the amount of time and effort required to complete it. Furthermore, since only a limited number of simplified response patterns such as straight-lining or alternating between “strongly agree” and “agree”, or between “strongly disagree” and “disagree” within a set of items are possible, this could result in several respondents producing identical response patterns (IRPs). Therefore, not all IRPs necessarily imply data fabrication. Our analyses will provide statistical guidelines to help make appropriate conclusions. Since the PISA school-level data are obtained via self-administrated questionnaires, only two sources of simplification are possible: satisficing respondents and dishonest employees in research institutes.

3. Data and Hypotheses

7

In 2009, PISA data from 515,958 pupils in 18,641 schools in 73 countries were obtained. Since France did not provide school-level data and since in Liechtenstein only 12 schools were surveyed, we excluded both countries, which reduced the number of schools to 18,461. We concentrate on the quality of the school-level data obtained from the principals (or their designates). The principals were asked about the structure and organization of their school, its resources, and its climate, among other topics. Principals were assured that the information provided would be kept confidential and that their responses would be combined with those of other principals such that the individual schools could not be identified (see school questionnaire, p. 3). The sample design of the PISA study was quite complex. To start, a minimum response rate of 85% was required for the initially-selected schools. Where the initial school response rate fell between 65% and 85%, the use of replacement schools was allowed to achieve an acceptable school response rate. Furthermore, schools with student participation rates between 25 and 50% were not classified as participating schools but were nevertheless included in the data set, while schools with a student participation rate of less than 25% were excluded from the database (OECD, 2012: 60). We assume that all countries included in the PISA 2009 data set could document that they fulfilled these stringent conditions. Further information on sampling design, measurement, and technical considerations is available in the PISA 2009 Technical Report (OECD, 2012). Principals of schools represent an elite group that has achieved high levels of education, usually a university degree. As a result, in line with the research findings cited earlier, they should produce responses of superior quality. That is, principals can be expected to be relatively immune to such things as response styles, and relatively sophisticated in their ability to comprehend the questions, recall and synthesize the relevant information, and select the appropriate responses.

8

Given the OECD sponsorship and the importance of the data for national and international educational policies, and given the fact that their school was selected to take part, some principals probably felt obligated to complete the questionnaire even if they were not keen to participate. The cover sheet of the school questionnaire states that completing the questionnaire should take about 30 minutes. However, there are 21 pages of questions, with some questions requiring the principal to estimate various percentages (such as foreign-born students), or to provide precise counts (such as the number of boys and girls in their school). On average the principals would have to spend less than 1.5 minutes per page to complete the questionnaire in the indicated amount of time, which is rather optimistic. A diligent principal who provides thoughtful answers would require substantially more time. The length of the questionnaire and the high workload might entice some principals to employ undesirable shortcuts. The magnitude and complexity of the PISA project means it can be undertaken only by major research institutes within each country. These institutes would require a nationwide net of staff and supervisors to be able to complete the field work within a reasonable time (on average 260 schools participated in each country). To administer several thousand student achievement tests and questionnaires (in addition to the school principal survey) all over the country involves many employees operating at different levels and with different responsibilities within the organization. Superb technical and organizational skills are required if mistakes at each position are to be minimized. Since PISA constitutes an important project for research institutes, pressure by the head of the institute might be exerted on the persons responsible for the entire study to do a good job. An important indicator for doing a superb job is to convince 85% of the randomly selected schools (or their eligible replacements) to participate in PISA, a condition that was fulfilled in all countries except the United States (OECD, 2012: Table 11.3, pages 165-166, unweighted schools); many countries even reported values of 100%. A second indicator is to obtain high response rates from the target students; the reported country means range between 81.1% in 9

Canada and 99.6% in Macao-China (OECD, 2012: Table 11.4. pages 166-167, unweighted cases). In addition to high response rates from the schools and the students, a high response rate from the principals is required to obtain the necessary school-level information -- ideally, to convince all of them to complete the questionnaire. The pressure on the staff responsible for the survey conceivably filters through all levels of the hierarchy of persons involved in the process of collecting and processing the data. The question arises, what happens if some forms get lost or some principals fail to participate, or a principal responsible for several schools fills in only one form -- perhaps noting that “the same holds for the other schools”? In most surveys missing cases might be replaced by alternates, but this is not possible in PISA. Here the target sample consists only of principals from the participating schools and there is no possibility of substitution. Opportunities to cheat (in this case to fabricate the missing data) exist in principle at all levels of the hierarchy. If an employee has access to the data entry process, an easy solution to missing interviews is to copy-and-paste them. That an employee might copy-and-paste interviews would be unexpected by the respective supervisors (or the head of the institute), and inordinate resources would be required to check for duplicates in hundreds or thousands of interviews with hundreds of questions. Therefore, the chance of being detected and risking sanctions for such deviant behavior is rather low, while the rewards for “exemplary performance” might be high.

4. Assessing the Quality of the PISA 2009 Data In a recent study, Blasius and Thiessen (2012) proposed several procedures to assess the quality of survey data. Using a variety of well-known and publicly-available data sets, they demonstrated how they can be used to screen data, among them being principal component 10

analysis (PCA) and multiple correspondence analysis (MCA). Both techniques produce latent continuous scales that are derived from the observed variables. In PCA, input data are assumed to have metric properties, while in MCA they can be unordered categorical. One advantage of the latter property is that missing values can easily be included as an additional category. While both methods are typically employed to create latent variables for substantive purposes, our sole interest in these methods resides in the fact that each combination of responses results in a unique factor score on each dimension. For example, in a set of ten variables with four response categories each, PCA (or MCA), generates in theory 410 = 1,048,576 unique response patterns or distinct factor scores on each dimension. Of course, since the variables are usually inter-correlated and unevenly distributed, the various response combinations have different probabilities of occurring. Granted, some response combinations might occur more than once due to simplified response patterns. However, with 1,000 respondents and one million possible response patterns, the probability that by sheer coincidence more than two or three respondents selected an identical response pattern is low. Fortunately,, any simplified (or “popular”) response pattern, such as undifferentiated responses, would become apparent through multiple occurrences of identical factor scores. We will use the factor scores to count the number of distinct response patterns in a given set of items and to identify response patterns that appear multiple times. Two response patterns are of particular importance with respect to data quality. The first concerns principals who selected the same response option across all items in a given domain; we call this form of task simplification “undifferentiated response patterns” (URPs). If four response categories are provided, then four different URPs are possible, such as consistently responding with “strongly agree”. The second is multiple occurrences of identical response patterns (IRPs); i.e., when several principals apparently provide identical responses across 11

numerous items. While some IRPs might be manifestations of task simplification strategies (depending on their nature, number, and distribution), among them all URPs, a large number of different IRPs arouse suspicions of data fabrication. We analyse three item batteries representing three domains (school climate, resource shortages, and management practices) to detect IRPs and URPs. The school climate domain was introduced by the statement “In your school, to what extent is the learning of students hindered by the following phenomenon?” namely “a) teachers’ low expectations of students, b) student absenteeism, c) poor student-teacher relations, d) disruption of classes by students, e) teachers not meeting individual student’s needs, f) teachers absenteeism, g) students skipping classes, h) students lacking respect for teachers, i) staff resisting change, j) student use of alcohol or illegal drugs, k) teachers being too strict with students, l) students intimidating or bullying other students, m) students not being encouraged to achieve their full potential”. Four response options were provided: “not at all”, “very little”, “to some extent”, and “a lot” (questions q17a to q17m, principal questionnaire). The resource shortages domain was tapped with the question “Is your school’s capacity to provide instruction hindered by any of the following issues?” A list of 13 types of shortages or resource inadequacies are listed with the same response options as for the first domain (q11a to q11m, principal questionnaire). For both sets of items, considering only the four substantive categories, 413 = 67,108,864 response combinations are possible, i.e. the average probability of occurrence of each response combination is 1.49 × 10-8. The management practices domain is comprised of 14 items regarding the principal’s own behaviour: “Below you can find statements about your management of this school. Please indicate the frequency of the following activities and behaviours in your school during the last school year.” (q26a to 26n, principal questionnaire) In contrast to the other two domains, these items are formulated in a positive direction (for example, “I check to see whether classroom activities are in keep-

12

ing with our educational goals”). Four response options (“never”, “seldom”, “quite often”, and “very often”) were given for each of the 14 items. To assess the quality of the school principals’ data, we first report the traditional measures of nonresponse. We then search for simplification strategies (or simple response patterns) exhibited by the principals to be detected by URPs and IRPs within a certain domain. Finally we look for institutional data simplification practices as revealed by IRPs across a large number of items that include numerous domains where no (or only low) inter-correlations are to be expected.

5. Findings It is unlikely that every principal will complete the entire questionnaire and some may refuse to participate altogether since both unit and item nonresponse are found in all social surveys. We define (and consequently exclude) a case as constituting a “unit nonresponse” if only missing values occur for all 184 variables comprising question 11 though question 26 (see school questionnaire, pp. 10-24) since these are the items that will be used in our subsequent analyses. Institutes included these as well as unit nonresponse cases to simplify matching and adding the principal data to the pupil data. Against our expectations, in most countries sufficient information was available from all principals; only in Iceland and Ireland did more than 10% of principals meet our criterion of unit nonresponse. For our analyses we had to exclude only 1.0% of the cases, or 1.2% on country average (see Appendix). Starting with the school climate items, no missing values occurred in Australia, Macao and Romania. At the other end of the continuum, 13.6% of principals in Azerbaijan and 9.8% in Kyrgyzstan produced at least one missing value. A somewhat larger variation in the percentages of missing values occurs in the resource shortages domain, ranging from none in Australia, Shanghai, Macao, and Romania, up to 21.0% in Azerbaijan. A comparable distribution of 13

missing values occurs for the management practices items. Again, no missing values were found in Australia, Japan, Macao, and Singapore, while Azerbaijan again had the highest percentage of missing values (13.0%).

5.1 Response Simplification Patterns The simplest response pattern on the 13 school climate items consists of selecting the same response to all items, i.e. an undifferentiated response pattern (13 times “not at all” or 13 times “very little” or 13 times “to some extent” or 13 times “a lot”). The item content argues against interpreting a large number of such URPs as reflecting realistic assessments; it might be more prudent to consider many of them to be manifestations of (extreme) response simplification. To simplify matters, we restrict our calculations to the 17,635 principals (96.4% on country average) who gave a valid response to all 13 items. Out of a total of 229,225 responses (13 × 17,635) we found 68,888 “not at all” (= 30.1%), 103,784 “very little” (= 45.3%), 45,898 “to some extent” (= 20.0%), and 10,685 “a lot” (= 4.7%) (cf. Table 1).

Table 1: Percentages of responses and number of undifferentiated response patterns (URPs)

Although clearly more principals judged their schools positively than negatively, even outstandingly positive judgments should rarely lead to 13 ticks on “not at all”; rather, they should generate a mixture of “not at all” and “very little” responses. Even if just the first two response categories were chosen, 213 = 8,192 response combinations are possible. On purely probabilistic grounds, very few instances should be found of principals who chose the same response 13 times. Yet for the school climate domain, 283 principals (= 1.6%) responded 13 times with “not at all”, another 135 consistently chose “very little” (= 0.8%), 11 ticked “to 14

some extent” on all items, and another 13 repeatedly chose “a lot”. Note that although the odds between “very little” to “not at all” is roughly 3 to 2, more than twice as many URPs occur on the “not at all” response than on the “very little” response. This is a strong indicator of extreme response simplification, where we have straight-lining together with some recency effect. In addition to the four types of URPs, many less extreme forms of response simplification could occur, such as 1-1-1-1-1-1-1-1-1-1-1-1-2. On country average we find 6.2% IRPs (see appendix), probably almost all of them, for purely probabilistic reasons, represent simple response pattern such as URP. Since both the form and the prevalence of response simplification behaviors likely differ by country, we explore this possibility next. The number of valid cases ranges between 39 (Luxembourg) and 1,492 (Mexico), with 49 countries having fewer than 200 valid cases. Especially in countries with relatively few cases, say less than 200, few response patterns should appear more than once. In the following we first perform PCA across the 17,635 valid cases to make it easier to compare response patterns across countries. We use bar charts to visualize the factor scores of the first dimension to portray country-wise multiply-occurring instances of IRPs, but we also counted their frequencies. A factor score that occurred twice is equal to one IRP, a factor score that occurred three times counts as two IRPs, a factor score that occurred four times corresponds to three IRPs, and so on. As an example, Figure 1 shows the frequencies of the unrotated PCA factor scores on the first dimension for Slovenia and the USA.

Figure 1: Bar charts of the school climate domain factor scores for USA and Slovenia

For the USA, the factor scores of 139 out of 162 principals who responded to all items appeared only once, eight appeared twice (= 8 IRPs), and one response combination appeared 15

seven times (= 6 IRPs), resulting in a total of 14 IRPs or 8.6%. As expected, the response pattern that appeared eight times is a simplified response pattern, in this case a URP (checking 13 times “very little”, the respective factor score is 0.03192). Another URP (13 times “not at all”) appears twice to the very left of Figure 1, the respective factor score is -1.86977. Note that the same distribution of frequencies would occur when counting the factor scores of the second, third or any higher dimension, for both rotated and unrotated solutions. In contrast, in Slovenia out of 333 cases with no missing data, we find 156 unique response pattern, 52 duplicates (= 52 IRPs), 19 triplicates (= 38 IRPs) and four quadruplets (= 12 IRPs), or a total of 102 (= 30.6%) IRPs. Finding nine different patterns of identical responses within 162 cases, as found for the USA, is not implausible; they might be caused by simplification startegies of the principals. However, obtaining 75 different patterns of identical responses within 333 cases is troublesome, and will be examined in depth below. In the appendix we show the numerical solutions (URPs and IRPs) for all countries and for all three domains. Both URPs and IRPs are unevenly distributed across countries. In Finland, Romania, and Slovakia not a single URP occurs, but in India, Macao and Azerbaijan there are 8.5%, 8.9%, and 9.3%, respectively. The highest percentages of IRPs occur in Slovenia (30.6%) and in the Dubai/UAE (22.3%) while no IRPs appear in Moldova and Romania. If the number of different IRPs is relatively small, and if the respective response patterns represent instances of URPs or other simple response patterns, then principals are the likely source of this simplification. For the resource shortages items, 1,016 principals (5.9%) gave a “not at all” URP, an additional 20 principals produced a “very little” URP, 11 principals a “to some extent” URP and 15 principals consistently checked “a lot” (Table 1). It is hard to imagine that so many instances of undifferentiated responses to all 13 items could accurately describe the school situation. Under the admittedly unrealistic condition of independence, less than one case of the 16

most positive URP would be expected. This is again a strong indicator of extreme response simplification, together with some form of socially desirable impression management (Kuncel and Tellegen, 2009; Blasius and Thiessen, 2012; Knoll, 2013), since some principals may have wished to portray “their school” in the best possible light. With respect to URPs in the various countries (see Appendix), the largest percentages of them are found in Shanghai, Taipei, Hong Kong, India, Japan, Qatar, Singapore, United Arab Emirates, and the United States. In contrast, the percentages of URPs are gratifyingly low in Estonia, Georgia, Kyrgyzstan, Latvia, Montenegro, Norway, and Russia. We also find high between-country variation for the IRPs. As was the case for the school climate domain, Slovenia has the highest percentage (46.7%), followed by Hong Kong (32.2%), and Dubai/UEA (31.3%). On the other end, no IRPs were found for Kyrgyzstan and Norway, Albania has just one duplicate and Germany has 1.0%. To illustrate the similarities in patterns between the school climate and resource shortages domain, Figure 2 provides the same information as Figure 1, but now for the resource shortage domain.

Figure 2: Bar charts of the resource shortage domain factor scores for USA and Slovenia

The solution is basically the same as reported for the school climate domain. For the USA we find 112 unique response patterns among the 162 cases without any missing value, seven response patterns that appeared twice, two patterns appeared three times and one 30 times; the latter is a URP in which the respective principals consistently responded “not at all”. In total we have ten distinct simplified response patterns, all of them quite close to the last-mentioned URP (see Figure 2, keeping in mind that the substantive meaning of the first PCA dimension reflects the extent of reported shortages and lack of resources). In Slovenia, 110 unique re17

sponse patterns (RPs) occurred within the 336 cases: 42 RPs appear twice (=42 IRPs), 17 RPs three times (=34 IRPs), four RPs four times (=12 IRPs), one RP five times (=4 IRPs), two RPs seven times (=12 IRPs), one RP eight times (=7 IRPs), one RP 13 times (=12 IRPs), and the last RP (again the URP containing 13 times “not at all”) appears 35 times (=34 IRPs). In total 69 different patterns of identical responses occur (out of a total of 157 IRPs or 46.7%), and these are distributed over the entire scale, i.e., they are also independent of the marginal distributions of the items and of their substantive content. Turning to the management practices items, these were formulated in the opposite polarity to that of the previous domains. Not surprisingly from a response simplification and impression management point of view (Kuncel and Tellegen, 2009), we find 346 instances where all 14 items were checked “very often” (Table 1). Not unlike the other two domains, the percentage of URPs and IRPs varies substantially by country. While in some countries not a single principal employed a URP across all items (among others in Belgium, Denmark, and Finland, see Appendix), more than 20% in India and more than 15% in Jordan did so. With respect to IRPs, Slovenia, Dubai/UAE, and Kazakhstan provide percentages above 30%, India above 25%, and Brazil just below 25% (see Appendix). In contrast, no IRP at all occurs in Germany, Ireland, Luxembourg, and Malta. Comparing these lists with the previous ones on resource shortages and school climate shows that response simplification strategies are both country- and domain-specific. So far we have documented that URPs are unevenly distributed across countries. Further, we have shown that the number of IRPs, their location on the latent dimension, as well as the number of distinct types of IRPs varies by country. In some countries they are alarmingly high, and would undoubtedly be even higher if we included less extreme forms of task simplification, such as response patterns that are identical except for one or two items. On purely probabilistic grounds, these less extreme simplifications could result in IRPs, which would 18

also be detectable with our procedures. So far we have also implicitly considered principals to be the source of these IRPs. However, copying and pasting data will also result in IRPs. We turn to this possibility next.

5.2 Institutional Task Simplification Practices As stated earlier, task completion shortcuts can also be utilized by employees of the research institutes (Blasius and Thiessen, 2012). If the staff have access to the electronic data, the simplest way to fabricate data is to copy-and-paste large segments of some of the principals’ responses. Where employees practice this kind of simplification, one should expect that the questionnaire identification number and perhaps some of the initial entries to be changed, since detection of fabrication would otherwise be too easy. Additionally, one would not expect metric data, such as the precise number of boys and girls attending the respective schools (q6, cf. school questionnaire), to be copied and pasted. Instead of fabricating entire interviews, employees might just duplicate large segments of the questionnaire. One straightforward procedure to test if some of the cases might be fabricated is to search for IRPs in a large segment of uncorrelated items. We chose 184 consecutive items between question 11 on resource shortages and question 26 on management practices. Excluding the nonresponse categories, the number of theoretically possible response patterns for these items is 413 × 32 × 214 × 25 × 55 × 28 × 413 × 3 × 37 × 36 × 23 × 25 × 24 × 212×5 × 26×4 × 414 = 1.729 × 1072, or a probability of 5.78 × 10-71, a number that is hard to fathom. Comparing this number with a stringent standard of statistical significance such as p < .001, or with an average probability of 1.49 × 10-8 within a set of correlated items, shows how immensely distant it is from the typical standard. Allowing for high inter-correlations between the items will not be of much help, since these correlations are primarily within the single domains but not between all of them. Taking into consideration the unequal marginal distributions will reduce the value by a few

19

digits but surely not by 68. Including the missing response options will increase the number of digits. The only possibility for artificially-induced IRPs would be if there were large percentages of missing values. To minimize this possibility, we already excluded those 184 cases consisting only of missing values (see Appendix). However, a high number of missing data could coincide with mis-coding of other items, which was indeed sometimes the case. For example, we found for some countries only “no” responses in a large segment of dichotomous data (questions 24 and 25, in total 84 items) while the other variables were missing. Since “no” was a valid answer for these questions, a response pattern of 84 “no’s” while the other variables are missing is probably just based on a coding error. This happened, for example, in Brazil (14 cases) and in Panama (seven cases). Although the information containing these items is wrong, we would not consider them to be instances of data fabrication. To exclude this kind of artefactual IRPs we included only cases with fewer than 80 missing values, which reduced our data set from 18,277 to 18,233 cases. To permit inclusion of the remaining missing data as “valid responses”, we performed MCA on the entire data set across all 184 variables; the important part of the numerical solution is shown in Table 2.

Table 2: Number of duplicates by country

According to Table 2, across all countries we find 18,019 unique response patterns, 91 duplicates (= 182 cases), eight triplicates (= 24 cases), and two quadruplets (= eight cases), which sums to 18,233 cases. Using the factor scores provided for the entire sample, we calculated how many of them occur within single countries and how many of them occur between countries. If IRPs occur just by chance, almost all of them should occur between countries. This is not the case as Table 2 shows: Italy has 13 duplicates and one triplicate, Slovenia shows 20 20

duplicates, six triplicates and one quadruplet, while Dubai/UAE has 33 duplicates and one quadruplet. The remaining 25 duplicates are distributed between the following countries: Australia (1), Austria (3), Belgium (1), Columbia (2), Czech Republic (1), Mexico (3), The Netherlands (2), Portugal (1), Qatar (2), Slovak Republic (1), Spain (4), Switzerland (2), Trinidad and Tobago (1), Uruguay (1); the remaining triplicate belongs to Latvia. Note especially that there is not a single duplicate occurs between countries, i.e. that from any two countries any two principals provided the same IRP. This case does not exist. What might explain the duplicates within the countries in a benign way? One possibility is that one, two or even three duplicates are filled in by principals who are responsible for two schools that are fully identical in all domains asked for, and both schools were selected for the target sample. But the sheer number of duplicates makes such an interpretation quite unreasonable for Italy, Slovenia, and Dubai/UAE. For additional possible explanations we consulted the Technical Report of the study (OECD, 2012). For Italy and the other countries with a few duplicates we did not find any clues in the technical report. For Dubai/UAE we find a comment to the table describing the “sample frame unit” (OECD, 2012: 77): “Schools with mixed genders that have two separate campuses will be split into two schools with the same ID but differentiated with M for males and F for females”. This note does not help to explain the IRPs since the duplicates have different school IDs, and the respective schools have both boys and girls but with the same numbers in the replicated cases. A comment for Slovenia in the same table (OECD, 2012: 78) states: “The preferred approach to sampling in Slovenia is by study programme. Many programmes share the same school building, however they operate largely independently from each other, sometimes even having different school principals, and in most cases a vice-principal for each programme”. A best case explanation could be that there was perhaps confusion between “programme” and “school” (building). Perhaps the principal of the building provided the information for all programs located in the same building and just noted that the information provided holds for 21

all programs of the school (building). To explore this possibility, we examined information on the six triplicates and the one quadruplet in greater detail. For doing so we used metric information provided by the principals at the start of the questionnaire and additional information from the achievement tests of the student data (see Table 3).

Table 3: Program (school) characteristics for triplets and quadruplets in Slovenia

Table 3 shows that the school (or a “programme”) with the ID number 57 consists of 42 boys and six girls (information from the principal data, q6, principal questionnaire), 19 of them being in the modal grade (information from the principal data, q10a, principal questionnaire, “At your school, what is the total number of students in the ?”). From the student data we know that 14 of them took the achievement tests (own computation, cases are easy to identify via school and country ID), and received an average reading score of 292 (averaged over the five “plausible values” in reading as provided in the student data, and over the 14 students, own computation); the respective scores in maths and sciences are 377 and 279 (own computation). For the 48 students in this program the school provides 17 computers and six part time teachers (information from the principal data, q10b and q9a, principal questionnaire). This means that there are 0.354 computers and 0.0625 teachers per student (if we count part time teachers as 0.5). Calculating the same values for the other programs (schools) in the same building, the “school” with the ID 166 has 0.029 computers and 0.032 teachers per students, and the “school” with the ID 329 has 0.033 computers and 0.046 teachers per student. The students in these three schools/programs also differ substantially (often by more than one standard deviation) in their mean achievement scores (PISA scores are normed to an international mean of 500 and a standard deviation of 100).

22

Comparable dissimilarities characterize the other triplicates and the quadruplet. Clearly, the schools comprising each triplicate/quadruplet have students performing at different achievement levels and with quite unequal resources; for example, 0.354 vs. 0.029 computers per student (IDs 57 and 166, respectively). There might be reasons for this unequal distribution of computers within the same school (building) but if this is true it defies credulity that all of them nevertheless have exactly the same “school climate”, “resources and resource shortages”, “management practices”, and so on. We therefore rule out confusion between the terms “school” and “programme” as an explanation for the IRPs since the principals/designates clearly distinguished between the characteristics of their programs at the beginning of the questionnaire, but not for the 184 variables between questions 11 and 26. In the “best case scenario” we can imagine a “strong simplification” performed by the respective 27 principals were they to just note “all the same” (Table 2) and the institute then just copied the respective cases from question 11 to question 26 (184 variables), obtaining 62 cases for them (Table 2). The “best case scenario” includes some relative strong assumptions, including that in quite a few instances several programs from different school buildings were selected for the final sample and that the principals simply noted that the programs within the school were identical across 184 consecutive variables. A less benign scenario would be one in which just a few parameters were changed in the second part of the questionnaire. In this case our procedures would have found fewer IRPs than there really are because a change of a single value in the set of 184 items already produces different factor scores. If we restrict our test to fewer items and find a substantially greater increase of IRPs in the three suspicious countries than in all other countries we would have a strong indicator of data fabrication. To run this test we reduced the items to those that belong to the three domains discussed above. We use the same criteria for excluding missing cases as before to keep the number of valid cases constant; a 23

possible drawback is that we include cases with a large number of missing values, which artificially would increase the number of IRPs. However, this will hold for all countries -- and Slovenia has a very low number of missing values (cf. Appendix). Furthermore, a few IRPs might be caused by URPs across all three domains. The solutions of the MCA are summarized in Table 4.

Table 4: Number of duplicates by country, three domains only (40 variables)

Comparing Tables 2 and 4 and starting with the totals of all countries, the number of unique response combinations in the entire sample decreases from 18,019 (Table 2) to 17,844 (Table 4). Almost half of these lost unique response patterns can be assigned to Slovenia (279-209 = 70 cases), an additional 25 cases to Dubai/UAE and an additional 22 cases to Italy, while 31 cases belong to the remaining 68 countries. Note that there are 27 cases more in the single countries than in “all countries”; these cases belong to IRPs between the countries (see the notes for Table 4 for details on the various calculations). With respect to the duplicates, Italy now has eight more, Slovenia has 21 more, Dubai has 11 more, and the remaining countries increase their common amount from 25 to 37. With respect to the increase in triplicates, again the majority occurred in the three suspicious countries, and again the majority of cases belong to Slovenia. Comparing the numbers of IRPs between the countries, with the IRPs within the countries, even a single IRP within a country is suspicious. In the best case scenario there might be some country-specific circumstances that might explain a few cases of IRPs, but this will not explain the strong and country-specific increase in IRPs when reducing the number of items involved in the analysis from 184 to 40. The only possibility we can think of is data fabrication. 24

6. Discussion In this paper we searched for indications of problematic responses of highly educated individuals, namely the principals of schools participating in the PISA 2009. Not only are these respondents well-educated, they were asked questions on topics on which they were personally knowledgeable and interested. Applying Tourangeau’s framework (Tourangeau and Rasinski, 1988; Tourangeau, Rips and Rasinski, 2000) these principals were highly adept at understanding the questions, retrieving the relevant information, synthesizing their experiences, and selecting the appropriate response. On four grounds (the quality control procedures, the high reputation and the importance of the study for the educational system in the country, the educational attainment of the principals, and the interest and knowledge that these respondents would have regarding their own school), we anticipated finding data of superior quality. With respect to nonresponse, in a majority of participating countries, every principal returned at least a partially completed questionnaire, producing overall admirably high unit response rates. Likewise, across a large spectrum of 184 variables, the data of only one percent of the principals had to be classified as unit nonresponse because of too many missing values. Considering the length of the questionnaire, this is an extraordinarily high response rate. For the three domains we investigated, the crudest form of task simplification would be to provide undifferentiated responses to each of the entire battery of items. Because of its primitive nature, we expected no more than a handful of such cases. Surprisingly, we found rather more than a handful – as many as 1,062 instances in the resource shortages domain (cf. Table 1). Given the sophisticated nature of our respondents, one can expect additional instances of more subtle (and consequently more difficult to detect) task simplification techniques to have been applied. In the literature they are usually discussed under the term response styles in various forms (Baumgartner and Steenkamp, 2001). In contrast to common practice, we do not differentiate these different response styles, since we argue that all of them are merely 25

manifestations of task simplification. The implication is that there is no distinct response style that can be assigned to a particular group of the population, such as the better educated. URPs and other forms of task simplification caused by the respondents are conflated with the substantive topic of the items. Using the three domains simultaneously with an admittedly arbitrary threshold of three percent on each to classify the quality of the respective data as high, the following countries meet these criteria: Czech Republic, Denmark, Estonia, Finland, Germany, Italy, Latvia, Lithuania, Moldova, Montenegro, Norway, Russia, Serbia, Slovakia, Tunisia, and Turkey. Conversely, in India and Qatar the URPs in all three domains exceed seven percent. URPs can be clearly assigned to the respondents, for simple probabilistic reasons the use of other response styles will result in IRPs. Using the same lower bound threshold for the IRPs, the following countries can be classified as having high quality data across the three domains: Estonia, Germany, Malta, Moldova, and Tunisia. At the opposite end, the following countries had IRPs in excess of 10% in all three domains: India, Qatar, Singapore, United Kingdom (all values exceed 15%), Dubai/UAE (all values exceed 20%), and Slovenia (all values exceed 30%). Most suspicious were the large number of different IRPs found in some countries. These, we argued, could be the result of copy-and-paste procedures. Running some analyses to test for fabricated data via copy-and-paste, we found strong evidence that in some countries employees of the institutes actually duplicated large parts of their data. In the crudest case, complete duplicates (except for the school identification number) were found. In countries such as Italy, Slovenia, and Dubai/UAE, an alarming number of cases met this strong criterion. Note that a single change in the entire string of numbers would change the respective factor scores, and consequently we would have failed to identify these cases as identical. The pressure exerted on principals to participate also holds for research institutes and their employees. The principals might feel under pressure because their school participated in PISA 26

and for the success of the study it was important to have school-level information. The head of the institutes were under pressure to do a “good job” since landing a PISA project offers a chance to increase the reputation of the institute (and it is probably also economically lucrative). The heads of the institutes transmit this pressure to the next level of hierarchy, s/he will do the same to his/her next level of hierarchy, and so on. Finally, every employee who is involved in the study will try to avoid “mistakes” and to do a “good job”. This includes in addition to fulfilling the PISA requirements, collecting the necessary school-level information from, if possible, all principals (or their designates). However, either not every principal completed the questionnaires or some of them were lost. In contrast to other surveys, replacements were not possible since the target persons were the principals of the participating schools; nobody else could do. The simplest “solution” to avoid nonresponse is to use copy-and-paste procedures. Since we did not screen for more sophisticated forms of fabrication, the instances we uncovered here represent the lower bound of its prevalence. However, strong forms of response simplification and data fabrication practices via copy-and-paste uncovered for PISA are not exceptions; analogous results have been found in the World Value Survey (Blasius and Thiessen, 2012). What are the effects of fabricated questionnaire responses? If only a very few instances occur, as is the case for PISA 2009 in Australia, Austria and Belgium, there should be almost no effect. If the number of instances is large, as is the case for Slovenia and Dubai/UAE, at a minimum the level of significance in any cross-table, correlation and multivariate analysis is over-estimated since the true sample size is smaller than the apparent one. The main difference between simplification techniques generated by respondents and interviewers/institutes is the number of cases involved. While respondents can simplify only their own responses, interviewers are involved in multiple cases. If they conducted a large number of interviews where they applied some kind of simplifications, as shown for the German General Social Survey (Blasius and Thiessen, 2013) this should have some noticeable effects on 27

the quality of data. In contrast, employees of the institute that have access to the electronic files, are able to inflict the most damage. We restricted our study to the 2009 PISA principal data where we could show that at least in some countries a substantial number of respondents simplified their task, for example, via URPs. Furthermore, we could show that at least in Slovenia, Dubai/UAE, and Italy the (employees of the) institutes simplified their task via copy-and-paste procedures. At this point we would like to stress again that the PISA studies are among the most used and most cited social survey data in the entire world. Compared to all other well-known international surveys, PISA also has exceptionally stringent quality control mechanisms, implemented by the OECD. Institutes solidify their reputation if they do a “good job” when collecting the data. Having a high response rate is certainly an indicator for having done a “good job”, while copying-andpasting data is certainly not (assuming someone discovers that and that it becomes published). Considering these facts, the quality of the PISA principal data should be very high. Although we found strong evidence for task simplifications, from the principals as well as from some institutes, we still believe that, compared to all other well-known international social surveys (and we will restrict the discussion to well-known and public available international surveys), the PISA data are of high quality – and this is the reason why we used them for this paper. Consider the three possible sources of task simplification, the respondents, the interviewers, and the institutes. For the respondents there are various reasons to do the job as fast as possible; in most studies there is no compelling reason why they should take part, especially in general surveys such as the International Social Survey Program. Evidence for this low and decreasing interest is the increasing non-response rates (De Leeuw and de Heer, 2002), which we find for all survey modes. Incentives may help to increase the response rate but do they help get better data – a high response rate is meaningless when many respondents simplify their task.

28

The institutes collecting the data have contracts, according to which they have to deliver a certain number of interviews within a certain time frame, and in case of PISA from a selected number of schools. What happens if they fail, for whatever reasons, to fulfill the conditions of the contract? Somebody will be blamed for this failure, so s/he may use a copy-and-paste procedure to obtain the required number of cases. This task simplification procedure has also been used in some countries in the latest World Value Survey (Blasius and Thiessen, 2012). Interviewers will earn more money in a given time, at least in cases when they are paid per interview, which is the standard practice in face-to-face interviews. Since fabricating the entire interview is easy to detect by the institutes, they typically contact the target persons and ask a few questions (Blasius and Thiessen, 2012, 2013). However, if interviewers are mainly extrinsically motivated, they would want to complete the interviews as fast as possible. This may result in, for example, asking one question from a set of items and “generalizing” this answer to all questions. Such task simplification will result in various kinds of simplified response patterns which can be discovered in the data set. For the German General Social Survey 2008, Blasius and Thiessen (2013) could show that a large amount of URPs and other simplified response patterns are caused by the interviewers and not by the respondents. Finally, when assessing the quality of survey data, independent of the survey mode, scholars often use the response rate and item nonresponse as partial indicators since high response rates and low item nonresponse are necessary to ensure that the sample is representative of the population (Biemer and Lyberg, 2003; Groves and Lyberg, 2010; Smith, 2011). While useful, these two measures are increasingly recognized as insufficient for a variety of reasons (Groves, 2006; Groves et al., 2008; Yan and Curtin, 2010; Wagner, 2010). Occasionally these measures can actually be misleading, since interviewers who fake or partly fake interviews seldom employ missing responses (Schäfer et al., 2005; Blasius and Thiessen, 2012, 2013). Likewise, interviewers as well as institutes who fabricate entire questionnaires automatically, albeit artificially, increase their response rates.

29

Taking all our findings together, we come to the conclusion that precisely in internationally successful and policy-relevant projects such as PISA, pressures are likely exerted on both respondents and survey institutes to produce high response rates, but sometimes at the cost of data integrity. For the principals, the latter manifests itself in taking various shortcuts to reduce their time and energy commitments. For the research institutes, this increases the temptation to artificially increase their response rates via (partial) data duplication. Appendix: Overview of findings by country

References Aichholzer, J. (2013). Intra-Individual Variation of Extreme Response Style in Mixed-Mode Panel Studies. Social Science Research, 42, 957-970. Baumgartner, H., and Steenkamp, J.-B. E. M. (2001). Response Styles in Marketing Research: A Cross-National Investigation. Journal of Marketing Research, 38, 143-156. Bachman, G. G., and O'Malley, P. M. (1984). Yea-saying, nay-saying, and going to extremes: Black-white differences in response styles. Public Opinion Quarterly, 48(2), 491-509. Biemer, P., and Lyberg, L. (2003). Introduction to Survey Quality. New York: Wiley. Blasius, J., and Thiessen, V. (2001). Methodological Artifacts in Measures of Political Efficacy and Trust: A Multiple Correspondence Analysis. Political Analysis, 9, 1-20. Blasius, J., and Thiessen, V. (2012). Assessing the Quality of Survey Data. London: Sage. Blasius, J., and Thiessen, V. (2013). Detecting Poorly Conducted Interviews. In Interviewers´ Deviations in Surveys. Impact, Reasons, Detection and Prevention, (ed.) P. Winker, N. Menold and R. Porst (pp. 67-88), Frankfurt/Main: Peter Lang. Cambré, B., Welkenhuysen-Gybels, J., and Billiet, J. (2002). Is it content or is it style? An evaluation of two competitive measurement models applied to a balanced set of ethnocentrism items. International Journal of Comparative Sociology, 43(1), 1-20. Converse, J. M. (1976). Predicting No Opinion in the Polls. Public Opinion Quarterly, 40, 515-530. Crespi, L. P. (1945). The Cheater Problem in Polling. Public Opinion Quarterly, 9, 431-445. Crespi, L. P. (1946). Further Observations on the ‘Cheater’ Problem. Public Opinion Quarterly, 10, 646-649. 30

Dayton, E., Zhan, C., Sangl, J., Darby, C., and Moy, E. (2006). Racial and ethnic differences in patient assessments of interactions with providers: Disparities or measurement biases? American Journal of Medical Quality, 21(2), 109-114. De Leeuw, E.D./de Heer, W. (2002). Trends in Household Survey Nonresponse: A Longitudinal and International Comparison. In Survey Nonresponse, (ed.) R.M. Groves, D.A. Dillman, J. Eltinge and R.J.A. Little (pp. 41-54), New York: Wiley. Diamantopoulos, A., Reynolds, N. L., and Simintiras, A. C. (2006). The impact of response styles on the stability of cross-national comparisons. Journal of Business Research, 59, 925-935. Durant, H. (1946). The ‘Cheater’ Problem. Public Opinion Quarterly, 10, 228-291.

Elliott, M. N., Haviland, A. M., Kanouse, D. E., Hambarsoomian, K., and Hays, R. D. (2009). Adjusting for subgroup differences in extreme response tendency in ratings of health care: Impact on disparity estimates. Health Services Research, 44(2), 542-561. Groves, R. M. (2006). Nonresponse Rates and Nonresponse Bias in Household Surveys. Public Opinion Quarterly, 70, 646-675. Groves, R. M., Brick, J. M., Couper, M., Kalsbeek, W., Harris-Kojetin, B., Kreuter, F., Pennell, B.-E., Raghunathan, T., Smith, T. W., Tourangeau, R., Bowers, A., Jans, M., Kennedy, C., Levenstein, R., Olson, K., Peytcheva, E., Ziniel, S. and Wagner J. (2008). Issues Facing the Field: Alternative Practical Measures of Representativeness of Survey Respondent Pools. URL=http://www.nonresponse.org/c/417/Workshop%202008?preid=0 (Accessed May 2013). Groves, R., and Lyberg, L. (2010). Total Survey Error. Past, Present, and Future. Public Opinion Quarterly, 74, 849-879. Hamamura, T., Heine, S. J., and Paulhus, D. L. (2008). Cultural differences in response styles: The role of dialectical thinking. Personality and Individual Differences, 44(4), 932–942. Heerwegh, D., and Loosveldt, G. (2011). Assessing mode effects in a national crime victimization survey using structural equation models: Social desirability bias and acquiescence. Journal of Official Statistics, 27, 49–63. Hui, C. H., and Triandis, H. C. (1989). Effects of culture and response format on extreme response style. Journal of Cross-Cultural Psychology, 20(3), 296-309. Knoll, B. R. (2013). Assessing the Effect of Social Desirability on Nativism Attitude Responses. Social Science Research, 42, 1587-1598. 31

Khorramdel, L, and von Davier, M (2014). Measuring response styles across the Big Five: A multiscale extension of an approach using multinomial processing trees. Multivariate Behavioral Research, 49(2),161-177. Krosnick, J. A. (1991). Response Strategies for Coping with the Cognitive Demands of Attitude Measures in Surveys. Applied Cognitive Psychology, 5, 213-236. Krosnick, J. (1999). Survey Research. Annual Review of Psychology, 50, 337-367. Krosnick, J. A., and Alwin, D. F. (1987). An Evaluation of a Cognitive Theory of ResponseOrder Effects in Survey Measurement. Public Opinion Quarterly, 51, 201-219. Krosnick, J. A., Narayan, S., and Smith, W. R. (1996). Satisficing in Surveys: Initial Evidence. In Advances in Survey Research, (ed.) M. T. Braverman and J. K. Slater (pp. 2944), San Francisco: Josey-Bass. Kuncel, N. and Tellegen A. (2009). A Conceptual and Empirical Reexamination of the Measurement of the Social Desirability of Items: Implications for Detecting Desirable Response Style and Scale Development. Personnel Psychology, 62, 201-228. Marsh, H. W. (1986). Negative item bias in ratings scales for preadolescent children: A cognitive-developmental phenomenon. Developmental Psychology, 22(1), 37-49. Marsh, H. W. (1996). Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of Personality and Social Psychology, 70(4), 810-819. Meisenberg, M., and Williams, A. (2008). Are acquiescent and extreme resonse styles related to low inetelligence and education? Personality and Individual Differences 44,15391550. Nelson, J. E. and Kiecker, P. L. (1996). Marketing Research Interviewers and Their Perceived Necessity of Moral Compromise. Journal of Business Ethics, 15, 1107-1117. OECD (2012). PISA 2009 Technical Report, PISA, OECD Publishing. Oskamp, S. (1977). Attitudes and Opinions. Englewood Cliffs, N.J.: Prentice-Hall. Schäfer, C., Schräpler, J.P., Müller, K.-R. and Wagner, G. G. (2005). Automatic Identification of Faked and Fraudulent Interviews in the German SOEP. Schmollers Jahrbuch, 125, 183-193. Smith, T. W. (2011). Refining the Total Survey Error Perspective. International Journal of Public Opinion Research, 23, 464-484. Thiessen, V., and Blasius, J. (2008). Mathematics achievement and mathematics learning strategies: Cognitive competencies and construct differentiation. International Journal of Educational Research, 47(4), 362-371.

32

Tourangeau, R., and Rasinski, K. A. (1988). Cognitive Processes Underlying Context Effects in Attitude Measurement. Psychological Bulletin, 103, 299-314. Tourangeau, R., Rips, L. J., and Rasinski K. A. (2000). The Psychology of Survey Response. New York: Cambridge University Press. Van Rosmalen, J., Van Herk, H., and Groenen, P. J. F. (2010). Identifying Response Styles: A Latent-Class Bilinear Multinomial Logit Model. Journal of Marketing Research, 47, 157-172. Wagner, J. (2010). The Fraction of Missing Information as a Tool for Monitoring the Quality of Survey Data. Public Opinion Quarterly, 74, 223-243. Watkins, D., and Cheung, S. (1995). Culture, Gender, and Response Bias: An Analysis of Responses to the Self-Description Questionnaire. Journal of Cross-Cultural Psychology, 26, 490-504. Weech-Maldonado, R., Elliott, M. N., Oluwole, A., and Hays, R. D. (2008). Survey response style and differential use of CAHPS rating scales by Hispanics. Medical Care, 46(9), 963-968. Wilkinson, A. E. (1970). Relationship between measures of intellectual functioning and extreme response style. Journal of Social Psychology, 81(2), 271-272. Winkler, P., Menold, N., and Porst, R. (eds.). (2013). Interviewers´ Deviations in Surveys. Impact, Reasons, Detection and Prevention. Frankfurt/M.: Peter Lang. Yan, T., and Curtin, R. (2010). The Relation Between Unit Nonresponse and Item Nonresponse: A Response Continuum Perspective. International Journal of Public Opinion Research, 22, 535-551.

33

Table 1: Percentages of responses and number of undifferentiated response patterns (URPs) Category Not at all/ Never Very little/ Seldom To some extend/ Quite often A lot/ Very often N

School climate In % Nb. of URPs

Resource shortages In % Nb. of URPs

30.1

283

44.2

1,016

3.4

7

45.3

135

25.1

20

15.1

6

20.0

11

20.7

11

44.1

109

4.7

13

10.0

15

37.5

346

17,635

442

17,310

1,062

17,440

468

34

Management practices In % Nb. of URPs

Table 2: Number of duplicates by country, 184 variables Single cases duplicates triplicates quadruplets

N

All countries

18,019

91

8

2

18,233

Italy

1,051

13

1

-

1,080

Slovenia

279

20

6

1

341

Dubai/UAE

298

33

-

1

368

All other countries(1)

16,391

25

1

-

16,444

(1)

All other countries include duplicates, triplets, and quadruple within and between the countries. Duplicates within the remaining countries: Australia (1), Austria (3), Belgium (1), Columbia (2), Czech Republic (1), Mexico (3), The Netherlands (2), Portugal (1), Qatar (2), Slovak Republic (1), Spain (4), Switzerland (2), Trinidad and Tobago (1), Uruguay (1); sum = 25; ergo there is no single duplicate between the countries, i.e. that from any two countries any two principals have the same response pattern. This case does not exist. Triplicates within the countries: Latvia (1).

35

Table 3: Indicators for triplicates and quadruplets in Slovenia

Student Data School ID

N

Read

Math Science

57 166 329

14 24 19

292 438 410

377 441 438

279 432 453

Principal Data N N N Modal N Teachers Teachers Boys Girls Grade Computers full time part time 42 6 19 17 0 6 270 73 93 10 6 10 472 22 157 16 12 21

98 134 282

22 5 20

357 326 460

404 379 523

417 346 526

284 29 166

1 0 4

119 10 54

20 0 9

6 0 3

13 4 11

45 101 170 234

24 2 18 17

483 347 392 384

530 408 432 382

567 367 452 413

135 30 80 189

15 0 0 96

28 9 24 125

8 2 8 32

3 1 0 9

9 3 7 13

30 87 143

23 25 5

327 462 331

359 534 392

398 529 371

297 86 36

0 1 0

109 30 11

50 42 24

3 2 1

17 4 2

200 212 325

24 13 25

428 301 373

476 376 402

485 320 399

95 31 65

78 28 113

72 24 74

14 5 15

3 1 1

14 5 11

21 306 319

20 24 23

381 485 469

417 552 538

410 544 530

163 129 203

4 43 21

63 35 87

20 11 27

5 2 4

5 14 13

36 240 323

24 24 2

380 402 310

426 422 374

425 468 340

220 115 13

0 1 0

69 41 6

15 9 0

5 1 0

13 9 2

36

Table 4: Number of duplicates by country, three domains only (40 variables) Single cases duplicates triplicates quadruplets

N

All countries

17,844

146

19

4

18,233(1)

Italy

1,029

21

3

-

1,080

Slovenia

209

41

14

2

341

Dubai/UAE

273

44

1

1

368

All other countries(3)

16,360

37

2(2)

1

16,444

(1)

In addition there is one instance of six IRPs, one of seven IRPs, and one of 11. At least one triplicate appears in the entire data set as part of (1) (3) The differences between the sum of all countries (all other countries, Italy, Slovenia, and Dubai/UAE) to “all countries” can be explained by a few IRPs between the countries (6 + 7 + 11 + 3×2 (from the differences in the duplicates) – 1×3 (from the differences in the triplicates) = 17,844 – 1,029 – 209 – 273 – 16,360 = 27 cases). (2)

Duplicates (d), triplicates (t), and quadruplets (q) within the remaining countries: Australia (1d), Austria (5d), Belgium (1d), Columbia (3d), Czech Republic (1d), India (1q), Latvia (2d, 1t), Mexico (4d), Montenegro (1d), The Netherlands (2d), Portugal (1d), Qatar (2d), Slovak Republic (1d), Spain (7d), Sweden (1d), Switzerland (3d, 1t), Trinidad and Tobago (1d), Uruguay (1d).

37

-1.87 -1.25 -1.03 -0.74 -0.56 -0.42 -0.30 -0.16 -0.10 0.02 0.14 0.19 0.31 0.41 0.56 0.61 0.75 0.82 1.00 1.19 1.75 -1.87 -1.15 -0.71 -0.52 -0.43 -0.29 -0.19 -0.10 0.04 0.16 0.22 0.32 0.44 0.57 0.63 0.76 1.07 1.34 1.91

Figure 1: Bar charts of the school climate domain factor scores for USA and Slovenia

USA

8

7

6

5

4

3

2

1

0

Slovenia

5

4

3

2

1

0

38

-1.51 -1.37 -1.26 -1.15 -1.11 -1.01 -0.99 -0.91 -0.87 -0.79 -0.73 -0.62 -0.60 -0.51 -0.40 -0.36 -0.29 -0.21 -0.04 0.13 -1.51 -1.30 -1.25 -1.19 -1.11 -1.01 -0.92 -0.86 -0.73 -0.62 -0.49 -0.37 -0.25 -0.08 0.05 0.11 0.21 0.39 0.64 0.78 1.61

Figure 2: Bar charts of the resource shortage domain factor scores for USA and Slovenia

USA

35

30

25

20

15

10

5

0

Slovenia

40

35

30

25

20

15

10

5

0

39

Appendix: Overview of findings by country Nb. of scho ols

% of M D

Nb. of schools , correct ed

Country

Scho ol clima te

Resour ce shorta ge

Managem ent Practices

% MV

% UR Ps

% IRP s

% MV

% UR Ps

% IRP s

% MV

% UR Ps

% IRP s

Albania

181

0.0

181

6.1

5.9

10. 0

7.2

1.2

0.6

6.1

2.4

2.9

Azerbaija n

162

0.0

162

13.6

9.3

12. 1

21.0

3.9

4.7

13.0

5.0

5.0

Argentina 199

0.0

199

7.0

2.7

2.2

8.0

3.3

3.8

8.0

0.5

1.1

Australia

353

0.0

353

0.0

5.7

10. 8

0.0

13. 9

20. 1

0.0

3.1

5.1

Austria

282

2.8

274

2.2

1.1

3.7

6.9

3.5

12. 5

6.9

0.4

3.5

Belgium

278

0.0

278

3.6

2.2

7.1

10.1

4.4

5.6

10.1

0.0

2.0

Brazil

947

0.0

947

5.7

0.9

2.6

11.0

3.6

4.6

5.7

3.9

24. 1

Bulgaria

178

0.0

178

5.1

1.8

1.8

7.3

3.6

4.2

5.6

1.2

1.8

Canada

978

0.6

972

2.5

2.5

13. 8

3.2

10. 4

19. 9

6.4

2.3

8.2

Chile

200

6.0

188

2.7

3.3

3.3

5.9

3.4

4.0

8.5

4.7

5.8

Shanghai

152

0.0

152

0.7

2.6

4.0

0.0

15. 1

15. 8

2.0

2.0

7.4

Taipei

158

0.0

158

8.9

6.9

6.3

4.4

19. 9

19. 9

3.8

4.6

12. 5

Colombia

275

0.0

275

4.7

1.1

4.2

4.4

3.8

5.3

3.3

4.9

10. 5

Costa Rica

181

0.0

181

3.3

1.7

0.6

4.4

2.9

2.9

3.3

6.9

20. 0

40

Croatia

158

0.0

158

3.2

0.7

0.7

5.1

3.3

2.7

3.2

0.7

3.3

Czech Rep.

261

3.4

252

1.2

0.4

3.6

1.6

0.8

2.0

1.2

0.4

2.8

Denmark

285

1.4

281

1.4

2.9

11. 2

2.5

2.9

7.3

5.3

0.0

1.5

Estonia

175

0.0

175

2.3

0.6

0.6

2.3

1.8

1.8

5.1

2.4

1.2

Finland

203

0.0

203

0.5

0.0

4.0

4.4

2.6

2.6

3.4

0.0

0.5

Georgia

226

0.0

226

8.4

1.4

15. 9

8.0

0.0

1.4

8.4

5.3

6.3

Germany

226

5.8

213

2.8

1.0

1.4

6.1

1.5

1.0

5.6

0.0

0.0

Greece

184

0.0

184

6.5

0.6

5.2

10.3

4.2

4.8

7.6

0.0

0.6

Hong Kong

151

0.7

150

2.0

2.0

7.5

2.7

27. 4

32. 2

3.3

5.5

13. 1

Hungary

187

0.0

187

3.2

3.3

6.1

1.6

7.1

9.8

3.2

2.2

3.9

Iceland

131

13. 0

114

3.5

0.9

0.9

2.6

10. 8

16. 2

3.5

0.0

0.9

India

213

4.7

203

7.4

8.5

13. 3

5.9

16. 8

17. 3

5.4

21. 4

25. 5

Indonesia

183

0.0

183

0.5

2.7

8.8

2.7

1.7

1.1

1.6

7.2

13. 3

Ireland

144

11. 8

127

4.7

0.0

1.7

7.1

3.4

4.2

6.3

0.0

0.0

Israel

176

2.3

172

2.9

0.6

1.2

9.3

8.3

8.3

7.0

1.3

4.4

Italy

1,097

0.0

1,097

5.0

0.9

8.0

7.4

2.7

8.0

5.5

0.3

11. 1

Japan

186

0.0

186

1.1

3.3

6.0

1.6

17. 5

25. 7

0.0

2.7

9.7

Kazakhst an

199

0.0

199

3.5

3.1

5.2

3.5

4.2

3.6

0.5

0.0

2.5

Jordan

210

0.0

210

2.9

1.5

3.9

5.7

6.6

7.1

3.8

16. 3

33. 2

Korea

157

0.0

157

0.6

3.8

10.

3.8

8.6

9.3

0.6

1.3

6.4

41

9 Kyrgyzsta n

173

0.0

173

9.8

3.2

3.2

11.6

0.7

0.0

7.5

5.0

7.5

Latvia

184

0.0

184

2.7

1.7

5.6

4.9

0.6

3.4

2.7

0.6

2.8

Lithuania

196

0.5

195

3.1

2.1

11. 1

3.1

1.1

4.8

2.6

0.0

0.5

Luxembo urg

39

0.0

39

5.1

5.4

5.4

2.6

5.3

2.6

7.7

0.0

0.0

Macao

45

0.0

45

0.0

8.9

6.7

0.0

13. 3

8.9

0.0

4.4

2.2

Malaysia

152

0.0

152

7.9

5.0

8.6

1.3

2.0

1.3

0.7

8.6

19. 2

Malta

53

5.7

50

2.0

6.1

2.0

4.0

2.1

2.1

4.0

0.0

0.0

Mauritius

185

0.0

185

2.7

1.7

3.3

6.5

9.2

9.8

3.8

5.1

15. 7

Mexico

1,535

0.2

1,532

2.6

2.1

11. 2

4.4

4.2

7.1

2.3

3.1

16. 4

Moldova

186

0.0

186

5.4

1.1

0.0

3.2

1.1

1.7

3.8

1.7

2.8

Montene gro

52

0.0

52

1.9

2.0

7.8

1.9

0.0

2.0

0.0

0.0

3.8

Netherla nds

186

2.2

182

2.7

4.0

7.9

5.5

1.7

5.2

5.5

0.0

4.1

New Zealand

163

2.5

159

1.9

3.2

7.7

3.1

7.8

11. 7

6.3

2.0

4.0

Norway

197

3.6

190

1.6

0.5

5.3

1.1

0.5

0.0

4.2

0.0

1.6

Panama

188

0.0

188

8.0

2.3

3.5

12.8

8.5

7.3

10.6

5.4

10. 7

Peru

240

0.0

240

2.9

1.7

1.3

9.2

3.7

4.6

3.3

5.2

10. 3

Poland

185

0.0

185

3.2

1.7

3.4

3.2

10. 6

19. 0

3.2

0.0

0.6

Portugal

214

0.0

214

2.3

3.8

4.3

3.7

3.4

3.9

4.2

0.0

0.5

42

Qatar

153

0.0

153

5.9

7.6

11. 8

9.8

15. 2

20. 3

5.9

9.0

18. 8

Romania

159

0.0

159

0.0

0.0

0.0

0.0

6.9

10. 1

0.6

0.6

5.7

Russia

213

0.0

213

7.0

1.0

0.5

5.6

1.0

1.0

6.1

2.5

4.5

Serbia

190

1.1

188

2.7

0.5

0.5

3.2

2.7

3.3

4.3

2.8

4.4

Singapor e

171

0.0

171

1.2

4.7

10. 7

0.6

19. 4

28. 8

0.0

1.8

11. 1

Slovakia

189

0.0

189

1.6

0.0

2.2

8.5

1.2

1.7

4.2

1.1

3.9

Slovenia

341

0.0

341

2.3

0.3

30. 6

1.5

10. 4

46. 7

1.8

0.0

38. 2

Spain

889

2.0

871

3.2

3.8

9.3

4.6

4.3

11. 4

4.6

0.5

3.5

Sweden

189

0.0

189

2.1

1.6

2.7

4.2

3.9

3.3

4.2

0.6

2.2

Switzerla nd

426

0.7

423

2.1

3.1

8.0

5.7

7.3

10. 5

7.1

0.0

2.3

Thailand

230

0.0

230

0.4

1.3

3.1

2.6

4.5

4.5

3.5

2.7

8.1

Trinidad & T.

158

3.8

152

3.3

2.0

2.0

5.3

2.1

1.4

3.9

3.4

9.6

UA Emirates

369

0.0

369

7.6

6.7

22. 3

6.5

18. 3

31. 3

4.3

7.4

30. 0

Tunisia

165

0.0

165

2.4

1.2

0.6

1.8

1.2

1.2

3.0

2.5

1.9

Turkey

170

0.0

170

2.9

2.4

3.6

1.2

1.8

4.8

2.4

0.6

1.2

UK

482

5.0

458

2.8

5.2

18. 2

3.9

13. 2

18. 0

4.1

3.0

17. 8

United States

165

0.0

165

1.8

5.6

8.6

1.8

19. 1

24. 7

4.8

2.5

10. 2

Uruguay

232

0.0

232

3.9

3.6

5.4

6.0

5.5

8.3

11.6

2.4

2.9

Venezuel a

121

6.6

113

4.4

1.9

0.9

17.7

7.5

7.5

6.2

10. 4

14. 2

Total

18,46 1

1.2

18,277

3.6

2.5

6.2

5.1

6.1

8.8

4.9

2.7

7.7

43

Highlights 

Internationally important survey projects such as PISA can generate dynamics that undermine the integrity of the data.



Even among well-educated respondents, such as school principals, extreme response simplification occurs surprisingly frequently.



We uncovered strong evidence of data fabrication in several countries in PISA 2009.



Unit nonresponse rates are, at best, ambiguous indicators of data quality.

44

Should we trust survey data? Assessing response simplification and data fabrication.

While many factors, such as unit- and item nonresponse, threaten data quality, we focus on data contamination that arises primarily from task simplifi...
1MB Sizes 1 Downloads 7 Views