Big Data Volume 5 Number 2, 2017 ª Mary Ann Liebert, Inc. DOI: 10.1089/big.2016.0043

ORIGINAL ARTICLE

Research Dilemmas with Behavioral Big Data

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

Galit Shmueli* Abstract Behavioral big data (BBD) refers to very large and rich multidimensional data sets on human and social behaviors, actions, and interactions, which have become available to companies, governments, and researchers. A growing number of researchers in social science and management fields acquire and analyze BBD for the purpose of extracting knowledge and scientific discoveries. However, the relationships between the researcher, data, subjects, and research questions differ in the BBD context compared to traditional behavioral data. Behavioral researchers using BBD face not only methodological and technical challenges but also ethical and moral dilemmas. In this article, we discuss several dilemmas, challenges, and trade-offs related to acquiring and analyzing BBD for causal behavioral research. Keywords: behavioral big data; academic research; social science; data acquisition; data analysis; ethics

chology, management, marketing, information systems, sociology, political science, education—as well as in fields that earlier dealt with big inanimate data—production, biology, and engineering. Based on BBD studies presented at conferences and published in journals, it appears that these two types of researchers are encountering new dilemmas and challenges. In this article, we look at BBD from the perspective of a behavioral researcher and the new challenges and dilemmas they face in the new BBD context. We start by describing how BBD is different from other types of big data and then describe challenges and pitfalls that face the behavioral researcher in each of the two stages of BBD acquisition and data analysis.

What Is Behavioral Big Data and What Is Unique About It? Big data has become available in many fields due to technological advancement in measurement and storage. The term Big Data can now be found in almost any field, including manufacturing, engineering, the physical and life sciences, business, and, more recently, in the social sciences and management. The focus of this article is on behavioral big data (BBD) that captures human and social actions and interactions at a new level of detail. BBD studies and applications are quickly growing in popularity in industry as well as in academic research. The term BBD highlights its focus on human and social behavior, and the novelty of its scale. The combination of ‘‘behavioral’’ and ‘‘big’’ creates challenges and opportunities for statisticians and data miners, since the great majority of researchers in these communities are not trained in the behavioral sciences and are therefore unfamiliar with study design, ethical conduct, and research methods for studies with human subjects. It also creates challenges for social and behavioral scientists who have classic statistical modeling training and experience. Academic researchers are now using BBD in fields that earlier had small behavioral data—such as psy-

BBD versus inanimate big data BBD is different from inanimate big data (IBD) collected on items (e.g., products) in a fundamental way: Unlike IBD, in BBD the human subjects being measured have an aware, ongoing interaction with the BBD. Humans, unlike items, enrich (or ‘‘contaminate’’) the data with intention, deception, emotion, reciprocation, herding, and other human and social aspects. They can also be harmed by BBD not only physically but also socially, emotionally, and psychologically. For example,

Institute of Service Science, National Tsing Hua University, Hsinchu, Taiwan. *Address correspondence to: Galit Shmueli, Institute of Service Science, College of Technology Management, National Tsing Hua University, 101, Sec. 2, Kuang Fu Road, Hsinchu 30013, Taiwan, E-mail: [email protected]

98

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

1. BBD collection can change the subjects’ behaviors. 2. BBD study design and analysis can harm and risk subjects in intangible ways. 3. BBD study design is complicated by free will. 4. BBD can change over time on its own or in response to inquiry or examination. Whereas in IBD studies, the context includes three main components: the research question, the researcher, and the data, in BBD studies another critical entity is the human subjects. Figure 1 schematically illustrates the IBD versus the BBD study context. The diagram also includes the types of relationships between the different entities. In the IBD case, the researcher is the only active agent, applying his/her knowledge, skills, and tools to craft the research question, collect, and analyze the data. In contrast, in the BBD case, the researcher is not the only active agent. The human subjects are also active: their actions can affect the research design and outcomes, the availability and quality of the data, and more. BBD versus physiological big data One step closer to BBD is physiological big data, which also involves human subjects. This includes biomedical data at the DNA level from the human genome project and physiological data from large clinical trials, from hospitals, as well as from wearable devices measuring heartbeat, hydration, or other physiological attributes of the body. The BBD schematic (Fig. 1) that includes human subjects is therefore also relevant to physiological big data. However, BBD is also different from bio-

FIG. 1. data.

99

medical and physiological big data in two key ways. First, biomedical data and physiological big data rely on physical measurements of (many) individual human bodies, whereas BBD includes behavioral data on individuals as well as their interactions with others. These fundamental differences have led to different research methodologies in the life sciences compared to behavioral sciences. For example, due to the different measurement instruments typically used—surveys in behavioral studies versus medical devices in life sciences studies—statistical methods for evaluating the measurement instrument differ (testing reliability and validity of a questionnaire compared to measuring the accuracy of a medical device). Another example is the prevalence of latent variable models such as structural equation models in the behavioral sciences, and their absence from the life sciences, because in behavioral studies, measurements are often treated as proxies for underlying constructs that are abstract (e.g., perceived stress), whereas in life sciences, measurements are treated either as values of interest (e.g., type of tumor) or else as proxies for other measurements of interest that are unavailable or expensive (e.g., biomarkers). A third example is the proliferation of social network analysis in fields such as sociology, capturing and measuring interactions between observations, whereas in physiological big data, models tend to assume independence between observations. Such differences transfer from the ‘‘small data’’ environments to today’s big data environments. The second difference between physiological big data and BBD is the relationship between the human

Research environment in studies of inanimate big data (left) versus BBD (right). BBD, behavioral big

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

100

subjects and the researcher when manipulations are involved. Biomedical data and physiological data that arise from large clinical trials are fundamentally different from large-scale behavioral experiments; participants in clinical trials are aware that they are part of an experiment and have a vested interest in being part of the experiment. In contrast, subjects of behavioral experiments in the age of BBD are often unaware that they are part of an experiment. And in many cases, the experiment is aimed at helping the company, sometimes at the expense of the human subject. In other words, unlike clinical trials, in large-scale behavioral experiments there is usually no direct value for the participant. BBD versus small behavioral data Academic empirical research aims at discovering scientifically valid regularities that generalize from the data sample to a population of interest. In behavioral sciences (including marketing, management, and information systems), causal questions are the most common, both in the world of small data and in the new era of BBD. It is therefore not surprising that many of the BBD studies conducted in management schools and behavioral departments are causal in nature. BBD offers a valuable addition (and sometimes alternative) to surveys, case studies, interviews, and other traditional social science data collection methods that aim to capture human intentions, feelings, and thoughts. BBD therefore holds the promise to transform our understanding of individuals, communities, and societies. The transformational power of BBD lies in its richness of information on previously unmeasurable human and social phenomena. The key difference between studies with small behavioral data versus with BBD is the relationship between the researcher, research question, data, and human subjects. Figure 2 provides a schematic of main principles and issues along these relationships. Relationship between researcher and research question: formulating and justifying the research question. Among causal BBD-based studies, some examine age-old questions with newly available BBD, while others identify and ask new questions, often related to new technological capabilities and their effect on behavior. These two types of BBD research present behavioral researchers with new challenges in terms of formulating and justifying the research question. Unlike in traditional behavioral research, researchers

SHMUELI

using BBD can rely less on their own experience and domain knowledge and on previous literature. In the first case, where researchers use BBD to answer an existing question, the main challenge is in formulating clear operational definitions of the BBD variables. In other words, the researcher must justify why newly measured variables can operationalize the constructs of interest. For example, Hinz et al.1 investigate whether conspicuous consumption increases social capital using BBD virtual worlds. The authors claim that empirically testing this age-old research question has not been possible ‘‘due in part to the difficulty of objectively observing and measuring social capital.’’ In their BBD study, they operationalize the network metric of ‘‘node degree’’ as a measure of social capital and evaluate the effect of conspicuous consumption by comparing the number of users’ ‘‘friends.’’ Another example is the study by Muchnik et al.,2 testing the impact of social influence on collective judgment, where ‘‘collective judgment’’ was operationalized using user ratings and discourse on a social news aggregation website. Specifically, the outcome variable was an aggregate current rating for each posted comment (equal to the number of up-votes minus the number of downvotes). The authors offer multiple reasons why this is a valid and useful measure of the effect of social influence, indicating the burden of justifying new measures in studies with newly available BBD. Finally, Chetty et al.3 use BBD that combines school district and tax records for more than one million children,* to test whether teachers’ impacts on students’ test scores are a good measure of their quality. The authors operationalize ‘‘teacher quality’’ by their value-added (VA) scores,{ and the long-term impact of ‘‘teacher quality’’ by measuring the students’ college attendance, salary, and having children as teenagers. In the case of BBD studies that ask completely new research questions, the challenge to researchers is the lack of previous literature and the need to identify relevant theories, often from a variety of disciplines and through additional studies. For example, Belo et al.4 study the impact of broadband at school on student performance. The authors explain the novelty of this research question (‘‘Despite the large investments in computers and Internet access in schools, there are only a few studies that examine the impact of the *The BBD linked information from an administrative data set on students and teachers in grades 3–8 from a large urban school district spanning 1989–2009 with selected data from United States tax records spanning 1996–2011. { VA scores are a controversial measure of teacher quality that is based on students’ test scores.

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

101

FIG. 2. Schematic of main principles and issues in the relationships between the researcher, research question, data, and human subjects in the BBD context.

Internet in schools on students’ performance’’) and address the lack of previous literature by supplementing their randomized experimental study with a survey to understand how the Internet is used in schools. Relationship between BBD and research question: generalization and bias. A fundamental goal in conducting a causal study is generalization, which includes notions such as internal validity (the ability to draw a causal conclusion from the data), external validity (the ability to generalize the effect to other contexts), as well as statistical generalization (inferring whether the sample effect generalizes to a larger population). BBD challenges traditional approaches for drawing generalization in two main ways due to a gap between what is measured (BBD) and the research question: the units measured and the sample coverage. The first question that arises with regard to a sample of BBD is the unit of observation vs. the unit of analysis. In BBD, the unit of observation (what is measured) can often differ from the unit of analysis (what the research question is about) due to human interaction. For example, if the units of observation are user accounts, then who does an account represent? Is it a single user, a household, friends sharing a single account, a child impersonating as an adult? Any insights or decisions made on the basis of analyzing BBD make some assumptions about the definition of an ‘‘observation.’’ However, such assumptions are often violated in practice (e.g., Verstrepen and Goethals5 discuss the challenges of recommender systems applied to shared

user accounts). Another issue that leads to a difference between unit of observation and unit of analysis with BBD is the trade-off between information and privacy. Technically, individual-level data can lead to more nuanced insights, but privacy and confidentiality often necessitate aggregation. Both these issues are more pronounced in BBD compared to small behavioral studies, due to BBD generating technologies and platforms and the way they are used in practice. The second question is what the BBD sample represents and who it generalizes to beyond the data in our hands. Specifically, selection bias arises due to over- or under-coverage of the BBD sample compared to the population of interest defined by the research question. There is often a (false) notion that due to the ‘‘bigness’’ of our data, it can be used to extrapolate as if it were a random sample from the researcher’s population of interest. However, the fact is that BBD samples often provide over-coverage of some populations and under-coverage (or no coverage) of other populations. This is obviously the case when considering BBD from online sources as representative of the entire population, for example, in countries where large proportions of the population do not have access to the Internet. While bias due to over-/ under-coverage is also common in small behavioral studies, the popular modes of acquiring BBD (through companies, web application programming interface (APIs), and online labor markets—see the Acquisition of BBD section) make this type of error especially challenging to behavioral researchers.

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

102

Relationship between researcher and BBD: new technical and methodological challenges. BBD offers behavioral researchers richer and larger data on phenomena, entities, and relationships that were earlier data invisible as well as substantially enriching data on already measured phenomena and entities. This novelty poses technical challenges for behavioral researchers (who typically lack advanced technical skills) in collecting, storing, transmitting, computing, and running statistical software. BBD also poses statistical methodological challenges, because the more complex data and complex signals it holds typically require more complex study designs and modeling. Some of the traditional methodologies and modeling approaches do not scale up, while others pose new challenges when used as-is (see section Capturing Heterogeneity: Shortcomings of Traditional Statistical Methods). In addition, the predominant mode of data analysis in behavioral research has been linear models, which typically come with distributional assumptions and require user specification, while it is rare to see the use of machine learning approaches where models are more data driven and make fewer assumptions about functional form but come with other attendant risks such as overfitting (see section Prediction in Causal Research). Relationship between researcher and BBD subjects: institutional review board and ethics. Behavioral research is bound by ethical rules of conduct. The trouble is that guidelines for conducting scientifically sound and ethically acceptable research do not directly transfer to the context of BBD studies. Collecting BBD through experiments or surveys typically, but not always, requires academic researchers to obtain approval by an ethics committee. Academic researchers in biomedical and behavioral sciences are well familiar with institutional review boards (IRB) and the process of obtaining IRB approval before carrying out a study that involves human subjects. Top journals require authors to confirm that their study has IRB approval. The IRB is a university-level committee designated to approve, monitor, and review biomedical and behavioral research involving humans. In the United States, any university or body that receives Federal funds is required to have an IRB, which governs the Common Rule—a rule of ethics regarding biomedical and behavioral research involving human subjects. Such ‘‘ethics committees’’ also exist in other countries with different names. The IRB performs a benefit/risk analysis for proposed studies, aimed at assuring that the study

SHMUELI

will potentially have a sufficient contribution to justify the risks for the human subjects involved. Shmueli6 summarizes the main guidelines in the U.S. Code of Federal Regulations on Protection of Human Subjects,* which focus on beneficence, justice, and respect for persons:  Risks to subjects are minimized (beneficence).  Risks are reasonable in relation to benefits (beneficence).  Selection of subjects is equitable (justice).  Provisions are adequate to monitor the data and ensure confidentiality and the safety of subjects (beneficence).  Informed consent obtained and documented (respect for persons), including assuring information comprehension and voluntary agreement.  Safeguards for vulnerable populations (respect for persons). The traditional IRB process does not transition smoothly from the biomedical context from which IRB procedures were shaped to today’s BBD environment. Four specific issues are as follows: 1. Exemption for publicly available and pre-existing data: Current IRB regulations exempt research that uses already existing publicly available data from the need to undergo IRB review, under the assumption that data already publicly available cannot cause any further harm to individuals.7 One result is that BBD researchers in areas of machine learning, statistics, operations research, and related data science disciplines are mostly unfamiliar with ethics of human subjects’ studies. This gap creates multiple dilemmas and unexpected trade-offs. 2. Different rules for academia and companies and definition of human subjects’ research: Clinical trials run by pharmaceutical companies in developed countries are regulated in terms of ethical conduct, because they involve a physical intervention in human life. In contrast, BBD-type companies (such as Google and Facebook) are not legally required to go through IRB approval (because they do not receive federal funding) and are not bound by regulations related to interventions even when they run experiments, because *See Part 46 on Protection of Human Subjects, 46.111 Criteria for IRB approval of research, in Code of Federal Regulations by the Department of Health and Human Services, www.hhs.gov/ohrp/regulations-and-policy/regulations/45-cfr46/index.html#46.111

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

‘‘data analytics techniques rarely appear as direct interventions in the life or body of an individual human being.’’7 Recently, Facebook created an Internal Ethics Review Committee. While such initiatives are important, they are self-regulated and are not required to be transparent outside of the company. Moreover, the review criteria used by such committees are tailored to the typical questions the specific company researchers address and the data they use, and these differ from the Common Rule used by IRB (see Jackman and Kanerva8). Hence, the different ethical requirements for academic researchers versus companies create dilemmas and trade-offs. 3. Subjects’ risks, harm, and consent: In BBD studies, the distance between the researchers and subjects is very large, with subjects almost invisible; risks and harm are often unanticipated and intangible and therefore it is more difficult to determine what protections are needed.7 IRB’s minimal risk means that the probability and magnitude of harm or discomfort anticipated in the research are not greater in and of themselves than those ordinarily encountered in daily life or during the performance of routine physical or psychological examinations or tests.* However, risks in BBD studies are typically intangible (e.g., information privacy and data discrimination) and are therefore not defined as risks by Common Rule.7 Consent is often obtained through a long and largely unread online ‘‘data use policy’’ page, which gives a blanket coverage for all future company uses of the data; such policies are often modified by the company over time and are no longer binding in cases of mergers or company acquisitions. These challenges mean that ethical evaluation should take place not just before data acquisition (as is done in IRB applications) but also during and after the study. 4. Blurred distinction between research and practice: IRB approval is required only for studies considered ‘‘research.’’ While in biomedical and psychological contexts, the difference between research and practice is clear (the line between physicianas-a-caregiver and physician-as-researcher7), in BBD studies the distinction is blurred, both for company and academic researchers. While aca*‘‘Code of Federal Regulations,’’ www.hhs.gov/ohrp/humansubjects/guidance/ 45cfr46.html

103

demic researchers might consider their work ‘‘research,’’ if the work is done in collaboration with a company it also constitutes practice. Acquisition of BBD In this section, we discuss dilemmas related to generalization, technical and methodology use, and ethical conduct that arise at the data acquisition step. In the ‘‘old times,’’ collecting behavioral data through experiments, surveys, case studies, and other methods was a costly, slow, human-intensive process. In the new BBD era, data acquisition is cheaper, faster, and more automated, and data are available from new sources. These novelties create new challenges and dilemmas regarding generalization, modeling, and ethical conduct. In the following, we discuss several popular data sources used by researchers in causal BBD studies and the challenges and dilemmas associated with each. Table 1 summarizes these points. Collaborating with a company Company versus academic researcher objectives. Many BBD-based studies published in top academic journals are based on a partnership between academic researchers and a company, where the data are obtained from the company’s BBD. Adar9 notes that ‘‘there is rarely a perfect alignment between commercial and academic research interests. Clearly, the agenda of companies will focus [on] the kinds of questions they ask and consequently the kinds of data they capture: can I understand my customers and predict what they will do next?’’ Academic researchers are interested in the more generalizable scientific question and in publishing their results. The different ethical code of conduct and review procedures for industry and academia enhance the divide even further. One way to ensure that the study is driven by the scientific goal is to make sure that answering the scientific question also benefits the company. Adar9 gives the example of the Facebook study by Kramer et al.,10 where researchers from Cornell collaborated with Facebook researchers on an experiment that manipulated the extent to which users were exposed to emotional expressions in their Facebook News Feed. The authors showed that emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. This study goal offered a potential benefit to the company (e.g., understanding how one user’s posting behavior varies from their friend’s can be used to

104

SHMUELI

Table 1. Research dilemmas at the data acquisition stage Data source

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

Collaborating with a Company

Researcher–subjects relationship: ethical dilemmas 1. Scientific value of intervention vs. risks 2. Adverse effects to treated and nontreated subjects 3. Adverse effects to company 4. Cannot terminate experiment early

Researcher–BBD relationship: technical and methodology challenges

BBD–Research question relationship: generalization issues

1. Technical ability and resources for secure transfer and storage of BBD 2. Study design constraints

1. BBD sample not random 2. Company BBD sample does not represent larger population

Open Data, Public 1. Hacked data Data, Website APIs, 2. Privacy violation (even with and Web Scraping deidentified BBD) 3. Use of web scraping (legal, effect on website)

1. Inconvenient data formats and access 2. Limiting sharing rules 3. Programming ability (web scraping and using APIs) 4. Aggregation level of data does not support research question

1. API/scraped sample representative of what population? 2. Lack of demographic data 3. Nonsampling errors

Online Data Collection Platforms

1. Costs, time, and programming 1. Replicability of studies using DIY knowledge to create and run DIY platforms platform 2. Correspondence between subject 2. Technical ability to set up pool and target population experiment/survey on online labor platform and reach adequate workers 3. Spill-over effects and revealing of deceptive treatments 4. Re-use of MTurk subject pool by many researchers

Larger distance between researcher and subjects (many issues, e.g., fair treatment and payment to subjects)

API, application programming interface; BBD, behavioral big data; DIY, do it yourself.

design interfaces or algorithms that encourage posting behavior), as well as to scientific inquiry (e.g., how emotional contagion works). In terms of publication, it is advantageous for academic researchers to work with companies that understand and see value in scientific research, such as companies that reward their staff for academic publications. Large corporations that have research divisions (such as Microsoft, Google, and IBM) have long traditions of publication and scientific-oriented research and even allow their research staff to perform independent research. A related dilemma can arise when the analysis results are risky or unfavorable to the company, a situation that might cause the company to disallow scientific publication. A popular solution is publishing the research while maintaining the company’s anonymity.* Finally, data acquisition often takes place during student internships and faculty sabbatical leave at a company. While the presence of the academic researcher inside the company can make data access easier (e.g., some companies only allow data access from within their facilities), strengthen collaboration and trust, and improve the understanding of the data context, it also might lead the researcher to feel more obligated toward the company goals over scientific goals and code of conduct. *Even when anonymizing the company name is insufficient for masking it completely, in some cases from the company’s perspective this is sufficient.

In addition to the ethical dilemmas, a researcher will often face two generalization challenges: because the BBD sample is typically not randomly drawn from the company’s databases, there are issues of bias when inferring to the larger population. Second, while for the company a BBD sample on their users can reasonably represent their company-specific population of interest,{ from a research point of view, results based on a company’s BBD sample do not necessarily generalize to a larger population of interest beyond the company’s users. Conducting experiments on company users. Researchers interested in running experiments for answering causal scientific hypotheses often partner with a company in designing the experiment and the intervention. A few recent examples of interventions in BBD experiments include manipulating ratings of news article comments2; providing only the treatment group with a prestige good1; gifting an anonymous browsing feature in an online dating site to a treatment group, while the control group continued to browse nonanonymously11 and manipulating the emotional content in users’ Facebook News Feed.10 Researchers { Even for a company using their own BBD, generalization can be a challenge due to events such as mergers and acquisitions and economic changes that can lead to bias inferring from a sample at one period to a population in another period.

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

face several ethical dilemmas. One dilemma is the scientific value of the intended manipulation and the possible risks that it poses to the participants, and possibly even to nonparticipants due to spill-over effects. Another dilemma relates to unintended adverse effects of the treatment to the participants. Adverse effects raise ethical questions regarding risk, fairness, and benefit. A third dilemma involves risks to the company due to running the researcher’s experiment. In some cases the adverse effects to subjects can affect the company directly. For example, some companies with BBD have run experiments to evaluate new features or to compare strategies. Amazon.com was one of the first online companies to massively use what is known as A/B testing—a two-level single factorial experiment. In one of their experiments they manipulated the price of top-selling DVDs. However, users discovered the differential pricing scheme, sounded their anger, and the company discontinued the experiment and compensated customers.* Amazon continues to carry out A/B tests for studying various factors, such as new home page design, moving features around the page, different algorithms for recommendations, and changing search relevance rankings.{ The Facebook experiment mentioned earlier led to strong reactions from users, the scientific community, the media, and others. Since such experiments do not require IRB approval (perhaps only approval by an internal company review committee), a researcher collaborating on the design of such experiments might have dilemmas about potentially unexpected adverse effects. A researcher collaborating on the experimental design to be run by the company also faces methodological dilemmas regarding study design. Ideally, researchers aim to design experiments that optimize the chance of detecting the effect of interest, given practical constraints. Various dilemmas and trade-offs arise in such study design scenarios that differ from smaller scale researcher-controlled behavioral experiments: choices such as sample size, sampling approach, duration of experiment, and deployment approach can be dramatically constrained by company restrictions and requirements such as limited access to some user populations (e.g., high-value customers), schedules and methods of deployment determined by the company’s IT, marketing, or other groups, and often an inability to issue an early termination of the experiment. In *‘‘Amazon backs away from test prices,’’ www.cnet.com, January 2, 2002. { ‘‘Amazon’s business strategy and revenue model: A history and 2014 update,’’ www.smartinsights.com, June 30, 2014.

105

addition, the technological nature of many such experiments means that the researcher relies heavily on the company for conducting the experiment and compiling and sharing the relevant data. Researchers should possess the technical ability to handle secure transfer and storage of BBD, either in an ongoing manner (during different phases of the experiment) or after the experiment and data collection period are completed. These require both technical knowledge and resources that are different from traditional behavioral studies. Acquiring pre-existing experimental (and other) data. Researchers can obtain experimental data that were already collected by a company. Currently, they are not required to obtain IRB approval for such data use. The Facebook experiment10 created controversy that eventually led to an editorial Expression of Concern published in Proceedings of the National Academy of Sciences (PNAS), the journal in which the article was published. In a nutshell, the Cornell researchers were able to obtain an exemption from IRB because the experiment was already run by Facebook and hence they were using a pre-existing data set. The editorial Expression of Concern highlights the issues of whether research using pre-existing data sets constitutes human subjects research, and whether the study should have triggered a full IRB review (see Metcalf and Crawford7 for further details): This paper represents an important and emerging area of social science research that needs to be approached with sensitivity and with vigilance regarding personal privacy issues. Questions have been raised about the principles of informed consent and opportunity to opt out in connection with the research in this paper. The authors noted in their paper, ‘‘[The work] was consistent with Facebook’s Data Use Policy, to which all users agree before creating an account on Facebook, constituting informed consent for this research.’’ When the authors prepared their paper for publication in PNAS, they stated that: ‘‘Because this experiment was conducted by Facebook, Inc. for internal purposes, the Cornell University IRB determined that the project did not fall under Cornell’s Human Research Protection Program.’’ This statement has since been confirmed by Cornell University. Obtaining informed consent and allowing participants to opt out are best practices in most instances under the. ‘‘Common Rule.’’ Adherence to the Common Rule is PNAS policy, but as a private company Facebook was under no obligation to conform to the provisions of the Common Rule when it collected the data used by the authors, and the Common Rule does not preclude their use of the data. Based on the information provided by the authors, PNAS editors deemed it appropriate to publish the paper. It is nevertheless a matter of concern that the collection of the data by Facebook may have involved practices that were not fully consistent with the principles of obtaining informed consent and allowing participants to opt out.

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

106

Adar9 claims that the controversy around the Facebook emotional contagion study, as reflected by a broad variation in responses from the public, academics (where computational scientists tended to be more in favor compared to social scientists opposing it), the press, ethicists, and corporates, ‘‘demonstrates that we have not yet converged [to a] solution that can balance the demands of scientists, the public, and corporate interests.’’ Finally, sometimes a third party helps connect companies with researchers. For example, the Wharton Customer Analytics Initiative (WCAI),* an academic research center, describes its mission: ‘‘WCAI enables academic researchers from around the world to help companies understand how to better monetize the individual-level data they collect about customers through the development and application of new predictive models.’’ WCAI works with companies with BBD to provide data to researchers who are selected based on their research proposals. In this scenario researchers are again not required to obtain IRB approval. Unlike a direct collaboration with a company, researchers in WCAI projects typically have limited direct access to the company. Open data, public data, website APIs, and web scraping BBD is available also outside of company collaborations: Multiple organizations have been making their data publicly available through simple download or through APIs. Some make BBD available via a single data release (e.g., the Netflix Prize data on user ratings of movies, the AOL release of users’ query logs), while others have an ongoing feed. In some cases, access is limited to researchers. Twitter is probably one of the most heavily used BBD sources by researchers today: one can download all tweets for a certain search term from the last 7 days. Amazon and eBay share some of their data via APIs. In contrast, Facebook does not provide data download. A second source is ‘‘open data’’ by government agencies and organizations, making their collected BBD publicly available.{ Websites such as data.gov, data.gov.uk, and data.taipei provide data sets collected by government agencies: traffic accidents, consumer complaints, crimes, health surveys, and more. data.worldbank.org provides data on economic growth, *http://wcai.wharton.upenn.edu { Another source for obtaining data is through a legal process, such as requests using The Freedom of Information Act (e.g., the Enron emails12 and Hillary Clinton’s emails).

SHMUELI

education, and so on. While this trend and the number of available data sets have been growing, data are often not easily accessible, due to limited APIs, inconvenient data formats (such as PDF files), and limiting sharing rules.9 Also, many data sets are given at aggregation levels that do not support BBD research questions. For example, Taipei city provides data on its bicycle sharing system{ aggregated to the bicycle station level, rather than at the trip level. Such data might be useful for studying station-level demand, but not movement of riders. A third source of public BBD is from websites that aggregate individual (single release) data sets from disparate sources, such as UCI Machine Learning Repositoryx and, more recently, data mining contest platforms such as Kaggle.com and crowdanalytix.com that host contests for various companies who share a large data set. Many of these contests include BBD, such as consumer ratings from the restaurant rating website yelp.com, Hillary Clinton’s emails, customer bookings on airbnb.com, crimes in San Francisco, purchase and browsing behavior on ponpare.jp, restaurant bookings by customers on eztable.com.tw, and more.6 Data sets from repositories and contest websites are heavily used by researchers in machine learning to test new algorithms. While highly convenient for download and use, the secondary nature of such BBD and the insufficient context and collection history can make it inadequate for answering behavioral questions beyond the question the data were originally collected for. Hoerl et al.13 explain regarding using data from contest websites: A better question to ask with a given data set would be: ‘What can I learn from this data set that would help me collect even better data in the future so I can solve the original problem and continuously learn?’

In other words, it can be useful to explore such secondary data sets for discovering the types of research questions that might be answerable, and then pursue a dedicated data acquisition strategy to procure more appropriate data, using a sound study design and an understanding of the domain from where the data arise. An example that highlights the role of context in using public BBD from a contest is the Netflix Prize contest that took place in 2006–2008. At the time, Netflix was the largest movie rental company in North America, which provided DVDs by mail to { YouBike data is currently available at http://data.taipei/opendata/datalist/ datasetMeta?oid=8ef1626a-892a-4218-8344-f7ac46e1aa48 x http://archive.ics.uci.edu/ml

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

107

members. The company set up a contest with the intention to improve its movie recommendation system. They ran a contest open to the public, carrying a $1 million prize, releasing a large BBD on movie ratings by individual users. The winning team included computer scientists and a statistician.14 While the company’s initial goal for the contest was to improve its own recommendation system, this open contest ended up contributing to research on recommender systems but not providing an improved solution, because by the time the contest was over, Netflix was shifting from primarily DVD by mail to movie streaming, rendering the developed algorithm incompatible with the new type of BBD: ‘‘it turns out that the recommendation for streaming videos is different than for rental viewing a few days later.’’* This example highlights the importance of context for answering relevant questions. Finally, a data acquisition tool commonly used by academic researchers is ‘‘web scraping’’—automated programs that collect data from a website in a methodical way. Some websites disallow web scraping from some or all pages, by setting technological barriers and legal notices. Yet, many websites do tolerate web scraping by researchers, if they do not overload their servers or scrape massively.{ Allen et al.15 discuss legal and ethical issues pertaining to web data collection and offer guidelines for academic researchers. From a technical point of view, researchers who are not savvy programmers face challenges in terms of writing code that collects and parses the needed data. They also run the risk of violating ethical scraping conduct and leading to a Denial of Service or blocking their university’s access to the website if their code violates the company’s rules. As in the case of company collaborations, and even more so, a major concern researchers face when obtaining public data from company or aggregator websites is that of generalization. What population does an API-provided or website-downloadable BBD sample represent? How was it sampled from the database and what is omitted? Such BBD is also less likely to include detailed demographic information on subjects for purposes of privacy protection. Hence, inference

from the BBD sample to the larger population from which it was drawn is uncertain, and more so, inference to a larger population is unclear. For example, while Twitter is a popular source of BBD among researchers, even if we can treat a twitter BBD as a random sample of the larger population of tweets,{ Twitter users are different from the general population—they tend to be younger and college educated.x Generalization is also a serious problem when BBD is obtained by web scraping. Although scraping gives the researcher more control over the collection process, there are still many unknowns regarding the relationship between the information available on the website and information that the company does not make available on the website. In addition, server problems and Internet congestion, website refresh policies, poor website design, and the nonrandom nature of ‘‘search’’ results are just some factors that introduce nonsampling errors into the BBD sample (chp 2 in Jank and Shmueli17 discusses these and other issues). As described earlier, researchers acquiring publicly available BBD do not require IRB approval, yet such BBD can pose serious, if intangible, risks to the subjects. According to Bender et al.,18 the two pillars on which so much of social science has rested—informed consent and anonymization—are virtually useless in a BBD setting where multiple data sets can be linked. Metcalf and Crawford19 describe the study by Hauge et al.,19 which uncovered the true identity of the anonymous artist Banksy by linking multiple publicly available data sets, thereby violating his intended anonymity. Netflix was sued by a user claiming her movie preferences were revealed by the anonymized data the company made public.** Narayan and Shmatikov20 showed that even in such a data set an adversary with minimal knowledge about an individual subscriber can easily identify this subscriber’s record in the Netflix data set by linking to user reviews on Internet Movie Database, thereby identifying apparent political preferences and other potentially sensitive information. Another ethical dilemma arises when considering private BBD made publicly available by hackers, by other researchers aiming at reproducible research, or otherwise. One example is a large BBD from the online

*‘‘Streaming has not only changed the way our members interact with the service, but also the type of data available to use in our algorithms.’’ Netflix blog, April 6, 2012, http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5stars.html { Legally, breaking a website’s terms of service to collect information is guilty of a federal crime; several researchers are now trying to change this law for the purpose of identifying company discrimination. See for example, www.wired.com/2016/06/researchers-sue-government-computer-hacking-law

{ Studies such as Gonzalez-Bailona et al.16 have shown different types of biases associated with Twitter samples. x Pew Research Center, Social Media Update 2016, www.pewinternet.org/2016/11/ 11/social-media-update-2016 **Netflix Sued for ‘‘Largest Voluntary Privacy Breach To Date,’’ http:// privacylaw.proskauer.com, December 28, 2009, http://privacylaw.proskauer.com/ 2009/12/articles/invasion-of-privacy/netflix-sued-for-largest-voluntary-privacybreach-to-date

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

108

dating site OKCupid.com that was (illegally) scraped by Danish researchers and then made publicly available.* A second example is the AOL search logs release, which the company quickly withdrew from its website, but the data were already distributed online. Researchers conveyed their dilemma using the AOL data{: ‘‘many were torn, loath to conduct research with it as they balanced a chronic thirst for useful data against concerns over individual privacy.’’ Another example is the customer database of the adultery website www.ashleymadison.com, which was hacked and then made publicly downloadable on the Internet. The database was then used by journalists, other hackers, churches, blackmailing individuals, and more.{ The researcher must therefore decide whether using such BBD is ethically and morally acceptable, and if so, in what form. While individual-level data can lead to more nuanced insights, privacy and confidentiality considerations might lead to aggregation before analysis. For example, in their study on online dating, Burtch and Ramprasad21 write ‘‘We employ aggregate, anonymized data from the Ashley Madison data leak of 2015. In anonymizing the data, we follow the approach of other recent academic research that has drawn on the same data set.’’22,23 Online data collection platforms Virtual Labs, online survey platforms, and online labor markets have become an important tool for collecting BBD. Ideally, these tools combine the technological platform along with an adequate pool of human subjects. For example, Survey Monkeyx and Prolific Academic** have panels of on-demand survey respondents (some online survey platforms only offer the survey platform, but not a respondents pool). The novelty of such platforms for BBD researchers includes a more diverse pool of respondents compared to the traditional student population used by many researchers, and importantly, cheaper and faster turnover, allowing the collection of larger samples at fast rates. Such platforms are highly valuable to global research and especially to researchers outside of North America and Europe, who can get access to respondents from those countries. *www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release { ‘‘Researchers Yearn to Use AOL Logs, but They Hesitate,’’ NY Times August 23, 2006. www.nytimes.com/2006/08/23/technology/23search.html { ‘‘Life after the Ashley Madison Affair,’’ The Guardian, February 28, 2016, www.theguardian.com/technology/2016/feb/28/what-happened-after-ashleymadison-was-hacked x www.surveymonkey.com **www.prolific.ac

SHMUELI

Virtual Labs are another important BBD collection tool. These platforms are designed to replace onground behavioral laboratories, which have been heavily used by behavioral researchers, typically using undergraduate student populations. Virtual Labs make it possible for researchers to run large-scale experiments. Some researchers build their own online experimental platform: for example, Salganik et al.24 created an artificial music market (The Music Lab) to test the effect of social influence on the success of songs. They recruited participants from teen-interested social networking sites and obtained 14,000 subjects. Watts25 describes the challenges they had with this do it yourself (DIY) approach, including the costs, time, and technical programming knowledge required to build the platform (it took the researchers a year to build the platform), the nonreusable nature of the platform, the issue of replicability by other researchers, and the challenge of recruiting subjects. Adar9 describes another DIY Virtual Lab (MTogether{{) created by the University of Michigan, designed as ‘‘an observational and interventional platform built into desktop and mobile platforms that tracks social media use and can ‘‘manipulate’’ a user’s experience. The initial releases were designed to leverage the alumni and fan base for Michigan.’’ Given the challenges associated with building and using DIY platforms, BBD researchers have embraced and been increasingly using online labor markets. Such platforms have a large, stable, and diverse subject pool of ‘‘workers’’ who use the platform to earn money by performing different tasks, including responding to online surveys and serving as subjects in online experiments. These platforms allow researchers to perform studies at lower costs and faster iteration between developing theory and executing experiments.26 Unlike DIY platforms, using online labor markets for BBD collection does not require technical knowledge and investment of time and money for building the platform (although it can be used in conjunction with a DIY survey or experiment website), nor does it require the researcher to find subjects. Moreover, the large pool of subjects makes it easier to conduct synchronous experiments, where multiple subjects must be present at the same time. We consider online labor markets as sources for BBD because researchers are able to obtain much larger and richer behavioral data (and metadata) using such platforms, compared {{

www.mtogether.us

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

to traditional data collection mechanisms. They are also able to answer new research questions, and scientific replicability becomes possible. At present, as indicated by published articles, the most popular platform used by BBD researchers is Amazon Mechanical Turk ( MTurk*) (other markets include Taskcn{ in China). MTurk is used for a variety of tasks, including using workers as experiment subjects or survey respondents, as well as for manual cleaning or tagging data and performing other operations that humans are currently better at than computers (e.g., tagging the gender of people in photos). The latter use can enrich BBD from other sources, for example, by tagging behaviors observed in videos. There are also online experiment platforms, such as Xperiment,{ that provide a platform for running an online experiment and use the MTurk workers’ pool. Some researchers run experiments using MTurk as a single BBD source (e.g., Mao et al.27), while others use it as supportive BBD for comparing and evaluating the MTurk results to those from a field experiment (e.g., Burtch et al.28). Finally, experimental BBD from online labor markets has led to a surge in labor-related studies such as on motivation (e.g., Liu et al.29) and collaboration (e.g., Watts25). BBD collected using online labor markets raises technical, ethical, and generalization dilemmas. From a technical point of view, researchers must learn how to plan and execute surveys and experiments using the labor platform’s tools. For example, the survey tool on MTurk is not as versatile and user friendly as many popular online survey tools (e.g., SurveyMonkey .com); linking to an external survey tool or experiment website requires strategies for linking the external information with the MTurk platform; reaching adequate subjects requires understanding the incentive system for workers. Mason and Suri26 describe the mechanics of setting up a task on MTurk, including recruiting subjects, executing the task, and reviewing the submitted work. From a generalization point of view, the advantage of online survey platforms with subject pools and online labor markets is the larger diversity of their subjects compared to the typical subject in traditional social science and management studies: an undergraduate university student, typically in North America. In contrast, workers on MTurk are more diverse in age, geog*www.mturk.com { www.taskcn.com { www.xperiment.mobi

109

raphy, education, and culture. This profile is often more aligned with researchers’ population of interest than a sample of university students. However, it too might not be sufficiently reflective of the population of interest, leading to exclusion of specific and often sensitive populations. For example, if the population of interest includes computer-illiterate subjects and those not connected to the Internet (at present, 53% of the world population is not connected to the Internetx), then results from an MTurk BBD will not necessarily generalize to the population of interest. Therefore, researchers still must carefully consider whether the online worker community represents the population of interest in the study. According to Mason and Suri,26 numerous studies show correspondence between the behavior of MTurk workers and offline subjects on various dimensions. They conclude: ‘‘While there are clearly differences between Mechanical Turk and offline contexts, evidence that Mechanical Turk is a valid means of collecting data is consistent and continues to accumulate.’’ Yet, the literature also includes articles showing the discrepancy between MTurk and various populations of interest to researchers. From a methodological point of view, using MTurk increases the risk of spill-over effects (treatment effects spill over from the treatment group to the control group) and ineffective interventions due to the higher communication level between MTurk workers on forums (see also the Analysis of Behavioral Big Data section). For example, in experiments where deception is used and communicated to subjects after they complete the experiment, the deception information might be shared by those who have completed the experiment with other workers. A related challenge is the heavy reuse of the same subject pool—specifically MTurk workers—by many teams of researchers. Stewart et al.30 describe the potential consequences of nonnaı¨vete` of MTurk workers, many of whom report having taken part in common research paradigms: Experienced workers show practice effects which may inflate measures of ability or attentiveness to trick questions. cooperation in social games on MTurk has declined, perhaps as the result of too much experience or learning. Participants often conform to demand characteristics. and MTurk workers may infer demands, correctly or otherwise, from debriefings from earlier experiments. Workers may also have been previously deceived, a key concern in behavioral economics.

Finally, from an ethical point of view, while IRB rules apply to research studies using the online data collection x

www.itu.int/en/ITU-D/Statistics/Documents/facts/ICTFactsFigures2016.pdf

110

SHMUELI

Table 2. Research dilemmas at the data analysis stage

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

Capturing Heterogeneity (experiments and observational data)

Researcher–subjects relationship: ethical dilemmas

Researcher–BBD relationship: technical and methodology challenges

BBD–Research question relationship: generalization issues

1. T-test, ANOVA, regression models ignore minorities by focusing on average effect 2. Unmatched observations in treatment or control group are dropped out of PSM, thereby excluded from treatment effect evaluation 3. ATE-level analyses avoid privacy compromises

1. Technical difficulty running stat models on large high-dimensional data 2. T-tests, regression models, PSM not designed to capture heterogeneous effects 3. Statistical significance in large samples 4. Multiple testing can lead to false discoveries 5. Self-selection modeling procedures are complex and suffer scaling issues

1. Average effect does not necessarily generalize to subgroups or individuals 2. Statistical generalization (small p values) does not imply practically significant effects 3. False discoveries due to multiple testing 4. ATE-level analyses useful for assessing aggregate/overall social benefit

Contamination of 1. Difficult to use blinding and placebo 1. Treatment effect can change over Treatment Effect in in online BBD experiments time, requiring design of fast and BBD Experiments 2. How to identify and evaluate recurring experiments of effect (experiments) knowledge sharing between subjects 2. How to design a network-based that leads to treatment effect experiment or an experiment in a contamination? networked environment that leads to statistically valid results? 3. How to model dependence between subjects due to network effects?

Internal validity concerns: 1. Overlap of experiments (effect is of which treatment?) 2. Knowledge of allocation and gift effect (effect due to being treated, or not getting treatment) 3. Spill-over effect (control group is also affected by treatment)

Prediction in Causal 1. Predictive ranking algorithms can Research discriminate against minorities (experiments and 2. Labeling individuals using ‘‘top tier’’ observational data) ranking models can risk and cause damage to labeled subjects 3. Studies based on BBD from recommender systems can deepen the recommender system biases

1. Company interested in short-term prediction for immediate actions, while researcher interested in generalizable causal effect 2. Company interested in individuallevel predictions; researcher interested in group-level effects 3. BBD from recommender systems, search queries, social network likes, and so on reflects system’s algorithm setting (bias) 1. Company metrics optimize for a specific application vs. researcher metrics that optimize for scientific discovery of causal effect

Performance Measures

1. Choice of metric (e.g., sensitivity + specificity vs. FPR + FNR) optimized for company vs. subjects

1. Behavioral researchers unfamiliar with predictive modeling and assessment 2. Use of predictive modeling and assessment for causal research is different from application-driven prediction

1. Causal explanation measures (ATE, p values, R2, etc.) differ from predictive measures (out-of-sample RMSE, AUC, etc.) 2. Some predictive measures (AUC) incoherent or insensitive for imbalanced data

ANOVA, analysis of variance; ATE, average treatment effect; AUC, area under the curve; FNR, false-negative rate; FPR, false-positive rate; PSM, propensity score matching; RMSE, root mean squared error.

platforms, additional ethical issues arise from the larger distance between researchers and subjects/workers. An in-depth discussion of ethical considerations of such platforms is beyond the scope of this article.* Analysis of BBD In this section, we discuss dilemmas related to generalization, technical and methodology use, and ethical conduct that arise at the data analysis step. Compared to analysis of traditional behavioral data, BBD poses more extreme as well as new challenges in terms of internal and external validity (generalization), technical and methodological *One issue is fair treatment and payment to workers. The Wiki page http:// wiki.wearedynamo.org/index.php/Guidelines_for_Academic_Requesters created by several MTurk workers, and signed by over 60 researchers, provides guidelines for researchers.

implementation, and ethical dilemmas. We discuss four key types of issues and relate them to the changing relationships between researcher, research question, data, and human subjects (see Table 2 for a summary). Capturing heterogeneity: shortcomings of traditional statistical methods The most common statistical analysis methods for testing and quantifying causality from experimental data are analysis of variance and regression models. Such methods are designed to extract the causal effect from randomized controlled experiments in the most efficient and statistically generalizable way. Regression models are also extremely popular in observational studies that test causal hypotheses. Yet, the small-

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

sample behavioral inference approach stumbles into several challenges when applied to BBD. The advantage of BBD is its richness in terms of diversity: we are more likely to have data on rare minorities compared to small samples or to lower dimensional data. However, when we apply these statistical models to large BBD, the rare minorities are either filtered out (e.g., considered outliers) or are overshadowed by averaging with the large majority. For example, with very large samples the effect of small minorities and outliers on regression coefficients and statistical tests is very small. BBD also challenges the usefulness of statistical significance and inference due to large samples and multiple testing. We briefly describe each of these next. Average treatment effect: ignoring heterogeneity and minorities. A/B testing and more generally the design and analysis of randomized experiments are aimed at gauging the differences between group averages, in other words, they compare ‘‘the average behavior’’ under different treatments and quantify the average treatment effect (ATE). This is useful for macrolevel decision-making, where a single decision is made for a large population to optimize the aggregate benefit (e.g., whether or not to institute a new policy or to change the website design). ATE-level analysis is also advantageous in terms of protecting the privacy of individual observations. While conducting statistical tests for the ATE on subgroups is technically possible (e.g., measuring gender-specific ATE), the subgroups must be prespecified and accounted for in the design stage of the experiment to derive valid conclusions. While the researcher might have theoretical justifications for a few such subgroup-level comparisons, these are far from taking advantage of the rich BBD measurements. It is possible that treatment effects differ dramatically for certain user groups or minorities, yet unless the researcher has theory or domain knowledge about this before the experiment, the statistical testing will ignore such heterogeneous effects. In observational studies, the same issue arises, although in a slightly different form. As in the design and analysis of randomized experiments, causal models based on observational data are also optimized to detect the ATE, thereby discarding or ignoring observations far from the average. In methods based on creating matched treatment and control groups (e.g., propensity score matching), observations that do not have a match are discarded from the analysis. These effects increase significantly with sample size.31 The same

111

is true with regression and path models, which are frequently used in behavioral causal studies. For example, Chetty et al.3 estimated the impact of teachers on student outcomes using education and tax BBD. They combined BBD from administrative school district records and federal income tax records to study whether high VA teachers improve students’ long-term outcomes. The question of the long-term impact of teachers on student outcomes has been of interest in economic policy. The novelty of this study is its use of BBD, which includes many life events such as test scores, teachers, demographics, college attendance and quality, teenage pregnancy, childbirth, earnings, and more.6 The authors used regression models and statistical inference to quantify the effects, concluding: We find that teacher VA has substantial impacts on a broad range of outcomes. We find that students assigned to higher VA [Value-Added] teachers are more successful in many dimensions. They are more likely to attend college, earn higher salaries, live in better neighborhoods, and save more for retirement. They are also less likely to have children as teenagers.

Such a conclusion could lead policymakers to a binary decision of whether or not to use the VA system of teacher ratings. It can also lead school administrators, parents, and students to conclude about their own teachers’ impact on student long-term outcomes. However, such modeling does not tell us about the nonaverage teacher and the nonaverage student.* Another issue related to aggregation and heterogeneity is Simpson’s paradox, where the direction of a causal effect is reversed in the aggregated data compared to the disaggregated data. It is important to detect whether Simpson’s paradox occurs in a data set used for decision-making. Given a large and rich BBD and a causal question, it is useful to be able to determine whether a Simpson’s paradox is present. In the presence of a paradox, the researcher or decision maker must determine whether to customize decisions or make a single overall decision. To address this need, Shmueli and Yahav32 introduced a method that uses Classification and Regression Trees for automated detection of potential Simpson’s paradoxes in data with few or many potential confounding variables, and which scales to large samples. If the researcher is interested in the effects of a policy or intervention on individuals’ outcomes in addition to the average effect, then predictive modeling *The American Statistical Association issued a Statement on Using Value-Added Models for Educational Assessment criticizing current VAM model use: www.amstat.org/policy/pdfs/ASA_VAM_Statement.pdf

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

112

and validation can be a useful approach. For example, using the causal statistical model to generate predictions for a holdout set of observations (which is hopefully as diverse as the population of interest). The predictions and their errors can then be compared across different subgroups or even simply sorted to identify subgroups for whom the effects significantly differ from ‘‘the average majority.’’ This is an example of using predictive validation for improving causal studies.33 Another example of using predictive modeling in conjunction with causal modeling is uplift modeling (or ‘‘true lift modeling’’),34,35 which combines A/B testing with data from a prior treatment, to model the incremental reaction to a treatment. An A/B test identifies which treatment does better on average, but says nothing about which treatment does better for which individual.36 The uplift model is then used for predicting which person will respond favorably to the treatment. Political campaigns now maintain extensive data on voters to help guide decisions about outreach to individual voters. They use uplift modeling to predict voters most likely to respond favorably to their outreach. This is typically done by first conducting a survey of voters to determine their inclination to vote for a certain party or candidate. Given the survey results, an A/B test is conducted, randomly promoting the party/candidate to half of the sample. The experiment is followed by another survey to evaluate whether voters’ opinions have shifted. In summary, researchers should realize that classic statistical causal modeling and inference, based on experimental or observational data, are aimed at capturing the overall population causal effect. Communicating results should therefore be done with caution, so that stakeholders do not misinterpret the overall effect as applicable at the individual level. ATE: statistical significance with large samples. Many behavioral researchers continue to apply the same small-sample methodology to BBD, thinking that the only challenge is computational scalability. Technically, running regression models on very large highdimensional samples is resource- and time-consuming, and is often solved by brute-force computation with more powerful computing power. However, the larger and more dangerous pitfall is methodological scalability of statistical inference. Relying on p values and statistical significance for drawing conclusions from large BBD is misleading at best. Specifically, in very large samples, even minuscule effects are statistically signifi-

SHMUELI

cant.37 While in some fields there are efforts to move away from using p-value cutoffs as decision points,38 the practice is very common in both academia and in practice, possibly because proposed alternatives (e.g., ignoring p values altogether,39 using estimation in place of testing, adopting a Bayesian approach or a model selection approach40) are not common in behavioral research and do not support the popular nullhypothesis testing approach. T-tests are commonly used to test the statistical significance of a treatment in randomized experiments, by testing whether the difference between the group means reflects a nonzero difference in the population. Variability around the group means is used to determine statistical significance. However, with a sufficiently large sample, even large variability will result in a statistically significant average causal effect. For instance, the t or z statistic in the test used in A/B testing is heavily influenced by the sample sizes of the two groups (n1 , n2 ), regardless of the pooled variance (Sp): t=

Y treat  Y control qffiffiffiffiffiffiffiffiffiffiffiffiffiffi : Sp = n11 þ n12

(1)

An illustration of the danger of using statistical significance with large BBD is the VA teacher impact study by Chetty et al.3 that had over one million records, yet the article reports many statistically significant effects that are practically meaningless.* For example, the statistically significant effect of having a VA teacher on long-term financial earnings of students is in fact less than USD1000 per year, on average. The authors get around this embarrassing magnitude by looking at the ‘‘lifetime value’’ of a student ‘‘On average, having such a teacher for one year raises a child’s cumulative lifetime income by $80,000 (equivalent to $14,500 in present value at age 12 with a 5% interest rate).’’ Multiple testing. Agarwal and Chen41 summarized several of the challenges in designing algorithms for computational advertising and content recommendation. One is the multivariate nature of outcomes, in multiple different contexts, with multiple objectives. They call this ‘‘3Ms’’: multiresponse (clicks, share, comments, likes), multicontext (mobile, desktop, e-mail) modeling to optimize multiple objectives (trade-off in engagement, revenue, viral activities). Such multiplicity, when using statistical inference, translates into *For more about this issue see blog post www.bzst.com/2012/05/policy-changingresults-or-artifacts-of.html

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

testing many hypotheses that run the risk of false discoveries. For example, when testing m independent hypotheses, if each hypothesis is tested at significance level a, then the chance of a false discovery in at least one of m tests is exponential in m:

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

acombined = 1  (1  a)m :

(2)

While experiments conducted for scientific research typically do not suffer the extreme multiplicity issues of the industrial environment, multiple testing arises in other ways. Behavioral researchers are prone to false discoveries with BBD because they rely heavily on statistical inference for testing their hypotheses. The dilemma then becomes which hypotheses to focus on: only the theory-based hypotheses, or also hypotheses based on domain knowledge holders at the company? Should the many insignificant effects be reported or not? With BBD, which is usually heterogeneous and high dimensional, the extra trade-off is between post hoc testing of effects for different subgroups to identify heterogeneous effects (e.g., by gender, age, race, and other measured variables and their combinations) and the risk of false discoveries due to multiple testing. The trade-off is therefore between looking for heterogeneous effects and that of false discoveries due to multiple testing. The methodological dilemma is how to account for multiple testing.* Should one use statistical methods such as the Bonferroni family-wise adjustment or the more powerful false discovery rate by Benjamini and Hochberg?42 Should one consider machine learning algorithms that search for data-driven heterogeneous effects? (e.g., Imai and Ratkovic43 and Yahav et al.31). While the latter two may not yield results in the structure familiar to behavioral researchers, they are tuned to account for some of the false discovery challenges of BBD. Modeling self-selection. Quasi-experiments are similar to randomized experiments except that they lack random assignment. Quasi-experimental BBD is quite common in cases where users self-select whether to be in the treatment or control group. Lambert and Pregibon44 described a self-selection challenge in the context of online advertising, where Google wanted to test a new feature but could not randomize the advertisers who would receive the new feature. An additional challenge they faced was assessing whether a new feature *In many published articles, there is no adjustment at all.

113

makes advertisers happier, given the advertisers selfselection of whether to use the new feature or not. The two most common approaches for inferring causality from quasi-experiments are propensity score matching (PSM45) and the Heckman approach.46 Both methods attempt to match the self-selected treatment group with a control group that has the same propensity to select the treatment. The methods differ mainly in terms of assuming that the selection process can be modeled using observable data (PSM), or it is unobservable (Heckman approach46). With very rich BBD, it becomes more plausible to model the selection process, and therefore, PSM is common in BBD studies. While PSM can handle these issues, Lambert and Pregibon44 note: Our main reservation about all variants of matching is the degree of care required in building the propensity score model and the degree to which the matched sets must balance the advertiser characteristics. If analysis is automated, then the care needed may not be taken. In our idealized view, we want our cake and we want to eat it too; specifically, we require an estimator that has good performance and that can be applied routinely by non-statisticians.

With an increasing number of quasi-experiments in industry and in academia, there is a need for more robust, automated, insightful, and user-understandable techniques. Moreover, matching techniques do not scale well to big data. Recently, Yahav et al.31 developed a tree-based approach that offers an automated, datadriven, nonparametric, computationally scalable, and easy-to-understand alternative to PSM. They illustrated the usefulness of the tree-based approach in a variety of scenarios, such as heterogeneous treatment effects and a continuous treatment variable; the method also highlights pretreatment variables that are unbalanced across the treatment and control groups, helping the analyst draw insights about what might be driving the self-selection.6 Contamination of treatment effect in BBD experiments The origins of experimental design and analysis are in a nonhuman-subjects context. These were later adapted for human-subjects experiments such as clinical trials. Like clinical trials, BBD experiments pose challenges such as compliance, and awareness of being treated. Moreover, BBD experiments are conducted by companies in new, fast, and less-regulated ways, which lead to several challenges and dilemmas both for industry and for research. In the book Amazonia. Five years at the epicentre of the dot-com juggernaut Marcus,47 an Amazon.com

114

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

ex-employee describes some of the challenges that Amazon encounters*: Amazon has a culture of experiments of which A/B tests are key components. These involve testing a new treatment against a previous control for a limited time of a few days or a week. The system will randomly show one or more treatments to visitors and measure a range of parameters such as units sold and revenue by category (and total), session time, session length, etc. The new features will usually be launched if the desired metrics are statistically significantly better. Statistical tests are a challenge though as distributions are not normal (they have a large mass at zero for example of no purchase). There are other challenges since multiple A/B tests are running every day and A/B tests may overlap and so conflict. There are also longer-term effects where some features are ‘cool’ for the first two weeks and the opposite effect where changing navigation may degrade performance temporarily. Amazon also finds that as its users evolve in their online experience the way they act online has changed. This means that Amazon has to constantly test and evolve its features.

This description highlights three challenges that companies face when inferring causality from A/B testing: reliance on statistical significance in large-scale experiments; overlap of different experiments causing confounding of effects; and effects that change over time either due to fashion or due to users’ experience. Lack of ‘‘clean’’ baseline. Academic researchers typically conduct a single experiment rather than an ongoing series of A/B tests for different effects. Clinical trials are designed to avoid overlap of subjects across multiple simultaneous trials. Yet, the first two challenges at Amazon—using statistical inference and the lack of a ‘‘clean’’ baseline—are also relevant to researchers. Companies run multiple overlapping experiments for various reasons (e.g., Tang et al.48 describe an overlapping experiment infrastructure developed at Google). When multiple experiments are run with even some overlap, it causes internal validity dangers in terms of inferring whether the effect of interest is the cause of the outcome, or whether it is an effect from some other experiment that is driving the outcome. Although highly desirable, researchers often find it challenging to identify a BBD sample that lacks effects from other interventions performed by the company at the same time, and sometimes, it is even difficult to ascertain whether and what type of interventions (e.g., promotions) or experiments have occurred during the data collection period. Knowledge of allocation and gift effect. In humansubjects experiments in both the behavioral and bio*From ‘‘Amazon’s business strategy and revenue model: A history and 2014 update,’’ www.smartinsights.com, June 30, 2014.

SHMUELI

medical fields, a concern arises regarding the effect of subjects’ knowledge of their allocation to the treatment or control group on the outcome. Knowledge of allocation can also affect the subjects’ compliance levels. In clinical trials, solutions include placebo as well as single, double, and even triple blinding (where the subjects, doctors, and even data analysts are blind to the group allocation). However, placebo and blinding strategies can be difficult to use in BBD experiments, especially when they are carried out online, due to the highly networked environment. For example, subjects sometimes identify a manipulation and their group allocation through online communication channels such as forums. An example is the experiment by Amazon that manipulated prices of top-selling DVDs. Consumers quickly detected the price variations shown to different users.6 In their BBD study, Hinz et al.1 approached the issue of subjects’ possible knowledge of the premium gift manipulation by surveying users in both the treatment and control groups about their perceived chance of receiving the premium gift as being equal. Another issue that arises from the higher connectivity between subjects is a potential ‘‘gift effect.’’ In behavioral experiments as in clinical trials, where the treatment group receives a gift or preferential treatment, it is possible that the treated members react to the act of receiving a gift rather than (or in addition) to the treatment itself. While clinical trials try to avoid this issue by using placebos, it is more difficult to implement placebos in BBD experiments. In their online dating experiment, Bapna et al.11 evaluated the possibility of a gift effect by comparing the outcomes in the very end of the treatment period to the immediate period of the post-treatment period. They found that the effect was in effect even at the very end of the treatment period, but immediately disappeared in the beginning of post-treatment month, thereby ruling out a possible gift effect. Knowledge of allocation and gift effect is difficult to control, thereby requiring the researcher to find ways to evaluate their extent and then account for it. The researcher’s creativity often plays a main role in doing so. Spill-over effects. A methodological challenge that arises in randomized BBD experiments is that the treatment can sometimes ‘‘spill over’’ and affect the control group subjects or even peers of the treatment subjects.49 This is especially challenging in social network environments, where control group members might

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

become ‘‘contaminated’’ by the treatment through connections with members in the treatment group. Spill over can easily happen when the treatment is some type of information that might be easily shared. While spill over can also occur in small-scale researchercontrolled behavioral experiments, the issue is more severe in the BBD environment where the company runs the experiment and users are networked via the platform. Bapna and Umyarov,50 who conducted an experiment on a social network, emphasize: Researchers working on network experiments have to be careful in dealing with possible biases that can arise because of the presence of network structure among the peers of manipulated users. The failure to appropriately account for this intersection problem could introduce various kinds of biases threatening either internal or external validity of the experiment. How one deals with this issue depends on the particular methodology and design choices deployed by the researchers.

The networked environment creates design and analysis challenges: Fienberg51 described the challenge of using standard randomized designs and the role of randomization on networked subjects: ‘‘How to design [a] network-based experiment with randomization and statistically valid results?’’ While the treatment and control subjects might be chosen to be sufficiently far away as to avoid spill-over effects, Fienberg51 emphasizes that the analysis should account for the dependence between observations that arises due to the network structure. Prediction in causal research Lack of prediction in behavioral research. A fundamental difference in the use of analytics by companies versus behavioral researchers is the heavy focus of companies on predictive analytics, while behavioral academic research is focused more on causal modeling. Predictive modeling and assessment are necessary and beneficial for scientific development.33,52 However, their use for theory building, evaluation, and development is different from the deployment of predictive analytics for commercial purposes. Companies are using predictive analytics for short-term decisions and actions such as direct marketing, customer churn, fraud detection, human resources analytics, and personalized recommendations. The latter provides immediate solutions and actions that are based on correlations between measured behaviors, aimed at improving the company’s operations in the short term. However, they do not provide an understanding of the underlying causes for problems such as employee churn, prisoner recidivism, or customer dissatisfaction. In contrast, academic studies using causal modeling often do not use predictive assess-

115

ment and limit the scope to statistical inference on prespecified overall population effects. Using more data-driven methodologies and performing predictive assessment can help establish the relevance of the results to applications and enhance novel discoveries. At the same time, using machine learning methods for predicting or mining (e.g., social network analysis) individual observations compromises privacy. One ethical dilemma is therefore the application of data mining algorithms used in non-BBD to BBD. Another methodological challenge and ethical pitfall are using BBD that results from organizations that utilize a predictive model (e.g., credit scoring data). Applying predictive models to such data simply constitutes ‘‘reverse-engineering’’ the organization’s algorithm, and perpetuates the biases introduced by that algorithm. Algorithms for scoring and ranking ‘‘special’’ populations. A common predictive goal is ranking a new set of observations to identify observations with the highest probability to behave in some way, or classifying observations as ‘‘special’’ if their probability crosses some threshold. Examples include fraud detection, direct marketing, crime prediction, and early warning systems in education. Companies and organizations using such algorithms can cause much damage. The Pro Publica journalist organization recently ‘‘found that an algorithm being used across the country to predict future criminals is biased against black defendants.’’* O’Neil53 calls such algorithms weapons of math destruction (WMD), giving the example of sentencing algorithms used by judges: ‘‘sentencing models that profile a person by his or her circumstances help to create the environment that justifies their assumptions. This destructive loop goes round and round, and in the process the model becomes more and more unfair.’’ O’Neil53 further describes the contagious effect of predictive algorithms used by different organizations applied to the same ‘‘special’’ populations: Poor people are more likely to have bad credit and live in highcrime neighborhoods, surrounded by other poor people. Once the dark universe of WMDs digests that data, it showers them with predatory ads for subprime loans or for-profit schools. It sends more police to arrest them, and when they’re convicted it sentences them to longer terms. This data feeds into other WMDs, which score the same people as high risks or easy targets and proceed to block them from jobs, while jacking up their rates for mortgages, car loans, and every kind of insurance imaginable. This drives their credit rating down further, creating nothing less than a death spiral of modeling. Being poor in a world of WMDs is getting more and more dangerous *www.propublica.org/podcast/item/how-we-decided-to-test-racial-bias-inalgorithms

116

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

and expensive. The same WMDs that abuse the poor also place the comfortable classes of society in their own marketing silos. They jet them off to vacations in Aruba and wait-list them at Wharton. For many of them, it can feel as though the world is getting smarter and easier. Models highlight bargains on prosciutto and chianti, recommend a great movie on Amazon Prime, or lead them, turn by turn, to a cafe´ in what used to be a ‘‘sketchy’’ neighborhood. The quiet and personal nature of this targeting keeps society’s winners from seeing how the very same models are destroying lives, sometimes just a few blocks away.

Researchers interested in causal goals do not typically use scoring and ranking algorithms. Yet, the results from such algorithms can uncover novel insights about human and social phenomena that are at the extremes of behaviors, both with negative consequences (high likelihood of addiction, failure at school, loan default) and positive (high sporting performance, job promotion, finding a partner). The trade-off of using ranking algorithms for studying human and social behavior is the ethical risk of labeling individuals, leading to unjust accusations and labels that can have damaging psychological, legal, financial, and other consequences. More dangerously, researchers using ranking and classification algorithms to model BBD that results from an organization that already uses some such algorithm (e.g., court sentencing data that are affected by recidivism scores or credit risk data from an agency that uses algorithmic credit scores) perpetuate the bias introduced by the organization’s algorithm. Thus, analyzing data that have been ‘‘contaminated’’ by an organization’s predictive algorithm creates biased and typically unfair and unjust insights. The contagious effect of WMDs means that even BBD that appears to be collected from an organization that does not use predictive algorithms might be contaminated by other organizations who apply such algorithms to the same (or part of the same) population. At the same time, behavioral researchers can use this knowledge to utilize ranking and scoring algorithms to reverse engineer and uncover the biases introduced by companies and other organizations. Such discoveries can lead to improved transparency, algorithm design,* and policy making, thereby contributing to behavioral research and policy in this age of BBD.

SHMUELI

highly valued by consumers. Recommendations are often based on association rules and/or collaborative filtering algorithms. However, since such algorithms are chosen by the vendor, they might be set to be biased toward the vendor’s benefit or what the vendor perceives as what customers want. For example, Chau et al.54 study the effects of malfunctioning recommendation systems on users’ distrust and behaviors. Similarly, Xiao and Benbasat55 examine recommendation systems that are designed to produce recommendations on the basis of benefiting e-commerce merchants rather than benefiting consumers. Regardless of the company’s motivation behind the choice of recommendations, researchers analyzing data that arise from recommendations, such as network data on ‘‘likes’’ of different products by users, might in fact be deepening the vendor bias by ‘‘discovering’’ relationships that are vendor induced. This reflects a mismatch between the researcher’s scientific goal and a company’s goal. A related bias arises when analyzing BBD from information systems that use a page-rank type algorithm, such as Internet search queries (e.g., Google queries) or shares/likes on social networks, where ‘‘the rich get richer,’’ the most popular items become even more popular. Another example of how a recommendation engine policy biases data is the recent series of changes Facebook has made to its ‘‘Trending Topics’’ feature,{ with the latest version aimed at surfacing only stories covered by what it deems credible publishers. Analyzing Facebook BBD from periods with different recommendation policies would therefore capture users’ behavior in light of the company policy.

Biased data due to recommendation algorithms. Personalized recommendation agents are ubiquitous on websites that offer products and services, and are considered an e-commerce feature that is

Performance measures The choice of performance metrics and approaches sometimes differs between researchers and companies, as the former are optimizing for scientific discovery while the latter optimize for a specific application. An example is the Netflix Prize contest described earlier (see the Open Data, Public Data, Website APIs, and Web Scraping section): The winning criterion was an improvement to a metric measuring accuracy of recommendations for new movie-user pairs. Unsurprisingly, the competing teams consisted mostly of machine learning researchers and a few statisticians, but no behavioral researchers. A causal research study examining user movie ratings would most likely

*O’Neil53 suggests explicitly embedding better values into algorithms, ‘‘creating Big Data models that follow our ethical lead.’’

{ ‘‘Facebook Moves to Curtail Fake News on ‘Trending’ Feature,’’ Wall Street J, January 25, 2017.

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

require information about the users and about the movies to ascertain why users rate certain movies high or low,* and model performance would be measured by evaluating model fit and strength of effects of different possible causes. Below we look at metrics used by researchers. However, there is a need for developing adaptations of analysis methods that directly optimize a function that accounts for the stakeholder goals (e.g., profit or realtime scoring56) or requirements (e.g., regulatory compliance or interpretability) rather than optimizing statistical metrics such as mean squared error, likelihood, Akaike’s information criterion, and area under the curve (AUC). Developing such methods will not only be useful to the stakeholders (businesses, government) but can also potentially make their functions of interest more transparent to the data science community and to their customers and users. Measures of causal explanatory performance. Typically, behavioral researchers pursue causal questions, apply causal models, and use statistical metrics to estimate (e.g., ATE, regression coefficients, R2) and test generalization of the effect to the population (e.g., p values and confidence intervals). As discussed in the context of causal modeling, such metrics of explanatory performance are aimed at quantifying average population effects and inferring these overall effects to the population. These metrics are useful and appropriate when the overall population number is of importance, such as economic indicators. For example, Bapna et al.57 used BBD from eBay to estimate the consumer surplus from eBay in 2003. With BBD it is possible to quantify more nuanced effects. The behavioral researcher’s dilemma is whether to pursue more nuanced effects at the risk of multiple testing. And if so, whether and how to use automated machine learning algorithms for identifying causal effects that are not sample specific, as well as which adjusted metrics to use. Moreover, behavioral researchers still do not typically use predictive evaluation for supporting their discoveries and models. Predictive measures. Researchers developing predictive algorithms typically use predictive metrics such as precision, recall, precision-recall charts, sensitivity, specificity, receiver operating characterisitc (ROC) curves, and AUC; lift charts; and out-of-sample predic*Interestingly, although many teams initially collected additional information about movie features, the winning team found those features to be useless for improving predictive accuracy.13

117

tive performance. However, some of these metrics are inappropriate for use in BBD research. Specifically, the AUC that is popularly used has been criticized for several reasons: According to Hand,58 the AUC can give potentially misleading results if the ROC curves cross (a common situation in real applications), and more critically, the AUC is fundamentally incoherent in terms of misclassification costs because it uses different misclassification cost distributions for different classifiers. Hanczar et al.59 showed that the AUC has high variance and is therefore not precise, especially in imbalanced and small samples. Most importantly for BBD studies, in real classification studies a single threshold is chosen, yet the AUC may not reflect the expected classification accuracy at this single threshold. Because BBD studies often involve imbalanced binary outcomes (modeling differences between a small minority and a majority), the choice of metrics should be suitable for such situations. Saito and Rehmsmeier60 show that ROC curves are insensitive to the imbalance ratio, while precision-recall charts do changes with the imbalance ratio. A social issue arises in the context of choosing between reporting and optimizing for sensitivity and specificity or for false-negative rate (FNR) and false-positive rate (FPR). The two sets differ in terms of the point of view of who is using the model. If the resulting decision/conclusion is at the overall population level, for example, a public health policy, then sensitivity and specificity (and ROC curves) are appropriate. Sensitivity and specificity measure the ability of the model to correctly detect the important class ( = sensitivity) and its ability to correctly rule out the unimportant class. In contrast, when we are evaluating the results of an algorithm from the perspective of an individual, then FPR and FNR are appropriate. To illustrate this, consider the Wellcome Elisa test for HIV. The possible results are testing positive or negative for AIDS. Suppose the important class is testing positive. The four metrics are: Sensitivity = P(positive test result j person is HIV positive). Specificity = P(negative test result j person is not HIV positive). FNR = P(person is HIV positive j negative test result). FPR = P(person is not HIV positive j positive test result). While the public health researcher is interested in the first two, a person who just tested positive (and his/her doctor) is interested in the last two, and specifically, in 1FPR = P(person is HIV positive j positive

118

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

test result). This misalignment of interests creates a dilemma for the model builder in terms of the metrics to optimize and to report. Conclusion Our discussion of different dilemmas facing a researcher using BBD tried to highlight the close relationship between the data and the analysis context, and the disruption to the researcher–research question–data– human-subjects relationships caused by BBD. The integration raises issues that do not typically arise when considering the data as simply numbers and the analysis as a data science exercise; nor do they arise when using small behavioral data or big data that is not behavioral. While this article mostly raises questions rather than offers solutions, we hope that the various challenges and dilemmas faced by researchers—both in behavioral sciences and in data sciences—will lead to broader discussions and innovations in empirical scientific research, as well as to the development of more sophisticated procedures, practices, and norms for identifying and working through these dilemmas. One immediate lesson that we have learned is that conducting BBD studies and providing solutions to the multiple dilemmas arising in the new BBD landscape require a combination of knowledge and skills, including an understanding of social science questions and ethical human-subjects issues, technical (programming) skills, and statistical know-how for big data, as well as communication skills for managing collaborations with companies. The combination of such knowledge and skills is typically not possessed by researchers from a single field. Yet, as data scientists become more heavily involved in BBD studies, sometimes in leadership roles, it is important to revise the data science curriculum to include ethics of research with human subjects, which is currently strikingly missing. The emotional contagion experiment by Facebook has highlighted ethical and moral issues that largescale experiments on human subjects raise. The ease of running a large-scale experiment quickly and at low cost holds the danger of harming many people at a quick rate. One suggestion to reduce such a risk is performing a small-scale pilot study to evaluate risks and unintended effects. However, this alone is clearly insufficient to provide a principled approach going forward. It is clear that the various ethical questions that face a researcher cannot be effectively addressed by individual researchers, and leaving it to individuals’

SHMUELI

choices can in fact be dangerous and harmful to BBD subjects. Researchers in academia and in industry often have incentives that conflict with moral or ethical choices. Given that there is no universal moral or ethical code, it is essential to form communities of practice. A first step in that direction is raising awareness to the issues and conflicts and starting discussions about possible approaches. Acknowledgments I thank Bart Baesens, Patrick Chau, Tomer Geva, Soumya Ray, and Inbal Yahav for valuable suggestions. Three reviewers and an associate editor provided helpful feedback that improved the structure and content of the article. I am grateful to the insightful questions and feedback from Vasant Dhar that greatly improved the article. This work was supported, in part, by grant 105-2410-H-007-034-MY3 from the Ministry of Science and Technology in Taiwan. Author Disclosure Statement No competing financial interests exist. References 1. Hinz O, Spann M, Hahn I-H. Can’t buy me love. or can i? social capital attainment through conspicuous consumption in virtual environments. Inf Syst Res. 2015;26:849–870. 2. Muchnik L, Aral S, Taylor S. Social influence bias: A randomized experiment. Science. 2014;341:647–651. 3. Chetty R, Friedman JN, Rockoff JE. Measuring the impacts of teachers II: Teacher value-added and student outcomes in adulthood. Am Econ Rev. 2014;104:2633–2679. 4. Belo R, Ferreira P, Telang R. Broadband in school: Impact on student performance. Manage Sci. 2013;60:265–282. 5. Verstrepen K, Goethals B. Top-n recommendation for shared accounts. In: Proceedings of the 9th ACM Conference on Recommender Systems, RecSys’15, New York, NY, 2015, ACM, pp. 59–66. 6. Shmueli G. Analyzing behavioral big data: Methodological, practical, ethical and moral issues. Qual Eng. 2017;29:57–74. 7. Metcalf J, Crawford K. Where are human subjects in big data research? The emerging ethics divide. Big Data Soc. 2016;3:1–14. 8. Jackman M, Kanerva L. Evolving the irb: Building robust review for industry research. Wash Lee Law Rev Online. 2016;72:442–457. 9. Adar E. The two cultures and big data research. I/S: J Law Policy Inf Soc. 2015;10:765–781. 10. Kramer ADI, Guillory JE, Hancock JT. Experimental evidence of massivescale emotional contagion through social networks. Proc Natl Acad Sci U S A. 2014;111:8788–8790. 11. Bapna R, Ramaprasad J, Shmueli G, Umyarov A. One-way mirrors in online dating: A randomized field experiment. Manage Sci. 2016;62:3100–3122. 12. Golbeck J. Analyzing the social web. Waltham, MA: Morgan Kaufmann, 2013. 13. Hoerl R, Snee R, De Veaux R. Applying statistical thinking to ‘big data’ problems. WIREs Comput Stat. 2014;6:222–232. 14. Bell RM, Koren Y, Volinsky C. All together now: A perspective on the netflix prize. Chance. 2010;23:24–29. 15. Allen GN, Burk DL, Davis GB. Academic data collection in electronic environments: Defining acceptable use of internet resources. MIS Q. 2006;30:599–610. 16. Gonzalez-Bailona S, Wangb N, Riveroc A, et al. Assessing the bias in samples of large online networks. Soc Netw. 2014;38:16–27.

Downloaded by 80.82.77.83 from online.liebertpub.com at 09/25/17. For personal use only.

RESEARCH DILEMMAS WITH BEHAVIORAL BIG DATA

17. Jank W, Shmueli G. Modeling online auctions. Hoboken, NJ: John Wiley and Sons, 2010. 18. Bender S, Jarmin R, Kreuter F, Lane J. Privacy and confidentiality. In: Big Data and Social Science Research: Theory and Practical Approaches. CRC Press, 2016. 19. Hauge M, Stevenson M, Rossmo D, Le Comber S. Tagging banksy: Using geographic profiling to investigate a modern art mystery. J Spat Sci. 2016;61:185–190. 20. Narayan A, Shmatikov V. Robust de-anonymization of large sparse datasets. In: Proceedings of 29th IEEE Symposium on Security and Privacy, 2008. 21. Burtch G, Ramprasad J. Assessing and quantifying network effects in an online dating market. 2016. Available at http://ssrn.com/abstract= 2832917 (last accessed June 9, 2017). 22. Griffin JM, Kruger SA, Maturana G. Do personal ethics influence corporate ethics. 2016. Available at http://ssrn.com/abstract=2745062 (last accessed June 9, 2017). 23. Grieser WD, Kapadia N, Li R, Simonov A. Fifty shades of corporate culture. 2016. Available at http://ssrn.com/abstract=2741049 (last accessed June 9, 2017). 24. Salganik MJ, Dodds PS, Watts DJ. Experimental study of inequality and unpredictability in an artificial cultural market. Science. 2006;311: 854–856. 25. Watts DJ. A brief history of the virtual lab. Talk, New Directions Lecture Series, from Stanford University, March 16, 2017. 26. Mason W, Suri S. Conducting behavioral research on amazon’s mechanical turk. Behav Res. 2012;44:647–651. 27. Mao A, Mason W, Suri S, Watts DJ. An experimental study of team size and performance on a complex task. PLoS One. 2016;11:e0153048. 28. Burtch G, Hong Y, Bapna R, Griskevicius V. Stimulating online reviews by combining financial incentives and social norms. Manage Sci. 2017, [Epub ahead of print]; DOI: 10.1287/mnsc.2016.2715. 29. Liu TX, Yang J, Adamic LA, Chen Y. Crowdsourcing with all-pay auctions: A field experiment on taskcn. Manage Sci. 2014;60:2020–2037. 30. Stewart N, Ungemach C, Harris AJL, et al. The average laboratory samples a population of 7,300 amazon mechanical turk workers. Judgm Decis Making. 2015;10:479–491. 31. Yahav I, Shmueli G, Mani D. A tree-based approach for addressing selfselection in impact studies with big data. MIS Q. 2016;40:819–848. 32. Shmueli G, Yahav I. The forest or the trees? Tackling Simpson’s paradox with classification and regression trees. Prod Oper Manage. 2017, in press. 33. Shmueli G, Koppius O. Predictive analytics in information systems research. MIS Q. 2011;35:553–572. 34. Radcliffe NJ, Surry PD. Differential response analysis: Modelling true response by isolating the effect of a single action. In: Proceedings of Credit Scoring and Credit Control VI, Credit Research Centre, University of Edinburgh Management School: Edinburgh, UK, 1999. 35. Lo V. The true lift model. ACM SIGKDD Explor Newslett. 2002;4:78–86. 36. Shmueli G, Bruce PC, Patel NR. Data mining for business analytics: Concepts, techniques, and applications with XLMiner, 3rd ed. John Wiley and Sons, 2016. 37. Lin M, Lucas H Jr., Shmueli G. Too big to fail: Large samples and the pvalue problem. Inf Syst Res. 2013;24:906–917. 38. Wasserstein RL, Lazar N. The asa’s statement on p-values: Context, process, and purpose. Am Stat. 2016;70:129–133. 39. Trafimow D, Marks M. editorial. Basic and Applied Social Psychology 2015;37:1–2. 40. Burnham KP, Anderson DR. Model selection and multimodel inference: A practical information-theoretic approach. Springer Science & Business Media, 2003. 41. Agarwal DK, Chen B-C. Statistical methods for recommender systems. Cambridge University Press, 2016. 42. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc B. 1995;57:289–300. 43. Imai K, Ratkovic M. Estimating treatment effect heterogeneity in randomized program evaluation. Ann Appl Stat. 2013;7:443–470.

119

44. Lambert D, Pregibon D. More bang for their bucks: Assessing new features for online advertisers. In: Proceedings of the 1st International Workshop on Data Mining and Audience Intelligence for Advertising, ADKDD’07, New York, NY, USA, 2007. ACM, pp. 7–15. 45. Rosenbaum P, Rubin D. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. 46. Heckman J. Sample selection bias as a specification error. Econometrica. 1979;47:153–161. 47. Marcus J. Amazonia. Five years at the epicentre of the dot-com juggernaut. New York: The New Press, 2014. 48. Tang D, Agarwal A, O’Brien D, Meyer M. Detecting unintentional information leakage in social media news comments. In: Proceedings 16th Conference on Knowledge Discovery and Data Mining, 2010, ACM, pp. 17–26. 49. Aral S, Walker D. Creating social contagion through viral product design: a randomized trial of peer influence in networks. Manage Sci. 2011;57: 1623–1639. 50. Bapna R, Umyarov A. Do your online friends make you pay? a randomized field experiment on peer influence in online social networks. Manage Sci. 2015;61:1902–1920. 51. Fienberg S. The promise and perils of big data for statistical inference. In: Israel Statistical Association Annual Meeting, 2015. 52. Shmueli G. To explain or to predict? Stat Sci. 2010;25:289–310. 53. O’Neil C. Weapons of Math Destruction: How big data increases inequality and threatens democracy. New York: Crown Publishers, 2016. 54. Chau PYK, Ho SY, Ho KKW, Yao Y. Examining the effects of malfunctioning personalized services on online users’ distrust and behaviors. Decis Support Syst. 2013;56:180–191. 55. Xiao B, Benbasat I. Designing warning messages for detecting biased online product recommendations: An empirical investigation. MIS Q. 2015;26:793–811. 56. Verbraken T, Verbeke W, Baesens B. A novel profit maximizing metric for measuring classification performance of customer churn prediction models. IEEE Trans Knowl Data Eng. 2013;25:961–973. 57. Bapna R, Jank W, Shmueli G. Consumer surplus in online auctions. Inf Syst Res. 2008;19:400–416. 58. Hand DJ. Measuring classifier performance: A coherent alternative to the area under the roc curve. Mach Learn. 2009;77:103–123. 59. Hanczar B, Hua J, Sima C, et al. Small-sample precision of roc-related estimates. Bioinformatics. 2010;26:822–830. 60. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10:e0118432.

Cite this article as: Shmueli G (2017) Research dilemmas with behavioral big data. Big Data 5:2, 98–119, DOI: 10.1089/big.2016.0043.

Abbreviations Used API ATE BBD DIY FNR FPR IBD IRB PSM VA WCAI WMD

¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼

application programming interface average treatment effect behavioral big data do it yourself false-negative rate false-positive rate inanimate big data institutional review board propensity score matching value-added Wharton customer analytics initiative weapons of math destruction

Research Dilemmas with Behavioral Big Data.

Behavioral big data (BBD) refers to very large and rich multidimensional data sets on human and social behaviors, actions, and interactions, which hav...
629KB Sizes 0 Downloads 12 Views