Collecting a citizen's digital footprint for health data mining.

Collecting a Citizen’s Digital Footprint for Health Data Mining Oguzhan Gencoglu*1 , Heidi Simil¨a**2 , Harri Honko*3 , Minna Isomursu***4

Abstract— This paper describes a case study for collecting digital footprint data for the purpose of health data mining. The case study involved 20 subjects residing in Finland who were instructed to collect data from registries which they evaluated to be useful for understanding their health or health behaviour, current or past. 11 subjects were active, sending 100 data requests to 49 distinct organizations in total. Our results indicate that there are still practical challenges in collecting actionable digital footprint data. Our subjects received a total of 75 replies (reply rate of 75.0%) and 61 datasets (reception rate of 61%). Out of the received data, 44 datasets (72.1%) were delivered in paper format, 4 (6.6%) in portable document format and 13 (21.3%) in structured digital form. The time duration between the sending of the information requests and reception of a reply was 26.4 days on the average.

I. INTRODUCTION In the contemporary society, our everyday activities create increasingly detailed digital traces scattered to databases managed by different organizations. This collection of digital records of personal data can be called ”digital footprint”, i.e. the trail of digital information created about us and by our actions [1]. Digital footprints can be categorized into two groups as active and passive. The former one corresponds to the stored data which is deliberately shared by oneself while the latter one refers to the case when data is collected without one’s knowledge. In either case, when analyzed properly, digital footprints can tell a lot about the behaviour, characteristics and preferences of an individual [2] [3] [4] [5] [6], provided it’s accessible in digitally digestible, machine-readable form. Increasingly the data sets, open or closed are being made available over an application programming interface, API. Where accessible, the person’s digital footprint is used today, for example, for personalized recommendation services, person-, income- and even location-context aware focused advertising [7] and time management [8] [9]. There are ideas promoting that digital footprint data, when properly gathered and analyzed with modern data analytics could provide significant opportunities for providing new, more personalized and timely health services. Aggregated and analyzed data can help individuals themselves learn about their health condition [10] [11]. Better access to electronic health records can help communication between *Department of Signal Processing, Tampere University of Technology, Tampere, Finland 1 3

[email protected] [email protected]

**VTT Technical Research Centre of Finland Ltd, Oulu, Finland 2

[email protected]

***Information Processing Science, University of Oulu, Oulu, Finland 4

[email protected]

978-1-4244-9270-1/15/$31.00 ©2015 IEEE

carers, health professionals and other service providers [12]. This can create opportunities for totally new kind of health and wellbeing services, which create new business opportunities for companies, and help increasing efficiency of health interventions through targeted care. In this paper, we examine the state-of-the-practice of collecting 2010’s citizen’s personal footprint for the purpose of health data mining. Our research question is ”Can digital footprint of an individual be collected successfully today for health data mining?”. We approach the research question through a case study. For the purpose of the study, volunteer individuals were instructed to make formal information requests for organizations who they found as relevant data controllers over their personal data relevant for their health. Choice of organizations was on their free choice. The subjective personal right to check one’s data in a data controller’s registry [13] was used as the support ’lever’ to maximize the number of responses. The research subjects had their place of accommodation in Finland. Our results summarize how successful our case subjects were in collecting their digital footprint data, i.e. did the organizations provide them access to their personal footprint data, in what format the data was presented to them, and what procedures roughly would be needed to make that data actionable so that it could be used for computerized health data mining by anyone attempting to refine and analyze the data to provide insights and health related value. Our discussion summarizes our experience and suggests further work on how such data can be examined to reveal health behaviour patterns. II. METHODOLOGY Total of 20 volunteer participants were recruited among active researchers in this study and their contacts. A uniform information request form and covering letter describing the study were prepared. The participants were instructed to print, sign and mail the information request with the covering letter to 5-10 target organizations of their own choice. A preliminary list of candidate sources for digital footprint information was collected to serve as an example for the participants, although they were instructed to decide themselves which data sources could be valuable for health data analytics. The focus was in gaining coverage of as many different representative registry data sets as possible rather than longitude of data. In order to follow the process, the participants kept a record of dates when the information requests were sent, when the replies were received and in which format. The data was asked to be delivered to each participants home address or email. In the information request form it

7626

is stated that data is preferred to be delivered via an API, a memory stick or DVD, instead of printed paper documents. After receiving the data, the participants were instructed to go through the data and decide which representative set of the individual registers data they were willing to donate for the research program. The sensitive personal information was removed or redacted when needed. Each participant signed an informed consent while handing over the data. III. RESULTS AND DISCUSSION The number of voluntary participants, all residing in Finland, was 20 (18 natives, 2 foreigners) for the study. 11 (55.0%) individuals were active during period of five months (11/2014-03/2015), sending 100 information requests (9.09 per person) to 49 (2.04 per registry) distinct data sources in total. With respect to their content, these data sources were classified by researchers into 15 categories, i.e., banking, education, energy, fitness, groceries, healthcare, housing, insurance, library, mobility, municipality, police, retail, telecommunication and web. The average number of distinct data sources and number of sent requests per category is 3.27 and 6.67, respectively. Maximum number of distinct data sources along with maximum number of sent requests belongs to health category with 30 requests from 13 data sources. For each category, a detailed summary of number of data sources, number of sent requests, number of received replies and number of replies resulting in an access to data can be seen from Table I.

instance, all of the requests to energy related data sources ended up with data while a small portion of telecommunication companies provided such information. As the main purpose of a digital footprint collection process eventually is to perform data analysis on each individual’s data, the amount of collected data has a great effect on the analysis performance. Statistical significance strength of inferential analysis is immensely coupled with the amount of data. Similarly, state-of-the-art predictive analysis algorithms are data driven (e.g. deep learning) whose power is enhanced with more data. Thus, having a high data reception rate is of utmost importance for any such data collection process. The format of the collected data is crucial as well for the analysis to be conducted properly. Even though more than half of the data sources provided some data to the individuals, most of the cases the format of the returned data is not analysis-friendly, even not digitized. The format of the delivered data can be categorized into three groups as paper format (hard copy), portable document format (PDF) and spreadsheet/structured format which includes formats such as comma-separated values (CSV), Microsoft Excel file formats (XLS/XLSX), JavaScript object notation (JSON). The listed order is from least analysis-friendly to the most. A detailed view of the format of the collected data for different categories can be seen from Table II. Hard copy, i.e., paper format, corresponds to the majority of the collected data with 72.1%. Only 21.3% of the collected data can be considered as structured. None of the data sources had APIs for such data ingestion process.

TABLE I S UMMARY S TATISTICS OF I NFORMATION R EQUESTS Category banking education energy fitness groceries healthcare housing insurance library mobility municipality police retail telecom web Total

Data Sources 7 2 4 3 3 13 1 5 2 2 1 1 1 2 2 49

Requests 17 2 5 3 13 30 1 6 3 2 2 1 2 9 4 100

Reply Received 15 (88.8%) 1 (50.0%) 5 (100.0%) 2 (66.7%) 12 (92.3%) 23 (76.7%) 0 (0.0%) 3 (50.0%) 3 (100.0%) 0 (0.0%) 1 (50.0%) 1 (100.0%) 2 (100.0%) 4 (44.4%) 3 (75.0%) 75 (75.0%)

TABLE II S UMMARY S TATISTICS OF I NFORMATION R EQUEST F ORMATS

Data Received 10 (58.8%) 1 (50.0%) 5 (100.0%) 2 (66.7%) 12 (92.3%) 17 (56.7%) 0 (0.0%) 2 (33.3%) 3 (100.0%) 0 (0.0%) 1 (50.0%) 0 (0.0%) 2 (100.0%) 3 (33.3%) 3 (75.0%) 61 (61.0%)

Category banking education energy fitness groceries healthcare housing insurance library mobility municipality police retail telecom web Total

Significant metrics for evaluating such data collection processes are response rate (percentage of requests answered) and data reception rate (percentage of requests replied with data or instructions to reach the data). As all responses do not contain or lead to data, the latter is less than, or in an ideal situation equal to, the former. Overall response rate and data reception rate of the study was 75.0% and 61.0% respectively. Both measures vary between categories. For

Paper 8 1 1 7 16 2 3 1 2 3 44 (72.1%)

PDF 2 1 1 4 (6.6%)

Structured 4 1 5 3 13 (21.3%)

When the process of transforming non-analysis-friendly data into analysis-friendly form is considered, the drawbacks become more obvious. Data delivered in paper format, first of all, has to be printed and mailed, which comes at a cost. As an individual can easily own hundreds of pages of data residing in several data sources; logistics, security and storing problems arise. Then, the data has to be digitized by the recipient, for example by scanning. Such a process is not

7627

Fig. 1.

Data registry categories, delivery formats and reply times.

only burdensome but also error-prone. After digitization, data is in the form of PDF or digital images which has to be fed into an optical character recognition (OCR) algorithm. As the paper-form data is likely to contain artifacts (lines, logos, bright/dark spots due to scanning, irrelevant text, folded/torn down parts) acting as noise to the OCR system, the likelihood of error increases. Furthermore, the OCR system had to be tuned specifically for the structure of the text in paper; thus, parsing the relevant information becomes even more demanding. In addition, as there is no guarantee of the data source delivering the data on the paper in the same format in the future, such tasks are discouraged with respect

to the reproducible research paradigm. As the data sources with great probability already have any individual’s data in a structured and electronic format, the abovementioned process can be considered as a redundant and fallible reverse engineering procedure. Another interesting aspect of the data collection process is the analysis of swiftness of the data sources, i.e., how quick each registry replies to the requests. 56 of the requests have both sending and reply dates recorded. On the average, a reply (providing data or not) took 26.4 days to arrive. Average reply times for different categories can be seen from Table III. The average durations for the data registries

7628

with small number of recorded times are given for the sake of completeness rather than inference intent. The average reply time for requests resulting in data reception was 29.6 days while replies failing to do so came in 14.8 days on the average. TABLE III S UMMARY S TATISTICS OF R EPLY T IMES TO I NFORMATION R EQUEST Category banking education energy fitness groceries healthcare housing insurance library mobility municipality police retail telecom web

Number of Recorded Reply Times 11 1 4 2 9 16 3 1 1 1 2 2 3

Average duration (days) 22.3 6.0 44.8 4.5 43.0 23.0 36.3 20.0 37.0 7.0 44.0 12.0 0.0

the data. Very few provided data in format which could be easily digested by digital tools. Providing high quality data to the cutting-edge data mining and machine learning systems is essential for high performance predictive analysis, health behavioral modeling and personalized services. In order to achieve this goal, controlled and secure data access via service web portals, or even better, through machine readable APIs are needed. This paper does not yet address the actual healthrelatedness of the received data - it’s a reflection of citizen’s voluntary selection of registries when one freely considers possible connections to one’s health and the attitude of these registries towards such data requests. Our work continues with exploration of the collected datasets in terms of validity, suitability and information value for health data mining, leading to in-depth analysis of how the digital footprint can be used in health services. ACKNOWLEDGMENT This research has been supported by a grant from Tekes the Finnish Funding Agency for Innovation as part of Digital Health Revolution programme. R EFERENCES

An overall view of data source categories, format of the provided data as well as duration of responses is visualized in Figure 1. Each circular portion on the left half of the image represents a data registry category. Size of each portion is proportional to the number of data requests sent for that given category. Green and red parts represent the data requests resulting in return of the data or not for each source, respectively. The right hand half corresponds to the delivered data formats. The links between data source categories and data format categories give an insight about the delivered format within each category. Furthermore, reply time visualization is located at the outer circle for each category including a thick line representing the mean of reply duration for that data source. IV. CONCLUSION One’s behaviour is reflecting to his/her actions and those actions are recorded in great amounts in today’s world as digital footprint. As the advancing data mining algorithms enable efficient harmonization of multi-modal data to perform inferential, predictive and even causal analysis of people’s behaviour, these digital footprints are of considerable value for health data mining purposes. An expected rise in the demand of personal data from various data registries is likely to change the current situation of such information retrieval process which is presented in this paper. Our results show that currently utilization of digital footprint in services has practical challenges. Companies and institutions in control of the data of individuals are not responsive and attentive to the emerging value of digital footprint. Even in the Finnish context, where the individuals have right by law to access their personal data, many organizations ignored the request or refused the access to

[1] A. Sellen, Y. Rogers, R. Harper, and T. Rodden, “Reflecting human values in the digital age,” Communications of the ACM, vol. 52, no. 3, pp. 58–66, 2009. [2] “World economic forum - rethinking personal data: Strengthening trust,” 2012. [3] D. Zhang, B. Guo, B. Li, and Z. Yu, “Extracting social and community intelligence from digital footprints: an emerging research area,” in Ubiquitous Intelligence and Computing. Springer, 2010, pp. 4–18. [4] C. Moiso and R. Minerva, “Towards a user-centric personal data ecosystem the role of the bank of individuals’ data,” in Intelligence in Next Generation Networks (ICIN), 2012 16th International Conference on. IEEE, 2012, pp. 202–209. [5] A. Malhotra, L. Totti, W. Meira Jr, P. Kumaraguru, and V. Almeida, “Studying user footprints in different online social networks,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012). IEEE Computer Society, 2012, pp. 1065–1070. [6] N. Eagle and A. Pentland, “Reality mining: sensing complex social systems,” Personal and ubiquitous computing, vol. 10, no. 4, pp. 255– 268, 2006. [7] M. Venkataramanan, “My identity for sale,” http://www.wired.co.uk /magazine/archive/2014/11/features/my-identity-for-sale/viewall, accessed: 2015-27-03. [8] “Mac basics: Notifications keep you informed,” https://support.apple.com/en-lb/HT204079, accessed: 2015-27-03. [9] “Google now,” https://www.google.com/landing/now/, accessed: 201527-03. [10] J. H. Frost and M. P. Massagli, “Social uses of personal health information within patientslikeme, an online patient community: what can happen when patients have access to one anothers data,” Journal of Medical Internet Research, vol. 10, no. 3, 2008. [11] S. Kumar, W. Nilsen, M. Pavel, and M. Srivastava, “Mobile health: Revolutionizing healthcare through transdisciplinary research,” Computer, no. 1, pp. 28–35, 2013. [12] C. Pagliari, D. Detmer, and P. Singleton, “Potential of electronic personal health records,” BMJ: British Medical Journal, vol. 335, no. 7615, p. 330, 2007. [13] “Finnish legislation - personal data act, 523/199,” translation completed: 2001-31-03.

7629

Digital family histories for data mining.

Mining Electronic Health Records using Linked Data.

The Mining Minds digital health and wellness framework.

Optimizing data collection for public health decisions: a data mining approach.

Digital Family History Data Mining with Neural Networks: A Pilot Study.

Characterizing user engagement with health app data: a data mining approach.

A Temporal Pattern Mining Approach for Classifying Electronic Health Record Data.

Data mining for wearable sensors in health monitoring systems: a review of recent trends and challenges.

Mining for answers from big data.

Improving data mining strategies for drug design.

Compass: a hybrid method for clinical and biobank data mining.

EpiMINE, a computational program for mining epigenomic data.

A call for biological data mining approaches in epidemiology.

DAPPER: a data-mining resource for protein-protein interactions.

A review of heterogeneous data mining for brain disorder identification.

Data mining in radiology.

[Citizens: allies of the health system].

Roadmap to the Digital Transformation of Animal Health Data.

A four-phase approach for systematically collecting data and measuring medication discrepancies when patients transition between health care settings.

Limitations of the various methods for collecting dietary intake data.

Translation in Data Mining to Advance Personalized Medicine for Health Equity.

How health managers can use data mining for predicting individuals' risks of contracting nosocomial pneumonia.

InCoB2014: mining biological data from genomics for transforming industry and health.

Digital health tools for diabetes.