On the reporting of new information from open data sets.

LETTERS

On the Reporting of New Information From Open Data Sets To the Editor: What is becoming more common, although rare on a relative basis, are open data sets. It has not been well defined whether it is reasonable for authors to base their work, and subsequently submit for publication, results from an examination of a single open data set that was built by others. Is this parasitism, or are the original authors recognizing that they cannot summarize their data from all perspectives? Therefore, the original authors graciously make the data available for others to study. The reluctance of editors to commit space for a full-length manuscript of which much has already been published elsewhere is understandable. From an author’s perspective, it can be redundant to repeat descriptions in order to report what may simply be a few “pearls” of supplemental information.

TO THE

EDITOR

Yet, there is no reason to assume that a discovery from the public domain should be of less value than from private property. As an example, Table 1 provides an outline for an interesting exercise of discovery using the open source R statistical package.1 An open data set is available2 that was summarized in a widely referenced New England Journal of Medicine paper.3 Provided are breast cancer data on 295 younger patients. Included are results of several gene-based assays (to include the seventy gene assay and recurrence score). The histologic grade is also listed as an attribute. This might be a data set for use by residents or fellows who are interested in research. An assigned problem could be for a student to parallel what was shown by the Nottingham group.4 Specifically, the problem would be to examine how the seventy gene profile (instead of Ki-67) can stratify histologic grade. Table 1 shows what is required at the R command line to find the answer. It may be regarded by editors and reviewers that discoveries based

on an exploration as outlined in Table 1 may not be acceptable for publication because it arises from data that have already been published. However, Table 1 outlines a path not reported by the original authors. And, if indeed, the data indicates that the value of gene-based markers is limited to stratifying intermediate-grade cancer then such supplemental information should be of interest to pathologists. Thus, it is recommended that editors seek a way to accommodate new information derived from “mining” open data sets. A suggestion would be brief communications. Print space would be saved as there would be no need to repeat much of what has already passed peer review (through the original article). In that original authors, and relevant journals, admirably allow a data set to be open can only, in part, be construed that they hope their work will be expanded upon. Open data sets are regarded as a positive trend,5 and it may be time to prepare guidelines on how to best report new insights from “old” data.

TABLE 1. R Commands Needed to Build Survival Curves and Test for Statistical Significance by the Log Rank Method R Command

Explanatory Note

njm

Mutual information between discrete and continuous data sets.

Information content in data sets for a nucleated-polymerization model.

Learning rule sets from survival data.

The Model-Based Study of the Effectiveness of Reporting Lists of Small Feature Sets Using RNA-Seq Data.

Remarks on Pre-ℐ-Regular Pre-ℐ-Open Sets.

On soft β-open sets and soft β-continuous functions.

Computational derivation of structural alerts from large toxicology data sets.

Diverse Data Sets Can Yield Reliable Information through Mechanistic Modeling: Salicylic Acid Clearance.

A new tool called DISSECT for analysing large genomic data sets using a Big Data approach.

New information on the Wukongopteridae (Pterosauria) revealed by a new specimen from the Jurassic of China.

Analysis of multicrystal pump-probe data sets. II. Scaling of ratio data sets.

Big data and deep data in scanning and electron microscopies: deriving functionality from multidimensional data sets.

Selection of representative protein data sets.

The reporting of health information in the media.

Open-source, Rapid Reporting of Dementia Evaluations.

New trichomycete species from China and additional information on Gauthieromyces.

Similarity screening of molecular data sets.

Comparing data sets: implicit summaries of the statistical properties of number sets.

Correction of data reporting errors.

Minimum information about a biofilm experiment (MIABiE): standards for reporting experiments and data on sessile microbial communities living at interfaces.

The effect of species representation on the detection of positive selection in primate gene data sets.

Under reporting of dementia deaths on death certificates using data from a population-based study (NEDICES).

The uses and abuses of large data sets.

Exploiting open data: a new era in pharmacoinformatics.