On the Reporting of New Information From Open Data Sets To the Editor: What is becoming more common, although rare on a relative basis, are open data sets. It has not been well deﬁned whether it is reasonable for authors to base their work, and subsequently submit for publication, results from an examination of a single open data set that was built by others. Is this parasitism, or are the original authors recognizing that they cannot summarize their data from all perspectives? Therefore, the original authors graciously make the data available for others to study. The reluctance of editors to commit space for a full-length manuscript of which much has already been published elsewhere is understandable. From an author’s perspective, it can be redundant to repeat descriptions in order to report what may simply be a few “pearls” of supplemental information.
Yet, there is no reason to assume that a discovery from the public domain should be of less value than from private property. As an example, Table 1 provides an outline for an interesting exercise of discovery using the open source R statistical package.1 An open data set is available2 that was summarized in a widely referenced New England Journal of Medicine paper.3 Provided are breast cancer data on 295 younger patients. Included are results of several gene-based assays (to include the seventy gene assay and recurrence score). The histologic grade is also listed as an attribute. This might be a data set for use by residents or fellows who are interested in research. An assigned problem could be for a student to parallel what was shown by the Nottingham group.4 Speciﬁcally, the problem would be to examine how the seventy gene proﬁle (instead of Ki-67) can stratify histologic grade. Table 1 shows what is required at the R command line to ﬁnd the answer. It may be regarded by editors and reviewers that discoveries based
on an exploration as outlined in Table 1 may not be acceptable for publication because it arises from data that have already been published. However, Table 1 outlines a path not reported by the original authors. And, if indeed, the data indicates that the value of gene-based markers is limited to stratifying intermediate-grade cancer then such supplemental information should be of interest to pathologists. Thus, it is recommended that editors seek a way to accommodate new information derived from “mining” open data sets. A suggestion would be brief communications. Print space would be saved as there would be no need to repeat much of what has already passed peer review (through the original article). In that original authors, and relevant journals, admirably allow a data set to be open can only, in part, be construed that they hope their work will be expanded upon. Open data sets are regarded as a positive trend,5 and it may be time to prepare guidelines on how to best report new insights from “old” data.
TABLE 1. R Commands Needed to Build Survival Curves and Test for Statistical Significance by the Log Rank Method R Command