Simon Robin Evans published an article in PLOS Biology on April 8th, 2016 “Gauging the Purported Costs of Public Data Archiving for Long-Term Population Studies”. The work illustrates some of the problems that emerge when even well meaning scientists jump into the data sharing debate.

The article is a review of one element of the debate on the reuse of data as a threat to original research, part of the “research parasite” pejorative that had a brief moment of fame before the NEJM backed away from the stance. He seeks to examine and ultimately to negate the concerns raised by Mills et al. (2015) among others regarding the negative aspects of datasharing in the physical sciences. The straw-men excuses against sharing include things like being scooped, no one understands my data, the over use of a single dataset and a variety of the other stereotypes that have long since been disproved by archival research on the use of secondary data.

In general, the author’s perspective on the benefits of datasharing are well reasoned, and they are supportive of the importance of data sharing which is commendable. The concerns about the article itself revolve around the approach and the conclusions drawn from a very narrow analysis of a limited number of studies from an archival storage facility with little penetration into the research community that engages in secondary analysis.  Evans argues for data storage and the importance of providing a means to share data; these are both admirable things. Unfortunately the approach and the conclusions drawn from the article are misleading due to the choice of the research site.

The author employs data archived by the Dryad Digital Repository, a consortium based data storage model that allows data from publications to be stored under a variety of access rules that potentially encourage reuse. Analyzing the use of a selected set of datasets (72) from Dryad, the author draws the following conclusion “I found no evidence that PDA results in reuse of data by independent parties, suggesting the purported costs of PDA for long-term population studies have been overstated.” This crystallization of what the author learned from the analysis exercise is the core of the problem with the article.

While it may not have been the purpose of the author to make non-use the central takeaway point, it is the point that stands out. Of all the potential arguments for why data sharing is safe and beneficial to science, the idea that “no one will use your data” if you put it in a data storage facility such as Dryad is one of the most harmful to the growth of the broader acceptance of secondary analysis among researchers outside of the BSR arena.

In general,  Evans’ arguments about the importance of datasharing are correct and the author’s dismissal of the usual suspects for not sharing data represents good practice. Unfortunately, his research approach and his conclusions about use are based upon a faulty understanding of the difference between data, data extracts, and metadata. The information resources maintained by resources such as Dryad largely represent data extracts; summaries of information that allow for the generation of the tables and charts used in a specific publication. In this regard, they are an improvement over the kind of metadata used in meta-analysis, generally the tables and charts extracted from primary publications. Not all data in Dryad is this narrow, but it is the more common type of deposit. This kind of information, while potentially interesting, is limited in application. It might be used in replicating a study, for example, but it is little different from replicating primary publication material as it is not “data” as data is seen from an archival science perspective.

When an archivist talks about data, generally we are talking about origin data. The original and complete data that underlies the analysis in question. This misunderstanding regarding the nature of data is common in the biological and biomedical sciences where, often, the data is seen as the actual publication. From the perspective of the investigator depositing these supplemental publication files, the nice thing about metadata and data extracts is that they will generally replicate the findings of the primary publication. The sharing of origin data is more problematic as it has not been organized, recoded and framed in a way that facilitates any one specific analysis. Consequently, origin data is often seen as more “dangerous” as analysis is often idiosyncratic to the researcher and coding decisions can impact findings in ways that make replication more difficult. The provision of origin data, all of the origin data, is also seen as problematic as secondary analyst can perform new and independent studies using the data. Generally, this is seen as healthy to science, generating new knowledge and increasing the return to investment on a research study. Some investigators see it as a threat however, offering the opportunity for someone else to publish that study they want to do, but which they have simply not gotten around to as of yet. We all get “scooped”; it is the nature of scientific inquiry. The best way to avoid this is to selfishly withhold access to a dataset. The more common approach is to maintain control of a dataset and require a co-author credit, a common and widely accepted behavior in the biomedical community. This issue underlies the mistaken idea of ‘ownership” which Evans argues against in the body of his argument, making several good points particularly in generational nature of ownership in long term studies.

Datasharing  inherently involves a certain degree of risk, but it also presents significant benefits to the data provider. Reuse should result in new and original findings; new publications, new grants and new collaborations. A properly managed data archive will seek to ensure that the creator of a dataset is properly cited and acknowledged for providing access to the origin data when it is used for secondary analysis. This will often involve a formal citation that gives credit to the investigator for the data used in a publication. It does imply that the secondary analyst has any obligation to share credit for their original research, merely acknowledge the contribution of the data collector to the development of new research. Increasingly, funded research is being judged on its use by the investigator and by independent others for new discoveries and new science. This can only be achieved through the broad sharing of well documented origin data. Metadata sites such as Dryad are relatively new services but they serve an important purpose in the research community as they can increase metadata granularity for validation studies, but their primary purpose is not to encourage reuse or new discovery using these data. They are uncurated storage facilities that meet the emerging requirements among many journals that require the data used in the publication be stored in an open source environment.

This distinction between repositories that focus on uncurated data extracts and repositories that support fully curated origin data is an  important one. Both serve an important function in the science community, but their functions and missions are not directly comparable. The issues that Evans is attempting to address in his article are not ones appropriately examined using the kind of information maintained in the Dryad Repository. Article based data is preserved primarily as a tool for validation of published results, an additional potential check for accuracy. There are issues here as well, as there is no general requirement to provide the source code used to perform the analysis that generates results. Raw data extracts alone are an imprecise tool for replicating published results, the ongoing controversies in psychology being an example of this concern. While much of the data  in publication repositories such as Dryad serve an important, they have a very narrow focus. Consequently, there are only limited ways in which the provided data can be employed in independent analysis. This in turn impacts use which is the most likely explanation for Evans’ finding.

Another important issue is support and publication of data resources. Much as Dryad data are self archived, they are also depend upon self promotion if the investigator wants to encourage use. The application of the DOI to the deposit does make it discoverable, but the interested user has to know what to look for in order to make the search. For successful reuse to occur the data resource needs to be actively and aggressively publicized. This is a task that often falls to the repository as the depositor has often moved on to new avenues of research after depositing their data. Lack of promotion of a dataset means lack of use of a dataset. Making a resource discoverable is only the first step in the successful data sharing activity, promoting and encouraging use is by far the most important and time consuming activity. This kind of active promotion is not in the mission of most publication data repositories and it is often not seen as being in the best interest of the investigator who deposited the data. Data sharing in the physical sciences remains a reluctant avocation as it has not been properly incentivized to seem beneficial to the data collector. As Evans accurately notes, until journals began to require the deposit of the data associated with a publication there was little interest on the part of the science to voluntarily make their data available. Lacking incentives and responding to unwanted pressure, it is unlikely that reluctant authors will freely put more than the required minimum into a pay for storage repository such as Dryad.

This summarizes the concerns. We have a publication based data repository which has no specific review of data acquisitions, no formatting standards and unstructured supporting documentation. We have an associated DOI, but no clear promotion or support services to publicize the availability of a data resource. You can find the data easily enough, but you first have to know it exists. From the perspective of an archivist, this is a clear path to lack of use of a resource. It is a way to store data, to keep it safe, but it is not a model to encourage reuse. Knowing this, the conclusion that Evans draws is inaccurate and more importantly it provides support for of researchers who seek rationales to minimize datasharing. This is clearly not Evans’ intent but it is an outcome of his publication.

The idea that we do not have to worry about the potential risks of datasharing because no one uses data put in repositories is an artifact of the repository Evans chooses to use in his analysis. His stance on the importance of sharing is admirable and his open discussion of the weaknesses of arguments put forth by those reluctant to share data are philosophically correct. They are consistent with what we have learned in the decades that archival scientists have worked with BSR data resources. His error is attempting to study the problem using the wrong kind of data resource and as a consequence he draws the wrong conclusion. The lack of reuse of Dryad data he finds in his analysis is an artifact of the archival model used by publication driven repositories. It not due to any lack of interest in the reuse of well curated research data. With a better resource Evans might find a very different result.

The statement “I found no evidence that PDA results in reuse of data by independent parties, suggesting the purported costs of PDA for long-term population studies have been overstated” is not only an unsupported generalization regarding the nature and motivations of secondary analysis it is also contrary to what we know about the growing acceptance and value science places in secondary analysis of data. The Evans publication has already been cited in a number of news reports and the lack of use of the archived data is always the take away point. In a tight funding environment, the last thing a researcher wants to report to his funding agency is that “no one used my data”. Increasingly, at organizations such as NIH, the lack of reuse is seen as a concern if not a rationale for ending a project.

What we need to focus on is the benefits of data reuse for secondary analysis. The more publications, the more grants, the more original research that emerges from a properly curated and shared secondary resource, the more value the data has to the broader research community. Lack of use is as much a failure of the datasharing process as is the refusal to share. In both cases the data is essentially dead.

Most, if not all of the problems in the Evans article can be traced to the ongoing perspective of the physical and biomedical sciences which philosophically see the publication as the center of the research process. This is an antiquated and increasingly outmoded way of looking at research and deliverables. Publications emerge from data, and it is the origin data that is at the center of the research process. If we focus on the original publication as the key and then try to expand from that focus we should expect to experience finding such as reported in the analysis.  All Evans found was an artifact of the repository model used by Dryad.

The concerns offered by members of the research community who do not want to share their data are well established. The objections represent a broken record repeating a litany of fears that are already resolved by a research literature that reflects the tens of thousands of publications across disciplines that use secondary analysis of existing data. There is no need to explore complex content analysis to disprove what is already been dismissed by the majority of scientists across disciplines. To encourage more data sharing we do not need to argue with the holdouts, we need to show and emphasize the successful reuse of existing data resources. This is how the culture of data sharing changed in fields such as epidemiology and it is how it is now changing in psychology. It will be the way we finally get more open access to clinical trials data and other science traditions that currently lack a culture that promotes data sharing.

While we applaud the motivations and philosophical stance that Simon Robin Evans puts forth in his recent publication in PLOS Biology, his analysis approach and attempt to operationalize data reuse are flawed and unfortunately lead to an incorrect conclusion. Data sharing is not a problem because it universally improve knowledge and makes science better. The idea that datasharing should not be a concern because no one will use data placed in a repository is simply wrong.

 

Evans SR (2016) Gauging the Purported Costs of Public Data Archiving for Long-Term Population Studies. PLoS Biol 14(4): e1002432. doi:10.1371/journal.pbio.1002432

Mills JA, Teplitsky C, Arroyo B, Charmantier A, Becker PH, Birkhead TR, et al. Archiving primary data: solutions for long-term studies. Trends in Ecology & Evolution. 2015;30(10):581–9. doi: 10.1016/j.tree.2015.07.006.

Leave a comment

Trending