Working draft for survey of computational reproducibility
Geir Kjetil Sandve
Department of Informatics
University of Oslo
June 19, 2013
The rise of computational science has led to exciting and fast-moving developments in many scientific areas. New technologies, increased computing power, and methodological advances have dramatically improved our ability to collect complex high-dimensional data (DOI: 10.1126/science.1213847). In the fields of biology and medicine, new technologies continue to produce a deluge of data of different varieties, raising expectations for new knowledge that will translate into meaningful therapeutics and insights into health (PMID:22144612). However, as exemplified by studies showing difficulties in replicating published findings (e.g. PMID:19174838), as well as several growing number of retractions of scientific papers (PMID:21186208), the fields are currently experiencing challenges with replication and validity of published findings. In a related note, John Ioannidis created quite a bit of stir by claiming in a now highly cited PLoS Medicine article (PMID:16060722) that most research findings are false. Although debated (PMID:17456002), the main message that one has to take into the picture that many individual studies may show false positive results seem to be still standing (e.g .(10.1136/amiajnl-2012-000972)).
That these challenges represents more than a pure academic problem is reflected by a recent rise in failures of clinical trials, which have partly been attributed to lacking validation of pre-clinical research (PMID:21892149,22460880). Some authors take it so far as to refer to the current challenges as a "credibility crisis" of scientific research (10.1126/science.1218263, 10.1109/MCSE.2009.15 ). It has also been claimed that the recent omics studies are particularly prone to these problems (PMID:?)
Although replicability and validity of claims have been cornerstones of science since the scientific revolution, the issues appears to have experienced a renewed interest in the latest years, as exemplified by two recent issues of the journal Science related to the sharing of data and reproducibility of findings (issue 6018 and 6060, both 2011). Crocker and Cooper (10.1126/science.1216775) state that "replication is a cornerstone of a cumulative science", thus emphasizing the cumulative nature of science - where publication is not the end point of a research finding, but perhaps closer to a starting point (REF?). The cumulative aspect of science touches upon several components of scientific activity: first and foremost the ability to trust and build further on previous findings, but also on the ability to reuse data and methodology from previous research (either your own or other's) and on replication as a way to learn state of the art and confirm that ones understanding and way of working is in line with other leading groups. Some go so far as to suggest that scientific papers are merely the advertisement of scholarship whereas the computer programs, input data, parameter values, etc. embody the scholarship itself (PMID:20093459, DOI:10.1109/5992.881708).
As full replication may often be very difficult or expensive, as exemplified for e.g field observations (PMID:22144615, 22144614), is has been suggested that reproducibility of research findings based on the original data could instead be adopted as an attainable minimum standard for assessing the value of scientific claims, particularly when full independent replication of a study (based on fresh data) is not feasible (DOI: 10.1126/science.1213847). Another way to put this, is that "papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension" (PMID:20093459). Although the nature of computational procedures means that exact reproduction should be possible, as of today too many published studies unfortunately remain irreproducible due to the lack of sharing of data, computer code or software required to reproduce the study results (10.1093/bib/bbs078). There has recently been several high profile editorials and perspectives related to the importance of increasing the level of reproducibility in computational science (PMID: 17339612, 22144584,22898652,22144613,22499926,22358837,22144612), with the 2013 Nature editorial "Raising standards" (PMID:23598386) as one of the latest.
In parallel with the discussions on the nature of the reproducibility problem, and the importance of improvement, there has also been many publications on concrete strategies and solutions to improve the state of reproducibility. Two essential pre-requisites for reproducibility (that each have received a slate of high-profile publications) are the public availability of data (PMID:21310971, DOI:10.1096/fj.12-218164) and code (PMID:22358837,23236269, 21330998). In addition to this, a reproducible analysis requires all parameters and program versions of the whole analysis workflow to be precisely specified (REF?). Although it is possible to track the input data and analysis scripts used to generate every final and intermediate data file, doing this manually while working on the command line is typically both very time-consuming and error-prone (10.1145/1376616.1376747). It is therefore advised to do this in a more structured and automated manner. This can be done on the command line using e.g. makefiles, but requires a certain level of technical expertise (10.1109/SECSE.2009.5069157). A more accessible alternative is graphical interface platforms that do automatic tracking to ensure data and analysis provenance for whole customizable workflows, which are available both in web-based (PMID:16169926) and desktop versions (PMID:16642009).
With analyses being performed in a reproducible manner, the next step is to integrate this information on availability and reproducibility into publications. There are a few alternatives available for integrating text and reproducible analyses, such as an add-in to allow integration of GenePattern analyses into Microsoft Word (PMID:20093459), the Sweave module (10.1007/978-3-642-57489-4_89) that allows integration of R code into text documents, as well as Galaxy Pages that allows Galaxy analysis workflows to be integrated into online text documents (PMID:20738864).
When data and code is available, and the analysis is described appropriately, the next thing to be desired is that the code and data is of as high quality as possible. Several publications have been discussing good software development practices in a setting of reproducibility and reuse (PMID:20944712, arXiv:1210.0530), including advices on how data can be represented and shared in order to support reuse (10.7287/peerj.preprints.7).