Working draft for note to the opening of GCC 2013
The Galaxy Community Conference 2013 - Some thoughts on Computational Life Science at the University of Oslo
Nils Christophersen and Geir Kjetil Sandve
Department of Informatics
University of Oslo
June 19, 2013
As we all know, life science is now going through a revolution driven by the huge amounts of new data rapidly becoming available through high throughput technologies in areas such as sequencing and imaging. This offers immense opportunities in many areas, raising expectations for new knowledge that will for example translate into meaningful therapeutics and insights into health (Science 2011, PMID:22144612). The University of Oslo (UiO) is committed to life science in a broad sense as on of its main areas of research in the coming years. Our strategic plan towards 2020 states: "During the strategy period, UiO will prioritise an interdisciplinary effort in the Life Sciences through academic development and infrastructure. UiO has a special potential in this area that will help achieve key research policy objectives and meet societal needs for new knowledge in health care, the environment, sustainable energy and the effects of global climate change on life and health ."
To achieve this, we are, for example, working hard to get public funding for a new large life science building, over two times bigger than the informatics building. This is a long-term project, but one we do believe will be realized.
It is obvious that for such an ambitious strategic initiative to succeed, UiO must strive for the highest scientific standards in all aspects of life science under study, and Galaxy is an important part of this picture. In fact, Galaxy has been chosen as the framework on which UiO will base its development of web portals supporting our life scientists. Already, the Bioinformatics Core Facility, shared between UiO and Oslo University Hospital, has developed the Genomic Hyperbrowser (doi:10.1186/gb-2010-11-12-r121) on top of Galaxy to be presented at the conference. User friendly web-based portals are essential to enable life scientists to take advantage of the highly complex IT infrastructure and data processing needed to realize the potential inherent in the current deluge of data.
However, there are many challenges left before web solutions emerge that will allow life scientists to answer complex biological questions in competent ways with only a bare minimum of technological prowess. Before addressing some of these issues, most notably the questions of replicability and reproducibility of results, we need to be a bit more specific and look at the generic data flow in an experimental environment as sketched in Figure 1. Figure 1. Generic flow of data in an experimental research environment.
Data collection is the first step in an experimental environment. It includes lab analyses and digitalization, where a proper LIMS (Laboratory Information Management System) is required, at least for moderate to huge amounts of data, as well as proper labeling in the form of metadata. Data analyses and processing follow, possibly merging with other data sets and requiring HPC (High Performance Computing). Results are then published and should be archived in such a way that the primary data, as well as all software needed to reproduce the results, are freely and easily available. In fact, we believe that successful research in the life sciences will depend on properly managing the whole data flow as sketched in Figure 1. The whole chain is not stronger than the weakest link. Without this overall view, one may well put great effort into one part, for example sequencing a huge number of samples with high coverage and a high number of corresponding metadata, but perform the data analyses in ways that make the computational results impossible to reproduce at a later time, even for the research group itself.
The demand for replicable and reproducible results is, of course, a cornerstone in the natural sciences and a main topic at this conference. In modern life science, this requirement represents a new challenge that has received considerable attention in recent years in prestigious journals like Nature (with the most recent editorial "Raising standards" in 2013, PMID:23598386) and Science with two special issues in 2011 - see Figure 2.
The editorial in the issue from December 2nd, 2011 (DOI: 10.1126/science.334.6060.1225) expresses the situation as follows::
Replication— The confirmation of results and conclusions from one study obtained independently in another — is considered the scientific gold standard. New tools and technologies, massive amounts of data, long-term studies, interdisciplinary approaches, and the complexity of the questions being asked are complicating replication efforts, as are increased pressures on scientists to advance their research.
Here, the term "replication" refers to the whole chain in Figure 1. The term "reproducible", on the other hand, is often use to denote the results from the main computational stage - i.e. the middle part of Figure 1. The situation here has been described in a report by Victoria Stodden et al. from a workshop held at The Institute for Computational and Experimental Research in Mathematics (ICERM) at Brown University in December 2012 .
"Unfortunately the scientiﬁc culture surrounding computational work has evolved in ways that often make it difﬁcult to verify ﬁndings, efﬁciently build on past research, or even to apply the basic tenets of the scientiﬁc method to computational procedures. Bench scientists are taught to keep careful lab notebooks documenting all aspects of the materials and methods they use including their negative as well as positive results, but computational work is often done in a much less careful, transparent, or well-documented manner. Often there is no record of the workﬂow process or the code actually used to obtain the published results, let alone a record of the false starts. This ultimately has a detrimental effect on researchers’ own productivity, their ability to build on past results or participate in community efforts, and the credibility of the research among other scientists and the public."
The overall picture that emerges from this is a serious one and clearly something UiO must pay attention to strategically, as well as coping with at the practical research level. It is interesting, though, to observe that journals seem to be of two minds - on the one hand well reputed journals have published editorials and comments on these issues for some years but, on the other hand, papers get published that do not live up to the desired standards (e.g. Nature 2012, doi:10.1038/483531a). But the trend is clear - rising standards will follow and the only rational approach for a credible research institution is to prepare for this and not be left in the backwaters.
So the question for us is what UiO should do? One obvious measure, recommended universally, is to promote full openness of data as well as algorithms and software. This may be difficult or even impossible with regard to sensitive human data, not least in Norway which has a very strict legal framework in this context. But let this question rest here, and consider only non-sensitive data. We think what further measures to adopt, besides promotion of openness, are at a critical point for the conference. Life scientists may have the best of intentions to comply with the gold standard but they must also have the tools to achieve that. In other words, they must consider it practical and in their own interest even within a hectic working day. Unfortunately, so to say, UiO has a certain experience here. After a serious case of scientific fraud in 2006, the university developed a system where scientists should deposit their data as well as descriptions of all procedures following publication to be able to document honest conduct. However, this turned out to be impractical and thus impossible to impose. A challenge for the Galaxy community is the development of systems that are both practical to use and advanced enough to allow required standards to be achieved across the board in Figure 1.
Here, Galaxy is well positioned as an institutional platform for computational life science. It already contains full support for the data analyses stage, automatically tracking all aspects of the computations it carries out. The details include not only the primary and intermediate data, but also program versions and parameters. It will be of great importance to take the next logical steps and develop connections with the data generation through LIMS systems upstream, as well as better integration with publication archiving at the downstream end. The journal Gigascience may provide an interesting template as it links standard manuscript publication with an extensive database that hosts all associated data and provides data analysis tools and cloud-computing resources.
Looking ahead, the immense opportunities promised by modern life science can only be fully realized by paying attention to the issues discussed here. Otherwise, there will be many false starts and the field will be in danger of being considered over-hyped with possible backlashes as a consequence. To approach gold standards or at least establish generally accepted minimum standards, efforts are required at the level of individual researchers, scientific journals, funding agencies and scientific institutions. At UiO, we know the stakes and must continuously encourage and support good life science to the best of our abilities.