This wiki service has now been shut down and archived


From ESIWiki

Jump to: navigation, search

We organized a one-day workshop on Use Cases for Provenance in eScience, on April 20, 2009, as part of an eScience Institute Thematic Program on Principles of Provenance.

The goal of the workshop is to identify usage scenarios, open problems and ambitious goals for research concerning provenance in eScience and scientific data management. Presentations need not be on technical research contributions, but should introduce or clarify issues concerning provenance, accountability, integrity, and data quality in a relevant discipline. Such information would be important raw material to help computer science researchers and eScience applications developers focus on relevant problems.

We wish to involve participants representing a variety of perspectives:

  • scientific disciplines in which sophisticated database, Grid, or eScience technology is already in widespread use (such as physics, astronomy and biology)
  • disciplines where such technology is not widely used but conceivably will be in the future, and
  • researchers and developers working on provenance and scientific data management or eScience technology

Though the main goal of the Theme is to support provenance in eScience, there is significant overlap with provenance, auditing, and forensic analysis of data in other settings, such as manufacturing, finance, and e-government. We therefore welcome participants interested in provenance in settings outside of eScience.

Some support for invited participants is available from the Theme (depending on attendance). We also intend to arrange for remote participation, by recording and "webcasting" the talks. The link to the video stream will be available from 9AM GMT at the event information page (



Please register at There is a small registration fee covering lunch and coffee breaks. Invited participants should register so that we can arrange accommodation but their registration fee is waived.



8:45-9:00 Opening remarks
Session 1
9:00-9:45 Provenance issues in 3R systems, Larry Hunter, University of Colorado slides
9:45-10:30 Provenance in a collaborative bio-database: RAASWiki, Donald Dunbar and Jon Manning, Centre for Bioinformatics, Queen's Medical Research Institute, University of Edinburgh CANCELLED slides
10:30-11:00 Coffee
Session 2
11:00-11:45 Common provenance questions across e-science experiments, Simon Miles, King's College London slides
11:45-12:30 Data security aspects of provenance, Jens Jensen, STFC slides
12:30-2:00 Lunch
Session 3
2:00-2:30 Provenance & Evidence-Based Policy Research, Peter Edwards and Lorna Philip, University of Aberdeen slides
2:30-3:00 Provenance: Capture it or else!, Kenneth Lawrie (BGS) slides
3:00-3:30 Provenance in Engineering: Industrial perspectives on the provenance of design data, Alex Ball, UKOLN/University of Bath slides
3:30-4:00 Coffee
Session 4
4:00-4:45 In your worst nightmares: How experimental scientists are doing provenance for themselves, Cameron Neylon, STFC slides
4:45-5:30 Provenance for the safety control of civil engineering structures, José Barateiro, Portuguese National Laboratory of Civil Engineering (LNEC) CANCELLED slides
5:30-6:00 Concluding discussion

Speakers and Abstracts (tentative)

  • Alex Ball (UKOLN, University of Bath), Provenance in Engineering: Industrial perspectives on the provenance of design data

Abstract: In industries such as aerospace, unexpected behaviour of products can have catastrophic consequences. Thus, when incidents of such behaviour occur, it is important to track down their cause to prevent re-ocurrence. While some incidents may be put down to in-service causes -- operator error, inadequate maintenance -- some investigations must consider the details of the design and manufacture of the product. In such cases, investigators need to be sure that they have access to the genuine documentation for the product, and if they are to track down the root cause of the incident they must be able to trace the effects of design decisions and information transactions across the various pieces of evidence. The complexity of this task is frequently compounded by the fact that many product designs are adaptations of earlier designs. The possibility of design re-use has a further consequence that provenance has an additional role to play in the tracking of the intellectual property of an engineering organization.

  • Jose Barateiro (Technical University of Lisbon/Portuguese National Laboratory of Civil Engineering), Challenges in Dam Safety Control Data Management

Abstract: To support the safety control of large civil engineering structures as dams and bridges, raw data is collected from sensors placed in strategic points of the structures. This data is gathered through automatic or manual methods, and transformed into engineering quantities by specific algorithms. The life-cycle of this information is affected by different processes and systems, often involving different data schemas and file formats to store the same type of information. The Portuguese National Laboratory of Civil Engineering - LNEC is using a new system to store and manage the data collected from dams, which is also shared with EDP (National power company owning the dams). However, this system was designed for operational purposes, and not for long-term preservation and interoperability. Thus, it does not handle crucial provenance information required for data quality, preservation and interoperability purposes. According to the Portuguese law, LNEC is specifically responsible to maintain an updated record of the data and knowledge about the dams’ safety and behavior. Thus, the preservation of this data is not only an option but a legal obligation. For that, this data also requires complementary detailed context and provenance information, which might imply the need to define and put in place new descriptive and curation processes.

  • Donald Dunbar, Jon Manning (University of Edinburgh), Provenance in a collaborative bio-database: RAASWiki

Abstract: We have generated a database of information about a biologically and clinically relevant pathway: the renin-angiotensin-aldosterone system. The database contains information about biological objects (genes, drugs, animal models etc) and scientific publications and is aimed at helping a targeted group of researchers collaboratively annotate this information. Much of the data is 'seeded' from publicly available resources. This and the collaborative nature of the database's wiki format generates provenance issues. I will describe these issues and show some initial attempts to address them, highlighting solutions from other bio-databases.

  • Peter Edwards and Lorna Philip (University of Aberdeen), Provenance & Evidence-Based Policy Research

Abstract: The PolicyGrid project ( is investigating how social scientists can be supported in their policy research activities through the application of eScience tools. An important element of 21st-century policy making, at all levels, is a move away from opinion-based decision making to decision making that is grounded in evidence and subject to rigorous evaluation. An evidence base supports transparency and accountability in the policy decision-making process. UK government departments require that research reports should provide sufficient evidence to support their conclusions and recommendations, and should provide an easy audit trail to allow decision makers to understand the assumptions underlying conclusions and recommendations. In this talk we will discuss some of the issues surrounding documentation of policy evidence bases, before outlining a solution based on use of OWL and RDF. We will conclude by discussing an additional aspect of the provenance record, namely the social interactions that occur between researchers during the evidence gathering process.

  • Larry Hunter (University of Colorado), Provenance issues in 3R Systems

Abstract: Our laboratory builds knowledge-based systems for the analysis of genome-scale datasets. These systems populate a knowledge-base by semantic data integration from many public databases, and augment that knowledge with the results of natural language processing from the biomedical literature and various inference techniques (for more information, see [Leach, et al., 2009] and/or This work raises three provenance issues: (1) Independence: information in public databases (such as MINT or GO annotations) is generally curated from publications. It is possible for one publication to be the source of several instances of an assertion (e.g. from NLP as well as multiple databases picking it up), but assuming that all assertions traced to a single article are the same fails, since some publications are the source of multiple assertions. What is the best way to identify independent confirmation of a scientific fact without succumbing to simple redundancy of presentation? (2) Estimating the reliability: We use consensus methods (e.g. [Leach, et al., 2007]) to estimate the reliability of each knowledge source, and then combination methods [Leach, et al. 2009] to increase the estimated reliability of multiple reports of the same assertion. To be effective, this method requires an independence assumption among the different sources of information [see issue 1], and fails to take into account assigned reliabilities from within particular sources. Are there better global ways to assess the reliability of statements in the scientific literature? (3) Tracking changes over time: scientists are often interested in new assertions or in assertions that have been recently withdrawn. However, assigning dates to assertions is non-trivial; the date of a database entry may not properly reflect when the assertion was first made for a variety of reasons. What is the best method for assigning priority dates to scientific assertions?

Leach, S; Gabow, A; Hunter, L “Assessing and Combining Reliability of Protein Interaction Scores,” Pacific Symposium on Biocomputing 12:433-444 (2007)

Leach, S; Tipney, H; Feng, W; Baumgartner Jr, W; Kasliwal, P; Schuyler, R; Williams, T; Spritz, R; Hunter, L “Biomedical Discovery Acceleration, with Applications to Craniofacial Development,” PLoS Computational Biology 5(3): e1000215.

  • Jens Jensen (STFC)

Abstract: This presentation will look at practical security aspects of provenance: what can "security" (in data management) bring to provenance and what can provenance bring to security. Rather than looking at theoretical "solutions", we focus instead on current practical problems in projects with requirements for provenance in data management. In particular, we introduce the ASPiS project which will provide a Shibboleth enabled iRODS data store integrated with provenance metadata management. Providing such a service in the current e-Science environment enables us to meet some of the customers' requirements for data provenance and security, but also introduces its own set of problems. While the main part of the talk focuses on real cases, we will also use some of the lessons learned to try to distill advice for new users.

  • Kenneth Lawrie (British Geological Survey), Provenance, capture it or else!

Abstract: I run a large project which electronically captures, extracts key terms/phrasesfrom within descriptive text in borehole logs and compares them to existing vocabularies/thesaurii in order to produce appropriately targeted and standardised translations for modellers and mappers with varying degrees of geoscience understanding.

(The datasets in question are some 2million boreholes (1.3 million electronically indexed) representing some 20+ million interval descriptions of which we have approximately 3million intervals coded to date) The origins of these data span many branches of geoscience and very nearly 180 years of collection.

We are becoming increasingly aware that confidence in our stored data is significantly eroded without a comprehensive knowledge and understanding of provenance, data and its subsequent manipulation, a fact borne out by persistent 'reinvention of the wheel' scenarios with regard to that data.

  • Simon Miles (King's College, London)

Abstract: Over the course of recent research projects, colleagues and I interviewed scientists working in a variety of disciplines about the questions they would like to be able to answer regarding the provenance of their experiment process' result data. Both substantial commonalities and a wide range of potential uses of provenance data were found in projects covering small-scale bioinformatics to large-scale physics experiments, primarily electronic processes to primarily physical processes, natural and social sciences and more. In this presentation, I will describe the range of general purposes of provenance data discovered, with illustrations using particular case studies, and the technical issues they raise.

  • Cameron Neylon (STFC), In your worst nightmares: How experimental scientists are doing provenance

for themselves

On the whole experimental scientists, particularly those working in traditional, small research groups, have little knowledge of, or interest in, the issues surrounding provenance and data curation. There is however an emerging and evolving community of practice developing the use of the tools and social conventions related to the broad set of web based resources that can be characterised as "Web 2.0". This approach emphasises social, rather than technical, means of enforcing citation and attribution practice as well as maintaining provenance. I will give examples of how this approach has been applied, and discuss the emerging social conventions of this community from the perspective of an insider.

This is an archived website, preserved and hosted by the School of Physics and Astronomy at the University of Edinburgh. The School of Physics and Astronomy takes no responsibility for the content, accuracy or freshness of this website. Please email webmaster [at] ph [dot] ed [dot] ac [dot] uk for enquiries about this archive.