This wiki service has now been shut down and archived


From ESIWiki

Jump to: navigation, search


Symposium on Provenance in Scientific Workflows, October 13-17 2008

This event will be held at the University of Utah, Salt Lake City, USA.

Confirmed Participants/Speakers

  • Roger Barga (Microsoft Research)

In my presentation I will outline a novel provenance algebra consisting of a common provenance model called provenir, defined in description logic based W3C Web Ontology Language (OWL-DL), along with a set of provenance query operators derived from the classification of provenance queries. We also introduce a practical provenance storage solution using materialized views over a generic relational database system. Our approach takes advantage of provenance query operators and well-defined indices to efficiently process complex provenance queries over very large datasets. To support our claims we present an evaluation of both performance and scalability aspects of our initial implementation. To the best of our knowledge this is the first provenance management system that supports the complete process from a formal provenance model and query operators to storage and efficient queries over provenance data.

Contributors: Satya Sahoo, Roger Barga, Yogesh Simmhan

  • Juliana Freire (University of Utah):

Provenance Analytics

As workflow systems are deployed (and extended with provenance capabilities), large volumes of provenance information are being collected. Just collecting provenance information, however, does not bring major benefits. For the data to be truly useful, we need effective and efficient ways to analyze a re-use it. In this talk I will present ongoing work on techniques to mine provenance data. I will discuss the challenges and opportunities, as well as new applications that are enabled by provenance analytics.

  • Jan Hidders (University of Antwerp/Delft University of Technology):

Provenance models for Petri-net based workflow models

In Petri nets the run of a net is usually defined as a sequence of vectors where each vector assigns a number to a place that indicates the number of tokens. If the Petri net represents a computation or a workflow that produces a certain result then clearly a richer notion of run is required to store provenance information. In this talk we will discuss such a notion and investigate how it relates to the original Petri net notion of run. In addition we will investigate how this notion of run can be adapted to extended types of Petri nets such as DFL that are adapted to deal with scientific workflows.

Additional slides

  • Bertram Ludaescher (University of California, Davis)

From Models of Computation to Models of Provenance

The recent increased interest in provenance across various communities has given rise to many, rather different notions and definitions of "provenance". For example, apart from more or less obvious commonalities, there are different requirements, use cases, and research questions being considered in databases and scientific workflows. In this talk, I will provide an overview of some of the different uses and notions of provenance in (scientific) workflows, and then make an attempt at providing a framework that allows to distinguish and compare (some of the) different types of provenance. The framework highlights the importance of the underlying 'model of computation' (including observables) of a workflow as the basis for different 'models of provenance'.

Workflow provenance research in myGrid

The myGrid project at the School of Computer Science, University of Manchester, is the home of Taverna, a workflow management system commonly used by e-scientists to automate their in silico experiments. In this talk I will give a broad overview of some of the research issues associated with the development of provenance management and query capabilities for Taverna, recently undertaken within myGrid. These issues include, among others: a precise notion of granularity of provenance information over collection values; the role of annotations in shifting from black box to gray box processors, and the consequent notions of "provenance-active" processors and "provenance friendly" workflows; and the role of semantic overlays on top of raw provenance.

  • Beth Plale (Indiana University)

Provenance collection as a commodity, or spreading the word through a tool

Provenance or lineage of simple or complex data objects is essential to the reuse of the object because it enables among other things, proper attribution, assessment of quality, etc. For a provenance collection tool to be broadly beneficial, however, it must be flexible in how it gathers information, In this talk we discuss extensions to the Karma provenance collection and representation system that we are undertaking to make the tool useful to a broad set of users. In a project with industry pharmecutical giant, Eli Lilly, and University of Manchester, we are extending the Lilly open source Life Science Grid with semantically annotated provenance, and discuss the challenges in fitting into the framework.

Bio Beth Plale is Director of the Center for Data and Search Informatics at Indiana University and professor in the Department of Computer Science. Her Ph.D. focused on real time monitoring to detect behaviors of remote objects. This is her 8th year at Indiana University preceded by a Postdoctoral Fellow position at Georgia Institute of Technology, a PhD from SUNY Binghamton, and a stint in industry.

  • Yogesh Simmhan (Microsoft Research)

A Tale of Two Users: Supporting Provenance for the Scientist and the Data Valet (Yogesh Simmhan and Roger Barga)

Workflows are used in eScience not just by the scientists modeling their in silico experiments but also by “data valets” (or data managers) who clean and prepare the data to be science ready. For example, when data from instruments or sensors is shared with a scientific community, it needs to be pre-processed by data valets using well defined quality check algorithms, data formats transformations, and moved to a shared location and its metadata registered before it can be used by the scientists.

Scientific workflows can be used by both these classes of users. However, the kind of provenance required by them is quite different. While the scientist is interested in the semantic aspects of the workflow execution without having to worry about the infrastructure, data valets wear two hats – science data manager and system administrator – and need provenance that helps them both with data analysis and fault recovery. This talk uses examples from the Pan-STARRS astronomy sky survey and the NEPTUNE oceanography project to compare the provenance requirements for these user classes and approaches we have adopted as part of the Trident Scientific Workflow workbench to collect and query over it.

  • Val Tannen (University of Pennsylvania)
  • Jan Van den Bussche (Hasselt University and transnational University of Limburg, Belgium)

Workflow provenance in the presence of complex data manipulation

Modeling the complex data manipulations that happen between the computational steps in a workflow is an issue of particular importance to scientific workflows. The issue has received attention in database provenance work (Buneman, Tannen, et al), but also in workflow research in the COMAD model, in the VDL, in Taverna, and in our DFL and NRC dataflow models. We revisit the NRC dataflow model and its associated notion of subvalue provenance. In this context we also discuss the question about the relationship between forward propagation of annotations (Buneman, Cheney, Vansummeren; Tannen et al) and backward tracing of subvalue provenance.

  • Natalia Kwasnikowska (Hasselt University and transnational University of Limburg, Belgium)

Design issues in complex-object workflow databases

We consider the design of databases containing workflow specifications, their executions, and their metadata. We focus on dataflow applications involving the manipulation of data with a complex-object structure, combined with service calls. Services can be internal or external; internal services are dataflows acting as a subprogram of another dataflow, whereas external services are modeled as functions with a possibly nondeterministic behavior. Workflow specifications are expressed in a high-level programming language based on the nested relational calculus, the operators of which provide the right "glue" needed to combine different service calls into a complex-object dataflow. We envisage an integrated workflow repository database in which multiple workflow specifications can be stored, together with multiple executions of these workflows, including the complex object data that are involved. We discuss how such repositories can be queried in a variety of ways, including provenance queries. We show that a modern SQL platform with stored procedures and SQL/XML suffices to support all types of repository queries.



8:00-9:00 Breakfast

9:00-9:15 Opening remarks

915-10:00 Val Tannen

10:00-10:45 Yogesh Simmhan

10:45-11:15 Coffee break

11:15-1200 Roger Barga

12:00-1:30 Lunch

1:30-2:30 Jan Hidders

2:15-2:45 Paolo Missier

2:45-3:15 Coffee break

3:15-4:00 Discussions

4:00-5:00 Discussions


8:00-9:00 Breakfast

9:00-9:45 Jan Van den Bussche

9:45-10:15 Natalia Kwasnikowska

10:15-10:30 Coffee break

10:30-11:15 Beth Plale

11:15-12:00 Juliana Freire

12:00-1:30 Lunch

1:30-2:15 Bertram Ludaescher

2:15-3:00 Val Tannen

2:15 - TBD


8:00-9:00 Breakfast

9:00-12:00 TBD

12:00-1:30 Lunch

1:30-5:00 TBD

Local information

We encourage participants to stay at the University of Utah Guest House. It is inexpensive and convenient since it is on campus and a short walk or bus ride from the workshop venue, and breakfast and lunch will be available on campus. Although there are few restaurants nearby, it should be easy to organize group taxis into downtown Salt Lake City for dinner.

Salt Lake City is a Delta hub. For the Europeans: there is a direct flight between Paris and Salt Lake City.

The Salt Lake City is relatively close to the city. A shared shuttle between the airport and the Guest House costs USD$18, and a cab costs approximately USD$30.

The workshop will be held in the Warnock Engineering building (WEB), room 3760.Here is a map to the workshop site:


Travel and subsistence expenses of invited participants will be reimbursed by the Theme. You should retain receipts and after the event submit an expense claim form. See Reimbursement for Visitors (under "invited speaker") for more details. Note that since we are not meeting in Edinburgh we cannot take advantage of the eSI to arrange hotel booking and you need to do this yourself.

This is an archived website, preserved and hosted by the School of Physics and Astronomy at the University of Edinburgh. The School of Physics and Astronomy takes no responsibility for the content, accuracy or freshness of this website. Please email webmaster [at] ph [dot] ed [dot] ac [dot] uk for enquiries about this archive.