ProvenanceInDatabases

From ESIWiki

Jump to: navigation, search

Contents

Provenance in Databases Symposium, May 19-23, 2008

Overview

Provenance has been studied in a database setting for almost 20 years, and a number of techniques for supporting provenance in database queries, views, and updates have been proposed; however, a consensus about many key questions about provenance in databases has yet to emerge. This includes key questions such as:

  • what kinds of provenance are there, and how are they related?
  • what distinguishes provenance from other forms of metadata?
  • what problems do various techniques address, and how can we evaluate their success in doing so?
  • how is database provenance related to other forms of provenance in, for example, scientific workflows?

These questions are especially important because of the growing need for provenance in "curated" scientific databases, including biomedical and astronomical databases. This workshop will feature lectures from computer scientists involved in leading research on provenance in databases and from working scientists involved in scientific data curation.

Program

May 19-20 Free time for collaboration. Base meeting room in Appleton Tower M1
May 21 Public workshop day with lectures/panel discussion. Meeting in Cramond Room, National eScience Centre, 15 South College Street. Detailed program as follows:
8:45-9:00 Opening remarks
9:00-9:45 Bob Mann, Data provenance in astronomy
9:45-10:30 Donald Dunbar, Data provenance in biomedical discovery
10:30-11:00 Coffee
11:00-11:45 TJ Green, Containment of Conjunctive Queries on Annotated Relations
11:45-12:30 Val Tannen, Annotated XML: Queries and Provenance
12:30-2:00 Lunch
2:00-2:45 Stijn Vansummeren, Extending Where-Provenance with External Functions.
2:45-3:30 Natalia Kwasnikowska (Hasselt University, Belgium), A formal model for dataflows, runs of dataflows, and provenance within runs
3:30-4:00 Coffee
4:00-4:45 Dirk van Gucht, Reflection: A Tool to Model and Reason about Provenance Databases
4:45-5:30 Closing remarks/discussion
May 20-21 Free time for collaboration. Base meeting room in Leith Room, Old College

Abstracts & Slides

Sky survey data processing is typically viewed by astronomers as comprising two parts: data reduction and data analysis. Data reduction is the removal of instrumental signatures from raw observational data together with their calibration to yield 'science-ready' data products which can be used by an astronomer without expert knowledge of the particular instrument. Data analysis is then the extraction of knowledge from those science-ready data products. I shall present a case study, based on the UK Infrared Deep Sky Survey (UKIDSS), which illustrates the way that provenance information is managed (or not) within astronomical data reduction and data analysis systems.
I will talk about the work we do in biomedical discovery, specifically outlining the range of databases we create and administer. I'll ask some questions about if and where we need to know the sources, quality and history of the data we store and use. I'll also give example of how the provenance information can impact on our data mining success.
We study containment and equivalence of (unions of) conjunctive queries on relations annotated with elements of a commutative semiring. Such relations and the semantics of positive relational queries on them were introduced in a recent paper as a generalization of set semantics, bag semantics, incomplete databases, and databases annotated with various kinds of provenance information. We obtain positive decidability results and complexity characterizations for databases with lineage and why-provenance annotations, for both conjunctive queries and unions of conjunctive queries. We also obtain positive decidability results for the case of provenance polynomial annotations, again for both conjunctive queries and unions of conjunctive queries. This is surprising given the close relation between provenance polynomials and bag semantics. The decision procedures rely on interesting variations on the notion of containment mappings. We also show that for any positive semiring (a very large class) and conjunctive queries without self-joins, equivalence is the same as isomorphism. Finally, we show that for Datalog programs, equivalence under bag semantics coincides with equivalence under provenance polynomial annotations. This implies that equivalence of unions of conjunctive queries under bag semantics is decidable (and the same as isomorphism), thus resolving an open problem of Chaudhuri and Vardi.
We present a formal framework for capturing the provenance of data appearing in XQuery views of XML. Building on previous work on relations and their (positive) query languages, we decorate unordered XML with annotations from commutative semirings and show that these annotations suffice for a large positive fragment of XQuery applied to this data. In addition to tracking provenance metadata, the framework can be used to represent and process XML with repetitions, incomplete XML, and probabilistic XML, and provides a basis for enforcing access control policies in security applications. Each of these applications builds on our semantics for XQuery, which we present in several steps: we generalize the semantics of the Nested Relational Calculus (NRC) to handle semiring-annotated complex values, we extend it with a recursive type and structural recursion operator for trees, and we define a semantics for XQuery on annotated XML by translation into this calculus.
A particular important form of provenance for digital data is the so-called "where-provenance" that describes where a piece of data was copied or adapted from. Although natural definitions of where-provenance for both query and update languages have been given and studied in the literature, these definitions currently do not deal with features such as aggregate operators and external function calls. In this talk, we propose a general model of where-provenance for digital data, and discuss how existing results about the expressiveness of provenance may generalize to the new model.
Dataflow repositories are databases containing dataflows and their different runs. Such repositories can provide effective management of all experimental and workflow data kept in a large laboratory or enterprise setting, facilitate verification of results and tracking of the provenance (origin) of dataflow results. We present a formal model, where we use the Nested Relational Calculus (NRC) as dataflow programming language (enhanced with external functions and subtyping). Our model includes careful formalisations of such features as complex data manipulation, external service calls, subdataflows, and the provenance of result values. We argue for our choice of NRC, which is a well-studied language with exactly the right set of operations that are needed for the manipulation of the types of complex data that occur in a scientific dataflow. Moreover, we present our formalisation of the notion of a run of a dataflow, i.e., the information necessary to reproduce the result value of the run. As such result values are often both large and complex, we have also developed a set of rules that trace back the provenance (origin) of a subvalue of a dataflow result inside its run.
Reflection in programming languages is the ability, within the run of a program, to inspect the program in the current environment, and also generate and execute other programs. If we replace in this sentence the word "program" by "query" and the word "environment" by "database", we get that reflection in query languages is the ability, within the run of a query, to inspect the query in the current database, and also generate and execute other queries. When the database contains data, meta-data, queries, annotations, and provenance data, reflection become a powerful tool to reason in such contexts. In this talk, I will examine the benefits and drawbacks of refelction for provenance databases.


Registration

If you would like to attend only one or two talks, please feel free to do so; however, if you plan to attend all day, please register. There is no registration fee.

Please register here.

Discussion

ProvenanceInDatabasesDiscussion

Personal tools