Techniques for Managing Data Provenance in Scientific Workflow Systems

Shawn Bowers
UC Davis Genome Center


Scientific data analysis frequently requires the integration of
multiple domain-specific tools and applications. Scientists have
traditionally used batch files and scripting languages to automate the
execution of these programs and the routing of data between them.
These approaches, however, are often ad hoc and require considerable
technical skill, especially for data or compute intensive analyses.
Scientific workflow systems aim to address these shortcomings by
employing dataflow languages in which workflows are modeled as
directed graphs consisting of nodes denoting computation steps
(wrapped into reusable components) and edges denoting the desired
dataflow between steps. These systems additionally offer support for
workflow design and analysis, workflow optimization, and automatic
recording of data and process dependencies (i.e., provenance)
introduced during workflow runs.

In this presentation, I will describe a number of challenges in
managing data provenance within scientific workflow systems.  In
particular, while most workflow systems employ a simple dependency
model for representing provenance information, these approaches do not
capture explicit data dependencies introduced by "provenance-aware"
processes, and largely ignore computation models that work over
structured data, including XML. I will first describe a general
provenance model that extends conventional approaches with support for
representing explicit data dependencies and workflow steps that employ
update semantics over nested data collections (e.g., XML). I will also
describe efficient representation schemes for data provenance,
focusing on the trade-offs among update time, storage size, and query
response time.  Finally, this work is being carried out within the
Kepler scientific workflow system, and I will briefly describe our
current prototypes and ongoing provenance work within Kepler.