Techniques for Managing Data Provenance in Scientific Workflow Systems Shawn Bowers UC Davis Genome Center Scientific data analysis frequently requires the integration of multiple domain-specific tools and applications. Scientists have traditionally used batch files and scripting languages to automate the execution of these programs and the routing of data between them. These approaches, however, are often ad hoc and require considerable technical skill, especially for data or compute intensive analyses. Scientific workflow systems aim to address these shortcomings by employing dataflow languages in which workflows are modeled as directed graphs consisting of nodes denoting computation steps (wrapped into reusable components) and edges denoting the desired dataflow between steps. These systems additionally offer support for workflow design and analysis, workflow optimization, and automatic recording of data and process dependencies (i.e., provenance) introduced during workflow runs. In this presentation, I will describe a number of challenges in managing data provenance within scientific workflow systems. In particular, while most workflow systems employ a simple dependency model for representing provenance information, these approaches do not capture explicit data dependencies introduced by "provenance-aware" processes, and largely ignore computation models that work over structured data, including XML. I will first describe a general provenance model that extends conventional approaches with support for representing explicit data dependencies and workflow steps that employ update semantics over nested data collections (e.g., XML). I will also describe efficient representation schemes for data provenance, focusing on the trade-offs among update time, storage size, and query response time. Finally, this work is being carried out within the Kepler scientific workflow system, and I will briefly describe our current prototypes and ongoing provenance work within Kepler.