CSE 712 Provenance and lineage

Registration #21948

Instructor

Dr. Jan Chomicki, Professor. Office hours: TBD.

Time and location

T 10-12, Davis 113A.

Talks

Resources

To access the papers in the UB digital library, you may need to use the proxy server and reload the appropriate page:
 http://libweb.lib.buffalo.edu/help/help.asp?ID1=442
Many papers can be googled on the author pages or retrieved from
dblp
.

Workload

  1. Prepare and present a talk based on one or more papers from the current computer science literature (I will distribute the papers and help with the presentation). The talks will be prepared and presented in two-person teams. A substantial journal paper will rate 2 presenters; a conference paper will rate a single presenter. In the latter case the papers presented by a team have to be related.
  2. Prepare a report based on the same material.
  3. Attend all the classes and participate in the discussions.
  4. There may also be presentations by the instructor and/or invited speakers.

Prerequisites

Required background: a course in databases. Some knowledge of logic is helpful.

Grading

The seminar is graded S/U and can be taken for 3 credits. A research or an implementation project is a possibility: see the instructor.

Policies

  1. Academic integrity
  2. Class attendance is mandatory. No late arrivals in class.
  3. Other relevant university policies.

Accessibility resources

Students with physical or learning disabilities should register with the Accessibility Resources in order to receive accommodation.

Summary

Provenance is one of the central topics in Big Data. Data provenance keeps track of how the data is derived from the sources. Workflow provenance/lineage represents the information about specific processes and makes it possible to query and replay them. An important class of provenance applications deals with query result explanation. Another class deals with scientific workflows. This seminar will address practical and theoretical issues in provenance, lineage and related areas. Each participating student will present one or more research papers from the current database literature, and prepare a report.

Topics

General

James Cheney, Laura Chiticariu, Wang Chiew Tan: Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1(4): 379-474 (2009).

Todd J. Green, Val Tannen: The Semiring Framework for Database Provenance. PODS 2017: 93-99.

Collections

Applications of Provenance. IEEE Data Engineering Bulletin, Volume 41, Number 1, March 2018.

Provenance

Chen Chen, Harshal Tushar Lehri, Lay Kuan Loh, Anupam Alur, Limin Jia, Boon Thau Loo, Wenchao Zhou: Distributed Provenance Compression. SIGMOD Conference 2017: 203-218.

Luc Moreau, Paul T. Groth, James Cheney, Timothy Lebo, Simon Miles: The rationale of PROV. J. Web Sem. 35: 235-257 (2015).

Peter Buneman, James Cheney, Stijn Vansummeren: On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst. 33(4): 28:1-28:47 (2008).

Yael Amsterdamer, Daniel Deutch, Val Tannen: Provenance for aggregate queries. PODS 2011: 153-164.

Yael Amsterdamer, Daniel Deutch, Tova Milo, Val Tannen: On Provenance Minimization. ACM Trans. Database Syst. 37(4): 30:1-30:36 (2012).

Applications

Daniel Deutch, Yuval Moskovitch, Val Tannen: Provenance-based analysis of data-centric processes. VLDB J. 24(4): 583-607 (2015).

Val Tannen: Provenance analysis for FOL model checking. SIGLOG News 4(1): 24-36 (2017).

Luc Moreau, Paul T. Groth: Provenance of Publications: A PROV Style for LaTeX. TaPP 2015.

Daniel Deutch, Amir Gilad: Reverse-Engineering Conjunctive Queries from Provenance Examples. EDBT 2019: 277-288.

Peter Buneman, Sanjeev Khanna, Keishi Tajima, Wang Chiew Tan: Archiving scientific data. ACM Trans. Database Syst. 29: 2-42 (2004)

Pingcheng Ruan, Gang Chen, Anh Dinh, Qian Lin, Beng Chin Ooi, Meihui Zhang: Fine-Grained, Secure and Efficient Data Provenance for Blockchain. PVLDB 12(9): 975-988 (2019).

Nan Zheng, Abdussalam Alawini, Zachary G. Ives: Fine-Grained Provenance for Matching & ETL. ICDE 2019: 184-195.

Grigoris Karvounarakis, Todd J. Green, Zachary G. Ives, Val Tannen: Collaborative data sharing via update exchange and provenance. ACM Trans. Database Syst. 38(3): 19:1-19:42 (2013)

David W. Archer, Lois M. L. Delcambre, David Maier: User Trust and Judgments in a Curated Database with Explicit Provenance. In Search of Elegance in the Theory and Practice of Computation 2013: 89-111.

Lineage

Rajendra Bose, James Frew: Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. 37(1): 1-28 (2005).

Paolo Missier, Norman W. Paton, Khalid Belhajjame: Fine-grained and efficient lineage querying of collection-based workflow provenance. EDBT 2010: 299-310.

Query result explanation, causality

Seokki Lee, Sven Köhler, Bertram Ludäscher, Boris Glavic: A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries. ICDE 2017: 485-496.

Sudeepa Roy, Laurel Orr, Dan Suciu: Explaining Query Answers with Explanation-Ready Databases. PVLDB 9(4): 348-359 (2015).

Zhengjie Miao, Sudeepa Roy, Jun Yang: Explaining Wrong Queries Using Small Examples. SIGMOD Conference 2019: 503-520.

Daniel Deutch, Yuval Moskovitch, Noam Rinetzky: Hypothetical Reasoning via Provenance Abstraction. SIGMOD Conference 2019: 537-554

Daniel Deutch, Amir Gilad: Reverse-Engineering Conjunctive Queries from Provenance Examples. EDBT 2019: 277-288

Melanie Herschel: A Hybrid Approach to Answering Why-Not Questions on Relational Query Results. J. Data and Information Quality 5(3): 10:1-10:29 (2015).

Seokki Lee, Bertram Ludäscher, Boris Glavic: PUG: a framework and practical implementation for why and why-not provenance. VLDB J. 28(1): 47-71 (2019).

Data profiling

Ziawasch Abedjan, Lukasz Golab, Felix Naumann: Profiling relational data: a survey. VLDB J. 24(4): 557-581 (2015).

Jens Ehrlich, Mandy Roick, Lukas Schulze, Jakob Zwiener, Thorsten Papenbrock, Felix Naumann: Holistic Data Profiling: Simultaneous Discovery of Various Metadata. EDBT 2016: 305-316.

Schema evolution