CSE 703 Data Cleaning

Registration #19210

Instructor

Dr. Jan Chomicki, Professor. Office hours: TBD.

Time and location

Thursdays 9:30-11:30 am, Davis 113A.

Resources

To access the papers in the UB digital library, you may need to use the proxy server and reload the appropriate page:
 http://libweb.lib.buffalo.edu/help/help.asp?ID1=442
Many papers can be googled on the author pages or retrieved from
dblp
.

Workload

  1. Prepare and present a talk based on one or more papers from the current computer science literature (I will distribute the papers and help with the presentation). The talks will be prepared and presented in two-person teams. A substantial journal paper will rate 2 presenters; a conference paper will rate a single presenter. In the latter case the papers presented by the team have to be related.
  2. Prepare and present a brief (5-10 minutes) highlight of your presentation by the end of February.
  3. Prepare a brief summary of your talk by the end of the semester.
  4. Before each class read the relevant part of the book.
  5. Attend all the classes and participate in the discussions.
  6. There may also be presentations by the instructor and/or invited speakers.

Prerequisites

Required background: a course in databases. Some knowledge of logic is helpful.

Grading

The seminar is graded S/U and can be taken for 1-3 credits. A research or an implementation project is a possibility: see the instructor.

Policies

  1. Academic integrity
  2. Class attendance is mandatory. No late arrivals in class.
  3. Other relevant university policies.

Accessibility resources

Students with physical or learning disabilities should register with the Accessibility Resources in order to receive accommodation.

Summary

Topics

General

Ihab F. Ilyas, Xu Chu: Data Cleaning. ACM 2019, ISBN 978-1-4503-7152-0, pp. 1-285

Deduplication

Chapter 3 of the Data Cleaning book.

Mauricio A. Hernández, Salvatore J. Stolfo: Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov. 2(1): 9-37 (1998) Oktie Hassanzadeh, Fei Chiang, Renée J. Miller, Hyun Chul Lee: Framework for Evaluating Clustering Algorithms in Duplicate Detection. PVLDB 2(1): 1282-1293 (2009) Jiannan Wang, Tim Kraska, Michael J. Franklin, Jianhua Feng: CrowdER: Crowdsourcing Entity Resolution. PVLDB 5(11): 1483-1494 (2012)

Data Transformation

Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, Jeffrey Heer: Wrangler: interactive visual specification of data transformation scripts. CHI 2011: 3363-3372

Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker: DataXFormer: A robust transformation discovery system. ICDE 2016: 1134-1145

Integrity constraints

Chapter 5 of the Data Cleaning book.

Rule-based data cleaning

Chapter 6 of the Data Cleaning book.

Xu Chu, Ihab F. Ilyas, Paolo Papotti: Holistic data cleaning: Putting violations into context. ICDE 2013: 458-469

George Beskales, Ihab F. Ilyas, Lukasz Golab: Sampling the Repairs of Functional Dependency Violations under Hard Constraints. PVLDB 3(1): 197-207 (2010)

Leopoldo E. Bertossi: Consistent query answering in databases. SIGMOD Record 35(2): 68-76 (2006)

Leopoldo E. Bertossi: Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management, Morgan & Claypool Publishers 2011

Repairing and Machine Learning

Chapter 5 of the Data Cleaning book.

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, Christopher Ré: HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10(11): 1190-1201 (2017)