CSE 703 Data Cleaning
Registration #19210
Instructor
Dr. Jan Chomicki, Professor.
Office hours: TBD.
Time and location
Thursdays 9:30-11:30 am, Davis 113A.
Resources
To access the papers in the UB digital library, you may need to use the proxy server and reload the appropriate page:
http://libweb.lib.buffalo.edu/help/help.asp?ID1=442
Many papers can be googled on the author pages or retrieved from dblp
.
Workload
- Prepare and present a talk based on one or more papers from the
current computer science literature (I will distribute the papers
and help with the presentation). The talks will be prepared and presented
in two-person teams. A substantial journal paper will rate 2 presenters;
a conference paper will rate a single presenter. In the latter case the
papers presented by the team have to be related.
- Prepare and present a brief (5-10 minutes) highlight of your presentation by the end of February.
- Prepare a brief summary of your talk by the end of the semester.
- Before each class read the relevant part of the book.
- Attend all the classes and participate in the discussions.
- There may also be presentations by the instructor
and/or invited speakers.
Prerequisites
Required background: a course in databases.
Some knowledge of logic is helpful.
Grading
The seminar is graded S/U and can be taken for 1-3 credits.
A research or an implementation project is a possibility: see the instructor.
Policies
-
Academic integrity
- Class attendance is mandatory. No late arrivals in class.
- Other relevant university policies.
Accessibility resources
Students with physical or learning disabilities should register
with the
Accessibility Resources
in order to receive accommodation.
Summary
Topics
General
Ihab F. Ilyas, Xu Chu:
Data Cleaning. ACM 2019, ISBN 978-1-4503-7152-0, pp. 1-285
Deduplication
Chapter 3 of the Data Cleaning book.
Mauricio A. Hernández, Salvatore J. Stolfo:
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem.
Data Min. Knowl. Discov. 2(1): 9-37 (1998)
Oktie Hassanzadeh, Fei Chiang, Renée J. Miller, Hyun Chul Lee:
Framework for Evaluating Clustering Algorithms in Duplicate Detection. PVLDB 2(1): 1282-1293 (2009)
Jiannan Wang, Tim Kraska, Michael J. Franklin, Jianhua Feng:
CrowdER: Crowdsourcing Entity Resolution. PVLDB 5(11): 1483-1494 (2012)
Data Transformation
Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, Jeffrey Heer:
Wrangler: interactive visual specification of data transformation scripts. CHI 2011: 3363-3372
Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker:
DataXFormer: A robust transformation discovery system. ICDE 2016: 1134-1145
Integrity constraints
Chapter 5 of the Data Cleaning book.
Rule-based data cleaning
Chapter 6 of the Data Cleaning book.
Xu Chu, Ihab F. Ilyas, Paolo Papotti:
Holistic data cleaning: Putting violations into context. ICDE 2013: 458-469
George Beskales, Ihab F. Ilyas, Lukasz Golab:
Sampling the Repairs of Functional Dependency Violations under Hard Constraints.
PVLDB 3(1): 197-207 (2010)
Leopoldo E. Bertossi:
Consistent query answering in databases. SIGMOD Record 35(2): 68-76 (2006)
Leopoldo E. Bertossi:
Database Repairing and Consistent Query Answering.
Synthesis Lectures on Data Management, Morgan & Claypool Publishers 2011
Repairing and Machine Learning
Chapter 5 of the Data Cleaning book.
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, Christopher Ré:
HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10(11): 1190-1201 (2017)