Data Integration (#24401)
Instructor
Dr. Jan Chomicki, Associate Professor
Location and time
Talbert 107, T R 5:00-6:20.
Textbook
Anhai Doan, Alon Y. Halevy, Zachary Ives. Principles of Data Integration.
Morgan Kaufmann Publishers, 2012.
Piazza
Handouts
- Datalog
- XML
-
Querying Data through Ontologies (slides for the book
Web Data Management
by S. Abiteboul et al.)
-
RDF and SPARQL
(Marcelo Arenas)
bibliography
- Schema matching and mapping
-
Description Logic Reasoning (I. Horrocks)
- Querying integrated databases
- Source capabilities
- Consistent query answering (my ICDT'07 keynote)
- CQA: query rewriting
- Metadata
- Provenance
- Data quality
(Floris Geerts)
Resources
Tests
- Test 1 (due 10/16/2012)
- Test 2 (test due 12/11/12)
Summary
The availability of integrated data from multiple independent,
heterogenous data sources is crucial for many applications. Data
integration requires combining and matching information from different
sources, and resolving a variety of discrepancies. XML is becoming a
de facto data integration standard. Recently, Semantic Web techniques
have been considered for data integration.
This course will survey selected
issues arising in data integration, focusing on the theoretical
foundations of the area, and algorithms and software systems facilitating integration. The students in the class will be working on
team projects (2-3 people) involving research and/or programming, and
will give class presentations about their projects. There will also be
two take-home exams.
The project in this course can be used as a CSE M.S. project (under the condition that the course grade is at least B+).
Projects
Prerequisites
Good knowledge of database systems, some knowledge of logic and computational complexity.
Policies
Grading:
- projects (50%, includes class presentation and final report)
- midterm (25%)
- final exam (25%)
Academic integrity policy: I will follow the
UB and CSE academic integrity policies.
Make-up policy: The request should be made sufficiently in advance of
the test, for valid reasons.
Late submission policy: The submissions are due at midnight on the due
date. No late submissions are accepted. Exceptions will be made only for medical
reasons. Questions about the grading have to be raised within a week
after the graded assignment has been returned.
Course outline (tentative)
- Datalog, negation, recursive query evaluation.
- XML: data model, schemas, integrity constraints, logics.
- XML query languages: XPath, XQuery, query evaluation.
- Semantic Web: RDF, SPARQL, OWL, description logics.
- Schema matching and mapping.
- Query containment and rewriting.
- Query evaluation, source descriptions and limitations.
- Data exchange, source-to-target dependencies.
- Database incompleteness and inconsistency, nulls, consistent query answers.
- Schematic discrepancies, metadata, SchemaSQL.
- Data cleaning.
- Data quality.
- Data provenance.
Bibliography (under construction)
Information Integration Systems
Useful URL's
Tutorials
Massive data integration and mining projects
Real-life stories
On-line bibliography
This bibliography is far from complete and typically does not contain references to papers for which I haven't been able
to find a freely available on-line version. Any additions/modifications are appreciated.
Collections of articles
-
Reasoning Web 2011, Springer, LNCS 6848.
- Special issue
on Information Integration, IBM Systems Journal, 41(4), 2002.
-
Special issue Towards Quality Data with Fusion and Cleaning,
Data Engineering Bulletin, Sept. 2011, IEEE Computer Society.
-
Special issue on Data Provenance,
Data Engineering Bulletin, December 2007, IEEE Computer Society.
-
Special issue on Data Quality,
Data Engineering Bulletin, June 2006, IEEE Computer Society.
-
Special issue on Integration Management, Data Engineering Bulletin, Sept. 2002, IEEE Computer Society.
-
Special issue on Data Cleaning, Data Engineering Bulletin, Dec. 2000, IEEE Computer Society.
General background reference:
- J. D. Ullman, J. Widom: "A First Course in Database Systems," 3rd edition, Prentice Hall, 2008.
- S. Abiteboul, R. Hull, V. Vianu: "Foundations of Databases," Addison Wesley, 1995.
Schema integration
- A. Halevy:
Why Your Data Won't Mix: Semantic Heterogeneity.
ACM Queue, 2005.
- E. Rahm, P. A. Bernstein:
A survey of approaches to automatic schema matching. VLDB Journal, 2001.
- S. Melnik, H. Garcia-Molina, E. Rahm.
Similarity Flooding: A Versatile Graph Matching Algorithm. ICDE'02.
Full version.
RECOMMENDED
- A. Doan, P. Domingos, and A. Halevy.
Learning to Match the Schemas of Databases: A Multistrategy Approach.
Machine Learning, 2003. Doan's Ph.D. won the 2003 ACM Doctoral Dissertation Award!
- J. Madhavan, P. A. Bernstein, E. Rahm.
Generic Schema Matching with CUPID.
VLDB'01.
- L. Popa et al.
Translating Web Data. VLDB'02.
RECOMMENDED
- R. J. Miller, L. M. Haas, M. A. Hernandez.
Schema Mapping as Query Discovery.
VLDB'00.
Schematic discrepancies
- Ravi Krishnamurthy, Witold Litwin, William Kent:
Language Features for Interoperability of Databases with Schematic Discrepancies.
SIGMOD 1991, 40-49.
- Laks V. S. Lakshmanan, Fereidoon Sadri, Iyer N. Subramanian:
SchemaSQL: An extension to SQL for multidatabase interoperability.
ACM TODS, 26(4), December 2001.
- Catharine M. Wyss, Edward M. Robinson: Relational languages for metadata integration.
ACM TODS, 30(2), June 2005.
RECOMMENDED
Data cleaning
- Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, C-A. Saita:
Declarative Data Cleaning: Language, Models, and Algorithms. VLDB'01.
RECOMMENDED
- Vijayshankar Raman, Joseph M. Hellerstein.
Potter's Wheel: An Interactive Data Cleaning System.
VLDB'01.
- M. Hernandez and S. Stolfo:
Real-World Data is Dirty: Data Cleansing and The Merge/Purge Problem.
Journal of Data Mining and Knowledge Discovery, 1997.
- S. Chaudhuri, K. Ganjam, V. Ganti, R. Motwani.
Robust and Efficient Fuzzy Match for Online Data Cleaning.
SIGMOD'03.
- Alvaro E. Monge.
An adaptive and efficient algorithm for detecting approximately duplicate
database records. Submitted.
Consistent query answers
- J. Chomicki:
Consistent Query Answering: Five Easy Pieces. ICDT'07.
- L. Bertossi:
Consistent Query Answering in Databases. ACM SIGMOD Record, June 2006.
RECOMMENDED
- L. Bertossi, J. Chomicki:
Query Answering in Inconsistent Databases. In Logics for
Emerging Applications of Databases, Springer-Verlag, 2003.
RECOMMENDED
- M. Arenas, L. Bertossi, J. Chomicki:
Consistent Query Answers in Inconsistent Databases.
PODS'99.
- M. Arenas, L. Bertossi, J. Chomicki, X. He, V. Raghavan, J. Spinrad:
Scalar Aggregation in Inconsistent Databases.
TCS, 2003.
Query Evaluation for Distributed Data Sources
- Jeffrey D. Ullman:
Information Integration Using Logical Views. ICDT 1997: 19-40.
- M. Lenzerini: Data Integration; A Theoretical Perspective, Proc. ACM Symposium
on Principles of Database Systems, 2002.
Slides.
- Alon Y. Halevy:
Theory of Answering Queries Using Views, SIGMOD Record 29(4), December 2000.
- Alon Y. Levy:
Answering Queries Using Views: A Survey,VLDB Journal, 2001.
- Alon Y. Levy, Anand Rajaraman, Joann J. Ordille:
Query-Answering Algorithms for Information Agents.
AAAI 1996.
- Oliver M. Duschka:
Query Planning and Optimization in Information Integration.
Ph.D. Thesis, Stanford University,
Computer Science Technical Report STAN-CS-TR-97-1598,
1997.
- Rachel Pottinger, Alon Levy:
A Scalable Algorithm for Answering Queries Using Views, VLDB Journal, 2001.
- Oliver Duschka, Michael Genesereth, Alon Levy:
Recursive Query Plans for Data Integration.
Journal of Logic Programming, special issue on Logic Based
Heterogeneous Information Systems, 2000.
RECOMMENDED
-
Jarek Gryz:
Query Rewriting Using Views in the Presence of Functional and
Inclusion Dependencies.
Information Systems, Vol. 24, No. 7, 1999, pp. 597-612.
-
R. Fagin, P. G. Kolaitis, R. J. Miller, L. Popa.
Data Exchange: Semantics and Query Answering.
ICDT'03. RECOMMENDED
Source limitations
- R. Yerneni, H. Garcia-Molina, J. Ullman:
Computing Capabilities of Mediators. SIGMOD'99.
- Y.Papakonstantinou, A. Gupta, H. Garcia-Molina, J. Ullman: A Query Translation Scheme
for Rapid Implementation of Wrappers. Deductive and
Object-Oriented Databases (DOOD) 95.
- Y. Papakonstantinou, A. Gupta, L. Haas: Capabilities-Based Query
Rewriting in Mediator Systems. Distributed And Parallel
Databases, 6(1), Jan 98.
- Alon Levy, Anand Rajaraman, Jeffrey Ullman:
Answering Queries Using Limited External Query Processors.
Journal of Computer and System Sciences, special issue dedicated to PODS-96, 1999.
XPath/XQuery
- M. Benedikt, Ch. Koch:
XPath Leashed. ACM TOCL, to appear.
- G. Gottlob, Ch. Koch, R. Pichler:
XPath Processing in a Nutshell. SIGMOD Record, March'03.
- G. Gottlob, Ch. Koch, R. Pichler:
Efficient Algorithms for Processing XPath Queries.
ACM TODS, June 2005.
- David DeHaan, David Toman, Mariano P. Consens, and M. Tamer Qzsu:
A Comprehensive XQuery to SQL Translation using Dynamic Interval Encoding.
SIGMOD'03.
Incremental validation of XML documents
- A. Balmin, Y. Papakonstantinou, V. Vianu:
Incremental Validation of XML Documents. ACM TODS, December 2004.
- Denilson Barbosa, Alberto Mendelzon, Leonid Libkin, Laurent Mignet and Marcelo Arenas:
Efficient Incremental Validation of XML Documents. ICDE'04.
- B. Bouchou, M. Halfeld Ferrari Alves:
Updates and Incremental Validation of XML Documents. DBPL'03.
- Maria Adriana Abrão, Béatrice Bouchou, Mirian Halfeld Ferrari Alves, Dominique Laurent, Martin A. Musicante:
Incremental Constraint Checking for XML Documents. XSym'04.
Repairing XML documents
XML security
XML data exchange
XML query relaxation
XML indexing
Mediators and wrappers for semistructured data and XML
- G.Wiederhold:
Value-added Middleware: Mediators.
- H. Garcia-Molina et al:
The TSIMMIS Approach to Mediation: Data Models and Languages.
JIIS.
- Y.Papakonstantinou, A. Gupta, H. Garcia-Molina, J. Ullman:
A Query Translation Scheme for Rapid Implementation of Wrappers.
DOOD'95.
- Y. Papakonstantinou, H. Garcia-Molina, J. Ullman:
MedMaker: A Mediation System Based on Declarative Specifications.
ICDE'96.
- Y.Papakonstantinou, S. Abiteboul, H. Garcia-Molina:
Object fusion in Mediator systems.VLDB'96.
- J-R. Gruser, L. Raschid, M. E. Vidal, L. Bright:
Wrapper Generation for Web Accessible Data Sources.
CoopIS'98.
Wrapper Generation Project.
- I. Manolescu, D. Florescu, D. Kossmann:
Answering XML Queries over
Heterogenous Data Sources. VLDB'01.
Storing semistructured data in relational DBMS
Semantic Web
- T. Berners-Lee, J. Hendler, O. Lassila:
The Semantic Web. Scientific American, May 2001.
- RDF Resource Guide
- D. McGuinness et al.:
DAML+OIL: An Ontology Language for the Semantic Web.
IEEE Intelligent Systems, Sept/Oct 2002.
- F. Baader, I. Horrocks, U. Sattler:
Description Logics as Ontology Languages for the Semantic Web.
Festschrift in honor of Jörg Siekmann, Springer, 2003.
- I. Horrocks, S. Tessaris:
Querying the Semantic Web: a Formal Approach.
ISWC'02.
- D. Calvanese et al.:
DL-Lite: Tractable description logics for ontologies.
AAAI'05.
- D. Calvanese et al.:
Ontologies and databases: The DL-Lite approach.
Reasoning Web'09.
- M. Arenas, C. Gutierrez, A. Mendelzon, J. Perez:
Foundations of Semantic Web Databases.
JCSS, 2011 (SEL has subscription).
Provenance
Keyword queries
Combining rank information
- R. Fagin:
Combining Fuzzy Information: an Overview.
ACM SIGMOD Record 31, 2, 2002. RECOMMENDED
- R. Fagin, A. Lotem, M. Naor:
Optimal aggregation algorithms for middleware.
Extended abstract in PODS'01.
- W.F. Cody, L.M. Haas, W. Niblack, M. Arya, M.J. Carey, R. Fagin, D. Lee, D. Petkovic, P.M. Schwarz, J. Thomas,
M. Tork Roth, J.H. Williams and E.L. Wimmers:
Querying Multimedia Data from Multiple Repositories by Content: The Garlic Project.
IFIP 2.6 Third Working Conference on Visual Database Systems (VDB-3), 1995.
- R. Fagin:
Combining Fuzzy Information From Multiple Systems.
JCSS, 1999.
- E.L. Wimmers, L.M. Haas, M. Tork Roth, Ch. Braendli:
Using Fagin's Algorithm for Merging Ranked Results in Multimedia Middleware.
COOPIS'99.
- R. Fagin, E.L. Wimmers:
A Formula for Incorporating Weights into Scoring Rules.
ICDT'97.
- R. Fagin:
Fuzzy Queries in Multimedia Database Systems.
PODS'98, invited talk.