Data Integration
Instructor
Dr. Jan Chomicki, Associate Professor
Location and time
260 Capen, T R 9:30-10:50.
Handouts
- Datalog
- Relational calculus
- Negation
- XML
- XPath/XQuery
- Schema mapping
- Data integration and exchange
- Metadata
- Consistent query answering (ICDT'07 keynote)
- CQA: query rewriting
- Source capabilities
- Interactive Query Formulation over Web Service-Accessed Sources
(Michalis Petropoulos)
- RDF and SPARQL
(Marcelo Arenas); bibliography
- Description Logic Reasoning
(Ian Horrocks)
Resources
- Problem set #1
- Problem set #2
Tests
- Test 1 (due March 20, 2008)
- Test 2 (due May 6, 2008)
Summary
The availability of integrated data from multiple independent,
heterogenous data sources is crucial for many applications. Data
integration requires combining and matching information from different
sources, and resolving a variety of discrepancies. XML is becoming a
de facto data integration standard.
This course will survey selected
issues arising in data integration, focusing on the theoretical
foundations of the area. The students in the class will be working on
team projects (2-3 people) involving research and/or programming, and
will give class presentations about their projects. There will also be
1-2 take-home exams.
Projects
Prerequisites
Good knowledge of database systems, some knowledge of logic and computational complexity.
Policies
Grading:
- projects (50%, includes class presentation and final report)
- midterm (25%)
- final exam (25%)
Academic integrity policy: I will follow the CSE department
academic integrity policy.
Make-up policy: The request should be made sufficiently in advance of
the test, for valid reasons.
Late submission policy: The submissions are due at midnight on the due
date. No late submissions are accepted. Exceptions will be made only for medical
reasons. Questions about the grading have to be raised with the TA within a week
after the graded assignment has been returned.
Course outline (tentative)
- Datalog: syntax, semantics, query evaluation.
- Datalog with negation: stratified programs, stable models.
- XML: data model, schemas, types, integrity constraints, logics.
- XML query languages: XPath, XQuery, query evaluation.
- Schema matching and mapping.
- Data integration and exchange, source-to-target dependencies.
- Schematic discrepancies, metadata, SchemaSQL.
- Database inconsistency and incompleteness, consistent query answers.
- Semantic Web: RDF, OWL, description logics.
Bibliography (under construction)
Information Integration Systems
Useful URL's
Tutorials
Massive data integration and mining projects
Real-life stories
On-line bibliography
This bibliography is far from complete and typically does not contain references to papers for which I haven't been able
to find a freely available on-line version. Any additions/modifications are appreciated.
Collections of articles
- Special
issue on Information Integration on the Web,
IEEE Intelligent Systems, Sept/Oct. 2003.
- Special issue
on Information Integration, IBM Systems Journal, 41(4), 2002.
-
Special issue on Integration Management, Data Engineering Bulletin, Sept. 2002, IEEE Computer Society.
-
Special issue on Data Cleaning, Data Engineering Bulletin, Dec. 2000, IEEE Computer Society.
-
Special issue on XML Data Management, Data Engineering Bulletin, June 2001, IEEE Computer Society.
General background reference:
J. D. Ullman, J. Widom: "A First Course in Database Systems," 3rd edition, Prentice Hall, 2008.
Datalog and negation
Schema integration
- E. Rahm, P. A. Bernstein:
A survey of approaches to automatic schema matching. VLDB Journal, 2001.
- S. Melnik, H. Garcia-Molina, E. Rahm.
Similarity Flooding: A Versatile Graph Matching Algorithm. ICDE'02.
Full version.
RECOMMENDED
- A. Doan, P. Domingos, and A. Halevy.
Learning to Match the Schemas of Databases: A Multistrategy Approach.
Machine Learning, 2003. Doan's Ph.D. won the 2003 ACM Doctoral Dissertation Award!
- J. Madhavan, P. A. Bernstein, E. Rahm.
Generic Schema Matching with CUPID.
VLDB'01.
- L. Popa et al.
Translating Web Data. VLDB'02.
RECOMMENDED
- R. J. Miller, L. M. Haas, M. A. Hernandez.
Schema Mapping as Query Discovery.
VLDB'00.
Schematic discrepancies
- Ravi Krishnamurthy, Witold Litwin, William Kent:
Language Features for Interoperability of Databases with Schematic Discrepancies.
SIGMOD 1991, 40-49.
- Laks V. S. Lakshmanan, Fereidoon Sadri, Iyer N. Subramanian:
SchemaSQL: An extension to SQL for multidatabase interoperability.
ACM TODS, 26(4), December 2001.
- Catharine M. Wyss, Edward M. Robinson: Relational languages for metadata integration.
ACM TODS, 30(2), June 2005.
RECOMMENDED
Data cleaning
- Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, C-A. Saita:
Declarative Data Cleaning: Language, Models, and Algorithms. VLDB'01.
RECOMMENDED
- Vijayshankar Raman, Joseph M. Hellerstein.
Potter's Wheel: An Interactive Data Cleaning System.
VLDB'01.
- M. Hernandez and S. Stolfo:
Real-World Data is Dirty: Data Cleansing and The Merge/Purge Problem.
Journal of Data Mining and Knowledge Discovery, 1997.
- S. Chaudhuri, K. Ganjam, V. Ganti, R. Motwani.
Robust and Efficient Fuzzy Match for Online Data Cleaning.
SIGMOD'03.
- Alvaro E. Monge.
An adaptive and efficient algorithm for detecting approximately duplicate
database records. Submitted.
Consistent query answers
- J. Chomicki:
Consistent Query Answering: Five Easy Pieces. ICDT'07.
- L. Bertossi:
Consistent Query Answering in Databases. ACM SIGMOD Record, June 2006.
RECOMMENDED
- L. Bertossi, J. Chomicki:
Query Answering in Inconsistent Databases. In Logics for
Emerging Applications of Databases, Springer-Verlag, 2003.
RECOMMENDED
- M. Arenas, L. Bertossi, J. Chomicki:
Consistent Query Answers in Inconsistent Databases.
PODS'99.
- M. Arenas, L. Bertossi, J. Chomicki, X. He, V. Raghavan, J. Spinrad:
Scalar Aggregation in Inconsistent Databases.
TCS, 2003.
Query Evaluation for Distributed Data Sources
- Jeffrey D. Ullman:
Information Integration Using Logical Views. ICDT 1997: 19-40.
- M. Lenzerini: Data Integration; A Theoretical Perspective, Proc. ACM Symposium
on Principles of Database Systems, 2002.
Slides.
- Alon Y. Halevy:
Theory of Answering Queries Using Views, SIGMOD Record 29(4), December 2000.
- Alon Y. Levy:
Answering Queries Using Views: A Survey,VLDB Journal, 2001.
- Alon Y. Levy, Anand Rajaraman, Joann J. Ordille:
Query-Answering Algorithms for Information Agents.
AAAI 1996.
- Oliver M. Duschka:
Query Planning and Optimization in Information Integration.
Ph.D. Thesis, Stanford University,
Computer Science Technical Report STAN-CS-TR-97-1598,
1997.
- Rachel Pottinger, Alon Levy:
A Scalable Algorithm for Answering Queries Using Views, VLDB Journal, 2001.
- Oliver Duschka, Michael Genesereth, Alon Levy:
Recursive Query Plans for Data Integration.
Journal of Logic Programming, special issue on Logic Based
Heterogeneous Information Systems, 2000.
RECOMMENDED
-
Jarek Gryz:
Query Rewriting Using Views in the Presence of Functional and
Inclusion Dependencies.
Information Systems, Vol. 24, No. 7, 1999, pp. 597-612.
-
R. Fagin, P. G. Kolaitis, R. J. Miller, L. Popa.
Data Exchange: Semantics and Query Answering.
ICDT'03. RECOMMENDED
Source limitations
- R. Yerneni, H. Garcia-Molina, J. Ullman:
Computing Capabilities of Mediators. SIGMOD'99.
- Y.Papakonstantinou, A. Gupta, H. Garcia-Molina, J. Ullman: A Query Translation Scheme
for Rapid Implementation of Wrappers. Deductive and
Object-Oriented Databases (DOOD) 95.
- Y. Papakonstantinou, A. Gupta, L. Haas: Capabilities-Based Query
Rewriting in Mediator Systems. Distributed And Parallel
Databases, 6(1), Jan 98.
- Alon Levy, Anand Rajaraman, Jeffrey Ullman:
Answering Queries Using Limited External Query Processors.
Journal of Computer and System Sciences, special issue dedicated to PODS-96, 1999.
OEM and XML basics
XQuery
XPath/XQuery
- M. Benedikt, Ch. Koch:
XPath Leashed. ACM TOCL, to appear.
- G. Gottlob, Ch. Koch, R. Pichler:
XPath Processing in a Nutshell. SIGMOD Record, March'03.
- G. Gottlob, Ch. Koch, R. Pichler:
Efficient Algorithms for Processing XPath Queries.
ACM TODS, June 2005.
- David DeHaan, David Toman, Mariano P. Consens, and M. Tamer Qzsu:
A Comprehensive XQuery to SQL Translation using Dynamic Interval Encoding.
SIGMOD'03.
Incremental validation of XML documents
- A. Balmin, Y. Papakonstantinou, V. Vianu:
Incremental Validation of XML Documents. ACM TODS, December 2004.
- Denilson Barbosa, Alberto Mendelzon, Leonid Libkin, Laurent Mignet and Marcelo Arenas:
Efficient Incremental Validation of XML Documents. ICDE'04.
- B. Bouchou, M. Halfeld Ferrari Alves:
Updates and Incremental Validation of XML Documents. DBPL'03.
- Maria Adriana Abrão, Béatrice Bouchou, Mirian Halfeld Ferrari Alves, Dominique Laurent, Martin A. Musicante:
Incremental Constraint Checking for XML Documents. XSym'04.
Repairing XML documents
XML security
XML data exchange
XML query relaxation
XML indexing
Mediators and wrappers for semistructured data and XML
- G.Wiederhold:
Value-added Middleware: Mediators.
- H. Garcia-Molina et al:
The TSIMMIS Approach to Mediation: Data Models and Languages.
JIIS.
- Y.Papakonstantinou, A. Gupta, H. Garcia-Molina, J. Ullman:
A Query Translation Scheme for Rapid Implementation of Wrappers.
DOOD'95.
- Y. Papakonstantinou, H. Garcia-Molina, J. Ullman:
MedMaker: A Mediation System Based on Declarative Specifications.
ICDE'96.
- Y.Papakonstantinou, S. Abiteboul, H. Garcia-Molina:
Object fusion in Mediator systems.VLDB'96.
- J-R. Gruser, L. Raschid, M. E. Vidal, L. Bright:
Wrapper Generation for Web Accessible Data Sources.
CoopIS'98.
Wrapper Generation Project.
- I. Manolescu, D. Florescu, D. Kossmann:
Answering XML Queries over
Heterogenous Data Sources. VLDB'01.
Storing semistructured data in relational DBMS
Semantic Web
Combining rank information
- R. Fagin:
Combining Fuzzy Information: an Overview.
ACM SIGMOD Record 31, 2, 2002. RECOMMENDED
- R. Fagin, A. Lotem, M. Naor:
Optimal aggregation algorithms for middleware.
Extended abstract in PODS'01.
- W.F. Cody, L.M. Haas, W. Niblack, M. Arya, M.J. Carey, R. Fagin, D. Lee, D. Petkovic, P.M. Schwarz, J. Thomas,
M. Tork Roth, J.H. Williams and E.L. Wimmers:
Querying Multimedia Data from Multiple Repositories by Content: The Garlic Project.
IFIP 2.6 Third Working Conference on Visual Database Systems (VDB-3), 1995.
- R. Fagin:
Combining Fuzzy Information From Multiple Systems.
JCSS, 1999.
- E.L. Wimmers, L.M. Haas, M. Tork Roth, Ch. Braendli:
Using Fagin's Algorithm for Merging Ranked Results in Multimedia Middleware.
COOPIS'99.
- R. Fagin, E.L. Wimmers:
A Formula for Incorporating Weights into Scoring Rules.
ICDT'97.
- R. Fagin:
Fuzzy Queries in Multimedia Database Systems.
PODS'98, invited talk.