Data Integration

Instructor

Dr. Jan Chomicki, Associate Professor

Location and time

260 Capen, T R 9:30-10:50.

Handouts

  1. Datalog
  2. Relational calculus
  3. Negation
  4. XML
  5. XPath/XQuery
  6. Schema mapping
  7. Data integration and exchange
  8. Metadata
  9. Consistent query answering (ICDT'07 keynote)
  10. CQA: query rewriting
  11. Source capabilities
  12. Interactive Query Formulation over Web Service-Accessed Sources (Michalis Petropoulos)
  13. RDF and SPARQL (Marcelo Arenas); bibliography
  14. Description Logic Reasoning (Ian Horrocks)

Resources

  1. Problem set #1
  2. Problem set #2

Tests

  1. Test 1 (due March 20, 2008)
  2. Test 2 (due May 6, 2008)

Summary

The availability of integrated data from multiple independent, heterogenous data sources is crucial for many applications. Data integration requires combining and matching information from different sources, and resolving a variety of discrepancies. XML is becoming a de facto data integration standard.

This course will survey selected issues arising in data integration, focusing on the theoretical foundations of the area. The students in the class will be working on team projects (2-3 people) involving research and/or programming, and will give class presentations about their projects. There will also be 1-2 take-home exams.

Projects

Prerequisites

Good knowledge of database systems, some knowledge of logic and computational complexity.

Policies

Grading:

  1. projects (50%, includes class presentation and final report)
  2. midterm (25%)
  3. final exam (25%)

Academic integrity policy: I will follow the CSE department academic integrity policy.

Make-up policy: The request should be made sufficiently in advance of the test, for valid reasons.

Late submission policy: The submissions are due at midnight on the due date. No late submissions are accepted. Exceptions will be made only for medical reasons. Questions about the grading have to be raised with the TA within a week after the graded assignment has been returned.

Course outline (tentative)

  1. Datalog: syntax, semantics, query evaluation.
  2. Datalog with negation: stratified programs, stable models.
  3. XML: data model, schemas, types, integrity constraints, logics.
  4. XML query languages: XPath, XQuery, query evaluation.
  5. Schema matching and mapping.
  6. Data integration and exchange, source-to-target dependencies.
  7. Schematic discrepancies, metadata, SchemaSQL.
  8. Database inconsistency and incompleteness, consistent query answers.
  9. Semantic Web: RDF, OWL, description logics.

Bibliography (under construction)

Information Integration Systems

SystemInstitution/CompanyTypeTechnology
XQuark Bridge/Fusion XQuarkOpen sourceXML/XQuery
Liquid Data (Enosys) BEACommercialXML
NimbleActuate CommercialXML
DB2 Information Integrator IBMCommercialSQL
Power CenterInformatica CommercialWeb services
XML Information WorkbenchXML Global CommercialXML
Callixacallixa.com CommercialSQL
Metamatrixmetamatrix.com CommercialXML/SQL
Xylemexyleme.com CommercialXML
Infomaster Stanford U.Academic, operationalSQL
SIMS ISI (USC)Academic prototypeRelational
Tukwila U. WashingtonAcademic prototypeXML
Raccoon UCIAcademic prototypeXML
Garlic IBM Almaden Industrial prototypeObject-relational

Useful URL's

Tutorials

Massive data integration and mining projects

Real-life stories

On-line bibliography

This bibliography is far from complete and typically does not contain references to papers for which I haven't been able to find a freely available on-line version. Any additions/modifications are appreciated.

Collections of articles

General background reference: J. D. Ullman, J. Widom: "A First Course in Database Systems," 3rd edition, Prentice Hall, 2008.

Datalog and negation

Schema integration

Schematic discrepancies

Data cleaning

Consistent query answers

Query Evaluation for Distributed Data Sources

Source limitations

OEM and XML basics

XQuery

XPath/XQuery Incremental validation of XML documents Repairing XML documents XML security XML data exchange XML query relaxation XML indexing Mediators and wrappers for semistructured data and XML

Storing semistructured data in relational DBMS

Semantic Web

Combining rank information ˙