Web Scale Data Management (Fall 2012)

Prof. Oliver Kennedy

This seminar course will cover a range of topics related to storing and querying large datasets. Specific topics covered will include a variety of distributed systems and primitives, including data processing, synchronization, key-value stores, stream processors, as well as full SQL database systems.

Important Times

This course meets WEEKLY on MONDAYS from 9:00 AM to 10:50 AM in Davis 113A

Office hours are Monday and Thursday from 2:00 to 4:00, or by appointment

Course Requirements

Grading is S/U. (yes, the system says letter grades, grading will be S/U regardless)

All students are expected to submit a short, weekly abstract and critical analysis of the week's papers, and to participate in class discussion of the paper. The weekly report is due via email to okennedy at buffalo before class starts, or in class. Students are allowed to miss up to 2 weeks worth of abstracts (out of a total of 10) without penalty.

The top 3 abstracts for each week will be posted on the site.

Students enrolled for at least 2 credits MUST contact the instructor to sign up to present and lead a discussion about at least one of the papers below. Students enrolled for 1 credit may also choose to present, pending availability of the desired papers.

Students enrolled for 3 credits will also be required to submit a simple experimental project and a short report/presentation on their results. Computing resources will be provided.

Project Ideas

Student project ideas should be approved by the instructor by the beginning of October.

Project Resources

Several resources will be made available for student use and testing. See me for access details.

How to Write A Good Critique

(in 4-ish easy questions)

When writing a critique, I like to distill the essence of each paper by asking myself a few questions (presented here with some example answers for Map/Reduce):

  1. What technical challenges is the system being designed to address? (Writing distributed code is hard)
  2. In one sentence, what approach did the authors take? (A domain-specific language/framework amenable to distribution)(Heterogeneous hardware, Failures on at least one node highly likely, Users happy to write C++ -> not entirely true)
  3. What interesting design decisions or observations about the application domain did the authors make?(A Centralized command server -> Jobs take a long time, so this is scalable; Map+Shuffle+Reduce is enough to encode most of the tasks in their application domain -> Can make use of a distributed *streaming* filesystem service; etc...)

Questions 1 and 2 should help with the summary, and a good critique is based on your answers to 3 and 4.

Course Schedule

WeekPresenterThemeSystem (url)Notes
August 27No Class - Oliver away
Sept 3No Class - Labor Day
Sept 10Oliver (slides)Data Flow Course Introduction
Sept 17No Class - Rosh Hashanah
Sept 24Jon L. (slides)
Ying Y. (slides)
Map/Reduce MapReduce (Original Paper)
Oct 1Raghav A. (slides)
Ying Y. (slides)
Extremely Parallel Query Languages 1 Hive (Demo Paper)
HadoopDB (Demo Paper)
Project proposals due
Top Critiques
Oct 8Gomathivinayagam M. (slides)
Raghav A. (slides)
Extremely Parallel Query Languages 2 Pig
Top Critiques
Oct 15Janhavi D. (slides)
Niccolo M. (slides)
Column Stores MonetDB
Top Critiques
Oct 22Ravi M. (slides)
Kyungho J. (slides)
NoSQL Databases Cassandra
Top Critiques
Oct 29Mike O. (slides)
Dinesh R. (slides)
Distributed Consistency Percolator
First Project Milestone
Top Critiques
Nov 5Gomathivinayagam M. (slides)
Sakthi G.(slides)
Distributed Hash Tables Chord
Top Critiques
Nov 12Niccolo M. (slides)
Ravi M. (slides)
WAN Datastores PNUTS
Top Critiques
Nov 19Kyungho J. (slides)
Oliver (slides)
Stream Processors Borealis
DBToaster and Laasie (no paper)
Nov 26Janhavi D. (slides)
Sakthi G. (slides)
Misc Topics PIQL
Second Project Milestone
Dec 3Student Presentations