CSE 726

Data Intensive Distributed Computing

Spring 2011

 

 

Instructor:

 

Prof. Tevfik Kosar

Office: 245 Bell Hall

Phone: 645-2323

Email: tkosar@buffalo.edu

Office hours: Wed 1:00pm-2:30pm

 

Course Description:

 

Scientific applications and experiments in all areas of science are becoming increasingly complex and more demanding in terms of their computational and data requirements. Large experiments, such as genome mapping, climate modeling, astrophysics, health sciences, and high-energy physics simulations generate data volumes reaching thousands of terabytes per year. As scientific applications become more data intensive, the management of data resources and dataflow between the storage and compute resources is becoming the main bottleneck. Analyzing, visualizing, and disseminating these large data sets has become a major challenge and Data Intensive Computing is now considered as the Fourth Paradigm in scientific discovery after Empirical, Theoretical, and Computational branches of scientific thought.

 

This seminar will be discussing state-of-the-art research, development, and deployment efforts in running Data Intensive Computing workloads on clustered, grid, and cloud infrastructures. We will be reading and discussing two papers every week in one of the following areas:

 

      Parallel Cluster File Systems

      Wide Area Distributed File Systems

      Wide Area Data Placement and Optimization

      Cloud and Cluster Scheduling

      MapReduce Improvements

      Scalable Data Placement

      Remote Data Access

      Global Scale Distributed Testbed Design

 

Course Location and Time:

 

The seminars will be held Thursdays 9am-11am @ Bell 224. First day of classes will be January 20, Thursday.

 

Reading List:

 

The reading list and topics for this seminar is available here.

 

Grading:

 

This is a research course. There will be no exams and no projects (unless there is a request from individual students for a term project). Each student will present 1 paper and will write review for 2 others. Each student is expected to read all papers, attend classes, and join the discussion of the papers. Grading will be P/F.

 

Useful Links:

 

      How to Read a Paper, by S. Keshav.

      Reviewing a Technical Paper, by M. Ernst

 

Paper Review Format Guidelines:

 

      1 paragraph executive summary (what are the authors trying to achieve? potential contributions of the paper?)

      2-3 paragraphs of details (key ideas? motivation & justification? strengths and weaknesses? technical flaws? supported with results? comparison with other systems? future work? anything you disagree with authors?)

      1-2 paragraphs summarizing the discussions in the class.

 

Course Blog:

 

All presentation slides and paper reviews will be posted on the course blog at http://cse726.blogspot.com/. Please make sure you visit this blog regularly. Also,

do not forget to post your questions on papers to be discussed every Wednesday by noon.

 

Seminar Schedule:

 

Date

Week

Papers to be Discussed

Presenter

Reviewers

Jan. 20

1

Introduction – Data Intensive Distributed Computing.

Kosar

 

Jan. 27

2

[1] GPFS: A Shared-Disk File System for Large Computing Clusters

Avula

Mupparaju, Nayak

[2] PVFS: A Parallel File System for Linux Clusters

Sathish

Baldawa, Kadkol

Feb. 3

3

[3] Black-Box Problem Diagnosis in Parallel File Systems

Baldawa

Bhagat, Agrawal

[4] Panache: A Parallel File System Cache for Global File Access

Alekar

Verma, Baldawa

Feb. 10

4

[5] Availability in Globally Distributed Storage Systems

Agrawal

Baisane, Verma

[6] Safety, Visibility, and Performance in a Wide-Area File System

Verma

Sarikaya, Venkatesh

Feb. 17

5

[7] Integrating Portable and Distributed Storage

Vyavahare

Sathish, Rajput

[8] The Google File System

Mupparaju

Nayak, Bhide

Feb. 24

6

[9] Adaptive Data Placement for Wide-Area Sensing Services

Yilmaz

Sarikaya, Avula

[10] Volley: Automated Data Placement for Geo-Distributed Cloud Services

Ramakrishnaprasad

Venkatesh, Huang

Mar. 3

7

[11] The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM

Huang

Patel, Kadkol

[12] Quincy: Fair Scheduling for Distributed Computing Clusters

Baisane

Patel, Ramakrishnaprasad

Mar. 10

8

[13] Hedera: Dynamic Flow Scheduling for Data Center Networks

Puranik

Kaufman, Vyavahare

[14] Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations

Kaufman

Rodolph, Maiti

Mar. 17

 

Spring Break

 

 

Mar. 24

9

[15] MapReduce Online

Patel

Baisane, Bhagat

[16] Improving MapReduce Performance in Heterogeneous Environments

Nayak

Mupparaju, Bhide

Mar. 31

10

[17] quFiles: The right file at the right time

Sarikaya

Iyer, Puranik

[18] Nectar: Automatic Management of Data and Computation in Datacenters

Iyer

Alekar, Ramakrishnaprasad

Apr. 7

11

[19] Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems

Maiti

Yilmaz, Rajput

[20] Adaptive File Transfers for Diverse Environments

Venkatesh

Kaufman, Huang

Apr. 14

12

[21] Structure and Performance of the Direct Access File System

Bhide

Sathish, Avula

[22] RFS: Efficient and Flexible Remote File Access for MPI-IO

Rajput

Maiti, Puranik

Apr. 21

13

[23] A toolkit for user-level file systems

Bhagat

Agrawal

[24] Experiences Building PlanetLab

Rodolph

Alekar, Iyer

Apr. 28

14

[25] Large-scale Virtualization in the Emulab Network Testbed

Kadkol

Vyavahare

[26] iPlane: An Information Plane for Distributed Services

TBA

Yilmaz, Rodolph