CSE 726 - Data Intensive Distributed Computing

CSE 726

Data Intensive Distributed Computing

Spring 2011

Instructor:

Prof. Tevfik Kosar

Office: 245 Bell Hall

Phone: 645-2323

Email: tkosar@buffalo.edu

Office hours: Wed 1:00pm-2:30pm

Course Description:

Scientific applications and experiments in all areas of science are becoming increasingly complex and more demanding in terms of their computational and data requirements. Large experiments, such as genome mapping, climate modeling, astrophysics, health sciences, and high-energy physics simulations generate data volumes reaching thousands of terabytes per year. As scientific applications become more data intensive, the management of data resources and dataflow between the storage and compute resources is becoming the main bottleneck. Analyzing, visualizing, and disseminating these large data sets has become a major challenge and Data Intensive Computing is now considered as the “Fourth Paradigm” in scientific discovery after Empirical, Theoretical, and Computational branches of scientific thought.

This seminar will be discussing state-of-the-art research, development, and deployment efforts in running Data Intensive Computing workloads on clustered, grid, and cloud infrastructures. We will be reading and discussing two papers every week in one of the following areas:

· Parallel Cluster File Systems

· Wide Area Distributed File Systems

· Wide Area Data Placement and Optimization

· Cloud and Cluster Scheduling

· MapReduce Improvements

· Scalable Data Placement

· Remote Data Access

· Global Scale Distributed Testbed Design

Course Location and Time:

The seminars will be held Thursdays 9am-11am @ Bell 224. First day of classes will be January 20, Thursday.

Reading List:

The reading list and topics for this seminar is available here.

Grading:

This is a research course. There will be no exams and no projects (unless there is a request from individual students for a term project). Each student will present 1 paper and will write review for 2 others. Each student is expected to read all papers, attend classes, and join the discussion of the papers. Grading will be P/F.

Useful Links:

· How to Read a Paper, by S. Keshav.

· Reviewing a Technical Paper, by M. Ernst

Paper Review Format Guidelines:

· 1 paragraph executive summary (what are the authors trying to achieve? potential contributions of the paper?)

· 2-3 paragraphs of details (key ideas? motivation & justification? strengths and weaknesses? technical flaws? supported with results? comparison with other systems? future work? anything you disagree with authors?)

· 1-2 paragraphs summarizing the discussions in the class.

Course Blog:

All presentation slides and paper reviews will be posted on the course blog at http://cse726.blogspot.com/. Please make sure you visit this blog regularly. Also,

do not forget to post your questions on papers to be discussed every Wednesday by noon.

Seminar Schedule:

Date	Week	Papers to be Discussed	Presenter	Reviewers
Jan. 20	1	Introduction – Data Intensive Distributed Computing.	Kosar
Jan. 27	2	[1] GPFS: A Shared-Disk File System for Large Computing Clusters	Avula	Mupparaju, Nayak
Jan. 27	2	[2] PVFS: A Parallel File System for Linux Clusters	Sathish	Baldawa, Kadkol
Feb. 3	3	[3] Black-Box Problem Diagnosis in Parallel File Systems	Baldawa	Bhagat, Agrawal
Feb. 3	3	[4] Panache: A Parallel File System Cache for Global File Access	Alekar	Verma, Baldawa
Feb. 10	4	[5] Availability in Globally Distributed Storage Systems	Agrawal	Baisane, Verma
Feb. 10	4	[6] Safety, Visibility, and Performance in a Wide-Area File System	Verma	Sarikaya, Venkatesh
Feb. 17	5	[7] Integrating Portable and Distributed Storage	Vyavahare	Sathish, Rajput
Feb. 17	5	[8] The Google File System	Mupparaju	Nayak, Bhide
Feb. 24	6	[9] Adaptive Data Placement for Wide-Area Sensing Services	Yilmaz	Sarikaya, Avula
Feb. 24	6	[10] Volley: Automated Data Placement for Geo-Distributed Cloud Services	Ramakrishnaprasad	Venkatesh, Huang
Mar. 3	7	[11] The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM	Huang	Patel, Kadkol
Mar. 3	7	[12] Quincy: Fair Scheduling for Distributed Computing Clusters	Baisane	Patel, Ramakrishnaprasad
Mar. 10	8	[13] Hedera: Dynamic Flow Scheduling for Data Center Networks	Puranik	Kaufman, Vyavahare
Mar. 10	8	[14] Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations	Kaufman	Rodolph, Maiti
Mar. 17		Spring Break
Mar. 24	9	[15] MapReduce Online	Patel	Baisane, Bhagat
Mar. 24	9	[16] Improving MapReduce Performance in Heterogeneous Environments	Nayak	Mupparaju, Bhide
Mar. 31	10	[17] quFiles: The right file at the right time	Sarikaya	Iyer, Puranik
Mar. 31	10	[18] Nectar: Automatic Management of Data and Computation in Datacenters	Iyer	Alekar, Ramakrishnaprasad
Apr. 7	11	[19] Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems	Maiti	Yilmaz, Rajput
Apr. 7	11	[20] Adaptive File Transfers for Diverse Environments	Venkatesh	Kaufman, Huang
Apr. 14	12	[21] Structure and Performance of the Direct Access File System	Bhide	Sathish, Avula
Apr. 14	12	[22] RFS: Efficient and Flexible Remote File Access for MPI-IO	Rajput	Maiti, Puranik
Apr. 21	13	[23] A toolkit for user-level file systems	Bhagat	Agrawal
Apr. 21	13	[24] Experiences Building PlanetLab	Rodolph	Alekar, Iyer
Apr. 28	14	[25] Large-scale Virtualization in the Emulab Network Testbed	Kadkol	Vyavahare
Apr. 28	14	[26] iPlane: An Information Plane for Distributed Services	TBA	Yilmaz, Rodolph