CSE 726
Data Intensive Distributed Computing
Spring 2011
Instructor:
Prof. Tevfik Kosar
Office: 245 Bell Hall
Phone: 645-2323
Email: tkosar@buffalo.edu
Office hours: Wed 1:00pm-2:30pm
Scientific applications and
experiments in all areas of science are becoming increasingly complex and more
demanding in terms of their computational and data requirements. Large
experiments, such as genome mapping, climate modeling, astrophysics, health sciences,
and high-energy physics simulations generate data volumes reaching thousands of
terabytes per year. As
scientific applications become more data intensive, the management of data
resources and dataflow between the storage and compute resources is becoming
the main bottleneck. Analyzing, visualizing, and disseminating these large data
sets has become a major challenge and Data Intensive Computing is now considered as the “Fourth
Paradigm” in scientific discovery after Empirical, Theoretical, and
Computational branches of scientific thought.
This seminar will be discussing
state-of-the-art research, development, and deployment efforts in running Data
Intensive Computing workloads on clustered, grid, and cloud infrastructures. We
will be reading and discussing two papers every week in one of the following
areas:
·
Parallel Cluster File Systems
·
Wide Area Distributed File
Systems
·
Wide Area Data Placement and
Optimization
·
Cloud and Cluster Scheduling
·
MapReduce Improvements
·
Scalable Data Placement
·
Remote Data Access
·
Global Scale Distributed Testbed
Design
Course
Location and Time:
The seminars will be held Thursdays 9am-11am @
Bell 224.
First day of classes will be January 20, Thursday.
Reading List:
The reading list and topics for this seminar is
available here.
Grading:
This is a
research course. There will be no exams and no projects (unless there is a
request from individual students for a term project). Each student will present
1 paper and will write review for 2 others. Each student is expected to read
all papers, attend classes, and join the discussion of the papers. Grading will be P/F.
Useful Links:
· How to Read a Paper, by S. Keshav.
· Reviewing
a Technical Paper, by M. Ernst
Paper Review Format Guidelines:
·
1 paragraph executive summary (what are the authors
trying to achieve? potential contributions of the paper?)
·
2-3 paragraphs of details (key ideas? motivation
& justification? strengths and weaknesses? technical flaws? supported with
results? comparison with other systems? future work? anything you disagree with
authors?)
·
1-2 paragraphs summarizing the discussions in the
class.
Course Blog:
All presentation slides and paper reviews
will be posted on the course blog at http://cse726.blogspot.com/.
Please make sure you visit this blog regularly. Also,
do not forget to post your
questions on papers to be discussed every Wednesday by noon.
Seminar Schedule:
Date |
Week |
Papers to be Discussed |
Presenter |
Reviewers |
Jan. 20 |
1 |
Introduction – Data Intensive
Distributed Computing. |
Kosar |
|
Jan. 27 |
2 |
[1] GPFS: A Shared-Disk File System for Large Computing Clusters |
Avula |
Mupparaju,
Nayak |
Sathish |
Baldawa,
Kadkol |
|||
Feb. 3 |
3 |
Baldawa |
Bhagat,
Agrawal |
|
[4]
Panache:
A Parallel File System Cache for Global File Access |
Alekar |
Verma,
Baldawa |
||
Feb. 10 |
4 |
Agrawal |
Baisane,
Verma |
|
[6]
Safety, Visibility,
and Performance in a Wide-Area File System |
Verma |
Sarikaya,
Venkatesh |
||
Feb. 17 |
5 |
Vyavahare |
Sathish,
Rajput |
|
Mupparaju |
Nayak,
Bhide |
|||
Feb. 24 |
6 |
Yilmaz |
Sarikaya,
Avula |
|
[10]
Volley:
Automated Data Placement for Geo-Distributed Cloud Services |
Ramakrishnaprasad |
Venkatesh,
Huang |
||
Mar. 3 |
7 |
[11] The Case
for RAMClouds: Scalable High-Performance Storage Entirely
in DRAM |
Huang |
Patel, Kadkol |
[12]
Quincy:
Fair Scheduling for Distributed Computing Clusters |
Baisane |
Patel, Ramakrishnaprasad |
||
Mar. 10 |
8 |
[13]
Hedera: Dynamic Flow Scheduling for Data Center Networks |
Puranik |
Kaufman, Vyavahare |
[14]
Distributed
Aggregation for Data-Parallel Computing: Interfaces and Implementations |
Kaufman |
Rodolph,
Maiti |
||
Mar. 17 |
|
Spring Break |
|
|
Mar. 24 |
9 |
[15]
MapReduce
Online |
Patel |
Baisane,
Bhagat |
[16]
Improving
MapReduce Performance in Heterogeneous Environments |
Nayak |
Mupparaju,
Bhide |
||
Mar. 31 |
10 |
Sarikaya |
Iyer,
Puranik |
|
[18]
Nectar:
Automatic Management of Data and Computation in Datacenters |
Iyer |
Alekar,
Ramakrishnaprasad |
||
Apr. 7 |
11 |
[19] Spyglass:
Fast, Scalable Metadata Search for Large-Scale Storage Systems |
Maiti |
Yilmaz,
Rajput |
Venkatesh |
Kaufman, Huang |
|||
Apr. 14 |
12 |
[21]
Structure
and Performance of the Direct Access File System |
Bhide |
Sathish,
Avula |
[22] RFS: Efficient
and Flexible Remote File Access for MPI-IO |
Rajput |
Maiti,
Puranik |
||
Apr. 21 |
13 |
Bhagat |
Agrawal |
|
Rodolph |
Alekar,
Iyer |
|||
Apr. 28 |
14 |
[25]
Large-scale
Virtualization in the Emulab Network Testbed |
Kadkol |
Vyavahare |
TBA |
Yilmaz,
Rodolph |