CSE4/587 Data-intensive Computing

Course Description for Spring 2021

Data-intensive computing deals with diverse data formats, storage models, application architectures, and programming models and algorithms and tools for large-scale data analytics. In particular we study approaches that address challenges in managing and utilizing large-scale data and the methods for transforming voluminous datasets (big data) into discoveries and intelligence for human understanding and decision making. Topics include: intelligent representation of data, approaches for discovering intelligence in data, data-driven computing, storage requirements of big data, organization of big data repositories such as Hadoop, characteristics of Write-Once-Read-Many (WORM) data, data-intensive programming models such as MapReduce and Spark analytics, web services-based cloud computing middleware, and scalable analytics and visualization. This course has four major goals: (i) understand data-intensive computing, that has been defined as the fourth paradigm for Sciences by the late Jim Grey, (ii) study, design and develop solutions using data-intensive computing models, (iii) predictive analytics and visualization using packages such as Python and Spark analytics and (iv) focus on methods for scalability using the cloud computing infrastructures such as Google Compute Engine, and Amazon Web Services (AWS).

On completion of this course students will be able to analyze, design, and implement effective solutions for data-intensive applications with very large scale data sets. More specifically a student will be able to:

Recognize a data-intensive problem.
Assess the scale of data and requirements.
Retrieve data using appropriate methods.
Describe the data layout and define the data repository format (Ex: store).
Decide the algorithms (Ex: MapReduce) and programming models (Ex: Bayesian).
Define application-specific algorithms and analytics (Ex: network analysis).
Design the data-intensive program solution and system configuration.
Implement the data-intensive solution and test the solution.
Write a report summarizing the solution and results.
Incorporate services from cloud computing platforms.
Study the foundational concepts enabling cloud computing: services-based interface, programmatic consumption of services, virtualization, PKI-based security, large-scale storage, load-balancing, machine images and on-demand services.
Formulate data-intensive visualization solutions for presenting the results.

Website	http://www.cse.buffalo.edu/~bina/cse487/spring2021
Instructor	Bina Ramamurthy (bina@buffalo.edu)
Office Hours	MonWed: 8.30-10.00AM
Office Location	Online only
Lecture Time	T/TH:3.55-5.10PM
Lecture location	Online -- Virtual -- Synchronous

Overall grade for the course will be based on the student's performance in: class attendance and participation (10%), in-class quizzes (30%), 3 hands-on projects (60%), .
95% or above is an A, 90% is an A- etc. will be the mapping for letter grades based on the overall percent. There will be curve applied at the end based on the relative performance of the students in the course. We will use separate curve for graduates and undergradautes.

There is one main text book that covers the major concepts defined in the description (algorithms and statistical model for data-intensive computing models). We will cover rest of the topics, hands-on lab material, big-data infrastructure details, cloud computing using online reference material and open source tools.

Data Sceince: Python Data Science Handbook. J. VanderPlas, O'reilly. ISBN: 9781491912058, November 2016.
Big data analytics: Data-Intensive Text Processing with MapReduce. J. Lin and C. Dyer, Morgan & Claypool Publishers (October 10, 2010), ASIN : B0094J3FXM.
Hadoop and Spark infrastructure: Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem. D. Eadline, Addison Wesley Professional, ASIN : B017A8UACW, 2015.

We will be using many other references, online sources and textbooks throughout the semester. The details will be provided in the References tab.

There are 3 projects (data products) planned each with about 4 weeks time. Each project will cover one of more concepts and will involve design, development and testing. You will need a laptop for this. The problem solved in each of the project may or may not be related. The solution is expected to represent an entire pipeline / workflow leveraging the expertise you have developed in various areas through the project work. NO late projects will be accepted.

Lecture material will be posted before the lecture starts in the format shown.

Week	Concepts	Tools	Labs, Exams, Term Project
2/2	Introduction	Data-intensive computing	First day handout
5/6	Last day of classes	Review

Attendance Policy:

This course is fully online, synchronous course. that means that you have to be present during the lecture REALTIME. Do not FAKE attendance. If you do, you will penalized with a grade reduction. Remember that attendance accounts for 10% of your grade. You are responsible for the contents of all lectures. We reserve the right to take attendance during the lectures. We may use this information to determine how to resolve borderline grades at the end of the course, especially if we see a lack of attendance and participation during lecture sessions. During lectures, we will be covering material from the textbook. We will also work out several of the problems from the text. Lecture will also consist of the exploration of several real world problems not covered in the book. You will be given a reading assignment at the end of each lecture for the next class.

Assessment Policy:

Online quizzes are planned as a major assessment instrument. These are closed book, closed neighbor and no discussion quizzes. Please make sure that you are the only one involved in a answering the quizzes. No sharing of information using any form of social media or other media during the quiz. No retaking of quizzes. You have to keep the video on while taking the quizzes.

Projects will be graded in phases by a special digital grading scheme (0,5,10,15) method that I will explain during the first lecture. If you have any questions about your grade, dont post it on Piazza or social media. Other students cannot do anything about. Meet your TA first and then your instructor about any grading issues.

Incomplete Policy:

We only grant incompletes in this course under the direst of circumstances. By definition, an incomplete is warranted if the student is capable of completing the course satisfactorily, but some traumatic event has interfered with their capability to finish within the timeframe of the semester. Incompletes are not designed as stalling tactic to defer a poor performance in a class.

Academic Integrity Policy: We take this issue so seriously that I have created a separate page for this. See here.

If you have special needs, you must be registered with the Office of accessibility Resources. If you are registered with them please let your instructors know about this so that they can make special arrangements for you.

Course Description for Spring 2021

Course Overview

Course Objectives

Course Information

Grading (tentative)

Text books

Hands-on Activity (Project) Plans

Lecture Material Plans

Course Policies

Students with Accessibilty Issues