Data-intensive Computing

CSE4/587 Spring 2021

"Hmm..What will I learn in this course?"

Introduction

Welcome to Spring 2021 and Data-intensive Computing course. This course covers topics that are relevant to the emerging area of Data Science and Data-intensive Computing. Data Science deals with data acquisition, cleaning, exploratory data analysis, statistical modeling, algorithmic data processing, knowledge extraction, prediction and prescriptive analytics. Data-intensive computing deals with computing aspects such as the infrastructure, big-data architectures, data structures and algorithms that enable the Data Science. We will cover both aspects in this course.


There are three recommended texts: All are available for free online.
  1. Data Sceince: Python Data Science Handbook. J. VanderPlas, O'reilly. ISBN: 9781491912058, November 2016.
  2. Big data analytics: Data-Intensive Text Processing with MapReduce. J. Lin and C. Dyer, Morgan & Claypool Publishers (October 10, 2010), ASIN : B0094J3FXM.
  3. Hadoop and Spark infrastructure: Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem. D. Eadline, Addison Wesley Professional, ASIN : B017A8UACW, 2015.

We will be using many other references, online sources and textbooks throughout the semester. The details will be provided in the References tab.

Tentative Curriculum

A broad overview of the topics to be covered is given below.

Introduction to Data.
Data Aqusition and cleaning.
Exploratory Data Analysis (EDA) using Python.
Data Visualization.
Statistical modeling.
Algorithms for big-data processing.
Data bases for small-data and big-data.
Infrastructures for big-data (Hadoop eco-system).
High speed, scalable big-data prcessing (Spark eco-system).
Computing on the cloud: Amamzon AWS, and Google cloud.
Research issues in Data-intensive computing.

All concepts discussed during the lectures will be reinforced by you designing and building three data products for demonstrating (i) data analytic pipeline in Python, (ii) big data analystics on Hadoop infrastructure, and (iii) streaming data processing using Spark.