CSE4/587 Data-Intensive Computing

Introduction: What will I learn the course?

January 14 by Bina

Welcome to Spring 2016 and Data-Intensive Computing course. This course covers topics that are relevant to the emerging area of Data Science and Data-intensive Computing.

Data Science deals with data acquisition, cleaning, exaploratory data analysis, modeling and knowledge extraction. Data-intensive computing deals with computing aspects such as the infrastructure, data structures and algorithms that enable the Data Science. We will cover both aspects in this course.

The required text book for the course is: Title: Doing Data Science: Straight Talk from the Frontline, 1st Edition Author(s): Cathy O'Neil and Rachel Schutt ISBN: 978-1449358655 Publisher: O'Reilly Media

There will be many other external/online refernces for the Data-intensive computing topics of the course. That list will be provided soon.

Tentative Curriculum

A broad overview of the topics to be covered is given below. All the components will be explored using hands-on at least 3 programming projects

Data Aqusition and cleaning; Exploratory Data Analysis (EDA) using R Language

We will use R-Studio

Data Visualization

We will use Tableau or comparable tool

Statistical modeling

First 5 chapters of Data Science book above

Hadoop V1 and V2: principles, storage and entire ecsosystem

We will use resources on amazon aws or Google Cloud platform

Hive, Hbase, Pig and similar systems
Solving large scale problems using Hadoop-based systems.
Spark data-stack, in-memory computations data frames, DAGs, and applications.

We will use amazon EC2 with Spark image or virtual machines on your laptop

Research issues in Data-intensive computing

You can use Apache resources for studying Hadoop and Spark. For algorithms development for unstructured data look at this following text:

Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, Synthesis Lectures on Human Language Technologies, 2010, Vol. 3, No. 1, Pages 1-177, (doi: 10. 2200/S00274ED1V01Y201006HLT007). An online version of this text is also available through UB Libraries since UB subscribes to Morgan and Claypool Publishers. Online version available at: http://lintool.github.com/MapReduceAlgorithms/index.html

Data-Intensive Computing and Data Science at Buffalo

Introduction: What will I learn the course?

Tentative Curriculum

About

Archives

Elsewhere