Data-Intensive Computing and Data Science at Buffalo

CSE4/587 Data-Intensive Computing

Introduction: What will I learn the course?

Welcome to Spring 2016 and Data-Intensive Computing course. This course covers topics that are relevant to the emerging area of Data Science and Data-intensive Computing.


Data Science deals with data acquisition, cleaning, exaploratory data analysis, modeling and knowledge extraction. Data-intensive computing deals with computing aspects such as the infrastructure, data structures and algorithms that enable the Data Science. We will cover both aspects in this course.

The required text book for the course is: Title: Doing Data Science: Straight Talk from the Frontline, 1st Edition Author(s): Cathy O'Neil and Rachel Schutt ISBN: 978-1449358655 Publisher: O'Reilly Media

There will be many other external/online refernces for the Data-intensive computing topics of the course. That list will be provided soon.

Tentative Curriculum

A broad overview of the topics to be covered is given below. All the components will be explored using hands-on at least 3 programming projects

  • Data Aqusition and cleaning; Exploratory Data Analysis (EDA) using R Language
  • We will use R-Studio
  • Data Visualization
  • We will use Tableau or comparable tool
  • Statistical modeling
  • First 5 chapters of Data Science book above
  • Hadoop V1 and V2: principles, storage and entire ecsosystem
  • MapReduce algorithms and applications
    We will use resources on amazon aws or Google Cloud platform
  • Hive, Hbase, Pig and similar systems
  • Solving large scale problems using Hadoop-based systems.
  • Spark data-stack, in-memory computations data frames, DAGs, and applications.
  • We will use amazon EC2 with Spark image or virtual machines on your laptop
  • Research issues in Data-intensive computing

You can use Apache resources for studying Hadoop and Spark. For algorithms development for unstructured data look at this following text:

Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, Synthesis Lectures on Human Language Technologies, 2010, Vol. 3, No. 1, Pages 1-177, (doi: 10. 2200/S00274ED1V01Y201006HLT007). An online version of this text is also available through UB Libraries since UB subscribes to Morgan and Claypool Publishers. Online version available at: http://lintool.github.com/MapReduceAlgorithms/index.html