CSE4/587 Data-Intensive Computing
Welcome to Spring 2016 and Data-Intensive Computing course. This course covers topics that are relevant to the emerging area of Data Science and Data-intensive Computing.
Data Science deals with data acquisition, cleaning, exaploratory data analysis, modeling and knowledge extraction. Data-intensive computing deals with computing aspects such as the infrastructure, data structures and algorithms that enable the Data Science. We will cover both aspects in this course.
The required text book for the course is: Title: Doing Data Science: Straight Talk from the Frontline, 1st Edition Author(s): Cathy O'Neil and Rachel Schutt ISBN: 978-1449358655 Publisher: O'Reilly Media
There will be many other external/online refernces for the Data-intensive computing topics of the course. That list will be provided soon.
A broad overview of the topics to be covered is given below. All the components will be explored using hands-on at least 3 programming projects
We will use R-Studio
We will use Tableau or comparable tool
First 5 chapters of Data Science book above
We will use resources on amazon aws or Google Cloud platform
We will use amazon EC2 with Spark image or virtual machines on your laptop
You can use Apache resources for studying Hadoop and Spark. For algorithms development for unstructured data look at this following text:
Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, Synthesis Lectures on Human Language Technologies, 2010, Vol. 3, No. 1, Pages 1-177, (doi: 10. 2200/S00274ED1V01Y201006HLT007). An online version of this text is also available through UB Libraries since UB subscribes to Morgan and Claypool Publishers. Online version available at: http://lintool.github.com/MapReduceAlgorithms/index.html