CSE4/587 Spring 2017
Date | Topic and Chapter | Classnotes | Handouts |
---|---|---|---|
1/29 | Introduction to Data-Intensive Computing: Ch.1 | classnotes | |
1/31 | The Data Science Road Map: Ch 2 | DS Roadmap | JupyterRStudio Handout |
2/5 | Lab 1 Discussion | Lab1 | Lab1 Files |
2/7 | Exploratory Data Analysis (EDA) | Intro to R | Demo Data |
2/12,14 | Statistical Modeling and ML Algorithms | Models and Algorithms in R | Demo Description |
2/14 | Lets Review EDA & Algorithms for EDA | Models and Algorithms in R | Demo links inline |
2/19 | Data Infrastructure | Introduction to MapReduce | InfraStructure I |
2/26 | Big Data Infrastructure | Hadoop, Yarn and Beyond | InfraStructure II |
2/28 | Big Data Algorithms | MapReduce | Algorithms |
3/5 | Design MR Solutions | MapReduceCombiner | Algorithms |
3/12 | Demo and discussion of MR.HDFS | Download Hadoop, Demo Uspres speeches | MapReduce: How to code? |
3/29 | Review for Exam; Clarification on Lab2 | MR.Hadoop; Data Analysis Algorithms | Exam Review |
4/2 | Word Co-occurrence | MR.Hadoop; Data Analysis Algorithms | WordCO |
4/4 | Graph Algorithms | MR.Hadoop; Data Analysis Algorithms | Graph Alg |
4/9 | Apache Spark | Notes | Apache Spark |
4/18 | Naive Bayes | Notes | Bayesian Classification |
4/23 | Logistic Regression & ML Overview | Notes | ML and Logit Classification |
4/25 | Working on spark | Lab3: How to get started? | Working on Spark |
4/25 | Evaluating the Classifiers | Applies to Lab3 and Final Exam | Evaluation Metrics |
5/2 | Take inventory of what you learned | Final Exam Topics | Course Review |
5/7 | Final Review | Final Exam Topics | Review for the Final Exam |
S = {(x,y)= (1,25), (10,250), (100,2500), (200,5000)}What next? What is value for X = 40? We observe that
Linear model is defined by the functional form:
y = β0 + β1*x
Lets work on 2 demos:
age<-c(25,35,45,20,35,52,23,40,60,48,33)
loan<-c(40,60,80,20,120,18,95,62,100,220,150)
default<-c("N","N","N","N","Y","Y","Y","Y","Y","Y","Y")
In the 1990's during the early days of the Internet, the access mode for home computing was through a humongous modem that made this characteristic noise. People had one of the two or three service providers: Prodigy, or Compuserve and probably the web and the search was the only application. Even that was slow. The "web page" was not fancy since most people had only line-based terminals for accessing the Internet. One of the popular browsers was called "Mosaic". When you query for the something it took such a long time that you can clean up your room by the time the query-results arrived.
In late 1990's Larry Page and Sergei Brin (Standford graduate) students worked on the idea considering the web as a large graph and ranking the nodes of this connected web. This became the core idea for their thesis and the company they founded, Google. Google that has since become the byword for "search". The core idea is an algorithm called "Pagerank". This is well explained in a paper [1] written about this thesis.
Until Google came about the focus was on improving the algorithms and the data infrastructure was secondary. There were questions about the storage use for supporting this mammoth network of pages and their content that are being searched. In 2004 two engineers from Google presented an overview paper [2] of the Google File System (GFS) and a novel parallel processing algorithm "MapReduce". A newer version of the paper is available at [3] .
Of course, Google did not release the code as open source. Doug Cutting an independent programmer who was living in Washington State reverse engineered the GFS and named it after the yellow toy elephant that belonged to his daughter. Yahoo hired him and made Hadoop open source.
In 2008 Yahoo, National Science Foundation and Computing Community Consortium gathered together educators, industries, and other important stakeholders for the inaugural Hadoop Summit. Look at the list of attendees here I learned Hadoop from Doug Cutting. It is imperative that we study Hadoop and MapReduce together to learn their collective contribution.
Let us understand the foundation of the Big Data Infrastructure that has become integral part of most computing environments. We will approach this study by processing unstructured data as discussed in text [4] .