CSE4/587 Spring 2021
Date | Topic | Chapter | Classnotes | To do: |
---|---|---|---|---|
2/2 | Course goals and plans | Course description |
Today's plan:
Motivation 1 for "lm": Here is a list of Top ten machine learning algorithms.
Motivation 2 for "lm": Bloomberg's oil price prediction method.
We are in Chapter 3. Read Chapter 3. Three important data processing algorithms are discussed in this chapter. We will look into three of these today: Linear Regression. Problem 1: Consider the dataS = {(x,y)= (1,25), (10,250), (100,2500), (200,5000)}What next? What is value for X = 40? We observe that
Linear model is defined by the functional form:
y = β0 + β1*x
Lets work on 2 demos:
In the 1990's during the early days of the Internet, the access mode for home computing was through a humungous modem that made this characteristic noise. People had one of the two or three service providers: Prodigy, or Compuserve and probably the web and the search was the only application. Even that was slow. The "web page" was not fancy since most people had only line-based terminals for accessing the Internet. One of the popular browsers was called "Mosaic". When you query for the something it took such a long time that you can clean up your room by the time the query-results arrived.
In late 1990's Larry Page and Sergei Brin (Standford graduate) students worked on the idea considering the web as a large graph and ranking the nodes of this connected web. This became the core idea for their thesis and the company they founded, Google. Google that has since become the byword for "search". The core idea is an algorithm called "Pagerank". This is well explained in a paper [1] written about this thesis.
Until Google came about the focus was on improving the algorithms and the data infrastructure was secondary. There were questions about the storage use for supporting this mammoth network of pages and their content that are being searched. In 2004 two engineers from Google presented an overview paper [2] of the Google File System (GFS) and a novel parallel processing algorithm "MapReduce". A newer version of the paper is available at [3] .
Of course, Google did not release the code as open source. Doug Cutting an independent programmer who was living in Washington State reverse engineered the GFS and named it after the yellow toy elephant that belonged to his daughter. Yahoo hired him and made Hadoop open source.
In 2008 Yahoo, National Science Foundation and Computing Community Consortium gathered together educators, industries, and other important stakeholders for the inaugural Hadoop Summit. Look at the list of attendees here I learned Hadoop from Doug Cutting. It is imperative that we study Hadoop and MapReduce together to learn their collective contribution.
Let us understand the foundation of the Big Data Infrastructure that has become integral part of most computing environments. We will approach this study by processing unstructured data as discussed in text [4] .