Lecture Notes

Working with Hadoop MR

Today We will work on some fundamentals:
  • First we will learn how to solve problems for paralell processing by MR. Simple wordcount.
  • Then a numerical problem to understand the working of MR
  • We will write another parallel solution for a subject of topical interest: elections
  • We will discuss pagerank problem and how it was solved using MR. See 25 Billion Dollar Alg
More discussion on PageRank and a presentation.

For deploying and running MR solutions:
  • You can create a cluster from the scratch: You need at least one namenode and a datanode in the minimal configuration.
  • You can leverage virtualization to create a virtual cluster.
  • You can run MR jobs on cloud providers such as amazon, google cloud, Windows Azure, and many others.
  • You can install a simple virtual image on your lap top for learning purposes and not more than that.
We will examine few of the solutions above. In virtualization you have a host, guest, file system of the guest, HDFS on the guest. For MR details see Apache Hadoop.

Midterm Discussion & Project 1 & MR

We will discuss

Working with Amazon AWS-EMR Environment

Preferred method for working with prj2 is to develop the solutions in Java (or Python) and test the jar with you local environment. Then if you want to you can run it on AWS EMR. A critical step we are overlooking in the process is solving the problem using MR. Wordcount and other simple problems are discussed in the literature. The fun and challenge are in discovering the beauty of MR for solving other uncommon big data problems.


We will discuss We will also discuss Tableau like visualization:
  • Tableau User Inteface: Connect to Data Sources
  • Tableau worksheet
  • Working with Data: Sheets-->DashBoard-->Story concepts

Tableau was suggested to you just to give examples of effective visualization (useful, engaging, needs no interpretation).

If we have time we will start learning about problem solving using MR, starting with PageRank.

MapReduce Algorithm Design

We will
  • Review MR: Chapter 2: Lin and Dyer
  • Discuss items yet to be covered from March 7: Page Rank
  • Discuss iterative Map Reduce: Example K-means and Counter in configuration: See handout
  • Section 3.2 Lin & Dyer text: MapReduce with complex keys: pairs and stripes for text processing

MapReduce Algorithm Design (Continued)

We will
  • Discuss iterative Map Reduce: Example K-means and Counter in configuration: See block diagram for KMeans MR
  • Section 3.2 Lin & Dyer text: MapReduce with complex keys: pairs and stripes for text processing

Other MR algorithms

We will discuss
  • Project 2 briefly
  • Section 3.2 Lin & Dyer text: MapReduce with complex keys: pairs and stripes for text processing
  • Computing relative frequency: Map emits more than simple key,value
  • Reduce side join and Map Side join
  • Inverted Index: Chapter 4
  • Value-Key conversion design pattern to improve baseline MR for inverted index