Lecture Notes

Other MR algorithms

We will discuss
  • Project 2 briefly
  • Section 3.2 Lin & Dyer text: MapReduce with complex keys: pairs and stripes for text processing
  • Computing relative frequency: Map emits more than simple key,value
  • Reduce side join and Map Side join
  • Inverted Index: Chapter 4
  • Value-Key conversion design pattern to improve baseline MR for inverted index

MR algorithms: Graph algorithms

This is from Ch.4 and 5 of Lyn and Dyer. Ch.4 is on MR invwrted index and Ch.5 is on graph algorithms.
  • Web search problem has three major components: web crawling, inverted index development and information retrieval. See Web search
  • Web crawling and inverted index are done offline to facilitate a fast realtime response for IR. We will begin with an example. Then we will look at MR algs for inverted indexing. We will begin with a baseline implementation, and then a revised one
  • Graph algorithms: We will discuss graph problems, graph representations
  • We will discuss Dijkstra's algorithm and then MR alg for paralell breadth first search for large graphs
  • We will move onto pagerank algorithm as specified in Ch.5

MR algorithms: Graph algorithms (contd.)

This is from 5 of Lin and Dyer. Ch.5 is on graph algorithms.
  • The details of graph algorithms from Dijkstra's shortest path to parallel MR version of this algorithm
  • Graph algorithms: We will discuss graph problems, graph representations; Mr Graphs
  • MR graph algorithm tracing the steps. Graph data
  • We will move onto pagerank algorithm as specified in Ch.5

Apache Pig

Notes from last three lectures: Inverted index MR, Breadth first Search MR We will study Apache Pig as a means of specifying MR workflow.
  • Abstractions for defining MR workflows
  • Pig Data flow lanaguage Pig Dataflow
  • How to run Pig on AWS?: as a EMR step: Demo1; as an interactive shell to a cluster: Demo2

Apache Spark

We will look couple of demos of Pig on AWS.
  • PIG demos on aws Hadoop
  • What is Spark? Spark Intro
  • pySpark demo

Apache Spark

Spark details as described in RDD paper based on PhD. thesis of Matei Zaharia. Spark Programming Model

Final Review

Final Exam Topics