Chapter3

Chapter 3
Jimmy Lin and Chris Dyer

Mapreduce Algorithm Design
"simplicity" is the theme.
Fast "simple operation" on a large set of data
Most web-mobile-internet application data yield to embarrassingly parallel processing
General Idea; you write the Mapper and Reducer (Combiner and Partitioner); the execution framework takes care of the rest.
Of course, you configure...the splits, the # of reducers, input path, output path,.. etc.

Programmer has NO control over
-- where a mapper or reducer runs (which node in the cluster)
-- when a mapper or reducer begins or finishes
--which input key-value pairs are processed by a specific mapper
--what intermediate key-value pair is processed by a specific reducer

However what control does a programmer have?
1. Ability to construct complex structures as keys and values to store and communicate partial results
2. The ability to execute user-specified code at the beginning of a map or a reduce task; and termination code at the end;
3. Ability to preserve state in both mappers and reducers across multiple input /intermediate values: counters
4. Ability to control sort order, order of distribution to reducers
5. ability to partition the key space to reducers

Go thru' 3.2, 3.3, 3.4 , 3.5, 3.6, 3.7
1. Start with 3.2: simple map and reduce; not that sum is not the only operation; we can use other operations: mean, max, etc. and other complex operations;
2. Figures 3.2, 3.3 beautifully illustrates the "local combiner" optimization and more importantly
    the concept of "preserving state across keys" with Initialize and Close methods.
    This represents taking control back from Hadoop back to your code; in-line Combiner.
    Three appraches: default combiner, user-provided custom combiner (Hadoop run time schedules it), in-line combiner.
    Issues: scalabilty bottleneck, heap space problem (memory usage)
3. How about algorithmic correctness of local combiner?
Consider we wish to compute "mean"; How will you do this? Identity mapper + reducer
Look at 3.4.
If we apply the solution we discussed in fig.3.2, 3.3, it is incorrect
mean(1,2,3,45) is not equal to mean(1,2), mean(3,4,5)== wrong!

Introduce combiner: See 3.5
rule: input <key-value> type of combiners should be same as its output <key, value> type
3.5 violoates this rule
3.6 is correct version

Lets discuss 3.6 further. Input to mapper need not be <string t, integer r> ; it is more efficient to have <docid, doc>
Section 3.2 Pairs and stripes pattern
Example: co-occurrence matrix
large corpus: nXn matrix where n is the number of unique words in the corpus. (corpora is plural)
Observations:
lets assume m words, i and j row and column index, m(i.j) cell will have the number of times w(i) co-occurred with w(j)
For example <Ron Paul> is w(i) and <election> w<j> on twitter feed we collected in march 2012
For example, dec 2013 <Sandy> would have co-occurred most number of times with <?>
48MB data blows up to 888MB during naive MR co-occurrence implementation. O(n²)
We need better solutions/algorithms..
Lets look at Algorithm 3.8 (pairs) and 3.9 (stripes)
Experiment: 19 slaves, 2-core machines, 666 sec for stripes vs. 3758 sec for pairs.
For pairs
5.7G corpus --> Mapprs generated 2.6billion intermediate key-value pairs --> 31.2 GB
After combiners--> 1.1 billion key value pairs, , reducers emitted 142 million k,v pairs
For stripes
5.7GB --> 653 intermediate K,V pairs, --> 48.1GB--> after combiners-->28.8million <k,v>,
---> reducers --> 1.69 million <k,v>
Both stripes and pairs gave linear scalability... see fig.3.10

Next improvement is relative frequency of co-occurrence than the absolute count.
f (wj | wi) = count of occurrence of <word1,word2> / count of occurrence of <word1,*>
In the pairs approach we emit a special key <word, *> which holds the total count, and we have update the patitioner to
use only the left of the pair to partition the data so that all the keys go to the same reducer.
Order inversion design pattern is used in the above example. See the example case in fig.3.13
end of Chapter 3.
(Dont worry about 3.4..till the end)

Source code:
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop
Write the partitioner class and set it in the job conf.
http://hadoop.apache.org/docs/mapreduce/current/api/org/apache/hadoop/mapreduce/class-use/Partitioner.html
For books
http://www.gutenberg.org/
To facilitate transfer between file systems: Hadoop to any and back
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html