Chapter 3
Jimmy Lin and Chris Dyer
Mapreduce Algorithm Design
"simplicity" is the theme.
Fast "simple operation" on a large set of data
Most web-mobile-internet application data yield to embarrassingly
parallel processing
General Idea; you write the Mapper and Reducer (Combiner and
Partitioner); the execution framework takes care of the rest.
Of course, you configure...the splits, the # of reducers, input
path, output path,.. etc.
Programmer has NO control over
-- where a mapper or reducer runs (which node in the cluster)
-- when a mapper or reducer begins or finishes
--which input key-value pairs are processed by a specific mapper
--what intermediate key-value pair is processed by a specific
reducer
However what control does a programmer have?
1. Ability to construct complex structures as keys and values to
store and communicate partial results
2. The ability to execute user-specified code at the beginning of a
map or a reduce task; and termination code at the end;
3. Ability to preserve state in both mappers and reducers across
multiple input /intermediate values: counters
4. Ability to control sort order, order of distribution to reducers
5. ability to partition the key space to reducers
Go thru' 3.2, 3.3, 3.4 , 3.5, 3.6, 3.7
1. Start with 3.2: simple map and reduce; not that sum is not the
only operation; we can use other operations: mean, max, etc. and
other complex operations;
2. Figures 3.2, 3.3 beautifully illustrates the "local combiner"
optimization and more importantly
the concept of "preserving state across keys"
with Initialize and Close methods.
This represents taking control back from Hadoop
back to your code; in-line Combiner.
Three appraches: default combiner, user-provided
custom combiner (Hadoop run time schedules it), in-line combiner.
Issues: scalabilty bottleneck, heap space problem
(memory usage)
3. How about algorithmic correctness of local combiner?
Consider we wish to compute "mean"; How will you do this? Identity
mapper + reducer
Look at 3.4.
If we apply the solution we discussed in fig.3.2, 3.3, it is
incorrect
mean(1,2,3,45) is not equal to mean(1,2), mean(3,4,5)== wrong!
Introduce combiner: See 3.5
rule: input <key-value> type of combiners should be same as
its output <key, value> type
3.5 violoates this rule
3.6 is correct version
Lets discuss 3.6 further. Input to mapper need not be <string t,
integer r> ; it is more efficient to have <docid, doc>
Section 3.2 Pairs and stripes pattern
Example: co-occurrence matrix
large corpus: nXn matrix where n is the number of unique words in
the corpus. (corpora is plural)
Observations:
lets assume m words, i and j row and column index, m(i.j) cell will
have the number of times w(i) co-occurred with w(j)
For example <Ron Paul> is w(i) and <election> w<j>
on twitter feed we collected in march 2012
For example, dec 2013 <Sandy> would have co-occurred most
number of times with <?>
48MB data blows up to 888MB during naive MR co-occurrence
implementation. O(n2)
We need better solutions/algorithms..
Lets look at Algorithm 3.8 (pairs) and 3.9 (stripes)
Experiment: 19 slaves, 2-core machines, 666 sec for stripes
vs. 3758 sec for pairs.
For pairs
5.7G corpus --> Mapprs generated 2.6billion intermediate
key-value pairs --> 31.2 GB
After combiners--> 1.1 billion key value pairs, , reducers
emitted 142 million k,v pairs
For stripes
5.7GB --> 653 intermediate K,V pairs, --> 48.1GB--> after
combiners-->28.8million <k,v>,
---> reducers --> 1.69 million <k,v>
Both stripes and pairs gave linear scalability... see fig.3.10
Next improvement is relative frequency of co-occurrence than the
absolute count.
f (wj | wi) = count of occurrence of <word1,word2> / count of
occurrence of <word1,*>
In the pairs approach we emit a special key <word, *> which
holds the total count, and we have update the patitioner to
use only the left of the pair to partition the data so that all the
keys go to the same reducer.
Order inversion design pattern is used in the above example. See the
example case in fig.3.13
end of Chapter 3.
(Dont worry about 3.4..till the end)
Source code:
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop
Write the partitioner class and set it in the job conf.
http://hadoop.apache.org/docs/mapreduce/current/api/org/apache/hadoop/mapreduce/class-use/Partitioner.html
For books
http://www.gutenberg.org/
To facilitate transfer between file systems: Hadoop to any and back
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html