Chapter 3
    Jimmy Lin and Chris Dyer
    
    Mapreduce Algorithm Design
    "simplicity" is the theme.
    Fast "simple operation" on a large set of data
    Most web-mobile-internet application data yield to embarrassingly
    parallel processing
    General Idea; you write the Mapper and Reducer (Combiner and
    Partitioner); the execution framework takes care of the rest.
    Of course, you configure...the splits, the # of reducers, input
    path, output path,.. etc.
    
    
    Programmer has NO control over
    -- where a mapper or reducer runs (which node in the cluster)
    -- when a mapper or reducer begins or finishes
    --which input key-value pairs are processed by a specific mapper
    --what intermediate key-value pair is processed by a specific
    reducer
    
    However what control does a programmer have?
    1. Ability to construct complex structures as keys and values to
    store and communicate partial results
    2. The ability to execute user-specified code at the beginning of a
    map or a reduce task; and termination code at the end;
    3. Ability to preserve state in both mappers and reducers across
    multiple input /intermediate values: counters
    4. Ability to control sort order, order of distribution to reducers
    5. ability to partition the key space to reducers
    
    Go thru' 3.2, 3.3, 3.4 , 3.5, 3.6, 3.7
    1. Start with 3.2: simple map and reduce; not that sum is not the
    only operation; we can use other operations: mean, max, etc. and
    other complex operations;
    2. Figures 3.2, 3.3 beautifully illustrates the "local combiner"
    optimization and more importantly 
        the concept of "preserving state across keys"
    with Initialize and Close methods.
        This represents taking control back from Hadoop
    back to your code; in-line Combiner. 
        Three appraches: default combiner, user-provided
    custom combiner (Hadoop run time schedules it), in-line combiner.
        Issues: scalabilty bottleneck, heap space problem
    (memory usage)
    3. How about algorithmic correctness of local combiner?
    Consider we wish to compute "mean"; How will you do this? Identity
    mapper + reducer
    Look at 3.4.
    If we apply the solution we discussed in fig.3.2, 3.3, it is
    incorrect
    mean(1,2,3,45) is not equal to mean(1,2), mean(3,4,5)== wrong!
    
    Introduce combiner:  See 3.5
    rule: input <key-value> type of combiners should be same as
    its output <key, value> type
    3.5 violoates this rule
    3.6 is correct version
    
    Lets discuss 3.6 further. Input to mapper need not be <string t,
    integer r> ; it is more efficient to have <docid, doc>
    Section 3.2 Pairs and stripes pattern
    Example: co-occurrence matrix
    large corpus: nXn matrix where n is the number of unique words in
    the corpus. (corpora is plural)
    Observations:
    lets assume m words, i and j row and column index, m(i.j) cell will
    have the number of times w(i) co-occurred with w(j)
    For example <Ron Paul> is w(i) and <election> w<j>
    on twitter feed we collected in march 2012
    For example, dec 2013 <Sandy> would have co-occurred most
    number of times with <?>
    48MB data blows up to 888MB during naive MR co-occurrence
    implementation. O(n2)
    We need better solutions/algorithms.. 
    Lets look at Algorithm 3.8 (pairs) and 3.9 (stripes)
    Experiment: 19 slaves, 2-core machines,  666 sec for stripes
    vs. 3758 sec for pairs.
    For pairs
    5.7G corpus --> Mapprs generated 2.6billion intermediate
    key-value pairs --> 31.2 GB
    After combiners--> 1.1 billion key value pairs, , reducers
    emitted 142 million k,v pairs
    For stripes
    5.7GB --> 653 intermediate K,V pairs, --> 48.1GB--> after
    combiners-->28.8million <k,v>, 
    ---> reducers --> 1.69 million <k,v>
    Both stripes and pairs gave linear scalability... see fig.3.10
    
    Next improvement is relative frequency of co-occurrence than the
    absolute count.
    f (wj | wi) = count of occurrence of <word1,word2> / count of
    occurrence of <word1,*>
    In the pairs approach we emit a special key <word, *> which
    holds  the total count, and we have update the patitioner to 
    use only the left of the pair to partition the data so that all the
    keys go to the same reducer.
    Order inversion design pattern is used in the above example. See the
    example case in fig.3.13
    end of Chapter 3.
    (Dont worry about 3.4..till the end)
    
    
    
    
    Source code:
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop
    Write the partitioner class and set it in the job conf.
http://hadoop.apache.org/docs/mapreduce/current/api/org/apache/hadoop/mapreduce/class-use/Partitioner.html
    For books
    http://www.gutenberg.org/
    To facilitate transfer between file systems: Hadoop to any and back
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html