#DS #flashcards/ds410 #ds410

MapReduce

Developed by Google circa 2004 to scale its search engine performance with the
exploding size of the internet (many new webpages being added daily).

Key innovations of MapReduce

Was pioneered by Google ~ 2004

Impacts :

MapReduce at google

MapReduce cluster requirements

MapReduce requires:

Resource allocations

MapReduce performs scalable operations through “look-ahead”
Planning:

Skipped to divide and conquer to catch up

MapReduce - Divide and conquer

Partition

Each node is the first counting step (map worker) generates 26 output files, one for each key/word subspace (plan ahead for scalable aggregation). The aggregation task is assigned to the reduce worker one for each key subspace.

Boundary conditions

When analyzing gene sequences MapReduce fails as gene sequences can be chopped up thus the information is lost maybe even the sequence we’re looking for it wouldn’t crash just return errors depending on the situation. MapReduce computations must be embarrassingly parallel as the computation must require no communication or coordination between different map tasks, or between different reduce tasks.

Innovation

The key innovations of MapReduce are:

  1. Input data division into chunks that are each allocated to different workers.
  2. Partition the space of keys into subspaces that are each associated with different reduce workers.
  3. A map-step where map workers generate separate files for each subspace of keys that can be read by reduce workers
    The combination of these three steps leads to dramatic reduction in
    Time-to-solution for several classes of problems (with small intermediate sizes of
    Data, such as counting problems)

Plan-Ahead

The combination of partitioning and intermediate generation of pairs according to the number of reduce workers is called Plan-Ahead

Map in MapReduce

The map in MapReduce is more than simply a Map operation
Divides the input data into partitions/chunks, and distributes the application of map functions among a set of computers/nodes in a cluster- unsurprising

Plans-ahead/prepares for the following reduce step so that the output key-value pairs is distributed in multiple output files scalably.

MapReduce and Hadoop

Implementing MapReduce in a Cluster requires:

Robustness

Requires the concept of a master node:

Hadoop Distributed File System (HDFS)

Usefulness of HDFS

especially useful in the early days of Big Data to allow organizations to leverage old PCs to test/evaluate big data processing opportunities.

Hash-functions

Data partitioning

Hash functions are used to determine which split or partition a particular <key,value> pair should go to. By applying a hash function to the key, the framework can determine the partition where the data should be sent.

Shuffling and Sorting

: In this phase, the framework collects the output from all mapper tasks and sorts it based on the keys. Here, hash functions are used again to group together <key, value> pairs with the same key

Data Retrieval in Reduce Phase

In the reduce phase of MapReduce, each reducer task takes a set of grouped <key, value> pairs and processes them to generate the final output using the hash function.

Applications of MapReduce

Embarrassingly parallel applications that can be decomposed into a map step and a reduce (aggregation) step
The intermediate results of the map step can be represented as key-value pairs
The aggregation of reduce step groups intermediate results by keys

  1. Data transformations and feature engineering for AI modeling.
  2. Text Analysis and Natural Language Processing (NLP)
  3. Image and Video Processing
  4. Graph Analysis and Social Network Analysis
  5. Log Analysis and Anomaly Detection
  6. Genomics and Bioinformatics
  7. Recommendation Systems
  8. Large-Scale Data Mining
  9. Spatial Analysis and GIS (Geographic Information Systems)
  10. Data Aggregation and Summarization
  11. Fraud Detection

Limitations of MapReduce

  1. MapReduce is NOT designed for iterative big data analytics.
  2. If we use MapReduce in a loop, there is no general/flexible way to save/reuse data between iterations.
  3. Cost of iterative big data analytics using MapReduce is high. - Note: This limitation motivates the design of Spark.
  4. Any problem that introduces a ‘race-condition’

Distributed computing by dividing a task into

  1. A completely decomposable (Map) phase
  2. A key-based aggregation (Reduce) phase

Key-guided structure

A key is the dependency between the two phases interpret <key,value> pairs as intermediate results of Map to facilitate reduction phase

Coordinated Output/Input

Each Mapper generates n files - one for each Reducer (reduce the cost of copying files)