8-27-25 MapReduce

MapReduce

Developed by Google circa 2004 to scale its search engine performance with the
exploding size of the internet (many new webpages being added daily).

Key innovations of MapReduce

Was pioneered by Google ~ 2004

Motivated by streaming webpage processing for page indexing.
Apache Hadoop (released in 2006, only 2 years later) adopted by industry

Impacts :

Applied to a wide range of big data analytics problems very quickly.
Hadoop Distributed File Systems (HDFS) became the industry standard for
Storing Big Data (also adopted by Spark later).
Motivated the design of Spark.

MapReduce at google

Inputs to MapReduce (for this application) - large number of (new) webpages on a regular basis.
Outputs from MapReduce: - The term frequency (TF) for each word on a page. - For each word observed, update the inverse document frequency (IDF).

MapReduce cluster requirements

MapReduce requires:

Requires specification of M worker nodes for map.
Requires specification of R works nodes for reduce.
One master node for coordination and planning.
MapReduce provides:
(key, value) pairs

Resource allocations

MapReduce performs scalable operations through “look-ahead”
Planning:

Partitions keys into R groups - with each group having its reduce worker.
Partitions document/webpage corpus to M groups - with each group having its map worker.
MapReduce then performs aggregation through:
Each map worker generates R (key, value) pairs and saves them to disk.
Each reduce worker aggregates M files (one from each Map worker), all for the group of keys assigned to the Reduce worker.

Skipped to divide and conquer to catch up

MapReduce - Divide and conquer

Idea: Split your large data across computing resources to get acceleration.
Very well suited for “Embarrassingly parallel” problems such as local transformations to a data (for e.g. find sum of rows in a large file - send subsets of rows to different nodes)
Several problems require some form of aggregation across resources. MapReduce allows for this type of aggregation to be performed in a scalable manner.

Partition

Each node is the first counting step (map worker) generates 26 output files, one for each key/word subspace (plan ahead for scalable aggregation). The aggregation task is assigned to the reduce worker one for each key subspace.

Boundary conditions

When analyzing gene sequences MapReduce fails as gene sequences can be chopped up thus the information is lost maybe even the sequence we’re looking for it wouldn’t crash just return errors depending on the situation. MapReduce computations must be embarrassingly parallel as the computation must require no communication or coordination between different map tasks, or between different reduce tasks.

Innovation

The key innovations of MapReduce are:

Input data division into chunks that are each allocated to different workers.
Partition the space of keys into subspaces that are each associated with different reduce workers.
A map-step where map workers generate separate files for each subspace of keys that can be read by reduce workers
The combination of these three steps leads to dramatic reduction in
Time-to-solution for several classes of problems (with small intermediate sizes of
Data, such as counting problems)

Plan-Ahead

The combination of partitioning and intermediate generation of pairs according to the number of reduce workers is called Plan-Ahead

Map in MapReduce

The map in MapReduce is more than simply a Map operation
Divides the input data into partitions/chunks, and distributes the application of map functions among a set of computers/nodes in a cluster- unsurprising

Plans-ahead/prepares for the following reduce step so that the output key-value pairs is distributed in multiple output files scalably.

MapReduce and Hadoop

Implementing MapReduce in a Cluster requires:

A file system that supports Big Data (e.g., allow a big dataset to be stored in
multiple “chunks”) -> HDFS
A cluster manager that assigns data/task to nodes in the cluster -> Yarn
Efficient detection and recovery from failures of any node (robustness
requirement)

Robustness

Requires the concept of a master node:

A special node that assigns map tasks/data, reduce tasks to nodes in a cluster.
The special node also monitors other nodes (i.e., map workers and reduce workers) for detecting failures and for orchestrating recovery.
How do we get robustness?
If the Master node detects any failed map worker:
The Master dynamically reassigns all of their tasks to other nodes.
The Master also informs all Reduce workers about the changed location of the Map worker (and its intermediate files)
If the Master detects any reduce worker fails
The Master dynamically reassigns all of their tasks to another reduce worker

Hadoop Distributed File System (HDFS)

A distributed file system (DFS) that can run using commodity hardware (e.g. PCs). DFS implies data that is stored on separate physical locations but with communications (read/write)

Usefulness of HDFS

especially useful in the early days of Big Data to allow organizations to leverage old PCs to test/evaluate big data processing opportunities.

Supports high-throughput access to data (e.g., for gathering/storing streaming data).
Supports the storage of large data sets, which can be distributed among multiple machines/racks.
Supports the recovery of data from machine/disk failure. - Replicates data across machines/racks for fault-tolerance.
Uses a NameNode to map file names (and their copies) to actual locations.

Hash-functions

A hash function takes the key as input and computes a hash value, which is typically an integer. This hash value is used to determine the index (or position) where the corresponding value will be stored in the hash table.
In the context of storing information in computer memory, hash functions are used to efficiently organize and retrieve data from data structures known as hash tables or hash maps. These data structures are designed to provide fast access to data based on a key, allowing for rapid insertion, deletion, and retrieval of information.
Output of MapReduce is not reversible as the hash is lost in the process

Data partitioning

Hash functions are used to determine which split or partition a particular <key,value> pair should go to. By applying a hash function to the key, the framework can determine the partition where the data should be sent.

Shuffling and Sorting

: In this phase, the framework collects the output from all mapper tasks and sorts it based on the keys. Here, hash functions are used again to group together <key, value> pairs with the same key

Data Retrieval in Reduce Phase

In the reduce phase of MapReduce, each reducer task takes a set of grouped <key, value> pairs and processes them to generate the final output using the hash function.

Applications of MapReduce

Embarrassingly parallel applications that can be decomposed into a map step and a reduce (aggregation) step
The intermediate results of the map step can be represented as key-value pairs
The aggregation of reduce step groups intermediate results by keys

Data transformations and feature engineering for AI modeling.
Text Analysis and Natural Language Processing (NLP)
Image and Video Processing
Graph Analysis and Social Network Analysis
Log Analysis and Anomaly Detection
Genomics and Bioinformatics
Recommendation Systems
Large-Scale Data Mining
Spatial Analysis and GIS (Geographic Information Systems)
Data Aggregation and Summarization
Fraud Detection

Limitations of MapReduce

MapReduce is NOT designed for iterative big data analytics.
If we use MapReduce in a loop, there is no general/flexible way to save/reuse data between iterations.
Cost of iterative big data analytics using MapReduce is high. - Note: This limitation motivates the design of Spark.
Any problem that introduces a ‘race-condition’

Distributed computing by dividing a task into

A completely decomposable (Map) phase
A key-based aggregation (Reduce) phase

Key-guided structure

A key is the dependency between the two phases interpret <key,value> pairs as intermediate results of Map to facilitate reduction phase

Coordinated Output/Input

Each Mapper generates n files - one for each Reducer (reduce the cost of copying files)