MapReduce and Lazy Evaluation in Spark

#ds410 #swe #flashcards/ds410

Related: Software engineering | Cloud computing

Key Concepts: Understanding how Spark optimizes big data processing through lazy evaluation and RDD transformations.

Conventional Evaluation

Each statement in a program is executed immediately after it is loaded into CPU.

Lazy Evaluation

A statement in a program is NOT executed immediately; its execution is postponed after some other statements have been evaluated.

Lazy evaluation in Spark assists with optimization of computation. Remember the adage

“Computation movement is cheaper than data movement”

RDDs and lazy evaluation

RDDs and immutability

RDDs and data dependencies

RDDs and data dependencies

Narrow dependencies:

Each partition of the parent RDD is used by at most one partition of the child RDD.
Pasted image 20250908162754.png

Wide dependencies

Each child partition depends on multiple partitions of the parent RDD.
Pasted image 20250908162855.png

Key-value RDDS

Spark operations

Examples of transformations -

.map(function f): Parse each element of the input RDD through f, returns an output RDD of same structure. - .flatmap(function f): Parse each element of the input RDD through f, flatten and return an output RDD with flattened list (structure may change). - .filter(function f): Returns an output RDD that (after lazy evaluation) includes elements of the input RDD that is “TRUE”for the function f. - .join(inputRDD2): Joins two input RDDs on their keys (after lazy evaluation), returns an output RDD.

Examples of actions

Spark operations and lazy evaluations

token_1_RDD = tokensRDD.map(lambda x: (x, 1) )
token_1_RDD # object created, but NOT evaluated/computed since we do not know how to best partition it for following steps.
token_count_RDD = token_1_RDD.reduceByKey(lambda x, y: x+y,4)
token_count_RDD # object created, but NOT evaluated/computed. However, Spark nowknows the output, 
token_count_RDD # is better to be partitioned by keys into 4 chunks.
token_count_RDD.saveAsTextFile(“/storage/home/rmm7011/Lab2/token_count.txt”)

Difference between RDDs and Objects

PySpark

Two Modes of Running Spark:

Driver program and SparkContext

SparkContext

SparkContext serves as the entry point for interacting with a Spark cluster. It represents the connection to the Spark cluster manager and coordinates the execution of distributed computations across the cluster. The SparkContext provides the necessary configuration settings for your Spark application and manages the resources, such as memory and computing cores, needed for executing Spark jobs.

Terminology

Vanilla MapReduce vs. Spark’s MapReduce

Plan-ahead in MapReduce Lazy Evaluation in Spark Impacts of Spark Innovations
Map Workers Executors assigned to perform a transformation Transformation is broader than Map, including map, filter, read an input file, etc.
Reduce Workers Executors assigned to perform an action Action is broader than Reduce, including count, collect, save output in a distributed file system.
Partition of keys into R output files of Mappers Various ways to optimize: (1) partitioning data into RDDs, (2) assigning tasks (RDD + transformation/action) to executors, (3) key-based partition of key-value pairs, and (4) save/reuse RDD across iterations using persist/cache. A more flexible, powerful, developer-friendly programming model for Big Data with effective and scalable resource allocation.

Spark in local mode