MapReduce and Lazy Evaluation in Spark

#ds410 #swe #flashcards/ds410

Key Concepts: Understanding how Spark optimizes big data processing through lazy evaluation and RDD transformations.

Conventional Evaluation

Each statement in a program is executed immediately after it is loaded into CPU.

Lazy Evaluation

A statement in a program is NOT executed immediately; its execution is postponed after some other statements have been evaluated.

Lazy evaluation in Spark assists with optimization of computation. Remember the adage

“Computation movement is cheaper than data movement”

RDDs and lazy evaluation

A graph of data transformations/action form a lineage chain.
Transformations are not immediately evaluated. Physical Resource (memory, executor) for RDD is not allocated either until the action step gets evaluated (i.e., lazy evaluation). -
All necessary information for computing a transformation is stored in its RDD. - RDD contains both computing information and data partition information. -
The lineage chain is constructed incrementally, with one RDD added after its transformation is evaluated. -
When a lineage chain reaches an action, it executes all of the transformations earlier on the chain.

RDDs and immutability

All transformation operations in Spark generate RDDs. RDDs are immutable
Invoking computation of a transformation-action chain does NOT change the RDD, but generates new RDDs.
An RDD can be used in multiple transformation-action chains.

RDDs and data dependencies

Narrow (i.e., loose) dependency: -
- Example: Map: The child partition RDDs uses only the RDDs of the parent partition. -
- One-to-one mapping from partitions in parent to partitions in child. -
- More efficient to be pipelined -
- Recovery/Transformations are faster -
Wide (i.e., strong) dependency -
- Example: Join: Many-to-many mapping from partitions in parent to partitions in child. -
- More costly (remember “Race conditions”) -
- Recovery may require complete re-execution of the pipeline

RDDs and data dependencies

Narrow dependencies:

Each partition of the parent RDD is used by at most one partition of the child RDD.

Wide dependencies

Each child partition depends on multiple partitions of the parent RDD.

Key-value RDDS

Enables grouping, counting, and other aggregations by keys -
A Key-value RDD can be generated by map or flatMap (explained later) whose lambda function returns a list of two elements: (, ) -
A Key-value RDD can be aggregated using -
- reduceByKey -
- groupByKey -
You will see examples of this in your second Lab

Spark operations

Steps that can be postponed are called transformations (e.g., map, reduceByKey). -
Transformations generate an output RDD from (one or more) input RDD(s) -
Steps that can NOT be postponed are called actions (e.g., saveAsTextFile). -
Actions aggregate multiple RDD’s to return a result or write to storage -
The execution of transformations are postponed until an action that uses the output of the transformation (directly or indirectly). -
Consequently: Information later in a pipeline can be used to optimize partition, task assignment (to executors), etc for a transformation step earlier in the pipeline.

Examples of transformations -

.map(function f): Parse each element of the input RDD through f, returns an output RDD of same structure. - .flatmap(function f): Parse each element of the input RDD through f, flatten and return an output RDD with flattened list (structure may change). - .filter(function f): Returns an output RDD that (after lazy evaluation) includes elements of the input RDD that is “TRUE”for the function f. - .join(inputRDD2): Joins two input RDDs on their keys (after lazy evaluation), returns an output RDD.

Examples of actions

.collect(): returns the actual content of the RDD -
.count(): returns the total number of elements in the input RDD -
.take(n): returns the first n elements of the input RDD -
.saveAsTextFile(path):Save the content of the RDD in the specified directory.

Spark operations and lazy evaluations

token_1_RDD = tokensRDD.map(lambda x: (x, 1) )
token_1_RDD # object created, but NOT evaluated/computed since we do not know how to best partition it for following steps.
token_count_RDD = token_1_RDD.reduceByKey(lambda x, y: x+y,4)
token_count_RDD # object created, but NOT evaluated/computed. However, Spark nowknows the output, 
token_count_RDD # is better to be partitioned by keys into 4 chunks.
token_count_RDD.saveAsTextFile(“/storage/home/rmm7011/Lab2/token_count.txt”)

Difference between RDDs and Objects

An assignment statement to an object typically computes the value/content of the object and store them in the object.
However, an assignment statement to an RDD in Spark does NOT compute the value/content of the object
The assignment statement only creates an RDD object that contains vital information needed (later) for computing the RDD:
- The input data (another RDD(s) or the path of an input file)
- The method (e.g., map, flatMap) of the transformation
- The function (e.g., lambda function) used in the transformation
- How the output data should be partitioned, and how many partitions.

PySpark

Two Modes of Running Spark:

Local Mode: Pyspark in Jupyter Notebook (for developing and debugging code using small dataset)
Cluster Mode: spark-submit in a YARN/HDFS/Standalone cluster (for deploying, evaluating, and optimizing Spark for processing Big Data). We will use standalone cluster given ICDS uses slurm

Driver program and SparkContext

Sparks Driver program is like the master node in MapReduce
A special object, called SparkContext provides critical info about the job

SparkContext

SparkContext serves as the entry point for interacting with a Spark cluster. It represents the connection to the Spark cluster manager and coordinates the execution of distributed computations across the cluster. The SparkContext provides the necessary configuration settings for your Spark application and manages the resources, such as memory and computing cores, needed for executing Spark jobs.

Terminology

A Node: A physical machine in a HPC cluster, typically consisting of multiple processors.
Executor: Like a worker in MapReduce, each of which can be assigned to one or multiple Spark tasks.
A Spark task is a transformation operation or an action operation coupled with the data partition associated with it.
Multiple tasks can be assigned to the same executor.

Vanilla MapReduce vs. Spark’s MapReduce

Plan-ahead in MapReduce	Lazy Evaluation in Spark	Impacts of Spark Innovations
Map Workers	Executors assigned to perform a transformation	Transformation is broader than Map, including map, filter, read an input file, etc.
Reduce Workers	Executors assigned to perform an action	Action is broader than Reduce, including count, collect, save output in a distributed file system.
Partition of keys into R output files of Mappers	Various ways to optimize: (1) partitioning data into RDDs, (2) assigning tasks (RDD + transformation/action) to executors, (3) key-based partition of key-value pairs, and (4) save/reuse RDD across iterations using `persist`/`cache`.	A more flexible, powerful, developer-friendly programming model for Big Data with effective and scalable resource allocation.

Spark in local mode

Runs on a single computer
Uses a small part of the total data
Useful for debugging as you can work with a small dataset and test your features
The code should still be designed for scale up (using multiple cores for processing Big Data) in a cluster.