0% found this document useful (0 votes)
94 views

Complete Guide To Spark Memory Management 1726709042

Uploaded by

olegruchinsky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Complete Guide To Spark Memory Management 1726709042

Uploaded by

olegruchinsky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

🚀Complete Guide to Apache Spark

Memory Management🚀
Apache Spark is widely known for its powerful in-memory computations, which can
significantly speed up big data processing. However, to truly harness this power,
understanding how Spark manages memory is crucial. Efficient memory management is the
key to optimizing Spark jobs and avoiding costly out-of-memory (OOM) errors or
performance bottlenecks.

This guide provides a comprehensive breakdown of Spark memory management, focusing


on practical examples to help you manage memory effectively, whether you're a beginner or
an experienced user.

🔥1. Overview of Spark Memory Management 🔥


When a Spark job is executed, two main components handle memory
management:

● Driver: Manages task scheduling and coordination.


● Executors: Run tasks on worker nodes and manage the data
processing.

Each executor has a set amount of memory allocated, typically configured


using the spark.executor.memory setting. This memory is divided into
different sections, each serving a distinct role.

Each executor in Spark has a fixed amount of memory allocated to it, which is
managed by the JVM (Java Virtual Machine). Spark divides this memory into
different regions, mainly:

● Execution Memory: Used for tasks like shuffling, sorting, and


aggregating data.
● Storage Memory: Used for caching and persisting data (like
DataFrames or RDDs).

These two memory areas share the same pool, which is dynamically allocated
based on the workload. Let's break this down further.
🔥2. Executor Memory Breakdown 🔥
The memory allocated to each executor is split into several regions, each
responsible for different tasks. For a 4GB executor, here’s how the memory would
be divided:

💡2.1 Reserved Memory💡


Reserved memory is a small portion of memory (typically 300MB) that Spark uses
for internal operations like tracking metadata or task execution information. It is
non-configurable and ensures that essential system tasks have enough space to
operate.

Example of Metadata/Bookkeeping:

● Task Metadata: Keeps track of each task's progress, including its start/end
times, input/output data, and shuffle metrics.
● Partition Metadata: Maintains information about how data is split across
partitions on different nodes.

These small but important pieces of metadata ensure that Spark can efficiently
coordinate tasks across the cluster. Reserved memory acts as a safety buffer,
preventing the entire system from crashing due to insufficient memory.

💡2.2 User Memory💡


After the reserved memory is accounted for, the remaining memory is divided
between Spark Memory and User Memory.

User Memory is the part of memory available for user-defined objects, data
structures, and transformations. Spark does not directly manage this memory,
leaving it to the user. You might use user memory when defining custom
aggregations, UDFs (User-Defined Functions), or creating your own data structures
in transformations.
🧠Formula for User Memory Calculation🧠:
User Memory is calculated as the memory leftover after Spark allocates memory for Spark
Memory and Reserved Memory. Here’s the correct formula:

User Memory = (Total Memory - Reserved Memory) * (1 -


spark.memory.fraction)

Example Calculation (4GB Executor):

Let’s calculate User Memory for a 4GB executor:

● Total Memory = 4096MB


● Reserved Memory = 300MB
● spark.memory.fraction = 0.6

Using the formula:

● User Memory = (4096MB - 300MB) * (1 - 0.6)


● User Memory = 3796MB * 0.4 = 1518.4MB (approx 1.5GB)

Examples of User Memory Usage:

User-Defined Functions (UDFs): UDFs can consume significant memory,


especially when performing complex operations:

from pyspark.sql.functions import udf

def multiply(x):

return x * 2

multiply_udf = udf(multiply)

df.withColumn("new_col", multiply_udf(df["existing_col"])).show()

The intermediate data and function outputs consume User Memory, which is why it’s
important to be cautious when using UDFs in large-scale job
Custom Aggregation Logic: If you write your own logic using
mapPartitions() for aggregation, you might need to maintain custom objects
(like hash maps) to keep track of intermediate results.

def custom_aggregate(iterator):

result = {}

for record in iterator:

key = record[0]

value = record[1]

if key not in result:

result[key] = value

else:

result[key] += value

return iter(result.items())

In this example, the result dictionary stores intermediate results during the
aggregation. The memory used by this custom object (a Python dictionary) comes
from User Memory.
💡2.3 Spark Memory💡:
Spark Memory is the core part of executor memory, split into two sections:
Execution Memory and Storage Memory. These sections handle Spark's internal
operations and can dynamically borrow from each other based on the task's needs.

🌟Execution Memory:🌟
This part is responsible for storing intermediate data during tasks like shuffles,
joins, and aggregations. For example, if Spark is performing a sort or join, it uses
execution memory to hold temporary buffers for sorting data before writing it to disk
or network.

🌟Storage Memory:🌟
This section is used to cache data for reuse. When you call cache() or persist()
on a DataFrame, Spark stores the cached data in Storage Memory for faster
retrieval during subsequent actions.

Key Point: Spark uses a unified memory model, meaning execution and storage
memory share the same pool. If one memory section needs more space, it can
borrow from the other as long as there’s available memory.

Spark Memory = (Total Memory - Reserved Memory) *


spark.memory.fraction
💡3. Unified Memory Model (Execution vs. Storage
Memory)💡
In Spark’s Unified Memory Model, execution memory and storage memory share
the same pool. This allows Spark to dynamically allocate memory to either execution
or storage based on the needs of the current task. This dynamic nature is a key
feature of Spark's memory management, as it provides flexibility in handling different
workloads.

🧠3.1 Execution Memory🧠


This is where the majority of computations in Spark actually occur. When we talk
about Execution Memory, we're referring to the memory space where Spark
performs its data processing operations. This includes:

● Shuffles (redistributing data across partitions)


● Joins
● Sorting
● Aggregations
● Any other data transformations or actions

🧠→Formula for Execution Memory:🧠


The amount of memory reserved for execution depends on the Usable Memory and
the configuration of spark.memory.storageFraction:

● Execution Memory = Usable Memory * (1 -


spark.memory.storageFraction)
● Usable Memory: This is the memory left after Reserved Memory and User
Memory have been accounted for.
● spark.memory.storageFraction: Defines the percentage of the usable
memory dedicated to Storage Memory.

Example Calculation (4GB Executor):

With a 4GB executor and default settings (spark.memory.storageFraction =


0.5):

● Usable Memory = (4096MB - 300MB reserved) = 3796MB


● Execution Memory = 3796MB * (1 - 0.5) = 1.89GB
This means that approximately 1.89GB of memory will be available for execution
tasks such as shuffles and aggregations.

🧠3.2 Storage Memory🧠


Storage memory is used for caching or persisting data that needs to be reused
across multiple stages of a job. For example, when you call .cache() or
.persist() on an RDD or DataFrame, Spark keeps the cached data in Storage
Memory. If the data exceeds the available memory, Spark starts spilling the cached
data to disk to free up memory for other tasks.

→Formula for Storage Memory:

Storage memory is calculated based on the Usable Memory and the configuration
of spark.memory.storageFraction:

● Storage Memory = Usable Memory *


spark.memory.storageFraction

● Usable Memory: This is the same amount of memory used by execution


memory.
● spark.memory.storageFraction: Defines the percentage of the usable
memory allocated to Storage Memory.

Example Calculation (4GB Executor):

Using the same 4GB executor with default settings


(spark.memory.storageFraction = 0.5):

● Usable Memory = 3796MB


● Storage Memory = 3796MB * 0.5 = 1.89GB

Thus, approximately 1.89GB of memory will be available for storage, i.e., caching
and persisting DataFrames or RDDs.

🧠3.3 Dynamic Memory Allocation (Unified Model)🧠


The key feature of Spark's Unified Memory Model is the dynamic allocation of
memory between execution memory and storage memory. If one of the memory
pools (execution or storage) is running low on available memory, it can borrow from
the other pool, provided the other pool is not using it for active tasks.

Dynamic Sharing Example:

Let’s consider a scenario where you're caching several large DataFrames. As the job
progresses:

● Storage Memory is full: Spark may start spilling cached data to disk if it runs
out of storage memory. However, if Execution Memory is underutilized (e.g.,
there are no ongoing shuffle operations), Spark can borrow memory from the
execution pool to hold more cached data.

Conversely:

● If Execution Memory needs more space for intermediate operations (like


shuffles or joins), and Storage Memory has free space (e.g., no additional
data is being cached), Spark can borrow storage memory for the ongoing
execution task.

This dynamic sharing between the two memory pools is referred to as the dynamic
occupancy mechanism.

🔥Unified Memory Model: Dynamic Sharing🔥


The Unified Memory Model in Spark allows for flexible sharing of memory
between Execution and Storage based on workload requirements. Here's a
breakdown of how this works in real time:

1. Execution Memory borrows from Storage:


○ If a task involving shuffles or aggregations needs more memory for
intermediate results, Spark can dynamically reduce the size of the
Storage Memory pool and allocate more memory to Execution
Memory.
2. Storage Memory borrows from Execution:
○ If a job requires caching large datasets and there is idle Execution
Memory (i.e., no active shuffles or aggregations), Spark can borrow
that memory for caching without impacting ongoing computations.

Dynamic Sharing Example in Practice:

Let’s assume you're running a job that needs to cache several large DataFrames. If
the Storage Memory is fully utilized and Spark starts spilling cached data to disk, it
can borrow from the Execution Memory pool to prevent performance degradation
(assuming there are no active shuffle operations). This reduces the frequency of disk
IO operations, enhancing the performance of the job.

Similarly, if a job requires significant Execution Memory for large shuffles or joins,
and the Storage Memory isn’t fully utilized (e.g., minimal caching), Spark will borrow
from the Storage Memory pool to complete the task without spilling intermediate
data to disk.

This dynamic allocation of memory resources ensures that Spark can flexibly
handle varying workloads without hard boundaries between memory pools, resulting
in better performance and fewer memory-related issues.

💡4. Key Spark Memory Configurations💡


Here are the essential configurations you need to tune Spark’s memory
management effectively:

4.1 spark.executor.memory

● What it does: Sets the total amount of memory allocated to each executor.
● Example: spark.executor.memory=8g will allocate 8GB to each
executor.

4.2 spark.memory.fraction

● What it does: Specifies the fraction of executor memory used by Spark’s


execution and storage tasks.
● Default: 0.6, meaning 60% of the total memory is reserved for Spark tasks,
while the rest is for user-defined operations and overhead.

4.3 spark.memory.storageFraction

● What it does: Sets the fraction of Spark memory allocated to storage tasks
(like caching). The remainder goes to execution memory.
● Default: 0.5, meaning storage and execution memory are evenly split.

4.4 spark.executor.memoryOverhead

● What it does: Allocates additional memory for overhead tasks (e.g., Python
processes in PySpark). This prevents executors from running out of memory
due to non-JVM tasks.
● Example: Increasing spark.executor.memoryOverhead is useful when
using PySpark or other non-JVM languages.

💡5. Real-World Example: 4GB Executor Setup💡


Let’s break down memory allocation for an executor with 4GB of total memory using
the default configurations:

Total Executor Memory: 4GB

● Reserved Memory: 300MB


● Usable Memory: 4GB - 300MB = 3.7GB

User Memory:

● Formula: (Total Memory - Reserved Memory) * (1 -


spark.memory.fraction)
● Calculation: (4096MB - 300MB) * (1 - 0.6) = 1.49GB

Spark Memory (Execution + Storage):

● Formula: (Total Memory - Reserved Memory) *


spark.memory.fraction
● Calculation: (4096MB - 300MB) * 0.6 = 2.37GB

Execution Memory:

● Formula: Usable Memory * (1 - spark.memory.storageFraction)


● Calculation: 2.37GB * (1 - 0.5) = 1.19GB

Storage Memory:

● Formula: Usable Memory * spark.memory.storageFraction


● Calculation: 2.37GB * 0.5 = 1.19GB
💡6. Troubleshooting Common Memory Issues💡
Problem 1: OutOfMemory Errors

● Cause: Executor memory is insufficient for large datasets or shuffles.


● Solution: Increase spark.executor.memory or optimize the data pipeline
(e.g., by using MEMORY_AND_DISK persistence).

Problem 2: Garbage Collection Delays

● Cause: Too much time is spent in garbage collection.


● Solution: Tune the JVM GC options or enable off-heap memory
(spark.memory.offHeap.enabled).

Problem 3: Slow Caching or Frequent Recalculation

● Cause: Storage memory is too small to hold cached data.


● Solution: Increase spark.memory.storageFraction or use a
persistence mode that allows spilling to disk (e.g., MEMORY_AND_DISK).

7. Conclusion
Spark memory management can seem complex, but by understanding how memory
is divided and managed between tasks, storage, and custom operations, you can
optimize your jobs for performance and stability. The key is to balance execution
and storage memory, carefully manage user memory (especially with UDFs and
custom objects), and adjust memory configurations to fit your workloads.

With practical examples and detailed breakdowns, you now have the tools to handle
any scenario involving Spark memory management.

You might also like