Complete Guide To Spark Memory Management 1726709042
Complete Guide To Spark Memory Management 1726709042
Memory Management🚀
Apache Spark is widely known for its powerful in-memory computations, which can
significantly speed up big data processing. However, to truly harness this power,
understanding how Spark manages memory is crucial. Efficient memory management is the
key to optimizing Spark jobs and avoiding costly out-of-memory (OOM) errors or
performance bottlenecks.
Each executor in Spark has a fixed amount of memory allocated to it, which is
managed by the JVM (Java Virtual Machine). Spark divides this memory into
different regions, mainly:
These two memory areas share the same pool, which is dynamically allocated
based on the workload. Let's break this down further.
🔥2. Executor Memory Breakdown 🔥
The memory allocated to each executor is split into several regions, each
responsible for different tasks. For a 4GB executor, here’s how the memory would
be divided:
Example of Metadata/Bookkeeping:
● Task Metadata: Keeps track of each task's progress, including its start/end
times, input/output data, and shuffle metrics.
● Partition Metadata: Maintains information about how data is split across
partitions on different nodes.
These small but important pieces of metadata ensure that Spark can efficiently
coordinate tasks across the cluster. Reserved memory acts as a safety buffer,
preventing the entire system from crashing due to insufficient memory.
User Memory is the part of memory available for user-defined objects, data
structures, and transformations. Spark does not directly manage this memory,
leaving it to the user. You might use user memory when defining custom
aggregations, UDFs (User-Defined Functions), or creating your own data structures
in transformations.
🧠Formula for User Memory Calculation🧠:
User Memory is calculated as the memory leftover after Spark allocates memory for Spark
Memory and Reserved Memory. Here’s the correct formula:
def multiply(x):
return x * 2
multiply_udf = udf(multiply)
df.withColumn("new_col", multiply_udf(df["existing_col"])).show()
The intermediate data and function outputs consume User Memory, which is why it’s
important to be cautious when using UDFs in large-scale job
Custom Aggregation Logic: If you write your own logic using
mapPartitions() for aggregation, you might need to maintain custom objects
(like hash maps) to keep track of intermediate results.
def custom_aggregate(iterator):
result = {}
key = record[0]
value = record[1]
result[key] = value
else:
result[key] += value
return iter(result.items())
In this example, the result dictionary stores intermediate results during the
aggregation. The memory used by this custom object (a Python dictionary) comes
from User Memory.
💡2.3 Spark Memory💡:
Spark Memory is the core part of executor memory, split into two sections:
Execution Memory and Storage Memory. These sections handle Spark's internal
operations and can dynamically borrow from each other based on the task's needs.
🌟Execution Memory:🌟
This part is responsible for storing intermediate data during tasks like shuffles,
joins, and aggregations. For example, if Spark is performing a sort or join, it uses
execution memory to hold temporary buffers for sorting data before writing it to disk
or network.
🌟Storage Memory:🌟
This section is used to cache data for reuse. When you call cache() or persist()
on a DataFrame, Spark stores the cached data in Storage Memory for faster
retrieval during subsequent actions.
Key Point: Spark uses a unified memory model, meaning execution and storage
memory share the same pool. If one memory section needs more space, it can
borrow from the other as long as there’s available memory.
Storage memory is calculated based on the Usable Memory and the configuration
of spark.memory.storageFraction:
Thus, approximately 1.89GB of memory will be available for storage, i.e., caching
and persisting DataFrames or RDDs.
Let’s consider a scenario where you're caching several large DataFrames. As the job
progresses:
● Storage Memory is full: Spark may start spilling cached data to disk if it runs
out of storage memory. However, if Execution Memory is underutilized (e.g.,
there are no ongoing shuffle operations), Spark can borrow memory from the
execution pool to hold more cached data.
Conversely:
This dynamic sharing between the two memory pools is referred to as the dynamic
occupancy mechanism.
Let’s assume you're running a job that needs to cache several large DataFrames. If
the Storage Memory is fully utilized and Spark starts spilling cached data to disk, it
can borrow from the Execution Memory pool to prevent performance degradation
(assuming there are no active shuffle operations). This reduces the frequency of disk
IO operations, enhancing the performance of the job.
Similarly, if a job requires significant Execution Memory for large shuffles or joins,
and the Storage Memory isn’t fully utilized (e.g., minimal caching), Spark will borrow
from the Storage Memory pool to complete the task without spilling intermediate
data to disk.
This dynamic allocation of memory resources ensures that Spark can flexibly
handle varying workloads without hard boundaries between memory pools, resulting
in better performance and fewer memory-related issues.
4.1 spark.executor.memory
● What it does: Sets the total amount of memory allocated to each executor.
● Example: spark.executor.memory=8g will allocate 8GB to each
executor.
4.2 spark.memory.fraction
4.3 spark.memory.storageFraction
● What it does: Sets the fraction of Spark memory allocated to storage tasks
(like caching). The remainder goes to execution memory.
● Default: 0.5, meaning storage and execution memory are evenly split.
4.4 spark.executor.memoryOverhead
● What it does: Allocates additional memory for overhead tasks (e.g., Python
processes in PySpark). This prevents executors from running out of memory
due to non-JVM tasks.
● Example: Increasing spark.executor.memoryOverhead is useful when
using PySpark or other non-JVM languages.
User Memory:
Execution Memory:
Storage Memory:
7. Conclusion
Spark memory management can seem complex, but by understanding how memory
is divided and managed between tasks, storage, and custom operations, you can
optimize your jobs for performance and stability. The key is to balance execution
and storage memory, carefully manage user memory (especially with UDFs and
custom objects), and adjust memory configurations to fit your workloads.
With practical examples and detailed breakdowns, you now have the tools to handle
any scenario involving Spark memory management.