Re-evaluate the ML node memory avalability formula #126535

valeriy42 · 2025-04-09T14:13:35Z

Currently, if ml.use_auto_machine_memory_percent is set to true, the amount of available memory on an ML node is calculated as
NODE_MEMORY - JVM_HEAP_SIZE - 200MB OFF-HEAP MEMORY

Where JVM_HEAP_SIZE is configured on ES start, and the off-heap memory is estimated at 200MB as a fixed value.

Some empirical evidence suggests that the off-heap memory can be significantly larger, which can lead to the Java process being killed by the OOM-killer.

We need to re-evaluate whether the way the ML code is becoming aware of the available memory needs to be adjusted or changed.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2025-04-09T14:14:21Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2025-04-09T14:14:21Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

gbanasiak · 2025-04-10T12:22:46Z

Example 1:

[  pid  ]   uid  tgid   total_vm      rss pgtables_bytes swapents oom_score_adj name
[4157257] 65535 4157257      243        1          28672        0          -998 pause
[4157720]  1000 4157720      637      284          45056        0           936 tini
[4157732]  1000 4157732   635756    21208         380928        0           936 java
[4157809]  1000 4157809  2300095   536664        5029888        0           936 java <--- killed
[4157901]  1000 4157901    45202     2319          98304        0           936 controller
[   3316]  1000    3316   546772   501997        4124672        0           936 data_frame_anal <--- triggered OOM

JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j2.formatMsgNoLookups=true, -Djava.locale.providers=CLDR, -Dorg.apache.lucene.vectorization.upperJavaFeatureVersion=24, -Des.distribution.type=docker, -Des.java.type=bundled JDK, --enable-native-access=org.elasticsearch.nativeaccess,org.apache.lucene.core, --enable-native-access=ALL-UNNAMED, --illegal-native-access=deny, -Des.cgroups.hierarchy.override=/, -XX:ReplayDataFile=logs/replay_pid%p.log, -Des.entitlements.enabled=true, -XX:+EnableDynamicAgentLoading, -Djdk.attach.allowAttachSelf=true, --patch-module=java.base=lib/entitlement-bridge/elasticsearch-entitlement-bridge-9.1.0.jar, --add-exports=java.base/org.elasticsearch.entitlement.bridge=org.elasticsearch.entitlement,java.logging,java.net.http,java.naming,jdk.net, -XX:+UseG1GC, -Djava.io.tmpdir=/tmp/elasticsearch-9874597656439209350, --add-modules=jdk.incubator.vector, -Dorg.apache.lucene.store.defaultReadAdvice=normal, -XX:+HeapDumpOnOutOfMemoryError, -XX:+ExitOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m, -Des.serverless_transport=true, -Des.search.rank_supported=false, -Des.security.security_index.wait_timeout=5s, -Xms1636m, -Xmx1636m, -XX:MaxDirectMemorySize=857735168, -XX:G1HeapRegionSize=4m, -XX:InitiatingHeapOccupancyPercent=30, -XX:G1ReservePercent=15, -javaagent:/usr/share/elasticsearch/modules/apm/elastic-apm-agent-1.52.2.jar=c=/tmp/elasticsearch-9874597656439209350/.elstcapm.17705332215789177581.tmp, -Delastic.apm.central_config=false, -Delastic.apm.transaction_sample_rate=0.10, -Delastic.apm.application_packages=org.elasticsearch,org.apache.lucene, -Delastic.apm.log_level=warn, -Delastic.apm.enable_experimental_instrumentations=true, -Delastic.apm.instrument=false, -Delastic.apm.server_url=http://apm-server.elastic-agent:8200/, --module-path=/usr/share/elasticsearch/lib, --add-modules=jdk.net, --add-modules=jdk.management.agent, --add-modules=ALL-MODULE-PATH, -Djdk.module.main=org.elasticsearch.server]

Example 2:

[226625.308234] [  pid  ]   uid  tgid  total_vm      rss pgtables_bytes swapents oom_score_adj name
[226625.308237] [ 745717] 65535 745717      243        1          28672        0          -998 pause
[226625.308243] [ 746534]  1000 746534      637      268          49152        0           937 tini
[226625.308249] [ 746565]  1000 746565   718971    26291         446464        0           937 java
[226625.308258] [ 746987]  1000 746987  2995745   667999        6533120        0           937 java <--- killed
[226625.308271] [ 747394]  1000 747394    43666     2807         102400        0           937 controller
[226625.308278] [ 769598]  1000 769598   664358   377609        3743744        0           937 pytorch_inferen <--- triggered OOM

JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j2.formatMsgNoLookups=true, -Djava.locale.providers=CLDR, -Dorg.apache.lucene.vectorization.upperJavaFeatureVersion=24, -Des.distribution.type=docker, -Des.java.type=bundled JDK, --enable-native-access=org.elasticsearch.nativeaccess,org.apache.lucene.core, --enable-native-access=ALL-UNNAMED, --illegal-native-access=deny, -Des.cgroups.hierarchy.override=/, -XX:ReplayDataFile=logs/replay_pid%p.log, -Des.entitlements.enabled=true, -XX:+EnableDynamicAgentLoading, -Djdk.attach.allowAttachSelf=true, --patch-module=java.base=lib/entitlement-bridge/elasticsearch-entitlement-bridge-9.1.0.jar, --add-exports=java.base/org.elasticsearch.entitlement.bridge=org.elasticsearch.entitlement,java.logging,java.net.http,java.naming,jdk.net, -XX:+UseG1GC, -Djava.io.tmpdir=/tmp/elasticsearch-16789498905363193644, --add-modules=jdk.incubator.vector, -Dorg.apache.lucene.store.defaultReadAdvice=normal, -XX:+HeapDumpOnOutOfMemoryError, -XX:+ExitOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m, -Des.serverless_transport=true, -Des.search.rank_supported=false, -Des.security.security_index.wait_timeout=5s, -Xms1636m, -Xmx1636m, -XX:MaxDirectMemorySize=857735168, -XX:G1HeapRegionSize=4m, -XX:InitiatingHeapOccupancyPercent=30, -XX:G1ReservePercent=15, -javaagent:/usr/share/elasticsearch/modules/apm/elastic-apm-agent-1.52.2.jar=c=/tmp/elasticsearch-16789498905363193644/.elstcapm.17954687461798677298.tmp, -Delastic.apm.central_config=false, -Delastic.apm.transaction_sample_rate=0.10, -Delastic.apm.application_packages=org.elasticsearch,org.apache.lucene, -Delastic.apm.log_level=warn, -Delastic.apm.enable_experimental_instrumentations=true, -Delastic.apm.instrument=false, -Delastic.apm.server_url=http://apm-server.elastic-agent:8200/, --module-path=/usr/share/elasticsearch/lib, --add-modules=jdk.net, --add-modules=jdk.management.agent, --add-modules=ALL-MODULE-PATH, -Djdk.module.main=org.elasticsearch.server]

sunilemanjee · 2025-04-11T17:14:49Z

I was able to reproduce this error using this notebook: https://colab.research.google.com/drive/1mmB1adtRTpmdwtbiw9SXCQoWAzwELbOr#scrollTo=mGr_pki7eX1w

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Apr 9, 2025

valeriy42 added :Core/Infra/Core Core issues without another label :ml Machine learning Team:Core/Infra Meta label for core/infra team Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels Apr 9, 2025

davidkyle mentioned this issue Apr 11, 2025

Log failure to adjust OOM as a warning elastic/ml-cpp#2847

Closed

valeriy42 self-assigned this Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-evaluate the ML node memory avalability formula #126535

Re-evaluate the ML node memory avalability formula #126535

valeriy42 commented Apr 9, 2025

elasticsearchmachine commented Apr 9, 2025

elasticsearchmachine commented Apr 9, 2025

gbanasiak commented Apr 10, 2025

sunilemanjee commented Apr 11, 2025

Re-evaluate the ML node memory avalability formula #126535

Re-evaluate the ML node memory avalability formula #126535

Comments

valeriy42 commented Apr 9, 2025

elasticsearchmachine commented Apr 9, 2025

elasticsearchmachine commented Apr 9, 2025

gbanasiak commented Apr 10, 2025

sunilemanjee commented Apr 11, 2025