Skip to content

Re-evaluate the ML node memory avalability formula #126535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
valeriy42 opened this issue Apr 9, 2025 · 4 comments
Open

Re-evaluate the ML node memory avalability formula #126535

valeriy42 opened this issue Apr 9, 2025 · 4 comments
Assignees
Labels
:Core/Infra/Core Core issues without another label :ml Machine learning Team:Core/Infra Meta label for core/infra team Team:ML Meta label for the ML team

Comments

@valeriy42
Copy link
Contributor

Currently, if ml.use_auto_machine_memory_percent is set to true, the amount of available memory on an ML node is calculated as
NODE_MEMORY - JVM_HEAP_SIZE - 200MB OFF-HEAP MEMORY

Where JVM_HEAP_SIZE is configured on ES start, and the off-heap memory is estimated at 200MB as a fixed value.

Some empirical evidence suggests that the off-heap memory can be significantly larger, which can lead to the Java process being killed by the OOM-killer.

We need to re-evaluate whether the way the ML code is becoming aware of the available memory needs to be adjusted or changed.

@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Apr 9, 2025
@valeriy42 valeriy42 added :Core/Infra/Core Core issues without another label :ml Machine learning Team:Core/Infra Meta label for core/infra team Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels Apr 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@gbanasiak
Copy link
Contributor

Example 1:

[  pid  ]   uid  tgid   total_vm      rss pgtables_bytes swapents oom_score_adj name
[4157257] 65535 4157257      243        1          28672        0          -998 pause
[4157720]  1000 4157720      637      284          45056        0           936 tini
[4157732]  1000 4157732   635756    21208         380928        0           936 java
[4157809]  1000 4157809  2300095   536664        5029888        0           936 java <--- killed
[4157901]  1000 4157901    45202     2319          98304        0           936 controller
[   3316]  1000    3316   546772   501997        4124672        0           936 data_frame_anal <--- triggered OOM

JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j2.formatMsgNoLookups=true, -Djava.locale.providers=CLDR, -Dorg.apache.lucene.vectorization.upperJavaFeatureVersion=24, -Des.distribution.type=docker, -Des.java.type=bundled JDK, --enable-native-access=org.elasticsearch.nativeaccess,org.apache.lucene.core, --enable-native-access=ALL-UNNAMED, --illegal-native-access=deny, -Des.cgroups.hierarchy.override=/, -XX:ReplayDataFile=logs/replay_pid%p.log, -Des.entitlements.enabled=true, -XX:+EnableDynamicAgentLoading, -Djdk.attach.allowAttachSelf=true, --patch-module=java.base=lib/entitlement-bridge/elasticsearch-entitlement-bridge-9.1.0.jar, --add-exports=java.base/org.elasticsearch.entitlement.bridge=org.elasticsearch.entitlement,java.logging,java.net.http,java.naming,jdk.net, -XX:+UseG1GC, -Djava.io.tmpdir=/tmp/elasticsearch-9874597656439209350, --add-modules=jdk.incubator.vector, -Dorg.apache.lucene.store.defaultReadAdvice=normal, -XX:+HeapDumpOnOutOfMemoryError, -XX:+ExitOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m, -Des.serverless_transport=true, -Des.search.rank_supported=false, -Des.security.security_index.wait_timeout=5s, -Xms1636m, -Xmx1636m, -XX:MaxDirectMemorySize=857735168, -XX:G1HeapRegionSize=4m, -XX:InitiatingHeapOccupancyPercent=30, -XX:G1ReservePercent=15, -javaagent:/usr/share/elasticsearch/modules/apm/elastic-apm-agent-1.52.2.jar=c=/tmp/elasticsearch-9874597656439209350/.elstcapm.17705332215789177581.tmp, -Delastic.apm.central_config=false, -Delastic.apm.transaction_sample_rate=0.10, -Delastic.apm.application_packages=org.elasticsearch,org.apache.lucene, -Delastic.apm.log_level=warn, -Delastic.apm.enable_experimental_instrumentations=true, -Delastic.apm.instrument=false, -Delastic.apm.server_url=http://apm-server.elastic-agent:8200/, --module-path=/usr/share/elasticsearch/lib, --add-modules=jdk.net, --add-modules=jdk.management.agent, --add-modules=ALL-MODULE-PATH, -Djdk.module.main=org.elasticsearch.server]

Example 2:

[226625.308234] [  pid  ]   uid  tgid  total_vm      rss pgtables_bytes swapents oom_score_adj name
[226625.308237] [ 745717] 65535 745717      243        1          28672        0          -998 pause
[226625.308243] [ 746534]  1000 746534      637      268          49152        0           937 tini
[226625.308249] [ 746565]  1000 746565   718971    26291         446464        0           937 java
[226625.308258] [ 746987]  1000 746987  2995745   667999        6533120        0           937 java <--- killed
[226625.308271] [ 747394]  1000 747394    43666     2807         102400        0           937 controller
[226625.308278] [ 769598]  1000 769598   664358   377609        3743744        0           937 pytorch_inferen <--- triggered OOM

JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j2.formatMsgNoLookups=true, -Djava.locale.providers=CLDR, -Dorg.apache.lucene.vectorization.upperJavaFeatureVersion=24, -Des.distribution.type=docker, -Des.java.type=bundled JDK, --enable-native-access=org.elasticsearch.nativeaccess,org.apache.lucene.core, --enable-native-access=ALL-UNNAMED, --illegal-native-access=deny, -Des.cgroups.hierarchy.override=/, -XX:ReplayDataFile=logs/replay_pid%p.log, -Des.entitlements.enabled=true, -XX:+EnableDynamicAgentLoading, -Djdk.attach.allowAttachSelf=true, --patch-module=java.base=lib/entitlement-bridge/elasticsearch-entitlement-bridge-9.1.0.jar, --add-exports=java.base/org.elasticsearch.entitlement.bridge=org.elasticsearch.entitlement,java.logging,java.net.http,java.naming,jdk.net, -XX:+UseG1GC, -Djava.io.tmpdir=/tmp/elasticsearch-16789498905363193644, --add-modules=jdk.incubator.vector, -Dorg.apache.lucene.store.defaultReadAdvice=normal, -XX:+HeapDumpOnOutOfMemoryError, -XX:+ExitOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m, -Des.serverless_transport=true, -Des.search.rank_supported=false, -Des.security.security_index.wait_timeout=5s, -Xms1636m, -Xmx1636m, -XX:MaxDirectMemorySize=857735168, -XX:G1HeapRegionSize=4m, -XX:InitiatingHeapOccupancyPercent=30, -XX:G1ReservePercent=15, -javaagent:/usr/share/elasticsearch/modules/apm/elastic-apm-agent-1.52.2.jar=c=/tmp/elasticsearch-16789498905363193644/.elstcapm.17954687461798677298.tmp, -Delastic.apm.central_config=false, -Delastic.apm.transaction_sample_rate=0.10, -Delastic.apm.application_packages=org.elasticsearch,org.apache.lucene, -Delastic.apm.log_level=warn, -Delastic.apm.enable_experimental_instrumentations=true, -Delastic.apm.instrument=false, -Delastic.apm.server_url=http://apm-server.elastic-agent:8200/, --module-path=/usr/share/elasticsearch/lib, --add-modules=jdk.net, --add-modules=jdk.management.agent, --add-modules=ALL-MODULE-PATH, -Djdk.module.main=org.elasticsearch.server]

@sunilemanjee
Copy link

I was able to reproduce this error using this notebook: https://colab.research.google.com/drive/1mmB1adtRTpmdwtbiw9SXCQoWAzwELbOr#scrollTo=mGr_pki7eX1w

@valeriy42 valeriy42 self-assigned this Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label :ml Machine learning Team:Core/Infra Meta label for core/infra team Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

4 participants