-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Scaling EsExecutors with core size 0 might starve work due to missing workers #124667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
blocker
>bug
:Core/Infra/Core
Core issues without another label
Team:Core/Infra
Meta label for core/infra team
Comments
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Note, #18613 relates to this bug. We shouldn't be using unbounded queues for scaling executors. Our usage of the unbounded |
mosche
added a commit
to mosche/elasticsearch
that referenced
this issue
Mar 13, 2025
mosche
added a commit
to mosche/elasticsearch
that referenced
this issue
Mar 16, 2025
…h core pool size = 0 (elastic#124732) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in elastic#124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes elastic#124667 Relates to elastic#18613
elasticsearchmachine
pushed a commit
that referenced
this issue
Mar 16, 2025
…h core pool size = 0 (#124732) (#124965) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in #124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes #124667 Relates to #18613
mosche
added a commit
to mosche/elasticsearch
that referenced
this issue
Mar 17, 2025
…k with max pool size > 0. Relates to elastic#124867, ES-10640
mosche
added a commit
to mosche/elasticsearch
that referenced
this issue
Mar 18, 2025
…h core pool size = 0 (elastic#124732) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in elastic#124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes elastic#124667 Relates to elastic#18613 (cherry picked from commit 36874e8) # Conflicts: # test/framework/src/main/java/org/elasticsearch/test/transport/MockTransportService.java
mosche
added a commit
to mosche/elasticsearch
that referenced
this issue
Mar 18, 2025
…h core pool size = 0 (elastic#124732) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in elastic#124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes elastic#124667 Relates to elastic#18613 (cherry picked from commit 36874e8) # Conflicts: # test/framework/src/main/java/org/elasticsearch/test/transport/MockTransportService.java
mosche
added a commit
to mosche/elasticsearch
that referenced
this issue
Mar 18, 2025
…h core pool size = 0 (elastic#124732) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in elastic#124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes elastic#124667 Relates to elastic#18613 (cherry picked from commit 36874e8) # Conflicts: # test/framework/src/main/java/org/elasticsearch/test/transport/MockTransportService.java
mosche
added a commit
to mosche/elasticsearch
that referenced
this issue
Mar 18, 2025
…h core pool size = 0 (elastic#124732) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in elastic#124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes elastic#124667 Relates to elastic#18613 (cherry picked from commit 36874e8) # Conflicts: # server/src/main/java/org/elasticsearch/threadpool/ScalingExecutorBuilder.java # test/framework/src/main/java/org/elasticsearch/test/transport/MockTransportService.java
elasticsearchmachine
pushed a commit
that referenced
this issue
Mar 18, 2025
…h core pool size = 0 (#124732) (#125066) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in #124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes #124667 Relates to #18613 (cherry picked from commit 36874e8) # Conflicts: # test/framework/src/main/java/org/elasticsearch/test/transport/MockTransportService.java
mosche
added a commit
that referenced
this issue
Mar 18, 2025
…h core pool size = 0 (#124732) (#125067) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in #124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes #124667 Relates to #18613 (cherry picked from commit 36874e8) # Conflicts: # test/framework/src/main/java/org/elasticsearch/test/transport/MockTransportService.java
elasticsearchmachine
added a commit
that referenced
this issue
Mar 18, 2025
…tor with core pool size = 0 (#124732) (#125069) * Prevent work starvation bug if using scaling EsThreadPoolExecutor with core pool size = 0 (#124732) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in #124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes #124667 Relates to #18613 (cherry picked from commit 36874e8) # Conflicts: # server/src/main/java/org/elasticsearch/threadpool/ScalingExecutorBuilder.java # test/framework/src/main/java/org/elasticsearch/test/transport/MockTransportService.java * remove timeout * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
mosche
added a commit
that referenced
this issue
Mar 18, 2025
…tor with core pool size = 0 (#124732) (#125068) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in #124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes #124667 Relates to #18613
omricohenn
pushed a commit
to omricohenn/elasticsearch
that referenced
this issue
Mar 28, 2025
…h core pool size = 0 (elastic#124732) When `ExecutorScalingQueue` rejects work to make the worker pool scale up while already being at max pool size (and a new worker consequently cannot be added), available workers might timeout just about at the same time as the task is then force queued by `ForceQueuePolicy`. This has caused starvation of work as observed for `masterService#updateTask` in elastic#124667 where max pool size 1 is used. This configuration is most likely to expose the bug. This PR changes `EsExecutors.newScaling` to not use `ExecutorScalingQueue` if max pool size is 1 (and core pool size is 0). A regular `LinkedTransferQueue` works perfectly fine in this case. If max pool size > 1, a probing approach is used to ensure the worker pool is adequately scaled to at least 1 worker after force queueing work in `ForceQueuePolicy`. Fixes elastic#124667 Relates to elastic#18613
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
blocker
>bug
:Core/Infra/Core
Core issues without another label
Team:Core/Infra
Meta label for core/infra team
This was discovered in cases where
masterService#updateTask
had work enqueued, but no worker to process it.The root cause of the issue is a bug in
EsExecutors
. When the pool core size is set to 0 and max pool size is 1 (though, also possible with a higher max pool size, but less likely),EsThreadPoolExecutor
sometimes fails to add another worker to execute the task because we're already at the max pool size (expected). However, in rare cases, a single worker thread (or threads) can time out at about the same time (based on theirkeepAliveTime
) when then queueing the new task viaForceQueuePolicy
(triggered by the initial rejection as we failed to add a worker). Unless more tasks are submitted later (which is not the case formasterService#updateTask
), this task will starve in the queue without any worker to process it.Respective code in
EsExecutors
is old and unchanged. We were able to reproduce the bug onmain
using Java 21, 22, 23 as well as8.0
using Java 17. Likely the same is possible for older versions of ES.It looks as if the bug is triggered more frequently with more recent versions of the JDK, but this might just be an observation bias as we haven't been aware of this bug earlier.
The text was updated successfully, but these errors were encountered: