Skip to content

[ML] Enhancements to ml.allocated_processors_scale for increased flexibility in model allocations. #110023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Rassyan
Copy link
Contributor

@Rassyan Rassyan commented Jun 21, 2024

Related Issue

#109001

Motivation

The current implementation of ml.allocated_processors_scale is limited to integer values, primarily used for scaling down processor counts to account for hyper-threading. This proposal aims to extend its functionality to better utilize excess capacity on nodes by allowing the scaling up of processor counts.

Proposed Changes

  1. Modify ml.allocated_processors_scale to accept floating-point values for finer granularity in scaling.
  2. Allow ml.allocated_processors_scale to support values less than 1, enabling an increase in the effective processor count used in model planning.
  3. Update documentation to clearly describe the effects of ml.allocated_processors_scale on model allocations and thread usage.

These changes will make the setting more adaptable to various resource availability scenarios.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.15.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jun 21, 2024
@Rassyan Rassyan changed the title Enhancements to ml.allocated_processors_scale for increased flexibility in model allocations. Enhancements to ml.allocated_processors_scale for increased flexibility in model allocations. Jun 21, 2024
@Rassyan Rassyan changed the title Enhancements to ml.allocated_processors_scale for increased flexibility in model allocations. [ML] Enhancements to ml.allocated_processors_scale for increased flexibility in model allocations. Jun 21, 2024
@kingherc kingherc added :ml Machine learning Team:ML Meta label for the ML team labels Jun 21, 2024
@thecoop thecoop removed the needs:triage Requires assignment of a team area label label Jun 28, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle davidkyle self-requested a review June 28, 2024 15:14
@davidkyle
Copy link
Member

davidkyle commented Jun 28, 2024

It is correct that the ml.allocated_processors_scale setting was designed to scale down the number of processors to account for hyperthreading. The speed of inference is linearly related to the number of physical cores on the machine, increasing the number of physical cores a model can use increases the inference speed in a predictable manner. Once a model is using more threads than physical cores the performance improvements slow down as those threads are hyper-threaded. You can see the effect of hyper threading in this chart: https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html#_elser_v2_2.

Setting ml.allocated_processors_scale: 2 makes the performance increase with every new thread predicable at the cost of a slight loss in overall performance.

If ml.allocated_processors_scale is a double and allowed to be < 0 then it would allow over subscription of the CPU resources. For example, on a machine with 16vCPUs setting ml.allocated_processors_scale: 0.5 would make the model assignment logic think there are 32vCPUs on the machine and allow a model to be deployed using 32 threads but those 32 threads are backed by only 16 hardware threads.

Have you tested this change on your server? I would have expected over-subscribing the thread count would introduce contention. Does such a change increase throughput for you?

This proposal aims to extend its functionality to better utilize excess capacity on nodes by allowing the scaling up of processor counts.

Is the problem that Elasticsearch thinks the machine has fewer vCPUs than is actually does have or for some reason Elasticsearch cannot use all the available CPUs? For example, if Elasticsearch thinks that a 16vCPU machine has only 8vCPUs then setting ml.allocated_processors_scale: 0.5 would allow Elasticsearch to use the true number of vCPUs. Is this the problem scenario you are experiencing?

@Rassyan
Copy link
Contributor Author

Rassyan commented Jul 1, 2024

Hi, @davidkyle

Thank you for your detailed explanation and the insights provided by the chart regarding the performance implications of threading beyond physical core counts. I fully appreciate Elasticsearch's rigorous approach to this matter.

The primary scenario prompting my proposal arises when clients need to deploy multiple inference models within ES. Currently, the total number of deployment threads, calculated as number_of_allocations * threads_per_allocation, must not exceed the total available vCPUs across ML nodes in the cluster. For instance, if inference Model A serves Business A and Model B serves Business B, users may wish to maximize inference capabilities for both models without concurrent high-throughput usage. Under the existing constraints, deploying both models at their maximum thread capacity simultaneously isn't feasible without first halting one. Given that both are online services requiring uninterrupted operation, the ability to set ml.allocated_processors_scale to a value less than 1 would offer expert users the flexibility to deploy more models to handle complex operations, thereby placing more control over node throughput and performance in their hands.

Secondarily, our performance tests have shown that even when utilizing the full count of available vCPUs for intensive inference stress testing, the dedicated ML nodes' CPU utilization remains underutilized, hovering between 50%-60%. This bottleneck appears to be linked directly to the current restrictions. Thus, allowing expert-level testing by relaxing these constraints could help users identify more optimal deployment processes and thread counts. Post-adjustment in this PR, I have verified that such hyper-threaded utilization indeed enhances CPU usage rates on dedicated ML nodes.

Since the default setting remains unchanged, this modification poses no risk to general users, who are still protected by the vCPU count limitation. For expert users with the needs outlined above, this change would grant them greater flexibility to maximize performance and conduct more thorough testing. Given that similar constraints exist in ml-cpp, a coordinated strategy adjustment might be necessary. I look forward to discussing this further with you and exploring potential collaborative adjustments.

Have you tested this change on your server? I would have expected over-subscribing the thread count would introduce contention. Does such a change increase throughput for you?

Yes, I have conducted tests on our servers, and while it's generally expected that over-subscribing thread counts could lead to contention, our specific use case has shown a net increase in throughput. This is primarily due to the underutilization of CPU resources under current constraints, as mentioned earlier.

Is the problem that Elasticsearch thinks the machine has fewer vCPUs than is actually does have or for some reason Elasticsearch cannot use all the available CPUs? For example, if Elasticsearch thinks that a 16vCPU machine has only 8vCPUs then setting ml.allocated_processors_scale: 0.5 would allow Elasticsearch to use the true number of vCPUs. Is this the problem scenario you are experiencing?

The issue isn't that Elasticsearch misinterprets the number of vCPUs; rather, it's about how Elasticsearch currently limits the thread count per allocation based on the vCPUs available. This can prevent it from utilizing the full potential of the hardware, especially in scenarios where the workload is not consistently high, allowing for safe over-subscription without contention. The ml.allocated_processors_scale setting, when adjusted to below 1, is intended to offer more flexibility in such cases, not to correct a miscount of vCPUs but to optimize resource usage during varying load conditions.

I appreciate your engagement on this topic and look forward to further discussions to refine and enhance this feature.

@davidkyle
Copy link
Member

Thank you @Rassyan that is a very interesting idea to allow over-subscription of the CPU cores so that if you have 2 models deployed but only one model is actively used then that model can acquire all the CPU resource. I now see how this change would be helpful to you. My team has a meeting tomorrow, I've put this item on the agenda for discussion we will get back to you after the meeting.

@Rassyan
Copy link
Contributor Author

Rassyan commented Apr 21, 2025

Hi @davidkyle , since you're most familiar with this part of the codebase, would you consider checking this PR when convenient? I'd value your expertise on the implementation approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :ml Machine learning Team:ML Meta label for the ML team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants