Skip to content

ES-10063 Add multi-project support for more stats APIs #127650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

PeteGillinElastic
Copy link
Member

@PeteGillinElastic PeteGillinElastic commented May 2, 2025

This affects the following APIs:

  • GET _nodes/stats:
    • For indices, it now prefixes the index name with the project ID (for non-default projects). Previously, it didn't tell you which project an index was in, and it failed if two projects had the same index name.
    • For ingest, it now gets the pipeline and processor stats for all projects, and prefixes the pipeline ID with the project ID. Previously, it only got them for the default project.
  • GET /_cluster/stats:
    • For ingest, it now aggregates the pipeline and processor stats for all projects. Previously, it only got them for the default project.
  • GET /_info:
    • For ingest, same as for GET /_nodes/stats.

It does not add multi-project support for the repositories section on _nodes/stats, since repositories aren't MP-ready yet AFAIK.

This is done by making IndicesService.stats() and IngestService.stats() include project IDs in the NodeIndicesStats and IngestStats objects they return, and making those stats objects incorporate the project IDs when converting to XContent.

The transitive callers of these two methods are rather extensive (including all callers to NodeService.stats(), all callers of TransportNodesStatsAction, and so on). To ensure the change is safe, the callers were all checked out, and they fall into the following cases:

  • The behaviour change is one of the desired enhancements described above.
  • There is no behaviour change because it was getting node stats but neither indices nor ingest stats were requested.
  • There is no behaviour change because it was getting indices and/or ingest stats but only using aggregate values.
  • In MachineLearningUsageTransportAction and TransportGetTrainedModelsStatsAction, the IngestStats returned will return stats from all projects instead of just the default with this change, but they have been changed to filter the non-default project stats out, so this change is a noop there. (These actions are not MP-ready yet.)
  • MonitoringService will be affected, but this is the legacy monitoring module which is not in use anywhere that MP is going to be enabled. (If anything, the behaviour is probably improved by this change, as it will now include project IDs, rather than producing ambiguous unqualified results and failing in the case of duplicates.)

This affects the following APIs:
 - `GET _nodes/stats`:
   - For `indices`, it now prefixes the index name with the project ID (for non-default projects). Previously, it didn't tell you which project an index was in, and it failed if two projects had the same index name.
   - For `ingest`, it now gets the pipeline and processor stats for all projects, and prefixes the pipeline ID with the project ID. Previously, it only got them for the default project.
 - `GET /_cluster/stats`:
   - For `ingest`, it now aggregates the pipeline and processor stats for all projects. Previously, it only got them for the default project.
 - `GET /_info`:
   - For `ingest`, same as for `GET /_nodes/stats`.

This is done by making `IndicesService.stats()` and `IngestService.stats()` include project IDs in the `NodeIndicesStats` and `IngestStats` objects they return, and making those stats objects incorporate the project IDs when converting to XContent.

The transitive callers of these two methods are rather extensive (including all callers to `NodeService.stats()`, all callers of `TransportNodesStatsAction`, and so on). To ensure the change is safe, the callers were all checked out, and they fall into the following cases:
 - The behaviour change is one of the desired enhancements described above.
 - There is no behaviour change because it was getting node stats but neither `indices` nor `ingest` stats were requested.
 - There is no behaviour change because it was getting `indices` and/or `ingest` stats but only using aggregate values.
 - In `MachineLearningUsageTransportAction` and `TransportGetTrainedModelsStatsAction`, the `IngestStats` returned will return stats from all projects instead of just the default with this change, but they have been changed to filter the non-default project stats out, so this change is a noop there. (These actions are not MP-ready yet.)
 - `MonitoringService` will be affected, but this is the legacy monitoring module which is not in use anywhere that MP is going to be enabled. (If anything, the behaviour is probably improved by this change, as it will now include project IDs, rather than producing ambiguous unqualified results and failing in the case of duplicates.)
@PeteGillinElastic PeteGillinElastic added >non-issue :Data Management/Stats Statistics tracking and retrieval APIs v9.1.0 labels May 2, 2025
@PeteGillinElastic PeteGillinElastic marked this pull request as ready for review May 2, 2025 16:40
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label May 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@PeteGillinElastic PeteGillinElastic changed the title Add multi-project support for more stats APIs ES-10063 Add multi-project support for more stats APIs May 2, 2025
Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this, Pete! I did a first round of review by just looking at the production files. I'll have a look at the tests later on. This feels like the right direction; my comments are mostly small and/or thinking out loud.

Comment on lines 322 to 324
if (Objects.equals(projectId, Metadata.DEFAULT_PROJECT_ID)) {
return index.getName();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit on the fence about this approach. Thinking out loud: on the one hand, this maintains BwC with little effort. On the other hand, if we ever have indices in the default project this'll be a bit off. It's been mentioned a handful of times that we might want to consider (so, super hypothetically) having a shared cluster-level GeoIP index to avoid downloading the same DB over again. We shouldn't make this decision purely based on the GeoIP index, but the idea of the faint possibility of cluster-level indices is something to think about. At the same time, this approach doesn't "break" anything in that case either, it just makes the output inconsistent. So, we either try to account for that situation proactively now, or we decide to go with this approach and we solve that issue if/when it comes. Any thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the options are:

  1. Include the default/ prefix on every index in single-project mode.
  2. Live with the fact that people will need to know that the absence of a prefix indicates the default project in multi-project mode.
  3. Somehow pass in information about whether we're in single- or multi-project mode. I guess that we could go back to making the map nullable and use the null value to indicate that we're in single-project mode and don't need the prefixes.

What I've currently implement is option 2. I think this is certainly better than option 1. But if we're seriously considering having indices in the default project when in MP mode (which I hadn't realized was a possibility) and if we think that people might not realize that no prefix means default (which I'm not sure about) then we could switch to option 3 easily enough. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection, I think I'm leaning towards option 3.

This would also mean that, when deserializing an instance from an older node, we could make the map be null and so not show project prefixes: I think this would be valid, because I assume that we will never have to support BWC with nodes older than the current main in MP clusters (there's no way that could possibly work!).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, ideally, I'd want to go with option 3 too (not showing the project ID in single-project mode and showing the project ID for all indices in multi-project mode), for consistency. But let's park this discussion for a bit until we resolve #127650 (comment) (or choose not to resolve it and live with it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some YAML tests from the ingest-common module are muted in this file because they depend on stats APIs:

    '^test/ingest/15_info_ingest/*', // uses cluster info API
    '^test/ingest/70_bulk/Test bulk request with default pipeline', // uses stats API
    '^test/ingest/70_bulk/Test bulk request without default pipeline', // uses stats API

However, because we decided to prefix the project ID, we won't be able to unmute those tests... I don't immediately have a good suggestion, but I thought I'd mention it so we're aware. If we decide that that is OK, we should move these lines to the bottom so they're next to the comment explaining why some stats API tests are muted in MP.

Comment on lines 576 to 579
return indices.values()
.stream()
.map(AbstractIndexComponent::index)
.collect(Collectors.toMap(index -> index, index -> clusterService.state().metadata().projectFor(index).id()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: by using Metadata#projectFor, we throw an exception if the index is not found. I'm not familiar enough with the IndicesService to know whether there is a possibility of the service being "behind" the state in ClusterService. For instance, when an index is deleted from the cluster state, is the indices service updated atomically? Or if an index is added, is it first added to the cluster state or the IndicesService? If you don't already have an answer to those questions (which is completely fine), I think it's worth having a look to be sure what kind of behavior we could see here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. That's a good question.

Looking at the creation flow, I believe that creating the IndexService and adding it to the indices map in IndicesService happens as part of an applyClusterState method in a ClusterStateApplier, and looking at ClusterApplierService it seems that this will be invoked before the state is updated. So it seems plausible that we could call IndicesService.stats on another thread and observe it in a state where this fails for an index which is in the process of being created on this node.

At any rate, I think that good defensive coding principles dictate that we should make things work as well as possible in these circumstances.

I think the best option might be to switch to using lookupProject(Index). That means that we'll just omit the index from the map in that case. Then we have to figure out how to handle that. Right now, it would use null for the project ID, which is probably confusing. We could make it use something like <unknown>, which is sort of horrible but at least it's explicit. (We already have the possibility that the map is empty because it came from an older node, although at least that situation is only temporary.)

I don't like that, but I'm struggling to think of any better alternatives right now.

It would be simple if the IndexService knew its project ID. I think you said it was a deliberate design decision not to do that... Or did was the deliberate decision just that Index and/or IndexMetadata wouldn't know their project ID? I've had a glance, and it would be a moderately messy change: there are dozens of code paths we'd have to wire the project ID through to get back to somewhere it's available (not counting tests).

The refactoring to ensure that things happen in a different order which guarantees it would be okay is waaaaay to large and risky to justify.

So I'm going with the <unknown> option for now. But I'd be very interested to hear your thoughts!

Copy link
Member Author

@PeteGillinElastic PeteGillinElastic May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We already have the possibility that the map is empty because it came from an older node, although at least that situation is only temporary.)

N.B. If we go with the thing I called option 3 in the other thread, this comment will not be true. The only way that I think this the index could be missing from the map is if we hit this rare timing issue. So I think the question becomes: Do we want to do the plumbing to make an IndexService know its project ID in order to avoid this? Or is there a better solution that I'm not seeing?

(There is, incidentally, a risk that the stats call could fail if there are two indices with the same name in different projects both of which hit this rare timing issue at the same time, causing a key collision. I'm inclined not to worry about that, as it would be transient, although it could be avoided if we just add a counter to the prefix.)

Copy link
Contributor

@samxbr samxbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall! I haven't gone over the tests yet, will run another pass for the tests tomorrow.

public NodeIndicesStats(
CommonStats oldStats,
Map<Index, CommonStats> statsByIndex,
Map<Index, List<IndexShardStats>> statsByShard,
Map<Index, ProjectId> projectsByIndex,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like this approach of passing in the index to project map

Copy link
Member Author

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I just pushed a commit which responds to most of the comments here. There's some discussion below.

Comment on lines 576 to 579
return indices.values()
.stream()
.map(AbstractIndexComponent::index)
.collect(Collectors.toMap(index -> index, index -> clusterService.state().metadata().projectFor(index).id()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. That's a good question.

Looking at the creation flow, I believe that creating the IndexService and adding it to the indices map in IndicesService happens as part of an applyClusterState method in a ClusterStateApplier, and looking at ClusterApplierService it seems that this will be invoked before the state is updated. So it seems plausible that we could call IndicesService.stats on another thread and observe it in a state where this fails for an index which is in the process of being created on this node.

At any rate, I think that good defensive coding principles dictate that we should make things work as well as possible in these circumstances.

I think the best option might be to switch to using lookupProject(Index). That means that we'll just omit the index from the map in that case. Then we have to figure out how to handle that. Right now, it would use null for the project ID, which is probably confusing. We could make it use something like <unknown>, which is sort of horrible but at least it's explicit. (We already have the possibility that the map is empty because it came from an older node, although at least that situation is only temporary.)

I don't like that, but I'm struggling to think of any better alternatives right now.

It would be simple if the IndexService knew its project ID. I think you said it was a deliberate design decision not to do that... Or did was the deliberate decision just that Index and/or IndexMetadata wouldn't know their project ID? I've had a glance, and it would be a moderately messy change: there are dozens of code paths we'd have to wire the project ID through to get back to somewhere it's available (not counting tests).

The refactoring to ensure that things happen in a different order which guarantees it would be okay is waaaaay to large and risky to justify.

So I'm going with the <unknown> option for now. But I'd be very interested to hear your thoughts!

Comment on lines 322 to 324
if (Objects.equals(projectId, Metadata.DEFAULT_PROJECT_ID)) {
return index.getName();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the options are:

  1. Include the default/ prefix on every index in single-project mode.
  2. Live with the fact that people will need to know that the absence of a prefix indicates the default project in multi-project mode.
  3. Somehow pass in information about whether we're in single- or multi-project mode. I guess that we could go back to making the map nullable and use the null value to indicate that we're in single-project mode and don't need the prefixes.

What I've currently implement is option 2. I think this is certainly better than option 1. But if we're seriously considering having indices in the default project when in MP mode (which I hadn't realized was a possibility) and if we think that people might not realize that no prefix means default (which I'm not sure about) then we could switch to option 3 easily enough. What do you think?

@@ -564,6 +571,15 @@ IndexShardStats indexShardStats(final IndicesService indicesService, final Index
);
}

private Map<Index, ProjectId> projectsByIndex() {
Map<Index, ProjectId> map = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
Map<Index, ProjectId> map = new HashMap<>();
Map<Index, ProjectId> map = new HashMap<>(indices.size());

@@ -66,6 +70,7 @@ public class NodeIndicesStats implements Writeable, ChunkedToXContent {
private final CommonStats stats;
private final Map<Index, List<IndexShardStats>> statsByShard;
private final Map<Index, CommonStats> statsByIndex;
private final Map<Index, ProjectId> projectsByIndex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame we have to pass in this map ourselves (and (de)serialize it), as it's pretty much just the Metadata.ProjectLookup... It looks like we only need it for the XContent serialization, but I can't think of a way to pass the ProjectLookup to that serialization method. I do think it's worth thinking about that some more (and maybe asking others for suggestions). This is going to result in a lot more serialization over the wire.

COMMON_STATS,
Map.of(),
indexStats,
/* projectsByIndex is only used for rendering as XContent, which not needed here */ Map.of(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Teensiest nit ever:

Suggested change
/* projectsByIndex is only used for rendering as XContent, which not needed here */ Map.of(),
// projectsByIndex is only used for rendering as XContent, which not needed here
Map.of(),

The syntax you used is not very common here (AFAIK) and it took me longer than I'm willing to admit to realize the parameter came after the comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Stats Statistics tracking and retrieval APIs >non-issue Team:Data Management Meta label for data/management team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants