Skip to content

Update sparse_vector field mapping to include default setting for token pruning #126739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

markjhoy
Copy link
Contributor

@markjhoy markjhoy commented Apr 12, 2025

Updates the SparseVectorFieldMapper type to include index options for pruning tokens and associated configuration values.

Before this update, token pruning for sparse vector types is only available via the query (see parameters for the sparse vector query ).

With this PR, by default, any new indices with a sparse_vector field type will by default have token pruning turned on.

Example:

{
  "properties": {
    "example_field": {
       "type": "sparse_vector",
        "index_options": {
          "prune": (boolean, default is `true`),
          "pruning_config": {
            "tokens_freq_ratio_threshold": (integer, range 1-100, default is 5),
            "tokens_weight_threshold": (double, range 0.0-1.0, default if 0.4)
          }
        }
     }
  }
}

@markjhoy markjhoy requested review from kderusso and a team May 5, 2025 17:59
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label May 5, 2025
@markjhoy markjhoy added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label May 5, 2025
@elasticsearchmachine elasticsearchmachine removed the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label May 5, 2025
@markjhoy markjhoy added the Team:Enterprise Search Meta label for Enterprise Search team label May 5, 2025
@elasticsearchmachine elasticsearchmachine removed the Team:Enterprise Search Meta label for Enterprise Search team label May 5, 2025
@markjhoy markjhoy added Team:Enterprise Search Meta label for Enterprise Search team and removed needs:triage Requires assignment of a team area label labels May 5, 2025
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:Enterprise Search Meta label for Enterprise Search team labels May 5, 2025
@markjhoy markjhoy added Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels May 5, 2025
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:ML Meta label for the ML team labels May 5, 2025
Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice progress! I've left a few comments.

@@ -17,7 +17,14 @@ PUT my-index
"mappings": {
"properties": {
"text.tokens": {
"type": "sparse_vector"
"type": "sparse_vector",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we decide not to have multiple examples here?

@@ -242,6 +242,7 @@ static TransportVersion def(int id) {
public static final TransportVersion WRITE_LOAD_INCLUDES_BUFFER_WRITES = def(9_070_00_0);
public static final TransportVersion INTRODUCE_FAILURES_DEFAULT_RETENTION = def(9_071_0_00);
public static final TransportVersion FILE_SETTINGS_HEALTH_INFO = def(9_072_0_00);
public static final TransportVersion SPARSE_VECTOR_FIELD_PRUNING_OPTIONS = def(9_073_0_00);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to proactively add an 8.x transport version here as well, for an easier backport to 8.19. (Lessons we're learning now...)

} else if (numberObject instanceof Double doubleValue) {
return ((Double) numberObject).floatValue();
}
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this throw if it's not a number instead of doing multiple validation checks on null?

b.endObject();
}

protected void mappingWithIndexOptionsPruningConfig(XContentBuilder b) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still not adding prune:true here?

assertTrue(freq1 < freq2);
}

public void testWithIndexOptionsPruningConfigOnly() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think this should fail, because without sending in prune:true to the query we will not prune.


- match: { status: 400 }
- match: { error.type: "mapper_parsing_exception" }
- match: { error.reason: "Failed to parse mapping: [index_options] field [pruning_config] should not be set if [prune] is false" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shortcut tip: Instead of catch: bad_request you can shortcut this to something like catch: Failed to parse mapping: \[index_options\] field \[pruning_config\] should not be set if \[prune\] is false - then you could still keep the status check but the other error checks are already taken care of in the catch.

- match: { error.reason: "Failed to parse mapping: [pruning_config] field [tokens_weight_threshold] field should be a number between 0.0 and 1.0" }

---
"Check sparse_vector token pruning index_options in query":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please also add tests that override the default behavior?

For example, explicitly sending in a pruning config on queries where pruning is disabled in the mapping, and also explicitly sending in prune:false where pruning is set in the mapping?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should pretty much be a copy of the multi_cluster file, didn't review this file in this round

@@ -89,6 +89,29 @@ setup:
model_id: text_expansion_model
wait_for: started

---
teardown:
- requires:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have requires in the tear down? Isn't it enough to remove indices and ignore 404s?

- match: { error.reason: "Failed to parse mapping: [pruning_config] field [tokens_weight_threshold] field should be a number between 0.0 and 1.0" }

---
"Check sparse_vector token pruning index_options in query":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this test is flakey because of shard counts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs:triage Requires assignment of a team area label v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants