Update sparse_vector field mapping to include default setting for token pruning #126739

markjhoy · 2025-04-12T01:05:31Z

Updates the SparseVectorFieldMapper type to include index options for pruning tokens and associated configuration values.

Before this update, token pruning for sparse vector types is only available via the query (see parameters for the sparse vector query ).

With this PR, by default, any new indices with a sparse_vector field type will by default have token pruning turned on.

Example:

{
  "properties": {
    "example_field": {
       "type": "sparse_vector",
        "index_options": {
          "prune": (boolean, default is `true`),
          "pruning_config": {
            "tokens_freq_ratio_threshold": (integer, range 1-100, default is 5),
            "tokens_weight_threshold": (double, range 0.0-1.0, default if 0.4)
          }
        }
     }
  }
}

kderusso

Nice progress! I've left a few comments.

kderusso · 2025-05-05T20:15:30Z

docs/reference/elasticsearch/mapping-reference/sparse-vector.md

@@ -17,7 +17,14 @@ PUT my-index
  "mappings": {
    "properties": {
      "text.tokens": {
-        "type": "sparse_vector"
+        "type": "sparse_vector",


Did we decide not to have multiple examples here?

kderusso · 2025-05-05T20:16:57Z

server/src/main/java/org/elasticsearch/TransportVersions.java

@@ -242,6 +242,7 @@ static TransportVersion def(int id) {
    public static final TransportVersion WRITE_LOAD_INCLUDES_BUFFER_WRITES = def(9_070_00_0);
    public static final TransportVersion INTRODUCE_FAILURES_DEFAULT_RETENTION = def(9_071_0_00);
    public static final TransportVersion FILE_SETTINGS_HEALTH_INFO = def(9_072_0_00);
+    public static final TransportVersion SPARSE_VECTOR_FIELD_PRUNING_OPTIONS = def(9_073_0_00);


We may want to proactively add an 8.x transport version here as well, for an easier backport to 8.19. (Lessons we're learning now...)

kderusso · 2025-05-05T20:21:33Z

server/src/main/java/org/elasticsearch/index/mapper/vectors/TokenPruningConfig.java

+        } else if (numberObject instanceof Double doubleValue) {
+            return ((Double) numberObject).floatValue();
+        }
+        return null;


Should this throw if it's not a number instead of doing multiple validation checks on null?

kderusso · 2025-05-05T20:24:23Z

server/src/test/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapperTests.java

+        b.endObject();
+    }
+
+    protected void mappingWithIndexOptionsPruningConfig(XContentBuilder b) throws IOException {


We're still not adding prune:true here?

kderusso · 2025-05-05T20:25:02Z

server/src/test/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapperTests.java

+        assertTrue(freq1 < freq2);
+    }
+
+    public void testWithIndexOptionsPruningConfigOnly() throws Exception {


I really think this should fail, because without sending in prune:true to the query we will not prune.

kderusso · 2025-05-05T20:35:36Z

...tests-with-security/src/test/resources/rest-api-spec/test/multi_cluster/50_sparse_vector.yml

+
+  - match: { status: 400 }
+  - match: { error.type: "mapper_parsing_exception" }
+  - match: { error.reason: "Failed to parse mapping: [index_options] field [pruning_config] should not be set if [prune] is false" }


Shortcut tip: Instead of catch: bad_request you can shortcut this to something like catch: Failed to parse mapping: \[index_options\] field \[pruning_config\] should not be set if \[prune\] is false - then you could still keep the status check but the other error checks are already taken care of in the catch.

kderusso · 2025-05-05T20:37:17Z

...tests-with-security/src/test/resources/rest-api-spec/test/multi_cluster/50_sparse_vector.yml

+  - match: { error.reason: "Failed to parse mapping: [pruning_config] field [tokens_weight_threshold] field should be a number between 0.0 and 1.0" }
+
+---
+"Check sparse_vector token pruning index_options in query":


Could we please also add tests that override the default behavior?

For example, explicitly sending in a pruning config on queries where pruning is disabled in the mapping, and also explicitly sending in prune:false where pruning is set in the mapping?

kderusso · 2025-05-05T20:38:01Z

...ests-with-security/src/test/resources/rest-api-spec/test/remote_cluster/50_sparse_vector.yml

This should pretty much be a copy of the multi_cluster file, didn't review this file in this round

kderusso · 2025-05-05T20:39:28Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/sparse_vector_search.yml

@@ -89,6 +89,29 @@ setup:
        model_id: text_expansion_model
        wait_for: started

+---
+teardown:
+  - requires:


Why do we have requires in the tear down? Isn't it enough to remove indices and ignore 404s?

kderusso · 2025-05-05T20:42:43Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/sparse_vector_search.yml

+  - match: { error.reason: "Failed to parse mapping: [pruning_config] field [tokens_weight_threshold] field should be a number between 0.0 and 1.0" }
+
+---
+"Check sparse_vector token pruning index_options in query":


I wonder if this test is flakey because of shard counts?

Initial checkin - needs tests

e02cd3a

elasticsearchmachine added the v9.1.0 label Apr 12, 2025

markjhoy added 2 commits April 11, 2025 21:34

Missing s in IndexVersions

e24ab76

add changelog and docs for index_options

f39b78a

github-actions bot deployed to docs-preview April 15, 2025 00:50 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

eeebfd8

github-actions bot deployed to docs-preview April 15, 2025 00:56 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

51aab0c

github-actions bot deployed to docs-preview April 21, 2025 13:29 View deployment

correct index version

983ddf1

github-actions bot deployed to docs-preview April 21, 2025 14:06 View deployment

update tests

9545a0c

github-actions bot deployed to docs-preview April 21, 2025 17:53 View deployment

Complete tests for SparseVectorFieldMapper

19fe72d

github-actions bot deployed to docs-preview April 22, 2025 14:22 View deployment

[CI] Auto commit changes from spotless

58f9909

github-actions bot deployed to docs-preview April 22, 2025 14:32 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

5f8e7b9

github-actions bot deployed to docs-preview April 25, 2025 12:34 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

d7d27ba

github-actions bot deployed to docs-preview April 25, 2025 12:47 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

d342656

github-actions bot deployed to docs-preview April 25, 2025 18:51 View deployment

fix docs

96096ba

github-actions bot deployed to docs-preview April 25, 2025 19:01 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

eed88c6

github-actions bot deployed to docs-preview April 25, 2025 20:11 View deployment

fix lint

6a6052a

github-actions bot deployed to docs-preview April 25, 2025 21:03 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

f977ea8

markjhoy requested review from kderusso and a team May 5, 2025 17:59

elasticsearchmachine added the needs:triage Requires assignment of a team area label label May 5, 2025

markjhoy added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label May 5, 2025

elasticsearchmachine removed the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label May 5, 2025

markjhoy added the Team:Enterprise Search Meta label for Enterprise Search team label May 5, 2025

elasticsearchmachine removed the Team:Enterprise Search Meta label for Enterprise Search team label May 5, 2025

markjhoy added Team:Enterprise Search Meta label for Enterprise Search team and removed needs:triage Requires assignment of a team area label labels May 5, 2025

elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:Enterprise Search Meta label for Enterprise Search team labels May 5, 2025

markjhoy added Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels May 5, 2025

elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:ML Meta label for the ML team labels May 5, 2025

fix yaml test

74b19ca

github-actions bot deployed to docs-preview May 5, 2025 19:43 View deployment

kderusso reviewed May 5, 2025

View reviewed changes

finally fix yaml tests?

a341322

github-actions bot deployed to docs-preview May 5, 2025 21:58 View deployment

update docs

a47b915

github-actions bot deployed to docs-preview May 5, 2025 22:02 View deployment

add 8.x tx version; fix yaml tests; optimizations

4e681bd

github-actions bot deployed to docs-preview May 5, 2025 22:56 View deployment

[CI] Auto commit changes from spotless

6e50539

github-actions bot deployed to docs-preview May 5, 2025 23:04 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

fcf682f

github-actions bot deployed to docs-preview May 6, 2025 00:15 View deployment

markjhoy requested a review from kderusso May 6, 2025 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update sparse_vector field mapping to include default setting for token pruning #126739

Update sparse_vector field mapping to include default setting for token pruning #126739

markjhoy commented Apr 12, 2025 •

edited

Loading

kderusso left a comment

kderusso May 5, 2025

kderusso May 5, 2025

kderusso May 5, 2025

kderusso May 5, 2025

kderusso May 5, 2025

kderusso May 5, 2025

kderusso May 5, 2025

kderusso May 5, 2025

kderusso May 5, 2025

kderusso May 5, 2025

Update sparse_vector field mapping to include default setting for token pruning #126739

Are you sure you want to change the base?

Update sparse_vector field mapping to include default setting for token pruning #126739

Conversation

markjhoy commented Apr 12, 2025 • edited Loading

kderusso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markjhoy commented Apr 12, 2025 •

edited

Loading