Skip to content

Adding support to exclude semantic_text subfields #127664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Samiul-TheSoccerFan
Copy link
Contributor

Update the fieldCaps API to exclude semantic_text subfields in both legacy and new formats.

Legacy format:

setup:


PUT test-field-caps-with-legacy
{
    "settings": {
        "index.mapping.semantic_text.use_legacy_format": true
    },
    "mappings": {
        "properties": {
            "test_field_legacy": {
                "type": "semantic_text",
                "inference_id": ".elser-2-elasticsearch"
            },
            "non_infer_field_legacy": {
                "type": "text"
            },
            "sparse_vector_legacy": {
                "type": "sparse_vector"
            },
            "dense_vector_legacy": {
                "type": "dense_vector",
                "dims": 3,
                "similarity": "l2_norm"
            }
        }
    }
}

PUT test-field-caps-with-legacy/_doc/doc1
{
    "test_field_legacy": "these are not the droids you're looking for. He's free to go around",
    "sparse_vector_legacy": {
        "these": 1,
        "are": 2,
        "not": 3
    },
    "dense_vector_legacy": [1, 2, 3]
}

Query:

GET /_field_caps?allow_no_indices=true&fields=*&index=test*&ignore_unavailable=true&expand_wildcards=open

Response before update (Skimmed):

{
  "indices": [
    "test-field-caps-with-legacy"
  ],
  "fields": {
    "non_infer_field_legacy": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field_legacy.inference.chunks.text": {
      "keyword": {
        "type": "keyword",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    },
    "test_field_legacy.inference": {
      "object": {
        "type": "object",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    },
    "sparse_vector_legacy": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field_legacy": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field_legacy.inference.chunks.embeddings": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "dense_vector_legacy": {
      "dense_vector": {
        "type": "dense_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field_legacy.inference.chunks": {
      "nested": {
        "type": "nested",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    }
  }
}

Response after update (Skimmed):

{
  "indices": [
    "test-field-caps-with-legacy"
  ],
  "fields": {
    "non_infer_field_legacy": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "sparse_vector_legacy": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field_legacy": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "dense_vector_legacy": {
      "dense_vector": {
        "type": "dense_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    }
  }
}

new format:

setup:

PUT test-field-caps
{
    "mappings": {
        "properties": {
            "test_field": {
                "type": "semantic_text",
                "inference_id": ".elser-2-elasticsearch"
            },
            "non_infer_field": {
                "type": "text"
            },
            "sparse_vector": {
                "type": "sparse_vector"
            },
            "dense_vector": {
                "type": "dense_vector",
                "dims": 3,
                "similarity": "l2_norm"
            }
        }
    }
}

PUT test-field-caps/_doc/doc1
{
    "test_field": "these are not the droids you're looking for. He's free to go around",
    "sparse_vector": {
        "these": 1,
        "are": 2,
        "not": 3
    },
    "dense_vector": [1, 2, 3]
}

Query:

GET /_field_caps?allow_no_indices=true&fields=*&index=test*&ignore_unavailable=true&expand_wildcards=open

Response before update (Skimmed):

{
  "indices": [
    "test-field-caps"
  ],
  "fields": {
    "_ignored_source": {
      "_ignored_source": {
        "type": "_ignored_source",
        "metadata_field": true,
        "searchable": false,
        "aggregatable": false
      }
    },
    "non_infer_field": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "_index": {
      "_index": {
        "type": "_index",
        "metadata_field": true,
        "searchable": true,
        "aggregatable": true
      }
    },
    "_feature": {
      "_feature": {
        "type": "_feature",
        "metadata_field": true,
        "searchable": false,
        "aggregatable": false
      }
    },
    "sparse_vector": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field.inference.chunks.embeddings": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field.inference.chunks.offset": {
      "offset_source": {
        "type": "offset_source",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    },
    "test_field": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "_inference_fields": {
      "_inference_fields": {
        "type": "_inference_fields",
        "metadata_field": true,
        "searchable": false,
        "aggregatable": false
      }
    },
    "test_field.inference": {
      "object": {
        "type": "object",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    },
    "dense_vector": {
      "dense_vector": {
        "type": "dense_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field.inference.chunks": {
      "nested": {
        "type": "nested",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    }
  }
}

Response after update (Skimmed):

{
  "indices": [
    "test-field-caps"
  ],
  "fields": {
    "test_field": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "_inference_fields": {
      "_inference_fields": {
        "type": "_inference_fields",
        "metadata_field": true,
        "searchable": false,
        "aggregatable": false
      }
    },
    "non_infer_field": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "sparse_vector": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "dense_vector": {
      "dense_vector": {
        "type": "dense_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },    
  }
}

@Samiul-TheSoccerFan Samiul-TheSoccerFan added >enhancement v9.1.0 :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Vectors Vector search :SearchOrg/Relevance Label for the Search (solution/org) Relevance team labels May 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you.

Comment on lines +365 to +367
- requires:
cluster_features: "gte_v8.16.0"
reason: field_caps support for semantic_text added in 8.16.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to define a new cluster feature? As per my understanding, these fields are not expected from field_caps API so excluding these should not have an impact on the API level or discover. We have also covered backward compatibility through other yaml file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to create a test feature for these tests.

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @Mikep86 's comments in Slack, but good start!

Comment on lines +365 to +367
- requires:
cluster_features: "gte_v8.16.0"
reason: field_caps support for semantic_text added in 8.16.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to create a test feature for these tests.

Comment on lines 155 to 165
/**
* Returns true if the field should be excluded from the field capabilities response.
* This is used to exclude fields that are not useful for the user, such as
* offset_source and inference chunk embeddings.
*/
private static boolean shouldExcludeField(MappedFieldType ft) {
return ft.typeName().equals("offset_source")
|| ((ft instanceof SparseVectorFieldMapper.SparseVectorFieldType
|| ft instanceof DenseVectorFieldMapper.DenseVectorFieldType
|| ft instanceof KeywordFieldMapper.KeywordFieldType) && ft.name().contains(".inference.chunks"));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reiterating my message offline, this is a brittle solution. We shouldn't be hard-coding field names to exclude from field caps. Instead, I recommend investigating a solution where we add a flag to MappedFieldType to control if a field is excluded from field caps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Vectors Vector search :SearchOrg/Relevance Label for the Search (solution/org) Relevance team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants