Skip to content

ES|QL dense vector field type support #126456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

carlosdelest
Copy link
Member

@carlosdelest carlosdelest commented Apr 8, 2025

Support dense_vector field type. This is the first step to allow kNN queries and having dense_vector as a first class citizen in ES|QL

This allows a mapping that has dense_vector like the following:

{
  "properties": {
    "id": {
      "type": "long"
    },
    "vector": {
      "type": "dense_vector",
      "similarity": "l2_norm"
    }
  }
}

To be retrieved via ES|QL:

FROM dense_vector
| KEEP id, vector
| SORT id
id   | vector
0    | [1.0, 2.0, 3.0]
1    | [4.0, 5.0, 6.0]

For now, just float element types are allowed. There will be a similar work in order to allow for byte and bit element types, but I wanted to review this implementation first to ensure it's in line with what we need.

Both indexed / not indexed types and synthetic source is supported.

Support for CSV tests has been added. For now CSV tests are simple, we can expand on these and also support additional operations on dense_vector field types in subsequent PRs. An integration test has been added to test extensively on different index options and doc storage structure.

dense_vector field type is under a feature flag, as this will require follow up work.

@carlosdelest carlosdelest added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch :Search Relevance/Search Catch all for Search Relevance labels Apr 8, 2025
@@ -504,6 +506,80 @@ public String toString() {
}
}

public static class DenseVectorBlockLoader extends DocValuesBlockLoader {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a BlockLoader for dense vectors, that uses FloatVectorValues to retrieve indexed vector data.

@Override
public BlockLoader blockLoader(MappedFieldType.BlockLoaderContext blContext) {
if (elementType != ElementType.FLOAT) {
throw new UnsupportedOperationException("Only float dense vectors are supported for now");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can work on this next, creating specific BlockLoaders for Byte and Bit field types.

@@ -145,6 +146,10 @@ private static void assertMetadata(
// Type.asType translates all bytes references into keywords
continue;
}
if (blockType == Type.DOUBLE && expectedType == DENSE_VECTOR) {
// DENSE_VECTOR is internally represented as a double block
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially change when we support byte and bit element types - we could create the appropriate blocks.

@@ -63,18 +63,7 @@
"type" : "keyword"
},
"salary_change": {
"type": "float",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing CSV loading made this a problem, as there were parsing exceptions when trying to index float numeric data into integers. I didn't see a convenient way out of this and decided to remove as this field is not being tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can keep this unchanged, that will be great, this is a good example of nested fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just changed for the mapping-default-incompatible mapping, which was created to test some incompatible field mappings that did not include subfields. I'll try to give this another shot but it will require changes to the CSV loader or the dataset 😢

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just changed for the mapping-default-incompatible mapping, which was created to test some incompatible field mappings that did not include subfields. I'll try to give this another shot but it will require changes to the CSV loader or the dataset 😢

Is it easier if we make another copy of employees's schema and data for dense_vector related tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easier if we make another copy of employees's schema and data for dense_vector related tests?

The problem is that changing how the CSV tests load data impacted this dataset. Before this change, multivalues were being uploaded as arrays of strings, which is something we don't want to do for dense_vectors as that is not a format supported on the DenseVectorFieldMapper.

It seemed like too much work to change the actual dataset when that particular field is not actually used in the tests.

I'm open to other solutions here!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fang-xing-esql are these fields used in other tests somehow?
I see this was actually added by @carlosdelest a while back #117555.
The employees_incompatible index that is set with this mapping is only used in match function/operator tests.
We don't modify any of those tests here, so this looks like a safe change to me.

@@ -827,6 +829,7 @@ public static Literal randomLiteral(DataType type) {
throw new UncheckedIOException(e);
}
}
case DENSE_VECTOR -> Arrays.asList(randomArray(10, 10, i -> new Double[10], ESTestCase::randomDouble));
Copy link
Contributor

@ioanatia ioanatia Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually use randomFloat() and convert the value to double?
The rationale being that if we just use random double values, we might actually end up with something that can't be represented as a float and can't be used to actually index a dense_vector value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - Done in e8878a0

retrieveDenseVectorData
required_capability: dense_vector_field_type

FROM dense_vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more csv tests here with commands we know should be supported 🙈 ?
We have KEEP already, but I'm thinking DROP, RENAME and simple EVALs (EVAL a = dense_vector_field) might be supported already.
I know it should all just work - but for our own peace of mind it would be good to cover them.
Can be a single test that uses a combination of commands we know should be supported already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added one in 93d45fc. I'm not sure what this would catch, as the inner representation of dense_vector is a DoubleBlock and that is extensively tested for other fields.

I'm sure we will keep adding tests once we include arithmetic operators, conversion, etc to dense_vector.

Copy link
Member

@fang-xing-esql fang-xing-esql Apr 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expected that the items in a dense_vector have a fixed order? The reason I'm asking is that a dense_vector looks very like a multi-valued double fields, it is hard to tell whether it is a dense_vector field or a double field with MV from its value looks, and the order of the items in an MV is not guaranteed. I wonder what is the relationship between a double field with MV and a dense_vector.

Do we expect the functions/commands that take multi-valued fields apply to dense_vector? Like those mv_xxx and to_xxx functions, mv_expand, stats by mv_fields etc.?

If I understand it right, dense_vector does not support sort or aggregation, does dense_vector support comparison, does it make sense to dense_vector fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expected that the items in a dense_vector have a fixed order?

Yes. It's crucial that the vector dimensions match between different vectors, so they can be compared.

a dense_vector looks very like a multi-valued double fields, it is hard to tell whether it is a dense_vector field or a double field with MV from its value looks

I've used MV as this seemed a supported way of internally representing a double array. From a user perspective, it's a different data type - it's a dense_vector, which will not necessarily be supported on the same functions, and it will always have the same number of dimensions / ordering for a specific mapping.

I wonder what is the relationship between a double field with MV and a dense_vector.

We were thinking on adding a TO_DENSE_VECTOR cast function so users can specific a dense_vector using the MV double syntax:

WHERE knn(field, TO_DENSE_VECTOR([0.1, 0.2, 0.3, ... , 1.0])

Besides that, there should be no relation between the two. They are different data types that have the same representation (an array of elements).

Do we expect the functions/commands that take multi-valued fields apply to dense_vector? Like those mv_xxx and to_xxx functions, mv_expand, stats by mv_fields etc.?

It would probably help to differentiate the two data types if dense_vector fields do not support MV functions.

We can provide support for multivalued functions, but most of them will not make sense in the context of a dense_vector (MV_APPEND, MV_CONCAT, MV_DEDUPE, MV_SORT, MV_SUM). Others can be useful even though they are not necessarily vector related (MV_COUNT, MV_FIRST,MV_MEDIAN, MV_MAX, MV_MIN), but supporting those could confuse ESQL users.

dense_vector does not support sort or aggregation,

Correct.

does dense_vector support comparison, does it make sense to dense_vector fields?

We could support equality. Binary comparisons like greater / less than makes no sense for dense_vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests - this is addressed from my POV.
we can follow up on supporting more functions/operators - e.g. equality makes sense to me.

@carlosdelest carlosdelest mentioned this pull request Apr 11, 2025
23 tasks
Copy link
Member

@fang-xing-esql fang-xing-esql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @carlosdelest! At first, it looks to me that the value of a dense_vector field is very similar to a multi-valued double field, so I'm wondering if they work in similar ways, and when do we expect a dense_vector field works differently from the double field in ES|QL? I tried some queries to validate my thoughts, and a lot of times they work very similarly, and sometimes they don't. Here are my observations and you can have a look to see if they make sense. Perhaps I don't see an example of knn query yet, does knn query work on multi-valued double fields? I was looking for where a dense_vector field works differently from a multi-valued double field, and where it is expected to work the same as a multi-valued double field.

mapping

curl -u elastic:password -X PUT "localhost:9200/idx001?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "numericfield": {"type": "double"},
      "mixedfield": {"type": "dense_vector","similarity": "l2_norm"}
    }
  }
}
'

curl -u elastic:password -X PUT "localhost:9200/idx002?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "numericfield": {"type": "double"},
      "mixedfield": {"type" : "double"}
    }
  }
}
'

curl -X PUT "localhost:9200/idx001/_bulk?refresh&pretty" -H 'Content-Type: application/json' -d'
{"index": {}}
{"numericfield": 1,  "mixedfield" : [1.0, 2.0]}
'

curl -X PUT "localhost:9200/idx002/_bulk?refresh&pretty" -H 'Content-Type: application/json' -d'
{"index": {}}
{"numericfield": 2, "mixedfield" : [3.0, 4.0]}
'

queries

+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
  "query": "from idx00*"
}
'
  mixedfield   | numericfield  
---------------+---------------
null           |2.0            
null           |1.0            

+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
  "query": "from idx00* | eval x = mixedfield::double"
}
'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "verification_exception",
        "reason" : "Found 1 problem\nline 1:24: Cannot use field [mixedfield] due to ambiguities being mapped as [2] incompatible types: [dense_vector] in [idx001], [double] in [idx002]"
      }
    ],
    "type" : "verification_exception",
    "reason" : "Found 1 problem\nline 1:24: Cannot use field [mixedfield] due to ambiguities being mapped as [2] incompatible types: [dense_vector] in [idx001], [double] in [idx002]"
  },
  "status" : 400
}

+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
  "query": "from idx001 | mv_expand mixedfield"
}
'
 numericfield  |  mixedfield   
---------------+---------------
1.0            |1.0            
1.0            |2.0            

+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
  "query": "from idx001 | stats count(*) by mixedfield"
}
'
   count(*)    |  mixedfield   
---------------+---------------
1              |1.0            
1              |2.0            

+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
  "query": "from idx001 | eval x = mv_sort(mixedfield)"
}
'
  mixedfield   | numericfield  |       x       
---------------+---------------+---------------
[1.0, 2.0]     |1.0            |[1.0, 2.0]     

+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
  "query": "from idx001 | eval x = mixedfield::string"
}
'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "verification_exception",
        "reason" : "Found 1 problem\nline 1:24: argument of [mixedfield::string] must be [aggregate_metric_double or boolean or cartesian_point or cartesian_shape or date_nanos or datetime or geo_point or geo_shape or ip or numeric or string or version], found value [mixedfield] type [dense_vector]"
      }
    ],
    "type" : "verification_exception",
    "reason" : "Found 1 problem\nline 1:24: argument of [mixedfield::string] must be [aggregate_metric_double or boolean or cartesian_point or cartesian_shape or date_nanos or datetime or geo_point or geo_shape or ip or numeric or string or version], found value [mixedfield] type [dense_vector]"
  },
  "status" : 400
}

+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
  "query": "from idx002 | eval x= mixedfield::string"
}
'
  mixedfield   | numericfield  |       x       
---------------+---------------+---------------
[3.0, 4.0]     |2.0            |[3.0, 4.0]     

@@ -827,6 +829,7 @@ public static Literal randomLiteral(DataType type) {
throw new UncheckedIOException(e);
}
}
case DENSE_VECTOR -> Arrays.asList(randomArray(10, 10, i -> new Double[10], () -> (double) randomFloat()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a randomDouble() can be used, is the randomFloat() used on purpose here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is on purpose, see #126456 (comment)

retrieveDenseVectorData
required_capability: dense_vector_field_type

FROM dense_vector
Copy link
Member

@fang-xing-esql fang-xing-esql Apr 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expected that the items in a dense_vector have a fixed order? The reason I'm asking is that a dense_vector looks very like a multi-valued double fields, it is hard to tell whether it is a dense_vector field or a double field with MV from its value looks, and the order of the items in an MV is not guaranteed. I wonder what is the relationship between a double field with MV and a dense_vector.

Do we expect the functions/commands that take multi-valued fields apply to dense_vector? Like those mv_xxx and to_xxx functions, mv_expand, stats by mv_fields etc.?

If I understand it right, dense_vector does not support sort or aggregation, does dense_vector support comparison, does it make sense to dense_vector fields?

@@ -63,18 +63,7 @@
"type" : "keyword"
},
"salary_change": {
"type": "float",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can keep this unchanged, that will be great, this is a good example of nested fields.

@carlosdelest
Copy link
Member Author

@fang-xing-esql both dense_vector and a double multivalued field share some common traits:

  • They are composed as an array of double values
  • Internal representation uses the multivalue mechanism in Blocks
  • String representation should work the same way (unless we want it to be different)
  • Users will use the multivalue format to specify them (potentially converting / casting to a dense_vector type)

They are different as in dense_vector fields:

  • Values are always ordered
  • They are not sortable
  • Most MV_ functions make no sense to operate on dense_vector fields.
  • dense_vector fields will have specific operations similar to script score existing vector functions.

I think what you observed makes sense for dense_vector fields. We may provide MV_ functions support, but in my mind that would make it confusing from a user perspective. dense_vectors should be treated as a single value, and not a collection of values.

@carlosdelest
Copy link
Member Author

@ioanatia @fang-xing-esql I was able to use FLOAT blocks for dense_vector - changed that in cd462b8.

As Ioana noted, we're not publicly exposing float as a type here but dense_vector, so the change is doable as an implementation detail 👍 . Thanks!

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to see FloatBlocks being used, but they have too much flexibility - all different dims per position and also nulls. Maybe we just need an easy way/utility to assert their correct shape?

Copy link
Contributor

@ioanatia ioanatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an outstanding question from @ChrisHegarty that would be good to address but otherwise LGTM

…ctor_support

# Conflicts:
#	x-pack/plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/CsvTestsDataLoader.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
@carlosdelest
Copy link
Member Author

I'm happy to see FloatBlocks being used, but they have too much flexibility - all different dims per position and also nulls. Maybe we just need an easy way/utility to assert their correct shape?

@ChrisHegarty , I renamed some methods to make clear that we're creating dense vectors instead of floats, and added some checks for them, on c951ee7.

I had to take some back on ba0a6b9, as I can't easily add a new float block builder given the sealed structure of builders. We may have to add a new Block type for dense_vector, maybe as a follow up?

LMKWYT

@@ -0,0 +1,3 @@
id:l, vector:dense_vector
0, [1.0, 2.0, 3.0]
1, [4.0, 5.0, 6.0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be very nitpicky - but can we add another dense_vector value that does not have ordered values?
this might be why we did not caught #126456 (comment) during tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mvOrdered does not impact retrieval as it's used as an optimization in some cases. But adding unordered data uncovered a small fix that needed to be done to support multivalued style fields: 4b2126e

.field("type", "dense_vector")
.field("index", index);
if (index) {
mapping.field("similarity", "l2_norm");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it possible to randomize the similarity option? otherwise we could leave this completely out, since it optional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with that is some similarities normalize the vector values, so what we retrieve back is not what was stored, thus making comparisons difficult. I think we're ok with keeping this simple.


try (var resp = run(query)) {
assertColumnNames(resp.columns(), List.of("id", "vector"));
assertColumnTypes(resp.columns(), List.of("integer", "dense_vector"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we missing some assertions here for the values? to at least check that they are not nulls?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's done in other tests - I started with this test to just check that the field types are retrieved correctly.


@Override
public String toString() {
return "BlockSourceReader.Floats";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "BlockSourceReader.DenseVectors"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, afe79f7


@Override
protected String name() {
return "Floats";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again - just checking if this the value we want to return given that the class is called DenseVectorBlockLoader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I renamed this multiple times but didn't follow on this method. Thanks! afe79f7

carlosdelest and others added 4 commits April 30, 2025 19:55
…ctor_support

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
#	x-pack/plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/CsvTestsDataLoader.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
@carlosdelest
Copy link
Member Author

Closing this PR, as the final approach will imply multiple field types for the different dense_vector element types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >non-issue :Search Relevance/Search Catch all for Search Relevance Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants