Add ability to redirect ingestion failures on data streams to a failure store #126973

jbaiera · 2025-04-17T03:00:04Z

Documents that encountered ingest pipeline failures or mapping conflicts would previously be returned to the client as errors in the bulk and index operations. Many client applications are not equipped to respond to these failures. This leads to the failed documents often being dropped by the client which cannot hold the broken documents indefinitely. In many end user workloads, these failed documents represent events that could be critical signals for observability or security use cases.

To help mitigate this problem, data streams now maintain a "failure store" which is used to accept and hold documents that fail to be ingested due to preventable configuration errors. The data stream's failure store operates like a separate set of backing indices with their own mappings and access patterns that allow Elasticsearch to accept documents that would otherwise be rejected due to unhandled ingest pipeline exceptions or mapping conflicts.

Users can enable redirection of ingest failures to the failure store on new data streams by specifying it in the new data_stream_options field inside of a component or index template:

PUT _index_template/my-template
{
  "index_patterns": ["logs-test-*"],
  "data_stream": {},
  "template": {
    "data_stream_options": {
      "failure_store": {
        "enabled": true
      }
    }
  }
}'

Existing data streams can be configured with the new data stream _options endpoint:

PUT _data_stream/logs-test-apache/_options
{
  "failure_store": {
    "enabled": "true"
  }
}

When redirection is enabled, any ingestion related failures will be captured in the failure store if the cluster is able to, along with the timestamp that the failure occurred, details about the error encountered, and the document that could not be ingested. Since failure stores are a kind of Elasticsearch index, we can search the data stream for the failures that it has collected. The failures are not shown by default as they are stored in different indices than the normal data stream data. In order to retrieve the failures, we use the _search API along with a new bit of index pattern syntax, the :: selector.

POST logs-test-apache::failures/_search

This index syntax informs the search operation to target the indices in its failure store instead of its backing indices. It can be mixed in a number of ways with other index patterns to include their failure store indices in the search operation:

POST logs-*::failures/_search
POST logs-*,logs-*::failures/_search
POST *::failures/_search
POST _query
{
  "query": "FROM my_data_stream*::failures"
}

This PR removes the feature flags and guards that prevent the new failure store functionality from operating in production runtimes.

elasticsearchmachine · 2025-04-17T03:00:28Z

Hi @jbaiera, I've created a changelog YAML for you.

elasticsearchmachine · 2025-04-17T16:33:34Z

Hi @jbaiera, I've updated the changelog YAML for you. Note that since this PR is labelled release highlight, you need to update the changelog YAML to fill out the extended information sections.

jbaiera · 2025-04-17T18:14:38Z

@elasticmachine update branch

jbaiera · 2025-04-17T22:17:28Z

@elasticmachine update branch

jbaiera · 2025-04-18T02:41:55Z

@elasticmachine update branch

elasticsearchmachine · 2025-04-18T04:09:30Z

Pinging @elastic/es-data-management (Team:Data Management)

dakrone

LGTM, I left a few minor comments

docs/changelog/126973.yaml

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java

dakrone · 2025-04-18T14:55:55Z

server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java

+        // Should be removed after backport
+        PARSER.declareBoolean(ConstructingObjectParser.optionalConstructorArg(), FAILURE_STORE_FIELD);


Why should this be removed after backport?

This is an old field - we use the data stream options to determine if failure store is enabled now. Currently, the field is read and used as a fall back if data stream options are not present, but I think that's mostly for BWC testing during development. I think logic-wise we could simply ignore the field if present because it would always be overridden by data stream options. The only situation where this field is relevant is mixed clusters with 8.19 nodes and very old nodes running with the feature flag on which we would not support.

I opened #127071 for this. I also cleaned up the serialization logic a little.

Co-authored-by: Lee Hinman <[email protected]>

…re store (elastic#126973) Removes the feature flags and guards that prevent the new failure store functionality from operating in production runtimes.

…a failure store (#126973) (#127546) * Add ability to redirect ingestion failures on data streams to a failure store (#126973) Removes the feature flags and guards that prevent the new failure store functionality from operating in production runtimes. * Fix build * [CI] Auto commit changes from spotless * Fix build * Fix build * Fix build * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>

jbaiera added 8 commits April 16, 2025 15:21

Remove feature flag from prod code

8222e76

Remove feature flag from xpack main code

56a7989

Remove feature flag from test code

efed960

Remove feature flag from test cluster configs

f9ff7cc

Set rest spec to stable/public

d312f39

Cleanup and precommit

0b84f0d

Remove last references to flag and flag itself

40fe4c0

Remove dev feature guard from ESQL antlr files

2b0c91b

jbaiera added >feature :Data Management/Data streams Data streams and their lifecycles v8.19.0 v9.1.0 labels Apr 17, 2025

Update docs/changelog/126973.yaml

71919f1

jbaiera requested a review from gmarouli April 17, 2025 03:09

dakrone added the release highlight label Apr 17, 2025

Update docs/changelog/126973.yaml

da6fd85

Fix changelog

bb6f9e5

Merge branch 'main' into failure-store-feature-flag-removal

adb14b1

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Apr 17, 2025

Merge branch 'main' into failure-store-feature-flag-removal

762ea7e

Merge branch 'main' into failure-store-feature-flag-removal

63f275e

jbaiera marked this pull request as ready for review April 18, 2025 04:09

jbaiera requested review from a team as code owners April 18, 2025 04:09

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Apr 18, 2025

dakrone approved these changes Apr 18, 2025

View reviewed changes

jbaiera and others added 3 commits April 18, 2025 12:13

Update docs/changelog/126973.yaml

3d492d0

Co-authored-by: Lee Hinman <[email protected]>

Clean up DataStream serialization

a3207ff

Merge branch 'main' into failure-store-feature-flag-removal

82a7458

jbaiera added the auto-backport Automatically create backport pull requests when merged label Apr 18, 2025

Fix merge issues

5b20486

jbaiera removed the auto-backport Automatically create backport pull requests when merged label Apr 18, 2025

jbaiera merged commit 7b89f4d into elastic:main Apr 18, 2025
17 checks passed

jbaiera deleted the failure-store-feature-flag-removal branch April 18, 2025 20:34

jbaiera added the backport pending label Apr 19, 2025

dakrone mentioned this pull request Apr 28, 2025

Failed document handler #95534

Closed

jbaiera mentioned this pull request Apr 30, 2025

[8.19] Add ability to redirect ingestion failures on data streams to a failure store (#126973) #127546

Merged

jbaiera removed the backport pending label May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to redirect ingestion failures on data streams to a failure store #126973

Add ability to redirect ingestion failures on data streams to a failure store #126973

jbaiera commented Apr 17, 2025

elasticsearchmachine commented Apr 17, 2025

elasticsearchmachine commented Apr 17, 2025

jbaiera commented Apr 17, 2025

jbaiera commented Apr 17, 2025

jbaiera commented Apr 18, 2025

elasticsearchmachine commented Apr 18, 2025

dakrone left a comment

dakrone Apr 18, 2025

jbaiera Apr 18, 2025

		// Should be removed after backport
		PARSER.declareBoolean(ConstructingObjectParser.optionalConstructorArg(), FAILURE_STORE_FIELD);

Add ability to redirect ingestion failures on data streams to a failure store #126973

Add ability to redirect ingestion failures on data streams to a failure store #126973

Conversation

jbaiera commented Apr 17, 2025

elasticsearchmachine commented Apr 17, 2025

elasticsearchmachine commented Apr 17, 2025

jbaiera commented Apr 17, 2025

jbaiera commented Apr 17, 2025

jbaiera commented Apr 18, 2025

elasticsearchmachine commented Apr 18, 2025

dakrone left a comment

Choose a reason for hiding this comment

dakrone Apr 18, 2025

Choose a reason for hiding this comment

jbaiera Apr 18, 2025

Choose a reason for hiding this comment