Skip to content

Add ability to redirect ingestion failures on data streams to a failure store #126973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Apr 18, 2025

Conversation

jbaiera
Copy link
Member

@jbaiera jbaiera commented Apr 17, 2025

Documents that encountered ingest pipeline failures or mapping conflicts would previously be returned to the client as errors in the bulk and index operations. Many client applications are not equipped to respond to these failures. This leads to the failed documents often being dropped by the client which cannot hold the broken documents indefinitely. In many end user workloads, these failed documents represent events that could be critical signals for observability or security use cases.

To help mitigate this problem, data streams now maintain a "failure store" which is used to accept and hold documents that fail to be ingested due to preventable configuration errors. The data stream's failure store operates like a separate set of backing indices with their own mappings and access patterns that allow Elasticsearch to accept documents that would otherwise be rejected due to unhandled ingest pipeline exceptions or mapping conflicts.

Users can enable redirection of ingest failures to the failure store on new data streams by specifying it in the new data_stream_options field inside of a component or index template:

PUT _index_template/my-template
{
  "index_patterns": ["logs-test-*"],
  "data_stream": {},
  "template": {
    "data_stream_options": {
      "failure_store": {
        "enabled": true
      }
    }
  }
}'

Existing data streams can be configured with the new data stream _options endpoint:

PUT _data_stream/logs-test-apache/_options
{
  "failure_store": {
    "enabled": "true"
  }
}

When redirection is enabled, any ingestion related failures will be captured in the failure store if the cluster is able to, along with the timestamp that the failure occurred, details about the error encountered, and the document that could not be ingested. Since failure stores are a kind of Elasticsearch index, we can search the data stream for the failures that it has collected. The failures are not shown by default as they are stored in different indices than the normal data stream data. In order to retrieve the failures, we use the _search API along with a new bit of index pattern syntax, the :: selector.

POST logs-test-apache::failures/_search

This index syntax informs the search operation to target the indices in its failure store instead of its backing indices. It can be mixed in a number of ways with other index patterns to include their failure store indices in the search operation:

POST logs-*::failures/_search
POST logs-*,logs-*::failures/_search
POST *::failures/_search
POST _query
{
  "query": "FROM my_data_stream*::failures"
}

This PR removes the feature flags and guards that prevent the new failure store functionality from operating in production runtimes.

@elasticsearchmachine
Copy link
Collaborator

Hi @jbaiera, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @jbaiera, I've updated the changelog YAML for you. Note that since this PR is labelled release highlight, you need to update the changelog YAML to fill out the extended information sections.

@jbaiera
Copy link
Member Author

jbaiera commented Apr 17, 2025

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Apr 17, 2025
@jbaiera
Copy link
Member Author

jbaiera commented Apr 17, 2025

@elasticmachine update branch

@jbaiera
Copy link
Member Author

jbaiera commented Apr 18, 2025

@elasticmachine update branch

@jbaiera jbaiera marked this pull request as ready for review April 18, 2025 04:09
@jbaiera jbaiera requested review from a team as code owners April 18, 2025 04:09
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Apr 18, 2025
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left a few minor comments

Comment on lines 1350 to 1351
// Should be removed after backport
PARSER.declareBoolean(ConstructingObjectParser.optionalConstructorArg(), FAILURE_STORE_FIELD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this be removed after backport?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old field - we use the data stream options to determine if failure store is enabled now. Currently, the field is read and used as a fall back if data stream options are not present, but I think that's mostly for BWC testing during development. I think logic-wise we could simply ignore the field if present because it would always be overridden by data stream options. The only situation where this field is relevant is mixed clusters with 8.19 nodes and very old nodes running with the feature flag on which we would not support.

I opened #127071 for this. I also cleaned up the serialization logic a little.

@jbaiera jbaiera added the auto-backport Automatically create backport pull requests when merged label Apr 18, 2025
@jbaiera jbaiera removed the auto-backport Automatically create backport pull requests when merged label Apr 18, 2025
@jbaiera jbaiera merged commit 7b89f4d into elastic:main Apr 18, 2025
17 checks passed
@jbaiera jbaiera deleted the failure-store-feature-flag-removal branch April 18, 2025 20:34
jbaiera added a commit to jbaiera/elasticsearch that referenced this pull request Apr 30, 2025
…re store (elastic#126973)

Removes the feature flags and guards that prevent the new failure store functionality
from operating in production runtimes.
elasticsearchmachine added a commit that referenced this pull request Apr 30, 2025
…a failure store (#126973) (#127546)

* Add ability to redirect ingestion failures on data streams to a failure store (#126973)

Removes the feature flags and guards that prevent the new failure store functionality
from operating in production runtimes.

* Fix build

* [CI] Auto commit changes from spotless

* Fix build

* Fix build

* Fix build

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >feature release highlight serverless-linked Added by automation, don't add manually Team:Data Management Meta label for data/management team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants