Skip to content

[Failure Store] Expose failure store lifecycle information via the GET data stream API #126668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

gmarouli
Copy link
Contributor

@gmarouli gmarouli commented Apr 11, 2025

To retrieve the effective configuration you need to use the GET data streams API, for example, if a data stream has empty data stream options, it might still have failure store enabled from a cluster setting. The failure store is managed by default with a lifecycle with infinite (for now) retention, so the response will look like this:

GET _data_stream/*
{
  "data_streams": [
    {
      "name": "my-data-stream",
      "timestamp_field": {
        "name": "@timestamp"
      },
      .....
      "failure_store": {
        "enabled": true,
        "lifecycle": {
          "enabled": true
        },
        "rollover_on_write": false,
        "indices": [
           {
            "index_name": ".fs-my-data-stream-2099.03.08-000003",
            "index_uuid": "PA_JquKGSiKcAKBA8DJ5gw",
            "managed_by": "Data stream lifecycle"
          }
        ]
      }
    },...
]

In case there is a failure indexed managed by ILM the failure index info will be displayed as follows.

      {
          "index_name": ".fs-my-data-stream-2099.03.08-000002",
          "index_uuid": "PA_JquKGSiKcAKBA8DJ5gw",
          "prefer_ilm": true,
          "ilm_policy": "my-lifecycle-policy",
          "managed_by": "Index Lifecycle Management"
        }

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels Apr 11, 2025
@gmarouli gmarouli added >non-issue :Data Management/Data streams Data streams and their lifecycles auto-backport Automatically create backport pull requests when merged v8.19.0 labels Apr 11, 2025
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Apr 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Apr 14, 2025
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a couple small comments. I think also now that we've removed the feature flag some of the code can be simplified.

@@ -29,6 +29,7 @@
* supports the following configurations only explicitly enabling or disabling the failure store
*/
public record DataStreamFailureStore(Boolean enabled) implements SimpleDiffable<DataStreamFailureStore>, ToXContentObject {
public static final String FAILURES_LIFECYCLE_API_CAPABILITY = "failures_lifecycle";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It looks like this is only used in the rest action. Does it make sense to move this there? Alternatively, it's describing something about the structure of the response, should it be moved to the response object/action class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, name wise, should we make this something like failure_store.lifecycle to denote specifically what part of the request is new? I see we have other changes to the response as well, so this is a soft suggestion. If you think this should be more generic to describe those changes that's also fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the naming, about placement, I put it here because it will be used also by the data stream options later. But we can always move it then.

@gmarouli gmarouli added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 23, 2025
@elasticsearchmachine elasticsearchmachine merged commit db2992f into elastic:main Apr 23, 2025
19 checks passed
@gmarouli gmarouli deleted the failure-lifecycle/expose-in-get-data-streams branch April 23, 2025 13:45
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 126668

@gmarouli
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Apr 23, 2025
…ET` data stream API (elastic#126668)

To retrieve the effective configuration you need to use the `GET` data
streams API, for example, if a data stream has empty data stream
options, it might still have failure store enabled from a cluster
setting. The failure store is managed by default with a lifecycle with
infinite (for now) retention, so the response will look like this:

```
GET _data_stream/*
{
  "data_streams": [
    {
      "name": "my-data-stream",
      "timestamp_field": {
        "name": "@timestamp"
      },
      .....
      "failure_store": {
        "enabled": true,
        "lifecycle": {
          "enabled": true
        },
        "rollover_on_write": false,
        "indices": [
           {
            "index_name": ".fs-my-data-stream-2099.03.08-000003",
            "index_uuid": "PA_JquKGSiKcAKBA8DJ5gw",
            "managed_by": "Data stream lifecycle"
          }
        ]
      }
    },...
]
```

In case there is a failure indexed managed by ILM the failure index info
will be displayed as follows.

```
      {
          "index_name": ".fs-my-data-stream-2099.03.08-000002",
          "index_uuid": "PA_JquKGSiKcAKBA8DJ5gw",
          "prefer_ilm": true,
          "ilm_policy": "my-lifecycle-policy",
          "managed_by": "Index Lifecycle Management"
        }
```

(cherry picked from commit db2992f)

# Conflicts:
#	modules/data-streams/src/test/java/org/elasticsearch/datastreams/action/GetDataStreamsResponseTests.java
#	server/src/main/java/org/elasticsearch/action/datastreams/GetDataStreamAction.java
elasticsearchmachine pushed a commit that referenced this pull request Apr 29, 2025
…the `GET` data stream API (#126668) (#127255)

* [Failure Store] Expose failure store lifecycle information via the `GET` data stream API (#126668)

To retrieve the effective configuration you need to use the `GET` data
streams API, for example, if a data stream has empty data stream
options, it might still have failure store enabled from a cluster
setting. The failure store is managed by default with a lifecycle with
infinite (for now) retention, so the response will look like this:

```
GET _data_stream/*
{
  "data_streams": [
    {
      "name": "my-data-stream",
      "timestamp_field": {
        "name": "@timestamp"
      },
      .....
      "failure_store": {
        "enabled": true,
        "lifecycle": {
          "enabled": true
        },
        "rollover_on_write": false,
        "indices": [
           {
            "index_name": ".fs-my-data-stream-2099.03.08-000003",
            "index_uuid": "PA_JquKGSiKcAKBA8DJ5gw",
            "managed_by": "Data stream lifecycle"
          }
        ]
      }
    },...
]
```

In case there is a failure indexed managed by ILM the failure index info
will be displayed as follows.

```
      {
          "index_name": ".fs-my-data-stream-2099.03.08-000002",
          "index_uuid": "PA_JquKGSiKcAKBA8DJ5gw",
          "prefer_ilm": true,
          "ilm_policy": "my-lifecycle-policy",
          "managed_by": "Index Lifecycle Management"
        }
```

(cherry picked from commit db2992f)

# Conflicts:
#	modules/data-streams/src/test/java/org/elasticsearch/datastreams/action/GetDataStreamsResponseTests.java
#	server/src/main/java/org/elasticsearch/action/datastreams/GetDataStreamAction.java

* Ignore correctly the failure store response
gmarouli added a commit that referenced this pull request Apr 30, 2025
…tion (#127314)

The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "data_retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "data_retention": "5d"
          }
        }
      }
    }
  ]
}
To retrieve the effective configuration you need to use the GET data streams API, see #126668

Functionality

The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
The data stream options APIs reflect only the user's configuration.
The GET data stream API should be used to check the current state of the effective failure store configuration.
Telemetry
We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": { 
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y, 
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}
Implementation details

We ensure that partially reset failure store will create valid failure store configuration.
We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.
gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Apr 30, 2025
…tion (elastic#127314)

The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "data_retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "data_retention": "5d"
          }
        }
      }
    }
  ]
}
To retrieve the effective configuration you need to use the GET data streams API, see elastic#126668

Functionality

The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
The data stream options APIs reflect only the user's configuration.
The GET data stream API should be used to check the current state of the effective failure store configuration.
Telemetry
We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": {
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y,
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}
Implementation details

We ensure that partially reset failure store will create valid failure store configuration.
We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.

(cherry picked from commit 03d7781)

# Conflicts:
#	modules/data-streams/src/main/java/org/elasticsearch/datastreams/DataStreamsPlugin.java
#	modules/data-streams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java
#	modules/data-streams/src/test/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceTests.java
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	server/src/main/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsService.java
#	server/src/test/java/org/elasticsearch/cluster/metadata/DataStreamTests.java
elasticsearchmachine added a commit that referenced this pull request May 3, 2025
…nfiguration (#127314) (#127577)

* [Failure store] Introduce dedicated failure store lifecycle configuration (#127314)

The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "data_retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "data_retention": "5d"
          }
        }
      }
    }
  ]
}
To retrieve the effective configuration you need to use the GET data streams API, see #126668

Functionality

The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
The data stream options APIs reflect only the user's configuration.
The GET data stream API should be used to check the current state of the effective failure store configuration.
Telemetry
We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": {
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y,
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}
Implementation details

We ensure that partially reset failure store will create valid failure store configuration.
We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.

(cherry picked from commit 03d7781)

# Conflicts:
#	modules/data-streams/src/main/java/org/elasticsearch/datastreams/DataStreamsPlugin.java
#	modules/data-streams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java
#	modules/data-streams/src/test/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceTests.java
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	server/src/main/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsService.java
#	server/src/test/java/org/elasticsearch/cluster/metadata/DataStreamTests.java

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Data Management/Data streams Data streams and their lifecycles >non-issue serverless-linked Added by automation, don't add manually Team:Data Management Meta label for data/management team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants