prometheus.remote_write: prevent unbounded growth of inactive series in WAL #3927

kgeckhart · 2025-07-01T12:19:49Z

PR Description

This PR fixes an issue that can cause active series in the WAL to grow without bound described in, #3926, by refactoring the loadWAL implementation to more closely mirror the prometheus tsdb implementation it was likely based upon (https://github.com/prometheus/prometheus/blob/e83dc66bdb67a01ff06a96e81faf66b0095d9948/tsdb/head_wal.go#L78). Notes to the reviewer has a breakdown of the changesets + rationale.

Which issue(s) this PR fixes

Fixes #3926

Notes to the Reviewer

The first changeset is the test that shows the issue fully described in the linked issue
901772e - Ensure WAL replay considers the fact that the same labels can have duplicate SeriesRefs
- These changes are all inspired by how tsdb head decodes series during WAL replay
- New series are created only if they represent a unique label hash through the new stripeSeries.GetOrSet
- We only track series that were not created (aka duplicates) in multiRef
- Duplicate series are tracked in the WAL as deleted with the lastSegment allowing them to kept in checkpoints until they are gone from the WAL
abd5e63 - Rename multiRef to duplicateRefToValidRef and use it rewrite duplicate series as expected
- Alloy was using multiRef to track all series created and the change to only track duplicates series breaks the logic in all sample decode steps resulting in most samples being considered nonExistentSeriesRefs
- multiRef is used to rewrite the sample SeriesRef if needed
- A SeriesRef is considered a nonExistentSeriesRef only if it does not exist in series
- Since I struggled to realize the intent of multiRef based on the short name I renamed it to duplicateRefToValidRef which IMO helps illustrate its purpose a lot more clearly
1575ac5 - Keep duplicated SeriesRefs until the last segment they are read from
- As we are adding the duplicate SeriesRefs to deleted we were using the last segment but we are replaying so we can know exactly where we are at
- If we track the last segment we found the SeriesRef in deleted we should be able to clear these duplicate refs from the checkpoint earlier
- This is potentially upstreamable to tsbd head as well

PR Checklist

CHANGELOG.md updated (will add when closer to ready to merge to prevent ongoing conflicts)
Tests updated

bboreham · 2025-07-01T15:14:30Z

Is this relevant to the upstream Prometheus (Agent) WAL too?

kgeckhart · 2025-07-01T15:33:27Z

Is this relevant to the upstream Prometheus (Agent) WAL too?

@bboreham great question, I believe so as the code is largely the same between the two but I haven't done the same test upstream yet. If this fix makes sense and works as I expect I'm happy to upstream it as well.

bboreham · 2025-07-02T07:02:58Z

I didn’t really follow the code in this PR.

There is an upstream issue which might be related: WAL Checkpoint holds deleted series for 1 extra compaction cycle #12286

internal/static/metrics/wal/series.go

kgeckhart · 2025-07-02T15:10:46Z

I didn’t really follow the code in this PR.

There is an upstream issue which might be related: WAL Checkpoint holds deleted series for 1 extra compaction cycle #12286

Thanks for pointing out the issue @bboreham it's absolutely related but of a different flavor. I'll adapt the flow you originally described in prometheus/prometheus#12286 (comment) with modifications for this problem.

edit: Visualization was cleaned up and moved to the issue #3926 (comment)

ptodev · 2025-07-07T11:38:37Z

internal/static/metrics/wal/wal_test.go

Is it necessary to do any additional tests, e.g. in a real environment?

Is there a good place to try to add an integration / acceptance test? Perhaps adding another one similar to

alloy/internal/component/prometheus/remotewrite/remote_write_test.go

Lines 20 to 89 in 2619183

// Test is an integration-level test which ensures that metrics can get sent to

// a prometheus.remote_write component and forwarded to a

// remote_write-compatible server.

func Test(t *testing.T) {

writeResult := make(chan *prompb.WriteRequest)

// Create a remote_write server which forwards any received payloads to the

// writeResult channel.

srv := newTestServer(t, writeResult)

defer srv.Close()

// Create our component and wait for it to start running, so we can write

// metrics to the WAL.

args := testArgsForConfig(t, fmt.Sprintf(`

external_labels = {

cluster = "local",

}

endpoint {

name = "test-url"

url = "%s/api/v1/write"

remote_timeout = "100ms"

queue_config {

batch_send_deadline = "100ms"

}

}

`, srv.URL))

tc, err := componenttest.NewControllerFromID(util.TestLogger(t), "prometheus.remote_write")

require.NoError(t, err)

go func() {

err = tc.Run(componenttest.TestContext(t), args)

require.NoError(t, err)

}()

require.NoError(t, tc.WaitRunning(5*time.Second))

// We need to use a future timestamp since remote_write will ignore any

// sample which is earlier than the time when it started. Adding a minute

// ensures that our samples will never get ignored.

sampleTimestamp := time.Now().Add(time.Minute).UnixMilli()

// Send metrics to our component. These will be written to the WAL and

// subsequently written to our HTTP server.

sendMetric(t, tc, labels.FromStrings("foo", "bar"), sampleTimestamp, 12)

sendMetric(t, tc, labels.FromStrings("fizz", "buzz"), sampleTimestamp, 34)

expect := []prompb.TimeSeries{{

Labels: []prompb.Label{

{Name: "cluster", Value: "local"},

{Name: "foo", Value: "bar"},

},

Samples: []prompb.Sample{

{Timestamp: sampleTimestamp, Value: 12},

},

}, {

Labels: []prompb.Label{

{Name: "cluster", Value: "local"},

{Name: "fizz", Value: "buzz"},

},

Samples: []prompb.Sample{

{Timestamp: sampleTimestamp, Value: 34},

},

}}

select {

case <-time.After(time.Minute):

require.FailNow(t, "timed out waiting for metrics")

case res := <-writeResult:

require.Equal(t, expect, res.Timeseries)

}

}

?

I did run it against a real WAL that had accumulated ~14 million inactive series in the checkpoint and it pruned all them keeping the ~500k I expected to be active.

kgeckhart · 2025-07-09T14:29:21Z

@bboreham I went down the rabbit hole to figure out the potential impact in upstream between prometheus agent and the prometheus tsdb itself (it has very similar WAL replay code).

The prometheus agent is impacted because it makes the same assumption during WAL replay as alloy, a series that does not exist by its RefID is a series to be created https://github.com/prometheus/prometheus/blob/e83dc66bdb67a01ff06a96e81faf66b0095d9948/tsdb/agent/db.go#L522-L533.

The tsdb itself is not impacted because it does not trust the Series RefID to indicate if a series should be created or not it ends up using the hash.
https://github.com/prometheus/prometheus/blob/e83dc66bdb67a01ff06a96e81faf66b0095d9948/tsdb/head_wal.go#L255-L261 calls https://github.com/prometheus/prometheus/blob/a11772234de54be257fe60e07b313d974058d4a9/tsdb/head.go#L1752-L1753 which is going to decide based on the existence of the hash if the series should exists or should be created. Edit: Further, if it wasn't created it's going to add it back to be tracked as deleted

I'll see what it looks like to mirror the tsdb behavior into alloy as I think it should be a simpler fix than what I proposed here.

…plicate SeriesRefs

…e series as expected

Show how the WAL can leak active series

01f6937

kgeckhart requested a review from a team as a code owner July 1, 2025 12:19

kgeckhart changed the title ~~prometheus wal: prevent active series from being unable to be GC'ed~~ prometheus.remote_write: prevent unbounded growth of inactive series in WAL Jul 1, 2025

ptodev reviewed Jul 2, 2025

View reviewed changes

internal/static/metrics/wal/series.go Outdated Show resolved Hide resolved

internal/static/metrics/wal/series.go Outdated Show resolved Hide resolved

ptodev self-assigned this Jul 4, 2025

ptodev reviewed Jul 7, 2025

View reviewed changes

kgeckhart marked this pull request as draft July 9, 2025 20:24

Ensure WAL replay considers the fact that the same labels can have du…

901772e

…plicate SeriesRefs

kgeckhart force-pushed the kgeckhart/alloy-wal-leaking-active-series branch from 52e92ad to 901772e Compare July 9, 2025 20:30

kgeckhart added 2 commits July 10, 2025 06:39

Rename multiRef to duplicateRefToValidRef and use it rewrite duplicat…

abd5e63

…e series as expected

Keep duplicated SeriesRefs until the last segment they are read from

1575ac5

kgeckhart marked this pull request as ready for review July 10, 2025 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prometheus.remote_write: prevent unbounded growth of inactive series in WAL #3927

prometheus.remote_write: prevent unbounded growth of inactive series in WAL #3927

kgeckhart commented Jul 1, 2025 •

edited

Loading

Uh oh!

bboreham commented Jul 1, 2025

Uh oh!

kgeckhart commented Jul 1, 2025

Uh oh!

bboreham commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

kgeckhart commented Jul 2, 2025 •

edited

Loading

Uh oh!

ptodev Jul 7, 2025

Uh oh!

kgeckhart Jul 7, 2025

Uh oh!

kgeckhart commented Jul 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

	// Test is an integration-level test which ensures that metrics can get sent to
	// a prometheus.remote_write component and forwarded to a
	// remote_write-compatible server.
	func Test(t *testing.T) {
	writeResult := make(chan *prompb.WriteRequest)

	// Create a remote_write server which forwards any received payloads to the
	// writeResult channel.
	srv := newTestServer(t, writeResult)
	defer srv.Close()

	// Create our component and wait for it to start running, so we can write
	// metrics to the WAL.
	args := testArgsForConfig(t, fmt.Sprintf(`
	external_labels = {
	cluster = "local",
	}
	endpoint {
	name = "test-url"
	url = "%s/api/v1/write"
	remote_timeout = "100ms"

	queue_config {
	batch_send_deadline = "100ms"
	}
	}
	`, srv.URL))
	tc, err := componenttest.NewControllerFromID(util.TestLogger(t), "prometheus.remote_write")
	require.NoError(t, err)
	go func() {
	err = tc.Run(componenttest.TestContext(t), args)
	require.NoError(t, err)
	}()
	require.NoError(t, tc.WaitRunning(5*time.Second))

	// We need to use a future timestamp since remote_write will ignore any
	// sample which is earlier than the time when it started. Adding a minute
	// ensures that our samples will never get ignored.
	sampleTimestamp := time.Now().Add(time.Minute).UnixMilli()

	// Send metrics to our component. These will be written to the WAL and
	// subsequently written to our HTTP server.
	sendMetric(t, tc, labels.FromStrings("foo", "bar"), sampleTimestamp, 12)
	sendMetric(t, tc, labels.FromStrings("fizz", "buzz"), sampleTimestamp, 34)

	expect := []prompb.TimeSeries{{
	Labels: []prompb.Label{
	{Name: "cluster", Value: "local"},
	{Name: "foo", Value: "bar"},
	},
	Samples: []prompb.Sample{
	{Timestamp: sampleTimestamp, Value: 12},
	},
	}, {
	Labels: []prompb.Label{
	{Name: "cluster", Value: "local"},
	{Name: "fizz", Value: "buzz"},
	},
	Samples: []prompb.Sample{
	{Timestamp: sampleTimestamp, Value: 34},
	},
	}}

	select {
	case <-time.After(time.Minute):
	require.FailNow(t, "timed out waiting for metrics")
	case res := <-writeResult:
	require.Equal(t, expect, res.Timeseries)
	}
	}

prometheus.remote_write: prevent unbounded growth of inactive series in WAL #3927

Are you sure you want to change the base?

prometheus.remote_write: prevent unbounded growth of inactive series in WAL #3927

Conversation

kgeckhart commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

Uh oh!

bboreham commented Jul 1, 2025

Uh oh!

kgeckhart commented Jul 1, 2025

Uh oh!

bboreham commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

kgeckhart commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ptodev Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

kgeckhart Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

kgeckhart commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kgeckhart commented Jul 1, 2025 •

edited

Loading

kgeckhart commented Jul 2, 2025 •

edited

Loading

kgeckhart commented Jul 9, 2025 •

edited

Loading