[CORE-12747] storage: Assorted minor log improvements #26677

oleiman · 2025-07-03T05:41:31Z

What it says on the tin. Setting the table for #26676

Add the ability to supply a lock deadline to log_reader
Collect min/max timestamps in log::offset_range_size
Some improvements to locking behavior in offset_range_size & get_file_offset
Introduce log::eligible_for_compacted_reupload

Backports Required

Release Notes

none

vbotbuildovich · 2025-07-03T09:17:15Z

CI test results

test results on build#68385

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
CloudStorageTimingStressTest	test_cloud_storage	{"cleanup_policy": "compact,delete"}	ducktape	https://buildkite.com/redpanda/redpanda/builds/68385#0197cefd-e616-4763-9eef-e8a63205cdc9	FLAKY	19/21	upstream reliability is '96.16519174041298'. current run reliability is '90.47619047619048'. drift is 5.689 and the allowed drift is set to 50. The test should PASS
ClusterConfigTest	test_valid_settings		ducktape	https://buildkite.com/redpanda/redpanda/builds/68385#0197cefd-e614-4e50-937b-9b8a5a66817b	FLAKY	16/21	upstream reliability is '84.0'. current run reliability is '76.19047619047619'. drift is 7.80952 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 1, "compaction_mode": "sliding_window", "enable_failures": false, "mixed_versions": false, "with_iceberg": true}	ducktape	https://buildkite.com/redpanda/redpanda/builds/68385#0197cefd-e619-4ddd-a946-0b668d1dfef7	FLAKY	20/21	upstream reliability is '98.77551020408163'. current run reliability is '95.23809523809523'. drift is 3.53741 and the allowed drift is set to 50. The test should PASS

test results on build#68505

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
MasterTestSuite	test_replica_pair_frequency		unit	https://buildkite.com/redpanda/redpanda/builds/68505#0197e277-7384-45e5-b233-a60703f78bef	FAIL	0/1

test results on build#68536

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
ArchivalTest	test_single_partition_leadership_transfer	{"cloud_storage_type": 2}	ducktape	https://buildkite.com/redpanda/redpanda/builds/68536#0197e6b5-4090-400c-bd22-f6f1713bd323	FLAKY	20/21	upstream reliability is '99.72375690607734'. current run reliability is '95.23809523809523'. drift is 4.48566 and the allowed drift is set to 50. The test should PASS
ClusterConfigTest	test_valid_settings		ducktape	https://buildkite.com/redpanda/redpanda/builds/68536#0197e6b5-408e-4933-93c0-eac24cb5cb08	FLAKY	18/21	upstream reliability is '83.78033205619413'. current run reliability is '85.71428571428571'. drift is -1.93395 and the allowed drift is set to 50. The test should PASS
ConsumerGroupBalancingTest	test_coordinator_nodes_balance		ducktape	https://buildkite.com/redpanda/redpanda/builds/68536#0197e6b5-3d13-4a09-8c65-95f05f9f6e76	FLAKY	19/21	upstream reliability is '99.44444444444444'. current run reliability is '90.47619047619048'. drift is 8.96825 and the allowed drift is set to 50. The test should PASS

test results on build#68611

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
CloudStorageTimingStressTest	test_cloud_storage	{"cleanup_policy": "compact,delete"}	ducktape	https://buildkite.com/redpanda/redpanda/builds/68611#0197ec7c-6d7b-4f14-a911-6923c515b579	FLAKY	20/21	upstream reliability is '96.41791044776119'. current run reliability is '95.23809523809523'. drift is 1.17982 and the allowed drift is set to 50. The test should PASS
DeleteRecordsTest	test_delete_records_concurrent_truncations	{"cloud_storage_enabled": false, "truncate_point": "start_offset"}	ducktape	https://buildkite.com/redpanda/redpanda/builds/68611#0197ec7c-0cb7-4ba4-a44a-a8b16a625e77	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS

nvartolomei · 2025-07-03T15:10:05Z

src/v/storage/types.h

+    // If set to true, the resulting batch reader will completely ignore the
+    // contents of the batch cache, reading the specified offset range directly
+    // from disk.
+    bool force_ignore_batch_cache{false};


I still find this controversial. Per my mental model only a bug could lead to this and we are introducing a hack to circumvent the bug...

If we want to say that we are comfortable with the bug/feature, can we document it with a test?

Not only this is weird bug leaking out of the storage we are also accepting by default a performance penalty seemingly for no good reason.

In my mind the explicit usage case here is to have direct access to a segment on disk, so it doesn't feel especially like a hack. Does the batch cache generally provide that guarantee?

As I understood it, we should be invalidating the batch cache after or during adjacent merge compaction, but is that a correctness bug? The contents of the cache would still be "correct" with respect to the abstract log, no? just out of sync w/ batches dropped by compaction.

in any case I can try to add a test for this.

I somewhat agree with Nicolae here, its cool that this exists but I'd rather investigate fixing what we think is the race between adjacent merge compaction and uploads before accepting it exists and doing a work around fix instead.

I hear what you both are saying, but I can't quite get there. The idea here isn't exactly to work around the bug but to decouple ts upload from the batch cache entirely, effectively providing segment_reader-style semantics for log_reader.

Of course the commit now contains a test for this edge-case behavior in adjacent merge compaction, but my read is that it doesn't affect correctness. When/if that compaction behavior changes, we can remove the test, but I would still want new config.

@WillemKauf is the following scenario possible?

3 segments, key k appears once in all of them and is a tombstone in 3rd. Adjacent compaction happens between first and second, key k is removed from first but remains in batch cache. Adjacent compaction happens between 2nd and 3rd and only tombstone is left.

Now, can tombstone be removed without the first segment to be rewritten/batch cache reset? That's the only scenario I can come up that would violate kafka semantics. Otherwise, @oleiman statement is true.

Ok, so now that Willem's fix is in flight, I still think it's safer (for ts upload purposes) to reach around the batch cache straight to segments on disk. While it is a pessimization in principle, in practice it's exactly what we do now and avoids tying ts uploads to similar bugs that might emerge in future, however unlikely that is. Any thoughts?

is the following scenario possible?

No, that situation would not have been possible even pre-fix thankfully, due to the fact that the tombstone removal would have to go through self compaction which would drop the batch cache.

I still think it's safer (for ts upload purposes) to reach around the batch cache straight to segments on disk

I think that if this bug was the only motivating case and it is a pessimization in general, we should not reach around the batch cache, but that's just my opinion.

if this bug was the only motivating case

that's not how I'm thinking about it. probably easier to chat briefly f2f. i'll try to grab you both in the morning.

that's not how I'm thinking about it. probably easier to chat briefly f2f. i'll try to grab you both in the morning.

Sure, happy to chat. I think I understand the idea behind

but to decouple ts upload from the batch cache entirely, effectively providing segment_reader-style semantics for log_reader.

but I wonder if its really necessary or just a pessimization.

src/v/storage/disk_log_impl.cc

WillemKauf · 2025-07-07T16:04:17Z

src/v/storage/disk_log_impl.cc

+    if (it == _segs.end()) {
+        return std::nullopt;
+    }
+    auto compaction_complete = [](const segment& seg) {


What purposes will this function be used for? What does a compacted_offset mean?

finished_windowed_compaction() only means a segment has been deduplicated with some subset of keys in the log.

has_self_compact_timestamp() only means a segment has been deduplicated with some subset of keys in the log (those being the keys within the segment itself).

I think we need to document exactly what guarantees this function provides w/r/t compaction and offsets, because these are all currently very loose definitions of "compacted" (the strongest guarantee we can provide is has_clean_compact_timestamp(), which means a key at a given offset is fully de-duplicated with the prefix of the log before it.

interesting, thanks for the context. i don't think i was fully aware of the distinction. the intended usage here is to say whether an offset range satisfies the criteria for compacted reupload. currently this is checked segment-by-segment in the archival layer:

redpanda/src/v/cluster/archival/segment_reupload.cc

Lines 43 to 48 in 215e649

bool eligible_for_compacted_reupload(const storage::segment& s) {

if (config::shard_local_cfg().log_compaction_use_sliding_window) {

return s.finished_windowed_compaction();

}

return s.has_self_compact_timestamp();

}

it's an interesting point though - aiui, remote segments are reuploaded at most once, so maybe the current condition is suboptimal and we could use has_clean_compact_timestamp instead. @Lazin any opinion?

so maybe the current condition is suboptimal

Yes, it is, somewhat, but at the end of the day all TS compaction is suboptimal until we perform full compaction in TS.

We can make the condition better by checking for has_clean_compact_timestamp() (only iff log_compaction_use_sliding_window), otherwise has_self_compact_timestamp().

There are two objectives here

we want to upload data which is compacted well

we don't want the local and really well compacted data to be deleted before we uploaded it.

If the data is not reuploaded we will be stuck with non-compacted offset range in the cloud. I guess if we know that local retention will not remove any data we can afford to wait. If both compact and delete cleanup policies are used maybe it make sense to reupload greedy.

ah, my mental model was that local data wouldn't be truncated until we had advanced the collectable offset (fed by last_offset in manifest or similar), but i guess local retention policy can supersede this or space management can intervene?

maybe it makes sense to predicate this on ntp_config then. e.g.

if compact-only, wait for clean timestamp

if compact+delete, wait for one round of windowed compaction

if deletion-only or none, then treat range as non-compacted (i.e. do not reupload)

The compacted reupload advances last_compacted_offset in the manifest. And normal upload advances last_offset. Max collectible offset depends only on the last one. We're not preventing local retention if compacted reupload didn't happen yet.

if compact-only, wait for clean timestamp
if compact+delete, wait for one round of windowed compaction
if deletion-only or none, then treat range as non-compacted (i.e. do not reupload)

that makes sense, only in case of deletion only we will not even attempt compacted reupload so there is nothing we need to do

WillemKauf · 2025-07-07T16:06:07Z

src/v/storage/types.h

+    // If set to true, the resulting batch reader will completely ignore the
+    // contents of the batch cache, reading the specified offset range directly
+    // from disk.
+    bool force_ignore_batch_cache{false};


I somewhat agree with Nicolae here, its cool that this exists but I'd rather investigate fixing what we think is the race between adjacent merge compaction and uploads before accepting it exists and doing a work around fix instead.

oleiman · 2025-07-08T23:06:08Z

force push remove defunct compaction_e2e_test

nvartolomei · 2025-07-09T08:34:52Z

From my point of view batch cache is an implementation detail of the log reader. None of the callers should be aware of it. Skipping batch cache touching (existing functionality) is merely a performance hint. The new introduced option leaks an implementation detail for no good reason (I probably said that already). It also increases the api surface which comes with its own downsides and we should resist doing that whenever we can. Not only that, it also likely causes a performance regression for now good reason. Ideally, if there are still semantic differences in running with batch cache vs not and we require both modes then we should make that semantic difference part of the interface rather than the implementation detail. Conceptually it just doesn’t make sense long term to have that option part of public interface. I can easily imagine someone removing it in the future because it doesn’t make sense to them either. There also can be other changes in parts of log reading that again could break semantic expectations regardless of batch cache. Do they introduce one more option for implementation detail then?

…

On Wed, 9 Jul 2025 at 05:36, Willem Kaufmann ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/v/storage/types.h <#26677 (comment)> : > @@ -418,6 +418,11 @@ struct log_reader_config { std::optional<ss::semaphore::clock::time_point> read_lock_deadline{}; + // If set to true, the resulting batch reader will completely ignore the + // contents of the batch cache, reading the specified offset range directly + // from disk. + bool force_ignore_batch_cache{false}; that's not how I'm thinking about it. probably easier to chat briefly f2f. i'll try to grab you both in the morning. Sure, happy to chat. I think I understand the idea behind but to decouple ts upload from the batch cache entirely, effectively providing segment_reader-style semantics for log_reader. but I wonder if its really necessary or just a pessimization. — Reply to this email directly, view it on GitHub <#26677 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEETWPALD3PV6Q5JYCVJ5T3HSL5VAVCNFSM6AAAAACAVX7KOCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSOJZHEZTGOJUGU> . You are receiving this because you commented.Message ID: ***@***.***>

oleiman · 2025-07-13T03:48:25Z

force push:

drop force_ignore_batch_cache commit (discussed offline)
rename log::compaction_complete to log::eligible_for_compacted_reupload
- and adjust logic dependent on cleanup policy (see comment)
- ^^ and a unit test for this

Set through log_reader_config. Currently we take a range lock over the segments for the configured offset range, with no timeout. When using log_readers in the TS upload loop, we're much more likely to race against things like prefix truncation or segment rolls, so it's better to have the option to configure a timeout. Signed-off-by: Oren Leiman <[email protected]>

And improve some logs Signed-off-by: Oren Leiman <[email protected]>

Signed-off-by: Oren Leiman <[email protected]>

If it fails, throw a timeout error so we skip across this upload attempt and retry. The main client of this API is offset_range_size, which takes an offset range and aims to determine its size on disk. To this end, it first examines the log to find the range of segments which contains the offset range boundaries in its first and last element, respectively. Once the segment range is determined, it is critical that the target segments do not change on disk for the duration of the computation. To this end, offset_range_size first acquires read locks for all the segments in its collected range. Between the time we acquire this range of locks and the time we make a subsequent call to get_file_offset as part of the range_size computation, it is possible that a write_lock request for some segment has queued itself behind the outstanding read lock. If we then try to acquire a read lock for this segment in the file offset computation, we have a cyclic dependency that leads to deadlock. We must implement one of two courses of action here: 1. add a timeout to the lock acquisition in get_file_offset 2. check for this condition explicitly w/ a non-blocking lock call It is unlikely that we ever approach the read lock limit for any such segment, so the common case should be that any amount of waiting (as in (1)) is wasted time. Instead, we can immediately fail the request w/ a 'synthetic' timeout error, allow whatever write activity to take place as normal, and leave it to the client of offset_range_size to try again. Signed-off-by: Oren Leiman <[email protected]>

For acquiring read locks. Signed-off-by: Oren Leiman <[email protected]>

Determine whether an offset range is eligible for compacted reupload by the archival system. The result depends on whether sliding window compaction is enabled and on the configured cleanup policy. Returns 'false' unless all segments covering the offset range are marked compacted. When sliding window compaction is enabled, returns true iff: - delete policy: all segments are marked as having finished windowed compaction - no-delete policy: all segments have a clean compact timestamp Otherwise returns true iff all segments have a self compact timestamp Signed-off-by: Oren Leiman <[email protected]>

oleiman · 2025-07-13T04:04:27Z

force push merge conflict

oleiman self-assigned this Jul 3, 2025

github-actions bot added the area/redpanda label Jul 3, 2025

This was referenced Jul 3, 2025

[CORE-12748] Archival: Miscellaneous improvements to async_data_uploader, tests, minor log additions #26676

Open

[CORE-12749] Populate segment_collector_stream from log_reader but maintain segment boundary invariants. #26099

Draft

nvartolomei reviewed Jul 3, 2025

View reviewed changes

oleiman force-pushed the s/noticket/minor-log-improvements branch 2 times, most recently from a5fc43c to af46134 Compare July 7, 2025 00:57

oleiman requested review from nvartolomei, Lazin and WillemKauf July 7, 2025 01:00

oleiman force-pushed the s/noticket/minor-log-improvements branch from af46134 to ea079e2 Compare July 7, 2025 01:18

oleiman changed the title ~~storage: Assorted minor log improvements~~ [CORE-12747] storage: Assorted minor log improvements Jul 7, 2025

Lazin previously approved these changes Jul 7, 2025

View reviewed changes

WillemKauf reviewed Jul 7, 2025

View reviewed changes

oleiman dismissed Lazin’s stale review via 7879b0f July 7, 2025 20:12

oleiman force-pushed the s/noticket/minor-log-improvements branch from ea079e2 to 7879b0f Compare July 7, 2025 20:12

oleiman requested review from WillemKauf and Lazin July 7, 2025 20:12

oleiman force-pushed the s/noticket/minor-log-improvements branch from 7879b0f to 0579496 Compare July 8, 2025 23:04

oleiman force-pushed the s/noticket/minor-log-improvements branch from 0579496 to 172e5e2 Compare July 13, 2025 03:43

oleiman added 6 commits July 12, 2025 20:50

s/log: Return timestamps from offset_range_size

ffeb5d2

And improve some logs Signed-off-by: Oren Leiman <[email protected]>

s/segment: Expose non-blocking read lock ops

fa88ed7

Signed-off-by: Oren Leiman <[email protected]>

s/log: Adds timeout parameter to offset_range_size

28b2469

For acquiring read locks. Signed-off-by: Oren Leiman <[email protected]>

oleiman force-pushed the s/noticket/minor-log-improvements branch from 172e5e2 to 91d224f Compare July 13, 2025 04:03

	bool eligible_for_compacted_reupload(const storage::segment& s) {
	if (config::shard_local_cfg().log_compaction_use_sliding_window) {
	return s.finished_windowed_compaction();
	}
	return s.has_self_compact_timestamp();
	}

[CORE-12747] storage: Assorted minor log improvements #26677

Are you sure you want to change the base?

[CORE-12747] storage: Assorted minor log improvements #26677

Conversation

oleiman commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

vbotbuildovich commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lazin Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oleiman commented Jul 8, 2025

Uh oh!

nvartolomei commented Jul 9, 2025 via email

Uh oh!

oleiman commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oleiman commented Jul 13, 2025

Uh oh!

Uh oh!

oleiman commented Jul 3, 2025 •

edited

Loading

vbotbuildovich commented Jul 3, 2025 •

edited

Loading

Lazin Jul 9, 2025 •

edited

Loading

oleiman commented Jul 13, 2025 •

edited

Loading