Skip to content

[CORE-12747] storage: Assorted minor log improvements #26677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

oleiman
Copy link
Member

@oleiman oleiman commented Jul 3, 2025

What it says on the tin. Setting the table for #26676

  • Add the ability to supply a lock deadline to log_reader
  • Collect min/max timestamps in log::offset_range_size
  • Some improvements to locking behavior in offset_range_size & get_file_offset
  • Introduce log::eligible_for_compacted_reupload

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

  • none

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 3, 2025

CI test results

test results on build#68385
test_class test_method test_arguments test_kind job_url test_status passed reason
CloudStorageTimingStressTest test_cloud_storage {"cleanup_policy": "compact,delete"} ducktape https://buildkite.com/redpanda/redpanda/builds/68385#0197cefd-e616-4763-9eef-e8a63205cdc9 FLAKY 19/21 upstream reliability is '96.16519174041298'. current run reliability is '90.47619047619048'. drift is 5.689 and the allowed drift is set to 50. The test should PASS
ClusterConfigTest test_valid_settings ducktape https://buildkite.com/redpanda/redpanda/builds/68385#0197cefd-e614-4e50-937b-9b8a5a66817b FLAKY 16/21 upstream reliability is '84.0'. current run reliability is '76.19047619047619'. drift is 7.80952 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "sliding_window", "enable_failures": false, "mixed_versions": false, "with_iceberg": true} ducktape https://buildkite.com/redpanda/redpanda/builds/68385#0197cefd-e619-4ddd-a946-0b668d1dfef7 FLAKY 20/21 upstream reliability is '98.77551020408163'. current run reliability is '95.23809523809523'. drift is 3.53741 and the allowed drift is set to 50. The test should PASS
test results on build#68505
test_class test_method test_arguments test_kind job_url test_status passed reason
MasterTestSuite test_replica_pair_frequency unit https://buildkite.com/redpanda/redpanda/builds/68505#0197e277-7384-45e5-b233-a60703f78bef FAIL 0/1
test results on build#68536
test_class test_method test_arguments test_kind job_url test_status passed reason
ArchivalTest test_single_partition_leadership_transfer {"cloud_storage_type": 2} ducktape https://buildkite.com/redpanda/redpanda/builds/68536#0197e6b5-4090-400c-bd22-f6f1713bd323 FLAKY 20/21 upstream reliability is '99.72375690607734'. current run reliability is '95.23809523809523'. drift is 4.48566 and the allowed drift is set to 50. The test should PASS
ClusterConfigTest test_valid_settings ducktape https://buildkite.com/redpanda/redpanda/builds/68536#0197e6b5-408e-4933-93c0-eac24cb5cb08 FLAKY 18/21 upstream reliability is '83.78033205619413'. current run reliability is '85.71428571428571'. drift is -1.93395 and the allowed drift is set to 50. The test should PASS
ConsumerGroupBalancingTest test_coordinator_nodes_balance ducktape https://buildkite.com/redpanda/redpanda/builds/68536#0197e6b5-3d13-4a09-8c65-95f05f9f6e76 FLAKY 19/21 upstream reliability is '99.44444444444444'. current run reliability is '90.47619047619048'. drift is 8.96825 and the allowed drift is set to 50. The test should PASS
test results on build#68611
test_class test_method test_arguments test_kind job_url test_status passed reason
CloudStorageTimingStressTest test_cloud_storage {"cleanup_policy": "compact,delete"} ducktape https://buildkite.com/redpanda/redpanda/builds/68611#0197ec7c-6d7b-4f14-a911-6923c515b579 FLAKY 20/21 upstream reliability is '96.41791044776119'. current run reliability is '95.23809523809523'. drift is 1.17982 and the allowed drift is set to 50. The test should PASS
DeleteRecordsTest test_delete_records_concurrent_truncations {"cloud_storage_enabled": false, "truncate_point": "start_offset"} ducktape https://buildkite.com/redpanda/redpanda/builds/68611#0197ec7c-0cb7-4ba4-a44a-a8b16a625e77 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS

// If set to true, the resulting batch reader will completely ignore the
// contents of the batch cache, reading the specified offset range directly
// from disk.
bool force_ignore_batch_cache{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find this controversial. Per my mental model only a bug could lead to this and we are introducing a hack to circumvent the bug...

If we want to say that we are comfortable with the bug/feature, can we document it with a test?

Not only this is weird bug leaking out of the storage we are also accepting by default a performance penalty seemingly for no good reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind the explicit usage case here is to have direct access to a segment on disk, so it doesn't feel especially like a hack. Does the batch cache generally provide that guarantee?

As I understood it, we should be invalidating the batch cache after or during adjacent merge compaction, but is that a correctness bug? The contents of the cache would still be "correct" with respect to the abstract log, no? just out of sync w/ batches dropped by compaction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in any case I can try to add a test for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat agree with Nicolae here, its cool that this exists but I'd rather investigate fixing what we think is the race between adjacent merge compaction and uploads before accepting it exists and doing a work around fix instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear what you both are saying, but I can't quite get there. The idea here isn't exactly to work around the bug but to decouple ts upload from the batch cache entirely, effectively providing segment_reader-style semantics for log_reader.

Of course the commit now contains a test for this edge-case behavior in adjacent merge compaction, but my read is that it doesn't affect correctness. When/if that compaction behavior changes, we can remove the test, but I would still want new config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillemKauf is the following scenario possible?

3 segments, key k appears once in all of them and is a tombstone in 3rd. Adjacent compaction happens between first and second, key k is removed from first but remains in batch cache. Adjacent compaction happens between 2nd and 3rd and only tombstone is left.

Now, can tombstone be removed without the first segment to be rewritten/batch cache reset? That's the only scenario I can come up that would violate kafka semantics. Otherwise, @oleiman statement is true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so now that Willem's fix is in flight, I still think it's safer (for ts upload purposes) to reach around the batch cache straight to segments on disk. While it is a pessimization in principle, in practice it's exactly what we do now and avoids tying ts uploads to similar bugs that might emerge in future, however unlikely that is. Any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the following scenario possible?

No, that situation would not have been possible even pre-fix thankfully, due to the fact that the tombstone removal would have to go through self compaction which would drop the batch cache.

I still think it's safer (for ts upload purposes) to reach around the batch cache straight to segments on disk

I think that if this bug was the only motivating case and it is a pessimization in general, we should not reach around the batch cache, but that's just my opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this bug was the only motivating case

that's not how I'm thinking about it. probably easier to chat briefly f2f. i'll try to grab you both in the morning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not how I'm thinking about it. probably easier to chat briefly f2f. i'll try to grab you both in the morning.

Sure, happy to chat. I think I understand the idea behind

but to decouple ts upload from the batch cache entirely, effectively providing segment_reader-style semantics for log_reader.

but I wonder if its really necessary or just a pessimization.

@oleiman oleiman force-pushed the s/noticket/minor-log-improvements branch 2 times, most recently from a5fc43c to af46134 Compare July 7, 2025 00:57
@oleiman oleiman force-pushed the s/noticket/minor-log-improvements branch from af46134 to ea079e2 Compare July 7, 2025 01:18
@oleiman oleiman changed the title storage: Assorted minor log improvements [CORE-12747] storage: Assorted minor log improvements Jul 7, 2025
Lazin
Lazin previously approved these changes Jul 7, 2025
if (it == _segs.end()) {
return std::nullopt;
}
auto compaction_complete = [](const segment& seg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purposes will this function be used for? What does a compacted_offset mean?

finished_windowed_compaction() only means a segment has been deduplicated with some subset of keys in the log.

has_self_compact_timestamp() only means a segment has been deduplicated with some subset of keys in the log (those being the keys within the segment itself).

I think we need to document exactly what guarantees this function provides w/r/t compaction and offsets, because these are all currently very loose definitions of "compacted" (the strongest guarantee we can provide is has_clean_compact_timestamp(), which means a key at a given offset is fully de-duplicated with the prefix of the log before it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, thanks for the context. i don't think i was fully aware of the distinction. the intended usage here is to say whether an offset range satisfies the criteria for compacted reupload. currently this is checked segment-by-segment in the archival layer:

bool eligible_for_compacted_reupload(const storage::segment& s) {
if (config::shard_local_cfg().log_compaction_use_sliding_window) {
return s.finished_windowed_compaction();
}
return s.has_self_compact_timestamp();
}

it's an interesting point though - aiui, remote segments are reuploaded at most once, so maybe the current condition is suboptimal and we could use has_clean_compact_timestamp instead. @Lazin any opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe the current condition is suboptimal

Yes, it is, somewhat, but at the end of the day all TS compaction is suboptimal until we perform full compaction in TS.

We can make the condition better by checking for has_clean_compact_timestamp() (only iff log_compaction_use_sliding_window), otherwise has_self_compact_timestamp().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm.

Copy link
Contributor

@Lazin Lazin Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two objectives here

  1. we want to upload data which is compacted well
  2. we don't want the local and really well compacted data to be deleted before we uploaded it.

If the data is not reuploaded we will be stuck with non-compacted offset range in the cloud. I guess if we know that local retention will not remove any data we can afford to wait. If both compact and delete cleanup policies are used maybe it make sense to reupload greedy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, my mental model was that local data wouldn't be truncated until we had advanced the collectable offset (fed by last_offset in manifest or similar), but i guess local retention policy can supersede this or space management can intervene?

maybe it makes sense to predicate this on ntp_config then. e.g.

  • if compact-only, wait for clean timestamp
  • if compact+delete, wait for one round of windowed compaction
  • if deletion-only or none, then treat range as non-compacted (i.e. do not reupload)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compacted reupload advances last_compacted_offset in the manifest. And normal upload advances last_offset. Max collectible offset depends only on the last one. We're not preventing local retention if compacted reupload didn't happen yet.

if compact-only, wait for clean timestamp
if compact+delete, wait for one round of windowed compaction
if deletion-only or none, then treat range as non-compacted (i.e. do not reupload)

that makes sense, only in case of deletion only we will not even attempt compacted reupload so there is nothing we need to do

// If set to true, the resulting batch reader will completely ignore the
// contents of the batch cache, reading the specified offset range directly
// from disk.
bool force_ignore_batch_cache{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat agree with Nicolae here, its cool that this exists but I'd rather investigate fixing what we think is the race between adjacent merge compaction and uploads before accepting it exists and doing a work around fix instead.

@oleiman oleiman force-pushed the s/noticket/minor-log-improvements branch from ea079e2 to 7879b0f Compare July 7, 2025 20:12
@oleiman oleiman requested review from WillemKauf and Lazin July 7, 2025 20:12
@oleiman oleiman force-pushed the s/noticket/minor-log-improvements branch from 7879b0f to 0579496 Compare July 8, 2025 23:04
@oleiman
Copy link
Member Author

oleiman commented Jul 8, 2025

force push remove defunct compaction_e2e_test

@nvartolomei
Copy link
Contributor

nvartolomei commented Jul 9, 2025 via email

@oleiman oleiman force-pushed the s/noticket/minor-log-improvements branch from 0579496 to 172e5e2 Compare July 13, 2025 03:43
@oleiman
Copy link
Member Author

oleiman commented Jul 13, 2025

force push:

  • drop force_ignore_batch_cache commit (discussed offline)
  • rename log::compaction_complete to log::eligible_for_compacted_reupload
    • and adjust logic dependent on cleanup policy (see comment)
    • ^^ and a unit test for this

oleiman added 6 commits July 12, 2025 20:50
Set through log_reader_config. Currently we take a range lock over the segments
for the configured offset range, with no timeout.

When using log_readers in the TS upload loop, we're much more likely to race
against things like prefix truncation or segment rolls, so it's better to have
the option to configure a timeout.

Signed-off-by: Oren Leiman <[email protected]>
And improve some logs

Signed-off-by: Oren Leiman <[email protected]>
If it fails, throw a timeout error so we skip across this upload attempt
and retry.

The main client of this API is offset_range_size, which takes an offset range
and aims to determine its size on disk. To this end, it first examines the
log to find the range of segments which contains the offset range boundaries
in its first and last element, respectively. Once the segment range is
determined, it is critical that the target segments do not change on disk for
the duration of the computation. To this end, offset_range_size first acquires
read locks for all the segments in its collected range.

Between the time we acquire this range of locks and the time we make a
subsequent call to get_file_offset as part of the range_size computation, it
is possible that a write_lock request for some segment has queued itself
behind the outstanding read lock. If we then try to acquire a read lock for
this segment in the file offset computation, we have a cyclic dependency
that leads to deadlock.

We must implement one of two courses of action here:
1. add a timeout to the lock acquisition in get_file_offset
2. check for this condition explicitly w/ a non-blocking lock call

It is unlikely that we ever approach the read lock limit for any such segment,
so the common case should be that any amount of waiting (as in (1)) is wasted
time. Instead, we can immediately fail the request w/ a 'synthetic' timeout
error, allow whatever write activity to take place as normal, and leave it to
the client of offset_range_size to try again.

Signed-off-by: Oren Leiman <[email protected]>
For acquiring read locks.

Signed-off-by: Oren Leiman <[email protected]>
Determine whether an offset range is eligible for compacted reupload by
the archival system.

The result depends on whether sliding window compaction is enabled and
on the configured cleanup policy.

Returns 'false' unless all segments covering the offset range are marked
compacted.

When sliding window compaction is enabled, returns true iff:
  - delete policy: all segments are marked as having finished windowed
    compaction
  - no-delete policy: all segments have a clean compact timestamp
Otherwise returns true iff all segments have a self compact timestamp

Signed-off-by: Oren Leiman <[email protected]>
@oleiman oleiman force-pushed the s/noticket/minor-log-improvements branch from 172e5e2 to 91d224f Compare July 13, 2025 04:03
@oleiman
Copy link
Member Author

oleiman commented Jul 13, 2025

force push merge conflict

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants