-
Notifications
You must be signed in to change notification settings - Fork 645
[CORE-12747] storage: Assorted minor log improvements #26677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
[CORE-12747] storage: Assorted minor log improvements #26677
Conversation
CI test resultstest results on build#68385
test results on build#68505
test results on build#68536
test results on build#68611
|
src/v/storage/types.h
Outdated
// If set to true, the resulting batch reader will completely ignore the | ||
// contents of the batch cache, reading the specified offset range directly | ||
// from disk. | ||
bool force_ignore_batch_cache{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find this controversial. Per my mental model only a bug could lead to this and we are introducing a hack to circumvent the bug...
If we want to say that we are comfortable with the bug/feature, can we document it with a test?
Not only this is weird bug leaking out of the storage we are also accepting by default a performance penalty seemingly for no good reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind the explicit usage case here is to have direct access to a segment on disk, so it doesn't feel especially like a hack. Does the batch cache generally provide that guarantee?
As I understood it, we should be invalidating the batch cache after or during adjacent merge compaction, but is that a correctness bug? The contents of the cache would still be "correct" with respect to the abstract log, no? just out of sync w/ batches dropped by compaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in any case I can try to add a test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somewhat agree with Nicolae here, its cool that this exists but I'd rather investigate fixing what we think is the race between adjacent merge compaction and uploads before accepting it exists and doing a work around fix instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear what you both are saying, but I can't quite get there. The idea here isn't exactly to work around the bug but to decouple ts upload from the batch cache entirely, effectively providing segment_reader
-style semantics for log_reader
.
Of course the commit now contains a test for this edge-case behavior in adjacent merge compaction, but my read is that it doesn't affect correctness. When/if that compaction behavior changes, we can remove the test, but I would still want new config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillemKauf is the following scenario possible?
3 segments, key k
appears once in all of them and is a tombstone in 3rd. Adjacent compaction happens between first and second, key k
is removed from first but remains in batch cache. Adjacent compaction happens between 2nd and 3rd and only tombstone is left.
Now, can tombstone be removed without the first segment to be rewritten/batch cache reset? That's the only scenario I can come up that would violate kafka semantics. Otherwise, @oleiman statement is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so now that Willem's fix is in flight, I still think it's safer (for ts upload purposes) to reach around the batch cache straight to segments on disk. While it is a pessimization in principle, in practice it's exactly what we do now and avoids tying ts uploads to similar bugs that might emerge in future, however unlikely that is. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the following scenario possible?
No, that situation would not have been possible even pre-fix thankfully, due to the fact that the tombstone removal would have to go through self compaction which would drop the batch cache.
I still think it's safer (for ts upload purposes) to reach around the batch cache straight to segments on disk
I think that if this bug was the only motivating case and it is a pessimization in general, we should not reach around the batch cache, but that's just my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this bug was the only motivating case
that's not how I'm thinking about it. probably easier to chat briefly f2f. i'll try to grab you both in the morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's not how I'm thinking about it. probably easier to chat briefly f2f. i'll try to grab you both in the morning.
Sure, happy to chat. I think I understand the idea behind
but to decouple ts upload from the batch cache entirely, effectively providing segment_reader-style semantics for log_reader.
but I wonder if its really necessary or just a pessimization.
a5fc43c
to
af46134
Compare
af46134
to
ea079e2
Compare
src/v/storage/disk_log_impl.cc
Outdated
if (it == _segs.end()) { | ||
return std::nullopt; | ||
} | ||
auto compaction_complete = [](const segment& seg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What purposes will this function be used for? What does a compacted_offset
mean?
finished_windowed_compaction()
only means a segment has been deduplicated with some subset of keys in the log.
has_self_compact_timestamp()
only means a segment has been deduplicated with some subset of keys in the log (those being the keys within the segment
itself).
I think we need to document exactly what guarantees this function provides w/r/t compaction and offsets, because these are all currently very loose definitions of "compacted" (the strongest guarantee we can provide is has_clean_compact_timestamp()
, which means a key at a given offset is fully de-duplicated with the prefix of the log before it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, thanks for the context. i don't think i was fully aware of the distinction. the intended usage here is to say whether an offset range satisfies the criteria for compacted reupload. currently this is checked segment-by-segment in the archival layer:
redpanda/src/v/cluster/archival/segment_reupload.cc
Lines 43 to 48 in 215e649
bool eligible_for_compacted_reupload(const storage::segment& s) { | |
if (config::shard_local_cfg().log_compaction_use_sliding_window) { | |
return s.finished_windowed_compaction(); | |
} | |
return s.has_self_compact_timestamp(); | |
} |
it's an interesting point though - aiui, remote segments are reuploaded at most once, so maybe the current condition is suboptimal and we could use has_clean_compact_timestamp
instead. @Lazin any opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe the current condition is suboptimal
Yes, it is, somewhat, but at the end of the day all TS compaction is suboptimal until we perform full compaction in TS.
We can make the condition better by checking for has_clean_compact_timestamp()
(only iff log_compaction_use_sliding_window
), otherwise has_self_compact_timestamp()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two objectives here
- we want to upload data which is compacted well
- we don't want the local and really well compacted data to be deleted before we uploaded it.
If the data is not reuploaded we will be stuck with non-compacted offset range in the cloud. I guess if we know that local retention will not remove any data we can afford to wait. If both compact and delete cleanup policies are used maybe it make sense to reupload greedy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, my mental model was that local data wouldn't be truncated until we had advanced the collectable offset (fed by last_offset in manifest or similar), but i guess local retention policy can supersede this or space management can intervene?
maybe it makes sense to predicate this on ntp_config then. e.g.
- if compact-only, wait for clean timestamp
- if compact+delete, wait for one round of windowed compaction
- if deletion-only or none, then treat range as non-compacted (i.e. do not reupload)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compacted reupload advances last_compacted_offset
in the manifest. And normal upload advances last_offset
. Max collectible offset depends only on the last one. We're not preventing local retention if compacted reupload didn't happen yet.
if compact-only, wait for clean timestamp
if compact+delete, wait for one round of windowed compaction
if deletion-only or none, then treat range as non-compacted (i.e. do not reupload)
that makes sense, only in case of deletion only we will not even attempt compacted reupload so there is nothing we need to do
src/v/storage/types.h
Outdated
// If set to true, the resulting batch reader will completely ignore the | ||
// contents of the batch cache, reading the specified offset range directly | ||
// from disk. | ||
bool force_ignore_batch_cache{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somewhat agree with Nicolae here, its cool that this exists but I'd rather investigate fixing what we think is the race between adjacent merge compaction and uploads before accepting it exists and doing a work around fix instead.
ea079e2
to
7879b0f
Compare
7879b0f
to
0579496
Compare
force push remove defunct compaction_e2e_test |
From my point of view batch cache is an implementation detail of the log
reader. None of the callers should be aware of it. Skipping batch cache
touching (existing functionality) is merely a performance hint.
The new introduced option leaks an implementation detail for no good reason
(I probably said that already). It also increases the api surface which
comes with its own downsides and we should resist doing that whenever we
can. Not only that, it also likely causes a performance regression for now
good reason.
Ideally, if there are still semantic differences in running with batch
cache vs not and we require both modes then we should make that semantic
difference part of the interface rather than the implementation detail.
Conceptually it just doesn’t make sense long term to have that option part
of public interface. I can easily imagine someone removing it in the future
because it doesn’t make sense to them either. There also can be other
changes in parts of log reading that again could break semantic
expectations regardless of batch cache. Do they introduce one more option
for implementation detail then?
…On Wed, 9 Jul 2025 at 05:36, Willem Kaufmann ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/v/storage/types.h
<#26677 (comment)>
:
> @@ -418,6 +418,11 @@ struct log_reader_config {
std::optional<ss::semaphore::clock::time_point> read_lock_deadline{};
+ // If set to true, the resulting batch reader will completely ignore the
+ // contents of the batch cache, reading the specified offset range directly
+ // from disk.
+ bool force_ignore_batch_cache{false};
that's not how I'm thinking about it. probably easier to chat briefly f2f.
i'll try to grab you both in the morning.
Sure, happy to chat. I think I understand the idea behind
but to decouple ts upload from the batch cache entirely, effectively
providing segment_reader-style semantics for log_reader.
but I wonder if its really necessary or just a pessimization.
—
Reply to this email directly, view it on GitHub
<#26677 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEETWPALD3PV6Q5JYCVJ5T3HSL5VAVCNFSM6AAAAACAVX7KOCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSOJZHEZTGOJUGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
0579496
to
172e5e2
Compare
|
Set through log_reader_config. Currently we take a range lock over the segments for the configured offset range, with no timeout. When using log_readers in the TS upload loop, we're much more likely to race against things like prefix truncation or segment rolls, so it's better to have the option to configure a timeout. Signed-off-by: Oren Leiman <[email protected]>
And improve some logs Signed-off-by: Oren Leiman <[email protected]>
Signed-off-by: Oren Leiman <[email protected]>
If it fails, throw a timeout error so we skip across this upload attempt and retry. The main client of this API is offset_range_size, which takes an offset range and aims to determine its size on disk. To this end, it first examines the log to find the range of segments which contains the offset range boundaries in its first and last element, respectively. Once the segment range is determined, it is critical that the target segments do not change on disk for the duration of the computation. To this end, offset_range_size first acquires read locks for all the segments in its collected range. Between the time we acquire this range of locks and the time we make a subsequent call to get_file_offset as part of the range_size computation, it is possible that a write_lock request for some segment has queued itself behind the outstanding read lock. If we then try to acquire a read lock for this segment in the file offset computation, we have a cyclic dependency that leads to deadlock. We must implement one of two courses of action here: 1. add a timeout to the lock acquisition in get_file_offset 2. check for this condition explicitly w/ a non-blocking lock call It is unlikely that we ever approach the read lock limit for any such segment, so the common case should be that any amount of waiting (as in (1)) is wasted time. Instead, we can immediately fail the request w/ a 'synthetic' timeout error, allow whatever write activity to take place as normal, and leave it to the client of offset_range_size to try again. Signed-off-by: Oren Leiman <[email protected]>
For acquiring read locks. Signed-off-by: Oren Leiman <[email protected]>
Determine whether an offset range is eligible for compacted reupload by the archival system. The result depends on whether sliding window compaction is enabled and on the configured cleanup policy. Returns 'false' unless all segments covering the offset range are marked compacted. When sliding window compaction is enabled, returns true iff: - delete policy: all segments are marked as having finished windowed compaction - no-delete policy: all segments have a clean compact timestamp Otherwise returns true iff all segments have a self compact timestamp Signed-off-by: Oren Leiman <[email protected]>
172e5e2
to
91d224f
Compare
force push merge conflict |
What it says on the tin. Setting the table for #26676
log::offset_range_size
offset_range_size
&get_file_offset
log::eligible_for_compacted_reupload
Backports Required
Release Notes