Skip to content

Fix reported progress around compaction / defrag #287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 2, 2025

Conversation

simolus3
Copy link
Contributor

In some cases, it was possible for reported download progress to go out of the normal range of 0..1, e.g.:

  1. Start with empty db, create 100k rows, let client sync and stop at around 50k.
  2. Disconnect client, delete 90k rows, re-deploy sync rules.
  3. Reconnect client.
  4. Observe that it claims to have downloaded 50k rows out of 10k total.

This is because the sync progress heuristic assumes that buckets won't shrink in size. In this PR, we try to detect a bucket being reduced so much that we'd be over 100% when the sync has just started. In this case, we throw local progress away and start from 0. There are still some issues associated with progress around defrags (e.g. since checksums changed, we're downloading everything twice and will have progress go from 0 to 1 twice), but this fixes the most egregious mismatch.

@simolus3 simolus3 merged commit 5b9b08c into main Jun 2, 2025
13 of 15 checks passed
@simolus3 simolus3 deleted the fix-progress-around-compaction branch June 2, 2025 08:05
@rkistner
Copy link
Contributor

rkistner commented Jun 2, 2025

@simolus3 Could we also explicitly reset the bucket progress when we get a checksum failure? Or more specifically, whenever we hit the deleteBucket logic, the progress for that bucket should be cleared

@simolus3
Copy link
Contributor Author

simolus3 commented Jun 2, 2025

Or more specifically, whenever we hit the deleteBucket logic, the progress for that bucket should be cleared

This should already be the case. Our recovery from checksum failures is to restart the connection, which means that we get a new checkpoint and call getBucketOperationProgress() which selects the current counters from ps_buckets. So if the bucket is deleted we'd restart at zero. The internal progress we keep in memory is always just "state from ps_buckets at last checkpoint message + progress inferred from data lines since then".

@simolus3 simolus3 mentioned this pull request Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants