Skip to content

triedb/pathdb, eth: use double-buffer mechanism in pathdb #30464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 22, 2025

Conversation

rjl493456442
Copy link
Member

@rjl493456442 rjl493456442 commented Sep 19, 2024

Previously, PathDB used a single buffer to aggregate database writes, which needed to be flushed atomically. However, flushing large amounts of data (e.g., 256MB) caused significant overhead, often blocking the system for around 3 seconds during the flush.

To mitigate this overhead and reduce performance spikes, a double-buffer mechanism is introduced. When the active buffer fills up, it is marked as frozen and a background flushing process is triggered. Meanwhile, a new buffer is allocated for incoming writes, allowing operations to continue uninterrupted.

This approach reduces system blocking times and provides flexibility in adjusting buffer parameters for improved performance.

TODO:

  • release the content in the frozen buffer after flushing

@rjl493456442 rjl493456442 force-pushed the multibuffer branch 2 times, most recently from b48c0c9 to 20b4ffd Compare September 19, 2024 07:08
Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all, this looks promising, I suspect this could help quite a bit

nodes := writeNodes(batch, b.nodes, clean)
rawdb.WritePersistentStateID(batch, id)

// Flush all mutations in a single batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: at this point, mutations were already applied on the clean, i.e, dl.cleans cache. That happened during writeNodes. I've tried to figure out if that is a problem, but come to the conclusion that it's fine, but just wanted to raise it so you can also give it a think.

Regarding "flush all mutations in a single batch" -- is that important only because of crash-safety, or some other more subtle reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this
in disklayer.go, function node(), we lookup a node. Order:

  1. buffer
  2. frozen
  3. cleans
  4. database
    • And if found, write to cleans
	if dl.cleans != nil && len(blob) > 0 {
		dl.cleans.Set(key, blob)
		cleanWriteMeter.Mark(int64(len(blob)))
	}

I'm trying to think of a case where this write-to-cleans conflicts with the write-to-cleans in the background committer writeNodes method.

Copy link
Member Author

@rjl493456442 rjl493456442 Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • if it's found in buffer/frozen => return and no interaction with cache
  • if it's found in cache => return
  • if it's found in disk (it implicitly means the item is not in these places above, even the item is marked as deleted, it will still be caught in buffer/frozen/cache), load it from db and add it into the cache

so, no conflict should happen

But i have to say it's a really good point, i haven't thought about it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding "flush all mutations in a single batch" -- is that important only because of crash-safety, or some other more subtle reason?

Only because of crash-safety

holiman
holiman previously approved these changes Sep 23, 2024
Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would be interesting to see some performance-charts. This PR needs some runtime before merging, IMO

@rjl493456442
Copy link
Member Author

rjl493456442 commented Sep 23, 2024 via email

@joey0612
Copy link

joey0612 commented Oct 10, 2024

Referenced #28471

@holiman holiman changed the title triedb/pathdb, eth: introduce Double-Buffer Mechanism in PathDB triedb/pathdb, eth: use double-buffer mechanism in pathdb Dec 5, 2024
@rjl493456442 rjl493456442 force-pushed the multibuffer branch 2 times, most recently from 2383977 to 28ee3bc Compare December 6, 2024 06:17
@rjl493456442
Copy link
Member Author

截屏2024-12-10 10 23 16 截屏2024-12-10 10 23 32 截屏2024-12-10 10 23 49 截屏2024-12-10 10 25 44

The triedb commit is improved as expected, but the EVM execution is constantly slower, with unknown reasons.

@waynercheung
Copy link

Referenced #28471

I have the same question, is this PR referenced #28471

@rjl493456442
Copy link
Member Author

@joey0612 @waynercheung

No, double-buffering has been part of my design since day one when I first implemented the path database.
I even had a prototype a few years ago. However, it hasn’t been merged yet for several reasons:
(a) the overall performance improvement was not significant; and (b) after flushing the write buffer into the database, subsequent database reads became 2–3× slower, which in turn slowed down the entire chain progression.

However, with the fix to block prefetcher, the state read has been significantly improved. And I want to pile
up this change on top one more time.

cockroachdb/pebble#4109

Copy link
Member

@MariusVanDerWijden MariusVanDerWijden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return fmt.Errorf("buffer layers (%d) cannot be applied on top of persisted state id (%d) to reach requested state id (%d)", b.layers, head, id)
func (b *buffer) flush(root common.Hash, db ethdb.KeyValueStore, freezer ethdb.AncientWriter, progress []byte, nodesCache, statesCache *fastcache.Cache, id uint64, postFlush func()) {
if b.done != nil {
panic("duplicated flush operation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can never happen, even if the flushing takes a long time to do, right? Because we are always rotating out the buffers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. The buffer is supposed to be flushed for one time.

@rjl493456442 rjl493456442 added this to the 1.15.12 milestone Jun 22, 2025
@rjl493456442 rjl493456442 merged commit 2192020 into ethereum:master Jun 22, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants