aristo: switch to vector memtable #3447

arnetheduck · 2025-07-03T14:52:36Z

Every time we persist, we collect all changes into a batch and write that batch to a memtable which rocksdb lazily will write to disk using a background thread.

The default implementation of the memtable in rocksdb is a skip list which can handle concurrent writes while still allowing lookups. We're not using concurrent inserts and the skip list comes with significant overhead both when writing and when reading.

Here, we switch to a vector memtable which is faster to write but terrible to read. To compensate, we then proceed to flush the memtable eagerly to disk which is a blocking operation.

One would think that the blocking of the main thread this would be bad but it turns out that creating the skip list, also a blocking operation, is even slower, resulting in a net win.

Coupled with this change, we also make the "lower" levels bigger effectively reducing the average number of levels that must be looked at to find recently written data. This could lead to some write amplicification which is offset by making each file smaller and therefore making compactions more targeted.

Taken together, this results in an overall import speed boost of about 3-4%, but above all, it reduces the main thread blocking time during persist.

pre (for 8k blocks persisted around block 11M):

DBG 2025-07-03 15:58:14.053+02:00 Core DB persisted
kvtDur=8ms182us947ns mptDur=4s640ms879us492ns endDur=10s50ms862us669ns
stateRoot=none()

post:

DBG 2025-07-03 14:48:59.426+02:00 Core DB persisted
kvtDur=12ms476us833ns mptDur=4s273ms629us840ns endDur=3s331ms171us989ns
stateRoot=none()

Every time we persist, we collect all changes into a batch and write that batch to a memtable which rocksdb lazily will write to disk using a background thread. The default implementation of the memtable in rocksdb is a skip list which can handle concurrent writes while still allowing lookups. We're not using concurrent inserts and the skip list comes with significant overhead both when writing and when reading. Here, we switch to a vector memtable which is faster to write but terrible to read. To compensate, we then proceed to flush the memtable eagerly to disk which is a blocking operation. One would think that the blocking of the main thread this would be bad but it turns out that creating the skip list, also a blocking operation, is even slower, resulting in a net win. Coupled with this change, we also make the "lower" levels bigger effectively reducing the average number of levels that must be looked at to find recently written data. This could lead to some write amplicification which is offset by making each file smaller and therefore making compactions more targeted. Taken together, this results in an overall import speed boost of about 3-4%, but above all, it reduces the main thread blocking time during persist. pre (for 8k blocks persisted around block 11M): ``` DBG 2025-07-03 15:58:14.053+02:00 Core DB persisted kvtDur=8ms182us947ns mptDur=4s640ms879us492ns endDur=10s50ms862us669ns stateRoot=none() ``` post: ``` DBG 2025-07-03 14:48:59.426+02:00 Core DB persisted kvtDur=12ms476us833ns mptDur=4s273ms629us840ns endDur=3s331ms171us989ns stateRoot=none() ```

arnetheduck merged commit 0eea2fa into master Jul 4, 2025
23 checks passed

arnetheduck deleted the vector-memtable branch July 4, 2025 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

aristo: switch to vector memtable #3447

aristo: switch to vector memtable #3447

Uh oh!

arnetheduck commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

aristo: switch to vector memtable #3447

aristo: switch to vector memtable #3447

Uh oh!

Conversation

arnetheduck commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!