-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Translog architecture guide Distributed team #126416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translog architecture guide Distributed team #126416
Conversation
Closes ES-7879
Pinging @elastic/es-docs (Team:Docs) |
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a newbie consumer this looks great, very helpful, thank you Iraklis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for the feedback! Feel free to review again.
Dear reviewers, handled all feedback, gentle reminder for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
One thing to consider, you can declare the links outside of the text and reference them by name inside the text so that they're not so disruptive when reading the MD in plain-text form.
See example here.
Thanks @JeremyDahlgren @nicktindall !
Unfortunately does not work if I'd like to make the linked text a code style as well (with the backticks). It only works for regular styled text. :( |
You might be able to do (if you haven't tried already, I used this elsewhere to have text other than the link ID inline)
Still a bit clunky but better than the full link inline IMO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still a bit clunky but better than the full link inline IMO
Thanks @nicktindall that works! Replaced all links as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, just some suggestions.
A [`MultiSnapshot`] can be used to iterate operations over multiple [`TranslogSnapshot`]s. | ||
A [`TranslogWriter`] can be used to write operations to the translog. | ||
|
||
#### Real-time GETs from the translog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the Recovery
section isn't written yet, but may be worth at least linking to it here and saying something about how we replay the operations from the translog during recovery by just reading them sequentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added in the introduction:
, so they can be replayed by just reading them sequentially from the translog during recovery in the event of ephemeral failures such as a crash or power loss.
Each translog is a sequence of files, each identified by a translog generation ID, each containing a sequence of operations, with the last file open for writes. | ||
The last file has a part which has been fsync'ed to disk, and a part which has been written but not necessarily fsync'ed yet to disk. | ||
Each operation is identified by a sequence number (`seqno`), which is monotonically increased by the engine's ingestion functionality. | ||
A [`Checkpoint`] file is also maintained, that contains, among other information, the current translog generation ID, and its last fsync'ed operation and location, the minimum translog generation ID, and the minimum and maximum sequence number of operations the sequence of translog generations include. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain here why the separate Checkpoint
is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DaveCTurner ! Feel free to review again.
Each translog is a sequence of files, each identified by a translog generation ID, each containing a sequence of operations, with the last file open for writes. | ||
The last file has a part which has been fsync'ed to disk, and a part which has been written but not necessarily fsync'ed yet to disk. | ||
Each operation is identified by a sequence number (`seqno`), which is monotonically increased by the engine's ingestion functionality. | ||
A [`Checkpoint`] file is also maintained, that contains, among other information, the current translog generation ID, and its last fsync'ed operation and location, the minimum translog generation ID, and the minimum and maximum sequence number of operations the sequence of translog generations include. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extended.
A [`MultiSnapshot`] can be used to iterate operations over multiple [`TranslogSnapshot`]s. | ||
A [`TranslogWriter`] can be used to write operations to the translog. | ||
|
||
#### Real-time GETs from the translog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added in the introduction:
, so they can be replayed by just reading them sequentially from the translog during recovery in the event of ephemeral failures such as a crash or power loss.
The last file has a part which has been fsync'ed to disk, and a part which has been written but not necessarily fsync'ed yet to disk. | ||
Each operation is identified by a sequence number (`seqno`), which is monotonically increased by the engine's ingestion functionality. | ||
Typically the entries in the translog are in increasing order of their sequence number, but not necessarily. | ||
A [`Checkpoint`] file is also maintained, which is written on each fsync operation of the translog, and records important metadata and statistics about the translog, such as the current translog generation ID, its last fsync'ed operation and location, the minimum translog generation ID, and the minimum and maximum sequence number of operations the sequence of translog generations include, all of which are useful to identify the translog operations needed to be replayed upon recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more than just "useful", this is essential for correctness. The Checkpoint
records the last fsynced location in the translog file, i.e. it is safe to read up to the location in the checkpoint but beyond that point are only dragons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (could probably iterate on this forever but this is a great start)
Closes ES-7879