Skip to content

Translog architecture guide Distributed team #126416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Apr 25, 2025

Conversation

kingherc
Copy link
Contributor

@kingherc kingherc commented Apr 7, 2025

Closes ES-7879

@kingherc kingherc added >docs General docs changes >non-issue :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed Indexing Meta label for Distributed Indexing team v9.1.0 labels Apr 7, 2025
@kingherc kingherc self-assigned this Apr 7, 2025
@kingherc kingherc marked this pull request as ready for review April 7, 2025 15:40
@elasticsearchmachine elasticsearchmachine added the Team:Docs Meta label for docs team label Apr 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

Copy link
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a newbie consumer this looks great, very helpful, thank you Iraklis.

Copy link
Contributor Author

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for the feedback! Feel free to review again.

@kingherc
Copy link
Contributor Author

Dear reviewers, handled all feedback, gentle reminder for review!

Copy link
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

One thing to consider, you can declare the links outside of the text and reference them by name inside the text so that they're not so disruptive when reading the MD in plain-text form.

See example here.

@kingherc
Copy link
Contributor Author

Thanks @JeremyDahlgren @nicktindall !

One thing to consider, you can declare the links outside of the text and reference them by name inside the text so that they're not so disruptive when reading the MD in plain-text form.

Unfortunately does not work if I'd like to make the linked text a code style as well (with the backticks). It only works for regular styled text. :(

@nicktindall
Copy link
Contributor

nicktindall commented Apr 22, 2025

Thanks @JeremyDahlgren @nicktindall !

One thing to consider, you can declare the links outside of the text and reference them by name inside the text so that they're not so disruptive when reading the MD in plain-text form.

Unfortunately does not work if I'd like to make the linked text a code style as well (with the backticks). It only works for regular styled text. :(

You might be able to do (if you haven't tried already, I used this elsewhere to have text other than the link ID inline)

[`ClassName`][LinkIDForClassName]

Still a bit clunky but better than the full link inline IMO

Copy link
Contributor Author

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a bit clunky but better than the full link inline IMO

Thanks @nicktindall that works! Replaced all links as such.

@kingherc kingherc requested a review from DaveCTurner April 25, 2025 09:47
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, just some suggestions.

A [`MultiSnapshot`] can be used to iterate operations over multiple [`TranslogSnapshot`]s.
A [`TranslogWriter`] can be used to write operations to the translog.

#### Real-time GETs from the translog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the Recovery section isn't written yet, but may be worth at least linking to it here and saying something about how we replay the operations from the translog during recovery by just reading them sequentially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added in the introduction:

, so they can be replayed by just reading them sequentially from the translog during recovery in the event of ephemeral failures such as a crash or power loss.

Each translog is a sequence of files, each identified by a translog generation ID, each containing a sequence of operations, with the last file open for writes.
The last file has a part which has been fsync'ed to disk, and a part which has been written but not necessarily fsync'ed yet to disk.
Each operation is identified by a sequence number (`seqno`), which is monotonically increased by the engine's ingestion functionality.
A [`Checkpoint`] file is also maintained, that contains, among other information, the current translog generation ID, and its last fsync'ed operation and location, the minimum translog generation ID, and the minimum and maximum sequence number of operations the sequence of translog generations include.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain here why the separate Checkpoint is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended.

Copy link
Contributor Author

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DaveCTurner ! Feel free to review again.

Each translog is a sequence of files, each identified by a translog generation ID, each containing a sequence of operations, with the last file open for writes.
The last file has a part which has been fsync'ed to disk, and a part which has been written but not necessarily fsync'ed yet to disk.
Each operation is identified by a sequence number (`seqno`), which is monotonically increased by the engine's ingestion functionality.
A [`Checkpoint`] file is also maintained, that contains, among other information, the current translog generation ID, and its last fsync'ed operation and location, the minimum translog generation ID, and the minimum and maximum sequence number of operations the sequence of translog generations include.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended.

A [`MultiSnapshot`] can be used to iterate operations over multiple [`TranslogSnapshot`]s.
A [`TranslogWriter`] can be used to write operations to the translog.

#### Real-time GETs from the translog
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added in the introduction:

, so they can be replayed by just reading them sequentially from the translog during recovery in the event of ephemeral failures such as a crash or power loss.

@kingherc kingherc requested a review from DaveCTurner April 25, 2025 12:31
The last file has a part which has been fsync'ed to disk, and a part which has been written but not necessarily fsync'ed yet to disk.
Each operation is identified by a sequence number (`seqno`), which is monotonically increased by the engine's ingestion functionality.
Typically the entries in the translog are in increasing order of their sequence number, but not necessarily.
A [`Checkpoint`] file is also maintained, which is written on each fsync operation of the translog, and records important metadata and statistics about the translog, such as the current translog generation ID, its last fsync'ed operation and location, the minimum translog generation ID, and the minimum and maximum sequence number of operations the sequence of translog generations include, all of which are useful to identify the translog operations needed to be replayed upon recovery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more than just "useful", this is essential for correctness. The Checkpoint records the last fsynced location in the translog file, i.e. it is safe to read up to the location in the checkpoint but beyond that point are only dragons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded, thanks!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (could probably iterate on this forever but this is a great start)

@kingherc kingherc merged commit fd7d973 into elastic:main Apr 25, 2025
16 checks passed
@kingherc kingherc deleted the non-issue/ES-7879-translog-guide branch April 25, 2025 13:50
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >docs General docs changes >non-issue Team:Distributed Indexing Meta label for Distributed Indexing team Team:Docs Meta label for docs team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants