Skip to content

Translog architecture guide Distributed team #126416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Apr 25, 2025
Merged
Changes from 1 commit
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
7b47e98
Translog architecture guide Distributed team
kingherc Apr 7, 2025
3ef5edc
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 8, 2025
ac038f2
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 14, 2025
37fd83d
Added links
kingherc Apr 14, 2025
231e939
PR comments
kingherc Apr 14, 2025
c8722f7
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 14, 2025
7e44680
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 15, 2025
d176ba6
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 17, 2025
805a3f6
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 19, 2025
6f0c73b
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 21, 2025
c04966d
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 22, 2025
03e6a7e
Typo
kingherc Apr 22, 2025
ce767ad
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 23, 2025
09ceff8
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 25, 2025
9b5994d
PR comments
kingherc Apr 25, 2025
f4a3aa8
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 25, 2025
9c81a71
Merge remote-tracking branch 'origin/main' into non-issue/ES-7879-tra…
kingherc Apr 25, 2025
742d7a7
PR comments
kingherc Apr 25, 2025
c623c8d
PR comments
kingherc Apr 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Translog architecture guide Distributed team
Closes ES-7879
  • Loading branch information
kingherc committed Apr 7, 2025
commit 7b47e98b1ea42c3d09d56870e940c1fef89d3082
73 changes: 69 additions & 4 deletions docs/internal/DistributedArchitectureGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,15 +188,80 @@ Some concepts are applicable to both cluster and project scopes, e.g. [persisten

### Translog

(Explain checkpointing and generations, when happens on Lucene flush / fsync)
[Basic write model]:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-replication.html
[Translog]:https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/index/translog/Translog.java
[History retention]:https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-history-retention.html

(Concurrency control for flushing)
It is important to understand first the [Basic write model] of documents:
documents are written to Lucene in-memory buffers, then "refreshed" to searchable segments which may not be persisted on disk, and finally "flushed" to a durable Lucene commit on disk.
This means that newly ingested data may be lost if there is an outage before the next persisted commit on disk.
For this reason, newly ingested data is also written to a shard's [Translog], whose main purpose is to persist uncommitted operations (e.g., document insertions or deletions), so they can be replayed in the event of ephemeral failures such as a crash or power loss.
The translog is always persisted and fsync'ed on disk before acknowledging writes back to the user.
This can be seen in `InternalEngine` which calls the `add()` method of the translog to append operations, e.g., its `index()` method at some point adds a document insertion operation to the translog.
The translog ultimately truncates operations once they have been flushed to disk by a Lucene commit.

(VersionMap)
Main usages of the translog are:

* During recovery, an index shard can be recovered up to at least the last acknowledged operation by replaying the translog onto the last flushed commit of the shard.
* During a replica recovery, it may recover some lost operations from the primary's translog if needed before falling back to a complete recovery of Lucene files from the primary.
* Facilitate real-time (m)GETs of documents without refreshing.

#### Translog Truncation

#### Direct Translog Read
[Flush API]:https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-flush.html

Translog files are automatically truncated when they are no longer needed, specifically after all their operations have been persisted by Lucene commits on disk.
Lucene commits are initiated by flushes (e.g., with the index [Flush API]).

Flushes may also be automatically initiated by Elasticsearch, e.g., if the translog exceeds a configurable size (`INDEX_TRANSLOG_FLUSH_THRESHOLD_SIZE_SETTING`) or age (`INDEX_TRANSLOG_FLUSH_THRESHOLD_AGE_SETTING`),which ultimately truncates the translog as well.

#### Acknowledging writes

A bulk request will repeateadly call ultimately the Engine methods such as `index()` or `delete()` which adds operations to the Translog.
Finally, the AfterWrite action of the `TransportWriteAction` will call `indexShard.syncAfterWrite()` which will put the last written transloc Location of the bulk request into a `AsyncIOProcessor` that is responsible for gradually fsync'ing the Translog and notifying any waiters.
Ultimately the bulk request is notified that the translog has fsync'ed passed the requested location, and can continue to acknowledge the bulk request.

#### Translog internals

[History retention]:https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-history-retention.html

Each translog is a sequence of files, each identified by a translog generation ID, each containing a sequence of operations, with the last file open for writes.
The last file has a part which has been fsync'ed to disk, and a part which has been written but not necessarily fsync'ed yet to disk.
Each operation is identified by a sequence number (`seqno`), which is monotonically increased by the engine's ingestion functionality.
A `Checkpoint` file is also maintained, that contains, among other information, the current translog generation ID, and its last fsync'ed operation and location, the minimum translog generation ID, and the minimum and maximum sequence number of operations the sequence of translog generations include.
When the translog rolls over, e.g., upon the translog file exceeding a configurable size, a new file in the sequence is created for writes, and the last one becomes read-only.
A new commit flushed to the disk will also induce a translog rollover, since the operations in the translog so far will become eligible for truncation.

A few more words on terminology and classes used around the translog Java package.
A `Location` of an operation is defined by the translog generation file it is contained in, the offset of the operation in that file, and the number of bytes that encode that operation.
An `Operation` can be a document indexed, a document deletion, or a no-op operation.
A `Snapshot` iterator can be created to iterate over a range of requested operation sequence numbers read from the translog files.
A retention lock can be acquired for [History retention] purposes, e.g., for potentially facilitating a replica shard's recovery, which prohibits truncating the translog files.
The `sync()` method is the one that fsync's the current translog generation file to disk, and updates the checkpoint file with the last fsync'ed operation and location.
The `rollGeneration()` method is the one that rolls the translog, creating a new translog generation, e.g., called during an index flush.
The `createEmptyTranslog()` method creates a new translog, e.g., for a new empty index shard.
Each translog file starts with a `TranslogHeader` that is followed by translog operations.

Some internal classes used for reading and writing from the translog are the following.
A `TranslogReader` can be used to read operation bytes from a translog file.
A `TranslogSnapshot` can be used to iterate operations from a translog reader.
A `MultiSnapshot` can be used to iterate operations over multiple `TranslogSnapshot`s.
A `TranslogWriter` can be used to write operations to the translog.

#### Real-time GETs from the translog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the Recovery section isn't written yet, but may be worth at least linking to it here and saying something about how we replay the operations from the translog during recovery by just reading them sequentially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added in the introduction:

, so they can be replayed by just reading them sequentially from the translog during recovery in the event of ephemeral failures such as a crash or power loss.


[Get API]:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html

The [Get API] (and by extension, the multi-get API) supports a real-time mode, which can query documents by ID, even recently ingested documents that have not yet been refreshed and not searchable.
This capability is facilitated by another data structure, the `LiveVersionMap`, which maps recently ingested documents by their ID to the translog location that encodes their indexing operation.
That way, we can return the document by reading the translog operation.

The tracking in the version map is not enabled by default.
The first real-time GET induces a refresh of the index shard, and a search to get the document, but also enables the tracking in the version map for newly ingested documents.
Thus, next real-time GETs are serviced by going first through the version map, to query the translog, and if not found there, then search (refreshed data) without requiring to refresh the index shard.

On a refresh, the code safely swaps the old map with a new empty map.
That is because after a refresh, any documents in the old map are now searchable in Lucene, and thus we do not need them in the version map anymore.

### Index Version

Expand Down
Loading