Skip to content

Move to Lucene 9.12's new PostingsFormat. #115021

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 5 tasks
jpountz opened this issue Oct 17, 2024 · 3 comments
Open
1 of 5 tasks

Move to Lucene 9.12's new PostingsFormat. #115021

jpountz opened this issue Oct 17, 2024 · 3 comments
Assignees
Labels
blocker stateful Marking issues only relevant for stateful releases :StorageEngine/Codec Team:StorageEngine

Comments

@jpountz
Copy link
Contributor

jpountz commented Oct 17, 2024

A while back, Lucene changed the way that it encodes doc IDs from PFOR-delta to FOR-delta, which is a bit faster but less space-efficient. In order to avoid introducing space-efficiency regressions (especially on dense postings lists, which are common on Logging datasets), @iverase moved Elasticsearch to a copy of the Lucene postings format that would still use PFOR-delta for compression. (#103601)

But Lucene 9.12 introduced a new postings format that has better skipping logic (in general). It would be nice to take advantage of it. I would suggest the following plan:

  • Use 'Lucene912PostingsFormat' when storage efficiency isn't critical #119051
  • Create a new postings format that is a copy of Lucene912PostingsFormat but with a more space-efficient encoding of doc deltas. @dnhatn and I played with it earlier this year, there is room for significant improvement by storing exceptions (the P from PFOR stands for "patched") more efficiently and allowing more exceptions per block.
  • Move indexes whose storage efficiency is important to this new postings format instead of ES812PostingsFormat.
  • Disallow using ES812PostingsFormat on new indexes.
  • Move the write logic of ES812PostingsFormat to the test folder.
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@martijnvg
Copy link
Member

At least part of this issue is now a blocker for the 9.1.0 release. The posting format that ships with the current Lucene version is already better for search use cases than the forked ES812PostingsFormat Elasticsearch uses.

On top of this the next Lucene minor versions comes with more improvements to the posting format, which could also be beneficial for other use cases than pure search use cases. But in order to make use of this we need to start using the posting format that ships with Lucene.

@martijnvg martijnvg added blocker stateful Marking issues only relevant for stateful releases and removed blocker labels Mar 28, 2025
@jdcryans
Copy link

@martijnvg when you get back, we'll need to discuss what's the next step here as it's not obvious for @jordan-powers and I.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker stateful Marking issues only relevant for stateful releases :StorageEngine/Codec Team:StorageEngine
Projects
None yet
Development

No branches or pull requests

5 participants