Skip to content

Releases: apache/lucene

9.12.2

20 Jun 16:19
Compare
Choose a tag to compare

Bug fixes

  • Reduce NeighborArray on-heap memory during HNSW graph building
  • Fix IndexSortSortedNumericDocValuesRangeQuery for int sort
  • ValueSource.fromDoubleValuesSource(dvs).getSortField() would throw errors when used if the DoubleValuesSource needed scores
  • Disable connectedComponents logic in HNSW graph building.

10.2.2

20 Jun 20:51
Compare
Choose a tag to compare

Bug fixes

  • Reduce NeighborArray on-heap memory during HNSW graph building
  • Fix IndexSortSortedNumericDocValuesRangeQuery for int sort
  • ValueSource.fromDoubleValuesSource(dvs).getSortField() would throw errors when used if the DoubleValuesSource needed scores

10.2.1

01 May 13:21
1b2451b
Compare
Choose a tag to compare

This patch release contains bug fixes that are highlighted below.

  • Fix DISIDocIdStream::count so that it does not try to count beyond max.
  • Correct TermOrdValComparator competitive iterator so that it forces sparse field iteration to be at least scoring window baseline when doing intoBitSet.
  • Provide better impacts for fields indexed with IndexOptions.DOCS
  • Fixed lead cost computations for bulk scorers of conjunctive queries that mix MUST and FILTER clauses, and disjunctive queries that configure a minimum number of matching SHOULD clauses.

10.2.0

10 Apr 10:17
Compare
Choose a tag to compare

Lucene 10.2 includes major search-time performance improvements for a wide variety of queries. This is most notably due to:

  • Improved storage format of doc IDs in BKD trees for faster decoding.
    More vectorization when processing PointRangeQuerys and non-scoring BooleanQuerys.
  • Encoding of dense blocks of postings lists as bit sets instead of FOR-delta. This change also saves a bit of storage.
  • Merging matches of dense conjunctive clauses using bitwise ANDs. This especially helps on postings blocks that are encoded as bit sets.
    Implementing the ACORN-1 algorithm for pre-filtered vector searches.
  • Searches that don't require scores and match many docs should generally see good speedups, depending on how expensive the Collector is. Compared with Lucene 10.1.0, Lucene's nightly benchmarks report the following speedups when counting the number of hits of a the following queries:
    * Disjunctions of term queries: 77% to 4x faster
    * Conjunctions of term queries: 38% to 5x faster
    * Filtered disjunctions of term queries: 2.5x to 4x faster
    * Filtered PointRangeQuery: 3.5x faster
  • And the following speedup when computing top-100 hits:
    * Pre-filtered vector search: 3.5x faster

Changes in Runtime Behavior

  • TieredMergePolicy's default floor segment size was increased from 2MB to 16MB. This is expected to result in slightly slower indexing and about 10 fewer segments per index for applications that flush frequently. This should in-turn help speed up queries that have a high per-segment overhead such as multi-term queries, point queries and vector search.

New Features

  • Added TopDocs#rrf to combine multiple TopDocs instances using reciprocal rank fusion.
  • Added SeededKnnVectorQuery, an optimization to KnnVectorQuery that allows selecting better entry points for vector search using a seed Query.

Improvements

  • RegexpQuery support for unicode case-insensitive characters and ranges.
    Optimizations
  • Java 24 vector API support
  • Efficiency improvements to Automaton and RegExp
  • Faster merging of HNSW graphs which translated in a 25% indexing speedup in Lucene's nightly benchmarks.
  • Conjunctive queries can now skip applying clauses when they have long runs of matching docs, a case which is not uncommon when an index sort is configured.
  • Reduce heap usage during BKD tree merges.

10.1.0

20 Dec 20:32
Compare
Choose a tag to compare

New Features

  • Add IndexInput::isLoaded to determine if the contents of an  input is resident in physical memory
  • FeatureField now supports storing term vectors.

Improvements

  • TieredMergePolicy now allows merging up to maxMergeAtOnce segments for merges below the floor segment size, even if maxMergeAtOnce is greater than segmentsPerTier. This makes it more efficient to configure TieredMergePolicy to merge segments aggressively by configuring a high value of floorSegmentSize (e.g. 64MB), a low value of segmentsPerTier (e.g. 4) and a high value of maxMergeAtOnce (e.g. 32).

Optimizations

  • Many speedups to top-k query evaluation, in particular: top-level disjunctions, filtered disjunctions, conjunctions, DisjunctionMaxQuery.
  • Speedup to exhaustive evaluation of conjunctive queries by vectorizing the intersection of postings lists.
  • Reduced contention for top-k query evaluation when IndexSearcher is configured with an executor.

9.12.1

13 Dec 11:23
Compare
Choose a tag to compare

Improvements

  • Allow easier configuration of the Panama vectorization provider with newer Java versions. Set the org.apache.lucene.vectorization.upperJavaFeatureVersion system property to increase the set of Java versions that Panama vectorization will provide optimized implementations for.

Bug fixes

  • Fixed backwards compatibility bug that caused sparse (not all documents have a vector) KNN indices written with 9.0.0 to give silently (no exception) terrible recall results when searched by any 9.x release
  • Improve Tessellatorlogic when two holes share the same vertex with the polygon which was failing in valid polygons.
  • Fix backwards compatibility bug that caused 9.12.0 to incorrectly throw IllegalStateException when trying to open an IndexReader on an index created with quantized (int4, int7, int8) KNN vectors using Lucene99HnswScalarQuantizedVectorsFormat.

10.0.0

14 Oct 13:02
Compare
Choose a tag to compare

System requirements

  • Lucene 10.0 requires JDK 21 or newer

API changes

  • KNN vector values now have a random-access API.
  • Deprecated APIs have been removed and a number of API changes have been made. Please consult the migrate guide for an extensive list and actions to take to migrate to 10.0.

New Features

  • A new IndexInput#prefetch API has been added, allowing query evaluation logic to let the Directory know about regions of data that are about to be read. This helps perform I/O concurrently under the hood. MMapDirectory implements this API using the madvise system call and the MADV_WILLNEED flag on Linux and Mac OS.
  • Lucene now supports sparse indexing on doc values via FieldType#setDocValuesSkipIndexType. The sparse index will record the minimum and maximum values per block of doc IDs. Used in conjunction with index sorting to cluster similar documents together, this allows for very space-efficient and CPU-efficient filtering.
  • Search concurrency is now decoupled from the index geometry, so that an index can be searched using any number of threads, regardless of its number of segments.
  • Kmeans clustering on vectors

Improvements

  • Lucene now opens files with the MADV_RANDOM advice by default on Linux and Mac OS. This results in better efficiency for indexes that exceed the size of the page cache, but can make it slower to load indexes in the page cache. It is possible to revert to the MADV_NORMAL read advice by default by passing -Dorg.apache.lucene.store.defaultReadAdvice=NORMAL as a JVM startup flag.
  • Snowball dictionaries have been upgraded, resulting in improved tokenization. This may require reindexing to ensure consistency of search results with pre-10.0 indexes.
  • The expressions module is now using MethodHandles and Dynamic Class-File Constants (JEP 309) in combination with hidden classes (JEP 371) to implement a strict and type-safe call to external functions. This allows to easier extend expressions with custom functions in secure way because runtime linking of custom functions is no longer the responsibility of the expressions scripting engine. In addition, the hidden classes created by the expressions engine no longer suffer from global classloader locks.

... plus a multitude of helpful bug fixes!

9.12.0

28 Sep 20:19
Compare
Choose a tag to compare

Security Fixes

  • Deserialization of Untrusted Data vulnerability in Apache Lucene Replicator - CVE-2024-45772

New Features

  • Improve intra-merge parallelism for many value types. (Ben Trent)
  • Add support JDK 23 to the Panama Vectorization Provider. (Chris Hegarty)

Improvements

  • Add Intervals.regexp and Intervals.range methods to produce IntervalsSource for regexp and range queries. (Mayya Sharipova)
  • Remove support for writing 8 bit scalar vector quantization. 4 and 7 bit quantization are still supported (Michael McCandless )

Optimizations

  • Inline postings skip data to improve performance of queries that need skipping such as conjunctions. (Adrien Grand)
  • Optimizations to the decoding logic of blocks of postings. (Adrien Grand, Uwe Schindler, Greg Miller)
  • Avoid performance degradation with closing shared mapped segment data (Chris Hegarty, Michael Gibney, Uwe Schindler)

... plus a multitude of helpful bug fixes!

9.11.1

27 Jun 13:46
Compare
Choose a tag to compare

Bug Fixes

  • Fix performance regression in NumericComparator.
  • Remove intra-merge parallelism for everything except HNSW graph merges.
  • Fix bug that prevented adding a parent field to an index with no fields.
  • Fix IndexOutOfBoundsException thrown in DefaultPassageFormatter by unordered matches.
  • StringValueFacetCounts stops throwing NPE when faceting over an empty match-set.

9.11.0

06 Jun 14:29
Compare
Choose a tag to compare

New features

  • Add support for posix_madvise to MMapDirectory: If running on Linux/macOS and Java 21 or later, MMapDirectory uses IOContext to pass suitable MADV flags to kernel of operating system. This may improve paging logic especially when working with large indexes under memory pressure.
  • Expand support for new scalar bit levels for HNSW vectors. This includes 4-bit vectors and an option to compress them to gain a 50% reduction in memory usage.
  • Recursive graph bisection is now supported on indexes that have blocks

Improvements

  • MergeScheduler can now provide an executor for intra-merge parallelism. The first implementation is the ConcurrentMergeScheduler.
  • Upgrade icu4j to version 74.2.

Optimizations

  • Use RWLock to access LRUQueryCache to reduce contention.
  • Speedup multi-segment HNSW graph search for diversifying child kNN queries.
  • Add a MemorySegment Vector scorer - for scoring without copying on-heap. This can improve search latency by almost 2x for byte vectors.
  • Switch to using optimized, primitive collections where possible to improve performance and heap utilization.

Full Changelog: releases/lucene/9.10.0...releases/lucene/9.11.0