Skip to content

Conversation

@joein
Copy link
Member

@joein joein commented Nov 12, 2025

No description provided.

@coderabbitai
Copy link

coderabbitai bot commented Nov 12, 2025

📝 Walkthrough

Walkthrough

This PR modernizes type annotations to Python 3.10+ PEP 604 union syntax across the repository, replacing Optional[...] and Union[...] with X | None and A | B respectively in 30+ modules. It updates CI/workflows and pyproject to require Python ≥3.10, expands the NumpyArray alias to include integer dtypes, adjusts some dataclass defaults/type hints (e.g., ONNX output context), and adds an avg_len parameter to SparseVectorConverter. A few functions enforce stricter typing/validation (notably task_id handling in multitask embedding). No widespread runtime logic changes besides the noted validations and signature refinements.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • fastembed/common/onnx_model.py — dataclass default for OnnxOutputContext.attention_mask and related initialization.
  • fastembed/sparse/utils/sparse_vectors_converter.py — new avg_len parameter, changed container types, and INT32_MAX constant; verify call sites and behavior.
  • fastembed/text/multitask_embedding.py — _preprocess_onnx_input now enforces non-None task_id (raises ValueError); confirm intended contract.
  • fastembed/common/types.py — broadened NumpyArray alias (adds integer dtypes); check downstream typing assumptions.
  • Cross-module signature consistency — ensure subclass/override signatures remain compatible after Optional → X | None conversions.

Possibly related PRs

Suggested reviewers

  • generall

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.28% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive No description was provided; however, the changes comprehensively modernize Python type hints across the codebase and update Python version requirements, which is substantial enough to warrant review despite the missing description. Consider adding a description explaining the motivation (e.g., aligning with Python 3.10+ best practices) and noting any breaking changes or migration notes for users.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: dropping Python 3.9 support and modernizing type hints from Optional/Union to PEP 604 union syntax (|).
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch drop-python3.9

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
fastembed/common/types.py (1)

13-20: Reconsider the NumpyArray expansion scope—it masks a semantic type violation in BM25.

The expansion to include integer types was added to support token IDs (appropriate), but it overly broadens the contract for SparseEmbedding.values, which should semantically represent score values (floats).

Issue: BM25 creates integer values (np.ones_like(token_ids) with dtype=np.int32) that now pass strict type checking only because NumpyArray was expanded. Before this change, BM25 would have failed type validation—the bug was latent. The # type: ignore comment in as_dict() reveals the developers recognized the type mismatch.

Recommended approach:

  • Narrow NumpyArray to float types only for SparseEmbedding.values
  • Use a separate/specific type for token ID arrays (e.g., in tokenize(), convert_ids_to_tokens())
  • Fix BM25 to create actual float values (BM25 scores) instead of integer ones

The current change allows type-incorrect code to pass strict mode validation without addressing the underlying semantic contract violation.

🧹 Nitpick comments (6)
fastembed/late_interaction/token_embeddings.py (1)

2-69: Use PEP 604 for documents, drop Union import

Switching this parameter to str | Iterable[str] lets us remove the remaining Union import and keeps the file consistent with the rest of the modernization.

-from typing import Union, Iterable, Any, Type
+from typing import Iterable, Any, Type
...
-        documents: Union[str, Iterable[str]],
+        documents: str | Iterable[str],
fastembed/late_interaction_multimodal/colpali.py (1)

60-75: Docstring still references Optional[...]

The annotations now use | None, so consider updating the docstring wording (e.g., “Sequence[OnnxProvider] | None”) to avoid confusing readers skimming the docs.

fastembed/sparse/splade_pp.py (1)

80-95: Bring the docstring in line with the new union syntax.

Signature now advertises str | None, Sequence[...] | None, etc., but the docstring still talks about Optional[...]. Updating the prose keeps API docs consistent with the public surface.

-            cache_dir (str, optional): The path to the cache directory.
+            cache_dir (str | None, optional): The path to the cache directory.
@@
-            providers (Optional[Sequence[OnnxProvider]], optional): The list of onnxruntime providers to use.
+            providers (Sequence[OnnxProvider] | None, optional): The list of onnxruntime providers to use.
@@
-            device_ids (Optional[list[int]], optional): The list of device ids to use for data parallel processing in
+            device_ids (list[int] | None, optional): The list of device ids to use for data parallel processing in
@@
-            device_id (Optional[int], optional): The device id to use for loading the model in the worker process.
-            specific_model_path (Optional[str], optional): The specific path to the onnx model dir if it should be imported from somewhere else
+            device_id (int | None, optional): The device id to use for loading the model in the worker process.
+            specific_model_path (str | None, optional): The specific path to the onnx model dir if it should be imported from somewhere else
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)

92-106: Update the docstring to mirror the union annotations.

After switching the signature to str | None / Sequence[...] | None, the docstring should follow suit to avoid stale Optional[...] references.

-            cache_dir (str, optional): The path to the cache directory.
+            cache_dir (str | None, optional): The path to the cache directory.
@@
-            threads (int, optional): The number of threads single onnxruntime session can use. Defaults to None.
+            threads (int | None, optional): The number of threads single onnxruntime session can use. Defaults to None.
@@
-            providers (Optional[Sequence[OnnxProvider]], optional): The list of onnxruntime providers to use.
+            providers (Sequence[OnnxProvider] | None, optional): The list of onnxruntime providers to use.
@@
-            device_ids (Optional[list[int]], optional): The list of device ids to use for data parallel processing in
+            device_ids (list[int] | None, optional): The list of device ids to use for data parallel processing in
@@
-            device_id (Optional[int], optional): The device id to use for loading the model in the worker process.
-            specific_model_path (Optional[str], optional): The specific path to the onnx model dir if it should be imported from somewhere else
+            device_id (int | None, optional): The device id to use for loading the model in the worker process.
+            specific_model_path (str | None, optional): The specific path to the onnx model dir if it should be imported from somewhere else
fastembed/late_interaction/late_interaction_embedding_base.py (1)

1-61: LGTM! Type hint modernization is correct.

The conversion from Optional[T] to T | None and Union[A, B] to A | B is syntactically correct and maintains functional equivalence. All parameter types in __init__, embed, and query_embed have been updated consistently.

Consider updating the docstring at line 51 to align with the new type syntax—it still references Union[str, Iterable[str]] while the actual parameter type now uses str | Iterable[str]. This applies to similar docstrings in other methods as well.

fastembed/late_interaction/late_interaction_text_embedding.py (1)

1-153: LGTM! Type annotations updated correctly.

All type hint conversions to Python 3.10+ union syntax are correct. The changes maintain functional equivalence while modernizing the codebase.

The docstring at line 146 still references the old Union[str, Iterable[str]] syntax. For consistency, consider updating docstrings to match the new type annotation style throughout the codebase.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba1f605 and be498c2.

📒 Files selected for processing (42)
  • .github/workflows/python-publish.yml (1 hunks)
  • .github/workflows/python-tests.yml (0 hunks)
  • .github/workflows/type-checkers.yml (1 hunks)
  • fastembed/common/model_description.py (1 hunks)
  • fastembed/common/model_management.py (4 hunks)
  • fastembed/common/onnx_model.py (4 hunks)
  • fastembed/common/types.py (1 hunks)
  • fastembed/embedding.py (2 hunks)
  • fastembed/image/image_embedding.py (4 hunks)
  • fastembed/image/image_embedding_base.py (2 hunks)
  • fastembed/image/onnx_embedding.py (4 hunks)
  • fastembed/image/onnx_image_model.py (4 hunks)
  • fastembed/image/transform/functional.py (5 hunks)
  • fastembed/image/transform/operators.py (7 hunks)
  • fastembed/late_interaction/colbert.py (6 hunks)
  • fastembed/late_interaction/late_interaction_embedding_base.py (3 hunks)
  • fastembed/late_interaction/late_interaction_text_embedding.py (5 hunks)
  • fastembed/late_interaction/token_embeddings.py (2 hunks)
  • fastembed/late_interaction_multimodal/colpali.py (5 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (5 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (3 hunks)
  • fastembed/parallel_processor.py (4 hunks)
  • fastembed/postprocess/muvera.py (1 hunks)
  • fastembed/rerank/cross_encoder/custom_text_cross_encoder.py (2 hunks)
  • fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (4 hunks)
  • fastembed/rerank/cross_encoder/onnx_text_model.py (4 hunks)
  • fastembed/rerank/cross_encoder/text_cross_encoder.py (4 hunks)
  • fastembed/rerank/cross_encoder/text_cross_encoder_base.py (3 hunks)
  • fastembed/sparse/bm25.py (5 hunks)
  • fastembed/sparse/bm42.py (5 hunks)
  • fastembed/sparse/minicoil.py (5 hunks)
  • fastembed/sparse/sparse_embedding_base.py (4 hunks)
  • fastembed/sparse/sparse_text_embedding.py (4 hunks)
  • fastembed/sparse/splade_pp.py (4 hunks)
  • fastembed/sparse/utils/sparse_vectors_converter.py (7 hunks)
  • fastembed/text/custom_text_embedding.py (2 hunks)
  • fastembed/text/multitask_embedding.py (4 hunks)
  • fastembed/text/onnx_embedding.py (4 hunks)
  • fastembed/text/onnx_text_model.py (5 hunks)
  • fastembed/text/text_embedding.py (6 hunks)
  • fastembed/text/text_embedding_base.py (3 hunks)
  • tests/utils.py (2 hunks)
💤 Files with no reviewable changes (1)
  • .github/workflows/python-tests.yml
🧰 Additional context used
🧬 Code graph analysis (21)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)
fastembed/late_interaction_multimodal/colpali.py (1)
  • embed_text (209-242)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • embed_text (120-141)
fastembed/image/image_embedding_base.py (2)
fastembed/image/image_embedding.py (1)
  • embed (114-135)
fastembed/image/onnx_embedding.py (1)
  • embed (149-184)
fastembed/sparse/sparse_embedding_base.py (4)
fastembed/sparse/bm25.py (1)
  • query_embed (305-321)
fastembed/sparse/bm42.py (1)
  • query_embed (317-336)
fastembed/sparse/minicoil.py (1)
  • query_embed (220-238)
fastembed/sparse/sparse_text_embedding.py (1)
  • query_embed (118-128)
fastembed/text/text_embedding_base.py (4)
fastembed/image/image_embedding_base.py (1)
  • embed (23-45)
fastembed/late_interaction/late_interaction_embedding_base.py (2)
  • embed (22-29)
  • query_embed (46-61)
fastembed/sparse/sparse_embedding_base.py (2)
  • embed (47-54)
  • query_embed (71-86)
fastembed/text/text_embedding.py (2)
  • embed (165-187)
  • query_embed (189-200)
fastembed/late_interaction/late_interaction_embedding_base.py (3)
fastembed/late_interaction/colbert.py (2)
  • embed (205-239)
  • query_embed (241-251)
fastembed/late_interaction/token_embeddings.py (1)
  • embed (64-71)
fastembed/text/text_embedding_base.py (2)
  • embed (22-29)
  • query_embed (46-61)
fastembed/late_interaction/late_interaction_text_embedding.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • query_embed (46-61)
fastembed/text/text_embedding.py (1)
  • query_embed (189-200)
fastembed/late_interaction/colbert.py (1)
  • query_embed (241-251)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (3)
fastembed/image/image_embedding.py (1)
  • embedding_size (81-85)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • embedding_size (84-88)
fastembed/text/text_embedding.py (1)
  • embedding_size (132-136)
fastembed/late_interaction/colbert.py (2)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • query_embed (46-61)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • query_embed (141-153)
fastembed/sparse/utils/sparse_vectors_converter.py (3)
fastembed/common/utils.py (2)
  • get_all_punctuation (62-65)
  • remove_non_alphanumeric (68-69)
fastembed/sparse/utils/vocab_resolver.py (1)
  • vocab_size (54-56)
fastembed/sparse/sparse_embedding_base.py (1)
  • SparseEmbedding (13-31)
fastembed/sparse/bm25.py (4)
fastembed/late_interaction/colbert.py (1)
  • query_embed (241-251)
fastembed/sparse/bm42.py (1)
  • query_embed (317-336)
fastembed/sparse/sparse_embedding_base.py (2)
  • query_embed (71-86)
  • SparseEmbedding (13-31)
fastembed/sparse/sparse_text_embedding.py (1)
  • query_embed (118-128)
fastembed/sparse/minicoil.py (5)
fastembed/sparse/utils/vocab_resolver.py (1)
  • VocabResolver (31-201)
fastembed/sparse/utils/minicoil_encoder.py (1)
  • Encoder (11-146)
fastembed/sparse/utils/sparse_vectors_converter.py (1)
  • SparseVectorConverter (24-244)
fastembed/sparse/sparse_embedding_base.py (2)
  • query_embed (71-86)
  • SparseEmbedding (13-31)
fastembed/sparse/sparse_text_embedding.py (1)
  • query_embed (118-128)
fastembed/postprocess/muvera.py (2)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • LateInteractionTextEmbeddingBase (8-71)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • LateInteractionMultimodalEmbeddingBase (10-78)
fastembed/text/onnx_text_model.py (2)
fastembed/common/onnx_model.py (1)
  • _preprocess_onnx_input (49-55)
fastembed/text/multitask_embedding.py (1)
  • _preprocess_onnx_input (60-69)
tests/utils.py (1)
fastembed/common/model_description.py (1)
  • BaseModelDescription (24-31)
fastembed/text/multitask_embedding.py (2)
fastembed/image/image_embedding.py (1)
  • embed (114-135)
fastembed/text/text_embedding.py (1)
  • embed (165-187)
fastembed/image/onnx_image_model.py (1)
fastembed/image/transform/operators.py (1)
  • Compose (85-269)
fastembed/sparse/bm42.py (2)
fastembed/sparse/bm25.py (1)
  • query_embed (305-321)
fastembed/sparse/sparse_embedding_base.py (2)
  • query_embed (71-86)
  • SparseEmbedding (13-31)
fastembed/image/transform/operators.py (1)
fastembed/image/transform/functional.py (1)
  • pil2ndarray (118-121)
fastembed/image/image_embedding.py (3)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • embedding_size (84-88)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • embedding_size (87-91)
fastembed/text/text_embedding.py (1)
  • embedding_size (132-136)
fastembed/sparse/sparse_text_embedding.py (4)
fastembed/sparse/bm25.py (1)
  • query_embed (305-321)
fastembed/sparse/bm42.py (1)
  • query_embed (317-336)
fastembed/sparse/minicoil.py (1)
  • query_embed (220-238)
fastembed/sparse/sparse_embedding_base.py (2)
  • query_embed (71-86)
  • SparseEmbedding (13-31)
fastembed/text/text_embedding.py (4)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • query_embed (141-153)
fastembed/text/text_embedding_base.py (1)
  • query_embed (46-61)
fastembed/sparse/sparse_embedding_base.py (1)
  • query_embed (71-86)
fastembed/text/multitask_embedding.py (1)
  • query_embed (84-85)
🪛 Ruff (0.14.4)
fastembed/late_interaction/colbert.py

241-241: Unused method argument: kwargs

(ARG002)

fastembed/sparse/bm25.py

305-305: Unused method argument: kwargs

(ARG002)

fastembed/text/onnx_text_model.py

42-42: Unused method argument: kwargs

(ARG002)

fastembed/sparse/bm42.py

317-317: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.10.x on windows-latest test
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
🔇 Additional comments (42)
fastembed/sparse/minicoil.py (4)

3-3: LGTM! Import statement correctly updated.

The removal of Optional and Union imports is appropriate since the code now uses PEP 604 union syntax (|).


68-78: LGTM! Constructor parameters correctly modernized.

All optional parameter type hints have been properly converted to PEP 604 syntax. The conversions maintain functional equivalence while aligning with Python 3.10+ conventions.


118-126: LGTM! Instance attributes properly typed with appropriate None guards.

The optional type annotations correctly reflect the lazy loading pattern. The assertions at lines 151 and 264-266 properly guard against None usage before accessing these attributes.


180-220: LGTM! Method signatures correctly updated and consistent with base classes.

The type hint modernization for embed() and query_embed() methods maintains consistency with the base class signatures shown in the related code.

.github/workflows/type-checkers.yml (1)

11-11: Matrix drop matches supported runtimes

Removing 3.9 keeps our type-checking matrix aligned with the new 3.10+ typing surface—looks good.

.github/workflows/python-publish.yml (1)

28-28: Publish workflow aligned with 3.10 floor

Bumping the release job to Python 3.10 keeps packaging consistent with the rest of the toolchain.

fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)

11-45: Constructor unions look good

The move to str | None / int | None mirrors the rest of the PEP 604 migration and keeps the interface unchanged—LGTM.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

14-52: Base multimodal signatures updated cleanly

Nice work updating the constructor and embed entry points to the new union syntax while keeping behavior intact.

fastembed/image/transform/functional.py (1)

16-128: Image utilities now speak the same typing dialect

The widened unions for PIL images vs. ndarrays and the refined mean/std annotations match the rest of the refactor and preserve runtime logic—looks solid.

fastembed/image/transform/operators.py (1)

18-92: Transform stack consistent with functional helpers

The union-typed signatures align perfectly with the helper functions and keep the transform pipeline typings coherent—nicely done.

fastembed/sparse/utils/sparse_vectors_converter.py (3)

18-21: Type annotations modernized successfully.

The dataclass fields now use built-in generic types (list[str] instead of List[str]), aligning with Python 3.10+ conventions.


25-41: New parameter avg_len adds flexibility to BM25 calculation.

The avg_len parameter is now exposed in the constructor with a sensible default of 150.0. This maintains backward compatibility while allowing customization of the average document length used in BM25 scoring.


60-64: Method signatures consistently updated.

All method signatures throughout the file now use built-in generics (list[float] instead of List[float]), maintaining consistency with the modernization effort.

tests/utils.py (2)

11-11: Type annotations modernized with union syntax.

The function signature now uses str | Path instead of Union[str, Path], aligning with Python 3.10+ typing conventions.


42-47: Parameter type updated consistently.

The is_ci parameter now uses str | None instead of Optional[str], maintaining consistency with the modernization effort across the codebase.

fastembed/common/model_description.py (1)

34-40: Type hints updated to modern union syntax.

The DenseModelDescription fields now use int | None and dict[str, Any] | None instead of Optional[...]. The __post_init__ validation for dim remains unchanged, correctly enforcing that dim must be provided at runtime.

fastembed/postprocess/muvera.py (1)

12-12: Type alias modernized with union operator.

The MultiVectorModel type alias now uses the | operator instead of Union, consistent with Python 3.10+ typing conventions.

fastembed/rerank/cross_encoder/custom_text_cross_encoder.py (1)

11-23: Constructor signature modernized comprehensively.

All optional parameters now use the modern union syntax (str | None, int | None, Sequence[OnnxProvider] | None, list[int] | None). The changes are consistent and maintain full backward compatibility.

fastembed/common/model_management.py (3)

180-194: Nested function types updated consistently.

The _collect_file_metadata function's return type now uses dict[str, dict[str, int | str]] with the modern union syntax for the inner dict values. This correctly expresses that metadata values can be either int or str.


195-201: Parameter typing aligned with return type.

The _save_file_metadata function parameter type matches the return type of _collect_file_metadata, maintaining type consistency across the helper functions.


376-404: Public method signature modernized.

The download_model method's specific_model_path parameter now uses str | None instead of Optional[str], consistent with the modernization effort across the codebase.

fastembed/rerank/cross_encoder/onnx_text_model.py (3)

21-22: Class variable type correctly updated.

The ONNX_OUTPUT_NAMES class attribute now uses list[str] | None instead of Optional[list[str]], maintaining consistency with Python 3.10+ conventions.


28-46: Method signature parameters modernized.

The _load_onnx_model method parameters now use the union operator for optional types (int | None, Sequence[OnnxProvider] | None), aligning with modern Python typing.


87-100: Complex method signature updated consistently.

All optional parameters in _rerank_pairs now use the modern union syntax. The method signature is cleaner and more readable with the | operator.

fastembed/embedding.py (1)

16-24: Deprecated class updated with modern type hints.

The JinaEmbedding constructor parameters now use str | None and int | None instead of Optional types. While this class is deprecated, maintaining consistent typing is good practice.

fastembed/sparse/sparse_text_embedding.py (1)

1-128: LGTM! Type hint modernization is consistent.

The updates to union syntax are correct and align with the broader project changes.

fastembed/sparse/bm25.py (1)

1-321: LGTM! Type hint updates are correct.

The conversion to Python 3.10+ union syntax is consistent throughout the file.

Note: The static analysis hint about unused kwargs at line 305 can be safely ignored—the parameter is maintained for interface consistency with other embedding classes' query_embed methods, as seen in the codebase pattern.

fastembed/image/image_embedding.py (1)

1-135: LGTM! Type annotations modernized correctly.

All parameter and variable type hints have been updated to Python 3.10+ union syntax consistently.

fastembed/text/text_embedding.py (1)

1-214: LGTM! Type hint modernization is complete and correct.

All type annotations have been successfully updated to Python 3.10+ union syntax across the public API.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

1-164: LGTM! Type annotations updated consistently.

All type hints have been correctly modernized to Python 3.10+ union syntax.

fastembed/common/types.py (1)

9-12: LGTM! Type alias updates are correct.

The conversion of PathInput, ImageInput, and OnnxProvider to Python 3.10+ union syntax is correct and maintains semantic equivalence.

fastembed/parallel_processor.py (1)

11-11: LGTM! Type annotations modernized correctly.

The type hint updates from Optional[T] to T | None are syntactically correct and align with Python 3.10+ PEP 604 union syntax. The removal of Optional from the typing imports is also appropriate.

Also applies to: 41-41, 96-97, 102-103, 110-110, 223-223

fastembed/sparse/sparse_embedding_base.py (1)

38-39: LGTM! Type annotations modernized correctly.

The updates to cache_dir, threads, documents, parallel, and query parameters correctly adopt the Python 3.10+ union syntax.

Also applies to: 49-51, 71-71

fastembed/rerank/cross_encoder/text_cross_encoder.py (1)

1-1: LGTM! Type annotations modernized correctly.

All type hint updates from Optional[T] to T | None are correct and consistent. The removal of Optional from imports is appropriate.

Also applies to: 56-58, 60-60, 105-105, 143-143

fastembed/text/onnx_embedding.py (1)

1-1: LGTM! Type annotations modernized correctly.

All type hint updates throughout the OnnxTextEmbedding class are syntactically correct and align with Python 3.10+ union syntax. The removal of Optional from imports is appropriate.

Also applies to: 202-204, 206-206, 208-209, 242-242, 263-263, 265-265

fastembed/text/text_embedding_base.py (1)

1-1: LGTM! Type annotations modernized correctly.

All type hint updates in the TextEmbeddingBase class are correct and consistent with Python 3.10+ union syntax. The removal of Optional and Union from imports is appropriate.

Also applies to: 12-13, 20-20, 24-24, 26-26, 46-46

fastembed/image/onnx_image_model.py (1)

5-5: LGTM! Type annotations modernized correctly.

All type hint updates in the OnnxImageModel class are syntactically correct and align with Python 3.10+ union syntax. The removal of Optional from imports is appropriate.

Also applies to: 40-40, 54-55, 57-57, 94-94, 96-97, 99-99, 101-101

fastembed/image/onnx_embedding.py (1)

1-1: LGTM! Type annotations modernized correctly.

All type hint updates in the OnnxImageEmbedding class are syntactically correct and align with Python 3.10+ union syntax. The removal of Optional from imports is appropriate.

Also applies to: 66-68, 70-70, 72-73, 107-107, 151-151, 153-153

fastembed/text/onnx_text_model.py (1)

4-4: LGTM! Type annotations modernized correctly.

All type hint updates in the OnnxTextModel class are syntactically correct and align with Python 3.10+ union syntax. The removal of Optional from imports is appropriate.

The return type on line 43 (dict[str, NumpyArray | NDArray[np.int64]]) correctly uses the new union syntax for flexible preprocessing return types.

Also applies to: 18-18, 38-38, 43-43, 53-54, 56-56, 105-105, 107-108, 110-110, 112-112

fastembed/late_interaction/colbert.py (1)

2-2: LGTM! Clean type annotation modernization.

The conversion from Optional[T] and Union[A, B] to Python 3.10+ union syntax (T | None, A | B) is correctly applied throughout the file. All type changes maintain functional equivalence while aligning with modern Python typing conventions.

Note: The static analysis warning about unused kwargs in query_embed (line 241) can be safely ignored—it's part of the API contract inherited from the base class for extensibility.

Also applies to: 111-118, 152-152, 168-169, 172-172, 207-207, 209-209, 241-241

fastembed/sparse/bm42.py (1)

4-4: LGTM! Consistent type annotation updates.

The PEP 604 union syntax conversion is correctly and consistently applied. The changes maintain functional equivalence while modernizing the type hints.

Note: The static analysis warning about unused kwargs in query_embed (line 317) can be safely ignored—it follows the same extensibility pattern as other embedding classes in the codebase.

Also applies to: 60-68, 104-104, 275-275, 277-277, 317-317

fastembed/text/multitask_embedding.py (1)

2-2: LGTM! Type annotations correctly modernized.

The conversion to PEP 604 union syntax is properly applied across all method signatures and attribute annotations. The task_id handling pattern is correct: the public embed method accepts an optional task_id and defaults it to self.default_task_id, ensuring that by the time _preprocess_onnx_input is called, a non-None value is present. The ValueError in _preprocess_onnx_input (line 67) serves as an appropriate safeguard.

Also applies to: 48-48, 50-50, 63-63, 73-73, 75-76, 84-84

@@ -1,5 +1,5 @@
from dataclasses import dataclass
from typing import Iterable, Optional, Union, Any
from typing import Iterable, Any, Union
Copy link

@coderabbitai coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Incomplete Union conversion in dataclass field.

Line 15 still uses the old Union[NDArray[np.int64], NDArray[np.int32]] syntax, which is inconsistent with the PR's goal to modernize type annotations. This should be updated to NDArray[np.int64] | NDArray[np.int32] for consistency.

Once updated, the Union import on line 2 can likely be removed (unless it's used elsewhere in the file).

Apply this diff:

-from typing import Iterable, Any, Union
+from typing import Iterable, Any
 @dataclass
 class SparseEmbedding:
     values: NumpyArray
-    indices: Union[NDArray[np.int64], NDArray[np.int32]]
+    indices: NDArray[np.int64] | NDArray[np.int32]

Also applies to: 15-15

🤖 Prompt for AI Agents
In fastembed/sparse/sparse_embedding_base.py around line 15, the dataclass field
still uses the old Union[...] syntax (Union[NDArray[np.int64],
NDArray[np.int32]]); update it to the modern PEP 604 pipe form NDArray[np.int64]
| NDArray[np.int32] and remove the now-unused Union import from line 2 if it
isn't referenced elsewhere in the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataclass complaints on this change

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

Python dataclass PEP 604 union operator pipe syntax issues

💡 Result:

Short answer: this happens because the runtime | operator is being evaluated when annotations are executed, and one or both operands aren't real types that implement or (e.g. numpy.array is a function, not the ndarray class; some generics or runtime objects are types that don't support |). PEP 604 lets you write int | str in annotations, but only when the expression can be evaluated to a Union at runtime (or when annotations are postponed). See examples and discussion. [1][2][3]

Why it fails (concise):

  • PEP 604 makes X | Y call type.or at runtime; that will raise TypeError if X or Y don't implement or (or are incompatible runtime objects). [2][3]
  • Libraries like numpy expose numpy.array as a function/builtin, not the ndarray type, so numpy.array | list[tuple] triggers "unsupported operand type(s) for |". (Real reports and examples on Stack Overflow.) [1]

How to fix:

  • Use typing.Union (typing.Union[np.ndarray, list[tuple]]) instead of |. [2]
  • Use the actual types (e.g. np.ndarray) rather than factory functions (np.array). [1]
  • Postpone evaluation of annotations so the "|" is treated as a type-hint expression (from future import annotations) and handled by type checkers rather than evaluated at class definition time; then use typing.get_type_hints() when you need resolved types. [3]
  • Or annotate with strings (e.g. "np.ndarray | list[tuple]") if you must avoid runtime evaluation.

If you want, tell me the exact annotation you're using and your Python version and I’ll show the minimal change.

Sources: Stack Overflow example of the error and discussion; PEP 604; typing documentation. [1][2][3]


@joein, you're absolutely right! This happens because PEP 604's pipe operator is evaluated at runtime, and NumPy's NDArray types don't support the __or__ operator required for the pipe syntax.

Your PR already handles a similar case correctly in fastembed/common/onnx_model.py with the comment "dataclasses can't handle | as a replacement for Optional". The same issue applies here.

Recommendation: Keep Union for this dataclass field and add a clarifying comment:

from typing import Iterable, Any, Union

@dataclass
class SparseEmbedding:
    values: NumpyArray
    indices: Union[NDArray[np.int64], NDArray[np.int32]]  # Union required: dataclasses can't handle | with NDArray

The Union import should remain since it's needed for this field. This is the correct approach for the dataclass limitation, and your PR is handling it consistently.


✏️ Learnings added
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.959Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be498c2 and e1aa84b.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.959Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.10.x on windows-latest test
🔇 Additional comments (1)
pyproject.toml (1)

14-14: Python 3.10 minimum version requirement correctly updated.

The change from >=3.9.0 to >=3.10.0 aligns with the PR objectives and enables use of modern type-hinting features (PEP 604 union syntax with |).

Comment on lines 15 to 23
numpy = [
{ version = ">=1.21", python = ">=3.10,<3.12" },
{ version = ">=1.26", python = ">=3.12,<3.13" },
{ version = ">=2.1.0", python = ">=3.13" },
{ version = ">=1.21,<2.1.0", python = "<3.10" },
]
onnxruntime = [
{ version = ">=1.17.0,<1.20.0", python = "<3.10" },
{ version = ">1.20.0", python = ">=3.13" },
{ version = ">=1.17.0,!=1.20.0", python = ">=3.10,<3.13" },
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that pinned dependency versions remain valid and secure.

The simplified numpy and onnxruntime constraints look reasonable for Python 3.10+, but since my knowledge cutoff is March 2025 and the current date is November 2025, security advisories or version incompatibilities may have emerged. Confirm that the specified versions are still current, maintained, and free from vulnerabilities.


🌐 Web query:

What are the latest secure versions of numpy and onnxruntime as of November 2025?

💡 Result:

  • NumPy: 2.3.4 — released 15 Oct 2025. [1]
  • ONNX Runtime: v1.23.2 — released 25 Oct 2025. [2]

Sources:
[1] NumPy news/releases.
[2] ONNX Runtime GitHub releases.


I need to verify security advisories and Python 3.13 compatibility for these latest versions before finalizing the review comment.

Numpy 2.3.4 dropped Python 3.10 support; constraint issue for Python 3.10

NumPy 2.3.4 supports Python versions 3.11-3.13, which means the pinned constraint numpy >=1.21 for Python 3.10 is problematic. While older numpy versions (1.21–1.25) technically satisfy this constraint, NumPy 2.1.0 supports Python 3.13 and NumPy 2.1 first introduced Python 3.13 support. For Python 3.10, either pin a specific older numpy version (1.26.x series) or accept that users on Python 3.10 cannot use the latest numpy releases.

For ONNX Runtime, no direct vulnerabilities have been found for numpy 2.3.4 or onnxruntime 1.17.3 in Snyk's vulnerability database. ONNX Runtime 1.23.2 now supports Python 3.13, and the current constraints appropriately support Python 3.10–3.13.

Consider adding an explicit upper bound on numpy for Python 3.10 (e.g., numpy >=1.21,<2.0 for Python 3.10) to prevent unintended downgrades.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants