-
Notifications
You must be signed in to change notification settings - Fork 164
new: drop python3.9, replace optional and union with | #574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis PR modernizes type annotations to Python 3.10+ PEP 604 union syntax across the repository, replacing Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
fastembed/common/types.py (1)
13-20: Reconsider the NumpyArray expansion scope—it masks a semantic type violation in BM25.The expansion to include integer types was added to support token IDs (appropriate), but it overly broadens the contract for
SparseEmbedding.values, which should semantically represent score values (floats).Issue: BM25 creates integer values (
np.ones_like(token_ids)withdtype=np.int32) that now pass strict type checking only becauseNumpyArraywas expanded. Before this change, BM25 would have failed type validation—the bug was latent. The# type: ignorecomment inas_dict()reveals the developers recognized the type mismatch.Recommended approach:
- Narrow
NumpyArrayto float types only forSparseEmbedding.values- Use a separate/specific type for token ID arrays (e.g., in
tokenize(),convert_ids_to_tokens())- Fix BM25 to create actual float values (BM25 scores) instead of integer ones
The current change allows type-incorrect code to pass strict mode validation without addressing the underlying semantic contract violation.
🧹 Nitpick comments (6)
fastembed/late_interaction/token_embeddings.py (1)
2-69: Use PEP 604 for documents, dropUnionimportSwitching this parameter to
str | Iterable[str]lets us remove the remainingUnionimport and keeps the file consistent with the rest of the modernization.-from typing import Union, Iterable, Any, Type +from typing import Iterable, Any, Type ... - documents: Union[str, Iterable[str]], + documents: str | Iterable[str],fastembed/late_interaction_multimodal/colpali.py (1)
60-75: Docstring still referencesOptional[...]The annotations now use
| None, so consider updating the docstring wording (e.g., “Sequence[OnnxProvider] | None”) to avoid confusing readers skimming the docs.fastembed/sparse/splade_pp.py (1)
80-95: Bring the docstring in line with the new union syntax.Signature now advertises
str | None,Sequence[...] | None, etc., but the docstring still talks aboutOptional[...]. Updating the prose keeps API docs consistent with the public surface.- cache_dir (str, optional): The path to the cache directory. + cache_dir (str | None, optional): The path to the cache directory. @@ - providers (Optional[Sequence[OnnxProvider]], optional): The list of onnxruntime providers to use. + providers (Sequence[OnnxProvider] | None, optional): The list of onnxruntime providers to use. @@ - device_ids (Optional[list[int]], optional): The list of device ids to use for data parallel processing in + device_ids (list[int] | None, optional): The list of device ids to use for data parallel processing in @@ - device_id (Optional[int], optional): The device id to use for loading the model in the worker process. - specific_model_path (Optional[str], optional): The specific path to the onnx model dir if it should be imported from somewhere else + device_id (int | None, optional): The device id to use for loading the model in the worker process. + specific_model_path (str | None, optional): The specific path to the onnx model dir if it should be imported from somewhere elsefastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
92-106: Update the docstring to mirror the union annotations.After switching the signature to
str | None/Sequence[...] | None, the docstring should follow suit to avoid staleOptional[...]references.- cache_dir (str, optional): The path to the cache directory. + cache_dir (str | None, optional): The path to the cache directory. @@ - threads (int, optional): The number of threads single onnxruntime session can use. Defaults to None. + threads (int | None, optional): The number of threads single onnxruntime session can use. Defaults to None. @@ - providers (Optional[Sequence[OnnxProvider]], optional): The list of onnxruntime providers to use. + providers (Sequence[OnnxProvider] | None, optional): The list of onnxruntime providers to use. @@ - device_ids (Optional[list[int]], optional): The list of device ids to use for data parallel processing in + device_ids (list[int] | None, optional): The list of device ids to use for data parallel processing in @@ - device_id (Optional[int], optional): The device id to use for loading the model in the worker process. - specific_model_path (Optional[str], optional): The specific path to the onnx model dir if it should be imported from somewhere else + device_id (int | None, optional): The device id to use for loading the model in the worker process. + specific_model_path (str | None, optional): The specific path to the onnx model dir if it should be imported from somewhere elsefastembed/late_interaction/late_interaction_embedding_base.py (1)
1-61: LGTM! Type hint modernization is correct.The conversion from
Optional[T]toT | NoneandUnion[A, B]toA | Bis syntactically correct and maintains functional equivalence. All parameter types in__init__,embed, andquery_embedhave been updated consistently.Consider updating the docstring at line 51 to align with the new type syntax—it still references
Union[str, Iterable[str]]while the actual parameter type now usesstr | Iterable[str]. This applies to similar docstrings in other methods as well.fastembed/late_interaction/late_interaction_text_embedding.py (1)
1-153: LGTM! Type annotations updated correctly.All type hint conversions to Python 3.10+ union syntax are correct. The changes maintain functional equivalence while modernizing the codebase.
The docstring at line 146 still references the old
Union[str, Iterable[str]]syntax. For consistency, consider updating docstrings to match the new type annotation style throughout the codebase.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (42)
.github/workflows/python-publish.yml(1 hunks).github/workflows/python-tests.yml(0 hunks).github/workflows/type-checkers.yml(1 hunks)fastembed/common/model_description.py(1 hunks)fastembed/common/model_management.py(4 hunks)fastembed/common/onnx_model.py(4 hunks)fastembed/common/types.py(1 hunks)fastembed/embedding.py(2 hunks)fastembed/image/image_embedding.py(4 hunks)fastembed/image/image_embedding_base.py(2 hunks)fastembed/image/onnx_embedding.py(4 hunks)fastembed/image/onnx_image_model.py(4 hunks)fastembed/image/transform/functional.py(5 hunks)fastembed/image/transform/operators.py(7 hunks)fastembed/late_interaction/colbert.py(6 hunks)fastembed/late_interaction/late_interaction_embedding_base.py(3 hunks)fastembed/late_interaction/late_interaction_text_embedding.py(5 hunks)fastembed/late_interaction/token_embeddings.py(2 hunks)fastembed/late_interaction_multimodal/colpali.py(5 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(5 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(3 hunks)fastembed/parallel_processor.py(4 hunks)fastembed/postprocess/muvera.py(1 hunks)fastembed/rerank/cross_encoder/custom_text_cross_encoder.py(2 hunks)fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py(4 hunks)fastembed/rerank/cross_encoder/onnx_text_model.py(4 hunks)fastembed/rerank/cross_encoder/text_cross_encoder.py(4 hunks)fastembed/rerank/cross_encoder/text_cross_encoder_base.py(3 hunks)fastembed/sparse/bm25.py(5 hunks)fastembed/sparse/bm42.py(5 hunks)fastembed/sparse/minicoil.py(5 hunks)fastembed/sparse/sparse_embedding_base.py(4 hunks)fastembed/sparse/sparse_text_embedding.py(4 hunks)fastembed/sparse/splade_pp.py(4 hunks)fastembed/sparse/utils/sparse_vectors_converter.py(7 hunks)fastembed/text/custom_text_embedding.py(2 hunks)fastembed/text/multitask_embedding.py(4 hunks)fastembed/text/onnx_embedding.py(4 hunks)fastembed/text/onnx_text_model.py(5 hunks)fastembed/text/text_embedding.py(6 hunks)fastembed/text/text_embedding_base.py(3 hunks)tests/utils.py(2 hunks)
💤 Files with no reviewable changes (1)
- .github/workflows/python-tests.yml
🧰 Additional context used
🧬 Code graph analysis (21)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)
fastembed/late_interaction_multimodal/colpali.py (1)
embed_text(209-242)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
embed_text(120-141)
fastembed/image/image_embedding_base.py (2)
fastembed/image/image_embedding.py (1)
embed(114-135)fastembed/image/onnx_embedding.py (1)
embed(149-184)
fastembed/sparse/sparse_embedding_base.py (4)
fastembed/sparse/bm25.py (1)
query_embed(305-321)fastembed/sparse/bm42.py (1)
query_embed(317-336)fastembed/sparse/minicoil.py (1)
query_embed(220-238)fastembed/sparse/sparse_text_embedding.py (1)
query_embed(118-128)
fastembed/text/text_embedding_base.py (4)
fastembed/image/image_embedding_base.py (1)
embed(23-45)fastembed/late_interaction/late_interaction_embedding_base.py (2)
embed(22-29)query_embed(46-61)fastembed/sparse/sparse_embedding_base.py (2)
embed(47-54)query_embed(71-86)fastembed/text/text_embedding.py (2)
embed(165-187)query_embed(189-200)
fastembed/late_interaction/late_interaction_embedding_base.py (3)
fastembed/late_interaction/colbert.py (2)
embed(205-239)query_embed(241-251)fastembed/late_interaction/token_embeddings.py (1)
embed(64-71)fastembed/text/text_embedding_base.py (2)
embed(22-29)query_embed(46-61)
fastembed/late_interaction/late_interaction_text_embedding.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
query_embed(46-61)fastembed/text/text_embedding.py (1)
query_embed(189-200)fastembed/late_interaction/colbert.py (1)
query_embed(241-251)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (3)
fastembed/image/image_embedding.py (1)
embedding_size(81-85)fastembed/late_interaction/late_interaction_text_embedding.py (1)
embedding_size(84-88)fastembed/text/text_embedding.py (1)
embedding_size(132-136)
fastembed/late_interaction/colbert.py (2)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
query_embed(46-61)fastembed/late_interaction/late_interaction_text_embedding.py (1)
query_embed(141-153)
fastembed/sparse/utils/sparse_vectors_converter.py (3)
fastembed/common/utils.py (2)
get_all_punctuation(62-65)remove_non_alphanumeric(68-69)fastembed/sparse/utils/vocab_resolver.py (1)
vocab_size(54-56)fastembed/sparse/sparse_embedding_base.py (1)
SparseEmbedding(13-31)
fastembed/sparse/bm25.py (4)
fastembed/late_interaction/colbert.py (1)
query_embed(241-251)fastembed/sparse/bm42.py (1)
query_embed(317-336)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)fastembed/sparse/sparse_text_embedding.py (1)
query_embed(118-128)
fastembed/sparse/minicoil.py (5)
fastembed/sparse/utils/vocab_resolver.py (1)
VocabResolver(31-201)fastembed/sparse/utils/minicoil_encoder.py (1)
Encoder(11-146)fastembed/sparse/utils/sparse_vectors_converter.py (1)
SparseVectorConverter(24-244)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)fastembed/sparse/sparse_text_embedding.py (1)
query_embed(118-128)
fastembed/postprocess/muvera.py (2)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
LateInteractionTextEmbeddingBase(8-71)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
LateInteractionMultimodalEmbeddingBase(10-78)
fastembed/text/onnx_text_model.py (2)
fastembed/common/onnx_model.py (1)
_preprocess_onnx_input(49-55)fastembed/text/multitask_embedding.py (1)
_preprocess_onnx_input(60-69)
tests/utils.py (1)
fastembed/common/model_description.py (1)
BaseModelDescription(24-31)
fastembed/text/multitask_embedding.py (2)
fastembed/image/image_embedding.py (1)
embed(114-135)fastembed/text/text_embedding.py (1)
embed(165-187)
fastembed/image/onnx_image_model.py (1)
fastembed/image/transform/operators.py (1)
Compose(85-269)
fastembed/sparse/bm42.py (2)
fastembed/sparse/bm25.py (1)
query_embed(305-321)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)
fastembed/image/transform/operators.py (1)
fastembed/image/transform/functional.py (1)
pil2ndarray(118-121)
fastembed/image/image_embedding.py (3)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
embedding_size(84-88)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
embedding_size(87-91)fastembed/text/text_embedding.py (1)
embedding_size(132-136)
fastembed/sparse/sparse_text_embedding.py (4)
fastembed/sparse/bm25.py (1)
query_embed(305-321)fastembed/sparse/bm42.py (1)
query_embed(317-336)fastembed/sparse/minicoil.py (1)
query_embed(220-238)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)
fastembed/text/text_embedding.py (4)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
query_embed(141-153)fastembed/text/text_embedding_base.py (1)
query_embed(46-61)fastembed/sparse/sparse_embedding_base.py (1)
query_embed(71-86)fastembed/text/multitask_embedding.py (1)
query_embed(84-85)
🪛 Ruff (0.14.4)
fastembed/late_interaction/colbert.py
241-241: Unused method argument: kwargs
(ARG002)
fastembed/sparse/bm25.py
305-305: Unused method argument: kwargs
(ARG002)
fastembed/text/onnx_text_model.py
42-42: Unused method argument: kwargs
(ARG002)
fastembed/sparse/bm42.py
317-317: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
🔇 Additional comments (42)
fastembed/sparse/minicoil.py (4)
3-3: LGTM! Import statement correctly updated.The removal of
OptionalandUnionimports is appropriate since the code now uses PEP 604 union syntax (|).
68-78: LGTM! Constructor parameters correctly modernized.All optional parameter type hints have been properly converted to PEP 604 syntax. The conversions maintain functional equivalence while aligning with Python 3.10+ conventions.
118-126: LGTM! Instance attributes properly typed with appropriate None guards.The optional type annotations correctly reflect the lazy loading pattern. The assertions at lines 151 and 264-266 properly guard against None usage before accessing these attributes.
180-220: LGTM! Method signatures correctly updated and consistent with base classes.The type hint modernization for
embed()andquery_embed()methods maintains consistency with the base class signatures shown in the related code..github/workflows/type-checkers.yml (1)
11-11: Matrix drop matches supported runtimesRemoving 3.9 keeps our type-checking matrix aligned with the new 3.10+ typing surface—looks good.
.github/workflows/python-publish.yml (1)
28-28: Publish workflow aligned with 3.10 floorBumping the release job to Python 3.10 keeps packaging consistent with the rest of the toolchain.
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
11-45: Constructor unions look goodThe move to
str | None/int | Nonemirrors the rest of the PEP 604 migration and keeps the interface unchanged—LGTM.fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
14-52: Base multimodal signatures updated cleanlyNice work updating the constructor and embed entry points to the new union syntax while keeping behavior intact.
fastembed/image/transform/functional.py (1)
16-128: Image utilities now speak the same typing dialectThe widened unions for PIL images vs. ndarrays and the refined mean/std annotations match the rest of the refactor and preserve runtime logic—looks solid.
fastembed/image/transform/operators.py (1)
18-92: Transform stack consistent with functional helpersThe union-typed signatures align perfectly with the helper functions and keep the transform pipeline typings coherent—nicely done.
fastembed/sparse/utils/sparse_vectors_converter.py (3)
18-21: Type annotations modernized successfully.The dataclass fields now use built-in generic types (
list[str]instead ofList[str]), aligning with Python 3.10+ conventions.
25-41: New parameteravg_lenadds flexibility to BM25 calculation.The
avg_lenparameter is now exposed in the constructor with a sensible default of 150.0. This maintains backward compatibility while allowing customization of the average document length used in BM25 scoring.
60-64: Method signatures consistently updated.All method signatures throughout the file now use built-in generics (
list[float]instead ofList[float]), maintaining consistency with the modernization effort.tests/utils.py (2)
11-11: Type annotations modernized with union syntax.The function signature now uses
str | Pathinstead ofUnion[str, Path], aligning with Python 3.10+ typing conventions.
42-47: Parameter type updated consistently.The
is_ciparameter now usesstr | Noneinstead ofOptional[str], maintaining consistency with the modernization effort across the codebase.fastembed/common/model_description.py (1)
34-40: Type hints updated to modern union syntax.The
DenseModelDescriptionfields now useint | Noneanddict[str, Any] | Noneinstead ofOptional[...]. The__post_init__validation fordimremains unchanged, correctly enforcing thatdimmust be provided at runtime.fastembed/postprocess/muvera.py (1)
12-12: Type alias modernized with union operator.The
MultiVectorModeltype alias now uses the|operator instead ofUnion, consistent with Python 3.10+ typing conventions.fastembed/rerank/cross_encoder/custom_text_cross_encoder.py (1)
11-23: Constructor signature modernized comprehensively.All optional parameters now use the modern union syntax (
str | None,int | None,Sequence[OnnxProvider] | None,list[int] | None). The changes are consistent and maintain full backward compatibility.fastembed/common/model_management.py (3)
180-194: Nested function types updated consistently.The
_collect_file_metadatafunction's return type now usesdict[str, dict[str, int | str]]with the modern union syntax for the inner dict values. This correctly expresses that metadata values can be either int or str.
195-201: Parameter typing aligned with return type.The
_save_file_metadatafunction parameter type matches the return type of_collect_file_metadata, maintaining type consistency across the helper functions.
376-404: Public method signature modernized.The
download_modelmethod'sspecific_model_pathparameter now usesstr | Noneinstead ofOptional[str], consistent with the modernization effort across the codebase.fastembed/rerank/cross_encoder/onnx_text_model.py (3)
21-22: Class variable type correctly updated.The
ONNX_OUTPUT_NAMESclass attribute now useslist[str] | Noneinstead ofOptional[list[str]], maintaining consistency with Python 3.10+ conventions.
28-46: Method signature parameters modernized.The
_load_onnx_modelmethod parameters now use the union operator for optional types (int | None,Sequence[OnnxProvider] | None), aligning with modern Python typing.
87-100: Complex method signature updated consistently.All optional parameters in
_rerank_pairsnow use the modern union syntax. The method signature is cleaner and more readable with the|operator.fastembed/embedding.py (1)
16-24: Deprecated class updated with modern type hints.The
JinaEmbeddingconstructor parameters now usestr | Noneandint | Noneinstead ofOptionaltypes. While this class is deprecated, maintaining consistent typing is good practice.fastembed/sparse/sparse_text_embedding.py (1)
1-128: LGTM! Type hint modernization is consistent.The updates to union syntax are correct and align with the broader project changes.
fastembed/sparse/bm25.py (1)
1-321: LGTM! Type hint updates are correct.The conversion to Python 3.10+ union syntax is consistent throughout the file.
Note: The static analysis hint about unused
kwargsat line 305 can be safely ignored—the parameter is maintained for interface consistency with other embedding classes'query_embedmethods, as seen in the codebase pattern.fastembed/image/image_embedding.py (1)
1-135: LGTM! Type annotations modernized correctly.All parameter and variable type hints have been updated to Python 3.10+ union syntax consistently.
fastembed/text/text_embedding.py (1)
1-214: LGTM! Type hint modernization is complete and correct.All type annotations have been successfully updated to Python 3.10+ union syntax across the public API.
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
1-164: LGTM! Type annotations updated consistently.All type hints have been correctly modernized to Python 3.10+ union syntax.
fastembed/common/types.py (1)
9-12: LGTM! Type alias updates are correct.The conversion of
PathInput,ImageInput, andOnnxProviderto Python 3.10+ union syntax is correct and maintains semantic equivalence.fastembed/parallel_processor.py (1)
11-11: LGTM! Type annotations modernized correctly.The type hint updates from
Optional[T]toT | Noneare syntactically correct and align with Python 3.10+ PEP 604 union syntax. The removal ofOptionalfrom the typing imports is also appropriate.Also applies to: 41-41, 96-97, 102-103, 110-110, 223-223
fastembed/sparse/sparse_embedding_base.py (1)
38-39: LGTM! Type annotations modernized correctly.The updates to
cache_dir,threads,documents,parallel, andqueryparameters correctly adopt the Python 3.10+ union syntax.Also applies to: 49-51, 71-71
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
1-1: LGTM! Type annotations modernized correctly.All type hint updates from
Optional[T]toT | Noneare correct and consistent. The removal ofOptionalfrom imports is appropriate.Also applies to: 56-58, 60-60, 105-105, 143-143
fastembed/text/onnx_embedding.py (1)
1-1: LGTM! Type annotations modernized correctly.All type hint updates throughout the
OnnxTextEmbeddingclass are syntactically correct and align with Python 3.10+ union syntax. The removal ofOptionalfrom imports is appropriate.Also applies to: 202-204, 206-206, 208-209, 242-242, 263-263, 265-265
fastembed/text/text_embedding_base.py (1)
1-1: LGTM! Type annotations modernized correctly.All type hint updates in the
TextEmbeddingBaseclass are correct and consistent with Python 3.10+ union syntax. The removal ofOptionalandUnionfrom imports is appropriate.Also applies to: 12-13, 20-20, 24-24, 26-26, 46-46
fastembed/image/onnx_image_model.py (1)
5-5: LGTM! Type annotations modernized correctly.All type hint updates in the
OnnxImageModelclass are syntactically correct and align with Python 3.10+ union syntax. The removal ofOptionalfrom imports is appropriate.Also applies to: 40-40, 54-55, 57-57, 94-94, 96-97, 99-99, 101-101
fastembed/image/onnx_embedding.py (1)
1-1: LGTM! Type annotations modernized correctly.All type hint updates in the
OnnxImageEmbeddingclass are syntactically correct and align with Python 3.10+ union syntax. The removal ofOptionalfrom imports is appropriate.Also applies to: 66-68, 70-70, 72-73, 107-107, 151-151, 153-153
fastembed/text/onnx_text_model.py (1)
4-4: LGTM! Type annotations modernized correctly.All type hint updates in the
OnnxTextModelclass are syntactically correct and align with Python 3.10+ union syntax. The removal ofOptionalfrom imports is appropriate.The return type on line 43 (
dict[str, NumpyArray | NDArray[np.int64]]) correctly uses the new union syntax for flexible preprocessing return types.Also applies to: 18-18, 38-38, 43-43, 53-54, 56-56, 105-105, 107-108, 110-110, 112-112
fastembed/late_interaction/colbert.py (1)
2-2: LGTM! Clean type annotation modernization.The conversion from
Optional[T]andUnion[A, B]to Python 3.10+ union syntax (T | None,A | B) is correctly applied throughout the file. All type changes maintain functional equivalence while aligning with modern Python typing conventions.Note: The static analysis warning about unused
kwargsinquery_embed(line 241) can be safely ignored—it's part of the API contract inherited from the base class for extensibility.Also applies to: 111-118, 152-152, 168-169, 172-172, 207-207, 209-209, 241-241
fastembed/sparse/bm42.py (1)
4-4: LGTM! Consistent type annotation updates.The PEP 604 union syntax conversion is correctly and consistently applied. The changes maintain functional equivalence while modernizing the type hints.
Note: The static analysis warning about unused
kwargsinquery_embed(line 317) can be safely ignored—it follows the same extensibility pattern as other embedding classes in the codebase.Also applies to: 60-68, 104-104, 275-275, 277-277, 317-317
fastembed/text/multitask_embedding.py (1)
2-2: LGTM! Type annotations correctly modernized.The conversion to PEP 604 union syntax is properly applied across all method signatures and attribute annotations. The task_id handling pattern is correct: the public
embedmethod accepts an optional task_id and defaults it toself.default_task_id, ensuring that by the time_preprocess_onnx_inputis called, a non-None value is present. The ValueError in_preprocess_onnx_input(line 67) serves as an appropriate safeguard.Also applies to: 48-48, 50-50, 63-63, 73-73, 75-76, 84-84
| @@ -1,5 +1,5 @@ | |||
| from dataclasses import dataclass | |||
| from typing import Iterable, Optional, Union, Any | |||
| from typing import Iterable, Any, Union | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Incomplete Union conversion in dataclass field.
Line 15 still uses the old Union[NDArray[np.int64], NDArray[np.int32]] syntax, which is inconsistent with the PR's goal to modernize type annotations. This should be updated to NDArray[np.int64] | NDArray[np.int32] for consistency.
Once updated, the Union import on line 2 can likely be removed (unless it's used elsewhere in the file).
Apply this diff:
-from typing import Iterable, Any, Union
+from typing import Iterable, Any @dataclass
class SparseEmbedding:
values: NumpyArray
- indices: Union[NDArray[np.int64], NDArray[np.int32]]
+ indices: NDArray[np.int64] | NDArray[np.int32]Also applies to: 15-15
🤖 Prompt for AI Agents
In fastembed/sparse/sparse_embedding_base.py around line 15, the dataclass field
still uses the old Union[...] syntax (Union[NDArray[np.int64],
NDArray[np.int32]]); update it to the modern PEP 604 pipe form NDArray[np.int64]
| NDArray[np.int32] and remove the now-unused Union import from line 2 if it
isn't referenced elsewhere in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataclass complaints on this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
Python dataclass PEP 604 union operator pipe syntax issues
💡 Result:
Short answer: this happens because the runtime | operator is being evaluated when annotations are executed, and one or both operands aren't real types that implement or (e.g. numpy.array is a function, not the ndarray class; some generics or runtime objects are types that don't support |). PEP 604 lets you write int | str in annotations, but only when the expression can be evaluated to a Union at runtime (or when annotations are postponed). See examples and discussion. [1][2][3]
Why it fails (concise):
- PEP 604 makes X | Y call type.or at runtime; that will raise TypeError if X or Y don't implement or (or are incompatible runtime objects). [2][3]
- Libraries like numpy expose numpy.array as a function/builtin, not the ndarray type, so numpy.array | list[tuple] triggers "unsupported operand type(s) for |". (Real reports and examples on Stack Overflow.) [1]
How to fix:
- Use typing.Union (typing.Union[np.ndarray, list[tuple]]) instead of |. [2]
- Use the actual types (e.g. np.ndarray) rather than factory functions (np.array). [1]
- Postpone evaluation of annotations so the "|" is treated as a type-hint expression (from future import annotations) and handled by type checkers rather than evaluated at class definition time; then use typing.get_type_hints() when you need resolved types. [3]
- Or annotate with strings (e.g. "np.ndarray | list[tuple]") if you must avoid runtime evaluation.
If you want, tell me the exact annotation you're using and your Python version and I’ll show the minimal change.
Sources: Stack Overflow example of the error and discussion; PEP 604; typing documentation. [1][2][3]
@joein, you're absolutely right! This happens because PEP 604's pipe operator is evaluated at runtime, and NumPy's NDArray types don't support the __or__ operator required for the pipe syntax.
Your PR already handles a similar case correctly in fastembed/common/onnx_model.py with the comment "dataclasses can't handle | as a replacement for Optional". The same issue applies here.
Recommendation: Keep Union for this dataclass field and add a clarifying comment:
from typing import Iterable, Any, Union
@dataclass
class SparseEmbedding:
values: NumpyArray
indices: Union[NDArray[np.int64], NDArray[np.int32]] # Union required: dataclasses can't handle | with NDArrayThe Union import should remain since it's needed for this field. This is the correct approach for the dataclass limitation, and your PR is handling it consistently.
✏️ Learnings added
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.959Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (1)
pyproject.toml(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.959Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.10.x on windows-latest test
🔇 Additional comments (1)
pyproject.toml (1)
14-14: Python 3.10 minimum version requirement correctly updated.The change from
>=3.9.0to>=3.10.0aligns with the PR objectives and enables use of modern type-hinting features (PEP 604 union syntax with|).
| numpy = [ | ||
| { version = ">=1.21", python = ">=3.10,<3.12" }, | ||
| { version = ">=1.26", python = ">=3.12,<3.13" }, | ||
| { version = ">=2.1.0", python = ">=3.13" }, | ||
| { version = ">=1.21,<2.1.0", python = "<3.10" }, | ||
| ] | ||
| onnxruntime = [ | ||
| { version = ">=1.17.0,<1.20.0", python = "<3.10" }, | ||
| { version = ">1.20.0", python = ">=3.13" }, | ||
| { version = ">=1.17.0,!=1.20.0", python = ">=3.10,<3.13" }, | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify that pinned dependency versions remain valid and secure.
The simplified numpy and onnxruntime constraints look reasonable for Python 3.10+, but since my knowledge cutoff is March 2025 and the current date is November 2025, security advisories or version incompatibilities may have emerged. Confirm that the specified versions are still current, maintained, and free from vulnerabilities.
🌐 Web query:
What are the latest secure versions of numpy and onnxruntime as of November 2025?
💡 Result:
- NumPy: 2.3.4 — released 15 Oct 2025. [1]
- ONNX Runtime: v1.23.2 — released 25 Oct 2025. [2]
Sources:
[1] NumPy news/releases.
[2] ONNX Runtime GitHub releases.
I need to verify security advisories and Python 3.13 compatibility for these latest versions before finalizing the review comment.
Numpy 2.3.4 dropped Python 3.10 support; constraint issue for Python 3.10
NumPy 2.3.4 supports Python versions 3.11-3.13, which means the pinned constraint numpy >=1.21 for Python 3.10 is problematic. While older numpy versions (1.21–1.25) technically satisfy this constraint, NumPy 2.1.0 supports Python 3.13 and NumPy 2.1 first introduced Python 3.13 support. For Python 3.10, either pin a specific older numpy version (1.26.x series) or accept that users on Python 3.10 cannot use the latest numpy releases.
For ONNX Runtime, no direct vulnerabilities have been found for numpy 2.3.4 or onnxruntime 1.17.3 in Snyk's vulnerability database. ONNX Runtime 1.23.2 now supports Python 3.13, and the current constraints appropriately support Python 3.10–3.13.
Consider adding an explicit upper bound on numpy for Python 3.10 (e.g., numpy >=1.21,<2.0 for Python 3.10) to prevent unintended downgrades.
No description provided.