Skip to content

FEAT: add text embedder #1694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

rhajou
Copy link
Contributor

@rhajou rhajou commented May 2, 2025

Related Issues

Proposed Changes:

Added the Google Text Embedder

How did you test it?

Unit tests

Notes for the reviewer

Didn't add the Document Embedder, is it needed?

Checklist

@rhajou rhajou requested a review from a team as a code owner May 2, 2025 15:16
@rhajou rhajou requested review from mpangrazzi and removed request for a team May 2, 2025 15:16
@github-actions github-actions bot added integration:google-ai type:documentation Improvements or additions to documentation labels May 2, 2025
Copy link
Contributor

@mpangrazzi mpangrazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments! I recommend reading first the official docs, then update the implementation and the tests. I also recommend adding an example (and test it out) or an integration test. LMK if something is not clear! 😉

@@ -23,7 +23,7 @@ classifiers = [
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = ["haystack-ai>=2.9.0", "google-generativeai>=0.3.1"]
dependencies = ["haystack-ai>=2.9.0", "google-generativeai>=0.3.1", "google-genai==1.13.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding google-genai==1.13.0 with exact version pinning could cause conflicts. What about google-genai>=1.13.0?

:param model: The name of the Google AI embedding model to use.
Defaults to "models/embedding-001".
:param api_key: The Google AI API key. It can be explicitly provided or automatically read from the
`GOOGLE_API_KEY` environment variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: above you're initializing this from GEMINI_API_KEY, but here you are referring to GOOGLE_API_KEY.

configs.title = self.title
elif self.title and self.task_type != "retrieval_document":
warnings.warn(
UserWarning("Warning: Title 'Should Be Ignored' is ignored because task_type is 'retrieval_query'"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be f"Warning: title '{self.title}' is ignored..."?

raise RuntimeError(msg) from e

# Extract embeddings - result.embedding should be the list of lists
embeddings = result.get("embedding") # Use .get for safety, returns None if key missing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to docs, result should be an object and not a dict, so you should do result.embeddings.

I see that in the tests you're mocking this response (so tests are actually passing), but have you tried it outside tests (e.g. in an integration tests or an example?)

texts = ["text 1", "text 2"]
expected_embeddings = [[0.1, 0.2], [0.3, 0.4]]
# Configure the mock embed_content method to return a successful response
mock_client_instance.models.embed_content.return_value = {"embedding": expected_embeddings}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong mocking I was mentioning above.

According to docs, a correct mock should be something like:

mock_response = MagicMock()
mock_response.embeddings = None # or e.g. [[0.1, 0.2], [0.3, 0.4]]
mock_client_instance.models.embed_content.return_value = mock_response

Can you please update tests accordingly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:google-ai type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a GoogleAIGeminiDocumentEmbedder and GoogleAIGeminiTextEmbedder
2 participants