Skip to content

PYTHON-5172 bugfix: Add __repr__ and __eq__ to bson.binary.BinaryVector #2162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 10, 2025

Conversation

caseyclements
Copy link
Contributor

No ticket. I just noticed this while working on #2161

@caseyclements
Copy link
Contributor Author

@NoahStapp One question in my mind. Is using an f-string an issue? If the vector is really large, will __repr__ create the string even when it's not called?

@blink1073
Copy link
Member

Can you add the original PYTHON ticket for BSON Binary Vector if we're not adding a new ticket?

Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a new python ticket for this because it's a new feature.

bson/binary.py Outdated
@@ -247,6 +247,9 @@ def __init__(self, data: Sequence[float | int], dtype: BinaryVectorDtype, paddin
self.dtype = dtype
self.padding = padding

def __repr__(self) -> str:
return f"BinaryVector - {self.dtype=}, {self.dtype},{self.padding=}, {self.data=}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to follow Python's convention for repr: https://docs.python.org/3/library/functions.html#repr

And we should add tests for this like we do for our other custom type reprs.

@NoahStapp
Copy link
Contributor

@NoahStapp One question in my mind. Is using an f-string an issue? If the vector is really large, will __repr__ create the string even when it's not called?

f-strings are always evaluated, even if never used. In this case, I don't know if the performance benefit of f-strings is offset by a very large vector being parsed into a string at runtime. It's probably worth running some quick benchmarks to see if there is a performance difference here.

@caseyclements
Copy link
Contributor Author

@NoahStapp One question in my mind. Is using an f-string an issue? If the vector is really large, will __repr__ create the string even when it's not called?

f-strings are always evaluated, even if never used. In this case, I don't know if the performance benefit of f-strings is offset by a very large vector being parsed into a string at runtime. It's probably worth running some quick benchmarks to see if there is a performance difference here.

Yeah. I'm going to switch to format. These vectors could get massive. I don't want users to think mongodb is slow when they've simply got something in a logger.warning statement..

@caseyclements
Copy link
Contributor Author

@NoahStapp One question in my mind. Is using an f-string an issue? If the vector is really large, will __repr__ create the string even when it's not called?

f-strings are always evaluated, even if never used. In this case, I don't know if the performance benefit of f-strings is offset by a very large vector being parsed into a string at runtime. It's probably worth running some quick benchmarks to see if there is a performance difference here.

Yeah. I'm going to switch to format. These vectors could get massive. I don't want users to think mongodb is slow when they've simply got something in a logger.warning statement..

Our pre-commit automatically converts str.format into f-string! What's the workaround?

@NoahStapp
Copy link
Contributor

@NoahStapp One question in my mind. Is using an f-string an issue? If the vector is really large, will __repr__ create the string even when it's not called?

f-strings are always evaluated, even if never used. In this case, I don't know if the performance benefit of f-strings is offset by a very large vector being parsed into a string at runtime. It's probably worth running some quick benchmarks to see if there is a performance difference here.

Yeah. I'm going to switch to format. These vectors could get massive. I don't want users to think mongodb is slow when they've simply got something in a logger.warning statement..

Our pre-commit automatically converts str.format into f-string! What's the workaround?

You can ignore specific ruff rules with # noqa: <rule> on the offending line.

@caseyclements
Copy link
Contributor Author

@NoahStapp One question in my mind. Is using an f-string an issue? If the vector is really large, will __repr__ create the string even when it's not called?

f-strings are always evaluated, even if never used. In this case, I don't know if the performance benefit of f-strings is offset by a very large vector being parsed into a string at runtime. It's probably worth running some quick benchmarks to see if there is a performance difference here.

Yeah. I'm going to switch to format. These vectors could get massive. I don't want users to think mongodb is slow when they've simply got something in a logger.warning statement..

Our pre-commit automatically converts str.format into f-string! What's the workaround?

You can ignore specific ruff rules with # noqa: <rule> on the offending line.

Should we consider adding the following to pyproject.toml?

[tool.ruff]
ignore = ["RUF100"]

@NoahStapp
Copy link
Contributor

@NoahStapp One question in my mind. Is using an f-string an issue? If the vector is really large, will __repr__ create the string even when it's not called?

f-strings are always evaluated, even if never used. In this case, I don't know if the performance benefit of f-strings is offset by a very large vector being parsed into a string at runtime. It's probably worth running some quick benchmarks to see if there is a performance difference here.

Yeah. I'm going to switch to format. These vectors could get massive. I don't want users to think mongodb is slow when they've simply got something in a logger.warning statement..

Our pre-commit automatically converts str.format into f-string! What's the workaround?

You can ignore specific ruff rules with # noqa: <rule> on the offending line.

Should we consider adding the following to pyproject.toml?

[tool.ruff]
ignore = ["RUF100"]

Avoiding cluttering the codebase with unused directives is a pattern we follow consistently, are you seeing linting failures without this change?

@caseyclements
Copy link
Contributor Author

@NoahStapp One question in my mind. Is using an f-string an issue? If the vector is really large, will __repr__ create the string even when it's not called?

f-strings are always evaluated, even if never used. In this case, I don't know if the performance benefit of f-strings is offset by a very large vector being parsed into a string at runtime. It's probably worth running some quick benchmarks to see if there is a performance difference here.

Yeah. I'm going to switch to format. These vectors could get massive. I don't want users to think mongodb is slow when they've simply got something in a logger.warning statement..

Our pre-commit automatically converts str.format into f-string! What's the workaround?

You can ignore specific ruff rules with # noqa: <rule> on the offending line.

Should we consider adding the following to pyproject.toml?

[tool.ruff]
ignore = ["RUF100"]

Avoiding cluttering the codebase with unused directives is a pattern we follow consistently, are you seeing linting failures without this change?

Wrong rule. That was from ChatGPT. UP032 is the f-string one. I don't know. It feels like when someone is using format they probably have chosen to do so for a reason.

bson/binary.py Outdated
@@ -247,6 +247,11 @@ def __init__(self, data: Sequence[float | int], dtype: BinaryVectorDtype, paddin
self.dtype = dtype
self.padding = padding

def __repr__(self) -> str:
return "BinaryVector(dtype={}, padding={}, data={})".format( # noqa: UP032
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will repr create the string even when it's not called?

f-string vs format makes no difference. If anything f-string will be faster. __repr__ only runs when repr() is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__ only runs when repr() is called.

So if one has logger.debug(vector), is it not true that str.format won't be called, but the f-string will be created?

Copy link
Member

@ShaneHarvey ShaneHarvey Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging behavior is unrelated to this PR since we're just adding a new method. The behavior of repr(vector) is the same whether we use f-string vs format() so let's go back to using f-string.

@caseyclements caseyclements changed the title Adds __repr__ to BinaryVector dataclass PYTHON-5172 Adds __repr__ to BinaryVector dataclass Feb 28, 2025
@caseyclements
Copy link
Contributor Author

Test failures are async ones, unrelated to these.

four = BinaryVector(data, BinaryVectorDtype.PACKED_BIT, padding=3)
self.assertEqual(
repr(four), f"BinaryVector(dtype=BinaryVectorDtype.PACKED_BIT, padding=3, data={data})"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add asserts using assertRepr like we do in test_connection_monitoring.py?:

    def assertRepr(self, obj):
        new_obj = eval(repr(obj))
        self.assertEqual(type(new_obj), type(obj))
        self.assertEqual(repr(new_obj), repr(obj))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. That's very cool, that one can create an object using eval(repr(obj))

@ShaneHarvey ShaneHarvey changed the title PYTHON-5172 Adds __repr__ to BinaryVector dataclass PYTHON-5172 Add __repr__ to BinaryVector dataclass Feb 28, 2025
@@ -247,6 +247,9 @@ def __init__(self, data: Sequence[float | int], dtype: BinaryVectorDtype, paddin
self.dtype = dtype
self.padding = padding

def __repr__(self) -> str:
return f"BinaryVector(dtype={self.dtype}, padding={self.padding}, data={self.data})"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold up, the real bug here is that we're using @dataclass but also defining __init__. Why is that? The whole point of dataclass is that it provides __init__ and __repr__ automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we had issues with the combo of {dataclass, slots, defaults}. So I've just removed the dataclass decorator.

Copy link
Member

@ShaneHarvey ShaneHarvey Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other features that will break by removing @dataclass? We may be stuck with it until 5.0 if it's a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. As mentioned in the closed PR, we've done init, and repr, and we don't want to rely on equality except via the binary representation.

Copy link
Contributor Author

@caseyclements caseyclements Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems happy to be a dataclass or not. I have no strong preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@ShaneHarvey ShaneHarvey Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show an example of == and > before and after this change? I'm wondering if the old code (with @dataclass) even works at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It's easy to try. Put a breakpoint on line 83 of test_bson_binary.

# With the decorator:
vector_obs == vector_exp.as_vector()
True
vector_obs > vector_exp.as_vector()
# TypeError: '>' not supported between instances of 'BinaryVector' and 'BinaryVector'

# Without the decorator:
vector_obs == vector_exp.as_vector()
False
vector_obs > vector_exp.as_vector()
# TypeError: '>' not supported between instances of 'BinaryVector' and 'BinaryVector'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== is already broken with our current @dataclass approach:

>>> from bson.binary import *
>>> BinaryVector([1], BinaryVectorDtype.INT8) == BinaryVector([2], BinaryVectorDtype.INT8)
True

So we need to remove dataclass and implement == as well as __repr__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Take a look. I hate subtleties with == and floats..

@caseyclements caseyclements requested review from blink1073 and removed request for blink1073 March 4, 2025 16:49
Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Could you add a note for this in the changelog?

@caseyclements
Copy link
Contributor Author

Looking good! Could you add a note for this in the changelog?

Sure! Is this targeting 4.11 or 4.12? If the former, should I bump the micro version and add one line for description and another for jira? If the latter, should I additionallyx start up a whole new section?

Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, let's add this to changelog later since I'm adding the new 4.12 section in another PR right now and we may end up being backporting this as a bug fix. Could you update the PR title and Jira ticket to say this is a bug fix for repr and ==?

bson/binary.py Outdated
def __repr__(self) -> str:
return f"BinaryVector(dtype={self.dtype}, padding={self.padding}, data={self.data})"

def __eq__(self, other: BinaryVector) -> bool:
Copy link
Member

@ShaneHarvey ShaneHarvey Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typing failure:

bson/binary.py:251: error: Argument 1 of "__eq__" is incompatible with
supertype "object"; supertype defines the argument type as "object"  [override]
        def __eq__(self, other: BinaryVector) -> bool:
                         ^~~~~~~~~~~~~~~~~~~

@caseyclements caseyclements changed the title PYTHON-5172 Add __repr__ to BinaryVector dataclass PYTHON-5172 bugfix: Add __repr__ and __eq__ to bson.binary.BinaryVector Mar 10, 2025
@caseyclements
Copy link
Contributor Author

On second thought, let's add this to changelog later since I'm adding the new 4.12 section in another PR right now and we may end up being backporting this as a bug fix. Could you update the PR title and Jira ticket to say this is a bug fix for repr and ==?

Updated names. In the JIRA, I marked it as bug instead of improvement. Please adjust if that's not what you meant.

@caseyclements caseyclements merged commit b66a5cb into mongodb:master Mar 10, 2025
35 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants