Skip to content

bson.decode_file_iter: allow to bypass read/decode errors #433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

nicolas33
Copy link

If a file is corrupted or cannot be processed by the python driver, the iterator used to stop the processing
of the file on the first error. With the new yield_errors optional argument the user can bypass errors.

Signed-off-by: Nicolas Sebrecht [email protected]

If a file is corrupted or cannot be processed by the python driver, the iterator used to stop the processing
of the file on the first error. With the new yield_errors optional argument the user can bypass errors.

Signed-off-by: Nicolas Sebrecht <[email protected]>
@ShaneHarvey
Copy link
Member

Hi @nicolas33, thanks for the pull. Unfortunately, I don't think we're going to merge it for a few reasons:

  • We generally don't like to change the return value types of a function based on an argument. We could fix this by adding this feature as a new function.
  • I'm not convinced that this feature generally useful enough or robust enough for other users to rely on. For example, if one document in the file is corrupted it can cause the processing of all the subsequent documents to fail too, eg if one document's size is changed then we may not be able to decode any of the following documents. The third call to decode_yield_errors below highlights this issue.
  • Applications can already implement this feature by writing a custom version of decode_file_iter on top of bson.decode. This requires only a minimal amount of knowledge about the BSON format (the code just needs to duplicate the BSON object size parsing of the first 4 bytes). See the example below.

You can implement your own version of decode_file_iter like this:

import struct
import io

import bson
from bson.errors import InvalidBSON


_UNPACK_INT = struct.Struct("<i").unpack


def decode_yield_errors(file_obj, codec_options=bson.DEFAULT_CODEC_OPTIONS):
    while True:
        # Read the size of the next BSON object.
        # http://bsonspec.org/spec.html
        size_data = file_obj.read(4)
        if not size_data:
            break  # Finished with file normally.

        # Decode the next document, yielding (dict, invalid bytes,
        # exception)
        elements = size_data
        try:
            if len(size_data) != 4:
                raise InvalidBSON("cut off in middle of objsize")
            obj_size = _UNPACK_INT(size_data)[0] - 4
            elements = size_data + file_obj.read(max(0, obj_size))
            yield bson.decode(elements, codec_options), b'', None
        except InvalidBSON as exc:
            # Yield the invalid bytes and the exception.
            yield {}, elements, exc


# Test valid file.
valid = io.BytesIO(bson.encode({'a': 1})*2)
print(list(decode_yield_errors(valid)))

# Test invalid file.
invalid = io.BytesIO(bson.encode({'a': 1}) + b'invalid')
print(list(decode_yield_errors(invalid)))

# Corrupting a BSON object's size will (very likely) cause all subsequent
# documents to fail as well.
invalid_bytes = (
        b'\x0d\x00\x00\x00' + bson.encode({'a': 1})[4:] + bson.encode({'a': 1}))
invalid = io.BytesIO(invalid_bytes)
print(list(decode_yield_errors(invalid)))

Output:

$  python3.8 decode_ingore.py
[({'a': 1}, b'', None), ({'a': 1}, b'', None)]
[({'a': 1}, b'', None), ({}, b'invalid', InvalidBSON('objsize too large'))]
[({}, b'\r\x00\x00\x00\x10a\x00\x01\x00\x00\x00\x00\x0c', InvalidBSON('bad eoo')), ({}, b'\x00\x00\x00\x10a\x00\x01\x00\x00\x00\x00', InvalidBSON('objsize too large'))]

Note that if one document in the file is corrupted it can cause the processing of all the subsequent documents to fail too, eg if one document's size is changed then we may not be able to decode any of the following documents. This is another reason that I'm not convinced of this feature.

Before we close this issue though, I'd like to know more about why you want this feature. What is the use case you're trying to address?

@nicolas33
Copy link
Author

Before we close this issue though, I'd like to know more about why you want this feature. What is the use case you're trying to address?

I've been able to use a bson file having al lot of documents by ignoring many decode issues with my own iterator. Without this loading the data would be not possible at all. I don't care much what you do about this PR. I can maintain my own iterator. This is not possible without loading private variables from the module, though.

@ShaneHarvey
Copy link
Member

I can maintain my own iterator. This is not possible without loading private variables from the module, though.

You don't need to use private variables. My example above only uses public ones (bson.decode, bson.DEFAULT_CODEC_OPTIONS, and bson.errors.InvalidBSON).

I've been able to use a bson file having al lot of documents by ignoring many decode issues with my own iterator.

What kind of decode errors are you trying to recover from? You can ignore unicode decode errors with the unicode_decode_error_handler option:

from bson import decode_file_iter
from bson.codec_options import CodecOptions
opts = CodecOptions(unicode_decode_error_handler='ignore')
...
decode_file_iter(file_obj, codec_options=opts)

@nicolas33
Copy link
Author

You don't need to use private variables. My example above only uses public ones (bson.decode, bson.DEFAULT_CODEC_OPTIONS, and bson.errors.InvalidBSON)

You didn't import _UNPACK_INT.

What kind of decode errors are you trying to recover from?

I'm avoiding bson decode errors like unsupported datetimes.

@behackett
Copy link
Member

I'm avoiding bson decode errors like unsupported datetimes.

That's interesting. We've discussed possibly providing a configuration option to avoid this. Perhaps an opt-in that would cause out of range datetimes to just be returned as datetime.datetime.min or datetime.datetime.max. It would have to be opt-in since round tripping the value would change the value in the server.

@nicolas33
Copy link
Author

We've discussed possibly providing a configuration option to avoid this. Perhaps an opt-in that would cause out of range datetimes to just be returned as datetime.datetime.min or datetime.datetime.max.

It would be better to return something like "err: datetime above MAX", "err: datetime under MIN" or even better the error itself to let the user know that the Python driver is offending but that the data might be valid.

It would have to be opt-in since round tripping the value would change the value in the server.

AFAICT there's no server for bson. In the use case of mongo I guess that improving the current datetimes support to whatever fits best is fine. IMHO, this isn't enough to reject thie iterator of this PR because this one provides more flexibility to the user.

@nicolas33
Copy link
Author

You didn't import _UNPACK_INT.

(Of course, I assume that rebuilding _UNPACK_INT is worse than importing the value from bson.)

@ShaneHarvey
Copy link
Member

ShaneHarvey commented Dec 13, 2019

PYTHON-1824 is the ticket to follow where we are planning to add a workaround for pymongo's datetime range limitations.

I assume that rebuilding _UNPACK_INT is worse than importing the value from bson.

It's definitely worse (imo) to use the private bson attributes because there are no compatibility guarantees on our private APIs. We can and do change private things even in patch releases. You should define _UNPACK_INT as in the example above (_UNPACK_INT = struct.Struct("<i").unpack).

@nicolas33
Copy link
Author

nicolas33 commented Dec 14, 2019

You should define _UNPACK_INT as in the example above

If the definition of _UNPACK_INT has to change then there is all the chances that the redefined definition from struct becomes wrong too. Either way is not satisfactory. The private definition might include the changes to follow the correct definition. Not this re-definition snapshotted at some random time.

Actually, the full "write you own iterator" is not very satisfactory for users, especially when we know that the python driver doesn't correctly handle the BSON format.

As said previously, I don't bother much about this PR. I can maintain my own variation of the iterator. IMHO, the maintainers should worry about this more than I do to handle the weird cases coming from the tool itself.

@ShaneHarvey
Copy link
Member

If the definition of _UNPACK_INT has to change then there is all the chances that the redefined definition from struct becomes wrong too. ...

Yes that is an interesting and totally valid perspective.

As stated earlier, we're closing this PR and plan to address the underlying problem described in PYTHON-1824 instead. Thanks for all the info @nicolas33.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants