bson.decode_file_iter: allow to bypass read/decode errors #433

nicolas33 · 2019-12-09T10:56:07Z

If a file is corrupted or cannot be processed by the python driver, the iterator used to stop the processing
of the file on the first error. With the new yield_errors optional argument the user can bypass errors.

Signed-off-by: Nicolas Sebrecht [email protected]

If a file is corrupted or cannot be processed by the python driver, the iterator used to stop the processing of the file on the first error. With the new yield_errors optional argument the user can bypass errors. Signed-off-by: Nicolas Sebrecht <[email protected]>

ShaneHarvey · 2019-12-12T22:31:50Z

Hi @nicolas33, thanks for the pull. Unfortunately, I don't think we're going to merge it for a few reasons:

We generally don't like to change the return value types of a function based on an argument. We could fix this by adding this feature as a new function.
I'm not convinced that this feature generally useful enough or robust enough for other users to rely on. For example, if one document in the file is corrupted it can cause the processing of all the subsequent documents to fail too, eg if one document's size is changed then we may not be able to decode any of the following documents. The third call to decode_yield_errors below highlights this issue.
Applications can already implement this feature by writing a custom version of decode_file_iter on top of bson.decode. This requires only a minimal amount of knowledge about the BSON format (the code just needs to duplicate the BSON object size parsing of the first 4 bytes). See the example below.

You can implement your own version of decode_file_iter like this:

import struct
import io

import bson
from bson.errors import InvalidBSON


_UNPACK_INT = struct.Struct("<i").unpack


def decode_yield_errors(file_obj, codec_options=bson.DEFAULT_CODEC_OPTIONS):
    while True:
        # Read the size of the next BSON object.
        # http://bsonspec.org/spec.html
        size_data = file_obj.read(4)
        if not size_data:
            break  # Finished with file normally.

        # Decode the next document, yielding (dict, invalid bytes,
        # exception)
        elements = size_data
        try:
            if len(size_data) != 4:
                raise InvalidBSON("cut off in middle of objsize")
            obj_size = _UNPACK_INT(size_data)[0] - 4
            elements = size_data + file_obj.read(max(0, obj_size))
            yield bson.decode(elements, codec_options), b'', None
        except InvalidBSON as exc:
            # Yield the invalid bytes and the exception.
            yield {}, elements, exc


# Test valid file.
valid = io.BytesIO(bson.encode({'a': 1})*2)
print(list(decode_yield_errors(valid)))

# Test invalid file.
invalid = io.BytesIO(bson.encode({'a': 1}) + b'invalid')
print(list(decode_yield_errors(invalid)))

# Corrupting a BSON object's size will (very likely) cause all subsequent
# documents to fail as well.
invalid_bytes = (
        b'\x0d\x00\x00\x00' + bson.encode({'a': 1})[4:] + bson.encode({'a': 1}))
invalid = io.BytesIO(invalid_bytes)
print(list(decode_yield_errors(invalid)))

Output:

$  python3.8 decode_ingore.py
[({'a': 1}, b'', None), ({'a': 1}, b'', None)]
[({'a': 1}, b'', None), ({}, b'invalid', InvalidBSON('objsize too large'))]
[({}, b'\r\x00\x00\x00\x10a\x00\x01\x00\x00\x00\x00\x0c', InvalidBSON('bad eoo')), ({}, b'\x00\x00\x00\x10a\x00\x01\x00\x00\x00\x00', InvalidBSON('objsize too large'))]

Note that if one document in the file is corrupted it can cause the processing of all the subsequent documents to fail too, eg if one document's size is changed then we may not be able to decode any of the following documents. This is another reason that I'm not convinced of this feature.

Before we close this issue though, I'd like to know more about why you want this feature. What is the use case you're trying to address?

nicolas33 · 2019-12-13T18:45:31Z

Before we close this issue though, I'd like to know more about why you want this feature. What is the use case you're trying to address?

I've been able to use a bson file having al lot of documents by ignoring many decode issues with my own iterator. Without this loading the data would be not possible at all. I don't care much what you do about this PR. I can maintain my own iterator. This is not possible without loading private variables from the module, though.

ShaneHarvey · 2019-12-13T20:08:03Z

I can maintain my own iterator. This is not possible without loading private variables from the module, though.

You don't need to use private variables. My example above only uses public ones (bson.decode, bson.DEFAULT_CODEC_OPTIONS, and bson.errors.InvalidBSON).

I've been able to use a bson file having al lot of documents by ignoring many decode issues with my own iterator.

What kind of decode errors are you trying to recover from? You can ignore unicode decode errors with the unicode_decode_error_handler option:

from bson import decode_file_iter
from bson.codec_options import CodecOptions
opts = CodecOptions(unicode_decode_error_handler='ignore')
...
decode_file_iter(file_obj, codec_options=opts)

nicolas33 · 2019-12-13T21:02:36Z

You don't need to use private variables. My example above only uses public ones (bson.decode, bson.DEFAULT_CODEC_OPTIONS, and bson.errors.InvalidBSON)

You didn't import _UNPACK_INT.

What kind of decode errors are you trying to recover from?

I'm avoiding bson decode errors like unsupported datetimes.

behackett · 2019-12-13T21:09:34Z

I'm avoiding bson decode errors like unsupported datetimes.

That's interesting. We've discussed possibly providing a configuration option to avoid this. Perhaps an opt-in that would cause out of range datetimes to just be returned as datetime.datetime.min or datetime.datetime.max. It would have to be opt-in since round tripping the value would change the value in the server.

nicolas33 · 2019-12-13T21:22:04Z

We've discussed possibly providing a configuration option to avoid this. Perhaps an opt-in that would cause out of range datetimes to just be returned as datetime.datetime.min or datetime.datetime.max.

It would be better to return something like "err: datetime above MAX", "err: datetime under MIN" or even better the error itself to let the user know that the Python driver is offending but that the data might be valid.

It would have to be opt-in since round tripping the value would change the value in the server.

AFAICT there's no server for bson. In the use case of mongo I guess that improving the current datetimes support to whatever fits best is fine. IMHO, this isn't enough to reject thie iterator of this PR because this one provides more flexibility to the user.

nicolas33 · 2019-12-13T21:27:52Z

You didn't import _UNPACK_INT.

(Of course, I assume that rebuilding _UNPACK_INT is worse than importing the value from bson.)

ShaneHarvey · 2019-12-13T21:36:12Z

PYTHON-1824 is the ticket to follow where we are planning to add a workaround for pymongo's datetime range limitations.

I assume that rebuilding _UNPACK_INT is worse than importing the value from bson.

It's definitely worse (imo) to use the private bson attributes because there are no compatibility guarantees on our private APIs. We can and do change private things even in patch releases. You should define _UNPACK_INT as in the example above (_UNPACK_INT = struct.Struct("<i").unpack).

nicolas33 · 2019-12-14T19:19:08Z

You should define _UNPACK_INT as in the example above

If the definition of _UNPACK_INT has to change then there is all the chances that the redefined definition from struct becomes wrong too. Either way is not satisfactory. The private definition might include the changes to follow the correct definition. Not this re-definition snapshotted at some random time.

Actually, the full "write you own iterator" is not very satisfactory for users, especially when we know that the python driver doesn't correctly handle the BSON format.

As said previously, I don't bother much about this PR. I can maintain my own variation of the iterator. IMHO, the maintainers should worry about this more than I do to handle the weird cases coming from the tool itself.

ShaneHarvey · 2019-12-16T22:15:50Z

If the definition of _UNPACK_INT has to change then there is all the chances that the redefined definition from struct becomes wrong too. ...

Yes that is an interesting and totally valid perspective.

As stated earlier, we're closing this PR and plan to address the underlying problem described in PYTHON-1824 instead. Thanks for all the info @nicolas33.

ShaneHarvey closed this Dec 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bson.decode_file_iter: allow to bypass read/decode errors #433

bson.decode_file_iter: allow to bypass read/decode errors #433

nicolas33 commented Dec 9, 2019

ShaneHarvey commented Dec 12, 2019

nicolas33 commented Dec 13, 2019

ShaneHarvey commented Dec 13, 2019

nicolas33 commented Dec 13, 2019

behackett commented Dec 13, 2019

nicolas33 commented Dec 13, 2019

nicolas33 commented Dec 13, 2019

ShaneHarvey commented Dec 13, 2019 •

edited

Loading

nicolas33 commented Dec 14, 2019 •

edited

Loading

ShaneHarvey commented Dec 16, 2019

bson.decode_file_iter: allow to bypass read/decode errors #433

bson.decode_file_iter: allow to bypass read/decode errors #433

Conversation

nicolas33 commented Dec 9, 2019

ShaneHarvey commented Dec 12, 2019

nicolas33 commented Dec 13, 2019

ShaneHarvey commented Dec 13, 2019

nicolas33 commented Dec 13, 2019

behackett commented Dec 13, 2019

nicolas33 commented Dec 13, 2019

nicolas33 commented Dec 13, 2019

ShaneHarvey commented Dec 13, 2019 • edited Loading

nicolas33 commented Dec 14, 2019 • edited Loading

ShaneHarvey commented Dec 16, 2019

ShaneHarvey commented Dec 13, 2019 •

edited

Loading

nicolas33 commented Dec 14, 2019 •

edited

Loading