-
Notifications
You must be signed in to change notification settings - Fork 1.1k
bson.decode_file_iter: allow to bypass read/decode errors #433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If a file is corrupted or cannot be processed by the python driver, the iterator used to stop the processing of the file on the first error. With the new yield_errors optional argument the user can bypass errors. Signed-off-by: Nicolas Sebrecht <[email protected]>
Hi @nicolas33, thanks for the pull. Unfortunately, I don't think we're going to merge it for a few reasons:
You can implement your own version of decode_file_iter like this: import struct
import io
import bson
from bson.errors import InvalidBSON
_UNPACK_INT = struct.Struct("<i").unpack
def decode_yield_errors(file_obj, codec_options=bson.DEFAULT_CODEC_OPTIONS):
while True:
# Read the size of the next BSON object.
# http://bsonspec.org/spec.html
size_data = file_obj.read(4)
if not size_data:
break # Finished with file normally.
# Decode the next document, yielding (dict, invalid bytes,
# exception)
elements = size_data
try:
if len(size_data) != 4:
raise InvalidBSON("cut off in middle of objsize")
obj_size = _UNPACK_INT(size_data)[0] - 4
elements = size_data + file_obj.read(max(0, obj_size))
yield bson.decode(elements, codec_options), b'', None
except InvalidBSON as exc:
# Yield the invalid bytes and the exception.
yield {}, elements, exc
# Test valid file.
valid = io.BytesIO(bson.encode({'a': 1})*2)
print(list(decode_yield_errors(valid)))
# Test invalid file.
invalid = io.BytesIO(bson.encode({'a': 1}) + b'invalid')
print(list(decode_yield_errors(invalid)))
# Corrupting a BSON object's size will (very likely) cause all subsequent
# documents to fail as well.
invalid_bytes = (
b'\x0d\x00\x00\x00' + bson.encode({'a': 1})[4:] + bson.encode({'a': 1}))
invalid = io.BytesIO(invalid_bytes)
print(list(decode_yield_errors(invalid))) Output:
Note that if one document in the file is corrupted it can cause the processing of all the subsequent documents to fail too, eg if one document's size is changed then we may not be able to decode any of the following documents. This is another reason that I'm not convinced of this feature. Before we close this issue though, I'd like to know more about why you want this feature. What is the use case you're trying to address? |
I've been able to use a bson file having al lot of documents by ignoring many decode issues with my own iterator. Without this loading the data would be not possible at all. I don't care much what you do about this PR. I can maintain my own iterator. This is not possible without loading private variables from the module, though. |
You don't need to use private variables. My example above only uses public ones (
What kind of decode errors are you trying to recover from? You can ignore unicode decode errors with the from bson import decode_file_iter
from bson.codec_options import CodecOptions
opts = CodecOptions(unicode_decode_error_handler='ignore')
...
decode_file_iter(file_obj, codec_options=opts) |
You didn't import
I'm avoiding bson decode errors like unsupported datetimes. |
That's interesting. We've discussed possibly providing a configuration option to avoid this. Perhaps an opt-in that would cause out of range datetimes to just be returned as datetime.datetime.min or datetime.datetime.max. It would have to be opt-in since round tripping the value would change the value in the server. |
It would be better to return something like "err: datetime above MAX", "err: datetime under MIN" or even better the error itself to let the user know that the Python driver is offending but that the data might be valid.
AFAICT there's no server for bson. In the use case of mongo I guess that improving the current datetimes support to whatever fits best is fine. IMHO, this isn't enough to reject thie iterator of this PR because this one provides more flexibility to the user. |
(Of course, I assume that rebuilding _UNPACK_INT is worse than importing the value from bson.) |
PYTHON-1824 is the ticket to follow where we are planning to add a workaround for pymongo's datetime range limitations.
It's definitely worse (imo) to use the private |
If the definition of _UNPACK_INT has to change then there is all the chances that the redefined definition from struct becomes wrong too. Either way is not satisfactory. The private definition might include the changes to follow the correct definition. Not this re-definition snapshotted at some random time. Actually, the full "write you own iterator" is not very satisfactory for users, especially when we know that the python driver doesn't correctly handle the BSON format. As said previously, I don't bother much about this PR. I can maintain my own variation of the iterator. IMHO, the maintainers should worry about this more than I do to handle the weird cases coming from the tool itself. |
Yes that is an interesting and totally valid perspective. As stated earlier, we're closing this PR and plan to address the underlying problem described in PYTHON-1824 instead. Thanks for all the info @nicolas33. |
If a file is corrupted or cannot be processed by the python driver, the iterator used to stop the processing
of the file on the first error. With the new yield_errors optional argument the user can bypass errors.
Signed-off-by: Nicolas Sebrecht [email protected]