Closed
Description
🐛 Bug
Using HashCheckerIterDataPipe for implementing a SST2 dataset within torchtext causes test failures for unittest_linux_py3.6
and for all python versions on windows platform.
- Here is the CircleCI link for all the test failures: failures.
- Here is the Dataset implementation where the
HashCheckerIterDataPipe
is used: code pointer
I believe there may be changes to how io.seek()
works from python 3.6 to 3.7 that could be causing the failures in unittest_linux_py3.6
and unittest_windows_py3.6
. I'm not really sure why the other windows unit tests are failing.
To Reproduce
Steps to reproduce the behavior:
- Patch commit 62e6fb2 in Nayef211/text repo
- Create PR against pytorch/text repo
- Look at CircleCI unit test failures
Error for unittest_linux_py3.6
and unittest_windows_py3.6
self = <torchdata.datapipes.iter.util.hashchecker.HashCheckerIterDataPipe object at 0x7f937f867ba8>
def __iter__(self):
for file_name, stream in self.source_datapipe:
if self.hash_type == "sha256":
hash_func = hashlib.sha256()
else:
hash_func = hashlib.md5()
while True:
# Read by chunk to avoid filling memory
chunk = stream.read(1024 ** 2)
if not chunk:
break
hash_func.update(chunk)
# TODO(VitalyFedyunin): this will not work (or work crappy for non-seekable steams like http)
if self.rewind:
> stream.seek(0)
E io.UnsupportedOperation: seek
env/lib/python3.6/site-packages/torchdata-0.1.0a0+7772406-py3.6.egg/torchdata/datapipes/iter/util/hashchecker.py:51: UnsupportedOperation
Error for all other unittest_windows_py*
self = <torchdata.datapipes.iter.util.hashchecker.HashCheckerIterDataPipe object at 0x000001929F2B5548>
def __iter__(self):
for file_name, stream in self.source_datapipe:
if self.hash_type == "sha256":
hash_func = hashlib.sha256()
else:
hash_func = hashlib.md5()
while True:
# Read by chunk to avoid filling memory
chunk = stream.read(1024 ** 2)
if not chunk:
break
hash_func.update(chunk)
# TODO(VitalyFedyunin): this will not work (or work crappy for non-seekable steams like http)
if self.rewind:
stream.seek(0)
if file_name not in self.hash_dict:
> raise RuntimeError("Unspecified hash for file {}".format(file_name))
E RuntimeError: Unspecified hash for file C:\Users\circleci\.torchtext\cache\SST2\SST-2\train.tsv
env\lib\site-packages\torchdata-0.1.0a0+7772406-py3.7.egg\torchdata\datapipes\iter\util\hashchecker.py:54: RuntimeError
Expected behavior
Expect all tests to pass
Environment
Tests pass on devserver environment but fails on CircleCI.
Metadata
Metadata
Assignees
Labels
No labels