Skip to content

Error with HashCheckerIterDataPipe #75

Closed
@Nayef211

Description

@Nayef211

🐛 Bug

Using HashCheckerIterDataPipe for implementing a SST2 dataset within torchtext causes test failures for unittest_linux_py3.6 and for all python versions on windows platform.

  • Here is the CircleCI link for all the test failures: failures.
  • Here is the Dataset implementation where the HashCheckerIterDataPipe is used: code pointer

I believe there may be changes to how io.seek() works from python 3.6 to 3.7 that could be causing the failures in unittest_linux_py3.6 and unittest_windows_py3.6. I'm not really sure why the other windows unit tests are failing.

To Reproduce

Steps to reproduce the behavior:

  1. Patch commit 62e6fb2 in Nayef211/text repo
  2. Create PR against pytorch/text repo
  3. Look at CircleCI unit test failures

Error for unittest_linux_py3.6 and unittest_windows_py3.6

self = <torchdata.datapipes.iter.util.hashchecker.HashCheckerIterDataPipe object at 0x7f937f867ba8>

    def __iter__(self):
    
        for file_name, stream in self.source_datapipe:
            if self.hash_type == "sha256":
                hash_func = hashlib.sha256()
            else:
                hash_func = hashlib.md5()
    
            while True:
                # Read by chunk to avoid filling memory
                chunk = stream.read(1024 ** 2)
                if not chunk:
                    break
                hash_func.update(chunk)
    
            # TODO(VitalyFedyunin): this will not work (or work crappy for non-seekable steams like http)
            if self.rewind:
>               stream.seek(0)
E               io.UnsupportedOperation: seek

env/lib/python3.6/site-packages/torchdata-0.1.0a0+7772406-py3.6.egg/torchdata/datapipes/iter/util/hashchecker.py:51: UnsupportedOperation

Link to Circle CI Error

Error for all other unittest_windows_py*

self = <torchdata.datapipes.iter.util.hashchecker.HashCheckerIterDataPipe object at 0x000001929F2B5548>

    def __iter__(self):
    
        for file_name, stream in self.source_datapipe:
            if self.hash_type == "sha256":
                hash_func = hashlib.sha256()
            else:
                hash_func = hashlib.md5()
    
            while True:
                # Read by chunk to avoid filling memory
                chunk = stream.read(1024 ** 2)
                if not chunk:
                    break
                hash_func.update(chunk)
    
            # TODO(VitalyFedyunin): this will not work (or work crappy for non-seekable steams like http)
            if self.rewind:
                stream.seek(0)
    
            if file_name not in self.hash_dict:
>               raise RuntimeError("Unspecified hash for file {}".format(file_name))
E               RuntimeError: Unspecified hash for file C:\Users\circleci\.torchtext\cache\SST2\SST-2\train.tsv

env\lib\site-packages\torchdata-0.1.0a0+7772406-py3.7.egg\torchdata\datapipes\iter\util\hashchecker.py:54: RuntimeError

Link to Circle CI Error

Expected behavior

Expect all tests to pass

Environment

Tests pass on devserver environment but fails on CircleCI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions