Description
🚀 Feature
Revamp our dataset testing strategy to reduce amount of time spent waiting on tests to complete before merging PRs in torchtext.
Motivation
TorchText dataset testing currently relies on downloading and caching the datasets daily and then running CircleCI tests on the cached data. This can be slow and unreliable for the first PR that kicks off the dataset download and caching. In addition, dataset extraction can be time consuming for some of the larger datasets within torchtext and this extraction process occurs each time the dataset tests are run on a PR. Due to these reasons, tests on CircleCI can take up to an hour to run for each PR whereas vision/audio tests run in mere minutes. We want to revamp our dataset testing strategy in order to reduce the amount of time we spend waiting on tests to complete before merging our PRs in torchtext.
Pitch
We need to update the legacy dataset tests within torchtext. Currently we test for things including:
- URL link
- MD5 hash of the entire dataset
- dataset name
- number of lines in dataset
Going forward it doesn’t make sense to test the MD5 hash or the number of lines in the dataset. Instead we
- Use mocking to test the implementation of our dataset
- Use smoke tests for URLs and integrity of data (potentially with Github Actions)
Backlog of Dataset Tests
- AG_NEWS mock up AG NEWS test for faster testing. #1553
- AmazonReviewFull Add AmazonReviewFull Mocked Unit Test #1561
- AmazonReviewPolarity Add AmazonReviewPolarity Mocked Unit Test #1532
- DBpedia Add DBpedia Mocked Unit Test #1566
- SogouNews Add SogouNews Mocked Unit Test #1576
- YelpReviewFull Merge YelpReviewPolarity and YelpReviewFull Mocked Unit Tests #1567
- YelpReviewPolarity Merge YelpReviewPolarity and YelpReviewFull Mocked Unit Tests #1567
- YahooAnswers Add YahooAnswers Mocked Unit Test #1574 #1577
- CoNLL2000Chunking Add CoNLL2000Checking Mocked Unit Test #1570
- UDPOS Add UDPOS Mocked Unit Test #1569
- IWSLT2016 mock up IWSLT2016 test for faster testing. #1563, IWSLT testing to start from compressed file #1596
- IWSLT2017 Add Mock test for IWSLT2017 dataset #1598
- Multi30K Multi30k mocked testing #1554
- SQuAD1 Add SQuAD1 Mocked Unit Test #1574
- SQuAD2 Add SQuAD2 Mocked Unit Test #1575
- PennTreebank Add PennTreebank Mocked Unit Test #1578
- WikiText103 Add WikiText103 and WikiText2 Mocked Unit Tests #1592
- WikiText2 Add WikiText103 and WikiText2 Mocked Unit Tests #1592
- EnWik9 Add EnWik9 Mocked Unit Test #1560
- IMDB Add IMDB Mocked Unit Test #1579
- SST2 Add SST2 Mocked Unit Test #1542
- CC-100 add CC100 mocking test #1583
Contributing
We have already implemented a dataset test for AmazonReviewPolarity (#1532) as an example to follow when implementing future dataset tests. Please leave a message below if you plan to work on particular dataset test to avoid duplication of efforts. Also please link to the corresponding PRs.
Follow-Up Items
- Encode all strings as
utf8
before writing to file when creating mocked data (see Multi30k mocked testing #1554 (comment)) - Parameterize tests for similar datasets (see Add SQuAD2 Mocked Unit Test #1575 (comment)) Parameterize tests for similar datasets #1600
- Fix formatting for all dataset tests [FORMATTING] Update formatting for dataset tests #1601
Additional Context
We explored how other domains implemented testing for datasets and summarize them below. We will implement our new testing strategy by taking inspiration from TorchAudio and TorchVision
Possible Approaches
- Download and cache the dataset daily before running tests (current state of testing)
- Create mock data for each dataset (used by torchaudio, and torchvision)
- Requires us to understand the structure of the datasets before creating tests
- Store a small portion of the dataset (10 lines) in an assets folder
- Might run into legal problems since we aren’t allowed to host datasets
TorchAudio Approach
- Each test lives in its own file
- Plan to add integration tests in the future to check dataset URLs
- Each test class extends
TestBaseMixin
andPytorchTestCase
(link)TestBaseMixin
base class provide consistent way to define device/dtype/backend aware TestCase
- Each test file contains a
get_mock_dataset()
method which is responsible for creating the mocked data and saving it to a file in a temp dir (link)- This method gets called in the setUp classmethod within each test class
- The actual test method creates a dataset from the mocked dataset file tests the dataset
TorchVision Approach
- All tests live in the
test_datasets.py
file. This file is really long (1300) and a little hard to read as opposed to seperating tests for each dataset into it's own file - Testing whether dataset URLs are available and download correctly (link)
CocoDetectionTestCase
for the COCO dataset extends theDatasetTestCase
base class (link)DatasetTestCase
is the abstract base class for all dataset test cases and expects child classes to overwrite class attributes such asDATASET_CLASS
andFEATURE_TYPES
(link)
- Here are all the tests from DatasetTestCase that get run for each dataset (link)