Skip to content

Issue with Dolly Dataloader: context key not found! #1760

Open
@pytholic

Description

@pytholic

Bug description

I ran into the following issue while running LoRA fine-tuning.

Stack Trace

KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/Users/lunit_haseebraja/miniconda3/envs/lora_tests/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/lunit_haseebraja/miniconda3/envs/lora_tests/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lunit_haseebraja/miniconda3/envs/lora_tests/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/Users/lunit_haseebraja/miniconda3/envs/lora_tests/lib/python3.11/site-packages/litgpt/data/base.py", line 80, in __getitem__
    example = self.transform(example)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lunit_haseebraja/miniconda3/envs/lora_tests/lib/python3.11/site-packages/litgpt/data/dolly.py", line 74, in _transform
    item["input"] = item.pop("context")
                    ^^^^^^^^^^^^^^^^^^^
KeyError: 'context'

Command

litgpt finetune_lora checkpoints/EleutherAI/pythia-70m --data Dolly --precision 16-mixed --data.num_workers 4 --train.global_batch_size 1 --train.max_seq_length 512 --data.val_split_fraction 0.0

I spent some time debugging it. It seems like _transform method is being called twice at the beginning for some reason. During the second call, they keys are not there since we are using pop. It does work with get though.
In src/litgpt/litgpt/data/dolly.py (commented parts are for debugging):

# import sys
# from pprint import pprint

def _transform(idx: int, item: dict) -> dict:
    # if "context" not in item.keys():
        # print(f"{idx}: Missing Key!")
        # pprint(item)
        # sys.exit()
    item["input"] = item.pop("context")
    item["output"] = item.pop("response")
    return item

I couldn't figure out why it is being called twice though.

What operating system are you using?

macOS

LitGPT Version

Tested on two versions. Also tested on two platforms macOS and linux.

litgpt                                   0.4.13
litgpt                                   0.4.14.dev1

Metadata

Metadata

Assignees

Labels

3rd partybugSomething isn't workinghelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions