Skip to content

torchtext iterator that tokenizes each line of words between the tokens <sos> and <eos> #654

Closed
@h56cho

Description

@h56cho

Hello,

I generated a text file called openbookQA_train. The contents of this file are shown below:

<sos> The sun is responsible for <mcoption> (A) puppies learning new tricks <eos>
<sos> The sun is responsible for <mcoption> (B) children growing up and getting old <eos>
<sos> The sun is responsible for <mcoption> (C) flowers wilting in a vase <eos>
<sos> The sun is responsible for <mcoption> (D) plants sprouting, blooming and wilting <eos>

I am trying to use or define torchtext Iterator to generate the input that I can pass into my Transformer.

I want each sample in my next(iter(openbookQA_train)).text to be a series of integers that are obtained by tokenizing each line of words between <sos> and <eos> (including those special tokens), and for a sample that contains lesser number of tokens than the bptt length, I want the sample to include all of the tokenized words between <sos> and <eos> and the rest of the slots to be filled with the token <pad> up to the bptt length.

How can I achieve this objective?

Thank you,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions