torchtext iterator that tokenizes each line of words between the tokens `<sos>` and `<eos>`

Hello,

I generated a text file called `openbookQA_train`. The contents of this file are shown below:
```
<sos> The sun is responsible for <mcoption> (A) puppies learning new tricks <eos>
<sos> The sun is responsible for <mcoption> (B) children growing up and getting old <eos>
<sos> The sun is responsible for <mcoption> (C) flowers wilting in a vase <eos>
<sos> The sun is responsible for <mcoption> (D) plants sprouting, blooming and wilting <eos>
```

I am trying to use or define torchtext Iterator to generate the input that I can pass into my Transformer. 

I want **each sample** in my `next(iter(openbookQA_train)).text` to be a series of integers that are obtained by tokenizing each line of words between  `<sos>` and `<eos>` (including those special tokens), and for a sample that contains lesser number of tokens than the bptt length, I want the sample to include all of the tokenized words between  `<sos>` and `<eos>` and the rest of the slots to be filled with the token `<pad>` up to the bptt length.

How can I achieve this objective?

Thank you,



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torchtext iterator that tokenizes each line of words between the tokens `<sos>` and `<eos>` #654

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

torchtext iterator that tokenizes each line of words between the tokens <sos> and <eos> #654

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

torchtext iterator that tokenizes each line of words between the tokens `<sos>` and `<eos>` #654