Skip to content

Refined Example.fromJSON() to make JSON processing more powerful. #563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from Jul 24, 2019
Merged

Refined Example.fromJSON() to make JSON processing more powerful. #563

merged 7 commits into from Jul 24, 2019

Conversation

ghost
Copy link

@ghost ghost commented Jul 20, 2019

Ability to parse nested key for parsing nested JSON dataset.

sample.json:

{"foods": {"fruits": ["Apple", "Banana"], "vegetables": [{"name": "lettuce"}, {"name": "marrow"}] }}

Before:

In [1]: from torchtext import data

In [2]: fields = {'foods.vegetables.name': ('vegs', data.Field())}

In [3]: dataset = data.TabularDataset(path='sample.json', format='json', fields=fields)

ValueError: Specified key foods.vegetables.name was not found in the input data

Now:

In [1]: from torchtext import data

In [2]: fields = {'foods.vegetables.name': ('vegs', data.Field())}

In [3]: dataset = data.TabularDataset(path='sample.json', format='json', fields=fields)

In [4]: dataset.examples[0].vegs
Out[4]: ['lettuce', 'marrow']

Also added unit test and doc example.

@zhangguanheng66 zhangguanheng66 requested a review from mttk July 21, 2019 15:51
@mttk
Copy link
Contributor

mttk commented Jul 22, 2019

Thanks for the PR. Do you have a concrete example of a dataset in this format?

@ghost
Copy link
Author

ghost commented Jul 22, 2019

Thanks for the PR. Do you have a concrete example of a dataset in this format?

Well, it's obvious that most json datasets have nested architectures. However, when dealing with nested JSONs, the original fromJSON() only allows us to pass in the keys of the first layer, which is very inconvenient.

For example, I am dealing with a dataset with this format:

{  
   "dialog":[  ],
   "uid":[  ],
   "profile":[  
      {  
         "tag":[  
            "星座命理;健康"
         ],
         "loc":"贵州 遵义",
         "gender":"female"
      },
      {  
         "tag":[  
            ""
         ],
         "loc":"海南",
         "gender":"male"
      }
   ],
   "responder_profile":{  
      "tag":[  
         "星座命理;健康"
      ],
      "loc":"贵州 遵义",
      "gender":"female"
   },
   "golden_response":[  ]
}

With the old version, if I need to extract 'tag' information, I have to extract twice at least. But now, we could extract just once by passing nested key: 'profile.tag'

In addition, the original fromJSON() actually will call fromDict(), and JSON is quite different from dict obviously.

@mttk
Copy link
Contributor

mttk commented Jul 22, 2019

Well, it's obvious that most json datasets have nested architectures. However, when dealing with nested JSONs, the original fromJSON() only allows us to pass in the keys of the first layer, which is very inconvenient.

Agreed.

With the old version, if I need to extract 'tag' information, I have to extract twice at least. But now, we could extract just once by passing nested key: 'profile.tag'

In addition, the original fromJSON() actually will call fromDict(), and JSON is quite different from dict obviously.

The original fromJSON assumed a flat (single-level) JSON object, which is an obviously bad limitation. This seems fine. I'd extend the test to cover a multiple-instance JSON object in this format. Also, when naming fields which are parsed from JSON users should be aware that periods "." are a reserved symbol for depth-splitting keys.

@ghost
Copy link
Author

ghost commented Jul 22, 2019

My solution is:

  1. Splitting keys to a list.
  2. Use python built in func reduce() to pass in each key in the list to extract information layer by layer.

This solution should work well.
And I will refine my code according to your test results. :)

@mttk
Copy link
Contributor

mttk commented Jul 22, 2019

My solution is:

  1. Splitting keys to a list.
  2. Use python built in func reduce() to pass in each key in the list to extract information layer by layer.

This solution should work well.

Understood, I was referring to the case where a JSON dataset in this format could have an element with a key containing a period. Ex.:

{  
   "utterance.1":[  ],
   "...": "..."
}

Where the splitting by "." would result in unexpected behavior.

@ghost
Copy link
Author

ghost commented Jul 22, 2019

{  
   "utterance.1":[  ],
   "...": "..."
}

Where the splitting by "." would result in unexpected behavior.

Well, I could add a parameter 'separator=' to fromJSON(), but this needs to modify TabularDataset to pass the param.

@mttk
Copy link
Contributor

mttk commented Jul 22, 2019

@zhangguanheng66 ideas on the last comment? I'm not sure if we want to change TabularDataset right now due to the new dataset pattern.

We could simply add a comment in example.fromJSON and leave it at that, but keep in mind that this can cause an issue in some extreme cases.

@ghost
Copy link
Author

ghost commented Jul 22, 2019

@zhangguanheng66 ideas on the last comment? I'm not sure if we want to change TabularDataset right now due to the new dataset pattern.

We could simply add a comment in example.fromJSON and leave it at that, but keep in mind that this can cause an issue in some extreme cases.

I don't think it's necessary to consider this exception because keys like 'utterance.1' are substandard.

@zhangguanheng66
Copy link
Contributor

@JahoJiang @mttk Thanks guys. For now, IMO, let's not change TabularDataset. For some extreme cases, we could 1) add a note to the docs and says not covering a key with "." 2) forced exit with proper warning and error message.

@ghost
Copy link
Author

ghost commented Jul 24, 2019

@mttk @zhangguanheng66 So, are you gonna review and merge this request?

@mttk
Copy link
Contributor

mttk commented Jul 24, 2019

@JahoJiang Sorry for not being clear -- I'll approve it as soon as you extend the test to cover a dataset sample with multiple instances.

@ghost
Copy link
Author

ghost commented Jul 24, 2019

@mttk done. ✌️

Copy link
Contributor

@mttk mttk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangguanheng66 zhangguanheng66 merged commit eb0d24b into pytorch:master Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants