-
Notifications
You must be signed in to change notification settings - Fork 813
Refined Example.fromJSON() to make JSON processing more powerful. #563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR. Do you have a concrete example of a dataset in this format? |
Well, it's obvious that most json datasets have nested architectures. However, when dealing with nested JSONs, the original fromJSON() only allows us to pass in the keys of the first layer, which is very inconvenient. For example, I am dealing with a dataset with this format: {
"dialog":[ ],
"uid":[ ],
"profile":[
{
"tag":[
"星座命理;健康"
],
"loc":"贵州 遵义",
"gender":"female"
},
{
"tag":[
""
],
"loc":"海南",
"gender":"male"
}
],
"responder_profile":{
"tag":[
"星座命理;健康"
],
"loc":"贵州 遵义",
"gender":"female"
},
"golden_response":[ ]
} With the old version, if I need to extract 'tag' information, I have to extract twice at least. But now, we could extract just once by passing nested key: 'profile.tag' In addition, the original fromJSON() actually will call fromDict(), and JSON is quite different from dict obviously. |
Agreed.
The original fromJSON assumed a flat (single-level) JSON object, which is an obviously bad limitation. This seems fine. I'd extend the test to cover a multiple-instance JSON object in this format. Also, when naming fields which are parsed from JSON users should be aware that periods "." are a reserved symbol for depth-splitting keys. |
My solution is:
This solution should work well. |
Understood, I was referring to the case where a JSON dataset in this format could have an element with a key containing a period. Ex.: {
"utterance.1":[ ],
"...": "..."
} Where the splitting by "." would result in unexpected behavior. |
Well, I could add a parameter 'separator=' to fromJSON(), but this needs to modify TabularDataset to pass the param. |
@zhangguanheng66 ideas on the last comment? I'm not sure if we want to change TabularDataset right now due to the new dataset pattern. We could simply add a comment in |
I don't think it's necessary to consider this exception because keys like 'utterance.1' are substandard. |
@JahoJiang @mttk Thanks guys. For now, IMO, let's not change TabularDataset. For some extreme cases, we could 1) add a note to the docs and says not covering a key with "." 2) forced exit with proper warning and error message. |
@mttk @zhangguanheng66 So, are you gonna review and merge this request? |
@JahoJiang Sorry for not being clear -- I'll approve it as soon as you extend the test to cover a dataset sample with multiple instances. |
@mttk done. ✌️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ability to parse nested key for parsing nested JSON dataset.
sample.json:
Before:
Now:
Also added unit test and doc example.