-
Notifications
You must be signed in to change notification settings - Fork 278
Canonical JSON is unclear #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Posting this publicly to serve as documentation should anyone else hit this issue. |
Whitespace is not permitted between tokens in Canonical JSON. I don't think CJSON modifies the actual data in the dictionary values; they are permitted to contain anything. Assuming that "bar\nbaz" is the literal string for the "foo" key, I think B is correct. I'm curious, why do you claim A is correct? |
I imagined this being a writer that streams bytes out, and if you pass it a Also, from the definition here:
and also
This means the CJSON is not a subset of JSON (despite the docs claiming it is) since a literal tab may be present in a string as with a literal newline (not So the question is, when you run a string through the |
I don't quite follow your objection to the use of the word "subset." CJSON is designed to take valid JSON as input, and modify it so that its output (canonical JSON) produces repeatable hashes. To quote the laptop.org page:
If CJSON is given invalid JSON input, then you should expect invalid JSON as output. Which docs? The linked page at laptop.org simply says, "The grammar for canonical JSON closely matches the grammar presented at json.org: ..." We do use the word "subset" in the TUF specification, so is this the objection? We can change that to use the wording used by laptop.org. Perhaps "restricted dialect" would be a better term to use.
If there are 8 bytes in a string, the streamer should read 8 uninterpreted bytes out of it and escape any hex 22 (") or hex 5C () that it encounters. You are free to dump JSON in whatever format you wish, to "pretty print" it for example. However, when the hashes are generated, you should run the JSON through CJSON to produce repeatable hashes. Here's an example using the reference implementation's encode_canonical() and Python's >>> import json
>>> import securesystemslib.formats
>>> json_data = {'foo': '1"23', 'bar': '456'}
>>> securesystemslib.formats.encode_canonical(json_data)
u'{"bar":"456","foo":"1\\"23"}'
>>> json.dumps(json_data)
'{"foo": "1\\"23", "bar": "456"}'
>>> |
It would be very useful for embedded implementations if the bytes that are fed into the checksum routine can* exactly match the bytes that are sent over the wire. There would be two advantages:
(*) CAN not MUST: this requires cooperation from the server to generate the JSON in canonical form |
The CJSON definition grammar (not the human friendly text that says "CJSON is parseable with a JSON parser") is not compatible with JSON. Further, the CJSON definition doesn't allow the sequence (edit: the quote definition of CJSON isn't really even a grammar and should have just used ABNF, and as a result my above claim is wrong) If we do CJSON the way its done in Proposed DefinitionCJSON is a subset of JSON with these restrictions:
|
Related ticket: advancedtelematic/tuf-test-vectors#42 |
So it turns out that in #457 (comment), the choice A actually is what |
Additional information: Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from securesystemslib.formats import encode_canonical as cjson
>>> import json
>>> json.loads('{"foo": "bar\\nbaz"}')
{'foo': 'bar\nbaz'}
>>> json.loads(cjson(json.loads('{"foo":"bar\\nbaz"}')))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 12 (char 11) |
The input to $ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> from securesystemslib.formats import encode_canonical as cjson
>>> json.loads(cjson({"foo":"bar\\nbaz"}))
{u'foo': u'bar\\nbaz'}
>>>
The behaviour of
How so? Feel free to link to the JSON specification for the specific violation(s), so I can better understand your objection.
I don't follow your reasoning. Is this Rule 1? "Strings are uninterpreted bytes, with only two escaped byte values." Hmm, I understand this to mean that there are only two bytes that are escaped by CJSON: hex 22 (") and hex 5C (). Not that the backslash is disallowed. In other words, whenever CJSON encounters hex 5C, it escapes it ( |
Ok, so that's super non-intuitive. When you do a This also doesn't disprove the point that the output of canonical JSON isn't parseable as JSON.
That means that
If inside the TUF lib wherever you're calling >>> from securesystemslib.formats import encode_canonical as cjson
>>> x = 'foo\nbar'
>>> print(x)
foo
bar
>>> print(cjson(x))
"foo
bar"
>>> assert cjson(x) == '"{}"'.format(x)
>>> If CJSON is a subset of JSON, then the following should hold: assert json.loads(encode_canonical(some_json)) == json.loads(json.dumps(some_json)) Which is not the case. Even still, the |
This also might explain why I couldn't get the same key IDs as you guys in this comment: https://github.com/theupdateframework/tuf/issues/362#issuecomment-295322447 |
You appear to have mixed up the loads(str) and dumps(obj) functions. In your example, you are comparing a dictionary to a string. I think the following code snippet is the example that you wanted: >>> obj = {"foo":"bar\\nbaz"}
>>> json.loads(cjson(obj)) == obj
True Here is the full code: $ python
Python 3.5.3 (default, Apr 22 2017, 00:00:00)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> from securesystemslib.formats import encode_canonical as cjson
>>> obj = {"foo":"bar\\nbaz"}
>>> json.dumps(obj)
'{"foo": "bar\\\\nbaz"}'
>>> cjson(obj)
'{"foo":"bar\\\\nbaz"}'
>>> json.loads(cjson(obj))
{'foo': 'bar\\nbaz'}
>>> json.loads(cjson(obj)) == obj
True |
@vladimir-v-diaz Yeah, I typo'd that and dropped one of the You're still double escaping things. If I want to the following text to display:
I needs a Python dict like this: { 'message': 'Hello\nworld'} Which can be represented as JSON like this: { "message": "Hello\nworld" } So if I want to canonicalize that JSON, I would do this: import json
from securesystemslib.formats import encode_canonical
# double escaped because this is a string representing a JSON string
json_string = "{ \"message\": \"Hello\\nworld\" }"
# single escaped because this is just a string
json_dict_ref = {'message': 'Hello\nworld'}
json_loaded = json.loads(json_string)
assert json_loaded == json_dict_ref
print(json_dict_ref['message'])
#Hello
#world
print(json_loaded['message'])
#Hello
#world
json.loads(encode_canonical(json_loaded))
# raises json.decoder.JSONDecodeError: Invalid control character at: line 1 column 18 (char 17) |
FWIW: this issue is apparently a result from our home-brewed implementation of canonical json: import json
from canonicaljson import encode_canonical_json as encode_canonical
# double escaped because this is a string representing a JSON string
json_string = "{ \"message\": \"Hello\\nworld\" }"
# single escaped because this is just a string
json_dict_ref = {'message': 'Hello\nworld'}
json_loaded = json.loads(json_string)
assert json_loaded == json_dict_ref
print(json_dict_ref['message'])
#Hello
#world
print(json_loaded['message'])
#Hello
#world
json.loads(encode_canonical(json_loaded))
# {'message': 'Hello\nworld'} |
The PyPI package Here is a Rust implementation: const QU: u8 = b'"'; // \x22
const BS: u8 = b'\\'; // \x5C
// Lookup table of escape sequences. A value of b'x' at index i means that byte
// i is escaped as "\x" in JSON. A value of 0 means that byte i is not escaped.
#[cfg_attr(rustfmt, rustfmt_skip)]
static ESCAPE: [u8; 256] = [
// 1 2 3 4 5 6 7 8 9 A B C D E F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 1
0, 0, QU, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 2
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 3
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 4
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, BS, 0, 0, 0, // 5
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 6
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 7
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 8
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 9
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // A
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // B
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // C
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // D
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // E
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // F
]; There are exactly two escapes, and you are allowed to have completely arbitrary input strings. There are two answers. Answer 1If we ignore the incorrect sentence in the OLPC definition and follow the grammar that is defined on the wiki, then the output is not JSON, and we continue using the OLPC definition regardless. Answer 2We define CJSON as: Canonical JSON is the subset of JSON where:
There is no way to say that the output of the OLPC grammar is JSON. It is not. You are allowed to have arbitrary strings as input to the encode function and (among other things) newlines are not escaped in the output. JSON escapes unicode as If we pick Answer 2 (which is the most sane) then all references to the OLPC wiki need to be removed and the I don't care if Answer 1 or Answer 2 becomes the spec. I just care that whatever is being used now is not fully understood by everyone and relies on (the OLPC) ambiguous grammar that is in direct contradiction to the first sentence of it's own spec. |
Thanks for the explanation! OLPC's canonical JSON does allow arbitrary bytes in strings for compatibility reasons: "A previous version of this specification required strings to be valid unicode, and relied on JSON's \u escape. This was abandoned as it doesn't allow representing arbitrary binary data in a string, and it doesn't preserve the identity of non-canonical unicode strings." Although the grammar for OLPC's canonical JSON can produce invalid JSON data if non-unicode bytes are present in strings, it nevertheless still produces valid JSON if only unicode characters are used, and the final JSON is encoded properly (e.g., UTF-8 in our case). Please note that TUF metadata is in JSON object format, and we canonicalize the valid JSON to calculate digests and signatures. From the TUF spec:
|
Yes, both >>> json_loaded = {'message': 'Hello\nworld'}
>>> json.dumps(json_loaded)
#'{"message": "Hello\\nworld"}'
>>> from canonicaljson import encode_canonical_json as ecj
>>> ecj(json_loaded)
#'{"message":"Hello\\nworld"}'
>>> from securesystemslib.formats import encode_canonical as cjson
>>> cjson(json_loaded)
#u'{"message":"Hello\nworld"}' Since TUF metadata is not written in canonical JSON nor needs to load CJSON with json.dumps() or The |
Hey @vladimir-v-diaz. Slow response here. Since it seems that the OLPC grammar is what we want, can we change the TUF spec to not reference it and use a real ABNF to describe the CJSON grammar? This seems like the best way to avoid other people getting as bogged down with this as we did. :) Also, good point on the unicode normalization. I hadn't thought about that and need to bring this up with the team here. |
@heartsucker
|
This already is neither allowed by the JSON nor the OLPC CJSON grammars.
We should use a formal grammar because from these sentences I can't tell if you've settled on the "JSON-subset" approach of the "OLPC Canonical JSON" approach. A formal grammar makes it unambiguous how things are encoded and escaped. |
The JSON restrictions listed come from the OLPC page. I want to list the OLPC restrictions in the specification, and to clear up any confusion about certain parts of it (e.g., strings are unicode, etc.) The OLPC page might explicitly mention trailing commas because certain parsers, ECMAScript 5 objects/functions, Python objects, etc. do allow it. It might be better to make it clear that trailing commas are disallowed. The following two JSON linters seem to think that If you think a formal grammar is better, please share it so we can judge. I personally think saying that we use standard JSON with certain restrictions is simple and clear. If an implementor wants a grammar, they can just visit json.org. |
Note: We might not even discuss JSON in the specification if we generalize the format of Metadata documents. This might just be documentation in the reference implementation if we decide to continue using this format. |
From RFC-7159 section 4:
Trailing commas are not allowed.
In both cases, there is not a 1-to-1 mapping between Python dicts/arrays/etc. and JSON. It just happens to be that JS/Python simple types look a whole awful lot like JSON, and that's more an artifact of the convenience of representing data as maps, lists, strings, numbers, booleans, and nulls. x = {
"foo": lambda x: x + 1,
} Could be "mapped" to JS x = {
foo: function(x) { return x + 1 },
} But both of these are not JSON, and there is no meaningful way to convert either to JSON. And in fact, the trailing commas are stripped by the interpreters and don't even exist in memory. Anyway. When the doc changes away from JSON only, it should just include a note that the signing format needs to be deterministic with respect to ordering of elements. The implementation details don't actually matter. |
Yup, this will a be helpful note to have for implementors. In practice, some formats don't always match up exactly across different applications. Sometimes a missing trailing newline can break things. |
Two nits:
|
I'm going to close this: I think the discussion is valuable but there does not seem to be an action we can take within python-tuf. There are securesystemslib and the specification issues linked above, both seem like better places to continue the discussion. Please open a new issue if there is a possible python-tuf improvement in this area. |
We're having an internal debate and we're not sure we're using CJSON in the correct way. Let's say we want to calculate a signature over this:
So what we do is:
Should the output of the function be A or B?
My claim is A is correct, but others claim B is correct. Though in some sense it doesn't really matter which is used so long as all clients use the same (assuming both are deterministic).
The text was updated successfully, but these errors were encountered: