-
Notifications
You must be signed in to change notification settings - Fork 19
Fix URL encoding and decoding #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix URL encoding and decoding #160
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor suggestion - but overall this looks great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR improves but does not fix URL encoding. The set of encoded characters varies by the PURL component, but this code is not aware of what component it is encoding. For example, this implementation will incorrectly encode a PURL like the proposed pkg:go/example.com%2Fwidgets
as pkg:go/example.com/widgets
.
It fixes it, in general. Which characters to encode is a separate issue, see #174, with the goal being to pass the test suite. The test suite is still inconsistent. I talked with the purl spec team recently, and the intended goal actually seems to be to encode all characters except the ASCII letter/digit and a few others mentioned. This means that |
f2c891f
to
0ca5d84
Compare
Encoding everything would be simpler but it would be a breaking change. Regardless, |
Did you consider using If you shade |
byte[] bytes = source.getBytes(StandardCharsets.UTF_8); | ||
int length = bytes.length; | ||
ByteArrayOutputStream buffer = new ByteArrayOutputStream(length); | ||
buffer.write(bytes, 0, percent); | ||
|
||
for (int i = percent; i < length; i++) { | ||
int b = bytes[i]; | ||
|
||
if (b == '%') { | ||
if (i + 2 >= length) { | ||
return null; | ||
} | ||
|
||
int b1 = Character.digit(bytes[++i], 16); | ||
int b2 = Character.digit(bytes[++i], 16); | ||
buffer.write((char) ((b1 << 4) + b2)); | ||
} else { | ||
buffer.write(b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a thought that we could improve the optimization of this even more:
byte[] bytes = source.getBytes(StandardCharsets.UTF_8); | |
int length = bytes.length; | |
ByteArrayOutputStream buffer = new ByteArrayOutputStream(length); | |
buffer.write(bytes, 0, percent); | |
for (int i = percent; i < length; i++) { | |
int b = bytes[i]; | |
if (b == '%') { | |
if (i + 2 >= length) { | |
return null; | |
} | |
int b1 = Character.digit(bytes[++i], 16); | |
int b2 = Character.digit(bytes[++i], 16); | |
buffer.write((char) ((b1 << 4) + b2)); | |
} else { | |
buffer.write(b); | |
byte[] bytes = source.getBytes(StandardCharsets.UTF_8); | |
int length = bytes.length; | |
int writePos = percent; | |
for (int i = percent; i < length; i++) { | |
int b = bytes[i]; | |
if (b == '%') { | |
if (i + 2 >= length) { | |
return null; | |
} | |
int b1 = Character.digit(bytes[++i], 16); | |
int b2 = Character.digit(bytes[++i], 16); | |
if (b1 == -1 || b2 == -1) { | |
return null; | |
} | |
bytes[writePos++] = (byte) ((d1 << 4) + d2); | |
} else { | |
bytes[writePos++] = b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only other comment on this method would be to replace the return null
with something like throw new IllegalArgumentException("Invalid percent encoding at index " + i);
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ppkarwasz I completely understand the benefits of proper shading. However, for a library like packageurl-java
I am violently opposed to introducing any runtime dependencies - even if properly shaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that returning null
was wrong. It's too bad that we can't throw a checked exception MalformedPackageURLException
(or we can if you're OK with changing a public
signature).
The other option is to be lenient and just ignore the '%'
. I think I have seen some URI parsers do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use ValidationException
instead of MalformedPackageURLException
as in other parts of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ValidationException
is unchecked. I actually don't like this and think the code should be changed to throw a checked exception, but that's off-topic for this pull request.
But, good point in that it doesn't change the method signature. I am in favor of deprecating the public
method since these methods should be static
anyway. It probably wasn't meant to be public since only decode
is public
but not encode
, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I checked, and it's impossible to make it to this code with an invalid percent-encoding, as the URI
constructor catches it. I added a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reach the code through the deprecated, but still public uriDecode
method.
Currently new URI
detects invalid percent encoding sequences, but it might be useful to check it again, in case the code is optimized not to use URI
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the code now throws ValidationException
. I added more tests.
} | ||
} | ||
return builder.toString(); | ||
|
||
return new String(buffer.toByteArray(), StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return new String(buffer.toByteArray(), StandardCharsets.UTF_8); | |
return new String(bytes, 0, writePos, StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the code to use ByteBuffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this should move to ByteBuffer
. Doesn't that decrease the performance of the algorithm in comparison to #160 (comment) - with the above suggestion as the return? With ByteBuffer we have to loop over the entire array of bytes to count the number of %
and then make a complete copy one byte at a time. If we kept closer to your original implementation with my small suggestion we skip having to copy the first portion of the byte array and then its just shifting the values down in the remaining portion of the array; never having to allocate a new buffer or loop over the bytes to count the number of %
characters. Maybe I'm missing something.
@dwalluck the suggested optimization doesn't have to be done - but would you mind fixing the merge conflicts? |
else { | ||
builder.append(source.charAt(i)); | ||
|
||
byte[] bytes = source.getBytes(StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
byte[] bytes = source.getBytes(StandardCharsets.UTF_8); | |
byte[] bytes = source.getBytes(StandardCharsets.US_ASCII); |
The encoded URI should only contain ASCII characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in incorrect behavior.
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The CharsetEncoder class should be used when more control over the encoding process is required.
Meaning the proposed version would convert non-ASCII characters into ?
. For example: pkg:generic/a?b=☃
becomes pkg:generic/a?b=?
. If you want to reject non-ASCII characters you need to use a different conversion.
The spec says that PURLs are ASCII strings, but it seems like all parsers except maennchen/purl accept Unicode inputs, and it's probably better to continue accepting Unicode inputs:
althonos/packageurl.rs: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
anchore/packageurl-go: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
giterlizzi/perl-URI-PackageURL: {"type": "generic", "name": "a", "qualifiers": {"b": "\u00e2\u0098\u0083"}}
maennchen/purl: {"error": "cannot parse due to reason invalid_uri: \":\""}
package-url/packageurl-dotnet: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-go: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-java: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-js: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-php: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-python: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-ruby: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-swift: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
phylum-dev/purl: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
sonatype/package-url-java: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, the Java URI
implementation also seems to tolerate non-blank Unicode characters outside of ASCII (cf. URI
Javadoc and it defines a new other
category of allowed characters:
The Unicode characters that are not in the
US-ASCII
character set, are not control characters (according to theCharacter.isISOControl
method), and are not space characters (according to theCharacter.isSpaceChar
method) (Deviation from RFC 2396, which is limited toUS-ASCII
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, use UTF_8
everywhere, then?
I think this one always converts |
Sure, I understand the drawbacks of shading, especially since Apache Commons uses a different license. 😉 Do you oppose any external dependency regardless of size (see #175 for example)? |
143e623
to
41b30c1
Compare
a168ac2
to
169207a
Compare
private static int getPercentCharCount(final byte[] bytes) { | ||
return (int) IntStream.range(0, bytes.length).map(i -> bytes[i]).filter(PackageURL::isPercent).count(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives only an approximation of the number of encoded sequences, since invalid sequences like %%%
can contain multiple percent characters.
Using a loop would allow you to skip two bytes after each %
character.
if (bytes[i] == PERCENT_CHAR) { | ||
int b1 = Character.digit(bytes[++i], 16); | ||
int b2 = Character.digit(bytes[++i], 16); | ||
b = (byte) ((b1 << 4) + b2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe that checking if b1
or b2
is -1
here does not hurt…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test case? I know it's possible for Character::digit
to return -1
, but when will it in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uriDecode("%GH")
? The method is deprecated, but still public
.
It is not at all evident here, that bytes
must only contain valid percent sequences. This is true, if this method is called from parse
but not at all evident, so it merits at least a comment. Besides: this method already contains two throw
statements.
} | ||
return builder.toString(); | ||
|
||
return new String(buffer.array(), StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If buffer
still has some unwritten bytes (i.e. there is a mismatch between its capacity and the number of written bytes), the string could have some trailing garbage.
StandardCharsets.UTF_8.encode(buffer.flip()).toString()
should decode only the bytes that were actually written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple ways to do it. You can also create the String
with bounds.
Right now, it's counting the exact capacity it should need if the input is valid. I don't actually know if this is better than just setting an upper bound like the code was before. Or, make the code even more complex and copy in chunks to take advantage of System::arrayCopy
. These inputs are in general so small that it doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple ways to do it. You can also create the
String
with bounds.
new String(buffer.array(), 0, buffer.position(), StandardCharsets.UTF_8)
sounds great! 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what my above suggestion did without needing to allocate additional resources or perform additional looping. But as I said above - maybe I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, can you restate what you said in case I missed something? I do remember a suggested change, but it looked incomplete and I did not understand how to change it.
I thought about changing ByteArrayOutputStream
to ByteBuffer
just because it's a newer API, but the performance probably isn't any better (vs. heap byte buffer). About StringBuilder
, I am not as sure. If we're talking about performance then we'd need to benchmark it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a problem with the percent
optimization: percent
gives the index of the first percent character in Unicode codepoints, not in bytes. The index in bytes might be different if source
is not ASCII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worse than that. Java strings are not made of bytes or Unicode codepoints. indexOf gives the index of the Java char, which is not the index of the byte if the string is not ASCII and it's not the index of the code point either if the string contains surrogate pairs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe this:
private String percentEncode(final String source) {
if (source == null || source.isEmpty()) {
return source;
}
byte[] bytes = source.getBytes(StandardCharsets.UTF_8);
int length = bytes.length;
int percent = -1;
for (int i = 0; i < length-2; i++) {
if (bytes[i] == PERCENT_CHAR) {
percent = i;
}
}
if (percent < 0) {
return source;
}
int writePos = percent;
for (int i = percent; i < length; i++) {
byte b = bytes[i];
if (b == PERCENT_CHAR) {
if (i + 2 >= length) {
throw new ValidationException("Incomplete percent encoding at index " + i);
}
int b1 = Character.digit(bytes[++i], 16);
int b2 = Character.digit(bytes[++i], 16);
if (b1 == -1 || b2 == -1) {
throw new ValidationException("Invalid percent encoding at index " + (i - 2));
}
bytes[writePos++] = (byte) ((b1 << 4) + b2);
} else {
bytes[writePos++] = b;
}
}
return new String(bytes, 0, writePos, StandardCharsets.UTF_8));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can look for percent characters multiple times. At the beginning it is worth using String.indexOf
:
if (source.indexOf(PERCENT_CHAR) == -1) {
return source;
}
since this prevents a relatively expensive encoding of the string into byte[]
. Later we can also skip all the bytes until the first percent %
character. It is worth running some benchmarks though, since intuition can really be misleading (especially on a highly optimized JVM 23).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremylong @matt-phylum @ppkarwasz I think the latest version is good and I incorporated all of your suggestions:
Decoding and Encoding:
- Split the input by char, so that
System.arrayCopy
is ultimately used instead of copying single bytes - Since we split, don't bother pre-counting char occurrences, just set an upper bounds on size if we need to
- I could be wrong, but I think this might outperform the version which doesn't split the input by a factor of 2 or 3
Decoding:
- Better error checking for bad percent encodings (short or invalid hex digits), moved to its own method
- Reuse the existing array (I used
ByteBuffer.wrap
so it automatically tracks thewritePos
)
Encoding:
- Use
UTF_8
everywhere, notUS_ASCII
. In my benchmarks, ASCII was faster, but only slightly anyway
169207a
to
0a11b17
Compare
I ran a small JMH benchmark to check the performance difference in decoding between
(each iteration decodes 1000 string of 256 characters, where 10% are not ASCII and the rest letters) |
0a11b17
to
c3f02af
Compare
Using the benchmark in the
Unless there are bugs in the benchmark (which is quite probable), |
The methods `uriEncode` and `uriDecode` did not properly handle percent-encoding. In particular, `uriEncode` didn't properly output two uppercase hex digits and `urlDecode` did not properly handle non-ASCII characters. Aditionally, if no percent-encoding was performed, these methods will now return the original string. Fixes package-url#150 Closes package-url#153 Fixes package-url#154
c3f02af
to
78f897b
Compare
This is what I found, too. It is probably faster if you just need a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It looks a little bit over-engineered (as if it contained suggestions from 3 different people 😉), but it fixes the problem, which is the important part. LGTM |
The methods
uriEncode
anduriDecode
did not properly handle percent-encoding. In particular,uriEncode
didn't properly output two uppercase hex digits andurlDecode
did not properly handle non-ASCII characters.Aditionally, if no percent-encoding was performed, these methods will now return the original string.
Fixes #150
Closes #153
Fixes #154