Skip to content

Fix URL encoding and decoding #160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 16, 2025

Conversation

dwalluck
Copy link
Contributor

The methods uriEncode and uriDecode did not properly handle percent-encoding. In particular, uriEncode didn't properly output two uppercase hex digits and urlDecode did not properly handle non-ASCII characters.

Aditionally, if no percent-encoding was performed, these methods will now return the original string.

Fixes #150
Closes #153
Fixes #154

Copy link
Collaborator

@jeremylong jeremylong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor suggestion - but overall this looks great.

Copy link

@matt-phylum matt-phylum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR improves but does not fix URL encoding. The set of encoded characters varies by the PURL component, but this code is not aware of what component it is encoding. For example, this implementation will incorrectly encode a PURL like the proposed pkg:go/example.com%2Fwidgets as pkg:go/example.com/widgets.

@dwalluck
Copy link
Contributor Author

This PR improves but does not fix URL encoding. The set of encoded characters varies by the PURL component, but this code is not aware of what component it is encoding. For example, this implementation will incorrectly encode a PURL like the proposed pkg:go/example.com%2Fwidgets as pkg:go/example.com/widgets.

It fixes it, in general. Which characters to encode is a separate issue, see #174, with the goal being to pass the test suite.

The test suite is still inconsistent. I talked with the purl spec team recently, and the intended goal actually seems to be to encode all characters except the ASCII letter/digit and a few others mentioned. This means that '/' and ':' would always be encoded everywhere, which isn't the case right now, and it's not worth chasing it really, until the spec is made clear. My opinion is to test the current test suite as the current goal and then update in the future.

@dwalluck dwalluck force-pushed the fix-url-encode-and-decode branch from f2c891f to 0ca5d84 Compare March 10, 2025 13:29
@matt-phylum
Copy link

Encoding everything would be simpler but it would be a breaking change.

Regardless, / must not be encoded in the namespace. The PURL spec is weird and treats the namespace as a series of segments, but I think most implementations, this one included, treat it as a single value. The escaping for the single-value namespace and the name are necessarily different.

@ppkarwasz
Copy link
Contributor

Did you consider using o.a.c.codec.net.PercentCodec from Apache Commons Codec here?

If you shade PercentCodec and its dependencies, you'll end up with 8 additional classes and 10 kB of additional uncompressed bytecode. This is a nice example of proper shading, where you shade only a small (~3%) fraction of a dependency to profit from code reuse and automatic improvements.

Comment on lines 547 to 564
byte[] bytes = source.getBytes(StandardCharsets.UTF_8);
int length = bytes.length;
ByteArrayOutputStream buffer = new ByteArrayOutputStream(length);
buffer.write(bytes, 0, percent);

for (int i = percent; i < length; i++) {
int b = bytes[i];

if (b == '%') {
if (i + 2 >= length) {
return null;
}

int b1 = Character.digit(bytes[++i], 16);
int b2 = Character.digit(bytes[++i], 16);
buffer.write((char) ((b1 << 4) + b2));
} else {
buffer.write(b);
Copy link
Collaborator

@jeremylong jeremylong Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a thought that we could improve the optimization of this even more:

Suggested change
byte[] bytes = source.getBytes(StandardCharsets.UTF_8);
int length = bytes.length;
ByteArrayOutputStream buffer = new ByteArrayOutputStream(length);
buffer.write(bytes, 0, percent);
for (int i = percent; i < length; i++) {
int b = bytes[i];
if (b == '%') {
if (i + 2 >= length) {
return null;
}
int b1 = Character.digit(bytes[++i], 16);
int b2 = Character.digit(bytes[++i], 16);
buffer.write((char) ((b1 << 4) + b2));
} else {
buffer.write(b);
byte[] bytes = source.getBytes(StandardCharsets.UTF_8);
int length = bytes.length;
int writePos = percent;
for (int i = percent; i < length; i++) {
int b = bytes[i];
if (b == '%') {
if (i + 2 >= length) {
return null;
}
int b1 = Character.digit(bytes[++i], 16);
int b2 = Character.digit(bytes[++i], 16);
if (b1 == -1 || b2 == -1) {
return null;
}
bytes[writePos++] = (byte) ((d1 << 4) + d2);
} else {
bytes[writePos++] = b;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only other comment on this method would be to replace the return null with something like throw new IllegalArgumentException("Invalid percent encoding at index " + i);.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppkarwasz I completely understand the benefits of proper shading. However, for a library like packageurl-java I am violently opposed to introducing any runtime dependencies - even if properly shaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that returning null was wrong. It's too bad that we can't throw a checked exception MalformedPackageURLException (or we can if you're OK with changing a public signature).

The other option is to be lenient and just ignore the '%'. I think I have seen some URI parsers do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ValidationException instead of MalformedPackageURLException as in other parts of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ValidationException is unchecked. I actually don't like this and think the code should be changed to throw a checked exception, but that's off-topic for this pull request.

But, good point in that it doesn't change the method signature. I am in favor of deprecating the public method since these methods should be static anyway. It probably wasn't meant to be public since only decode is public but not encode, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I checked, and it's impossible to make it to this code with an invalid percent-encoding, as the URI constructor catches it. I added a test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reach the code through the deprecated, but still public uriDecode method.

Currently new URI detects invalid percent encoding sequences, but it might be useful to check it again, in case the code is optimized not to use URI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the code now throws ValidationException. I added more tests.

}
}
return builder.toString();

return new String(buffer.toByteArray(), StandardCharsets.UTF_8);
Copy link
Collaborator

@jeremylong jeremylong Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new String(buffer.toByteArray(), StandardCharsets.UTF_8);
return new String(bytes, 0, writePos, StandardCharsets.UTF_8));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the code to use ByteBuffer.

Copy link
Collaborator

@jeremylong jeremylong Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this should move to ByteBuffer. Doesn't that decrease the performance of the algorithm in comparison to #160 (comment) - with the above suggestion as the return? With ByteBuffer we have to loop over the entire array of bytes to count the number of % and then make a complete copy one byte at a time. If we kept closer to your original implementation with my small suggestion we skip having to copy the first portion of the byte array and then its just shifting the values down in the remaining portion of the array; never having to allocate a new buffer or loop over the bytes to count the number of % characters. Maybe I'm missing something.

@jeremylong
Copy link
Collaborator

@dwalluck the suggested optimization doesn't have to be done - but would you mind fixing the merge conflicts?

else {
builder.append(source.charAt(i));

byte[] bytes = source.getBytes(StandardCharsets.UTF_8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
byte[] bytes = source.getBytes(StandardCharsets.UTF_8);
byte[] bytes = source.getBytes(StandardCharsets.US_ASCII);

The encoded URI should only contain ASCII characters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in incorrect behavior.

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The CharsetEncoder class should be used when more control over the encoding process is required.

Meaning the proposed version would convert non-ASCII characters into ?. For example: pkg:generic/a?b=☃ becomes pkg:generic/a?b=?. If you want to reject non-ASCII characters you need to use a different conversion.

The spec says that PURLs are ASCII strings, but it seems like all parsers except maennchen/purl accept Unicode inputs, and it's probably better to continue accepting Unicode inputs:

althonos/packageurl.rs: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
anchore/packageurl-go: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
giterlizzi/perl-URI-PackageURL: {"type": "generic", "name": "a", "qualifiers": {"b": "\u00e2\u0098\u0083"}}
maennchen/purl: {"error": "cannot parse due to reason invalid_uri: \":\""}
package-url/packageurl-dotnet: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-go: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-java: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-js: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-php: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-python: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-ruby: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
package-url/packageurl-swift: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
phylum-dev/purl: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}
sonatype/package-url-java: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, the Java URI implementation also seems to tolerate non-blank Unicode characters outside of ASCII (cf. URI Javadoc and it defines a new other category of allowed characters:

The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the Character.isISOControl method), and are not space characters (according to the Character.isSpaceChar method)  (Deviation from RFC 2396, which is limited to US-ASCII)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, use UTF_8 everywhere, then?

@dwalluck
Copy link
Contributor Author

dwalluck commented Mar 11, 2025

Did you consider using o.a.c.codec.net.PercentCodec from Apache Commons Codec here?

If you shade PercentCodec and its dependencies, you'll end up with 8 additional classes and 10 kB of additional uncompressed bytecode. This is a nice example of proper shading, where you shade only a small (~3%) fraction of a dependency to profit from code reuse and automatic improvements.

I think this one always converts + <-> ' '? Actually, it's a flag...

@ppkarwasz
Copy link
Contributor

@jeremylong,

@ppkarwasz I completely understand the benefits of proper shading. However, for a library like packageurl-java I am violently opposed to introducing any runtime dependencies - even if properly shaded.

Sure, I understand the drawbacks of shading, especially since Apache Commons uses a different license. 😉

Do you oppose any external dependency regardless of size (see #175 for example)?

@dwalluck dwalluck force-pushed the fix-url-encode-and-decode branch 2 times, most recently from 143e623 to 41b30c1 Compare March 11, 2025 19:54
@dwalluck dwalluck force-pushed the fix-url-encode-and-decode branch 10 times, most recently from a168ac2 to 169207a Compare March 11, 2025 21:55
Comment on lines 583 to 650
private static int getPercentCharCount(final byte[] bytes) {
return (int) IntStream.range(0, bytes.length).map(i -> bytes[i]).filter(PackageURL::isPercent).count();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives only an approximation of the number of encoded sequences, since invalid sequences like %%% can contain multiple percent characters.

Using a loop would allow you to skip two bytes after each % character.

Comment on lines 556 to 594
if (bytes[i] == PERCENT_CHAR) {
int b1 = Character.digit(bytes[++i], 16);
int b2 = Character.digit(bytes[++i], 16);
b = (byte) ((b1 << 4) + b2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe that checking if b1 or b2 is -1 here does not hurt…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test case? I know it's possible for Character::digit to return -1, but when will it in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uriDecode("%GH")? The method is deprecated, but still public.

It is not at all evident here, that bytes must only contain valid percent sequences. This is true, if this method is called from parse but not at all evident, so it merits at least a comment. Besides: this method already contains two throw statements.

}
return builder.toString();

return new String(buffer.array(), StandardCharsets.UTF_8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If buffer still has some unwritten bytes (i.e. there is a mismatch between its capacity and the number of written bytes), the string could have some trailing garbage.

StandardCharsets.UTF_8.encode(buffer.flip()).toString()

should decode only the bytes that were actually written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple ways to do it. You can also create the String with bounds.

Right now, it's counting the exact capacity it should need if the input is valid. I don't actually know if this is better than just setting an upper bound like the code was before. Or, make the code even more complex and copy in chunks to take advantage of System::arrayCopy. These inputs are in general so small that it doesn't matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple ways to do it. You can also create the String with bounds.

new String(buffer.array(), 0, buffer.position(), StandardCharsets.UTF_8) sounds great! 💯

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what my above suggestion did without needing to allocate additional resources or perform additional looping. But as I said above - maybe I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can you restate what you said in case I missed something? I do remember a suggested change, but it looked incomplete and I did not understand how to change it.

I thought about changing ByteArrayOutputStream to ByteBuffer just because it's a newer API, but the performance probably isn't any better (vs. heap byte buffer). About StringBuilder, I am not as sure. If we're talking about performance then we'd need to benchmark it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem with the percent optimization: percent gives the index of the first percent character in Unicode codepoints, not in bytes. The index in bytes might be different if source is not ASCII.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worse than that. Java strings are not made of bytes or Unicode codepoints. indexOf gives the index of the Java char, which is not the index of the byte if the string is not ASCII and it's not the index of the code point either if the string contains surrogate pairs.

Copy link
Collaborator

@jeremylong jeremylong Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe this:

    private String percentEncode(final String source) {
        if (source == null || source.isEmpty()) {
            return source;
        }
        byte[] bytes = source.getBytes(StandardCharsets.UTF_8);
        int length = bytes.length;
        int percent = -1;
        for (int i = 0; i < length-2; i++) {
            if (bytes[i] == PERCENT_CHAR) {
                percent = i;
            }
        }
        if (percent < 0) {
            return source;
        }
        int writePos = percent;
        for (int i = percent; i < length; i++) {
            byte b = bytes[i];
            if (b == PERCENT_CHAR) {
                if (i + 2 >= length) {
                    throw new ValidationException("Incomplete percent encoding at index " + i);
                }
                int b1 = Character.digit(bytes[++i], 16);
                int b2 = Character.digit(bytes[++i], 16);
                if (b1 == -1 || b2 == -1) {
                    throw new ValidationException("Invalid percent encoding at index " + (i - 2));
                }
                bytes[writePos++] = (byte) ((b1 << 4) + b2);
            } else {
                bytes[writePos++] = b;
            }
        }
        return new String(bytes, 0, writePos, StandardCharsets.UTF_8));
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can look for percent characters multiple times. At the beginning it is worth using String.indexOf:

if (source.indexOf(PERCENT_CHAR) == -1) {
    return source;
}

since this prevents a relatively expensive encoding of the string into byte[]. Later we can also skip all the bytes until the first percent % character. It is worth running some benchmarks though, since intuition can really be misleading (especially on a highly optimized JVM 23).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremylong @matt-phylum @ppkarwasz I think the latest version is good and I incorporated all of your suggestions:

Decoding and Encoding:

  • Split the input by char, so that System.arrayCopy is ultimately used instead of copying single bytes
  • Since we split, don't bother pre-counting char occurrences, just set an upper bounds on size if we need to
  • I could be wrong, but I think this might outperform the version which doesn't split the input by a factor of 2 or 3

Decoding:

  • Better error checking for bad percent encodings (short or invalid hex digits), moved to its own method
  • Reuse the existing array (I used ByteBuffer.wrap so it automatically tracks the writePos)

Encoding:

  • Use UTF_8 everywhere, not US_ASCII. In my benchmarks, ASCII was faster, but only slightly anyway

@dwalluck dwalluck force-pushed the fix-url-encode-and-decode branch from 169207a to 0a11b17 Compare March 12, 2025 19:43
@ppkarwasz
Copy link
Contributor

I ran a small JMH benchmark to check the performance difference in decoding between byte[] and ByteBuffer and the former is a clear winner:

Benchmark                                     Mode  Cnt     Score    Error  Units
PercentEncodingBenchmark.percentDecodeArray   avgt   25  1163,624 ± 32,962  us/op
PercentEncodingBenchmark.percentDecodeBuffer  avgt   25  2344,671 ± 22,177  us/op

(each iteration decodes 1000 string of 256 characters, where 10% are not ASCII and the rest letters)

@dwalluck dwalluck force-pushed the fix-url-encode-and-decode branch from 0a11b17 to c3f02af Compare March 13, 2025 21:54
@ppkarwasz
Copy link
Contributor

Using the benchmark in the feature/benchmark branch of ppkarwasz/packageurl-java I get the following results:

Benchmark                                     Mode  Cnt     Score    Error  Units
PercentEncodingBenchmark.baselineArray        avgt    5   585,449 ± 15,611  us/op
PercentEncodingBenchmark.baselineBuffer       avgt    5  1751,518 ± 38,290  us/op
PercentEncodingBenchmark.percentDecodeArray   avgt    5  1075,091 ± 29,880  us/op
PercentEncodingBenchmark.percentDecodeBuffer  avgt    5  2401,876 ± 97,551  us/op
PercentEncodingBenchmark.percentEncodeArray   avgt    5  1213,212 ± 12,130  us/op
PercentEncodingBenchmark.percentEncodeBuffer  avgt    5  1198,758 ± 50,471  us/op

Unless there are bugs in the benchmark (which is quite probable), UTF_8.encode(String) is very slow, which slows down baselineBuffer and percentDecodeBuffer. The baseline benchmarks convert a string into bytes and back without any other modification.

The methods `uriEncode` and `uriDecode` did not properly handle
percent-encoding. In particular, `uriEncode` didn't properly output two
uppercase hex digits and `urlDecode` did not properly handle non-ASCII
characters.

Aditionally, if no percent-encoding was performed, these methods will
now return the original string.

Fixes package-url#150
Closes package-url#153
Fixes package-url#154
@dwalluck dwalluck force-pushed the fix-url-encode-and-decode branch from c3f02af to 78f897b Compare March 14, 2025 15:10
@dwalluck
Copy link
Contributor Author

Unless there are bugs in the benchmark (which is quite probable), UTF_8.encode(String) is very slow, which slows down baselineBuffer and percentDecodeBuffer. The baseline benchmarks convert a string into bytes and back without any other modification.

This is what I found, too. It is probably faster if you just need a CharSequence, but if you're going to convert it into a String anyway, then you should do that directly.

Copy link
Collaborator

@jeremylong jeremylong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ppkarwasz
Copy link
Contributor

It looks a little bit over-engineered (as if it contained suggestions from 3 different people 😉), but it fixes the problem, which is the important part.

LGTM

@jeremylong jeremylong merged commit f2e1d7a into package-url:master Mar 16, 2025
2 checks passed
dwalluck added a commit to dwalluck/packageurl-java that referenced this pull request Mar 16, 2025
@dwalluck dwalluck deleted the fix-url-encode-and-decode branch April 15, 2025 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants