Fix URL encoding and decoding #160

dwalluck · 2025-02-20T21:32:05Z

The methods uriEncode and uriDecode did not properly handle percent-encoding. In particular, uriEncode didn't properly output two uppercase hex digits and urlDecode did not properly handle non-ASCII characters.

Aditionally, if no percent-encoding was performed, these methods will now return the original string.

Fixes #150
Closes #153
Fixes #154

jeremylong

One minor suggestion - but overall this looks great.

src/main/java/com/github/packageurl/PackageURL.java

matt-phylum

This PR improves but does not fix URL encoding. The set of encoded characters varies by the PURL component, but this code is not aware of what component it is encoding. For example, this implementation will incorrectly encode a PURL like the proposed pkg:go/example.com%2Fwidgets as pkg:go/example.com/widgets.

src/main/java/com/github/packageurl/PackageURL.java

dwalluck · 2025-03-10T13:25:47Z

This PR improves but does not fix URL encoding. The set of encoded characters varies by the PURL component, but this code is not aware of what component it is encoding. For example, this implementation will incorrectly encode a PURL like the proposed pkg:go/example.com%2Fwidgets as pkg:go/example.com/widgets.

It fixes it, in general. Which characters to encode is a separate issue, see #174, with the goal being to pass the test suite.

The test suite is still inconsistent. I talked with the purl spec team recently, and the intended goal actually seems to be to encode all characters except the ASCII letter/digit and a few others mentioned. This means that '/' and ':' would always be encoded everywhere, which isn't the case right now, and it's not worth chasing it really, until the spec is made clear. My opinion is to test the current test suite as the current goal and then update in the future.

matt-phylum · 2025-03-10T13:39:15Z

Encoding everything would be simpler but it would be a breaking change.

Regardless, / must not be encoded in the namespace. The PURL spec is weird and treats the namespace as a series of segments, but I think most implementations, this one included, treat it as a single value. The escaping for the single-value namespace and the name are necessarily different.

src/main/java/com/github/packageurl/PackageURL.java

ppkarwasz · 2025-03-11T10:26:57Z

Did you consider using o.a.c.codec.net.PercentCodec from Apache Commons Codec here?

If you shade PercentCodec and its dependencies, you'll end up with 8 additional classes and 10 kB of additional uncompressed bytecode. This is a nice example of proper shading, where you shade only a small (~3%) fraction of a dependency to profit from code reuse and automatic improvements.

jeremylong · 2025-03-11T10:37:00Z

src/main/java/com/github/packageurl/PackageURL.java

+        byte[] bytes = source.getBytes(StandardCharsets.UTF_8);
+        int length = bytes.length;
+        ByteArrayOutputStream buffer = new ByteArrayOutputStream(length);
+        buffer.write(bytes, 0, percent);
+
+        for (int i = percent; i < length; i++) {
+            int b = bytes[i];
+
+            if (b == '%') {
+                    if (i + 2 >= length) {
+                        return null;
+                    }
+
+                    int b1 = Character.digit(bytes[++i], 16);
+                    int b2 = Character.digit(bytes[++i], 16);
+                    buffer.write((char) ((b1 << 4) + b2));
+            } else {
+                buffer.write(b);


I had a thought that we could improve the optimization of this even more:

Suggested change

byte[] bytes = source.getBytes(StandardCharsets.UTF_8);

int length = bytes.length;

ByteArrayOutputStream buffer = new ByteArrayOutputStream(length);

buffer.write(bytes, 0, percent);

for (int i = percent; i < length; i++) {

int b = bytes[i];

if (b == '%') {

if (i + 2 >= length) {

return null;

}

int b1 = Character.digit(bytes[++i], 16);

int b2 = Character.digit(bytes[++i], 16);

buffer.write((char) ((b1 << 4) + b2));

} else {

buffer.write(b);

byte[] bytes = source.getBytes(StandardCharsets.UTF_8);

int length = bytes.length;

int writePos = percent;

for (int i = percent; i < length; i++) {

int b = bytes[i];

if (b == '%') {

if (i + 2 >= length) {

return null;

}

int b1 = Character.digit(bytes[++i], 16);

int b2 = Character.digit(bytes[++i], 16);

if (b1 == -1 || b2 == -1) {

return null;

}

bytes[writePos++] = (byte) ((d1 << 4) + d2);

} else {

bytes[writePos++] = b;

My only other comment on this method would be to replace the return null with something like throw new IllegalArgumentException("Invalid percent encoding at index " + i);.

@ppkarwasz I completely understand the benefits of proper shading. However, for a library like packageurl-java I am violently opposed to introducing any runtime dependencies - even if properly shaded.

I agree that returning null was wrong. It's too bad that we can't throw a checked exception MalformedPackageURLException (or we can if you're OK with changing a public signature).

The other option is to be lenient and just ignore the '%'. I think I have seen some URI parsers do this.

You can use ValidationException instead of MalformedPackageURLException as in other parts of the code.

I think ValidationException is unchecked. I actually don't like this and think the code should be changed to throw a checked exception, but that's off-topic for this pull request.

But, good point in that it doesn't change the method signature. I am in favor of deprecating the public method since these methods should be static anyway. It probably wasn't meant to be public since only decode is public but not encode, too?

So, I checked, and it's impossible to make it to this code with an invalid percent-encoding, as the URI constructor catches it. I added a test.

You can reach the code through the deprecated, but still public uriDecode method.

Currently new URI detects invalid percent encoding sequences, but it might be useful to check it again, in case the code is optimized not to use URI.

OK, the code now throws ValidationException. I added more tests.

jeremylong · 2025-03-11T10:37:12Z

src/main/java/com/github/packageurl/PackageURL.java

            }
        }
-        return builder.toString();
+
+        return new String(buffer.toByteArray(), StandardCharsets.UTF_8);


Suggested change

return new String(buffer.toByteArray(), StandardCharsets.UTF_8);

return new String(bytes, 0, writePos, StandardCharsets.UTF_8));

I've changed the code to use ByteBuffer.

I don't understand why this should move to ByteBuffer. Doesn't that decrease the performance of the algorithm in comparison to #160 (comment) - with the above suggestion as the return? With ByteBuffer we have to loop over the entire array of bytes to count the number of % and then make a complete copy one byte at a time. If we kept closer to your original implementation with my small suggestion we skip having to copy the first portion of the byte array and then its just shifting the values down in the remaining portion of the array; never having to allocate a new buffer or loop over the bytes to count the number of % characters. Maybe I'm missing something.

jeremylong · 2025-03-11T10:41:41Z

@dwalluck the suggested optimization doesn't have to be done - but would you mind fixing the merge conflicts?

ppkarwasz · 2025-03-11T11:03:29Z

src/main/java/com/github/packageurl/PackageURL.java

-            else {
-                builder.append(source.charAt(i));
+
+        byte[] bytes = source.getBytes(StandardCharsets.UTF_8);


Suggested change

byte[] bytes = source.getBytes(StandardCharsets.UTF_8);

byte[] bytes = source.getBytes(StandardCharsets.US_ASCII);

The encoded URI should only contain ASCII characters.

This results in incorrect behavior.

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The CharsetEncoder class should be used when more control over the encoding process is required.

Meaning the proposed version would convert non-ASCII characters into ?. For example: pkg:generic/a?b=☃ becomes pkg:generic/a?b=?. If you want to reject non-ASCII characters you need to use a different conversion.

The spec says that PURLs are ASCII strings, but it seems like all parsers except maennchen/purl accept Unicode inputs, and it's probably better to continue accepting Unicode inputs:

althonos/packageurl.rs: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} anchore/packageurl-go: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} giterlizzi/perl-URI-PackageURL: {"type": "generic", "name": "a", "qualifiers": {"b": "\u00e2\u0098\u0083"}} maennchen/purl: {"error": "cannot parse due to reason invalid_uri: \":\""} package-url/packageurl-dotnet: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} package-url/packageurl-go: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} package-url/packageurl-java: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} package-url/packageurl-js: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} package-url/packageurl-php: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} package-url/packageurl-python: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} package-url/packageurl-ruby: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} package-url/packageurl-swift: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} phylum-dev/purl: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}} sonatype/package-url-java: {"type": "generic", "name": "a", "qualifiers": {"b": "\u2603"}}

Good point, the Java URI implementation also seems to tolerate non-blank Unicode characters outside of ASCII (cf. URI Javadoc and it defines a new other category of allowed characters:

The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the Character.isISOControl method), and are not space characters (according to the Character.isSpaceChar method) (Deviation from RFC 2396, which is limited to US-ASCII)

So, use UTF_8 everywhere, then?

dwalluck · 2025-03-11T13:50:01Z

Did you consider using o.a.c.codec.net.PercentCodec from Apache Commons Codec here?

If you shade PercentCodec and its dependencies, you'll end up with 8 additional classes and 10 kB of additional uncompressed bytecode. This is a nice example of proper shading, where you shade only a small (~3%) fraction of a dependency to profit from code reuse and automatic improvements.

I think this one always converts + <-> ' '? Actually, it's a flag...

ppkarwasz · 2025-03-11T14:34:20Z

@jeremylong,

@ppkarwasz I completely understand the benefits of proper shading. However, for a library like packageurl-java I am violently opposed to introducing any runtime dependencies - even if properly shaded.

Sure, I understand the drawbacks of shading, especially since Apache Commons uses a different license. 😉

Do you oppose any external dependency regardless of size (see #175 for example)?

ppkarwasz · 2025-03-11T22:34:53Z

src/main/java/com/github/packageurl/PackageURL.java

+    private static int getPercentCharCount(final byte[] bytes) {
+        return (int)  IntStream.range(0, bytes.length).map(i -> bytes[i]).filter(PackageURL::isPercent).count();
+    }


This gives only an approximation of the number of encoded sequences, since invalid sequences like %%% can contain multiple percent characters.

Using a loop would allow you to skip two bytes after each % character.

ppkarwasz · 2025-03-11T22:36:05Z

src/main/java/com/github/packageurl/PackageURL.java

+            if (bytes[i] == PERCENT_CHAR) {
+                int b1 = Character.digit(bytes[++i], 16);
+                int b2 = Character.digit(bytes[++i], 16);
+                b = (byte) ((b1 << 4) + b2);


I still believe that checking if b1 or b2 is -1 here does not hurt…

Do we have a test case? I know it's possible for Character::digit to return -1, but when will it in this case?

uriDecode("%GH")? The method is deprecated, but still public.

It is not at all evident here, that bytes must only contain valid percent sequences. This is true, if this method is called from parse but not at all evident, so it merits at least a comment. Besides: this method already contains two throw statements.

ppkarwasz · 2025-03-11T22:40:42Z

src/main/java/com/github/packageurl/PackageURL.java

        }
-        return builder.toString();
+
+        return new String(buffer.array(), StandardCharsets.UTF_8);


If buffer still has some unwritten bytes (i.e. there is a mismatch between its capacity and the number of written bytes), the string could have some trailing garbage.

StandardCharsets.UTF_8.encode(buffer.flip()).toString()

should decode only the bytes that were actually written.

There are a couple ways to do it. You can also create the String with bounds.

Right now, it's counting the exact capacity it should need if the input is valid. I don't actually know if this is better than just setting an upper bound like the code was before. Or, make the code even more complex and copy in chunks to take advantage of System::arrayCopy. These inputs are in general so small that it doesn't matter.

There are a couple ways to do it. You can also create the String with bounds.

new String(buffer.array(), 0, buffer.position(), StandardCharsets.UTF_8) sounds great! 💯

This is what my above suggestion did without needing to allocate additional resources or perform additional looping. But as I said above - maybe I'm missing something.

Sorry, can you restate what you said in case I missed something? I do remember a suggested change, but it looked incomplete and I did not understand how to change it.

I thought about changing ByteArrayOutputStream to ByteBuffer just because it's a newer API, but the performance probably isn't any better (vs. heap byte buffer). About StringBuilder, I am not as sure. If we're talking about performance then we'd need to benchmark it.

There is a problem with the percent optimization: percent gives the index of the first percent character in Unicode codepoints, not in bytes. The index in bytes might be different if source is not ASCII.

It's worse than that. Java strings are not made of bytes or Unicode codepoints. indexOf gives the index of the Java char, which is not the index of the byte if the string is not ASCII and it's not the index of the code point either if the string contains surrogate pairs.

so maybe this:

private String percentEncode(final String source) { if (source == null || source.isEmpty()) { return source; } byte[] bytes = source.getBytes(StandardCharsets.UTF_8); int length = bytes.length; int percent = -1; for (int i = 0; i < length-2; i++) { if (bytes[i] == PERCENT_CHAR) { percent = i; } } if (percent < 0) { return source; } int writePos = percent; for (int i = percent; i < length; i++) { byte b = bytes[i]; if (b == PERCENT_CHAR) { if (i + 2 >= length) { throw new ValidationException("Incomplete percent encoding at index " + i); } int b1 = Character.digit(bytes[++i], 16); int b2 = Character.digit(bytes[++i], 16); if (b1 == -1 || b2 == -1) { throw new ValidationException("Invalid percent encoding at index " + (i - 2)); } bytes[writePos++] = (byte) ((b1 << 4) + b2); } else { bytes[writePos++] = b; } } return new String(bytes, 0, writePos, StandardCharsets.UTF_8)); }

I think that we can look for percent characters multiple times. At the beginning it is worth using String.indexOf:

if (source.indexOf(PERCENT_CHAR) == -1) { return source; }

since this prevents a relatively expensive encoding of the string into byte[]. Later we can also skip all the bytes until the first percent % character. It is worth running some benchmarks though, since intuition can really be misleading (especially on a highly optimized JVM 23).

@jeremylong @matt-phylum @ppkarwasz I think the latest version is good and I incorporated all of your suggestions:

Decoding and Encoding:

Split the input by char, so that System.arrayCopy is ultimately used instead of copying single bytes

Since we split, don't bother pre-counting char occurrences, just set an upper bounds on size if we need to

I could be wrong, but I think this might outperform the version which doesn't split the input by a factor of 2 or 3

Decoding:

Better error checking for bad percent encodings (short or invalid hex digits), moved to its own method

Reuse the existing array (I used ByteBuffer.wrap so it automatically tracks the writePos)

Encoding:

Use UTF_8 everywhere, not US_ASCII. In my benchmarks, ASCII was faster, but only slightly anyway

src/main/java/com/github/packageurl/PackageURL.java

ppkarwasz · 2025-03-13T19:53:35Z

I ran a small JMH benchmark to check the performance difference in decoding between byte[] and ByteBuffer and the former is a clear winner:

Benchmark                                     Mode  Cnt     Score    Error  Units
PercentEncodingBenchmark.percentDecodeArray   avgt   25  1163,624 ± 32,962  us/op
PercentEncodingBenchmark.percentDecodeBuffer  avgt   25  2344,671 ± 22,177  us/op

(each iteration decodes 1000 string of 256 characters, where 10% are not ASCII and the rest letters)

ppkarwasz · 2025-03-14T08:00:05Z

Using the benchmark in the feature/benchmark branch of ppkarwasz/packageurl-java I get the following results:

Benchmark                                     Mode  Cnt     Score    Error  Units
PercentEncodingBenchmark.baselineArray        avgt    5   585,449 ± 15,611  us/op
PercentEncodingBenchmark.baselineBuffer       avgt    5  1751,518 ± 38,290  us/op
PercentEncodingBenchmark.percentDecodeArray   avgt    5  1075,091 ± 29,880  us/op
PercentEncodingBenchmark.percentDecodeBuffer  avgt    5  2401,876 ± 97,551  us/op
PercentEncodingBenchmark.percentEncodeArray   avgt    5  1213,212 ± 12,130  us/op
PercentEncodingBenchmark.percentEncodeBuffer  avgt    5  1198,758 ± 50,471  us/op

Unless there are bugs in the benchmark (which is quite probable), UTF_8.encode(String) is very slow, which slows down baselineBuffer and percentDecodeBuffer. The baseline benchmarks convert a string into bytes and back without any other modification.

The methods `uriEncode` and `uriDecode` did not properly handle percent-encoding. In particular, `uriEncode` didn't properly output two uppercase hex digits and `urlDecode` did not properly handle non-ASCII characters. Aditionally, if no percent-encoding was performed, these methods will now return the original string. Fixes package-url#150 Closes package-url#153 Fixes package-url#154

dwalluck · 2025-03-14T15:13:01Z

Unless there are bugs in the benchmark (which is quite probable), UTF_8.encode(String) is very slow, which slows down baselineBuffer and percentDecodeBuffer. The baseline benchmarks convert a string into bytes and back without any other modification.

This is what I found, too. It is probably faster if you just need a CharSequence, but if you're going to convert it into a String anyway, then you should do that directly.

jeremylong

LGTM

ppkarwasz · 2025-03-14T22:44:48Z

It looks a little bit over-engineered (as if it contained suggestions from 3 different people 😉), but it fixes the problem, which is the important part.

LGTM

jeremylong requested changes Mar 9, 2025

View reviewed changes

src/main/java/com/github/packageurl/PackageURL.java Outdated Show resolved Hide resolved

jeremylong mentioned this pull request Mar 9, 2025

fix encoding of non-ascii characters #153

Closed

matt-phylum reviewed Mar 10, 2025

View reviewed changes

src/main/java/com/github/packageurl/PackageURL.java Outdated Show resolved Hide resolved

src/main/java/com/github/packageurl/PackageURL.java Outdated Show resolved Hide resolved

src/main/java/com/github/packageurl/PackageURL.java Outdated Show resolved Hide resolved

dwalluck force-pushed the fix-url-encode-and-decode branch from f2c891f to 0ca5d84 Compare March 10, 2025 13:29

matt-phylum reviewed Mar 10, 2025

View reviewed changes

src/main/java/com/github/packageurl/PackageURL.java Show resolved Hide resolved

ppkarwasz reviewed Mar 11, 2025

View reviewed changes

src/main/java/com/github/packageurl/PackageURL.java Outdated Show resolved Hide resolved

jeremylong reviewed Mar 11, 2025

View reviewed changes

ppkarwasz reviewed Mar 11, 2025

View reviewed changes

dwalluck force-pushed the fix-url-encode-and-decode branch 2 times, most recently from 143e623 to 41b30c1 Compare March 11, 2025 19:54

matt-phylum mentioned this pull request Mar 11, 2025

Update 'qualifiers' rule in core spec #382 package-url/purl-spec#398

Merged

dwalluck force-pushed the fix-url-encode-and-decode branch 10 times, most recently from a168ac2 to 169207a Compare March 11, 2025 21:55

ppkarwasz requested changes Mar 11, 2025

View reviewed changes

dwalluck force-pushed the fix-url-encode-and-decode branch from 169207a to 0a11b17 Compare March 12, 2025 19:43

dwalluck force-pushed the fix-url-encode-and-decode branch from 0a11b17 to c3f02af Compare March 13, 2025 21:54

dwalluck force-pushed the fix-url-encode-and-decode branch from c3f02af to 78f897b Compare March 14, 2025 15:10

jeremylong approved these changes Mar 14, 2025

View reviewed changes

jeremylong merged commit f2e1d7a into package-url:master Mar 16, 2025
2 checks passed

dwalluck added a commit to dwalluck/packageurl-java that referenced this pull request Mar 16, 2025

fix(test): fix compilation error introduced in package-url#160

ebef73e

jeremylong pushed a commit that referenced this pull request Mar 16, 2025

fix(test): fix compilation error introduced in #160 (#205)

a5032d1

dwalluck deleted the fix-url-encode-and-decode branch April 15, 2025 21:38

	return new String(buffer.toByteArray(), StandardCharsets.UTF_8);
	return new String(bytes, 0, writePos, StandardCharsets.UTF_8));

	byte[] bytes = source.getBytes(StandardCharsets.UTF_8);
	byte[] bytes = source.getBytes(StandardCharsets.US_ASCII);

Fix URL encoding and decoding #160

Fix URL encoding and decoding #160

Uh oh!

Conversation

dwalluck commented Feb 20, 2025

Uh oh!

jeremylong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matt-phylum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dwalluck commented Mar 10, 2025

Uh oh!

matt-phylum commented Mar 10, 2025

Uh oh!

Uh oh!

Uh oh!

ppkarwasz commented Mar 11, 2025

Uh oh!

jeremylong Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremylong Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremylong Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremylong commented Mar 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dwalluck commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppkarwasz commented Mar 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jeremylong Mar 11, 2025 •

edited

Loading

jeremylong Mar 11, 2025 •

edited

Loading

jeremylong Mar 13, 2025 •

edited

Loading

dwalluck commented Mar 11, 2025 •

edited

Loading

jeremylong Mar 13, 2025 •

edited

Loading