Skip to content

Improve packed field decoding #959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 6, 2025
Merged

Improve packed field decoding #959

merged 20 commits into from
May 6, 2025

Conversation

osa1
Copy link
Member

@osa1 osa1 commented Feb 10, 2025

  • Inline _readPacked manually and _withLimit with a pragma to eliminate
    closure allocation and calls in packed decoding loops.

  • Introduce PbList._addUnchecked to add to the list without checking the
    value for validity and list for mutability.

  • When decoding a packed field, check the list mutability once, instead of for
    every element.

  • When decoding a packed scalar field, don't check for value validity.

    For scalar fields we need to make sure the field value is not null, which is
    already guaranteed in the call sites as e.g. input.readDouble doesn't
    return nullable.

  • Sprinkle a bunch of prefer-inlines to make sure VM will inline one liners.

VM benchmarks before:

protobuf_PackedInt32Decoding(RunTimeRaw): 25598.8125 us.
protobuf_PackedInt64Decoding(RunTimeRaw): 67932.43333333333 us.
protobuf_PackedUint32Decoding(RunTimeRaw): 24668.844444444443 us.
protobuf_PackedUint64Decoding(RunTimeRaw): 64615.066666666666 us.
protobuf_PackedSint32Decoding(RunTimeRaw): 26037.275 us.
protobuf_PackedSint64Decoding(RunTimeRaw): 100819.65 us.
protobuf_PackedBoolDecoding(RunTimeRaw): 34733.4 us.
protobuf_PackedEnumDecoding(RunTimeRaw): 48379.659999999996 us.

VM benchmarks after:

protobuf_PackedInt32Decoding(RunTimeRaw): 19653.9 us.
protobuf_PackedInt64Decoding(RunTimeRaw): 48627.9 us.
protobuf_PackedUint32Decoding(RunTimeRaw): 19279.29090909091 us.
protobuf_PackedUint64Decoding(RunTimeRaw): 50681.8 us.
protobuf_PackedSint32Decoding(RunTimeRaw): 20271.854545454546 us.
protobuf_PackedSint64Decoding(RunTimeRaw): 83777.8 us.
protobuf_PackedBoolDecoding(RunTimeRaw): 24850.555555555555 us.
protobuf_PackedEnumDecoding(RunTimeRaw): 45205.659999999996 us.

Wasm benchmarks before (-O2):

protobuf_PackedInt32Decoding(RunTimeRaw): 64220.0 us.
protobuf_PackedInt64Decoding(RunTimeRaw): 81033.33333333334 us.
protobuf_PackedUint32Decoding(RunTimeRaw): 60800.0 us.
protobuf_PackedUint64Decoding(RunTimeRaw): 82700.0 us.
protobuf_PackedSint32Decoding(RunTimeRaw): 72433.33333333334 us.
protobuf_PackedSint64Decoding(RunTimeRaw): 142150.0 us.
protobuf_PackedBoolDecoding(RunTimeRaw): 27775.0 us.
protobuf_PackedEnumDecoding(RunTimeRaw): 43980.0 us.

Wasm benchmarks after:

protobuf_PackedInt32Decoding(RunTimeRaw): 56050.0 us.
protobuf_PackedInt64Decoding(RunTimeRaw): 74633.33333333334 us.
protobuf_PackedUint32Decoding(RunTimeRaw): 56525.0 us.
protobuf_PackedUint64Decoding(RunTimeRaw): 69400.0 us.
protobuf_PackedSint32Decoding(RunTimeRaw): 51925.0 us.
protobuf_PackedSint64Decoding(RunTimeRaw): 116250.0 us.
protobuf_PackedBoolDecoding(RunTimeRaw): 18427.272727272728 us.
protobuf_PackedEnumDecoding(RunTimeRaw): 41600.0 us.

cl/755309114

@osa1 osa1 marked this pull request as draft February 11, 2025 10:26
@mkustermann
Copy link
Collaborator

It seems in dart2wasm we have

// before
protobuf_PackedInt64Decoding(RunTimeRaw): 77833.33333333334 us.

// after
protobuf_PackedInt64Decoding(RunTimeRaw): 90650.0 us.

i.e. a regression, is this expected?

Copy link
Collaborator

@mkustermann mkustermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with comments

}
break;
case PbFieldType._REPEATED_BYTES:
final list = fs._ensureRepeatedField(meta, fi);
list.add(input.readBytes());
list._checkModifiable('add');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be hand-inlined code, omitting the _check call. It seems not very nice we have to copy&paste this many times.

Why not pass in a bool whether to check or not, and force inline the function. E.g.

_addInternal(input.readBytes(), omitCheck: true);

@pragma('vm/dart2js/dart2wasm:prefer-inline')
_addInternal(E value, {bool omitCheck = false}) {
  _checkModifiable(value, 'add');
  if (!omitCheck) _check(value);
  _wrappedList.add(value);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function wouldn't work below where we check for modifiable once before adding the elements, and then never check it while adding the elements.

We could have this version:

_addInternal(E value, {bool omitElementCheck = false, bool omitModifiableCheck = false}) {
  if (!omitModifiableCheck) _checkModifiable(value, 'add');
  if (!omitElementCheck) _check(value);
  _wrappedList.add(value);
}

But we would still need a _checkModifiable call in packed fields as we want to run it once.

Overall not sure if this version is better than the current..

@osa1
Copy link
Member Author

osa1 commented May 6, 2025

It seems in dart2wasm we have

// before
protobuf_PackedInt64Decoding(RunTimeRaw): 77833.33333333334 us.

// after
protobuf_PackedInt64Decoding(RunTimeRaw): 90650.0 us.

i.e. a regression, is this expected?

Maybe my CPU became busy with other things while running it, when I run the benchmarks again I consistently get better numbers with this PR. I updated the PR description with more recent benchmark numbers from a run just now.

@osa1
Copy link
Member Author

osa1 commented May 6, 2025

I tested this internally in cl/755309114. Merging ...

@osa1 osa1 merged commit 9daf5ca into google:master May 6, 2025
17 checks passed
@osa1 osa1 deleted the vm_inlines branch May 6, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants