cmd/compile: on amd64, compiler's Zero operation is slower than memclrNoHeapPointers for 1024-2048 bytes

cc @randall77 @mknyszek

On amd64, it looks like memclrNoHeapPointers performs better than the code that the compiler substitutes it with for a constant length "Zero" SSA operation, for byte sizes between 1024 and 2048 bytes. In that range, memclrNoHeapPointers use avx2 instructions to do the clear, not using the rep stos until the size is at least 2048 bytes, while the compiler will generate a rep stos.

There's a comment in the assembly memclrNoHeapPointers about why that's done:
[https://github.com/golang/go/blob/96a6e147b2b02b1f070d559cb2c8e1c25c9b78c3/src/runtime/memclr_amd64.s#L47](https://github.com/golang/go/blob/96a6e147b2b02b1f070d559cb2c8e1c25c9b78c3/src/runtime/memclr_amd64.s#L47)

I put together a [CL 681496](https://go.dev/cl/681496) with some benchmarks those sizes and ran it on the C3 perf gomotes. I put the results in the change description.

I think in those cases we sholudn't turn memclrNoHeapPointers call with a constant size into a Zero? Or we could copy what memclrNoHeapPointers does?

Also: I haven't tested this, but there's a branch in the memclrNoHeapPointers that will have different behavior for clears 32M or larger so we may want to investigate that.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/compile: on amd64, compiler's Zero operation is slower than memclrNoHeapPointers for 1024-2048 bytes #74171

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cmd/compile: on amd64, compiler's Zero operation is slower than memclrNoHeapPointers for 1024-2048 bytes #74171

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions