Why do custom gates kill performance of my simulation? #2983

OliverAh · 2025-06-02T12:46:57Z

OliverAh
Jun 2, 2025

Problem

Executing a kernel of cudaq native gates takes approx. 1 second.
Executing an equivalent kernel of custom gates takes approx. 7 to 12 seconds. (7 first time, 12 following runs)

Description

I noticed, using custom operations defined via cudaq.register_operation(...) make the kernel execution time explode. I am not talking about some percentages, but factors of 10x-100x are easily reached.
For example, see the following mwe using Hadamard gates (or manually composed custom operations thereof) only.
The circuit consists of 10 qubits, with a circuit depth/length/layers of 10.
In each layer, a Hadamard gate is applied to every qubit but the first (initially, I thought it had to do with the control mechanism, but it turns out it doesn't). The first kernel uses th cudaq native h(qubit) gates. The second kernel uses a custom gate, defined via cudaq.register_operation(...).

Questions

Why does it take so much longer to simulate the circuit built using the custom operations?
Is this expected behaviour and I am missing anything?
Can I do something to "speed up" my custom gates?
Why does the execution time vary so much using custom gates?
Should I post this as an issue?

File: mwe.py

import cudaq
import numpy as np
import time

cudaq.set_target('nvidia', option='fp64')

###
# Define basic options and compute matrix for custom gate
###
circuit_depth = 10
num_qubits = 10 # first might be used as control qubit

array_hadamard = np.array([[1., 1.], [1., -1.]]) / np.sqrt(2)
array_custom_operation = array_hadamard.copy()

for i in range(num_qubits-1-1):
    array_custom_operation = np.kron(array_hadamard, array_custom_operation)
assert array_custom_operation.shape == (2**(num_qubits-1), 2**(num_qubits-1))

cudaq.register_operation('custom_multi_hadamard', array_custom_operation.flatten())

###
# Define kernels
#    1. Use native gates
#    2. Use custom gates
###

@cudaq.kernel
def kernel_native_gates():
    qvec = cudaq.qvector(num_qubits)
    for _ in range(circuit_depth):
        for j in range(1,num_qubits):
            h(qvec[j])
            #h.ctrl(qvec[0], qvec[j])

@cudaq.kernel
def kernel_custom_gates():
    qvec = cudaq.qvector(num_qubits)
    for _ in range(circuit_depth):
        custom_multi_hadamard(qvec[1], qvec[2], qvec[3], qvec[4], qvec[5], qvec[6], qvec[7], qvec[8], qvec[9])
        #custom_multi_hadamard.ctrl(qvec[0], qvec[1], qvec[2], qvec[3], qvec[4], qvec[5], qvec[6], qvec[7], qvec[8], qvec[9])

###
# Run kernels and measure execution time
###

print('Start sampling native gates')
tic_native = time.time()
state_native = cudaq.get_state(kernel_native_gates)
toc_native = time.time()
print('... finished sampling native gates')

print('Start sampling custom gates')
tic_custom = time.time()
state_custom= cudaq.get_state(kernel_custom_gates)
toc_custom = time.time()
print('... finished sampling custom gates', '\n')

assert np.allclose(state_native, state_custom)

print(f"Native gates time: {toc_native - tic_native:.2f} seconds")
print(f"Custom gates time: {toc_custom - tic_custom:.2f} seconds")

Which gives the following output:

$ python mwe.py

Start sampling native gates
... finished sampling native gates
Start sampling custom gates
... finished sampling custom gates 

Native gates time: 0.91 seconds
Custom gates time: 7.40 seconds

$ python mwe.py

Start sampling native gates
... finished sampling native gates
Start sampling custom gates
... finished sampling custom gates 

Native gates time: 0.96 seconds
Custom gates time: 12.70 seconds

Environment

Containerized environment from nvcr.io/nvidia/quantum/cuda-quantum:cu12-0.11.0
- CudaQ: 0.11.0
- Python: 3.10.12
Additionally installed tqdm, pyqsp, pennylane inside the container using pip install

bmhowe23 · 2025-06-02T18:00:16Z

bmhowe23
Jun 2, 2025
Maintainer

I believe the FLOP count for performing an operation scales roughly with the number of elements in the matrix (or the matrix representation of a gate). Your custom operation is a 512x512 matrix, and applying that matrix is much more expensive (from a FLOP perspective) than applying 9 2x2 matrices. I think this is expected behavior. My recommendation would be to make each custom operation operate on as few qubits as possible.

6 replies

bmhowe23 Jun 3, 2025
Maintainer

No, it only does that up to a limit. For example, see the table in https://nvidia.github.io/cuda-quantum/latest/using/backends/sims/svsims.html. Regarding the fusion parameter, it says:

The max number of qubits used for gate fusion. The default value depends on GPU Compute Capability (CC) and the floating point precision selected for the simulator. Specifically, for CC 8.0, 9.0, and 10.0 the defaults are 4, 5, and 5 for FP32. For FP64 the corresponding defaults are 5, 6, and 4. For all other CC, the default is 4 for both precision modes.

Your example goes up to 9 qubits, which is much higher than the defaults. While the precise impact of that will be circuit dependent, take a look at https://nvidia.github.io/cuda-quantum/latest/examples/python/performance_optimizations.html#Gate-Fusion to see a case where setting the gate fusion parameter too high can have a bad impact.

OliverAh Jun 3, 2025
Author

I see. Thank you very much for the clarification! I will mark this as the answer, as it is the explanation to the main concern of the difference in runtime between native and custom gates.
May I still ask you to comment on the difference in runtime for the custom gates between the two executions? I.e. in the first run it took 7.4 seconds, and in the second run it took 12.7 seconds. All runs after the first take about 12-13 seconds, until I make a change in the .py file and get a new "first" execution, which will then take 7-8 seconds again. The two outputs given above were generated in the same shell session, on the same gpu. After finishing the first run there was a delay of at most 5 minutes until started the second run. In the meantime there was no other workload processed on the gpu.

bmhowe23 Jun 3, 2025
Maintainer

May I still ask you to comment on the difference in runtime for the custom gates between the two executions? I.e. in the first run it took 7.4 seconds, and in the second run it took 12.7 seconds.

I don't know about that one. Perhaps @1tnguyen has some ideas there?

1tnguyen Jun 3, 2025
Maintainer

I was unable to reproduce the time variation. For example, this is the elapsed time that I got running the above 4 times:

Start sampling native gates
... finished sampling native gates
Start sampling custom gates
... finished sampling custom gates

Native gates time: 0.56 seconds
Custom gates time: 6.01 seconds
=============================
Start sampling native gates
... finished sampling native gates
Start sampling custom gates
... finished sampling custom gates

Native gates time: 0.52 seconds
Custom gates time: 6.31 seconds
=============================
Start sampling native gates
... finished sampling native gates
Start sampling custom gates
... finished sampling custom gates

Native gates time: 0.60 seconds
Custom gates time: 6.19 seconds
=============================
Start sampling native gates
... finished sampling native gates
Start sampling custom gates
... finished sampling custom gates

Native gates time: 0.61 seconds
Custom gates time: 6.05 seconds

The run-run variation is not as high as what you have observed.
I'm not sure what might be causing the delay there.

OliverAh Jun 5, 2025
Author

I can still reproduce it. But I guess then it is probably an issue with my server.
Thanks for your efforts.
In case I figure it out I will post an update here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why do custom gates kill performance of my simulation? #2983

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why do custom gates kill performance of my simulation? #2983

Uh oh!

OliverAh Jun 2, 2025

Problem

Description

Questions

Environment

Replies: 1 comment · 6 replies

Uh oh!

bmhowe23 Jun 2, 2025 Maintainer

Uh oh!

bmhowe23 Jun 3, 2025 Maintainer

Uh oh!

OliverAh Jun 3, 2025 Author

Uh oh!

bmhowe23 Jun 3, 2025 Maintainer

Uh oh!

1tnguyen Jun 3, 2025 Maintainer

Uh oh!

OliverAh Jun 5, 2025 Author

OliverAh
Jun 2, 2025

Replies: 1 comment 6 replies

bmhowe23
Jun 2, 2025
Maintainer

bmhowe23 Jun 3, 2025
Maintainer

OliverAh Jun 3, 2025
Author

bmhowe23 Jun 3, 2025
Maintainer

1tnguyen Jun 3, 2025
Maintainer

OliverAh Jun 5, 2025
Author