Qualcomm AI Engine Direct - GA model enablement (T5) #12234

DannyYuyang-quic · 2025-07-04T07:24:03Z

Summary

e2e script / test case for GA T5 model
- perf: 16a8w avg encoding time: 4.09ms/inf, avg decoding time: 6ms/inf (SM8750)
- acc: F1 Score ~= 76% in SQuAD
add QA dataset for Seq2SeqLM benchmarking

Test plan

python -m examples.qualcomm.oss_scripts.t5.t5 -b build-android -m ${soc} -H ${host_id} -s ${device_id} -d ./SQuAD-v1.1.csv

cc: @haowhsu-quic,@cccclai

pytorch-bot · 2025-07-04T07:24:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12234

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 4e79e47 with merge base 1decf7a ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2025-07-04T07:24:41Z

@pytorchbot label "release notes: qualcomm"

facebook-github-bot · 2025-07-07T17:55:07Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D77877631.

cccclai · 2025-07-07T19:07:18Z

examples/qualcomm/oss_scripts/t5/t5_model.py

+logger = logging.get_logger(__name__)
+
+
+# Copy from transformers/models/t5/modeling_t5.py (transformers=4.47.1)


How different it is compared with transformers/models/t5/modeling_t5.py?

There are two changes compared to transformers/models/t5/modeling_t5.py:
both added to move the computation of relative position out of runtime and into precomputed buffers. This is because T5Attention._relative_position_bucket is not QNN-friendly.

cccclai · 2025-07-07T19:18:12Z

Also can you rebase?

Summary: - e2e script / test case for GA T5 model - perf: 16a8w avg encoding time: 4.09ms/inf, avg decoding time: 6ms/inf (SM8750) - acc: F1 Score ~= 76% in SQuAD - add QA dataset for Seq2SeqLM benchmarking

DannyYuyang-quic · 2025-07-08T03:15:58Z

examples/qualcomm/oss_scripts/t5/t5_model.py

+    ):
+        super().__init__(config, embed_tokens)
+
+        # ====================Qualcomm Changed=================================


The first part of the added is that I precompute the relative_position_bucket using T5Attention._relative_position_bucket and register the result as a buffer.

DannyYuyang-quic · 2025-07-08T03:20:06Z

examples/qualcomm/oss_scripts/t5/t5_model.py

+        # ====================Qualcomm Changed=================================
+        # The bias is indexed by cache_position to select the correct positions for the current step.
+        if self.is_decoder:
+            # For decoder, use the decoder's relative position bias table.
+            position_bias = (
+                self.block[0]
+                .layer[0]
+                .SelfAttention.relative_attention_bias(
+                    self.decoder_self_attn_position_bias[cache_position]
+                )
+                .permute([2, 0, 1])
+                .unsqueeze(0)
+            )
+        else:
+            # For encoder, use the encoder's relative position bias table.
+            position_bias = (
+                self.block[0]
+                .layer[0]
+                .SelfAttention.relative_attention_bias(
+                    self.encoder_self_attn_position_bias[cache_position]
+                )
+                .permute([2, 0, 1])
+                .unsqueeze(0)
+            )
+        position_bias = position_bias[:, :, -seq_length:, :]
+        if self.is_decoder:
+            position_bias = (
+                position_bias + causal_mask[:, :, :, : self.max_cache_length]
+            )
+        else:
+            position_bias = position_bias + causal_mask[:, :, :, :seq_length]
+
+        # For cross-attention in decoder, precompute encoder-decoder position bias as zeros and add encoder attention mask.
+        encoder_decoder_position_bias = None
+        if self.is_decoder and encoder_hidden_states is not None:
+            encoder_decoder_position_bias = torch.zeros(
+                (1, self.config.num_heads, seq_length, self.max_hidden_seq_length),
+                dtype=encoder_extended_attention_mask.dtype,
+            )
+            encoder_decoder_position_bias = (
+                encoder_decoder_position_bias
+                + encoder_extended_attention_mask[:, :, :, : self.max_hidden_seq_length]
+            )
+        # ========================================================================


The second part is in the forward, where I retrieve the relative_position_bucket by indexing into the buffer using the correct cache position.

I see, thanks for optimizing the performance. Can the source model definition also lowered?

No, I still need to write some source transforms or add passes. This is because the function _relative_position_bucket from source T5 computation has two main issues:

Unsupported ops for (int64/int32) datatypes:
Source function performs operations like abs, min, and neg on int64, but these ops are not supported on int64/int32 in QNN.

Unsupported casting:
There is a cast from float32 to int64 in source function, but in the 16a8w quantization case, QNN's cast op actually performs a cast from uint16 to int32, which is also unsupported.

Taking a step back from being performant, if user tried and it fails here, and if it's not supported in QNN, should this op fall back to cpu?

Yes, if it's not supported in QNN, this should fall back to cpu.

facebook-github-bot · 2025-07-08T21:14:06Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D77877631.

cccclai

Thanks! The diff train is kind of stuck, will try to merge it soon

DannyYuyang-quic requested review from larryliu0820, kirklandsign and cccclai as code owners July 4, 2025 07:24

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 4, 2025

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Jul 4, 2025

cccclai reviewed Jul 7, 2025

View reviewed changes

DannyYuyang-quic force-pushed the dev1/danny/GA_T5 branch 2 times, most recently from fa38382 to f15420a Compare July 8, 2025 03:01

Qualcomm AI Engine Direct - GA model enablement (T5)

d1c2fec

Summary: - e2e script / test case for GA T5 model - perf: 16a8w avg encoding time: 4.09ms/inf, avg decoding time: 6ms/inf (SM8750) - acc: F1 Score ~= 76% in SQuAD - add QA dataset for Seq2SeqLM benchmarking

DannyYuyang-quic force-pushed the dev1/danny/GA_T5 branch from f15420a to d1c2fec Compare July 8, 2025 03:08

DannyYuyang-quic commented Jul 8, 2025

View reviewed changes

Merge branch 'main' into dev1/danny/GA_T5

4e79e47

cccclai approved these changes Jul 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qualcomm AI Engine Direct - GA model enablement (T5) #12234

Qualcomm AI Engine Direct - GA model enablement (T5) #12234

DannyYuyang-quic commented Jul 4, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 4, 2025 •

edited

Loading

Uh oh!

DannyYuyang-quic commented Jul 4, 2025

Uh oh!

facebook-github-bot commented Jul 7, 2025

Uh oh!

cccclai Jul 7, 2025

Uh oh!

DannyYuyang-quic Jul 8, 2025 •

edited

Loading

Uh oh!

cccclai commented Jul 7, 2025

Uh oh!

DannyYuyang-quic Jul 8, 2025

Uh oh!

DannyYuyang-quic Jul 8, 2025 •

edited

Loading

Uh oh!

cccclai Jul 8, 2025

Uh oh!

DannyYuyang-quic Jul 9, 2025 •

edited

Loading

Uh oh!

cccclai Jul 9, 2025

Uh oh!

DannyYuyang-quic Jul 9, 2025

Uh oh!

facebook-github-bot commented Jul 8, 2025

Uh oh!

cccclai left a comment

Uh oh!

Uh oh!

		logger = logging.get_logger(__name__)


		# Copy from transformers/models/t5/modeling_t5.py (transformers=4.47.1)

Qualcomm AI Engine Direct - GA model enablement (T5) #12234

Are you sure you want to change the base?

Qualcomm AI Engine Direct - GA model enablement (T5) #12234

Conversation

DannyYuyang-quic commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12234

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

DannyYuyang-quic commented Jul 4, 2025

Uh oh!

facebook-github-bot commented Jul 7, 2025

Uh oh!

cccclai Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

DannyYuyang-quic Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cccclai commented Jul 7, 2025

Uh oh!

DannyYuyang-quic Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

DannyYuyang-quic Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cccclai Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

DannyYuyang-quic Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cccclai Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

DannyYuyang-quic Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 8, 2025

Uh oh!

cccclai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DannyYuyang-quic commented Jul 4, 2025 •

edited

Loading

pytorch-bot bot commented Jul 4, 2025 •

edited

Loading

DannyYuyang-quic Jul 8, 2025 •

edited

Loading

DannyYuyang-quic Jul 8, 2025 •

edited

Loading

DannyYuyang-quic Jul 9, 2025 •

edited

Loading