-
Notifications
You must be signed in to change notification settings - Fork 610
Qualcomm AI Engine Direct - GA model enablement (T5) #12234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12234
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 4e79e47 with merge base 1decf7a ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "release notes: qualcomm" |
logger = logging.get_logger(__name__) | ||
|
||
|
||
# Copy from transformers/models/t5/modeling_t5.py (transformers=4.47.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How different it is compared with transformers/models/t5/modeling_t5.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two changes compared to transformers/models/t5/modeling_t5.py
:
both added to move the computation of relative position out of runtime and into precomputed buffers. This is because T5Attention._relative_position_bucket
is not QNN-friendly.
Also can you rebase? |
fa38382
to
f15420a
Compare
Summary: - e2e script / test case for GA T5 model - perf: 16a8w avg encoding time: 4.09ms/inf, avg decoding time: 6ms/inf (SM8750) - acc: F1 Score ~= 76% in SQuAD - add QA dataset for Seq2SeqLM benchmarking
f15420a
to
d1c2fec
Compare
): | ||
super().__init__(config, embed_tokens) | ||
|
||
# ====================Qualcomm Changed================================= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part of the added is that I precompute the relative_position_bucket using T5Attention._relative_position_bucket
and register the result as a buffer.
# ====================Qualcomm Changed================================= | ||
# The bias is indexed by cache_position to select the correct positions for the current step. | ||
if self.is_decoder: | ||
# For decoder, use the decoder's relative position bias table. | ||
position_bias = ( | ||
self.block[0] | ||
.layer[0] | ||
.SelfAttention.relative_attention_bias( | ||
self.decoder_self_attn_position_bias[cache_position] | ||
) | ||
.permute([2, 0, 1]) | ||
.unsqueeze(0) | ||
) | ||
else: | ||
# For encoder, use the encoder's relative position bias table. | ||
position_bias = ( | ||
self.block[0] | ||
.layer[0] | ||
.SelfAttention.relative_attention_bias( | ||
self.encoder_self_attn_position_bias[cache_position] | ||
) | ||
.permute([2, 0, 1]) | ||
.unsqueeze(0) | ||
) | ||
position_bias = position_bias[:, :, -seq_length:, :] | ||
if self.is_decoder: | ||
position_bias = ( | ||
position_bias + causal_mask[:, :, :, : self.max_cache_length] | ||
) | ||
else: | ||
position_bias = position_bias + causal_mask[:, :, :, :seq_length] | ||
|
||
# For cross-attention in decoder, precompute encoder-decoder position bias as zeros and add encoder attention mask. | ||
encoder_decoder_position_bias = None | ||
if self.is_decoder and encoder_hidden_states is not None: | ||
encoder_decoder_position_bias = torch.zeros( | ||
(1, self.config.num_heads, seq_length, self.max_hidden_seq_length), | ||
dtype=encoder_extended_attention_mask.dtype, | ||
) | ||
encoder_decoder_position_bias = ( | ||
encoder_decoder_position_bias | ||
+ encoder_extended_attention_mask[:, :, :, : self.max_hidden_seq_length] | ||
) | ||
# ======================================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second part is in the forward
, where I retrieve the relative_position_bucket
by indexing into the buffer using the correct cache position
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for optimizing the performance. Can the source model definition also lowered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I still need to write some source transforms or add passes. This is because the function _relative_position_bucket
from source T5 computation has two main issues:
- Unsupported ops for (int64/int32) datatypes:
Source function performs operations like abs, min, and neg on int64, but these ops are not supported on int64/int32 in QNN. - Unsupported casting:
There is a cast from float32 to int64 in source function, but in the 16a8w quantization case, QNN's cast op actually performs a cast from uint16 to int32, which is also unsupported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a step back from being performant, if user tried and it fails here, and if it's not supported in QNN, should this op fall back to cpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if it's not supported in QNN, this should fall back to cpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The diff train is kind of stuck, will try to merge it soon
Summary
Test plan
cc: @haowhsu-quic,@cccclai