9082 educate users on mat mul precision #9103

yaoshiang · 2025-05-06T20:38:08Z

Adds runnable tutorial to teach users about mat mul precision.

Also includes new manual build instructions for runnable tutorials.

…on-mat-mul-precision

yaoshiang · 2025-05-06T20:39:14Z

PDF of doc is here
Controlling Floating Point Precision — PyTorch_XLA master documentation.pdf

…on-mat-mul-precision

This reverts commit 02a5069.

…on-mat-mul-precision

docs/README.md

docs/source/tutorials/precision_tutorial.py

mikegre-google · 2025-05-09T17:16:12Z

docs/source/tutorials/precision_tutorial.py

+# |
+# | ![bits](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tf32-bf16-fp16-fp32.png)
+
+# | ## Higher precision math on lower precision hardware


Perhaps we can just point readers to BFloat16: The secret to high peformance on Cloud TPUs instead of duplicating that content here.

If you'd rather keep this section, let me know and I can clean it up. My preference would be to use the existing documentation.

that doc doesn't share the math of how the 3 and 6 pass methods work, and we have had customer questions, so I think it's important we keep this description.

docs/source/tutorials/precision_tutorial.py

This reverts commit 6e3fdf9.

…on-mat-mul-precision

…hub.com/pytorch/xla into 9082-educate-users-on-mat-mul-precision

docs/README.md

docs/runnable_tutorials

docs/source/tutorials/precision_tutorial.py

tengyifei · 2025-05-09T23:07:18Z

docs/source/tutorials/precision_tutorial.py

+# | the differences between these three settings.
+# |
+# | Warning: Although this notebook demonstrates different precision settings,
+# | it is recommended to only set the precision once at the beginning of your


Not sure if it's already done; does it makes sense to throw a "hard error" when the user sets the precision twice?

It's a good question and I'm not sure. If I was pretty sure mat_mul_precision was flakey, I'd say, definitely yes. But I am also curious if the error is not really in the platform, but rather, in the testing harness. It seems weird to me that I can set matmul precision dynamically in scripts and interactive python interpreters... (see PR 9083)... but it fails the unit test. I really don't get it.

Ok, I know why. Looks like our compilation caching is not sound.

I ran your test a bunch of times with different levels of matmul precision. Then I printed xm.metrics_report() and it only reported 1 compilation event. I think we need at least 3 compilation events if there are 3 different precision levels.

In contrast, JAX maintains an extensive context of ambient settings that will impact compilation results: https://github.com/jax-ml/jax/blob/35e2657be8308917c7fa407be5a0b53192134890/jax/_src/config.py#L230. Whenever any of those things change, JAX will recompile.

I think the reasonable thing here is to:

Print some warning when the precision level is changed, warning the user that existing cached graphs may nullify their precision level change. If possible, we could maybe only print this warning if the are cached graphs.

Someone should fix this in some follow-up.

As a corollary, probably we should advise users that they better change the precision once at the start of their script.

Thanks for this analysis! I suspected there was global state (because runtime didn't triple) but didn't know the tools you know to dig deeper. I couldn't bisect it between the python -m unittest runner and something in libtpu.

Is there a deeper fix here for which I can file an issue? E.g. can we do what Jax does and force a recompile when this ambient setting is changed? Broadly speaking, being more stringent about what constitutes a cache hit?

Either way, the note to only set this setting once is in this guide, and I'll add it to the doc string as well. I'll look into a warning as well if that's idiomatic to PyTorch.

Is there a deeper fix here for which I can file an issue? E.g. can we do what Jax does and force a recompile when this ambient setting is changed? Broadly speaking, being more stringent about what constitutes a cache hit?

Absolutely. I'm only familiar with how JAX does it. But I could imagine a similar kind of lazy_tensor_trace_context() for PyTorch/XLA, and whatever dictionary we're currently inserting the cached compilation result into, we'll need to hash-combine the dict key with this trace context.

It's probably a good idea to audit the JAX list and see if any other items apply to ptxla as well; not just the matmul precision

do we have the ability to force a cache flush? That would provide a workaround.

docs/source/tutorials/precision_tutorial.py

tengyifei

Thanks! I really enjoyed reading it.

yaoshiang added 13 commits May 2, 2025 14:45

initial commit: binding and backends package, no tests.

86200a5

Tests for default, high, and highest precision.

2901b77

clang-format

564685a

formatter

4c0b52e

Updates to error messages

0d3f44f

fixed test class names

3c333d8

typo

967ccda

typo on error message. unit tested and yapfed.

393257e

linter

f48d7f6

minor edits.

114456b

Merge branch '9080-expose-mat_mul_precision' into 9082-educate-users-…

2d05da1

…on-mat-mul-precision

initial commit

587f6e5

adding updated note for __init__

a247bde

yaoshiang requested review from qihqi and mikegre-google May 6, 2025 20:38

yaoshiang requested review from tengyifei, lsy323, ManfeiBai and zpcore as code owners May 6, 2025 20:38

yaoshiang added 10 commits May 6, 2025 20:42

removed todo (done!)

69fdb49

Updated TODO per review

87cff3d

Update todo and precision math per comment.

b96f671

Merge branch 'master' into 9080-expose-mat_mul_precision

c26ad8f

Merge branch '9080-expose-mat_mul_precision' into 9082-educate-users-…

a825e12

…on-mat-mul-precision

Adding image to second location. Minor typos on README.

8f4b12b

yapf

30d1c19

Merge branch '9080-expose-mat_mul_precision' into 9082-educate-users-…

324bde4

…on-mat-mul-precision

yapf

e15443f

linter

959b0dc

yaoshiang added 8 commits May 9, 2025 02:28

parameterized, but in a process isolated way.

02a5069

Revert "parameterized, but in a process isolated way."

0205431

This reverts commit 02a5069.

parameterized, but in a process isolated way.

8aaa979

removed dead code

04516c7

added issue for repeatable, unexpected behavior.

c3856c8

Merge branch '9080-expose-mat_mul_precision' into 9082-educate-users-…

3fae53c

…on-mat-mul-precision

indent

eafc98e

Updated docstring.

6e3fdf9

mikegre-google reviewed May 9, 2025

View reviewed changes

yaoshiang added 7 commits May 9, 2025 10:55

Revert "Updated docstring."

a2e2298

This reverts commit 6e3fdf9.

Updated docstring.

79af416

changed naming of is_on_tpu

6ff7f59

Merge branch 'master' into 9080-expose-mat_mul_precision

89c72a8

CICD friendly version, hopefully.

9db6e3e

Merge branch '9080-expose-mat_mul_precision' into 9082-educate-users-…

922b57f

…on-mat-mul-precision

Merge branch '9082-educate-users-on-mat-mul-precision' of https://git…

4b629d8

…hub.com/pytorch/xla into 9082-educate-users-on-mat-mul-precision

tengyifei reviewed May 9, 2025

View reviewed changes

Updates based on comments.

bcea8e9

yaoshiang changed the base branch from 9080-expose-mat_mul_precision to master May 9, 2025 22:58

tengyifei reviewed May 9, 2025

View reviewed changes

yaoshiang added 2 commits May 10, 2025 01:56

addressing comments

56cfbce

tutorial changes

9c53282

tengyifei approved these changes May 10, 2025

View reviewed changes

yaoshiang mentioned this pull request May 10, 2025

set_mat_mul_precision is flakey #9129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

9082 educate users on mat mul precision #9103

9082 educate users on mat mul precision #9103

yaoshiang commented May 6, 2025 •

edited

Loading

yaoshiang commented May 6, 2025

mikegre-google May 9, 2025

mikegre-google May 9, 2025

yaoshiang May 9, 2025

tengyifei May 9, 2025

yaoshiang May 10, 2025

tengyifei May 10, 2025 •

edited

Loading

yaoshiang May 10, 2025 •

edited

Loading

tengyifei May 10, 2025

tengyifei May 10, 2025

yaoshiang May 10, 2025

tengyifei left a comment

9082 educate users on mat mul precision #9103

Are you sure you want to change the base?

9082 educate users on mat mul precision #9103

Conversation

yaoshiang commented May 6, 2025 • edited Loading

yaoshiang commented May 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengyifei May 10, 2025 • edited Loading

Choose a reason for hiding this comment

yaoshiang May 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengyifei left a comment

Choose a reason for hiding this comment

yaoshiang commented May 6, 2025 •

edited

Loading

tengyifei May 10, 2025 •

edited

Loading

yaoshiang May 10, 2025 •

edited

Loading