Added AMD CPU CI instances in scale-config.yml #6629

amukho · 2025-05-12T16:38:54Z

Added AWS AMD CPU instances and their corresponding runner types to scale-config.yml. Once this is merged will add the corresponding changes to the PyTorch repo to trigger inductor CPU CI jobs using these runners.

vercel · 2025-05-12T16:38:58Z

@amukho is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

facebook-github-bot · 2025-05-12T16:39:00Z

Hi @amukho!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

amukho · 2025-05-12T16:43:26Z

@amukho I thought we already test AMD using the AMD ROCm runners that already exist in the system. Can you give me more context on this PR? Is this being added by AMD?

Hi @zxiiro yes. We are adding the AMD CPU runners to run the inductor CPU CI jobs on AMD instances. This is in preparation for AMD CPU optimizations to be up streamed in PyTorch going forward. I have updated the PR comment to reflect this.

cc @malfet Hi Nikita I have added the AMD CPU runners in the scale-config.yml. We are preparing the PyTorch repo changes in parallel. Once this is merged we will push that in

malfet

LGTM, but please update PR description explaining what m7a runner is, why the same can not be achieved on m6a (or may be it can, in that case, I would advice to set default runners to m6a, but probably add one config on m7a just for benchmarking)

malfet · 2025-05-12T16:45:34Z

.github/canary-scale-config.yml

+        is_ephemeral: true
+  c.linux.12xlarge.amd:
+    disk_size: 200
+    instance_type: m6a.12xlarge


Why 12x large is m6a rather than m7a?

Hi @malfet these configs have been added inline with the current CPU configs in scale-config.yml.
m7a.24xlarge corresponds to c7i.metal-24xl (for torch inductor perf dashboard and perf smoke tests)
m7a.8xlarge corresponds to m7i-flex.8xlarge (for inductor unittests and periodic tests with AVX512 support)
m6a.12xlarge corresponds to m4.10xlarge (for inductor unittests and periodic tests with AVX2 support)

malfet · 2025-05-12T16:46:41Z

.github/lf-scale-config.yml

+  lf.linux.8xlarge.amd:
+    disk_size: 200
+    instance_type: m7a.8xlarge
+    is_ephemeral: true
+    os: linux
+    variants:
+      ephemeral:
+        is_ephemeral: true
+  lf.linux.12xlarge.amd:
+    disk_size: 200
+    instance_type: m6a.12xlarge
+    is_ephemeral: true
+    os: linux
+    variants:
+      ephemeral:
+        is_ephemeral: true
+  lf.linux.24xl.amd:
+    disk_size: 200
+    instance_type: m7a.24xlarge
+    is_ephemeral: true
+    os: linux
+    variants:
+      ephemeral:
+        is_ephemeral: true


@jeanschmidt , @ZainRizvi : do those needs to be always in sync? I.e. if someone adds runners to so far create test PR, shoudl they be added only to scale-config.yml and not to LF one?

From the CI Sync meeting just now. Yes the 4 files do need to be in sync in order to work. It is also possible to exclude the jobs from running in LF runners by runner determinator configuration in the job config. Its a feature we just added on Friday.

Here's an example of inductor jobs opting out of LF runners.

https://github.com/pytorch/pytorch/pull/153151/files

malfet · 2025-05-12T17:03:54Z

@amukho can you please sign CLA (AMD should have a corporate one with Meta signed already)

amukho · 2025-05-12T17:06:29Z

@amukho can you please sign CLA (AMD should have a corporate one with Meta signed already)

Hi @malfet I am already part of the corporate CLA. Not sure why this message is still appearing. I am trying to resolve this internally to see if we are missing anything

malfet

Please add information about runner names, CPU generations as well as AWS list price to PR description.
And stick to consistent naming schema(that includes CPU generation in its name).

I.e. what c. prefix stands for?
c.linux.8xlarge.amd and c.linux.12xlarge.amd, but c.linux.24xl.amd (it should be 24xlarge.amd shouldn't it?)

zxiiro · 2025-05-12T19:36:23Z

Please add information about runner names, CPU generations as well as AWS list price to PR description. And stick to consistent naming schema(that includes CPU generation in its name).

I.e. what c. prefix stands for?

c.linux.8xlarge.amd and c.linux.12xlarge.amd, but c.linux.24xl.amd (it should be 24xlarge.amd shouldn't it?)

c stands for "canary" which is a separate sandbox infra for testing CI changes that might be disruptive to the project.

zxiiro · 2025-05-12T19:41:31Z

.github/scale-config.yml

+    variants:
+      ephemeral:
+        is_ephemeral: true


Same here.

Suggested change

variants:

ephemeral:

is_ephemeral: true

hi @zxiiro I tried removing the variants from the scale-config.yml but it fails validation with the following error

Runner type linux.8xlarge.amd does not have a variants section defined
Runner type linux.12xlarge.amd does not have a variants section defined
Runner type linux.24xlarge.amd does not have a variants section defined
Found a total of 3 invalid runner configurations: linux.8xlarge.amd, linux.12xlarge.amd, linux.24xlarge.amd

From the script it looks like the variants section with an ephemeral variant is mandatory in the scale-config.yml

huh. odd I thought variants were optional but I see now all of the other configs have it set. In that case I guess I will retract my suggestion. Maybe @jeanschmidt could provide more details here.

amukho · 2025-05-13T12:27:36Z

Please add information about runner names, CPU generations as well as AWS list price to PR description. And stick to consistent naming schema(that includes CPU generation in its name).

I.e. what c. prefix stands for?

c.linux.8xlarge.amd and c.linux.12xlarge.amd, but c.linux.24xl.amd (it should be 24xlarge.amd shouldn't it?)

c stands for "canary" which is a separate sandbox infra for testing CI changes that might be disruptive to the project.

hi @malfet I have aligned the runner names. linux.24xl.amd was based on the existing runner linux.24xl.spr-metal but I have now aligned it with the rest of AMD CPU runner names

zxiiro · 2025-05-13T15:03:39Z

Flagging #6605 as it looks like it affects this PR.

Edit: @amukho looks like that change allows us to remove all those variants configuration so I guess it should be possible to remove it now.

Added AWS AMD CPU instances and their corresponding runner types to scale-config.yml. Once this is merged will add the corresponding changes to the PyTorch repo to trigger inductor CPU CI jobs using these runners Signed-off-by: Arijit Mukhopadhyay <[email protected]>

amukho · 2025-05-13T16:58:58Z

Flagging #6605 as it looks like it affects this PR.

Edit: @amukho looks like that change allows us to remove all those variants configuration so I guess it should be possible to remove it now.

hi @zxiiro I have updated the commit removing the variants section

facebook-github-bot · 2025-05-13T18:08:18Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

amukho · 2025-05-13T18:38:29Z

hi @malfet the CLA process is resolved. Can you please approve this PR?

Review feedback has been addressed

amukho mentioned this pull request May 12, 2025

Added AMD CI instances to inductor CI #6623

Closed

malfet approved these changes May 12, 2025

View reviewed changes

malfet requested changes May 12, 2025

View reviewed changes

zxiiro previously requested changes May 12, 2025

View reviewed changes

amukho force-pushed the amukho_fork_main branch from 4b82ce3 to fc88007 Compare May 13, 2025 12:24

amukho force-pushed the amukho_fork_main branch from fc88007 to 167b628 Compare May 13, 2025 16:57

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2025

amukho requested review from zxiiro and malfet May 14, 2025 04:12

malfet approved these changes May 14, 2025

View reviewed changes

malfet merged commit cf75659 into pytorch:main May 14, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added AMD CPU CI instances in scale-config.yml #6629

Added AMD CPU CI instances in scale-config.yml #6629

amukho commented May 12, 2025

vercel bot commented May 12, 2025

facebook-github-bot commented May 12, 2025

amukho commented May 12, 2025

malfet left a comment

malfet May 12, 2025

amukho May 12, 2025

malfet May 12, 2025

zxiiro May 12, 2025

malfet commented May 12, 2025

amukho commented May 12, 2025

malfet left a comment

zxiiro commented May 12, 2025

zxiiro May 12, 2025

amukho May 13, 2025

zxiiro May 13, 2025

amukho commented May 13, 2025

zxiiro commented May 13, 2025 •

edited

Loading

amukho commented May 13, 2025

facebook-github-bot commented May 13, 2025

amukho commented May 13, 2025

Added AMD CPU CI instances in scale-config.yml #6629

Added AMD CPU CI instances in scale-config.yml #6629

Conversation

amukho commented May 12, 2025

vercel bot commented May 12, 2025

facebook-github-bot commented May 12, 2025

Process

amukho commented May 12, 2025

malfet left a comment

Choose a reason for hiding this comment

malfet May 12, 2025

Choose a reason for hiding this comment

amukho May 12, 2025

Choose a reason for hiding this comment

malfet May 12, 2025

Choose a reason for hiding this comment

zxiiro May 12, 2025

Choose a reason for hiding this comment

malfet commented May 12, 2025

amukho commented May 12, 2025

malfet left a comment

Choose a reason for hiding this comment

zxiiro commented May 12, 2025

zxiiro May 12, 2025

Choose a reason for hiding this comment

amukho May 13, 2025

Choose a reason for hiding this comment

zxiiro May 13, 2025

Choose a reason for hiding this comment

amukho commented May 13, 2025

zxiiro commented May 13, 2025 • edited Loading

amukho commented May 13, 2025

facebook-github-bot commented May 13, 2025

amukho commented May 13, 2025

zxiiro commented May 13, 2025 •

edited

Loading