Skip to content

Added AMD CPU CI instances in scale-config.yml #6629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2025

Conversation

amukho
Copy link
Contributor

@amukho amukho commented May 12, 2025

Added AWS AMD CPU instances and their corresponding runner types to scale-config.yml. Once this is merged will add the corresponding changes to the PyTorch repo to trigger inductor CPU CI jobs using these runners.

Copy link

vercel bot commented May 12, 2025

@amukho is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@facebook-github-bot
Copy link
Contributor

Hi @amukho!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@amukho
Copy link
Contributor Author

amukho commented May 12, 2025

@amukho I thought we already test AMD using the AMD ROCm runners that already exist in the system. Can you give me more context on this PR? Is this being added by AMD?

Hi @zxiiro yes. We are adding the AMD CPU runners to run the inductor CPU CI jobs on AMD instances. This is in preparation for AMD CPU optimizations to be up streamed in PyTorch going forward. I have updated the PR comment to reflect this.

cc @malfet Hi Nikita I have added the AMD CPU runners in the scale-config.yml. We are preparing the PyTorch repo changes in parallel. Once this is merged we will push that in

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please update PR description explaining what m7a runner is, why the same can not be achieved on m6a (or may be it can, in that case, I would advice to set default runners to m6a, but probably add one config on m7a just for benchmarking)

is_ephemeral: true
c.linux.12xlarge.amd:
disk_size: 200
instance_type: m6a.12xlarge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 12x large is m6a rather than m7a?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @malfet these configs have been added inline with the current CPU configs in scale-config.yml.
m7a.24xlarge corresponds to c7i.metal-24xl (for torch inductor perf dashboard and perf smoke tests)
m7a.8xlarge corresponds to m7i-flex.8xlarge (for inductor unittests and periodic tests with AVX512 support)
m6a.12xlarge corresponds to m4.10xlarge (for inductor unittests and periodic tests with AVX2 support)

Comment on lines 346 to 369
lf.linux.8xlarge.amd:
disk_size: 200
instance_type: m7a.8xlarge
is_ephemeral: true
os: linux
variants:
ephemeral:
is_ephemeral: true
lf.linux.12xlarge.amd:
disk_size: 200
instance_type: m6a.12xlarge
is_ephemeral: true
os: linux
variants:
ephemeral:
is_ephemeral: true
lf.linux.24xl.amd:
disk_size: 200
instance_type: m7a.24xlarge
is_ephemeral: true
os: linux
variants:
ephemeral:
is_ephemeral: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeanschmidt , @ZainRizvi : do those needs to be always in sync? I.e. if someone adds runners to so far create test PR, shoudl they be added only to scale-config.yml and not to LF one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the CI Sync meeting just now. Yes the 4 files do need to be in sync in order to work. It is also possible to exclude the jobs from running in LF runners by runner determinator configuration in the job config. Its a feature we just added on Friday.

Here's an example of inductor jobs opting out of LF runners.

https://github.com/pytorch/pytorch/pull/153151/files

@malfet
Copy link
Contributor

malfet commented May 12, 2025

@amukho can you please sign CLA (AMD should have a corporate one with Meta signed already)

@amukho
Copy link
Contributor Author

amukho commented May 12, 2025

@amukho can you please sign CLA (AMD should have a corporate one with Meta signed already)

Hi @malfet I am already part of the corporate CLA. Not sure why this message is still appearing. I am trying to resolve this internally to see if we are missing anything

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add information about runner names, CPU generations as well as AWS list price to PR description.
And stick to consistent naming schema(that includes CPU generation in its name).

  • I.e. what c. prefix stands for?
  • c.linux.8xlarge.amd and c.linux.12xlarge.amd, but c.linux.24xl.amd (it should be 24xlarge.amd shouldn't it?)

@zxiiro
Copy link
Collaborator

zxiiro commented May 12, 2025

Please add information about runner names, CPU generations as well as AWS list price to PR description. And stick to consistent naming schema(that includes CPU generation in its name).

  • I.e. what c. prefix stands for?
  • c.linux.8xlarge.amd and c.linux.12xlarge.amd, but c.linux.24xl.amd (it should be 24xlarge.amd shouldn't it?)

c stands for "canary" which is a separate sandbox infra for testing CI changes that might be disruptive to the project.

zxiiro
zxiiro previously requested changes May 12, 2025
Comment on lines 363 to 365
variants:
ephemeral:
is_ephemeral: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Suggested change
variants:
ephemeral:
is_ephemeral: true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @zxiiro I tried removing the variants from the scale-config.yml but it fails validation with the following error

Runner type linux.8xlarge.amd does not have a variants section defined
Runner type linux.12xlarge.amd does not have a variants section defined
Runner type linux.24xlarge.amd does not have a variants section defined
Found a total of 3 invalid runner configurations: linux.8xlarge.amd, linux.12xlarge.amd, linux.24xlarge.amd

From the script it looks like the variants section with an ephemeral variant is mandatory in the scale-config.yml

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh. odd I thought variants were optional but I see now all of the other configs have it set. In that case I guess I will retract my suggestion. Maybe @jeanschmidt could provide more details here.

@amukho amukho force-pushed the amukho_fork_main branch from 4b82ce3 to fc88007 Compare May 13, 2025 12:24
@amukho
Copy link
Contributor Author

amukho commented May 13, 2025

Please add information about runner names, CPU generations as well as AWS list price to PR description. And stick to consistent naming schema(that includes CPU generation in its name).

  • I.e. what c. prefix stands for?
  • c.linux.8xlarge.amd and c.linux.12xlarge.amd, but c.linux.24xl.amd (it should be 24xlarge.amd shouldn't it?)

c stands for "canary" which is a separate sandbox infra for testing CI changes that might be disruptive to the project.

hi @malfet I have aligned the runner names. linux.24xl.amd was based on the existing runner linux.24xl.spr-metal but I have now aligned it with the rest of AMD CPU runner names

@zxiiro
Copy link
Collaborator

zxiiro commented May 13, 2025

Flagging #6605 as it looks like it affects this PR.

Edit: @amukho looks like that change allows us to remove all those variants configuration so I guess it should be possible to remove it now.

Added AWS AMD CPU instances and their corresponding runner types
to scale-config.yml. Once this is merged will add the corresponding
changes to the PyTorch repo to trigger inductor CPU CI jobs using
these runners

Signed-off-by: Arijit Mukhopadhyay <[email protected]>
@amukho amukho force-pushed the amukho_fork_main branch from fc88007 to 167b628 Compare May 13, 2025 16:57
@amukho
Copy link
Contributor Author

amukho commented May 13, 2025

Flagging #6605 as it looks like it affects this PR.

Edit: @amukho looks like that change allows us to remove all those variants configuration so I guess it should be possible to remove it now.

hi @zxiiro I have updated the commit removing the variants section

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2025
@amukho
Copy link
Contributor Author

amukho commented May 13, 2025

hi @malfet the CLA process is resolved. Can you please approve this PR?

@amukho amukho requested review from zxiiro and malfet May 14, 2025 04:12
@malfet malfet dismissed zxiiro’s stale review May 14, 2025 15:00

Review feedback has been addressed

@malfet malfet merged commit cf75659 into pytorch:main May 14, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants