-
Notifications
You must be signed in to change notification settings - Fork 97
Added AMD CPU CI instances in scale-config.yml #6629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@amukho is attempting to deploy a commit to the Meta Open Source Team on Vercel. A member of the Team first needs to authorize it. |
Hi @amukho! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Hi @zxiiro yes. We are adding the AMD CPU runners to run the inductor CPU CI jobs on AMD instances. This is in preparation for AMD CPU optimizations to be up streamed in PyTorch going forward. I have updated the PR comment to reflect this. cc @malfet Hi Nikita I have added the AMD CPU runners in the scale-config.yml. We are preparing the PyTorch repo changes in parallel. Once this is merged we will push that in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please update PR description explaining what m7a runner is, why the same can not be achieved on m6a (or may be it can, in that case, I would advice to set default runners to m6a, but probably add one config on m7a just for benchmarking)
is_ephemeral: true | ||
c.linux.12xlarge.amd: | ||
disk_size: 200 | ||
instance_type: m6a.12xlarge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 12x large is m6a rather than m7a?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @malfet these configs have been added inline with the current CPU configs in scale-config.yml.
m7a.24xlarge corresponds to c7i.metal-24xl (for torch inductor perf dashboard and perf smoke tests)
m7a.8xlarge corresponds to m7i-flex.8xlarge (for inductor unittests and periodic tests with AVX512 support)
m6a.12xlarge corresponds to m4.10xlarge (for inductor unittests and periodic tests with AVX2 support)
.github/lf-scale-config.yml
Outdated
lf.linux.8xlarge.amd: | ||
disk_size: 200 | ||
instance_type: m7a.8xlarge | ||
is_ephemeral: true | ||
os: linux | ||
variants: | ||
ephemeral: | ||
is_ephemeral: true | ||
lf.linux.12xlarge.amd: | ||
disk_size: 200 | ||
instance_type: m6a.12xlarge | ||
is_ephemeral: true | ||
os: linux | ||
variants: | ||
ephemeral: | ||
is_ephemeral: true | ||
lf.linux.24xl.amd: | ||
disk_size: 200 | ||
instance_type: m7a.24xlarge | ||
is_ephemeral: true | ||
os: linux | ||
variants: | ||
ephemeral: | ||
is_ephemeral: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeanschmidt , @ZainRizvi : do those needs to be always in sync? I.e. if someone adds runners to so far create test PR, shoudl they be added only to scale-config.yml and not to LF one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the CI Sync meeting just now. Yes the 4 files do need to be in sync in order to work. It is also possible to exclude the jobs from running in LF runners by runner determinator configuration in the job config. Its a feature we just added on Friday.
Here's an example of inductor jobs opting out of LF runners.
@amukho can you please sign CLA (AMD should have a corporate one with Meta signed already) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add information about runner names, CPU generations as well as AWS list price to PR description.
And stick to consistent naming schema(that includes CPU generation in its name).
- I.e. what
c.
prefix stands for? c.linux.8xlarge.amd
andc.linux.12xlarge.amd
, butc.linux.24xl.amd
(it should be24xlarge.amd
shouldn't it?)
c stands for "canary" which is a separate sandbox infra for testing CI changes that might be disruptive to the project. |
.github/scale-config.yml
Outdated
variants: | ||
ephemeral: | ||
is_ephemeral: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
variants: | |
ephemeral: | |
is_ephemeral: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @zxiiro I tried removing the variants from the scale-config.yml but it fails validation with the following error
Runner type linux.8xlarge.amd does not have a variants section defined
Runner type linux.12xlarge.amd does not have a variants section defined
Runner type linux.24xlarge.amd does not have a variants section defined
Found a total of 3 invalid runner configurations: linux.8xlarge.amd, linux.12xlarge.amd, linux.24xlarge.amd
From the script it looks like the variants section with an ephemeral variant is mandatory in the scale-config.yml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh. odd I thought variants were optional but I see now all of the other configs have it set. In that case I guess I will retract my suggestion. Maybe @jeanschmidt could provide more details here.
hi @malfet I have aligned the runner names. linux.24xl.amd was based on the existing runner linux.24xl.spr-metal but I have now aligned it with the rest of AMD CPU runner names |
Added AWS AMD CPU instances and their corresponding runner types to scale-config.yml. Once this is merged will add the corresponding changes to the PyTorch repo to trigger inductor CPU CI jobs using these runners Signed-off-by: Arijit Mukhopadhyay <[email protected]>
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
hi @malfet the CLA process is resolved. Can you please approve this PR? |
Added AWS AMD CPU instances and their corresponding runner types to scale-config.yml. Once this is merged will add the corresponding changes to the PyTorch repo to trigger inductor CPU CI jobs using these runners.