-
Notifications
You must be signed in to change notification settings - Fork 127
Pull requests: aws-samples/awsome-distributed-training
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
End-to-End LLM Model Development with Torchtitan and Torchtune
enhancement
New feature or request
#341
opened May 20, 2024 by
KeitaW
Loading…
updated Mar 20, 2025
Update SMPv2 conda setup script with latest PT2.3.1 TSM2.4.0
#366
opened Jun 25, 2024 by
viclzhu
Loading…
updated Mar 20, 2025
update dependencies of PyTorch base image
#375
opened Jul 15, 2024 by
KeitaW
Loading…
updated Mar 20, 2025
Update bionemo test case + propose to subdirectories per orchastrator
documentation
Improvements or additions to documentation
add tips to force NCCL comm to go through EFA
#531
opened Jan 23, 2025 by
KeitaW
Loading…
updated Mar 20, 2025
Feat/ddp mlflow
enhancement
New feature or request
#655
opened Apr 28, 2025 by
KeitaW
Loading…
updated May 3, 2025
Determine and display various versions of interest
#521
opened Jan 6, 2025 by
iankouls-aws
•
Draft
updated May 22, 2025
added new tool to scale up-down nodes on an instance group
#708
opened Jun 5, 2025 by
paragao
Loading…
updated Jun 9, 2025
Lustre mount via Ansible for SMHP Slurm LCS
#682
opened May 15, 2025 by
amanshanbhag
Loading…
updated Jun 17, 2025
Create fsdp-eks-regression.yml
#754
opened Jun 17, 2025 by
amanshanbhag
Loading…
updated Jun 19, 2025
Feature/slinkly slurm hyperpod eks
enhancement
New feature or request
#651
opened Apr 25, 2025 by
bluecrayon52
Loading…
updated Jun 23, 2025
Update data path in megatron-lm k8s test case
#762
opened Jun 26, 2025 by
KeitaW
Loading…
updated Jun 27, 2025
ProTip!
Type g p on any issue or pull request to go back to the pull request listing page.