feat: Add LoRA fine-tuning optimum-neuron example for slurm #643

Captainia · 2025-04-15T17:43:00Z

Issue #, if available:

Description of changes:

This example uses the slurm as orchestrator for the Optimum-neuron LoRA fine-tuning example. This is also targeting to the workshop under https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

nghtm · 2025-04-16T16:31:13Z

Hi Captainia, thank you for the submission! Is this ready for review?

Please connect with me on slack if you need permissions to the workshop. my amazon alias is same as github alias

3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md

KeitaW

Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with
https://github.com/aws-samples/awsome-distributed-training/pull/631/files

3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md

KeitaW · 2025-04-17T12:07:35Z

3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/get_model.py

@@ -0,0 +1,69 @@
+# from transformers.models.llama.modeling_llama import LlamaForCausalLM


You may download model weights and tokenizer file using cli. No need to have get_model.py and 1_download_model.sh.

cf. https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/neuronx-distributed/llama3/slurm#step1-download-llama3-model-and-tokenizer

Thank you, we are currently trying to keep this workshop the same structure as blog post https://aws.amazon.com/blogs/machine-learning/peft-fine-tuning-of-llama-3-on-sagemaker-hyperpod-with-aws-trainium/. It might be easier to run the slurm job since they follow the same structure, please let me know if you think this is a blocker, thanks!

KeitaW · 2025-04-17T12:10:51Z

Let's use x.filename instead of x_filename to align with the naming convention of other test cases (ex. https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/cpu-ddp/slurm).

3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/submit_jobs/3_finetune.sh

3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/train.py

…EADME.md Co-authored-by: Keita Watanabe <[email protected]>

Captainia · 2025-04-17T13:50:18Z

Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with https://github.com/aws-samples/awsome-distributed-training/pull/631/files

Thank you, that sounds good. Will rebase and refactor for the python dependencies once the EKS example is merged. Currently it is hard to reuse them across PRs nor I think we should merge the two into a single giant PR.

KeitaW · 2025-04-18T06:36:54Z

Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with https://github.com/aws-samples/awsome-distributed-training/pull/631/files

Thank you, that sounds good. Will rebase and refactor for the python dependencies once the EKS example is merged. Currently it is hard to reuse them across PRs nor I think we should merge the two into a single giant PR.

Sounds good! Thank you @Captainia for working on it.

…ubmit_jobs/3_finetune.sh Co-authored-by: Keita Watanabe <[email protected]>

KeitaW

Left some initial comments.

KeitaW · 2025-05-14T22:01:33Z

3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/Dockerfile

+RUN git clone https://github.com/aws-samples/awsome-distributed-training.git
+RUN cp -r awsome-distributed-training/3.test_cases/pytorch/optimum-neuron/llama3/src workspace


We may simply copy from local directory.

Yes, updated to copy from local directory.

KeitaW · 2025-05-14T22:02:32Z

3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/generate-jobspec.sh

@@ -39,6 +39,7 @@ export MAX_TRAINING_STEPS=1200
 # Generate the final yaml files from templates
 for template in tokenize_data compile_peft launch_peft_train consolidation merge_lora; do
    cat templates/${template}.yaml-template | envsubst > ${template}.yaml
+    chmod +x ${template}.yaml


Could you change the file mode instead of changing it on the fly?

Yes, of course, thanks and updated.

KeitaW · 2025-05-14T22:03:36Z

3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md

+cd ~/peft_ft
+cp -r ~/awsome-distributed-training/3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/submit_jobs .


Not sure why we need to copy the scripts to a separate directory.

Yes, this is not necessary, I updated to just cd into this directory.

KeitaW · 2025-05-26T11:15:56Z

3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/submit_jobs/1.download_model.sh

+MODEL_OUTPUT_PATH="/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B"
+TOKENIZER_OUTPUT_PATH="/fsx/ubuntu/peft_ft/tokenizer/llama3-8B"
+
+srun python3 $INPUT_PATH \


Not sure if we want to download model using the compute node. Perhaps we can use hf cli on the head node?

Thank you, we are currently trying to keep this workshop the same structure as blog post https://aws.amazon.com/blogs/machine-learning/peft-fine-tuning-of-llama-3-on-sagemaker-hyperpod-with-aws-trainium/. It might be easier to run the slurm job since they follow the same structure, please let me know if you think this is a blocker, thanks!

Captainia added 5 commits April 15, 2025 13:41

feat: Add LoRA fine-tuning optimum-neuron example for slurm

a287441

update paths

47ef4b4

fix on newly created cluster

6aa5b18

update create envs

2c15aaf

Update README

f402ea2

nghtm requested a review from allela-roy April 16, 2025 20:41