Skip to content

feat: Add LoRA fine-tuning optimum-neuron example for slurm #643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 3, 2025

Conversation

Captainia
Copy link
Contributor

Issue #, if available:

Description of changes:

This example uses the slurm as orchestrator for the Optimum-neuron LoRA fine-tuning example. This is also targeting to the workshop under https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@nghtm
Copy link
Contributor

nghtm commented Apr 16, 2025

Hi Captainia, thank you for the submission! Is this ready for review?

Please connect with me on slack if you need permissions to the workshop. my amazon alias is same as github alias

@nghtm nghtm requested a review from allela-roy April 16, 2025 20:41
Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with
https://github.com/aws-samples/awsome-distributed-training/pull/631/files

@@ -0,0 +1,69 @@
# from transformers.models.llama.modeling_llama import LlamaForCausalLM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may download model weights and tokenizer file using cli. No need to have get_model.py and 1_download_model.sh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, we are currently trying to keep this workshop the same structure as blog post https://aws.amazon.com/blogs/machine-learning/peft-fine-tuning-of-llama-3-on-sagemaker-hyperpod-with-aws-trainium/. It might be easier to run the slurm job since they follow the same structure, please let me know if you think this is a blocker, thanks!

@KeitaW
Copy link
Contributor

KeitaW commented Apr 17, 2025

Let's use x.filename instead of x_filename to align with the naming convention of other test cases (ex. https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/cpu-ddp/slurm).

@Captainia
Copy link
Contributor Author

Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with https://github.com/aws-samples/awsome-distributed-training/pull/631/files

Thank you, that sounds good. Will rebase and refactor for the python dependencies once the EKS example is merged. Currently it is hard to reuse them across PRs nor I think we should merge the two into a single giant PR.

@KeitaW
Copy link
Contributor

KeitaW commented Apr 18, 2025

Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with https://github.com/aws-samples/awsome-distributed-training/pull/631/files

Thank you, that sounds good. Will rebase and refactor for the python dependencies once the EKS example is merged. Currently it is hard to reuse them across PRs nor I think we should merge the two into a single giant PR.

Sounds good! Thank you @Captainia for working on it.

@Captainia Captainia force-pushed the slurm-workshop branch 2 times, most recently from e2a5146 to 9f0b523 Compare May 12, 2025 20:22
Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial comments.

Comment on lines 31 to 32
RUN git clone https://github.com/aws-samples/awsome-distributed-training.git
RUN cp -r awsome-distributed-training/3.test_cases/pytorch/optimum-neuron/llama3/src workspace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may simply copy from local directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated to copy from local directory.

@@ -39,6 +39,7 @@ export MAX_TRAINING_STEPS=1200
# Generate the final yaml files from templates
for template in tokenize_data compile_peft launch_peft_train consolidation merge_lora; do
cat templates/${template}.yaml-template | envsubst > ${template}.yaml
chmod +x ${template}.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change the file mode instead of changing it on the fly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course, thanks and updated.

Comment on lines 28 to 29
cd ~/peft_ft
cp -r ~/awsome-distributed-training/3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/submit_jobs .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need to copy the scripts to a separate directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not necessary, I updated to just cd into this directory.

MODEL_OUTPUT_PATH="/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B"
TOKENIZER_OUTPUT_PATH="/fsx/ubuntu/peft_ft/tokenizer/llama3-8B"

srun python3 $INPUT_PATH \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to download model using the compute node. Perhaps we can use hf cli on the head node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, we are currently trying to keep this workshop the same structure as blog post https://aws.amazon.com/blogs/machine-learning/peft-fine-tuning-of-llama-3-on-sagemaker-hyperpod-with-aws-trainium/. It might be easier to run the slurm job since they follow the same structure, please let me know if you think this is a blocker, thanks!

@KeitaW KeitaW merged commit 0cf6337 into aws-samples:main Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants