-
Notifications
You must be signed in to change notification settings - Fork 126
feat: Add LoRA fine-tuning optimum-neuron example for slurm #643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi Captainia, thank you for the submission! Is this ready for review? Please connect with me on slack if you need permissions to the workshop. my amazon alias is same as github alias |
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with
https://github.com/aws-samples/awsome-distributed-training/pull/631/files
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,69 @@ | |||
# from transformers.models.llama.modeling_llama import LlamaForCausalLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may download model weights and tokenizer file using cli. No need to have get_model.py
and 1_download_model.sh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, we are currently trying to keep this workshop the same structure as blog post https://aws.amazon.com/blogs/machine-learning/peft-fine-tuning-of-llama-3-on-sagemaker-hyperpod-with-aws-trainium/. It might be easier to run the slurm job since they follow the same structure, please let me know if you think this is a blocker, thanks!
Let's use |
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/submit_jobs/3_finetune.sh
Outdated
Show resolved
Hide resolved
3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/train.py
Outdated
Show resolved
Hide resolved
…EADME.md Co-authored-by: Keita Watanabe <[email protected]>
…EADME.md Co-authored-by: Keita Watanabe <[email protected]>
Thank you, that sounds good. Will rebase and refactor for the python dependencies once the EKS example is merged. Currently it is hard to reuse them across PRs nor I think we should merge the two into a single giant PR. |
Sounds good! Thank you @Captainia for working on it. |
e2a5146
to
9f0b523
Compare
…ubmit_jobs/3_finetune.sh Co-authored-by: Keita Watanabe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments.
RUN git clone https://github.com/aws-samples/awsome-distributed-training.git | ||
RUN cp -r awsome-distributed-training/3.test_cases/pytorch/optimum-neuron/llama3/src workspace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may simply copy from local directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated to copy from local directory.
@@ -39,6 +39,7 @@ export MAX_TRAINING_STEPS=1200 | |||
# Generate the final yaml files from templates | |||
for template in tokenize_data compile_peft launch_peft_train consolidation merge_lora; do | |||
cat templates/${template}.yaml-template | envsubst > ${template}.yaml | |||
chmod +x ${template}.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change the file mode instead of changing it on the fly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course, thanks and updated.
cd ~/peft_ft | ||
cp -r ~/awsome-distributed-training/3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/submit_jobs . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need to copy the scripts to a separate directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is not necessary, I updated to just cd into this directory.
MODEL_OUTPUT_PATH="/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B" | ||
TOKENIZER_OUTPUT_PATH="/fsx/ubuntu/peft_ft/tokenizer/llama3-8B" | ||
|
||
srun python3 $INPUT_PATH \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to download model using the compute node. Perhaps we can use hf cli on the head node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, we are currently trying to keep this workshop the same structure as blog post https://aws.amazon.com/blogs/machine-learning/peft-fine-tuning-of-llama-3-on-sagemaker-hyperpod-with-aws-trainium/. It might be easier to run the slurm job since they follow the same structure, please let me know if you think this is a blocker, thanks!
Issue #, if available:
Description of changes:
This example uses the slurm as orchestrator for the Optimum-neuron LoRA fine-tuning example. This is also targeting to the workshop under https://catalog.workshops.aws/sagemaker-hyperpod/en-US.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.