This repository contains example scripts for deep learning, including pretraining configurations for Large Language Models (LLMs) and Multimodal Models.
- NeMo LLM Pretraining Scripts : Contains example scripts for pretraining LLM Models using the NeMo Framework. These scripts are adapted from NeMo-Framework-Launcher
- Megatron-LM LLM Pretraining Scripts : Contains example scripts for pretraining LLM Models that adapted from Megatron-LM.
- Megatron-DeepSpeed LLM Pretraining Scripts : Contains example scripts for pretraining LLM Models that adapted from Megatron-DeepSpeed
- training/nemo/neva : Scripts to Multimodal Models - NeVa (LLaVA) pretraining with recommended config (from NeMo-Framework-Launcher) on NVIDIA H100, in fp16 data type, running on NeMo Framework
Before running the examples, ensure the following:
- Container: Use the ScitiX NeMo container (
registry-ap-southeast.scitix.ai/hpc/nemo:24.07
) or the NGC NeMo container (nemo:24.07
). If using NGC, clone this repository into the container or a shared storage accessible by distributed worker containers. - Datasets: Refer to the README.md under deep_learning_examples/training for dataset preparation.
- For LLM based on NeMo or Megatron-LM, mock data can be used.
- For ScitiX SiFlow or CKS, preset datasets are available.
- Pretrained Models: Prepare corresponding pretrained models for fine-tuning and multimodal pretraining. Preset models are available for ScitiX SiFlow or CKS.
Refer to the README.md under deep_learning_examples/training for detailed instructions.
Using PyTorchjob Operator
Scripts for launching PyTorch jobs on a Kubernetes cluster are located in launcher_scripts/k8s.
Refer to the README.md under deep_learning_examples/training for detailed instructions.
For example, to launch the LLaMA2-13B pretraining, use the following command:
cd ${DEEP_LEARNING_EXAMPLES_DIR}/launcher_scripts/k8s/training/llm
./launch_nemo_llama2_13b_bf16.sh