Skip to content

DeepSeek V3/R1/Prover-V2 671B SFT with LoRA #697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions 3.test_cases/pytorch/colossalai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Colossal-AI

## Dependencies

As of Apr 18th 2025 [commit](https://github.com/hpcaitech/ColossalAI/tree/46ed5d856b16b074325091a88e761544b3d4f9f0) ColosalAI required PyTorch 2.5.1 which official builds use CUDA 12.4. We use `nvidia/cuda:12.4.1-devel-ubuntu22.04` as the base image and install all dependencies on top of it in [colossalai.Dockerfile](colossalai.Dockerfile).

## Build Docker Image

Building Colossal-AI from scratch requires GPU support, you need to use Nvidia Docker Runtime as the default when doing docker build. We launch the build job on the GPU node:

Login to AWS ECR:
```bash
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

aws ecr get-login-password ...
```

Build the docker image on the GPU node and push it to the docker repo:
```bash
export DOCKER_REPO=159553542841.dkr.ecr.ap-northeast-1.amazonaws.com/belevich/colossalai
srun ./build_docker.sh
```

Take docker image from the docker repo:
```bash
docker pull $DOCKER_REPO:latest
```

Import the docker image to an enroot container(maybe remove previous created `rm ./colossalai.sqsh`):
```bash
enroot import -o ./colossalai.sqsh dockerd://$DOCKER_REPO:latest
```


22 changes: 22 additions & 0 deletions 3.test_cases/pytorch/colossalai/build_docker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#! /bin/bash

if [ -z "$SLURM_JOB_ID" ]; then
echo "Run with slurm: srun ./build_docker.sh"
exit 1
fi

docker build --progress=plain -f colossalai.Dockerfile -t colossalai:latest .

if [ $? -ne 0 ]; then
echo "Failed to build docker image"
exit 1
fi

if [ -z "$DOCKER_REPO" ]; then
echo "DOCKER_REPO is not set"
exit 1
fi

docker tag colossalai:latest $DOCKER_REPO:latest

docker push $DOCKER_REPO:latest
179 changes: 179 additions & 0 deletions 3.test_cases/pytorch/colossalai/colossalai.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04

ARG GDRCOPY_VERSION=v2.4.4
ARG EFA_INSTALLER_VERSION=1.38.1
ARG AWS_OFI_NCCL_VERSION=v1.14.0
ARG NCCL_VERSION=v2.26.2-1
ARG NCCL_TESTS_VERSION=v2.14.1

RUN apt-get update -y && apt-get upgrade -y
RUN apt-get remove -y --allow-change-held-packages \
ibverbs-utils \
libibverbs-dev \
libibverbs1 \
libmlx5-1 \
libnccl2 \
libnccl-dev

RUN rm -rf /opt/hpcx \
&& rm -rf /usr/local/mpi \
&& rm -f /etc/ld.so.conf.d/hpcx.conf \
&& ldconfig

ENV OPAL_PREFIX=

RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
apt-utils \
autoconf \
automake \
build-essential \
check \
cmake \
curl \
debhelper \
devscripts \
git \
gcc \
gdb \
kmod \
libsubunit-dev \
libtool \
openssh-client \
openssh-server \
pkg-config \
python3-distutils \
vim
RUN apt-get purge -y cuda-compat-*

RUN mkdir -p /var/run/sshd
RUN sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config

ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/lib:$LD_LIBRARY_PATH
ENV PATH /opt/amazon/openmpi/bin/:/opt/amazon/efa/bin:/usr/bin:/usr/local/bin:$PATH

RUN curl https://bootstrap.pypa.io/get-pip.py -o /tmp/get-pip.py \
&& python3 /tmp/get-pip.py \
&& pip3 install awscli pynvml

#################################################
## Install NVIDIA GDRCopy
##
## NOTE: if `nccl-tests` or `/opt/gdrcopy/bin/sanity -v` crashes with incompatible version, ensure
## that the cuda-compat-xx-x package is the latest.
RUN git clone -b ${GDRCOPY_VERSION} https://github.com/NVIDIA/gdrcopy.git /tmp/gdrcopy \
&& cd /tmp/gdrcopy \
&& make prefix=/opt/gdrcopy install

ENV LD_LIBRARY_PATH /opt/gdrcopy/lib:$LD_LIBRARY_PATH
ENV LIBRARY_PATH /opt/gdrcopy/lib:$LIBRARY_PATH
ENV CPATH /opt/gdrcopy/include:$CPATH
ENV PATH /opt/gdrcopy/bin:$PATH

#################################################
## Install EFA installer
RUN cd $HOME \
&& curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz \
&& tar -xf $HOME/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
&& rm -rf $HOME/aws-efa-installer

###################################################
## Install NCCL
RUN git clone -b ${NCCL_VERSION} https://github.com/NVIDIA/nccl.git /opt/nccl \
&& cd /opt/nccl \
&& make -j $(nproc) src.build CUDA_HOME=/usr/local/cuda \
NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_90,code=sm_90"

###################################################
## Install AWS-OFI-NCCL plugin
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y libhwloc-dev
#Switch from sh to bash to allow parameter expansion
SHELL ["/bin/bash", "-c"]
RUN curl -OL https://github.com/aws/aws-ofi-nccl/releases/download/${AWS_OFI_NCCL_VERSION}/aws-ofi-nccl-${AWS_OFI_NCCL_VERSION//v}.tar.gz \
&& tar -xf aws-ofi-nccl-${AWS_OFI_NCCL_VERSION//v}.tar.gz \
&& cd aws-ofi-nccl-${AWS_OFI_NCCL_VERSION//v} \
&& ./configure --prefix=/opt/aws-ofi-nccl/install \
--with-mpi=/opt/amazon/openmpi \
--with-libfabric=/opt/amazon/efa \
--with-cuda=/usr/local/cuda \
--enable-platform-aws \
&& make -j $(nproc) \
&& make install \
&& cd .. \
&& rm -rf aws-ofi-nccl-${AWS_OFI_NCCL_VERSION//v} \
&& rm aws-ofi-nccl-${AWS_OFI_NCCL_VERSION//v}.tar.gz

SHELL ["/bin/sh", "-c"]

###################################################
## Install NCCL-tests
RUN git clone -b ${NCCL_TESTS_VERSION} https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests \
&& cd /opt/nccl-tests \
&& make -j $(nproc) \
MPI=1 \
MPI_HOME=/opt/amazon/openmpi/ \
CUDA_HOME=/usr/local/cuda \
NCCL_HOME=/opt/nccl/build \
NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_90,code=sm_90"

RUN rm -rf /var/lib/apt/lists/*

## Set Open MPI variables to exclude network interface and conduit.
ENV OMPI_MCA_pml=^ucx \
OMPI_MCA_btl=tcp,self \
OMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agent\
OPAL_PREFIX=/opt/amazon/openmpi \
NCCL_SOCKET_IFNAME=^docker,lo,veth

## Turn off PMIx Error https://github.com/open-mpi/ompi/issues/7516
ENV PMIX_MCA_gds=hash

## Set LD_PRELOAD for NCCL library
ENV LD_PRELOAD /opt/nccl/build/lib/libnccl.so

# Install Miniconda to not depend on the base image python
RUN mkdir -p /opt/miniconda3 \
&& curl -L https://repo.anaconda.com/miniconda/Miniconda3-py312_25.3.1-1-Linux-x86_64.sh -o /tmp/Miniconda3-py312_25.3.1-1-Linux-x86_64.sh \
&& bash /tmp/Miniconda3-py312_25.3.1-1-Linux-x86_64.sh -b -f -p /opt/miniconda3 \
&& rm /tmp/Miniconda3-py312_25.3.1-1-Linux-x86_64.sh \
&& /opt/miniconda3/bin/conda init bash

ENV PATH="/opt/miniconda3/bin:${PATH}"

COPY gather_state_dict_fast.patch /tmp/gather_state_dict_fast.patch

RUN git clone https://github.com/hpcaitech/ColossalAI.git /tmp/colossalai && \
cd /tmp/colossalai && \
git checkout 46ed5d856b16b074325091a88e761544b3d4f9f0 && \
git apply /tmp/gather_state_dict_fast.patch && \
# BUILD_EXT=1 FORCE_CUDA=1
pip install . && \
cd applications/ColossalChat && \
pip install .

ENV TORCH_CUDA_ARCH_LIST="9.0a"

# because of https://discuss.huggingface.co/t/valueerror-unable-to-avoid-copy-while-creating-an-array-as-requested/93584/5
RUN pip install "numpy<2.0"

# Install tensornvme from github because pipy version is totallyoutdated
RUN apt update -y && apt install -y libaio-dev && pip install -v git+https://github.com/hpcaitech/TensorNVMe.git

# To use the fused RMSNorm kernel colossalai needs apex built from source:
RUN git clone https://github.com/NVIDIA/apex /tmp/apex && \
cd /tmp/apex && \
NVCC_APPEND_FLAGS="--threads 4" \
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext --cuda_ext --parallel 8" ./

# Build flash-attn
RUN MAX_JOBS=48 pip install flash-attn==2.7.4.post1 --no-cache-dir --no-deps --no-build-isolation --verbose --force-reinstall

# Install transformers==4.52.4 for better support of DeepSeek
RUN pip install transformers==4.52.4

RUN pip install math-verify==0.7.0 tqdm
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
DeepSeek-V3
logs
122 changes: 122 additions & 0 deletions 3.test_cases/pytorch/colossalai/deepseek-lora-finetune/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# DeepSeek V3/R1/Prover-V2 671B SFT with LoRA

This example uses Colossal-AI container from the parent directory

## Download model weights

```bash
pip install -U "huggingface_hub[cli]"
```

Choose the model you want to finetune:

- deepseek-ai/DeepSeek-V3
- deepseek-ai/DeepSeek-V3-0324
- deepseek-ai/DeepSeek-R1
- deepseek-ai/DeepSeek-Prover-V2-671B

and define model name environment variable, for example:
```bash
export MODEL_NAME="deepseek-ai/DeepSeek-R1"
```

Download the model weights from Hugging Face and find the model path:
```bash
huggingface-cli download $MODEL_NAME
export MODEL_PATH=`python -c "from pathlib import Path; from huggingface_hub import hf_hub_download; print(Path(hf_hub_download('$MODEL_NAME', filename='config.json')).parent)"`
export HF_HOME=${HF_HOME:-$(python -c "from pathlib import Path; from huggingface_hub import hf_hub_download; print(Path(hf_hub_download('$MODEL_NAME', filename='config.json')).parent.parent.parent.parent.parent)")}
```

## Convert fp8 weights to bf16

Since the model weights are fp8 and SFT requires bf16 weights, we use `convert_to_bf16.py` from `DeepSeek-V3` repo to convert the weights to bf16:

Clone DeepSeek V3 repo:
```bash
git clone https://github.com/deepseek-ai/DeepSeek-V3.git
```
Launch a job on the GPU node:
```bash
srun \
--container-image ../colossalai.sqsh \
--container-mounts ./:/workdir,$HF_HOME:$HF_HOME \
python /workdir/DeepSeek-V3/inference/fp8_cast_bf16.py \
--input-fp8-hf-path $MODEL_PATH \
--output-bf16-hf-path /workdir/$MODEL_NAME-bf16
```

Copy the model config and tokenizer files to the output directory:
```bash
cp -L $MODEL_PATH/*.json ./$MODEL_NAME-bf16/
cp -L $MODEL_PATH/*.py ./$MODEL_NAME-bf16/
```

## Launch LoRA finetuning

```bash
sbatch lora_finetune.sbatch $MODEL_NAME AI-MO/NuminaMath-TIR train
```
Check the logs:
```bash
tail -f -n +0 slurm-XXX.out
```
Example output on 15 p5en nodes(DP5, PP3, EP8):
```
Step: 3%|▎ | 6/224 [22:39<5:48:26, 95.90s/it, loss=00.794, grad_norm=0.161]
Step: 5%|▌ | 12/224 [25:24<2:14:47, 37.97s/it, loss=0.506, grad_norm=0.108]
Step: 8%|▊ | 17/224 [28:09<1:38:50, 28.65s/it, loss=0.442, grad_norm=0.124]
Step: 10%|█ | 23/224 [30:54<1:32:21, 27.57s/it, loss=0.429, grad_norm=0.0904]
Step: 13%|█▎ | 29/224 [33:16<1:34:26, 29.06s/it, loss=0.411, grad_norm=0.0404]
Step: 16%|█▌ | 34/224 [36:55<1:43:15, 32.61s/it, loss=0.383, grad_norm=0.0298]
Step: 18%|█▊ | 40/224 [40:09<1:33:32, 30.50s/it, loss=0.368, grad_norm=0.0255]
Step: 21%|██ | 46/224 [42:27<1:22:43, 27.89s/it, loss=0.367, grad_norm=0.0252]
Step: 23%|██▎ | 51/224 [45:13<1:19:52, 27.70s/it, loss=0.354, grad_norm=0.0262]
Step: 25%|██▌ | 57/224 [47:31<1:16:52, 27.62s/it, loss=0.346, grad_norm=0.0232]
Step: 28%|██▊ | 62/224 [50:16<1:14:09, 27.47s/it, loss=0.355, grad_norm=0.0211]
Step: 30%|███ | 68/224 [52:34<1:11:46, 27.61s/it, loss=0.336, grad_norm=0.0214]
Step: 33%|███▎ | 73/224 [55:36<1:14:31, 29.61s/it, loss=0.34, ggrad_norm=0.021]
Step: 35%|███▌ | 79/224 [57:57<1:07:50, 28.01s/it, loss=0.339, grad_norm=0.0212]
Step: 38%|███▊ | 84/224 [1:00:27<1:13:01, 31.30s/it, loss=0.325, grad_norm=0.0224]
Step: 40%|███▉ | 89/224 [1:03:35<1:07:18, 29.92s/it, loss=0.324, grad_norm=0.0206]
Step: 42%|████▏ | 95/224 [1:05:52<1:00:10, 27.78s/it, loss=0.338, grad_norm=0.0224]
Step: 45%|████▍ | 100/224 [1:08:08<56:34, 27.37s/it, loss=0.325, grad_norm=0.0213]
Step: 47%|████▋ | 105/224 [1:10:53<54:21, 27.41s/it, loss=0.318, grad_norm=0.0206]
Step: 49%|████▉ | 110/224 [1:13:11<51:55, 27.33s/it, loss=0.342, grad_norm=0.0208]
Step: 52%|█████▏ | 116/224 [1:15:40<51:53, 28.56s/it, loss=0.334, grad_norm=0.0214]
Step: 54%|█████▍ | 121/224 [1:18:04<48:21, 28.62s/it, loss=0.336, grad_norm=0.02]
Step: 56%|█████▋ | 126/224 [1:20:21<44:45, 27.40s/it, loss=0.33, grad_norm=0.0211]
Step: 58%|█████▊ | 131/224 [1:22:38<42:28, 27.41s/it, loss=0.326, grad_norm=0.022]
Step: 61%|██████ | 136/224 [1:25:29<46:47, 31.90s/it, loss=0.344, grad_norm=0.0233]
Step: 63%|██████▎ | 141/224 [1:28:45<38:47, 28.05s/it, loss=0.328, grad_norm=0.0218]
Step: 65%|██████▌ | 146/224 [1:30:29<35:42, 27.47s/it, loss=0.329, grad_norm=0.0218]
Step: 67%|██████▋ | 151/224 [1:32:47<33:11, 27.28s/it, loss=0.33, grad_norm=0.0208]
Step: 70%|██████▉ | 156/224 [1:35:16<32:35, 28.75s/it, loss=0.322, grad_norm=0.0216]
Step: 72%|███████▏ | 161/224 [1:37:37<30:45, 29.29s/it, loss=0.328, grad_norm=0.0238]
Step: 74%|███████▍ | 166/224 [1:39:26<26:40, 27.60s/it, loss=0.313, grad_norm=0.0236]
Step: 76%|███████▋ | 171/224 [1:41:43<24:12, 27.40s/it, loss=0.337, grad_norm=0.0435]
Step: 79%|███████▊ | 176/224 [1:44:00<21:54, 27.39s/it, loss=0.328, grad_norm=0.0222]
Step: 81%|████████ | 181/224 [1:46:24<21:01, 29.35s/it, loss=0.332, grad_norm=0.0226]
Step: 83%|████████▎ | 186/224 [1:49:02<18:43, 29.57s/it, loss=0.329, grad_norm=0.0215]
Step: 85%|████████▌ | 191/224 [1:51:18<15:45, 27.81s/it, loss=0.325, grad_norm=0.0217]
Step: 88%|████████▋ | 195/224 [1:53:42<14:02, 29.06s/it, loss=0.331, grad_norm=0.0221]
Step: 89%|████████▉ | 200/224 [1:55:59<11:04, 27.69s/it, loss=0.32, grad_norm=0.0207]
Step: 92%|█████████▏| 205/224 [1:58:00<09:17, 29.32s/it, loss=0.311, grad_norm=0.0224]
Step: 94%|█████████▍| 210/224 [2:00:16<06:27, 27.64s/it, loss=0.327, grad_norm=0.023]
Step: 96%|█████████▌| 214/224 [2:02:34<04:35, 27.57s/it, loss=0.315, grad_norm=0.0273]
Step: 98%|█████████▊| 219/224 [2:04:50<02:16, 27.38s/it, loss=0.348, grad_norm=0.0217]
Step: 100%|██████████| 224/224 [2:06:39<00:00, 27.30s/it, loss=0.325, grad_norm=0.0236]

Start saving final model checkpoint to /workdir/deepseek-ai/DeepSeek-R1-bf16-lora
Saved final model checkpoint at epoch 0 at folder /workdir/deepseek-ai/DeepSeek-R1-bf16-lora in 63.06 seconds

```

## Launch LoRA evaluation

```bash
srun \
--mpi=pmix --cpu-bind=none \
--container-image ../colossalai.sqsh \
--container-mounts ./:/workdir,$HF_HOME:$HF_HOME \
python /workdir/lora_eval.py -m deepseek-ai/DeepSeek-R1 -d AI-MO/NuminaMath-TIR
```
Loading