[📄 Paper] • [🐳 Docker] • [🗁 GitHub]
🔥 Official repo for "MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards".
❗️ Most of files are inherited from AllenAI's great work. We show our greatest respect to their efforts, and all the relevant rights are reserved for the ORIGINAL authors!
- [2025/01/23] 🔥🔥🔥 MoS is accepted by ICLR 2025 (Poster)!
The rapid scaling of large language models necessitates more lightweight finetuning methods to reduce the explosive GPU memory overhead when numerous customized models are served simultaneously. Targeting more parameter-efficient low-rank adaptation (LoRA), parameter sharing presents a promising solution. Empirically, our research into high-level sharing principles highlights the indispensable role of differentiation in reversing the detrimental effects of pure sharing. Guided by this finding, we propose Mixture of Shards (MoS), incorporating both inter-layer and intra-layer sharing schemes, and integrating four nearly cost-free differentiation strategies, namely subset selection, pair dissociation, vector sharding, and shard privatization. Briefly, it selects a designated number of shards from global pools with a Mixture-of-Experts (MoE)-like routing mechanism before sequentially concatenating them to low-rank matrices. Hence, it retains all the advantages of LoRA while offering enhanced parameter efficiency, and effectively circumvents the drawbacks of peer parameter-sharing methods. Our empirical experiments demonstrate approximately 8x parameter savings in a standard LoRA setting. The ablation study confirms the significance of each component. Our insights into parameter sharing and MoS method may illuminate future developments of more parameter-efficient finetuning methods.
To facilitate the integration of MoS into your customized applications, we primarily utilizes the code line share_lora_chunkwisely(model, chunk_config)
in finetune_trainer.py
to substitute the loaded LoRA modules. For your customized application, you can easily deploy MoS by adding this code line after LoRA is loaded.
# Clone the repo to local machine
git clone https://github.com/Forence1999/MoS.git
cd MoS
We recommend to setup the environment with our docker image, which will prepare the whole environment and ease your reproduction with minimal effort.
# Pull the image for finetuning on LLaMA2 from dockerhub
docker pull forence/open-instruct:v1
# Start the container, remember to replace <PROJECT_DIR> with your own project directory
docker run \
--name mos_llama2 \
--gpus all \
--network=host \
-v <PROJECT_DIR>:/workspace \
-it forence/open-instruct:v1 /bin/bash
cd /workspace
# Pull the image for finetuning on LLaMA3 from dockerhub
docker pull forence/mop_llama3:v0
# Start the container, remember to replace <PROJECT_DIR> with your own project directory
docker run \
--name mos_llama3 \
--gpus all \
--network=host \
-v <PROJECT_DIR>:/workspace \
-it forence/forence/mop_llama3:v0 /bin/bash
cd /workspace
# Switch to the branch for LLaMA3.2-3B
git checkout LLaMA3
If you use the above docker image, this step can be skipped, because the conda env has been well prepared in it.
# Create and activate conda environment
conda create -n mos python=3.11
conda activate mos
# Install required dependencies
pip install -r requirements.txt
The data preparation is inherited from the paper "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources" and open-instruct github repo, which can be refered for deatiled information. For simplicity, you can download and process the datasets for both fine-tuning and evaluation with following scripts:
# Prepare the training data
./scripts/prepare_train_data.sh
# Prepare the evaluation data
./scripts/prepare_eval_data.sh
LLaMA series require addtional requests to download. For LLaMA2 models, please refer to Hugging Face documentation for LLaMA for requesting the access token.
There are two alternative methods to pass the access token:
- Pass as a parameter (Recommended)
# Set the <HF_TOKEN> in the shell script and pass it as:
--token ${HF_TOKEN}
- Pass through environment variable
python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token(<HF_TOKEN>)"
All the preparation work is done! Here's an example to fine-tune LLaMA2-7B with SuperNI and evaluation on MMLU. The running script is as follows:
# Before running the following script, please replace the <HF_TOKEN> with your own huggingface token
bash ft_llama2_7b_superni_mmlu.sh <LORA_RANK> <SEED> <GPU_ID> <LEARNING_RATE> <FINE-TUNE_MODE> <INIT_LORA_A_VEC_VALUE> <INIT_NORM_STD> <NUM_PRIVATE_RANK> <VALID_LORA_RANK> <NUM_CHUNK>
LoRA_r Seed GPU Learning Rate Fine-Tune_Mode init_lora_A_vec_value init_norm_std valid_param_private_r valid_param_lora_r num_chunk
Here's a detailed description for each parameter:
LORA_RANK
: The rank of MoS, refered as the variabler
in our paper.SEED
: Random seed.GPU_ID
: The id of GPU assigned for the run.LEARNING_RATE
: Linear learning rate.FINE-TUNE_MODE
: Set to "mos" by default to activate MoS.INIT_LORA_A_VEC_VALUE
: Themean
of the guassian disctribution used inRandom Scaling
.INIT_NORM_STD
: Thestandard deviation
of the guassian disctribution used inRandom Scaling
.NUM_PRIVATE_RANK
: The number of equivalent lora ranks that are preserved private inShard Privatization
.VALID_LORA_RANK
: Equivalent valid lora ranks for fine-tuning.NUM_CHUNK
: Number of shards defined inVector Sharding
.
We also provide commands to postprocess and summarize the results, the running script is as follows:
# For MMLU
python mmlu_summarize.py --ts <TIME_SPAN>
# For TydiQA
python mmlu_summarize.py --ts <TIME_SPAN>
TIME_SPAN
: Duration of the present time from its last modification time in hours to be considered in result summary.
If you find our wrok helpful, please kindly cite the paper as follows:
@article{wang2025mos,
title={MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards},
author={Sheng Wang and Liheng Chen and Pengan Chen and Jingwei Dong and Boyang Xue and Jiyue Jiang and Lingpeng Kong and Chuan Wu},
year={2025},
eprint={2410.00938},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.00938},
}