MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards

[📄 Paper] • [🐳 Docker] • [🗁 GitHub]

🔥 Official repo for "MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards".

❗️ Most of files are inherited from AllenAI's great work. We show our greatest respect to their efforts, and all the relevant rights are reserved for the ORIGINAL authors!

🔥 News

[2025/01/23] 🔥🔥🔥 MoS is accepted by ICLR 2025 (Poster)!

💡 Abstract

The rapid scaling of large language models necessitates more lightweight finetuning methods to reduce the explosive GPU memory overhead when numerous customized models are served simultaneously. Targeting more parameter-efficient low-rank adaptation (LoRA), parameter sharing presents a promising solution. Empirically, our research into high-level sharing principles highlights the indispensable role of differentiation in reversing the detrimental effects of pure sharing. Guided by this finding, we propose Mixture of Shards (MoS), incorporating both inter-layer and intra-layer sharing schemes, and integrating four nearly cost-free differentiation strategies, namely subset selection, pair dissociation, vector sharding, and shard privatization. Briefly, it selects a designated number of shards from global pools with a Mixture-of-Experts (MoE)-like routing mechanism before sequentially concatenating them to low-rank matrices. Hence, it retains all the advantages of LoRA while offering enhanced parameter efficiency, and effectively circumvents the drawbacks of peer parameter-sharing methods. Our empirical experiments demonstrate approximately 8x parameter savings in a standard LoRA setting. The ablation study confirms the significance of each component. Our insights into parameter sharing and MoS method may illuminate future developments of more parameter-efficient finetuning methods.

🔨 Implementation Roadmap

To facilitate the integration of MoS into your customized applications, we primarily utilizes the code line share_lora_chunkwisely(model, chunk_config) in finetune_trainer.py to substitute the loaded LoRA modules. For your customized application, you can easily deploy MoS by adding this code line after LoRA is loaded.

⚙️ Environment setting

🗁 Prepare GitHub Repo

# Clone the repo to local machine
git clone https://github.com/Forence1999/MoS.git
cd MoS

🐳 Docker

We recommend to setup the environment with our docker image, which will prepare the whole environment and ease your reproduction with minimal effort.

LLaMA2-7B and LLaMA2-13B

# Pull the image for finetuning on LLaMA2 from dockerhub
docker pull forence/open-instruct:v1

# Start the container, remember to replace <PROJECT_DIR> with your own project directory
docker run \
    --name mos_llama2 \
    --gpus all \
    --network=host \
    -v <PROJECT_DIR>:/workspace \
    -it forence/open-instruct:v1 /bin/bash

cd /workspace

LLaMA3.2-3B

# Pull the image for finetuning on LLaMA3 from dockerhub
docker pull forence/mop_llama3:v0

# Start the container, remember to replace <PROJECT_DIR> with your own project directory
docker run \
    --name mos_llama3 \
    --gpus all \
    --network=host \
    -v <PROJECT_DIR>:/workspace \
    -it forence/forence/mop_llama3:v0 /bin/bash

cd /workspace

# Switch to the branch for LLaMA3.2-3B
git checkout LLaMA3

🐍 Conda

If you use the above docker image, this step can be skipped, because the conda env has been well prepared in it.

# Create and activate conda environment
conda create -n mos python=3.11
conda activate mos

# Install required dependencies
pip install -r requirements.txt

📜 Datasets

The data preparation is inherited from the paper "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources" and open-instruct github repo, which can be refered for deatiled information. For simplicity, you can download and process the datasets for both fine-tuning and evaluation with following scripts:

# Prepare the training data
./scripts/prepare_train_data.sh

# Prepare the evaluation data
./scripts/prepare_eval_data.sh

📃 Experiments

LLaMA series require addtional requests to download. For LLaMA2 models, please refer to Hugging Face documentation for LLaMA for requesting the access token.

There are two alternative methods to pass the access token:

Pass as a parameter (Recommended)

# Set the <HF_TOKEN> in the shell script and pass it as:
--token ${HF_TOKEN}

Pass through environment variable

python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token(<HF_TOKEN>)"

All the preparation work is done! Here's an example to fine-tune LLaMA2-7B with SuperNI and evaluation on MMLU. The running script is as follows:

# Before running the following script, please replace the <HF_TOKEN> with your own huggingface token
bash ft_llama2_7b_superni_mmlu.sh <LORA_RANK> <SEED> <GPU_ID> <LEARNING_RATE> <FINE-TUNE_MODE> <INIT_LORA_A_VEC_VALUE> <INIT_NORM_STD> <NUM_PRIVATE_RANK> <VALID_LORA_RANK> <NUM_CHUNK>

LoRA_r Seed GPU Learning Rate Fine-Tune_Mode init_lora_A_vec_value init_norm_std valid_param_private_r valid_param_lora_r num_chunk

Here's a detailed description for each parameter:

LORA_RANK: The rank of MoS, refered as the variable r in our paper.
SEED: Random seed.
GPU_ID: The id of GPU assigned for the run.
LEARNING_RATE: Linear learning rate.
FINE-TUNE_MODE: Set to "mos" by default to activate MoS.
INIT_LORA_A_VEC_VALUE: The mean of the guassian disctribution used in Random Scaling.
INIT_NORM_STD: The standard deviation of the guassian disctribution used in Random Scaling.
NUM_PRIVATE_RANK: The number of equivalent lora ranks that are preserved private in Shard Privatization.
VALID_LORA_RANK: Equivalent valid lora ranks for fine-tuning.
NUM_CHUNK: Number of shards defined in Vector Sharding.

We also provide commands to postprocess and summarize the results, the running script is as follows:

# For MMLU
python mmlu_summarize.py --ts <TIME_SPAN>

# For TydiQA
python mmlu_summarize.py --ts <TIME_SPAN>

TIME_SPAN: Duration of the present time from its last modification time in hours to be considered in result summary.

© Citation

If you find our wrok helpful, please kindly cite the paper as follows:

@article{wang2025mos,
      title={MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards}, 
      author={Sheng Wang and Liheng Chen and Pengan Chen and Jingwei Dong and Boyang Xue and Jiyue Jiang and Lingpeng Kong and Chuan Wu},
      year={2025},
      eprint={2410.00938},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.00938}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
beaker_configs		beaker_configs
ds_configs		ds_configs
eval		eval
images		images
model_licenses		model_licenses
modules		modules
open_instruct		open_instruct
qlora_repo		qlora_repo
quantize		quantize
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
bbh_summarize.py		bbh_summarize.py
codex_summarize.py		codex_summarize.py
deploy_container.sh		deploy_container.sh
environment.yml		environment.yml
finetune_trainer.py		finetune_trainer.py
ft_llama2_13b_code.sh		ft_llama2_13b_code.sh
ft_llama2_13b_cot_gsm8k.sh		ft_llama2_13b_cot_gsm8k.sh
ft_llama2_13b_flan_v2.sh		ft_llama2_13b_flan_v2.sh
ft_llama2_13b_superni_mmlu.sh		ft_llama2_13b_superni_mmlu.sh
ft_llama2_7b_code.sh		ft_llama2_7b_code.sh
ft_llama2_7b_cot_gsm8k.sh		ft_llama2_7b_cot_gsm8k.sh
ft_llama2_7b_flan_v2.sh		ft_llama2_7b_flan_v2.sh
ft_llama2_7b_superni_mmlu.sh		ft_llama2_7b_superni_mmlu.sh
gsm_summarize.py		gsm_summarize.py
logo.png		logo.png
merge_lora.py		merge_lora.py
mmlu_summarize.py		mmlu_summarize.py
model_param_summary.txt		model_param_summary.txt
requirements.txt		requirements.txt
tydiqa_summarize.py		tydiqa_summarize.py
weight-diff-requirements.txt		weight-diff-requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards

🔥 News

💡 Abstract

🔨 Implementation Roadmap

⚙️ Environment setting

🗁 Prepare GitHub Repo

🐳 Docker

LLaMA2-7B and LLaMA2-13B

LLaMA3.2-3B

🐍 Conda

📜 Datasets

📃 Experiments

© Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Forence1999/MoS

Folders and files

Latest commit

History

Repository files navigation

MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards

🔥 News

💡 Abstract

🔨 Implementation Roadmap

⚙️ Environment setting

🗁 Prepare GitHub Repo

🐳 Docker

LLaMA2-7B and LLaMA2-13B

LLaMA3.2-3B

🐍 Conda

📜 Datasets

📃 Experiments

© Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages