Code and models for papers: (i) FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization, and (ii) LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices
(The current code can be applied to only Llama and Llama 2 models)
| Model | W4A16 | W3A16 | W4A8 |
|---|---|---|---|
| Llama-2-7b | Llama-2-7b-hf-LRQ-w4a16 | Llama-2-7b-hf-LRQ-w3a16 | Llama-2-7b-hf-LRQ-w4a8 |
| Llama-2-13b | Llama-2-13b-hf-LRQ-w4a16 | Llama-2-13b-hf-LRQ-w3a16 | Llama-2-13b-hf-LRQ-w4a8 |
| Llama-2-70b | Llama-2-70b-hf-LRQ-w4a16 | Llama-2-70b-hf-LRQ-w3a16 | Llama-2-70b-hf-LRQ-w4a8 |
Quantized Llama 2 models by FlexRound will be uploaded soon.
pip install -r requirement.txt
cd scripts/FlexRound
and run one of the bash files depending on the desired model and bits.
For example, if you want the quantized Llama 2 7B model to W4A16 by FlexRound, then
run Llama-2-7b-hf-FlexRound-w4a16.sh
cd scripts/LRQ
and run one of the bash files depending on the desired model and bits.
For example, if you want the quantized Llama 2 7B model to W4A16 by LRQ, then
run Llama-2-7b-hf-LRQ-w4a16.sh
As the quantized model by FlexRound or LRQ consists of custom linear layers, we transform these custom linear layers into nn.Linear for convenience.
For example, you quantized the Llama 2 7B model and save it to path/to/quantized_model, then
cd utils
python transform.py --model meta-llama/Llama-2-7b --path path/to/quantized_model --output_dir path/to/output_dir
cd eval/per-channel-weight-only-quant/wikitext2
bash run.sh
cd eval/per-channel-weight-only-quant/lm-evaluation-harness
bash run.sh
cd eval/per-channel-weight-per-token-activation-quant/mmlu
After downloading the test set as described in README.md,
bash run.sh
cd val/per-channel-weight-per-token-activation-quant/lm-evaluation-harness
bash run.sh
@misc{lee2023flexroundlearnableroundingbased,
title={FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization},
author={Jung Hyun Lee and Jeonghoon Kim and Se Jung Kwon and Dongsoo Lee},
year={2023},
eprint={2306.00317},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2306.00317},
}
@misc{lee2024lrqoptimizingposttrainingquantization,
title={LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices},
author={Jung Hyun Lee and Jeonghoon Kim and June Yong Yang and Se Jung Kwon and Eunho Yang and Kang Min Yoo and Dongsoo Lee},
year={2024},
eprint={2407.11534},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.11534},
}