GPT 量化加速推理的几个方案

越来越多的开源优质模型，我们的选择也越来越多了，模型的质量上去了，硬件的要求更高了。

在我们有限的硬件条件下，甚至，在一些便携设备下，运行 LLM 几乎不可用。但是还是有些办法的：量化加速，CUDA，Vulcan，Metal，等。

量化加速通常会把LLM的 30G+ 的文件，处理到 10G以下。有一个通用标准：GGUF，单文件，可以让整个过程更轻松。

量化加速方案我目前推荐4种：

llama.cpp （支持的很多，HF上的GGUF能直接用的也有很多）
MLC-LLM（有自己的特定格式，但HF有自己能直接用的模型库，预构建的二进制文件安装很友好）
chatglm.cpp（ChatGLM的量化方案，对于国内的语言环境比较友好，但是缺点很严重）
LM Studio（桌面版安装，哪哪都好，但不开源。）

tips：本文假设你会安装 brew，miniconda，python 等。

image ref: https://textmine.com/post/an-introduction-to-llm-quantization

llama.cpp

quantize ref: https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md

clone llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git

install python dependency

python3 -m pip install -r requirements.txt

install llama.cpp binary (or build)

brew install llama.cpp

download model (GLM4)

(or you can download gguf model from https://huggingface.co)

git clone https://www.modelscope.cn/ZhipuAI/glm-4-9b-chat.git

convert hf to gguf

# cd llama.cpp/
python3 convert_hf_to_gguf.py /tmp/glm-4-9b-chat/

# INFO:hf-to-gguf:Model successfully exported to /tmp/glm-4-9b-chat/glm-4-9B-chat-F16.gguf

quantize：

llama-quantize /tmp/glm-4-9b-chat/glm-4-9B-chat-F16.gguf /tmp/glm-4-9b-chat/glm-4-9B-chat-F16-Q4_K_M.gguf Q4_K_M

# main: quantizing '/tmp/glm-4-9b-chat/glm-4-9B-chat-F16.gguf' to '/tmp/glm-4-9b-chat/glm-4-9B-chat-F16-Q4_K_M.gguf' as Q4_K_M

llama-cli -m glm-4-9B-chat-F16-Q4_K_M.gguf -p "I believe the meaning of life is" -n 128

# or 
llama-cli -m glm-4-9B-chat-F16-Q4_K_M.gguf -p "You are a helpful assistant" -cnv

MLC-LLM

quantize ref: https://llm.mlc.ai/docs/install/mlc_llm.html

install (conda)

(ref: https://docs.anaconda.com/miniconda/)

conda create --name mlc-prebuilt  python=3.11

install python dependency (vulkan & dependency)

# vulkan-loader
conda install -c conda-forge gcc libvulkan-loader

# gcc-ng
conda install -c conda-forge libgcc-ng

# mlc-llm
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly

# (test)
python -c "import mlc_llm; print(mlc_llm)"

# (if no found vulkan driver, if you use cuda or etc..., you can ignore this)
sudo apt install libvulkan1

download model

git clone https://hf-mirror.com/mlc-ai/gemma-7b-it-q4f16_2-MLC

mlc_llm chat ./gemma-7b-it-q4f16_2-MLC
# or
mlc_llm serve ./gemma-7b-it-q4f16_2-MLC

chatglm.cpp

总结下来优点和缺点一样明显。

优点是：快到不行
缺点是：只能支持 ChatGLM，对于兼容OpenAI的WebServer太麻烦，安装的时候总是少依赖。

clone chatglm.cpp

git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp
git submodule update --init --recursive

install python dependency

python3 -m pip install -U pip
python3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece

# install chatglm-cpp (or `pip install -U chatglm-cpp`)
pip install .

pip3 install gradio

convert

# you need pre download chatglm model, clone this repo: https://www.modelscope.cn/ZhipuAI/glm-4-9b-chat.git
python3 chatglm_cpp/convert.py -i glm-4-9b-chat -t q4_0 -o models/chatglm-ggml.bin

cd chatglm.cpp/examples

python3 cli_demo.py -m ../models/chatglm-ggml.bin -i
# or gradio web ui.
python3 web_demo.py -m ../models/chatglm-ggml.bin

Wayne's blog

GPT 量化加速推理的几个方案
点击返回顶部

首页

关于

GPT 量化加速推理的几个方案

GPT 量化加速推理的几个方案

llama.cpp

MLC-LLM

chatglm.cpp