Unlocking the Potential of Qwen1.5 LLMs: Inference and Quantization
Written on
Chapter 1: Introduction to Qwen1.5 Models
Recently, Alibaba introduced the Qwen1.5 models, which include a range of open pre-trained and chat LLMs available in various sizes: 0.5B, 1.8B, 4B, 7B, 14B, and 72B. Although detailed information about these models is limited, early indications suggest that they may outperform models like Mistral 7B, Mixtral-8x7B, and Llama 2.
The Qwen team has partnered with developers of well-known packages focused on quantization, fine-tuning, and serving LLMs. This collaboration means that Qwen1.5 is already well-integrated into current deep learning frameworks.
In this article, I will briefly summarize the Qwen1.5 models and assess their performance before demonstrating their use. It's important to note that utilizing Qwen1.5 on consumer hardware can present challenges. Additionally, I will explain how to quantize the models using AWQ and GPTQ techniques.
For the examples, I will focus on the Qwen1.5 7B model, although the process is similar for the other sizes. However, it should be noted that the 72B variant is not suitable for fine-tuning on consumer hardware. For the other models, a GPU with at least 24 GB of VRAM is sufficient.
The Qwen1.5 models can be found in this Hugging Face collection. The model license permits commercial applications for projects with fewer than 100 million users.
Section 1.1: Model Architecture and Training
The Qwen team has not released a technical report detailing the training process and architecture of these models. However, the model cards provide some insights. The models are trained on a substantial dataset and utilize a Transformer architecture with several advanced features, including SwiGLU activation, attention QKV bias, group query attention, and a combination of sliding window and full attention mechanisms.
When inspecting one of the Qwen models loaded through Hugging Face Transformers, you'll find that its neural architecture closely resembles that of Mistral 7B and Llama 2. Notably, Qwen1.5 supports a longer context than Llama 2, accommodating up to 32k tokens. Moreover, a "chat" version has been developed for all released models, which have been trained using DPO methods.
The performance of the models is well-documented in the blog post announcing Qwen1.5, highlighting their results across various benchmarks.
The first video titled "LLMs Quantization Crash Course for Beginners" provides an overview of the quantization process for large language models, making it easier for beginners to understand the practical implications.
Section 1.2: Performance Benchmarks
The most notable comparisons are drawn between Qwen-1.5 7B, Llama 2 7B, and Mistral 7B. In most benchmarks, excluding MMLU and BBH, Qwen1.5 7B significantly surpasses both Llama 2 7B and Mistral 7B, with the 14B version performing even better. Interestingly, the 72B version of Qwen1.5 appears to outperform Mixtral-8x7B.
However, it's crucial to approach these benchmark results with caution, as they can be manipulated by adjusting settings. Qwen1.5 models are also available in smaller sizes, making them more accessible for testing on consumer hardware.
While the performance metrics of the models vary widely, it is noted that certain models outperform Qwen1.5 1.8B and 4B on numerous tasks. Additionally, Qwen1.5 models were trained on multilingual datasets, showcasing impressive results on multilingual benchmarks. The Qwen-1.5 14B model outperformed Mixtral-8x7B, which is nearly 3.5 times larger.
However, this multilingual capability comes at a cost: the vocabulary for Qwen1.5 models is almost five times larger than that of Llama 2 and Mistral 7B (151936 for Qwen1.5 compared to 32000 for Llama 2). As a result, these models are larger and require more memory, with Qwen-1.5 7B occupying 15.5 GB on disk compared to Llama 2 7B's 13.5 GB.
Chapter 2: Utilizing Qwen1.5 on Consumer Hardware
In this chapter, we will explore how to execute inference with Qwen-1.5 7B and the quantization process. I will summarize key observations, and the complete code for all experiments is available in the accompanying notebook.
Section 2.1: Inference with Qwen1.5 Using vLLM
To achieve fast inference while optimizing memory usage, we can utilize vLLM (Apache 2.0 license). Below is the code to implement it with the GPTQ quantized model:
import time
from vllm import LLM, SamplingParams
prompts = [
"The best recipe for pasta is"
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, top_k=20, max_tokens=150)
loading_start = time.time()
llm = LLM(model="kaitchup/Qwen1.5-7B-gptq-4bit", quantization="gptq")
print("--- Loading time: %s seconds ---" % (time.time() - loading_start))
generation_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print("--- Generation time: %s seconds ---" % (time.time() - generation_time))
for output in outputs:
generated_text = output.outputs[0].text
print(generated_text)
print('------')
I highly recommend using the specified decoding hyperparameters, as they are optimized for the chat models. Using standard parameters may lead to code-switching, where different languages appear in the same sentence, resulting in nonsensical outputs.
Section 2.2: Quantization with GPTQ and AWQ
To quantize the Qwen1.5 model with GPTQ, the following code can be employed:
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
import torch
model_path = 'Qwen/Qwen1.5-7B'
w = 4 # Quantization to 4-bit; change to 2, 3, or 8 for different precision
quant_path = 'Qwen1.5-7B-gptq-'+str(w)+'bit'
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
quantizer = GPTQQuantizer(bits=w, dataset="c4", model_seqlen=2048)
quantized_model = quantizer.quantize_model(model, tokenizer)
quantized_model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)
It's worth noting that I used a "model_seqlen" of 2048, even though the model can handle sequences of 32k tokens. If your CPU is powerful enough, increasing model_seqlen may yield a more accurate quantization. Unfortunately, I could not increase it when using Google Colab due to hardware limitations.
I have also attempted to quantize the model using bitsandbytes NF4 and AWQ; however, the performance of AWQ appears to be subpar for this model. I compared the original model with the GPTQ and AWQ models across three different tasks.
The second video titled "New Tutorial on LLM Quantization w/ QLoRA, GPTQ and Llamacpp, LLama 2" dives deeper into the quantization techniques and their applications, providing valuable insights for users looking to optimize their models.
Section 2.3: Benchmarking Inference Speed and Memory Usage
In my final analysis, I benchmarked the memory consumption and throughput of the fp16, GPTQ 4-bit, and AWQ 4-bit models. The GPTQ model requires nearly 9.5 GB less memory than the original model, allowing it to run on a 16 GB GPU with minimal loss in accuracy.
Conclusion: The Future of Qwen1.5 Models
In summary, Qwen1.5 models rank among the top contenders in the realm of open pre-trained LLMs. They excel in tasks across various languages, particularly those beyond English, and are user-friendly due to widespread framework support.
However, these models do demand considerable memory. The largest Qwen1.5 model manageable on a 24 GB GPU is the 7B version, while the 14B variant may be feasible with some CPU RAM offloading after quantization. For users with less than 24 GB of VRAM who prefer not to quantize, the Qwen-1.5 4B model serves as a suitable alternative, requiring no more than 16 GB of VRAM.
To support my work, consider subscribing to The Kaitchup (my AI newsletter).