4008063323.net

Unlocking the Potential of Qwen1.5 LLMs: Inference and Quantization

Written on

Chapter 1: Introduction to Qwen1.5 Models

Recently, Alibaba introduced the Qwen1.5 models, which include a range of open pre-trained and chat LLMs available in various sizes: 0.5B, 1.8B, 4B, 7B, 14B, and 72B. Although detailed information about these models is limited, early indications suggest that they may outperform models like Mistral 7B, Mixtral-8x7B, and Llama 2.

The Qwen team has partnered with developers of well-known packages focused on quantization, fine-tuning, and serving LLMs. This collaboration means that Qwen1.5 is already well-integrated into current deep learning frameworks.

In this article, I will briefly summarize the Qwen1.5 models and assess their performance before demonstrating their use. It's important to note that utilizing Qwen1.5 on consumer hardware can present challenges. Additionally, I will explain how to quantize the models using AWQ and GPTQ techniques.

For the examples, I will focus on the Qwen1.5 7B model, although the process is similar for the other sizes. However, it should be noted that the 72B variant is not suitable for fine-tuning on consumer hardware. For the other models, a GPU with at least 24 GB of VRAM is sufficient.

The Qwen1.5 models can be found in this Hugging Face collection. The model license permits commercial applications for projects with fewer than 100 million users.

Section 1.1: Model Architecture and Training

The Qwen team has not released a technical report detailing the training process and architecture of these models. However, the model cards provide some insights. The models are trained on a substantial dataset and utilize a Transformer architecture with several advanced features, including SwiGLU activation, attention QKV bias, group query attention, and a combination of sliding window and full attention mechanisms.

When inspecting one of the Qwen models loaded through Hugging Face Transformers, you'll find that its neural architecture closely resembles that of Mistral 7B and Llama 2. Notably, Qwen1.5 supports a longer context than Llama 2, accommodating up to 32k tokens. Moreover, a "chat" version has been developed for all released models, which have been trained using DPO methods.

The performance of the models is well-documented in the blog post announcing Qwen1.5, highlighting their results across various benchmarks.

The first video titled "LLMs Quantization Crash Course for Beginners" provides an overview of the quantization process for large language models, making it easier for beginners to understand the practical implications.

Section 1.2: Performance Benchmarks

The most notable comparisons are drawn between Qwen-1.5 7B, Llama 2 7B, and Mistral 7B. In most benchmarks, excluding MMLU and BBH, Qwen1.5 7B significantly surpasses both Llama 2 7B and Mistral 7B, with the 14B version performing even better. Interestingly, the 72B version of Qwen1.5 appears to outperform Mixtral-8x7B.

However, it's crucial to approach these benchmark results with caution, as they can be manipulated by adjusting settings. Qwen1.5 models are also available in smaller sizes, making them more accessible for testing on consumer hardware.

While the performance metrics of the models vary widely, it is noted that certain models outperform Qwen1.5 1.8B and 4B on numerous tasks. Additionally, Qwen1.5 models were trained on multilingual datasets, showcasing impressive results on multilingual benchmarks. The Qwen-1.5 14B model outperformed Mixtral-8x7B, which is nearly 3.5 times larger.

However, this multilingual capability comes at a cost: the vocabulary for Qwen1.5 models is almost five times larger than that of Llama 2 and Mistral 7B (151936 for Qwen1.5 compared to 32000 for Llama 2). As a result, these models are larger and require more memory, with Qwen-1.5 7B occupying 15.5 GB on disk compared to Llama 2 7B's 13.5 GB.

Chapter 2: Utilizing Qwen1.5 on Consumer Hardware

In this chapter, we will explore how to execute inference with Qwen-1.5 7B and the quantization process. I will summarize key observations, and the complete code for all experiments is available in the accompanying notebook.

Section 2.1: Inference with Qwen1.5 Using vLLM

To achieve fast inference while optimizing memory usage, we can utilize vLLM (Apache 2.0 license). Below is the code to implement it with the GPTQ quantized model:

import time

from vllm import LLM, SamplingParams

prompts = [

"The best recipe for pasta is"

]

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, top_k=20, max_tokens=150)

loading_start = time.time()

llm = LLM(model="kaitchup/Qwen1.5-7B-gptq-4bit", quantization="gptq")

print("--- Loading time: %s seconds ---" % (time.time() - loading_start))

generation_time = time.time()

outputs = llm.generate(prompts, sampling_params)

print("--- Generation time: %s seconds ---" % (time.time() - generation_time))

for output in outputs:

generated_text = output.outputs[0].text

print(generated_text)

print('------')

I highly recommend using the specified decoding hyperparameters, as they are optimized for the chat models. Using standard parameters may lead to code-switching, where different languages appear in the same sentence, resulting in nonsensical outputs.

Section 2.2: Quantization with GPTQ and AWQ

To quantize the Qwen1.5 model with GPTQ, the following code can be employed:

from transformers import AutoModelForCausalLM, AutoTokenizer

from optimum.gptq import GPTQQuantizer

import torch

model_path = 'Qwen/Qwen1.5-7B'

w = 4 # Quantization to 4-bit; change to 2, 3, or 8 for different precision

quant_path = 'Qwen1.5-7B-gptq-'+str(w)+'bit'

# Load model and tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")

quantizer = GPTQQuantizer(bits=w, dataset="c4", model_seqlen=2048)

quantized_model = quantizer.quantize_model(model, tokenizer)

quantized_model.save_pretrained("./"+quant_path, safetensors=True)

tokenizer.save_pretrained("./"+quant_path)

It's worth noting that I used a "model_seqlen" of 2048, even though the model can handle sequences of 32k tokens. If your CPU is powerful enough, increasing model_seqlen may yield a more accurate quantization. Unfortunately, I could not increase it when using Google Colab due to hardware limitations.

I have also attempted to quantize the model using bitsandbytes NF4 and AWQ; however, the performance of AWQ appears to be subpar for this model. I compared the original model with the GPTQ and AWQ models across three different tasks.

The second video titled "New Tutorial on LLM Quantization w/ QLoRA, GPTQ and Llamacpp, LLama 2" dives deeper into the quantization techniques and their applications, providing valuable insights for users looking to optimize their models.

Section 2.3: Benchmarking Inference Speed and Memory Usage

In my final analysis, I benchmarked the memory consumption and throughput of the fp16, GPTQ 4-bit, and AWQ 4-bit models. The GPTQ model requires nearly 9.5 GB less memory than the original model, allowing it to run on a 16 GB GPU with minimal loss in accuracy.

Conclusion: The Future of Qwen1.5 Models

In summary, Qwen1.5 models rank among the top contenders in the realm of open pre-trained LLMs. They excel in tasks across various languages, particularly those beyond English, and are user-friendly due to widespread framework support.

However, these models do demand considerable memory. The largest Qwen1.5 model manageable on a 24 GB GPU is the 7B version, while the 14B variant may be feasible with some CPU RAM offloading after quantization. For users with less than 24 GB of VRAM who prefer not to quantize, the Qwen-1.5 4B model serves as a suitable alternative, requiring no more than 16 GB of VRAM.

To support my work, consider subscribing to The Kaitchup (my AI newsletter).

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Revolutionizing Cancer Diagnosis with AI Pathologists

Discover how AI is transforming cancer diagnosis through innovative methods that enhance accuracy and efficiency.

The Untold Story of the First Space Tourist: Dennis Tito's Journey

Discover the remarkable journey of Dennis Tito, the first space tourist, and the challenges he faced to achieve his dream of space travel.

Exploring the Chemistry of Literature and Reaction

An exploration of the intertwining of literature and science, examining how emotional and intellectual reactions shape our understanding of texts.

Exploring the Aroma of Pandora's Atmosphere from Avatar

A deep dive into the unique atmospheric composition of Pandora and its implications for potential inhabitants.

Navigating the Future of Work: Insights from LaShawn Davis

Explore LaShawn Davis's insights on the evolving workplace, leadership, and the future of work trends.

Strategies to Overcome Decision Burnout and Enhance Clarity

Discover effective strategies to combat decision burnout and regain clarity in your choices.

The Invigorating Effects of Cold Water Swimming on Stress Relief

Exploring the mental and physical benefits of cold water swimming, its rise in popularity, and safety tips for enthusiasts.

M1 Max: The Ultimate Production Rig for Content Creators

Discover how the M1 Max transforms content creation for professionals.