In this tutorial, we’ll show the deployment process of a quantized GPTQ model using vLLM. We are deploying a GPTQ, 4-bit quantized version of the codeLlama-Python-34B model.
We have considered using 4 different inference libraries and tested the CodeLlama-34b-Python model with a max_token of 512, with the default configuration of all the inference libraries.
Hugging Face: Transformer provides an easy-to-use pipeline for quick deployment of LLM but is not a good choice for LLM inference. We have used BitsandBytes with 4bit quantization, however it didn’t help in improving the inference latency.
AutoGPTQ: AutoGPTQ enables you to run LLM on low memory. We deploy a GPTQ 4bit quantized model and are able to achieve better inference latency and token/sec than Hugging Face(unquantized and bitsandbytes).
Text Generation Inference (TGI): TGI allows you to deploy and serve LLM. We have deployed and tested different type of quantized versions where 4bit quantized AWQ perform better.
vLLM: vLLM is a library use for deployment and serving LLM. We have deployed both quantized and unquantized versions, both using vLLM and vLLM as a backend with the Triton inference server. We have used vLLM-GPTQ(vLLM GPTQ brach) for deploying GPTQ quantized model, as of 24/11/23 vLLM doesn’t support GPTQ quantization.
In our experiment, we found that using the vLLM with GPTQ 4bit quantized model is a good setup. You can expect an average lowest latency of 3.51 sec
and average token generation rate of 58.40/sec
. This setup has an average cold start time of 21.8 sec
.
Note: You can use the vLLM-GPTQ to deploy the same.
We recommend the users to use NVIDIA A100(80GB) GPU to achieve similar results.
This tutorial utilizes vLLM
to load and serve the model. Define this library on the inferless-runtime-config.yaml file which you need to upload during the deployment.
While uploading your model from GitHub/Gitlab, you need to follow this format:
The app.py (Check the Github URL) file will load and serve the model.
The inferless-runtime-config.yaml (Check the Github URL) file will have all the software and Python dependencies.
You can also have any additional dependency files.
In the app.py (Check the Github URL) file, first, you will import all the required classes and functions and then create a model class, for example, “InferlessPythonModel”.This class will have three functions:
def initialize
: This function will load the model and tokenizer into the memory whenever the container starts.
def infer
: This function helps to serve the loaded model. You can create a complex inference pipeline or chain multiple models together here.
def finalize
: This function deallocates the allocated memory for the model and tokenizer whenever the container shuts down.
Whenever you upload the model through GitHub/GitLab, you must upload a custom runtime, i.e. a inferless-runtime-config.yaml
file. This allows the user to add all the system and Python packages required for the model. For this tutorial, we are using the libssl-dev
system package, and we use the Python packages mentioned in section 1.
You can use the inferless remote-run
(installation guide here) command to test your model or any custom Python script in a remote GPU environment directly from your local machine. Make sure that you use Python3.10
for seamless experience.
To enable Remote Run, simply do the following:
inferless
library and initialize Cls(gpu="A100")
. The available GPU options are T4
, A10
and A100
.initialize
and infer
functions with @app.load
and @app.infer
respectively.my_local_entry
) with @inferless.local_entry_point
.
Within this function, instantiate your model class, convert any incoming parameters into a RequestObjects
object, and invoke the model’s infer
method.From your local terminal, navigate to the folder containing your app.py
and your inferless-runtime-config.yaml
and run:
You can pass the other input parameters in the same way as long as your code expects them in the inputs
dictionary.
If you want to exclude certain files or directories from being uploaded, use the --exclude
or -e
flag.
Inferless supports multiple ways of importing your model. For this tutorial, we will use GitHub.
Navigate to your desired workspace in Inferless and Click on Add a custom model
button that you see on the top right. An import wizard will open up.
Once the model is in ‘Active’ status you can click on the ‘API’ page to call the model
Inferless allows you to deploy your model using Inferless-CLI. Follow the steps to deploy using Inferless CLI.
Let’s begin by cloning the model repository:
To deploy the model using Inferless CLI, execute the following command:
Explanation of the Command:
--gpu A100
: Specifies the GPU type for deployment. Available options include A10
, A100
, and T4
.--runtime inferless-runtime-config.yaml
: Defines the runtime configuration file. If not specified, the default Inferless runtime is used.