In this tutorial, we’ll show the deployment process of a quantized GPTQ model using vLLM. We are deploying a GPTQ, 4-bit quantized version of the codeLlama-Python-34B model.
3.51 sec
and average token generation rate of 58.40/sec
. This setup has an average cold start time of 21.8 sec
.
Note: You can use the vLLM-GPTQ to deploy the same.
vLLM
to load and serve the model. Define this library on the inferless-runtime-config.yaml file which you need to upload during the deployment.
def initialize
: This function will load the model and tokenizer into the memory whenever the container starts.
def infer
: This function helps to serve the loaded model. You can create a complex inference pipeline or chain multiple models together here.
def finalize
: This function deallocates the allocated memory for the model and tokenizer whenever the container shuts down.
inferless-runtime-config.yaml
file. This allows the user to add all the system and Python packages required for the model. For this tutorial, we are using the libssl-dev
system package, and we use the Python packages mentioned in section 1.
inferless remote-run
(installation guide here) command to test your model or any custom Python script in a remote GPU environment directly from your local machine. Make sure that you use Python3.10
for seamless experience.
inferless
library and initialize Cls(gpu="A100")
. The available GPU options are T4
, A10
and A100
.initialize
and infer
functions with @app.load
and @app.infer
respectively.my_local_entry
) with @inferless.local_entry_point
.
Within this function, instantiate your model class, convert any incoming parameters into a RequestObjects
object, and invoke the model’s infer
method.app.py
and your inferless-runtime-config.yaml
and run:
inputs
dictionary.
If you want to exclude certain files or directories from being uploaded, use the --exclude
or -e
flag.
Add a custom model
button that you see on the top right. An import wizard will open up.
--gpu A100
: Specifies the GPU type for deployment. Available options include A10
, A100
, and T4
.--runtime inferless-runtime-config.yaml
: Defines the runtime configuration file. If not specified, the default Inferless runtime is used.