Experimentation with different libraries

We have considered using 4 different inference libraries and tested the CodeLlama-34b-Python model with a max_token of 512, with the default configuration of all the inference libraries.

  1. Hugging Face: Transformer provides an easy-to-use pipeline for quick deployment of LLM but is not a good choice for LLM inference. We have used BitsandBytes with 4bit quantization, however it didn’t help in improving the inference latency.

  2. AutoGPTQ: AutoGPTQ enables you to run LLM on low memory. We deploy a GPTQ 4bit quantized model and are able to achieve better inference latency and token/sec than Hugging Face(unquantized and bitsandbytes).

  3. Text Generation Inference (TGI): TGI allows you to deploy and serve LLM. We have deployed and tested different type of quantized versions where 4bit quantized AWQ perform better.

  4. vLLM: vLLM is a library use for deployment and serving LLM. We have deployed both quantized and unquantized versions, both using vLLM and vLLM as a backend with the Triton inference server. We have used vLLM-GPTQ(vLLM GPTQ brach) for deploying GPTQ quantized model, as of 24/11/23 vLLM doesn’t support GPTQ quantization.

Our Observations

In our experiment, we found that using the vLLM with GPTQ 4bit quantized model is a good setup. You can expect an average lowest latency of 3.51 sec and average token generation rate of 58.40/sec. This setup has an average cold start time of 21.8 sec.

Note: You can use the vLLM-GPTQ to deploy the same.

GPU Recommendation

We recommend the users to use NVIDIA A100(80GB) GPU to achieve similar results.

Results with Different Experimentation

Defining Dependencies

This tutorial utilizes vLLM to load and serve the model. Define this library on the config.yaml file which you need to upload during the deployment.

Constructing the Github/Gitlab Template

While uploading your model from GitHub/Gitlab, you need to follow this format:

codellama-34b-python/
├── app.py
└── config.yaml
  • The app.py (Check the Github URL) file will load and serve the model.

  • The config.yaml (Check the Github URL) file will have all the software and Python dependencies.

  • You can also have any additional dependency files.

Creating the class for inference

In the app.py (Check the Github URL) file, first, you will import all the required classes and functions and then create a model class, for example, “InferlessPythonModel”.This class will have three functions:

  1. def initialize: This function will load the model and tokenizer into the memory whenever the container starts.
def initialize(self):  
        self.llm = LLM(  
          model="TheBloke/CodeLlama-34B-Python-GPTQ",  
          quantization="gptq")  
  1. def infer: This function helps to serve the loaded model. You can create a complex inference pipeline or chain multiple models together here.
def infer(self, inputs):  
        prompts = inputs["prompt"]  
        sampling_params = SamplingParams(  
            temperature=1.0,  
            top_p=1,  
            max_tokens=512  
        )  
        result = self.llm.generate(prompts, sampling_params)  
        result_output = [output.outputs[0].text for output in result]  
  
        return {"result": result_output[0]}  
  1. def finalize: This function deallocates the allocated memory for the model and tokenizer whenever the container shuts down.
def finalize(self,*args):  
    self.llm = None  

Creating the custom runtime

Whenever you upload the model through GitHub/GitLab, you must upload a custom runtime, i.e. a config.yaml file. This allows the user to add all the system and Python packages required for the model. For this tutorial, we are using the libssl-dev system package, and we use the Python packages mentioned in section 1.

build:
  cuda_version: "12.1.1"
  system_packages:
    - "libssl-dev"
  python_packages:
    - "accelerate"
    - "transformers"
    - "git+https://github.com/chu-tianxiang/vllm-gptq.git"

Deploying the model on Inferless

Inferless supports multiple ways of importing your model. For this tutorial, we will use GitHub.

Import the Model through GitHub

Click on theRepo(Custom code) and then click on the Add provider to connect to your GitHub account. Once your account integration is completed, click on your Github account and continue.

Provide the Model details

Enter the name of the model and pass the GitHub repository URL.

Now, you can define the model’s input and output parameters in JSON format. For more information, go through this page.

  • INPUT: Refer to the function def infer(self, inputs), here we mentioned inputs["prompt"], which means inputs the parameter will have a key prompt.

  • OUTPUT: The same goes here, the function mentioned above will return the results as a key-value pair return {"result": result}.

Input JSON must include prompt a key to pass the instruction. For output JSON, it must be includedgenerated_result to retrieve the output.

GPU Selection

On the Inferless platform, we support all the GPUs. For this tutorial, we recommend using A100 GPU. Select A100 from the drop-down menu in the GPU Type.

Using the Custom Runtime

In the Advance configuration, we have the option to select the custom runtime. First, click on the Add runtime to upload the config.yaml file, give any name and save it. Choose the runtime from the drop-down menu and then click on continue.

Review and Deploy

In this final stage, carefully review all modifications. Once you’ve examined all changes, proceed to deploy the model by clicking the Import button.

Voilà, your model is now deployed!