Dynamic Batching

Dynamic batching is a feature of Inferless that allows inference requests to be combined by the server so that a batch is created dynamically. Creating a batch of requests typically results in increased throughput. The dynamic batcher should be used for the stateless model. Dynamic batching is enabled and configured independently for each model using the BATCH_SIZE property in the model configuration. These settings control the preferred batch size(s) of the dynamically created batches, the maximum time that requests can be delayed in the scheduler to allow other requests to join the dynamic batch, and queue properties such as batch_window

Using Git Method

Define the BATCH_SIZE and BATCH_WINDOW in the input_schema.py or in app.py if you are using pydantic.

Input Schema Example

You can use the below repo for example: https://github.com/inferless/template_input_batch

/
├── app.py
├── input_schema.py 

input_schema.py

INPUT_SCHEMA = {
    "prompt": {
        'datatype': 'STRING',
        'required': True,
        'shape': [1],
        'example': ["There is a fine house in the forest"]
    }
}
BATCH_SIZE = 4
BATCH_WINDOW = 5000 # milliseconds

in app.py

import json
import numpy as np
import torch
from transformers import pipeline

class InferlessPythonModel:

    # replace ##task_type## and ##huggingface_name## with appropriate values
    def initialize(self):
        self.generator = pipeline("text-generation", model="EleutherAI/gpt-neo-125M",device=0)

    # Inputs is a list of dictionaries where the keys are input names and values are actual input data
    # e.g. in the below code the input name is a prompt 
    # Output generated by the infer function should be a List of dictionaries where keys are output names and values are actual output data
    # e.g. in the below code the output name is generated_txt
    def infer(self, inputs):
        output = []

        print(" no of inputs to be processed " + str(len(inputs)))
        for each in inputs:
            prompt = each["prompt"]
            pipeline_output = self.generator(prompt, do_sample=True, min_length=20)
            generated_txt = pipeline_output[0]["generated_text"]
            print("generated_txt", generated_txt, flush=True)
            output.append({"generated_text": generated_txt })
        return output

    # perform any cleanup activity here
    def finalize(self,args):
        self.pipe = None

Pydantic Example

/
├── app.py

in app.py

import inferless
from pydantic import BaseModel, Field

@inferless.config
class Config():
    is_batched_input: bool = True
    batch_size: int = 2
    batch_window: int = 50000

#...rest of the code 

Using File Import Method

If you are using file import create a file config.pbtxt in the root of the model directory with the following content:

1/
├── model.onnx
├── config.pbtxt 

platform: "onnxruntime_onnx"
max_batch_size: 8

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [3]
  }
]

dynamic_batching { 
  preferred_batch_size: [1,8]
}

In the above configuration, we have set the max_batch_size to 8. This means that the model will try to create a batch of 4 requests. The input and output dimensions are also specified in the configuration file.

Getting Started

Concepts

Integrations

API Reference

Model Import

Using Git Method

Input Schema Example

Pydantic Example

Using File Import Method

Getting Started

Concepts

Integrations

API Reference

Model Import

​Using Git Method

​Input Schema Example

​Pydantic Example

​Using File Import Method

Using Git Method

Input Schema Example

Pydantic Example

Using File Import Method