It is now possible to run your code remotely using Inferless. This feature is extremely useful when you want to run your code partly or entirely on a remote server. By introducing a few annotations in your code, you can run your code on a remote server with the command inferless remote-run app.py -c config.yaml.

import inferless

@inferless.method(gpu="T4")
def my_agent():
    return "Hello World"

Runtime Configuration

You can configure the runtime for remote run using a configuration file. The configuration file is a YAML file through which you can specify custom packages that you want to install on the remote server.

You can specify system packages (packages installed using apt-get) python packages (packages installed using pip) and run commands (shell commands) that you want to configure on the remote server.

# runtime.yaml
build:
    system_packages:
        - libssl-dev
    python_packages:
        - accelerate==0.27.2
        - torch==2.1.1
    run_commands:
        - wget https://example.com/model.pth

Annotations

You can use the following annotations to specify the code that you want to run on the remote server. The annotations accept a gpu parameter which specifies the GPU that you want to use on the remote server. Currently, the supported GPUs are T4 & A10

@inferless.method

Use this annotation to specify the function that you want to run on the remote server.

@inferless.Cls

Use this annotation to specify the class that you want to run on the remote server. Cls contains two sub annotations @Cls.load and @Cls.infer.

@Cls.load is used to specify the function that you want to run before the inference.

@Cls.infer is used to specify the function that you want to run for inference.

import inferless

app = inferless.Cls(gpu="T4")

class MyLLModel:
    @app.load
    def load(self):
        # load the model
        pass

    @app.infer
    def infer(self):
        # run inference
        return "generated output"

Examples

This section contains working examples of how to use the remote-run command.

Example 1 - Run a function on a remote server

# app.py
import inferless


@inferless.method(gpu="T4")
def test_gpt_neo(prompt):
    from transformers import pipeline
    pipe = pipeline("text-generation", model="EleutherAI/gpt-neo-125m")
    expected_output = pipe(prompt, max_length=50, do_sample=False)[0]['generated_text']

    return expected_output


def post_process(output):
    import re
    processed_output = re.sub(r'\b(\d{3}-\d{2}-\d{4})\b', '[REDACTED]', output)
    return processed_output

prompt = "Hello, Write a story about a dragon"
output = test_gpt_neo(prompt)
processed_output = post_process(output)
print(processed_output)
# runtime.yaml
build:
    python_packages:
        - transformers
        - torch
        - accelerate

run command inferless remote-run app.py -c runtime.yaml

In this example, only the test_gpt_neo function will run on the remote server. The post_process function will run on the local machine.

As you can see, the transformers is imported inside the function which means it is not required to install the transformers package on the local machine.

For the first call, the command may take some time as it will setup the required packages on the remote server.

Example 2 - Run a class on a remote

# app.py

import os
import inferless

app = inferless.Cls(gpu="A10")
HF_TOKEN = "<HF_TOKEN>"

class InferlessPythonModel:

    @app.load
    def initialize(self):
        from vllm import LLM, SamplingParams
        from transformers import AutoTokenizer

        model_id = "TheBloke/Llama-2-7B-chat-AWQ"
        tokenizer_model = "meta-llama/Llama-2-7b-chat-hf"
        self.sampling_params = SamplingParams(
            temperature=0.7,
            top_p=0.1,
            repetition_penalty=1.18,
            top_k=40,
            max_tokens=512,
        )
        self.llm = LLM(model=model_id, enforce_eager=True, quantization="AWQ")
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_model, token=HF_TOKEN)

    @app.infer
    def infer(self, inputs):
        prompts = inputs["prompt"]
        input_text = self.tokenizer.apply_chat_template([{"role": "user", "content": prompts}], tokenize=False)
        result = self.llm.generate(input_text, self.sampling_params)
        result_output = [output.outputs[0].text for output in result]
        return {"generated_result": result_output[0]}


model = InferlessPythonModel()
print(model.infer({"prompt": "Hello, Write a story about a dragon"}))
# runtime.yaml
gpu:
  T4
build:
  python_packages:
    - "accelerate==0.30.1"
    - "transformers==4.41.2"
    - "torch==2.3.0"
    - "vllm==0.4.3"

run command inferless remote-run app.py -c runtime.yaml

In this example, when the infer function is called, the initialize function will run first. The initialize function will load the model and tokenizer. The infer function will run the inference on the remote server.

Replace <HF_TOKEN> with your Hugging Face token.

Working Directory

By default, the files from the working directory are copied to the server excluding: “.git”, “*.pyc”, “pycache

If a .gitignore file is present in the working directory, the files mentioned in the .gitignore file will not be copied to the server.

You can also specify a custom ignore file using the --exclude -e option. inferless remote-run app.py -c runtime.yaml -e custom_ignore_file.txt

Maximum file size that can be copied to the server is 10MB.

Notes

Use Python3.10 for seamless experience, Other versions may face compatibility issues while using some libraries.
Try to avoid unnecessary packages in config file as it may increase the time to setup the environment.