Tutorials

Changelog

Blog

Inferless

Documentation

How to Guides

Cookbook

Join Private Beta

This guide will help you to understand how you can configure the Inference Service using CLI and Platform

Configuring the Inference Service

Welcome to Inferless, the go-to platform for effortlessly deploying machine learning models in the cloud. Our developer-friendly solution takes the complexities out of managing hardware and provides autoscaling capabilities for a seamless experience. Import your models from popular providers like Huggingface, AWS Sagemaker, and Google Vertex AI, and let Inferless handle the rest.

Introduction to Inferless

There are several ways to import your model, but for the purpose of this example, we will be using Hugging Face. By the end of this tutorial, you will have the ability to deploy a Hugging Face model in Inferless.

Deploy a ML Model with Inferless

The mission is to make deployment for AI models simple and efficient. To accelerate this we provide a simple interface to run your custom model without worrying about infrastructure.

Deploy Serverless Containers

Deploy Serverless

CLI Import

File structure requirements

Importing from Github

Deploying models using CLI required files in a particular format. This guide will explain about all the required files

Import using CLI

Importing from Cloud Buckets

Import using Docker

Importing your file from your System

Importing from Sagemaker

Importing from Google Vertex AI

Input / Output Schema

Custom software and dependencies in your Runtime

Bring custom packages

Automatic Build via webhooks

Github

Docker

Hugging Face

How to enable web-hooks in AWS for activating auto-rebuilt/CI-CD function in Inferless

AWS Sagemaker

Google Vertex AI

My Volumes

My Secrets

Hugging face

Git (Custom Code)

GitHub - Demo

GitLab - Demo

Bring your own docker container images. ( This might have higher coldstarts )

Cloud Buckets - S3/ GCS

Demo S3

Demo GCS

You can upload your model file directly from your system or from a public downloadable link.

File Import from System

Import from system

Demo

Dynamic Batching

Model Endpoint

Use a sample input to test your model before deployment

Test your model endpoint

In Inferless, configuring the Scale Down, Inference Timeouts, and Container Concurrency settings is essential for optimizing performance and cost. Here’s an overview of what each setting does and how you can adjust them:

Configuring the Model Settings

Setting up Webhooks

Inferless Python Client

Debugging your model with Logs

Call Logs

Build Logs

Version Management

In this hands-on tutorial, you'll learn to build a serverless [Logo Generator application](https://github.com/inferless/Logo-Generator/tree/main) capable of creating unique logos based on text descriptions. Leveraging the power of diffusion models using the diffuser library, this application will allow you to input text prompts and receive corresponding logos in just a few steps.

Create a Serverless Logo Generator Application

Welcome to a hands-on tutorial designed to walk you through the creation of a PDF Q&A application, leveraging cutting-edge serverless technologies. In just 10 minutes, you'll have a working app capable of delivering precise answers from PDF documents, enriched with contextual understanding.

Build a Serverless PDF Q&A Application in 10 Minutes

Welcome to an immersive tutorial crafted to guide you through the development of a voice conversational chatbot application, leveraging state-of-the-art serverless technologies. Throughout this tutorial, you'll gain insights into seamlessly integrating multiple models within Inferless to construct a robust application.

Build a Serverless Voice Conversational Chatbot

20th October : Better error handling, Git fixes

3rd November: Better Logs & Efficient Autoscaling

10th November: Gitlab Integration, Secrets Manager & Better Billing

17th November: Enhanced Error Handling and User Interface Improvements

27th November: User Interface Enhancements and Reliability Improvements

4th December: UI Enhancements, Stable Builds, and Better Error Handling

18th December : Advanced Monitoring, Better Custom Runtime, and Enhanced Integration Stability

22nd December: Enhanced Metrics, Improved Logging, and Advanced Model Support

5th January - Faster Cold-starts, Security Upgrades, and Integration Efficiency

12th January 2024 - Enhanced Volume Management, Docker Integration, and Improved Billing Processes

January 29, 2024 - Removal of I/O JSON, Webhook Support for Docker and Improved Runtime Management

12th February 2024 - Enhanced Monitoring, Docker Flexibility, and One-click Model Deploy

26th February 2024 - Better Exception Handling, Dynamic Batching Support and more.

11th March 2024: Better Monitoring Tools and Enhanced User Control

28th March 2024: Reducing model import time, better error handling

8th April 2024: Workflow Optimization, Infrastructure Enhancements, and Runtime Updates

15th April 2024: Runtime Flexibility, Build Efficiency, and Autoscaling Improvements

In this tutorial, we'll show the deployment process of a quantized GPTQ model using vLLM. We are deploying a GPTQ, 4-bit quantized version of the codeLlama-Python-34B model.

Deploy a CodeLlama-Python-34B Model using Inferless

Stability AI released Stable Video Diffusion, a latent diffusion model for high-resolution video generation from text and images.

Deploy Stable Video Diffusion using Inferless

Mixtral 8x7B, a sparse mixture of experts (SMoE) model with open weights, outperforms Llama 2 70B on benchmarks. It excels as the strongest open-weight model, displaying superior cost/performance.

Deploy Mixtral-8x7B using Inferless

Starling 7B is an LLM trained by Reinforcement Learning from AI Feedback (RLAIF). Starling-7B-alpha scores 8.09 in MT Bench with GPT-4 as a judge, outperforming every model to date on MT-Bench

Deploy Starling 7B using Inferless

Deploy Meditron using Inferless

OpenHermes 2.5 Mistral 7B is a state-of-the-art Mistral Fine-tune, a continuation of the OpenHermes 2 model, which is trained on additional code datasets.

Deploy OpenHermes using Inferless

Stability AI unveiled SDXL Turbo, a technology that facilitates high-quality image generation in just one step, utilizing an advanced distillation technique known as Adversarial Diffusion Distillation

Deploy Stable Diffusion XL Turbo using Inferless

OpenAI releases Whisper-large-v3, a pre-trained model for automatic speech recognition (ASR) and speech translation

Deploy Whisper Large V3 using Inferless

DeciLM-7B, a text generation model with 7.04 billion parameters, that leads the 7B base language models during its release

Deploy Deci 7B using Inferless

SOLAR-10.7B, an advanced large language model (LLM) with 10.7 billion parameters, demonstrates superior performance in various natural language processing (NLP) tasks

Deploy Quantized version of SOLAR 10.7B-Instruct using Inferless

Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference

Deploy Mixtral-8x7B for 52 Tokens/Sec on a Single GPU

TenyxChat-7B-v1, is trained using the Direct Preference Optimization (DPO) framework on the open-source AI feedback dataset UltraFeedback

Deploy TenyxChat 7B using Inferless

TenyxChat-8x7B-v1, is trained using the Direct Preference Optimization (DPO) framework on the open-source AI feedback dataset UltraFeedback

Deploy TenyxChat-8x7B-v1 using Inferless

Phi-2 is a Transformer with 2.7 billion parameters which showcased a nearly state-of-the-art performance among models with less than 13 billion parameters

How to Finetune, Quantize and Inference Phi-2

This tutorial demonstrates deploying a quantized CodeLlama 70B model using vLLM. We will be deploying a 4-bit quantized GPTQ version of the codeLlama-Python-70B model.

Deploy CodeLlama 70B using Inferless

This tutorial demonstrates deploying a quantized Smaug-72B model using vLLM. We will be deploying a 4-bit quantized GPTQ version of this model.

Deploy OpenLLM-leaderboard topper Smaug-72B using Inferless

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.

Deploy Gemma-7B using vLLM on Inferless

Stable Cascade distinguishes itself by operating within a significantly smaller latent space, offering faster inference and cost-effective training.

Deploy Stable Cascade using Inferless

Meta releases [MusicGen](https://audiocraft.metademolab.com/musicgen.html), a text-to-music model that converts text descriptions or audio prompts into high-quality music samples.

Deploy Musicgen Stereo Melody Large Model using Inferless

Llama 3 is an auto-regressive language model, leveraging a refined transformer architecture. The Llama 3 models were trained on 8x more data on over 15 trillion tokens. It has a context length of 8K tokens and increases the vocabulary size of the tokenizer to 128,256 (from 32K tokens in the previous version).

How to Finetune and Inference Llama-3

Llama 3 is an auto-regressive language model, leveraging a refined transformer architecture.It incorporate supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to ensure alignment with human preferences.

Deploy Meta-Llama-3-8B using Inferless

Llama-3-TenyxChat-70B is a model fine-tuned through Direct Preference Optimization (DPO). It leverages Tenyx's advance fine-tuning technology and the open-source AI feedback dataset, UltraFeedback, for its training.

Getting Started

Model Import

Integrations

API Reference

Configuring the Inference Service

Method A: Configure using Inferless Platform

Method B: Configure using Inferless CLI