Dynamic Batching

Authorization

Inferless

Tutorials

Changelog

Blog

Documentation

How to Guides

Cookbook

References

OSS Cheatsheet

Deploy now

A comprehensive cheatsheet, provides an overview of the top open-source TTS models, inference libraries, training resources, and more to help you get started with or enhance your TTS projects.

Text-To-Speech (TTS) Cheatsheet

A comprehensive cheatsheet covering open-source code generation models, inference libraries, datasets, deployment strategies, and ethical considerations for developers and organizations.

Code LLMs CheatSheet

A comprehensive cheatsheet covering open-source text-to-image generation models, inference libraries, datasets, use cases, deployment strategies, training resources, evaluation methods, and ethical considerations for developers and organizations.

Text-to-Image Generation CheatSheet

An all-in-one cheatsheet for vision-language models, including open-source models, inference toolkits, datasets, use cases, deployment strategies, optimization techniques, and ethical considerations for developers and organizations.

Vision-Language Models CheatSheet

A comprehensive cheatsheet covering the types of AI agents, their use cases, popular frameworks, LLMs, deployment options, essential tools, optimization techniques, common challenges, and ethical considerations.

AI Agents CheatSheet

A comprehensive guide to open-source 3D generative models, datasets, toolkits, and resources for development, deployment, and evaluation.

3D Generative Models CheatSheet

There are several ways to import your model, but for the purpose of this example, we will be using Hugging Face. By the end of this tutorial, you will have the ability to deploy a Hugging Face model in Inferless.

Deploy a ML Model with Inferless

The mission is to make deployment for AI models simple and efficient. To accelerate this we provide a simple interface to run your custom model without worrying about infrastructure.

Deploy Serverless Containers

Overview

Handling Input / Output with Inferless

Custom software and dependencies in your Runtime

Bring custom packages

Working with Files on Inferless 

Working with NFS - My Volumes

This will help you understand how to process multiple requests concurrently by the same replica.

Configuring Concurrent Requests

Automatic Build on Inferless

Managing Secrets on Inferless

Remote Run: Run your code remotely

Streaming with SSE Events

Hugging face

Git (Custom Code)

GitHub - Demo

GitLab - Demo

Bring your own docker container images. ( This might have higher coldstarts )

Docker

Cloud Buckets - S3/ GCS

Demo S3

Demo GCS

You can upload your model file directly from your system or from a public downloadable link.

File Import from System

Import from system

You can integrate with Inferless for alerts by using SNS to send notifications about critical events related to model health.

Model Alerts using AWS SNS

AWS PrivateLink - Inferless

Model Endpoint

Use a sample input to test your model before deployment

Test your model endpoint

In Inferless, configuring the Scale Down, Inference Timeouts, and Container Concurrency settings is essential for optimizing performance and cost. Here’s an overview of what each setting does and how you can adjust them:

Configuring the Model Settings

Setting up Webhooks

Inferless Python Client

Debugging your model with Logs

Call Logs

Build Logs

Version Management

This endpoint updates the settings of a model. You can configure Min/Max Replicas, Timeout and Concurrency Settings

Model Settings - Update APIs

Get Logs - API 

File structure requirements

Importing from Github

Deploying models using CLI required files in a particular format. This guide will explain about all the required files

Import using CLI

Importing from Cloud Buckets

Import using Docker

Importing your file from your System

Input / Output Schema

Automatic Build via webhooks

Github

Hugging Face

How to enable web-hooks in AWS for activating auto-rebuilt/CI-CD function in Inferless

AWS Sagemaker

Google Vertex AI

This guide will help you to understand how you can configure the Inference Service using CLI and Platform

Configuring the Inference Service

My Volumes

My Secrets

In this hands-on tutorial, you'll learn to build a serverless [Logo Generator application](https://github.com/inferless/Logo-Generator/tree/main) capable of creating unique logos based on text descriptions. Leveraging the power of diffusion models using the diffuser library, this application will allow you to input text prompts and receive corresponding logos in just a few steps.

Create a Serverless Logo Generator Application

Welcome to a hands-on tutorial designed to walk you through the creation of a PDF Q&A application, leveraging cutting-edge serverless technologies. In just 10 minutes, you'll have a working app capable of delivering precise answers from PDF documents, enriched with contextual understanding.

Build a Serverless PDF Q&A Application in 10 Minutes

Welcome to an immersive tutorial crafted to guide you through the development of a voice conversational chatbot application, leveraging state-of-the-art serverless technologies. Throughout this tutorial, you'll gain insights into seamlessly integrating multiple models within Inferless to construct a robust application.

Build a Serverless Voice Conversational Chatbot

Welcome to an engaging tutorial designed to walk you through creating a customer support voicebot where users can voice their queries and receive solutions. You'll learn to integrate speech recognition, large language, and text-to-speech models to develop a responsive and efficient voice-based customer support application.

Build a Serverless Customer Service Voicebot

Welcome to an immersive tutorial that guides you through leveraging the power of ComfyUI's API capabilities and deploying your workflows on Inferless. This resource is designed to help you create and deploy custom workflows, extending ComfyUI's API functionality. You'll learn how to interact with ComfyUI and deploy on Inferless.

Deploy and Run ComfyUI as an API on Inferless

Welcome to this tutorial where we are creating an book summarizer using LLM and TTS. You'll learn how to use large language model(LLM) with text-to-speech model to process PDF books, extract key ideas, quotes, and actionable items, and convert them into engaging audio summaries. This application aims to help users learn faster, enhance reading comprehension, and retain more knowledge by distilling books down to their most essential concepts in an easily digestible audio format.

Build a Serverless Book Audio Summary Generator

In this tutorial, we'll build a serverless Product Hunt thread summarizer using Large Language Models (LLMs).  You'll learn how to scrape, process, and summarize Product Hunt threads using LLM into concise summaries, highlighting key insights.  By creating this application, you'll help users save time and quickly grasp community sentiments on topic.

Build a Serverless Product Hunt Thread Summarizer

In this tutorial, you’ll build a serverless conversational agent that leverages Google Maps data via the Model Context Protocol (MCP), Inferless, Ollama and Langchain

Build a Google Maps Agent using MCP & Inferless

In this tutorial you’ll build a serverless Open-NotebookLM that turns any research paper or article into a lively, two-host audio podcast using Inferless.

Build an Open-NotebookLM with Inferless

20th October : Better error handling, Git fixes

3rd November: Better Logs & Efficient Autoscaling

10th November: Gitlab Integration, Secrets Manager & Better Billing

17th November: Enhanced Error Handling and User Interface Improvements

27th November: User Interface Enhancements and Reliability Improvements

4th December: UI Enhancements, Stable Builds, and Better Error Handling

18th December : Advanced Monitoring, Better Custom Runtime, and Enhanced Integration Stability

22nd December: Enhanced Metrics, Improved Logging, and Advanced Model Support

5th January - Faster Cold-starts, Security Upgrades, and Integration Efficiency

12th January 2024 - Enhanced Volume Management, Docker Integration, and Improved Billing Processes

January 29, 2024 - Removal of I/O JSON, Webhook Support for Docker and Improved Runtime Management

12th February 2024 - Enhanced Monitoring, Docker Flexibility, and One-click Model Deploy

26th February 2024 - Better Exception Handling, Dynamic Batching Support and more.

11th March 2024: Better Monitoring Tools and Enhanced User Control

28th March 2024: Reducing model import time, better error handling

8th April 2024: Workflow Optimization, Infrastructure Enhancements, and Runtime Updates

15th April 2024: Runtime Flexibility, Build Efficiency, and Autoscaling Improvements

6th May 2024: Enhanced Serverless Speeds, Model Build Efficiency, and Runtime Improvements

May 27th Update - Enhanced Runtime Management, AutoFix Suggestions, and Improved Infrastructure Stability

June 10th Update - Streaming APIs and Flexible Logging Options

June 21st Update - Enhanced CLI Commands and Model Management APIs

July 16th Update - Inferless AI Chatbot, CLI Improvements, and 30% faster build times

30th September 2024: Remote Run, Infrastructure Stability, and Observability Improvements

7th October 2024:  Enhanced Model Imports, Build Tracking, and Real-Time Logs

14th November 2024: Better Hugging Face Model Imports, Infrastructure Stability and Volume improvements

20th November 2024: Enhance Performance Tracking, New Runtime UI and more

9th December 2024: CLI v2.0: Faster and Smoother Experience

9th January 2025: Better Logs & Stability Fixes

28th February 2025: Enhanced one click model deploy & faster CLI experience

31st March 2025: New Dashboard UI, CLI Enhancements and Simplified Explore Models 

30th April 2025: Better Playground, Docker support and more 

31st May 2025: Runtime Flexibility, Faster Remote Run, and Hugging Face Improvements 

30th June 2025: Runtime Updates, Websockets and more 

In this tutorial, we'll show the deployment process of a quantized GPTQ model using vLLM. We are deploying a GPTQ, 4-bit quantized version of the codeLlama-Python-34B model.

Deploy a CodeLlama-Python-34B Model using Inferless

Stability AI released Stable Video Diffusion, a latent diffusion model for high-resolution video generation from text and images.

Deploy Stable Video Diffusion using Inferless

Mixtral 8x7B, a sparse mixture of experts (SMoE) model with open weights, outperforms Llama 2 70B on benchmarks. It excels as the strongest open-weight model, displaying superior cost/performance.

Deploy Mixtral-8x7B using Inferless

Starling 7B is an LLM trained by Reinforcement Learning from AI Feedback (RLAIF). Starling-7B-alpha scores 8.09 in MT Bench with GPT-4 as a judge, outperforming every model to date on MT-Bench

Deploy Starling 7B using Inferless

Deploy Meditron using Inferless

OpenHermes 2.5 Mistral 7B is a state-of-the-art Mistral Fine-tune, a continuation of the OpenHermes 2 model, which is trained on additional code datasets.

Deploy OpenHermes using Inferless

Stability AI unveiled SDXL Turbo, a technology that facilitates high-quality image generation in just one step, utilizing an advanced distillation technique known as Adversarial Diffusion Distillation

Deploy Stable Diffusion XL Turbo using Inferless

OpenAI releases Whisper-large-v3, a pre-trained model for automatic speech recognition (ASR) and speech translation

Deploy Whisper Large V3 using Inferless

DeciLM-7B, a text generation model with 7.04 billion parameters, that leads the 7B base language models during its release

Deploy Deci 7B using Inferless

SOLAR-10.7B, an advanced large language model (LLM) with 10.7 billion parameters, demonstrates superior performance in various natural language processing (NLP) tasks

Deploy Quantized version of SOLAR 10.7B-Instruct using Inferless

Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference

Deploy Mixtral-8x7B for 52 Tokens/Sec on a Single GPU

TenyxChat-7B-v1, is trained using the Direct Preference Optimization (DPO) framework on the open-source AI feedback dataset UltraFeedback

Deploy TenyxChat 7B using Inferless

TenyxChat-8x7B-v1, is trained using the Direct Preference Optimization (DPO) framework on the open-source AI feedback dataset UltraFeedback

Deploy TenyxChat-8x7B-v1 using Inferless

Phi-2 is a Transformer with 2.7 billion parameters which showcased a nearly state-of-the-art performance among models with less than 13 billion parameters

How to Finetune, Quantize and Inference Phi-2

This tutorial demonstrates deploying a quantized CodeLlama 70B model using vLLM. We will be deploying a 4-bit quantized GPTQ version of the codeLlama-Python-70B model.

Deploy CodeLlama 70B using Inferless

This tutorial demonstrates deploying a quantized Smaug-72B model using vLLM. We will be deploying a 4-bit quantized GPTQ version of this model.

Deploy OpenLLM-leaderboard topper Smaug-72B using Inferless

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.

Deploy Gemma-7B using vLLM on Inferless

Stable Cascade distinguishes itself by operating within a significantly smaller latent space, offering faster inference and cost-effective training.

Deploy Stable Cascade using Inferless

Meta releases [MusicGen](https://audiocraft.metademolab.com/musicgen.html), a text-to-music model that converts text descriptions or audio prompts into high-quality music samples.

Deploy Musicgen Stereo Melody Large Model using Inferless

Llama 3 is an auto-regressive language model, leveraging a refined transformer architecture. The Llama 3 models were trained on 8x more data on over 15 trillion tokens. It has a context length of 8K tokens and increases the vocabulary size of the tokenizer to 128,256 (from 32K tokens in the previous version).

How to Finetune and Inference Llama-3

Llama 3 is an auto-regressive language model, leveraging a refined transformer architecture.It incorporate supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to ensure alignment with human preferences.

Deploy Meta-Llama-3-8B using Inferless

Llama-3-TenyxChat-70B is a model fine-tuned through Direct Preference Optimization (DPO). It leverages Tenyx's advance fine-tuning technology and the open-source AI feedback dataset, UltraFeedback, for its training.

Deploy Llama-3-TenyxChat-70B using Inferless

TimesFM is a cutting-edge time series forecasting model developed by Google. It is designed to understand and generate detailed forecasts based on temporal data, making it a powerful tool for tasks such as demand forecasting, anomaly detection, and trend analysis.

Deploy Google TimesFM using Inferless

PaliGemma is a cutting-edge open vision-language model (VLM) developed by Google. It is designed to understand and generate detailed insights from both images and text, making it a powerful tool for tasks such as image captioning, visual question answering, object detection, and object segmentation.

Deploy Google PaliGemma-3B using Inferless

Stability AI has released Stable Diffusion 3, an advanced text-to-image generation model with significant improvements over its predecessors. This new version features a range of models from 800M to 8B parameters, providing users with scalable options to suit their needs.

Deploy Stable Diffusion 3 using Inferless

Phi-3-mini-128k-instruct is a 3.8 billion-parameter lightweight state-of-the-art model fine-tuned for instruction-following tasks, leveraging advanced techniques and comprehensive datasets to deliver high performance in natural language understanding and generation.

Deploy Phi-3-mini-128k-instruct using Inferless

Qwen2-72B-Instruct is a part of the Qwen2 series of large language models ranging from 0.5 to 72 billion parameters. The repository is for the 72B instruction-tuned model for deploying the model in the Inferless platform.

Deploy Qwen2-72B-Instruct using Inferless

This tutorial demonstrates how to implement real-time text-to-speech (TTS) streaming using the parler_tts_mini model and Parler-TTS library.

How to Stream Speech with Parler-TTS using Inferless

Llama-3.1-8B-Instruct is a new state-of-the-art model from Meta's Lama-3.1 series of large language models. The repository is for the Llama-3.1-8B-Instruct model for deploying the model in the Inferless platform.

Deploy Llama-3.1-8B-Instruct using Inferless

Llama-3.1-8B-Instruct GGUF is a quantized version of Meta's state-of-the-art Llama-3.1 series of large language models. This guide will take you through the deployment process of the GGUF model on the Inferless platform.

Deploy Llama-3.1-8B-Instruct GGUF using Inferless

Black Forest Labs has released FLUX.1-schnell, part of the FLUX.1 suite of text-to-image models that set a new state-of-the-art in image detail, prompt adherence, style diversity, and scene complexity. FLUX.1-schnell is the fastest model in the suite, tailored for local development and personal use.

Deploy FLUX.1-schnell using Inferless

The Llama 3.2 11B Vision Instruct model is part of Meta's latest series of large language models that introduce significant advancements in multimodal AI capabilities, allowing for both text and image inputs.

Getting Started

Concepts

Integrations

API Reference

Model Import

Dynamic Batching