Deploy Serverless Containers
The mission is to make deployment for AI models simple and efficient. To accelerate this we provide a simple interface to run your custom model without worrying about infrastructure.
Our solutions offer far more competitive prices than large cloud providers such as AWS or GCP and let you quickly spin up services
Key Advantages of Choosing Serverless:
-
Simplified Workflow: No infrastructure management lets you concentrate on code and data.
-
Rapid Cold Starts: Experience faster initialization of your services, reducing client wait times to up to 3 seconds
-
Custom runtime support: Bring your own software libraries with i.e pip, system or custom build software packages
-
Adaptive Autoscaling: Adjusts according to demand, ensuring optimal resource allocation.
-
Dynamic Batching: Send multiple requests from clients and batch them together automatically with sime click configuration.
-
Efficient Deployment: Streamline your ML rollouts without the operational bottlenecks.
-
Maintenance-Free Environment: Stay updated without the manual intervention of software patches. Ready Integration: Deploy with Nvidia Triton Inference Server effortlessly.
How does it work?
-
Select your model - Select the model you want to deploy. You can deploy a custom model available on the HuggingFace/AWS/ GCP for NLP, computer vision, or other tasks types
-
Choose your model configuration - Upon completion of the call, based on your configuration, we would autoscale down your container, thus saving you in inference costs. You would be charged only for the inference used.
-
Create and manage your endpoint - You can load your model into a machine of your choice. As of now, we offer 2 kinds of machines:
-
NVIDIA A100: The NVIDIA A100 is a high-performance graphics processing unit (GPU) designed for a variety of demanding workloads including machine learning inference. It used Ampere architecture to provide a substantial performance boost over the T4, which is based on the older Turing architecture.
-
NVIDIA T4: The NVIDIA T4 is designed for energy efficiency, with relatively low power consumption. It is a more cost-effective way to deploy machine learning models. If your workloads are not latency-critical and model sizes are relatively small T4 can give you much better cost efficiencies.
-
-
Call your APIs in Production - You can get the endpoint details and the Model Workspace API keys. Simply call the model in production and enjoy the services.
How does Billing for Serverless work?
Your invoice will comprise:
-
Setup Time: The duration required to load the model weights. Remarkably, Serverless trims this to one-third of container times.
-
Inference Time: The actual processing time for an inference.
-
Eviction Timeout: A custom setting dictating the ‘warm’ status duration for models. Adjustable between 5 seconds and 60 minutes.
Real-world Billing Example: Visualize a deployment scenario on A10G dedicated, incurring 9 seconds for a cold start and 5 seconds for inference. For 1000 daily requests, where 10% hit a cold start:
Billed Duration: (10% of 1000 requests * 9 seconds) + (1000 requests * 5 seconds) = 5900 seconds.
Your Bill: 5900 seconds * 2.006.