Key Components of the Application

In the process of building this application, we’ll we will utilize three distinct types of models:

  1. Automatic Speech Recognition Model: This model facilitates the conversion of spoken words into text. We’ll harness the power of the Whisper large v3 model for this task.
  2. Text Generation Model: This model is crucial for formulating responses to user queries, and plays a vital role in the conversational flow. Our choice for this task is the Mistral 7B Instruct v0.2 model.
  3. Text-to-Audio Model: To provide a seamless conversational experience, the output generated by the text generation model will be transformed into speech using Bark model.

Crafting Your Application

This tutorial guides you through the creation process of a voice conversational chatbot application. It leverages advanced technologies such as Bark, Faster-Whisper, Transformers, and Inferless.

Core Development Steps

Speech-to-Speech Generation

  • Objective: Accept user voice as a input and generate a response in audio.
  • Action: Implement a Python class (InferlessPythonModel) that handles the entire speech-to-speech generation process, including input handling, models integration, and audio generation.
from faster_whisper import WhisperModel
from transformers import AutoModelForCausalLM, AutoTokenizer
from bark import SAMPLE_RATE, generate_audio, preload_models
import numpy as np
import io
import base64
import soundfile as sf
import nltk

class InferlessPythonModel:
    def initialize(self):
        # Load speech to text model
        self.audio_file = "output.mp3"
        model_size = "large-v3"
        self.model_whisper = WhisperModel(model_size, device="cuda", compute_type="float16")
        
        # Load Mistral instruct, text to text model
        model_id = "mistralai/Mistral-7B-Instruct-v0.2"
        self.model_mistral = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        
        # Load Bark, Text to Speech
        self.SPEAKER = "v2/en_speaker_6"
        preload_models()

        # Download nltk punkt
        nltk.download('punkt')
       
    def base64_to_mp3(self, base64_data, output_file_path):
        # Convert base64 audio data to mp3 file
        mp3_data = base64.b64decode(base64_data)
        with open(output_file_path, "wb") as mp3_file:
            mp3_file.write(mp3_data)

    def infer(self, inputs):
        audio_data = inputs["audio_base64"]
        self.base64_to_mp3(audio_data, self.audio_file)
        
        # Transcribe audio to text
        segments, info = self.model_whisper.transcribe(self.audio_file, beam_size=5)
        user_text = ''.join([segment.text for segment in segments])

        # Generate prompt for Mistral model
        messages = [{"role": "user", "content": f"You are a helpful, respectful and honest assistant. Answer the following question in exactly in few words from the context. {user_text}"}]
        encodeds = self.tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
        model_inputs = encodeds.to("cuda")
        
        # Generate text response using Mistral model
        generated_ids = self.model_mistral.generate(model_inputs, max_new_tokens=80, do_sample=True)
        generated_text = self.tokenizer.batch_decode(generated_ids[:, encodeds.shape[1]:], skip_special_tokens=True)[0]

        # Process generated text into audio
        script = generated_text.replace("\n", " ").strip()
        sentences = nltk.sent_tokenize(script)
        
        silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence
        
        pieces = []
        for sentence in sentences:
            audio_array = generate_audio(sentence, history_prompt=self.SPEAKER)
            pieces += [audio_array, silence.copy()]
             
        # Convert audio pieces into base64
        buffer = io.BytesIO()
        sf.write(buffer, np.concatenate(pieces), SAMPLE_RATE, format='WAV')
        buffer.seek(0)
        base64_audio = base64.b64encode(buffer.read()).decode('utf-8')
         
        return {"generated_audio_base64": base64_audio}

    def finalize(self):
        # Finalize resources if needed
        pass

Setting up the Environment

Dependencies:

  • Objective: Ensure all necessary libraries are installed.
  • Action: Run the command below to install dependencies:
pip install torchaudio==2.2.1 soundfile==0.12.1 git+https://github.com/suno-ai/bark.git@refs/pull/391/head faster-whisper==1.0.0 torch==2.2.1 nltk==3.8.1

This command ensures your environment has all the tools required for the application.

Deploying Your Model with Inferless CLI

Inferless allows you to deploy your model using Inferless-CLI. Follow the steps to deploy using Inferless CLI.

Clone the repository of the model

Let’s begin by cloning the model repository:

git clone https://github.com/inferless/Voice-Conversational-Chatbot.git

Deploy the Model

To deploy the model using Inferless CLI, execute the following command:

inferless deploy --gpu A100 --runtime inferless-runtime-config.yaml

Explanation of the Command:

  • --gpu A100: Specifies the GPU type for deployment. Available options include A10, A100, and T4.
  • --runtime inferless-runtime-config.yaml: Defines the runtime configuration file. If not specified, the default Inferless runtime is used.

Demo of the Voice Conversational Chatbot.

Alternative Deployment Method

Inferless also supports a user-friendly UI for model deployment, catering to users at all skill levels. Refer to Inferless’s documentation for guidance on UI-based deployment.

Choosing Inferless for Deployment

Deploying your Voice Conversational Chatbot application with Inferless offers compelling advantages, making your development journey smoother and more cost-effective. Here’s why Inferless is the go-to choice:

  1. Ease of Use: Forget the complexities of infrastructure management. With Inferless, you simply bring your model, and within minutes, you have a working endpoint. Deployment is hassle-free, without the need for in-depth knowledge of scaling or infrastructure maintenance.
  2. Cold-start Times: Inferless’s unique load balancing ensures faster cold-starts. Expect around 28.60 seconds to process each queries, significantly faster than many traditional platforms.
  3. Cost Efficiency: Inferless optimizes resource utilization, translating to lower operational costs. Here’s a simplified cost comparison:

Scenario 1

You are looking to deploy a Voice Conversational Chatbot application for processing 100 queries.

Parameters:

  • Total number of queries: 100 daily.
  • Inference Time: All models are hypothetically deployed on A100 80GB, taking 28.60 seconds of processing time and a cold start overhead of 20.72 seconds.
  • Scale Down Timeout: Uniformly 60 seconds across all platforms, except Hugging Face, which requires a minimum of 15 minutes. This is assumed to happen 100 times a day.

Key Computations:

  1. Inference Duration:
    Processing 100 queries and each takes 28.60 seconds
    Total: 100 x 28.60 = 2860 seconds (or approximately 0.79 hours)
  2. Idle Timeout Duration:
    Post-processing idle time before scaling down: (60 seconds - 28.60 seconds) x 100 = 3140 seconds (or 0.87 hours approximately)
  3. Cold Start Overhead:
    Total: 100 x 20.72 = 2072 seconds (or 0.58 hours approximately)

Total Billable Hours with Inferless: 0.79 (inference duration) + 0.87 (idle time) + 0.58 (cold start overhead) = 2.25 hours
Total Billable Hours with Inferless: 2.24 hours

Scenario 2

You are looking to deploy a Voice Conversational Chatbot application for processing 1000 queries per day.

Key Computations:

  1. Inference Duration:
    Processing 1000 queries and each takes 28.60 seconds Total: 1000 x 28.60 = 28600 seconds (or approximately 7.94 hours)‍
  2. Idle Timeout Duration:
    Post-processing idle time before scaling down: (60 seconds - 28.60 seconds) x 100 = 3140 seconds (or 0.87 hours approximately)
  3. Cold Start Overhead:
    Total: 100 x 20.72 = 2072 seconds (or 0.58 hours approximately)

Total Billable Hours with Inferless: 7.94 (inference duration) + 0.87 (idle time) + 0.58 (cold start overhead) = 9.39 hours
Total Billable Hours with Inferless: 9.39 hours

Pricing Comparison for all the Scenario

ScenariosAWS SageMaker CostInferless Cost
100 requests/day$28.8 (24 hours billed at $1.22/hour)$2.73 (2.24 hours billed at $1.22/hour)
1000 requests/day$28.8 (24 hours billed at $1.22/hour)$11.46 (9.39 hours billed at $1.22/hour)

By opting for Inferless, you can achieve up to 90.52% cost savings.

Please note that we have utilized the A100(80 GB) GPU for model benchmarking purposes, while for pricing comparison, we referenced the A10G GPU price from both platforms. This is due to the unavailability of the A100 GPU in SageMaker.

Also, the above analysis is based on a smaller-scale scenario for demonstration purposes. Should the scale increase tenfold, traditional cloud services might require maintaining 2-4 GPUs constantly active to manage peak loads efficiently. In contrast, Inferless, with its dynamic scaling capabilities, adeptly adjusts to fluctuating demand without the need for continuously running hardware.

Conclusion

By following this guide, you’re now equipped to build and deploy a sophisticated voice conversational chatbot application. This tutorial showcases the seamless integration of advanced technologies, emphasizing the practical application of creating cost-effective solutions.