This tutorial demonstrates how to implement real-time text-to-speech (TTS) streaming using the parler_tts_mini model and Parler-TTS library.
This tutorial implements a text-to-speech (TTS) streaming model, parler_tts_mini using Parler_TTS library. It will enable real-time TTS streaming, converting text inputs into speech and stream the audio chunk by chunk.
We are using the Parler_TTS and Transformers libraries for the deployment.
Now quickly construct the GitHub/GitLab template, this process is mandatory and make sure you don’t add any file named model.py
You can also add other files to this directory.
Let’s begin by creating the input_schema.py
file, which defines the input structure for our model. You can find the complete file in our GitHub repository.
For this tutorial, we’ll use two text inputs:
prompt_value
: The main text to be converted to speechinput_value
: The voice instructions for the TTS modelBoth inputs are of string
data type. The output will be streamed using Server-Sent Events (SSE), delivering audio chunks as base64
encoded strings. This approach allows for real-time audio playback as the speech is generated.
To enable streaming with SSE, it’s crucial to set the IS_STREAMING_OUTPUT
property to True
in your model configuration. This tells the system to expect and handle a continuous output stream rather than a single response.
It’s important to note the limitations when working with streaming inputs:
INT
, STRING
, and BOOLEAN
are supported as input datatypes.[1]
. For multiple inputs or complex objects, use json.dumps(object)
to convert them to a string before passing.Now, let’s create the input_schema.py
file with the following content:
In the parler.py file, we define the ParlerTTSStreamer
class and import all the required functions.
In the app.py
we will define the class and import all the required functions
def initialize
: In this function, we will create an object of the ParlerTTSStreamer
class which will load the model. You can define any variable
that you want to use during inference.
def infer
: The infer
function is the core of your model’s inference process. It’s invoked for each incoming request and is responsible for processing the input and generating the streamed output. Here’s a breakdown of its key components:
a. Output Streaming Setup:
output_dict
with a key 'OUT'
.b. Processing and Streaming:
mp3_str
), we update the output_dict
:
stream_output_handler
for streaming the generated audio output chunks. It provides stream_output_handler.send_streamed_output()
function to send this chunk to the client:
c. Finalizing the Stream:
def finalize
: This function cleans up all the allocated memory.
This is a mandatory step where we allow the users to upload their own custom runtime through inferless-runtime-config.yaml.
To enable streaming functionality, ensure you are using CUDA version 12.4.1
.
Inferless supports multiple ways of importing your model. For this tutorial, we will use GitHub.
Navigate to your desired workspace in Inferless and Click on Add a custom model
button that you see on the top right. An import wizard will open up.
Once the model is in ‘Active’ status you can click on the ‘API’ page to call the model
Inferless allows you to deploy your model using Inferless-CLI. Follow the steps to deploy using Inferless CLI.
Let’s begin by cloning the model repository:
To deploy the model using Inferless CLI, execute the following command:
Explanation of the Command:
--gpu A100
: Specifies the GPU type for deployment. Available options include A10
, A100
, and T4
.--runtime inferless-runtime-config.yaml
: Defines the runtime configuration file. If not specified, the default Inferless runtime is used.
This tutorial demonstrates how to implement real-time text-to-speech (TTS) streaming using the parler_tts_mini model and Parler-TTS library.
This tutorial implements a text-to-speech (TTS) streaming model, parler_tts_mini using Parler_TTS library. It will enable real-time TTS streaming, converting text inputs into speech and stream the audio chunk by chunk.
We are using the Parler_TTS and Transformers libraries for the deployment.
Now quickly construct the GitHub/GitLab template, this process is mandatory and make sure you don’t add any file named model.py
You can also add other files to this directory.
Let’s begin by creating the input_schema.py
file, which defines the input structure for our model. You can find the complete file in our GitHub repository.
For this tutorial, we’ll use two text inputs:
prompt_value
: The main text to be converted to speechinput_value
: The voice instructions for the TTS modelBoth inputs are of string
data type. The output will be streamed using Server-Sent Events (SSE), delivering audio chunks as base64
encoded strings. This approach allows for real-time audio playback as the speech is generated.
To enable streaming with SSE, it’s crucial to set the IS_STREAMING_OUTPUT
property to True
in your model configuration. This tells the system to expect and handle a continuous output stream rather than a single response.
It’s important to note the limitations when working with streaming inputs:
INT
, STRING
, and BOOLEAN
are supported as input datatypes.[1]
. For multiple inputs or complex objects, use json.dumps(object)
to convert them to a string before passing.Now, let’s create the input_schema.py
file with the following content:
In the parler.py file, we define the ParlerTTSStreamer
class and import all the required functions.
In the app.py
we will define the class and import all the required functions
def initialize
: In this function, we will create an object of the ParlerTTSStreamer
class which will load the model. You can define any variable
that you want to use during inference.
def infer
: The infer
function is the core of your model’s inference process. It’s invoked for each incoming request and is responsible for processing the input and generating the streamed output. Here’s a breakdown of its key components:
a. Output Streaming Setup:
output_dict
with a key 'OUT'
.b. Processing and Streaming:
mp3_str
), we update the output_dict
:
stream_output_handler
for streaming the generated audio output chunks. It provides stream_output_handler.send_streamed_output()
function to send this chunk to the client:
c. Finalizing the Stream:
def finalize
: This function cleans up all the allocated memory.
This is a mandatory step where we allow the users to upload their own custom runtime through inferless-runtime-config.yaml.
To enable streaming functionality, ensure you are using CUDA version 12.4.1
.
Inferless supports multiple ways of importing your model. For this tutorial, we will use GitHub.
Navigate to your desired workspace in Inferless and Click on Add a custom model
button that you see on the top right. An import wizard will open up.
Once the model is in ‘Active’ status you can click on the ‘API’ page to call the model
Inferless allows you to deploy your model using Inferless-CLI. Follow the steps to deploy using Inferless CLI.
Let’s begin by cloning the model repository:
To deploy the model using Inferless CLI, execute the following command:
Explanation of the Command:
--gpu A100
: Specifies the GPU type for deployment. Available options include A10
, A100
, and T4
.--runtime inferless-runtime-config.yaml
: Defines the runtime configuration file. If not specified, the default Inferless runtime is used.