Key Components of the Application

In building this application, we’ll utilize these components:

  1. Text Generation Model: We’ll use the meta-llama/Meta-Llama-3.1-8B-Instruct model with vLLM as our text generation engine. This large language model is designed to understand and summarize large bodies of text effectively.
  2. Text-to-Audio Model: For converting text summaries into speech, we’ll employ the xTTS-v2 model with the TTS library.

Crafting Your Application

To build this application, you have to follow these steps:

  1. Book URL Input: The user provides a URL link to the book they wish to summarize.

  2. Book Retrieval and Chunking: The application retrieves the book content from the provided URL and splits it into manageable chunks for easier processing.

  3. Summarization with LLM: Each chunk is sent to a large language model (LLM), such as Meta-Llama-3.1-8B, which generates a concise summary of each text chunk individually.

  4. Final Summary Generation: Once all chunks are summarized, the chunk summaries are combined and processed again by the LLM to produce a cohesive final summary that encapsulates the book’s main ideas.

  5. Text-to-Speech Conversion: This final summary is then converted to audio using a TTS model like xTTS-v2.

  6. User Playback: The resulting audio summary is provided to the user, enabling them to listen to a streamlined, engaging version of the book’s key concepts.

Core Development Steps

Text-to-Speech Generation

  • Objective: Convert the final text summary of the book into natural-sounding speech, allowing users to listen to an audio version of the summarized content.
  • Action: Implement a Python class, such as InferlessPythonModel, to manage the entire text-to-speech process. This class should handle the text input from the user to integrate with the TTS model (xTTS-v2) for producing the final audio response.
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from pypdf import PdfReader
from TTS.api import TTS
import torch
import io
import base64
import requests

class InferlessPythonModel:
    @staticmethod
    def pdf_to_text(pdf_path):
        reader = PdfReader(pdf_path)
        text = []
        for page in reader.pages:
            text.append(page.extract_text())
        return '\n'.join(text)

    @staticmethod
    def split_text_into_chunks(text, chunk_size=4000):
        words = text.split(" ")
        for i in range(0, len(words), chunk_size):
            yield " ".join(words[i:i + chunk_size])

    @staticmethod
    def download_book(url, file_name="textbook.pdf"):
        response = requests.get(url)
        with open(file_name, "wb") as file:
            file.write(response.content)
        return file_name

    def initialize(self):
        model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
        self.llm = LLM(model=model_id, dtype="float16")
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        
        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
        self.prompts = prompts = [
            "Summarizing the following text from the chapters of a book.",
            "Write comprehensive notes summarizing the following text from the book.",
            """Based on the summaries provided, compose a comprehensive summary of the book titled '<Book-Name>' suitable for a speech. Follow this structure:**
                 1. **Summary of the Book '<Book-Name>':** Provide a brief overview capturing the essence of the book.
                 2. **Introduction:** Introduce the main themes, purposes, and significance of the book.
                 3. **Chapter Summaries:** Briefly summarize each chapter, highlighting key events, developments, and insights.
                 4. **Conclusion:** Conclude by summarizing the overall impact of the book and its contributions.
         
             **Ensure the speech is engaging, coherent, and maintains a consistent tone throughout."""
        ]
        
    def generate_summary(self,prompts_idx,max_tokens,chunk):
        sampling_params = SamplingParams(max_tokens=max_tokens)
        messages = [
                {"role": "system", "content": self.prompts[prompts_idx]},
                {"role": "user", "content": chunk}
            ]
        input_text = self.tokenizer.apply_chat_template(messages, tokenize=False)
        result = self.llm.generate(input_text, sampling_params)
        summary = [output.outputs[0].text for output in result][0].split("<|start_header_id|>assistant<|end_header_id|>")[-1]
        
        return summary
        
    def recursive_summarize(self,text_chunks, prompts_idx, max_tokens, batch_size):
        summaries = []
        for i in range(0, len(text_chunks), batch_size): 
            batch = "\n\n".join(text_chunks[i:i + batch_size])
            batch_summary = self.generate_summary(prompts_idx, max_tokens, batch)
            summaries.append(batch_summary)
            
        if len("\n\n".join(summaries).split(" "))>4000:
            return self.recursive_summarize(summaries, prompts_idx, max_tokens, batch_size)
        else:
            final_summaries = "\n\n".join(summaries)
            final_summary = self.generate_summary(2, 1024,final_summaries)
            return final_summary
        
    def infer(self,inputs):
        book_url = inputs['book_url']
        book_name = self.download_book(book_url)
        parsed_text = self.pdf_to_text(book_name)
        
        initial_summaries = [
            self.generate_summary(0, 200, chunk)
            for chunk in self.split_text_into_chunks(parsed_text, chunk_size=4000)
        ]
        final_summary = self.recursive_summarize(initial_summaries, 1, 1024, batch_size=5)
        
        wav_file = io.BytesIO()
        self.tts.tts_to_file(
                text=final_summary,
                file_path=wav_file,
                speaker="Kazuhiko Atallah",
                language="en",
        )    
        audio_base64 = base64.b64encode(wav_file.getvalue()).decode('utf-8')
            
        return {"generated_audio_base64":audio_base64}

    def finalize(self):
        self.llm = None
        self.tts = None

Setting up the Environment

Dependencies:

  • Objective: Ensure all necessary libraries are installed.
  • Action: Run the command below to install dependencies:
pip install vllm==0.6.2 coqui-tts==0.24.3 pypdf==4.3.1

This command ensures your environment has all the tools required for the application.

Deploying Your Model with Inferless CLI

  1. Run the following command to initialize your model:
inferless init
  1. Upload Custom Runtime: Use the following command to upload your custom runtime.
inferless runtime upload

Here’s the custom runtime for the application:

build:
  cuda_version: "12.1.1"
  python_packages:
    - vllm==0.6.2
    - coqui-tts==0.24.3
    - pypdf==4.3.1
  1. Deploy Model: Execute inferless deploy to deploy and monitor the build logs on Inferless.
inferless deploy

Demo of the Book Audio Summary Generator.

Alternative Deployment Method

Inferless also supports a user-friendly UI for model deployment, catering to users at all skill levels. Refer to Inferless’s documentation for guidance on UI-based deployment.

Choosing Inferless for Deployment

Deploying your book summarizer application with Inferless offers compelling advantages, making your development journey smoother and more cost-effective. Here’s why Inferless is the go-to choice:

  1. Ease of Use: Forget the complexities of infrastructure management. With Inferless, you simply bring your model, and within minutes, you have a working endpoint. Deployment is hassle-free, without the need for in-depth knowledge of scaling or infrastructure maintenance.
  2. Cold-start Times: Inferless’s unique load balancing ensures faster cold-starts.
  3. Cost Efficiency: Inferless optimizes resource utilization, translating to lower operational costs. Here’s a simplified cost comparison:

Scenario

You are looking to deploy a Customer Service Voicebot application for processing 100 queries.

Parameters:

  • Total number of queries: 100 daily.
  • Inference Time: All models are hypothetically deployed on A100 80GB, taking 284.51 seconds to process an average book size of 383 pages and a cold start overhead of 57.94 seconds.
  • Scale Down Timeout: Uniformly 60 seconds across all platforms, except Hugging Face, which requires a minimum of 15 minutes. This is assumed to happen 100 times a day.

Key Computations:

  1. Inference Duration:
    Processing 100 queries and each takes 2.87 seconds
    Total: 100 x 284.51 = 28451 seconds (or approximately 7.9 hours)
  2. Idle Timeout Duration:
    Post-processing idle time before scaling down: (300 seconds - 284.51 seconds) x 100 = 1549 seconds (or 0.43 hours approximately)
  3. Cold Start Overhead:
    Total: 100 x 57.94 = 5794 seconds (or 1.61 hours approximately)

Total Billable Hours with Inferless: 7.9 (inference duration) + 0.43 (idle time) + 1.61 (cold start overhead) = 9.94 hours
Total Billable Hours with Inferless: 9.94 hours

ScenarioOn-Demand CostServerless Cost
100 requests/day$28.8 (24 hours billed at $1.22/hour)$12.13 (9.94 hours billed at $1.22/hour)

By opting for Inferless, you can achieve up to 58.88% cost savings.

Please note that we have utilized the A100(80 GB) GPU for model benchmarking purposes, while for pricing comparison, we referenced the A10G GPU price from both platforms. This is due to the unavailability of the A100 GPU in SageMaker.

Also, the above analysis is based on a smaller-scale scenario for demonstration purposes. Should the scale increase tenfold, traditional cloud services might require maintaining 2-4 GPUs constantly active to manage peak loads efficiently. In contrast, Inferless, with its dynamic scaling capabilities, adeptly adjusts to fluctuating demand without the need for continuously running hardware.

Conclusion

By following this guide, you’re now equipped to build and deploy a sophisticated book summarizer application. This tutorial showcases the seamless integration of advanced technologies, emphasizing the practical application of creating cost-effective solutions.