Understanding Retrieval-Augmented Generation (RAG)

Before diving in, let’s briefly understand the technology behind our app. Retrieval-Augmented Generation (RAG) combines the power of large language models with external data sources to enhance response accuracy and context. It involves:

  1. Retrieval: Fetching relevant data in response to queries.
  2. Augmentation: Enhancing the language model’s knowledge with this data.
  3. Generation: Producing accurate, context-aware responses.

With this foundation, let’s start building.

Crafting Your Application

This tutorial is structured to ease you into creating a PDF Q&A application using technologies like LangChain, Pinecone, and Inferless.

Obtaining your Pinecone API Key

To seamlessly integrate Pinecone, a fully managed vector database, into our application, securing an API Key is essential. Here’s a simplified guide to getting started:

  1. Registration: Begin by signing up on the Pinecone platform using your email address. Pinecone serves as the backbone for managing and querying vector data efficiently.
  2. Index Creation: Once registered, proceed to create an index on Pinecone. Remember to note its name; this will be crucial for configuring your application.
  3. API Key Access: With your index ready, navigate to the API Keys section, easily found in the Pinecone dashboard’s left sidebar.
  4. API Key Retrieval: Copy the API Key presented here. This key will authenticate your application’s requests to Pinecone, enabling secure and efficient data operations.

By following these steps, you will have taken a crucial step in setting up your PDF Q&A application, leveraging Pinecone’s powerful vector database capabilities.

The next phase involves preparing your deployment environment. While this guide focuses on Inferless for its simplicity and efficiency in serverless deployments, you’re encouraged to select the platform that best fits your project needs.

Core Development Steps

PDF Upload Functionality:

  • Objective: Establish a process for uploading and managing PDFs, utilizing LangChain for document loading and Pinecone for efficient data indexing.
  • Action: Implement a Python class for PDF management. Refer to our GitHub example for detailed code.
import os
from langchain.document_loaders import OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import Pinecone

class InferlessPythonModel:
    def initialize(self):
        #define the index name of Pinecone, embedding model name and pinecone API KEY
        index_name = "documents"
        embed_model_id = "sentence-transformers/all-MiniLM-L6-v2"
        os.environ["PINECONE_API_KEY"] = "YOUR_PINECONE_API_KEY"

        #Initialize the embedding model, text_splitter & pinecone
        embeddings=HuggingFaceEmbeddings(model_name=embed_model_id)
        self.text_splitter=RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
        self.pinecone = Pinecone(index_name=index_name, embedding=embeddings)

    def infer(self, inputs):
        pdf_link = inputs["pdf_url"]
        loader = OnlinePDFLoader(pdf_link)
        data = loader.load()
        documents = self.text_splitter.split_documents(data)
        response = self.pinecone.add_documents(documents)
        
        return {"result":response}

    def finalize(self):
        pass

Integrating Q&A Functionality:

  • Objective: Enable the application to process user queries and extract answers from PDFs, employing the RAG technique.
  • Action: Develop a Python class to merge embeddings with LangChain’s retrieval capabilities and the Llama-2 7B model for answer generation. Our GitHub repository provides a comprehensive example.
import os
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import Pinecone
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

class InferlessPythonModel:
    def initialize(self):
        #define the index name of Pinecone, embedding model name, LLM model name, and pinecone API KEY
        index_name = "documents"
        embed_model_id = "sentence-transformers/all-MiniLM-L6-v2"
        llm_model_id = "NousResearch/Nous-Hermes-llama-2-7b"
        os.environ["PINECONE_API_KEY"] = "YOUR_PINECONE_API_KEY"

        # Initialize the model for embeddings
        embeddings=HuggingFaceEmbeddings(model_name=embed_model_id)
        vectorstore = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)
        retriever = vectorstore.as_retriever()

        # Initialize the LLM
        tokenizer = AutoTokenizer.from_pretrained(llm_model_id)
        model = AutoModelForCausalLM.from_pretrained(llm_model_id,trust_remote_code=True,device_map="cuda")
        pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=100)
        llm = HuggingFacePipeline(pipeline=pipe)

        # Define the chat template, and chain for retrival
        template = """Answer the question based only on the following context:
                      {context}
                      Question: {question}
                      """
        prompt = ChatPromptTemplate.from_template(template)
        self.chain = (
          RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
          | prompt
          | llm
          | StrOutputParser())
      
    def infer(self, inputs):
        question = inputs["question"]
        result = self.chain.invoke(question)
        return {"generated_result":result}

    def finalize(self):
        pass

Setting up the Environment

Dependencies:

  • Objective: Ensure all necessary libraries are installed.
  • Action: Run the command below to install dependencies:
pip install langchain==0.1.7 langchain-pinecone==0.0.2 pypdf==4.0.1 unstructured==0.12.4 sentence-transformers==2.3.1 pinecone-client==3.0.3 huggingface-hub==0.20.3 pdf2image==1.17.0 pdfminer==20191125 pdfminer.six==20221105 pillow_heif==0.15.0 unstructured-inference==0.7.24 pikepdf==8.12.0 transformers==4.37.2 accelerate==0.27.2

This command ensures that your environment is equipped with all the tools required for the PDF Q&A application.

Organizing Your Application:

  • Structure: Divide your application into two main parts for better modularity and maintenance:
    1. PDF Upload Component: Handles the uploading and initial processing of PDF documents. See GitHub Repository.
    2. Q&A Component: Manages the extraction of information and answering of user queries from PDFs. See GitHub Repository.

Deploying Your Model with Inferless CLI

PDF Upload Deployment:

  1. Run the following command to initialize your model:
inferless init
  1. Upload Custom Runtime: Use the following command to upload your custom runtime.
inferless runtime upload

Here’s the custom runtime for the application:


build:
  cuda_version: "12.1.1"
  system_packages:
    - "libssl-dev"
    - "libsm6"
    - "libxext6"
    - "libxrender-dev"
  python_packages:
    - "langchain==0.1.7"
    - "langchain-pinecone==0.0.2"
    - "pypdf==4.0.1"
    - "unstructured==0.12.4"
    - "sentence-transformers==2.3.1"
    - "pinecone-client==3.0.3"
    - "huggingface-hub==0.20.3"
    - "pdf2image==1.17.0"
    - "pdfminer==20191125"
    - "pdfminer.six==20221105"
    - "pillow_heif==0.15.0"
    - "unstructured-inference==0.7.24"
    - "pikepdf==8.12.0"
    - "transformers==4.37.2"
    - "accelerate==0.27.2"
  1. Deploy Model: Execute the following command to deploy and monitor the build logs on Inferless.
inferless deploy

PDF Q&A Deployment:

  1. Run the following command to initialize your model:
inferless init
  1. Upload Custom Runtime: Use the following command to upload your custom runtime.
inferless runtime upload

Here’s the custom runtime for the application:

build:
  cuda_version: "12.1.1"
  system_packages:
    - "libssl-dev"
    - "libsm6"
    - "libxext6"
    - "libxrender-dev"
  python_packages:
    - "langchain==0.1.7"
    - "langchain-pinecone==0.0.2"
    - "pypdf==4.0.1"
    - "unstructured==0.12.4"
    - "sentence-transformers==2.3.1"
    - "pinecone-client==3.0.3"
    - "huggingface-hub==0.20.3"
    - "pdf2image==1.17.0"
    - "pdfminer==20191125"
    - "pdfminer.six==20221105"
    - "pillow_heif==0.15.0"
    - "unstructured-inference==0.7.24"
    - "pikepdf==8.12.0"
    - "transformers==4.37.2"
    - "accelerate==0.27.2"
  1. Deploy Model: Execute inferless deploy to deploy and monitor the build logs on Inferless.
inferless deploy

Alternative Deployment Method

Inferless also supports a user-friendly UI for model deployment, catering to users at all skill levels. Refer to Inferless’s documentation for guidance on UI-based deployment.

Choosing Inferless for Deployment

Deploying your PDF Q&A app with Inferless offers compelling advantages, making your development journey smoother and more cost-effective. Here’s why Inferless is the go-to choice:

  1. Ease of Use: Forget the complexities of infrastructure management. With Inferless, you simply bring your model, and within minutes, you have a working endpoint. Deployment is hassle-free, without the need for in-depth knowledge of scaling or infrastructure maintenance.
  2. Cold-start Times: Inferless’s unique load balancing ensures faster cold-starts. Expect around 6.20 seconds for PDF uploads and 11.7 seconds for Q&A functionalities, significantly faster than many traditional platforms.
  3. Cost Efficiency: Inferless optimizes resource utilization, translating to lower operational costs. Here’s a simplified cost comparison:

PDF Upload:

Assumptions:

  • 100 documents uploaded daily
  • Each document takes 20 seconds of processing time (inference)
  • Additional time considerations include a 6.20-second cold start for each upload

Inferless Cost:

Total Active Processing Time: 100 documents * 20 seconds each = 2000 seconds (or approximately 0.55 hours)
Cold Start Overhead: 100 uploads * 6.20 seconds each = 620 seconds (or 0.17 hours)
Idle Time Between Uploads: ((60 seconds - 20 seconds) * 100) / 3600 = approximately 1.1 hours
Total Billable Hours with Inferless: 1.82 hours

Q&A for PDFs

Assumptions:

  • 30 queries per document daily, with a 11.71-second cold start
  • Each query takes 5 seconds of processing time

Inferless Cost:

Total Active Processing Time: (5 seconds * 3000 queries) / 3600 = approximately 4.1 hours
Cold Start Overhead: 11.71 seconds * 100 = 1171 seconds (or 0.32 hours)
Idle Time Between Queries: ((60 seconds - 5 seconds) * 3000) / 3600 = approximately 1.52 hours
Total Billable Hours with Inferless: 5.94 hours

Comparative Cost Analysis

Operation TypeAWS SageMaker CostInferless Cost
PDF Upload$28.8 (24 hours billed at $1.22/hour)$2.22 (1.82 hours billed at $1.22/hour)
Q&A for PDFs$28.8 (24 hours billed at $1.22/hour)$7.25 (5.94 hours billed at $1.22/hour)
Total Cost$57.6$9.47

By opting for Inferless, you can achieve up to 84% cost savings, lowering your operational expenses from $57.6 to just $9.47.

Please note, the above analysis is based on a smaller-scale scenario for demonstration purposes. Should the scale increase tenfold, traditional cloud services might require maintaining 2-4 GPUs constantly active to manage peak loads efficiently. In contrast, Inferless, with its dynamic scaling capabilities, adeptly adjusts to fluctuating demand without the need for continuously running hardware.

Discover the difference with real-world examples

Conclusion

By following this guide, you’re now equipped to build and deploy a sophisticated PDF Q&A application. This tutorial showcases the seamless integration of advanced technologies, emphasizing the practical application of RAG for creating cost-effective solutions.