LLM on Your Laptop Part 2: A Retrieval Augumented Generation (RAG) Guide — Using Ollama and Small LLMs

7 min readOct 9, 2024

Hey there!!, this is the second article to LLM on your laptop series (yes my laptop is planning to Ceaser me in sometime). In the last post, we created a synthetic dataset from Phi3 model and ran that on our local machine. You can check out the previous article in the series here

LLM on Your Laptop: A Synthetic Data Generation Guide — Using Ollama and Small LLMs | by Nikhil Shrimali | Analytics Vidhya | Sep, 2024 | Medium

Using the same synthetic dataset, in this post, we will create a vanilla RAG pipeline. In case you are already an expert in Retrieval Augmented Generation concepts, or have seen one thousand such articles already, feel free to skip this one. This post will be a stepping stone to further articles where we discover and benchmark new techniques such as (finetuning or graphs) that can help us in improving the performance of our LLM based pipeline using popular techniques used in industry right now

So without further ado, let’s do a RAG on the self-help articles dataset

Retrieval Augumented Generation

You would have heard this concept a lot these days; it has gained popularity since the introduction of ChatGPT and other LLMs. On a high level it happens in the several steps

Retrieve — Retrieves the context relevant to the user query from an external source ( a vector database)
We append the information which we have received from retriever into a prompt to prepare query with a context prompt
The last step is sending this context augmented prompt to LLM along with user query to generate a final answer which is based on the internal context rather than what LLM has been trained on.

Let’s dig deep into each step, for retriever usually you would be using a vector db, where the context is stored, but there is a catch, the context will be stored in a embedded format. That means, the input would be converted into a vector instead of normal text format.

Implementation

Step 1: Doing the imports and reading the dataset. We would be using Langchain framework for our RAG pipeline

## Langchain Imports

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from uuid import uuid4
import os
import glob
import textwrap
import time

# loaders
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader


# prompts
from langchain import PromptTemplate, LLMChain

# vector stores
import faiss
from langchain.vectorstores import FAISS

# models
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceInstructEmbeddings

# retrievers
from langchain.chains import RetrievalQA



from langchain.embeddings import OllamaEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.schema import Document

Step 2: Creating a vector store and index the dataset, for this we would be using FAISS as a vector store and “nomic-embed-text” model for generating embeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Convert DataFrame rows to Document objects
documents = [Document(page_content=row['page_content'], metadata={'id': row['content_id'], 'title':row['title'], 'intro':row['intro'], 'summary':row['summary'] }) for index, row in df_unique.iterrows()]
uuids = [str(uuid4()) for _ in range(len(documents))]

index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))

# Initializing the vector store
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)
# Add the documents to the vector store
vector_store.add_documents(documents=documents, ids=uuids)

Step 3: Testing if the retrieval is working

# Let's test if the approach is working fine
# Doing the cosine similarity search
# Currently we are retrieving one document for a sample query 

query = "I want to learn to teach people"
results = vector_store.similarity_search_with_score(
    query,
    k=1,
)

Step 4: Let’s create the GENERATOR now!!

We will start by creating a prompt that takes in the user query
and the context retrieved from the query And generate the info

# We will start by creating a prompt that takes in the user query
# and the context retrieved from the query
# And generate the info

PROMPT_TEMPLATE = """
You are an AI assistant designed bot, who replies to user to answer their queries.


########
Let's do it step by step
Instructions:

Step 1: Forget everything you know beforehand
Step 2: You will understand the intent of the user_query provided as input
Step 3: You will understand the intent of context provided as input
Step 4: You will generate an answer to the user_query based on the context provided
Step 5: After generating the response, you will calculate a relevance score for answer generated
Step 6: If relevance score is low, generate the answer again


########
Example Input:

user_query - 'how to keep chinchilla healthy and happy'
context - 'Intro \n Chinchillas are active, curious pets who thrive with plenty of physical activity. To make sure your chinchilla is healthy and happy, let it roam a room in your home for one to two hours each night. When a chinchilla is confined to its cage, provide toys that encourage activity. You'll find that, with a little dedication, you can keep your pet healthy, happy, and active. \n Summary \n None \n Tips \n  \n Warnings \n'
########

Example Output:

Based on the provided context, here are some key points to keep a chinchilla healthy and happy:

- Allow your chinchilla to roam freely in a room for 1-2 hours each night.
- Provide toys in the cage to encourage activity when confined.
- Ensure plenty of physical activity opportunities.
- Be dedicated to your pet's well-being.

######## 
Generate the output in html formatted bullet points

{format_instructions}

########
Here is the user_query - {}
Now, generate a response based on the context provided below
{context}
"""


PROMPT_TEMPLATE = """
You are an advanced AI assistant specialized in providing accurate and helpful responses to user queries. Your goal is to generate concise, relevant, and informative answers based on the given context.

#######

Instructions:
0. Forget everything you know, your knowledge is only limited to the context provided as input
1. Carefully analyze the user_query to understand the core question or request.
2. Thoroughly examine the provided context, focusing on relevant information.
3. Generate a clear and concise answer that directly addresses the user_query, using only the information from the context.
4. Format your response as bullet points, with each key point as a separate bullet.
5. Ensure your response is well-structured and easy to understand.
6. If the context doesn't contain sufficient information to answer the query, respond with 'fallback' in the mentioned json output instruction.
7. Always respond in below json formatted output instruction

#######

Follow the below instruction only to generate output

{format_instructions}



#######
Below are the example input and expected output. Do remember this is just for example reference  and not to be used in user query or context

Example Input:

user_query1 - 'how to keep chinchilla healthy and happy'
user_query2 - 'how can i detox my gut'

context - 'Intro \n Chinchillas are active, curious pets who thrive with plenty of physical activity. To make sure your chinchilla is healthy and happy, let it roam a room in your home for one to two hours each night. When a chinchilla is confined to its cage, provide toys that encourage activity. You'll find that, with a little dedication, you can keep your pet healthy, happy, and active. \n Summary \n None \n Tips \n  \n Warnings \n'

#######

Example Expected Output:

output_1 = "
<p>To keep your chinchilla healthy and happy follow the below steps</p>
<ul>
<li>Allow your chinchilla to roam freely in a room for 1-2 hours each night<li>
<li>Provide toys in the cage to encourage activity when confined<li>
<li>Ensure plenty of physical activity opportunities<li>
<li>Be dedicated to your pet's well-being<li>
</ul>
"
output_2 = "fallback"


#######

Now, please respond to the following user query based on the provided context:

user_query: {user_query}
context: {context}


"""

Step 5: Define the pydantic model to parse the output from the LLM, initialze the Prompt Template, and chain the LLM model, prompt, and the parser


import time

class LLMOutput(BaseModel):
    response: str = Field(
        description="string output response"

    )

model = OllamaLLM(model="phi3")


def generate_llm_output(user_query):

    try:

        parser = PydanticOutputParser(pydantic_object=LLMOutput)

        prompt = PromptTemplate(
            template=PROMPT_TEMPLATE,
            input_variables=["user_query", "context"],
            partial_variables={"format_instructions": parser.get_format_instructions()},
        )

        list_documents = vector_store.similarity_search_with_score(
            query=query,
            k=1,
        )



        doc_id = list_documents[0][0].metadata['id']
        context = list_documents[0][0].page_content
        score = list_documents[0][1]


        chain = prompt | model | parser
        response  = chain.invoke({"user_query": user_query, "context": context})
        print('Response',response)
        answer =  response.response
        print('answer',answer)
        return answer, doc_id, score

    
    except Exception as e:
        doc_id = 0
        score = 0
        print(type(e).__name__)
        return type(e).__name__, doc_id, score
import time

class LLMOutput(BaseModel):
    response: str = Field(
        description="string output response"

    )

model = OllamaLLM(model="phi3")


def generate_llm_output(user_query):

    try:

        parser = PydanticOutputParser(pydantic_object=LLMOutput)

        prompt = PromptTemplate(
            template=PROMPT_TEMPLATE,
            input_variables=["user_query", "context"],
            partial_variables={"format_instructions": parser.get_format_instructions()},
        )

        list_documents = vector_store.similarity_search_with_score(
            query=query,
            k=1,
        )



        doc_id = list_documents[0][0].metadata['id']
        context = list_documents[0][0].page_content
        score = list_documents[0][1]


        chain = prompt | model | parser
        response  = chain.invoke({"user_query": user_query, "context": context})
        print('Response',response)
        answer =  response.response
        print('answer',answer)
        return answer, doc_id, score

    
    except Exception as e:
        doc_id = 0
        score = 0
        print(type(e).__name__)
        return type(e).__name__, doc_id, score

Step 6: Create a retriever and generator

index = 1

query = df_wiki_sample.iloc[index]['LLM_Query']
content_id = df_wiki_sample.iloc[index]['content_id']

print(content_id, query)

# Calling the LLM model
answer, doc_id, score = generate_llm_output(query)

Conclusion

The small LLM models are performing as expected. Although this has a higher percentage of failures in terms of output but this is expected given we are using a small model. The model is able to generate the output content in HTML, which is expected.

LLM on Your Laptop Part 2: A Retrieval Augumented Generation (RAG) Guide — Using Ollama and Small LLMs

Retrieval Augumented Generation

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Nikhil Shrimali

No responses yet