Journey into building a custom RAG (Day 1)

I am currently building a RAG (Retrieval Augmented Generation) LLM application, and recently decided to share my journey on this blog.
I skip the intro and jump straight to my current status. Here’s where I am:

Starting Point: To get started with all the boilerplate code, I used the Metaflow RAG demo repository and this anyScale blog post for insights and second point of view. Metaflow’s repo include a streamlit chat webapp, utilities for scraping and parsing markdown documentation files, etc.

Current Status

I replicated what was done on Metaflow’s blog, augmented with Metaflow’s docs. Now, I want to do one with Hugging Face’s docs as well, focusing on three libraries: transformers, accelerate, TRL (Transfer Reinforcement Learning).
It works, but, it’s not amazing for now :

rag-q1 rag-q2 rag-q3

Make iteration cycles cheap

Currently, the pipeline uses openAI text-embedding-ada-002 (default from LLamaIndex) API for embeddings.

OpenAI Embeddings API prices

As you can see, there is now a text-embedding-3-small by OpenAI. It’s about 1/10th of the price, and is more performant : double win. Let’s use it :

from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI
ServiceContext.from_defaults(llm="gpt-3.5-turbo", embed_model=OpenAIEmbedding(model='text-embedding-3-small'))
set_global_service_context(service_context)

You can also choose any HuggingFace’s model to do the embeddings locally on your computer. One of the best at the time of writing is BAAI/bge-large-en-v1.5. You must tell LLamaIndex that it’s gonna be a local model with local: :

ServiceContext.from_defaults(llm="gpt-3.5-turbo", embed_model='local:BAAI/bge-large-en-v1.5')

I kept the LLM as GPT-3.5 for now, as it’s cheaper, and I’m mostly testing the relevance of the context augmentation at this stage, not the model itself.

While doing this modification, I found the prompt used in the initial RAG repo.

Prompt modification

It’s the prompt used to include the context from the vectorDB, and then ask the question to the LLM :

prompt = """
     Answer this question only if there is relevant context below: {}
     If there is nothing in the context say: "Could not find relevant context."
     Here is the retrieved context: {}
 """

I changed it to make it more flexible, I think it will help me in debugging and evaluating the pipeline.

prompt = """
    Answer this question using your existing knowledge the relevant context below: {}
    If the context doesn't help you answer, prefix your message with: "Context not relevant."
    If the context is empty, prefix your message with: "No context."
    If you don't know the answer and the context doesn't help you, say "I don't know".
    Here is the retrieved context: {}
"""

It actually helped me later, I found out randomly that my context retrieval wasn’t working so good 😂 (or that I trained on the wrong data) : Found in HF docs :

Next Steps

Check if embeddings are saved between runs and see what’s the default strategy when iterating on the pipeline
Run metaflow UI locally to easily track experiments
Consider switching to a proper vectorDB

See you for day 2 !

Current Status#

Make iteration cycles cheap#

Prompt modification#

Next Steps#

Current Status

Make iteration cycles cheap

Prompt modification

Next Steps