DocBot: A medical bot using Llama 2

DocBot: A Llama 2 medical bot 🩺

Situation

Problem: The project aimed to develop a medical bot, called DocBot, to provide users with accurate and concise information from a medical encyclopedia.

Context: The data for DocBot was extracted from a collection of medical encyclopedia PDFs. The challenge was to create a system that could understand user queries and retrieve relevant information from medical documents. This project is built as in the series of other learning projects to work with Language models and conversational AI and apply them in the real world.

Task

As part of this project, I developed an ingestion script to process PDF documents from the Encyclopedia of Medicine. The script, stored in book.py, utilized the langchain library for document loading, text splitting, and creating vector embeddings. The resulting vector database, powered by the FAISS library, was then saved locally.

In the model.py file, I implemented a conversational QA system using the Llama 2 model. The Llama 2, a state-of-the-art language model, was employed to understand and respond to user queries. The retrieval QA chain, facilitated by the chainlit library, utilized prompt engineering techniques for effective question-answering.

Action

Document Processing:

Used the PyPDFLoader and DirectoryLoader from langchain for loading PDF documents.Utilized the Hugging Face Embeddings (sentence-transformers/all-MiniLM-L6-v2) for creating meaningful document embeddings and storing them in a vector database FAISS and to retrieve vectors efficiently.

A snippet of code showing the document retrieval and storage:

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter 

DATA_PATH = 'data/'
DB_FAISS_PATH = 'vectorstore/db_faiss'

# Create vector database
def create_vector_db():
    loader = DirectoryLoader(DATA_PATH,
                             glob='*.pdf',
                             loader_cls=PyPDFLoader)

    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
                                                   chunk_overlap=50)
    texts = text_splitter.split_documents(documents)

    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2',
                                       model_kwargs={'device': 'cpu'})

    db = FAISS.from_documents(texts, embeddings)
    db.save_local(DB_FAISS_PATH)

if __name__ == "__main__":
    create_vector_db()

Conversational QA System:

Integrated the Llama 2 model using CTransformers for handling user queries.

Prompt Engineering

Prompt engineering proved to be a vital aspect of enhancing the performance of the conversational QA system. The PromptTemplate class facilitated the formulation of effective prompts that guided the model in understanding the context and generating helpful answers.

The code snipped shows the custom_prompt_template used:

custom_prompt_template = """Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}
Question: {question}

Only return the helpful answer below and nothing else.
Helpful answer:
"""

def set_custom_prompt():
    """
    Prompt template for QA retrieval for each vector store
    """
    prompt = PromptTemplate(template=custom_prompt_template,
                            input_variables=['context', 'question'])
    return prompt

The Llama 2 Model

The Llama 2 model, specifically the llama-2-7b-chat.ggmlv3.q4_0.bin variant which is the quantized version of the Llama 2 model to run on CPU, was crucial in powering the conversational QA system. This model was integrated into the retrieval QA chain, allowing it to retrieve relevant information from the vector database created during the ingestion process. With a focus on medical knowledge, the model demonstrated its proficiency in handling diverse user queries.

The below code snippet shows how the quantized model was loaded. The temperature was set to 0.7 to make the model more creative: