dev.to
dev.to
184
In an era where data privacy is paramount, setting up your own local language
model (LLM) provides a crucial solution for companies and individuals alike. This
tutorial is designed to guide you through the process of creating a custom chatbot
using Ollama, Python 3, and ChromaDB, all hosted locally on your system. Here
are the key reasons why you need this tutorial:
This tutorial will empower you to build a robust and secure local chatbot, tailored
to your needs, without compromising on privacy or control.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is an advanced technique that combines
the strengths of information retrieval and text generation to create more accurate
and contextually relevant responses. Here's a breakdown of how RAG works and
why it's beneficial:
What is RAG?
RAG is a hybrid model that enhances the capabilities of language models by
incorporating an external knowledge base or document store. The process
involves two main components:
Benefits of RAG
Enhanced Accuracy: By leveraging external data, RAG models can provide
more precise and detailed answers, especially for domain-specific queries.
Contextual Relevance: The retrieval component ensures that the generated
response is grounded in relevant and up-to-date information, improving the
overall quality of the response.
Scalability: RAG systems can be easily scaled to incorporate vast amounts of
data, enabling them to handle a wide range of queries and topics.
Flexibility: These models can be adapted to various domains by simply
updating or expanding the external knowledge base, making them highly
versatile.
By setting up a local RAG application with tools like Ollama, Python, and
ChromaDB, you can enjoy the benefits of advanced language models while
maintaining control over your data and customization options.
GPU
Running large language models (LLMs) like the ones used in Retrieval-Augmented
Generation (RAG) requires significant computational power. One of the key
components that enable efficient processing and embedding of data in these
models is the Graphics Processing Unit (GPU). Here's why GPUs are essential for
this task and how they impact the performance of your local LLM setup:
What is a GPU?
A GPU is a specialized processor designed to accelerate the rendering of images
and videos. Unlike Central Processing Units (CPUs), which are optimized for
sequential processing tasks, GPUs excel at parallel processing. This makes them
particularly well-suited for the complex mathematical computations required by
machine learning and deep learning models.
Memory Capacity: Larger models require more GPU memory. Look for GPUs
with higher VRAM (video RAM) to accommodate extensive datasets and model
parameters.
Compute Capability: The more CUDA cores a GPU has, the better it can handle
parallel processing tasks. GPUs with higher compute capabilities are more
efficient for deep learning tasks.
Bandwidth: Higher memory bandwidth allows for faster data transfer between
the GPU and its memory, improving overall processing speed.
Prerequisites
Before diving into the setup, ensure you have the following prerequisites in place:
$ python3 --version
# Python 3.11.7
$ mkdir local-rag
$ cd local-rag
$ source venv/bin/activate
# Windows
# venv\Scripts\activate
$ ollama --version
# ollama version is 0.1.47
Pull the LLM model you need. For example, to use the Mistral model:
$ ollama pull mistral
Pull the text embedding model. For instance, to use the Nomic Embed Text model:
$ ollama serve
app.py
This is the main Flask application file. It defines routes for embedding files to the
vector database, and retrieving the response from the model.
import os
from dotenv import load_dotenv
load_dotenv()
app = Flask(__name__)
@app.route('/embed', methods=['POST'])
def route_embed():
if 'file' not in request.files:
return jsonify({"error": "No file part"}), 400
file = request.files['file']
if file.filename == '':
return jsonify({"error": "No selected file"}), 400
embedded = embed(file)
if embedded:
return jsonify({"message": "File embedded successfully"}), 200
@app.route('/query', methods=['POST'])
def route_query():
data = request.get_json()
response = query(data.get('query'))
if response:
return jsonify({"message": response}), 200
embed.py
This module handles the embedding process, including saving uploaded files,
loading and splitting data, and adding documents to the vector database.
import os
from datetime import datetime
from werkzeug.utils import secure_filename
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from get_vector_db import get_vector_db
return file_path
# Function to load and split the data from the PDF file
def load_and_split_data(file_path):
# Load the PDF file and split the data into chunks
loader = UnstructuredPDFLoader(file_path=file_path)
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=10
chunks = text_splitter.split_documents(data)
return chunks
return True
return False
query.py
This module processes user queries by generating multiple versions of the query,
retrieving relevant documents, and providing answers based on the context.
import os
from langchain_community.chat_models import ChatOllama
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever
from get_vector_db import get_vector_db
prompt = ChatPromptTemplate.from_template(template)
# Set up the retriever to generate multiple queries using the language model
retriever = MultiQueryRetriever.from_llm(
db.as_retriever(),
llm,
prompt=QUERY_PROMPT
)
# Define the processing chain to retrieve context, generate the answer, and
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
response = chain.invoke(input)
return response
return None
get_vector_db.py
This module initializes and returns the vector database instance used for storing
and retrieving document embeddings.
import os
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores.chroma import Chroma
def get_vector_db():
embedding = OllamaEmbeddings(model=TEXT_EMBEDDING_MODEL,show_progress=True)
db = Chroma(
collection_name=COLLECTION_NAME,
persist_directory=CHROMA_PATH,
embedding_function=embedding
)
return db
Run your app!
Create .env file to store your environment variables:
TEMP_FOLDER = './_temp'
CHROMA_PATH = 'chroma'
COLLECTION_NAME = 'local-rag'
LLM_MODEL = 'mistral'
TEXT_EMBEDDING_MODEL = 'nomic-embed-text'
$ python3 app.py
Once the server is running, you can start making requests to the following
endpoints:
Response
{
"message": "File embedded successfully"
}
# Response
{
"message": "Nasser Maronie is a Full Stack Developer with experience in web and mo
}
Conclusion
By following these instructions, you can effectively run and interact with your
custom local RAG app using Python, Ollama, and ChromaDB, tailored to your
needs. Adjust and expand the functionality as necessary to enhance the
capabilities of your application.
Happy coding!
Top comments (1)
Perfect. The only issue I encountered was related to lang chain. The below
command ws required as part of step 2.
Pieces.app PROMOTED
Nasser Maronie
As a professional in the tech industry, I enjoy delving into complex problems and sharing
solutions that help others on their coding journey.
LOCATION
Indonesia
EDUCATION
University of AMIKOM Yogyakarta
WORK
Fullstack Engineer
JOINED
Jun 20, 2024
Building a Web Page Summarization App with Next.js, OpenAI, LangChain, and Supabase
#llm #langchain #openai #supabase
From Zero to Chatbot: How Large Language Models (LLMs) Work and How to Harness
Them Easily
#llm #node #openai #promptengineering
See Article