🦞 Rag

This project implements a Retrieval-Augmented Generation (RAG) pipeline for querying lecture materials (PDF slides and .txt transcripts) using local computation only.

Users can build a searchable vector database from lecture content and ask natural-language questions to receive answers grounded in the relevant documents.

Example outputs located in: example-outputs/. They were generated with Ollama's deepseek-r1:1.5b model as is in config.py. BERT question (q_BERT) answers based on retrieved context, q_backpropagation provided formulas that are not in the retrieved documents. q_visualization retrieves good relevant documents and the answer is based on them.

📁 Code Overview

rag/
├── loader.py        # Loads .pdf and .txt files from a directory
├── splitter.py      # Splits long texts into overlapping chunks
├── embedder.py      # Embeds text chunks and stores them in ChromaDB
├── retriever.py     # Retrieves top-k similar chunks for a query
├── generator.py     # Uses an LLM to generate answers from context
├── pipeline.py      # High-level build/query functions
main.py              # CLI entry point for building DB or asking questions
llm_interface.py     # Unified interface for multiple LLM backends
config.py            # Configuration for LLM provider and model

Component Notes

Loader: Uses PyMuPDF for PDFs and standard file reading for text files.
Splitter: Uses LangChain’s RecursiveCharacterTextSplitter for robust chunking.
Embedder: Utilizes all-MiniLM-L6-v2 for fast, high-quality text embeddings.
Retriever: Queries ChromaDB for top-k similar chunks using cosine similarity.
Generator: Leverages an LLM (as defined in config.py) to answer questions from retrieved context.
Pipeline: Orchestrates all steps: load → split → embed → retrieve → generate.

⚙️ Requirements

🐍 Python Virtual Environment

Requires Python 3.12.3 or higher (older versions cause issues with ChromaDB).

# Create virtual environment (one-time)
python3 -m venv ~/virtualenvs/rag_env

# Activate environment (every time)
source ~/virtualenvs/rag_env/bin/activate

# Install dependencies (one time)
pip install -r requirements.txt

🧠 LLM Provider

Recommended: Ollama

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Download a model

ollama pull deepseek-r1:1.5b
# or try: ollama pull llama2, llama2-7b, etc.

Run the model server
```
ollama serve
```

Want to add another LLM backend?
Just create a new class in llm_interface.py and update config.py accordingly.

🚀 Usage

0. Activate Environment

. ~/.virtualenvs/rag_new/bin/activate

1. Start the LLM Server

For Ollama:

ollama serve

2. Build the Vector Database

This loads and processes files from the data/ directory run this in code. Be aware you have to build and ask questions with the same LLM (different tokenizer used for each LLM build with 1 LLM running with other might not work:)

python3 main.py --build

3. Ask Questions

python3 main.py --query "What is the chain rule in backpropagation?"

📜 Transcript Generation

I generated transcripts from lecture videos:

Download .mp4 videos using download_lectures.sh.
Run Whisper on MetaCentrum via run_whisper.pbs.

I typically run 3–4 videos at a time, each taking ~20 minutes to transcribe.

⚠️ Limitations

Works great on CPU-only machines, but LLM generation is slow (~1 minute per answer).
On machines with a GPU, generation would drop to just a few seconds, but the aim was here that anyone can run this locally.
Retrieval is fast (1–2 seconds) even without a GPU — showcasing the efficiency of the vector search.

For real-time performance, I'd need better hardware or a cloud solution — but this system is intentionally designed for fully local use. ✅

💡 Improvement Ideas

add evaluation of LLM answer
add test cases (testing in general)
check if LLM really only nicely puts together what is in the searched data and does not go on its own
Extract slide change timestamps from the PDF (e.g. slides templated look at bottom-left page numbers change in video) to sync with transcripts.
Add metadata filtering (e.g., by lecture title, source).
Build a GUI like chatpdf.com.
Apply RAG to other domains: legal, medical, etc.
test if it would pass Milan Starka's publicly available questions for subjects:)

Small remarks:

in future compare this RAG system on TREC dataset eval
geenration in future
- let LLM judge the quality of the answer
- query rewriting: generate synonyms, subquestions to obtan better result
- cos retrieval works badly on retrieval on specific things (fe all mentions about google pixel 6a) bm25 hybrid search is an industry standard.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
data		data
example_outputs		example_outputs
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
download_lectures.sh		download_lectures.sh
run_whisper.pbs		run_whisper.pbs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🦞 Rag

📁 Code Overview

Component Notes

⚙️ Requirements

🐍 Python Virtual Environment

🧠 LLM Provider

Recommended: Ollama

🚀 Usage

0. Activate Environment

1. Start the LLM Server

2. Build the Vector Database

3. Ask Questions

📜 Transcript Generation

⚠️ Limitations

💡 Improvement Ideas

Small remarks:

📚 Sources & Inspiration

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sajdoann/Rag

Folders and files

Latest commit

History

Repository files navigation

🦞 Rag

📁 Code Overview

Component Notes

⚙️ Requirements

🐍 Python Virtual Environment

🧠 LLM Provider

Recommended: Ollama

🚀 Usage

0. Activate Environment

1. Start the LLM Server

2. Build the Vector Database

3. Ask Questions

📜 Transcript Generation

⚠️ Limitations

💡 Improvement Ideas

Small remarks:

📚 Sources & Inspiration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages