I Built a Fully Local Paper RAG on an RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB
I Built a Fully Local Paper RAG on an RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB I was using GPT-4o to read ArXiv papers. Throw in a PDF, say "summarize this," get a response in 30 seconds. Con...

Source: DEV Community
I Built a Fully Local Paper RAG on an RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB I was using GPT-4o to read ArXiv papers. Throw in a PDF, say "summarize this," get a response in 30 seconds. Convenient. Then one day I tried to batch-process 50 papers related to an internal research topic and stopped cold. Security policy — can we even send these to an external API? Asked my manager. Predictably, the answer was no. So the only option was to do everything locally. That's how this project started. I'd already confirmed in my previous article that Qwen2.5-32B runs under llama.cpp. The LLM was there. All I needed was a system to search paper contents and feed relevant passages to the LLM — in other words, RAG. Easy to say. The real question was how to cram it all into 8GB of VRAM. Extracting Text from ArXiv Papers — Getting Data Out of PDFs First First step: text extraction. Pull PDFs from the ArXiv API, convert to text with PyMuPDF. import arxiv import fitz # PyMuPDF from pathlib import