Advanced RAG Techniques with Arcee Trinity Mini (100% Local)

Jan 09, 2026

In this video, we build a fully local RAG chatbot that runs entirely on a MacBook — no cloud APIs, no usage costs, complete privacy.

We use Arcee’s Trinity Mini, a 26-billion-parameter mixture-of-experts model trained for real-world enterprise tasks, including RAG, function calling, and tool use. Running in Q8 quantization through llama.cpp with Metal acceleration, it’s surprisingly capable on Apple Silicon.

This builds on a previous video where we used Arcee Conductor for cloud-based inference. Same stack — LangChain for orchestration, ChromaDB for vector storage, Gradio for the UI — but now the model runs locally.

We also explore advanced retrieval techniques:

MMR (Maximal Marginal Relevance) for diverse results
Hybrid search combining vector similarity and BM25 keyword matching
Query rewriting to clean up messy questions before retrieval
Cross-encoder re-ranking for precision after recall

All running on a Mac. No internet required.

Julien’s Newsletter

Ready for more?