No cloud services. No API subscriptions. No data leaves your machine.
The only recurring cost is electricity.
Use case: Deploy a local private chatbot that answers questions using documents stored in your local vector database. All documents remain within your network and are never sent to or hosted by third-party services. If the retrieved information is insufficient to answer a question, the chatbot responds with "I don't know."
Some random ideas for future exploration:
- Training the model to have a "persona" so that no system instruction is needeed
- Training the model on the documentation directly
- 100% local inference using
llama.cpp - 100% local document indexing and retrieval
- No OpenAI, Anthropic, Google, or other API dependencies
- OpenAI-compatible local endpoint
- Simple document upload workflow
- Works offline after initial setup
- Supports custom GGUF models
- Designed for small businesses, homelabs, and privacy-focused users
User
│
▼
Chat UI
│
▼
RAG Pipeline
│
├── Retrieve Relevant Chunks
│
▼
Vector Database
│
└── Embedded Documents
│
▼
llama.cpp
│
▼
Local LLM (GGUF)
│
▼
Response
- User uploads documents.
- Documents are chunked automatically.
- Chunks are converted into embeddings.
- Embeddings are stored locally.
- User asks a question.
- Relevant chunks are retrieved.
- Retrieved context is sent to the local LLM.
- The model generates a response.
No internet connection is required after setup.
This project uses llama.cpp as the inference engine.
Using a dedicated Conda environment helps isolate dependencies and keeps your local LLM setup organized.
Install one of the following:
- Anaconda
- Miniconda (recommended)
Verify Conda is installed:
conda --versionExample output:
conda 25.x.x
Create a new environment:
conda create -n LocalLLM python=3.11 -yActivate it:
conda activate LocalLLMVerify the environment:
conda info --envsThe active environment will be marked with *.
llama.cpp is distributed through Conda-Forge.
Add the repository:
conda config --add channels conda-forgeEnable strict channel priority:
conda config --set channel_priority strictThis helps prevent dependency conflicts.
Install the package:
conda install llama.cppConda will automatically install all required dependencies.
Verify the installation completed successfully:
llama-server --helpor
llama-cli --helpYou should see a list of available command-line options.
You can also verify the package directly:
conda list llama.cppDepending on the installed version, the Conda package may include:
| Tool | Purpose |
|---|---|
llama-server |
OpenAI-compatible API server |
llama-cli |
Command-line inference |
llama-quantize |
Model quantization |
llama-bench |
Performance benchmarking |
llama-perplexity |
Perplexity testing |
llama-batched |
Batch inference examples |
Download a GGUF model from a model repository.
Recommended starter models:
- Phi-4 Mini Instruct
- Qwen 3
- Gemma 3
- Llama 3.2
Create a model directory:
C:\LLMModels\
~/LLMModels/
Place your downloaded .gguf files in this directory.
Example:
llama-server \
-m "/path/to/model.gguf" \
-c 8192 \
--host 0.0.0.0 \
--port 8080| Parameter | Description |
|---|---|
-m |
Path to the GGUF model |
-c |
Context window size |
--host |
Network interface to bind to |
--port |
API server port |
After startup, the API will be available locally:
http://localhost:8080
There's a web UI that's similar to chatgpt at http://localhost:8080/
The endpoint is compatible with:
- C#
- Python
- JavaScript
- LangChain
- LlamaIndex
- Open WebUI
- Custom RAG applications
The LocalLLM folder contains two scripts.
Run this once to install Conda, create the LocalLLM environment, and install llama.cpp:
cd LocalLLM
.\install.ps1This script will:
- Verify that Conda is available in your PATH.
- Create (or reuse) the
LocalLLMConda environment with Python 3.11. - Add the
conda-forgechannel and installllama.cpp. - Verify the installation by running
llama-server --version.
After installing, edit the model path inside the script, then run it:
.\startLocalLLM.ps1This script:
- Activates the
LocalLLMConda environment. - Starts
llama-serverwith sensible defaults (8192 context, full GPU offload). - Exposes an OpenAI-compatible API on
http://localhost:8080.
Before running: Download a GGUF model (e.g. Phi-4 Mini, Qwen3, Llama 3.2 from huggingface.co) and update the
-mpath instartLocalLLM.ps1.
- llama.cpp setup
- Local API endpoint
- Automatic document ingestion (using LangChain loaders)
- Hybrid Chunking pipeline (Recursive Character Splitting)
- Local embedding generation (Dense + FastEmbed Sparse)
- Vector storage (Qdrant)
- Semantic & Lexical Hybrid search (Dense + Sparse/BM25)
- Contextual Reranking (FlashRank Cross-Encoder)
- Web UI
- Drag-and-drop document uploads
- Chat history
- Source citations with score ratings
- A test benchmark evaluation suite for the RAG pipeline and local models
- Multi-user support
- Authentication
- Docker deployment
- Monitoring
A fully local, premium-designed RAG (Retrieval-Augmented Generation) dashboard with Qdrant, FastAPI, and React/Vite is available in the LocalRAG folder.
- Install all dependencies:
Run the automated PowerShell installer:
cd LocalRAG .\install.ps1
- Start the local LLM:
Launch
llama-serverwith embedding support enabled (i.e. using the--embeddingflag). - Launch the servers:
Run the startup script:
.\start_rag.ps1
For detailed guides, component diagrams, and RAG tuning options, please check the LocalRAG README.
Ensure Conda is installed and available in your system PATH.
Update Conda:
conda update -n base -c defaults condaThen retry the installation.
Verify:
conda activate LocalLLMCheck:
- NVIDIA drivers are installed
- CUDA-compatible build is available
- Your model launch configuration includes GPU offloading settings
The goal of hoboLocalLLM is to make local AI accessible to anyone.
The intended user experience is simple:
- Start the model.
- Upload documents.
- Ask questions.
- Receive answers.
No cloud infrastructure. No API keys. No subscriptions. No vendor lock-in.
Just a local AI assistant that runs entirely on your own hardware.
