Skip to content

MarsPeper/hoboLocalLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hoboLocalLLM

# hoboLocalLLM Right now, most of the stuff I wrote is mainly for Windows. A fully local AI chatbot pipeline with a fully local RAG (Retrieval-Augmented Generation) system.

No cloud services. No API subscriptions. No data leaves your machine.

The only recurring cost is electricity.

Use case: Deploy a local private chatbot that answers questions using documents stored in your local vector database. All documents remain within your network and are never sent to or hosted by third-party services. If the retrieved information is insufficient to answer a question, the chatbot responds with "I don't know."

Some random ideas for future exploration:

  • Training the model to have a "persona" so that no system instruction is needeed
  • Training the model on the documentation directly

Features

  • 100% local inference using llama.cpp
  • 100% local document indexing and retrieval
  • No OpenAI, Anthropic, Google, or other API dependencies
  • OpenAI-compatible local endpoint
  • Simple document upload workflow
  • Works offline after initial setup
  • Supports custom GGUF models
  • Designed for small businesses, homelabs, and privacy-focused users

Architecture

User
 │
 ▼
Chat UI
 │
 ▼
RAG Pipeline
 │
 ├── Retrieve Relevant Chunks
 │
 ▼
Vector Database
 │
 └── Embedded Documents
 │
 ▼
llama.cpp
 │
 ▼
Local LLM (GGUF)
 │
 ▼
Response

Planned Workflow

  1. User uploads documents.
  2. Documents are chunked automatically.
  3. Chunks are converted into embeddings.
  4. Embeddings are stored locally.
  5. User asks a question.
  6. Relevant chunks are retrieved.
  7. Retrieved context is sent to the local LLM.
  8. The model generates a response.

No internet connection is required after setup.


Installing llama.cpp with Conda

This project uses llama.cpp as the inference engine.

Using a dedicated Conda environment helps isolate dependencies and keeps your local LLM setup organized.

Prerequisites

Install one of the following:

  • Anaconda
  • Miniconda (recommended)

Verify Conda is installed:

conda --version

Example output:

conda 25.x.x

Step 1: Create a Dedicated Environment

Create a new environment:

conda create -n LocalLLM python=3.11 -y

Activate it:

conda activate LocalLLM

Verify the environment:

conda info --envs

The active environment will be marked with *.


Step 2: Add the Conda-Forge Repository

llama.cpp is distributed through Conda-Forge.

Add the repository:

conda config --add channels conda-forge

Enable strict channel priority:

conda config --set channel_priority strict

This helps prevent dependency conflicts.


Step 3: Install llama.cpp

Install the package:

conda install llama.cpp

Conda will automatically install all required dependencies.


Step 4: Verify Installation

Verify the installation completed successfully:

llama-server --help

or

llama-cli --help

You should see a list of available command-line options.

You can also verify the package directly:

conda list llama.cpp

Included Utilities

Depending on the installed version, the Conda package may include:

Tool Purpose
llama-server OpenAI-compatible API server
llama-cli Command-line inference
llama-quantize Model quantization
llama-bench Performance benchmarking
llama-perplexity Perplexity testing
llama-batched Batch inference examples

Step 5: Download a Model

Download a GGUF model from a model repository.

Recommended starter models:

  • Phi-4 Mini Instruct
  • Qwen 3
  • Gemma 3
  • Llama 3.2

Create a model directory:

Windows

C:\LLMModels\

Linux/macOS

~/LLMModels/

Place your downloaded .gguf files in this directory.


Step 6: Launch the Local API Server

Example:

llama-server \
  -m "/path/to/model.gguf" \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

Command Parameters

Parameter Description
-m Path to the GGUF model
-c Context window size
--host Network interface to bind to
--port API server port

After startup, the API will be available locally:

http://localhost:8080

There's a web UI that's similar to chatgpt at http://localhost:8080/

The endpoint is compatible with:

  • C#
  • Python
  • JavaScript
  • LangChain
  • LlamaIndex
  • Open WebUI
  • Custom RAG applications

Included Scripts — LocalLLM/

The LocalLLM folder contains two scripts.

install.ps1 — One-time setup

Run this once to install Conda, create the LocalLLM environment, and install llama.cpp:

cd LocalLLM
.\install.ps1

This script will:

  • Verify that Conda is available in your PATH.
  • Create (or reuse) the LocalLLM Conda environment with Python 3.11.
  • Add the conda-forge channel and install llama.cpp.
  • Verify the installation by running llama-server --version.

startLocalLLM.ps1 — Launch the server

After installing, edit the model path inside the script, then run it:

.\startLocalLLM.ps1

This script:

  • Activates the LocalLLM Conda environment.
  • Starts llama-server with sensible defaults (8192 context, full GPU offload).
  • Exposes an OpenAI-compatible API on http://localhost:8080.

Before running: Download a GGUF model (e.g. Phi-4 Mini, Qwen3, Llama 3.2 from huggingface.co) and update the -m path in startLocalLLM.ps1.


Project Roadmap

Phase 1 - Local Inference

  • llama.cpp setup
  • Local API endpoint

Phase 2 - Local RAG

  • Automatic document ingestion (using LangChain loaders)
  • Hybrid Chunking pipeline (Recursive Character Splitting)
  • Local embedding generation (Dense + FastEmbed Sparse)
  • Vector storage (Qdrant)
  • Semantic & Lexical Hybrid search (Dense + Sparse/BM25)
  • Contextual Reranking (FlashRank Cross-Encoder)

Phase 3 - User Experience

  • Web UI
  • Drag-and-drop document uploads
  • Chat history
  • Source citations with score ratings
  • A test benchmark evaluation suite for the RAG pipeline and local models

Phase 4 - Production

  • Multi-user support
  • Authentication
  • Docker deployment
  • Monitoring

📂 Local RAG Subsystem

A fully local, premium-designed RAG (Retrieval-Augmented Generation) dashboard with Qdrant, FastAPI, and React/Vite is available in the LocalRAG folder.

Quick Start Instructions:

  1. Install all dependencies: Run the automated PowerShell installer:
    cd LocalRAG
    .\install.ps1
  2. Start the local LLM: Launch llama-server with embedding support enabled (i.e. using the --embedding flag).
  3. Launch the servers: Run the startup script:
    .\start_rag.ps1

For detailed guides, component diagrams, and RAG tuning options, please check the LocalRAG README.


Troubleshooting

Conda Command Not Found

Ensure Conda is installed and available in your system PATH.

Package Not Found

Update Conda:

conda update -n base -c defaults conda

Then retry the installation.

Wrong Environment Active

Verify:

conda activate LocalLLM

GPU Not Being Used

Check:

  • NVIDIA drivers are installed
  • CUDA-compatible build is available
  • Your model launch configuration includes GPU offloading settings

Goal of This Project

The goal of hoboLocalLLM is to make local AI accessible to anyone.

The intended user experience is simple:

  1. Start the model.
  2. Upload documents.
  3. Ask questions.
  4. Receive answers.

No cloud infrastructure. No API keys. No subscriptions. No vendor lock-in.

Just a local AI assistant that runs entirely on your own hardware.

About

hoboLocalLLM, a fully local AI chat bot pipe line with a fully local RAG, no fees (except electricity)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors