Installing `llama.cpp` with Conda

# hoboLocalLLM Right now, most of the stuff I wrote is mainly for Windows. A fully local AI chatbot pipeline with a fully local RAG (Retrieval-Augmented Generation) system.

No cloud services. No API subscriptions. No data leaves your machine.

The only recurring cost is electricity.

Use case: Deploy a local private chatbot that answers questions using documents stored in your local vector database. All documents remain within your network and are never sent to or hosted by third-party services. If the retrieved information is insufficient to answer a question, the chatbot responds with "I don't know."

Some random ideas for future exploration:

Training the model to have a "persona" so that no system instruction is needeed
Training the model on the documentation directly

Features

100% local inference using llama.cpp
100% local document indexing and retrieval
No OpenAI, Anthropic, Google, or other API dependencies
OpenAI-compatible local endpoint
Simple document upload workflow
Works offline after initial setup
Supports custom GGUF models
Designed for small businesses, homelabs, and privacy-focused users

Architecture

User
 │
 ▼
Chat UI
 │
 ▼
RAG Pipeline
 │
 ├── Retrieve Relevant Chunks
 │
 ▼
Vector Database
 │
 └── Embedded Documents
 │
 ▼
llama.cpp
 │
 ▼
Local LLM (GGUF)
 │
 ▼
Response

Planned Workflow

User uploads documents.
Documents are chunked automatically.
Chunks are converted into embeddings.
Embeddings are stored locally.
User asks a question.
Relevant chunks are retrieved.
Retrieved context is sent to the local LLM.
The model generates a response.

No internet connection is required after setup.

Installing `llama.cpp` with Conda

This project uses llama.cpp as the inference engine.

Using a dedicated Conda environment helps isolate dependencies and keeps your local LLM setup organized.

Prerequisites

Install one of the following:

Anaconda
Miniconda (recommended)

Verify Conda is installed:

conda --version

Example output:

conda 25.x.x

Step 1: Create a Dedicated Environment

Create a new environment:

conda create -n LocalLLM python=3.11 -y

Activate it:

conda activate LocalLLM

Verify the environment:

conda info --envs

The active environment will be marked with *.

Step 2: Add the Conda-Forge Repository

llama.cpp is distributed through Conda-Forge.

Add the repository:

conda config --add channels conda-forge

Enable strict channel priority:

conda config --set channel_priority strict

This helps prevent dependency conflicts.

Step 3: Install `llama.cpp`

Install the package:

conda install llama.cpp

Conda will automatically install all required dependencies.

Step 4: Verify Installation

Verify the installation completed successfully:

llama-server --help

or

llama-cli --help

You should see a list of available command-line options.

You can also verify the package directly:

conda list llama.cpp

Included Utilities

Depending on the installed version, the Conda package may include:

Tool	Purpose
`llama-server`	OpenAI-compatible API server
`llama-cli`	Command-line inference
`llama-quantize`	Model quantization
`llama-bench`	Performance benchmarking
`llama-perplexity`	Perplexity testing
`llama-batched`	Batch inference examples

Step 5: Download a Model

Download a GGUF model from a model repository.

Recommended starter models:

Phi-4 Mini Instruct
Qwen 3
Gemma 3
Llama 3.2

Create a model directory:

Windows

C:\LLMModels\

Linux/macOS

~/LLMModels/

Place your downloaded .gguf files in this directory.

Step 6: Launch the Local API Server

Example:

llama-server \
  -m "/path/to/model.gguf" \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

Command Parameters

Parameter	Description
`-m`	Path to the GGUF model
`-c`	Context window size
`--host`	Network interface to bind to
`--port`	API server port

After startup, the API will be available locally:

http://localhost:8080

There's a web UI that's similar to chatgpt at http://localhost:8080/

The endpoint is compatible with:

C#
Python
JavaScript
LangChain
LlamaIndex
Open WebUI
Custom RAG applications

Included Scripts — `LocalLLM/`

The LocalLLM folder contains two scripts.

`install.ps1` — One-time setup

Run this once to install Conda, create the LocalLLM environment, and install llama.cpp:

cd LocalLLM
.\install.ps1

This script will:

Verify that Conda is available in your PATH.
Create (or reuse) the LocalLLM Conda environment with Python 3.11.
Add the conda-forge channel and install llama.cpp.
Verify the installation by running llama-server --version.

`startLocalLLM.ps1` — Launch the server

After installing, edit the model path inside the script, then run it:

.\startLocalLLM.ps1

This script:

Activates the LocalLLM Conda environment.
Starts llama-server with sensible defaults (8192 context, full GPU offload).
Exposes an OpenAI-compatible API on http://localhost:8080.

Before running: Download a GGUF model (e.g. Phi-4 Mini, Qwen3, Llama 3.2 from huggingface.co) and update the -m path in startLocalLLM.ps1.

Project Roadmap

Phase 1 - Local Inference

llama.cpp setup
Local API endpoint

Phase 2 - Local RAG

Automatic document ingestion (using LangChain loaders)
Hybrid Chunking pipeline (Recursive Character Splitting)
Local embedding generation (Dense + FastEmbed Sparse)
Vector storage (Qdrant)
Semantic & Lexical Hybrid search (Dense + Sparse/BM25)
Contextual Reranking (FlashRank Cross-Encoder)

Phase 3 - User Experience

Web UI
Drag-and-drop document uploads
Chat history
Source citations with score ratings
A test benchmark evaluation suite for the RAG pipeline and local models

Phase 4 - Production

Multi-user support
Authentication
Docker deployment
Monitoring

📂 Local RAG Subsystem

A fully local, premium-designed RAG (Retrieval-Augmented Generation) dashboard with Qdrant, FastAPI, and React/Vite is available in the LocalRAG folder.

Quick Start Instructions:

Install all dependencies: Run the automated PowerShell installer:
```
cd LocalRAG
.\install.ps1
```
Start the local LLM: Launch llama-server with embedding support enabled (i.e. using the --embedding flag).
Launch the servers: Run the startup script:
```
.\start_rag.ps1
```

For detailed guides, component diagrams, and RAG tuning options, please check the LocalRAG README.

Troubleshooting

Conda Command Not Found

Ensure Conda is installed and available in your system PATH.

Package Not Found

Update Conda:

conda update -n base -c defaults conda

Then retry the installation.

Wrong Environment Active

Verify:

conda activate LocalLLM

GPU Not Being Used

Check:

NVIDIA drivers are installed
CUDA-compatible build is available
Your model launch configuration includes GPU offloading settings

Goal of This Project

The goal of hoboLocalLLM is to make local AI accessible to anyone.

The intended user experience is simple:

Start the model.
Upload documents.
Ask questions.
Receive answers.

No cloud infrastructure. No API keys. No subscriptions. No vendor lock-in.

Just a local AI assistant that runs entirely on your own hardware.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
LocalLLM		LocalLLM
LocalRAG		LocalRAG
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Features

Architecture

Planned Workflow

Installing llama.cpp with Conda

Prerequisites

Step 1: Create a Dedicated Environment

Step 2: Add the Conda-Forge Repository

Step 3: Install llama.cpp

Step 4: Verify Installation

Included Utilities

Step 5: Download a Model

Windows

Linux/macOS

Step 6: Launch the Local API Server

Command Parameters

Included Scripts — LocalLLM/

install.ps1 — One-time setup

startLocalLLM.ps1 — Launch the server

Project Roadmap

Phase 1 - Local Inference

Phase 2 - Local RAG

Phase 3 - User Experience

Phase 4 - Production

📂 Local RAG Subsystem

Quick Start Instructions:

Troubleshooting

Conda Command Not Found

Package Not Found

Wrong Environment Active

GPU Not Being Used

Goal of This Project

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Installing `llama.cpp` with Conda

Step 3: Install `llama.cpp`

Included Scripts — `LocalLLM/`

`install.ps1` — One-time setup

`startLocalLLM.ps1` — Launch the server

Packages