Self-Hosted ChatGPT: Run Open WebUI with Local LLMs (Complete Guide)
Deploy a private ChatGPT alternative with Open WebUI and Ollama. Complete Docker Compose setup with model selection, RAG document upload, web search, multi-user config, and security hardening.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Why Self-Host a ChatGPT Alternative?
Every message you send to ChatGPT, Claude, or Gemini travels to someone else's server. For personal use, that's a reasonable trade-off. For a team handling proprietary code, medical records, legal documents, or internal strategy -- it's a non-starter. Self-hosting gives you complete data sovereignty: every conversation stays on hardware you control, with zero API costs after the initial setup.
Open WebUI is the leading open-source frontend for local LLMs. It connects to Ollama (or any OpenAI-compatible API), supports multiple users, and ships with features that rival commercial offerings -- RAG document upload, web browsing, image generation, conversation branching, and more. I've been running it for my team for six months, and it has replaced our ChatGPT Team subscription entirely.
What Is Open WebUI?
Definition: Open WebUI is an open-source, self-hosted web interface for interacting with large language models. It connects to model backends like Ollama, llama.cpp, or any OpenAI-compatible API, providing a ChatGPT-like experience with multi-user support, conversation history, document upload (RAG), web search, and administrative controls -- all running on your own infrastructure.
Think of Open WebUI as the frontend and Ollama as the backend. Ollama handles downloading, quantizing, and serving models. Open WebUI provides the chat interface, user management, and power features. Together, they form a complete self-hosted ChatGPT replacement.
Prerequisites and Hardware Requirements
Before you start, here's what you need:
| Component | Minimum | Recommended | For Teams (5-10 users) |
|---|---|---|---|
| CPU | 4 cores | 8+ cores (AVX2 support) | 16+ cores |
| RAM | 16GB | 32GB | 64GB+ |
| GPU (optional) | None (CPU-only works) | 8GB+ VRAM (RTX 3060/4060) | 24GB+ VRAM (RTX 4090) |
| Storage | 50GB SSD | 200GB NVMe SSD | 500GB+ NVMe SSD |
| Docker | Docker Engine 20.10+ | Docker Engine 24+ | Docker Engine 24+ |
| OS | Linux, macOS, WSL2 | Linux (Ubuntu 22.04+) | Linux (Ubuntu 22.04+) |
Watch out: Without a GPU, you're limited to smaller models (7B-14B parameters) at slower speeds. A 7B model on CPU generates around 14-18 tokens/sec on a modern desktop -- usable, but noticeably slower than the instant feel of ChatGPT. If you plan to serve multiple users or run 70B+ models, a dedicated GPU is strongly recommended.
Step 1: Install Ollama
Ollama is the model runtime. Install it first:
# Linux / WSL2
curl -fsSL https://ollama.com/install.sh | sh
# macOS (via Homebrew)
brew install ollama
# Verify installation
ollama --version
Ollama runs as a system service and exposes an API on port 11434 by default. On Linux, it starts automatically via systemd. On macOS, it runs as a background application.
Step 2: Pull Models
Download the models you want to use. Here are the best options by use case and hardware tier:
| Model | Parameters | VRAM / RAM (Q4) | Best For | Speed (RTX 4090) |
|---|---|---|---|---|
| Llama 4 Scout | 17B active (109B total MoE) | ~12GB | General purpose, multilingual | ~55 t/s |
| Qwen 3 8B | 8B | ~5GB | Coding, reasoning, multilingual | ~95 t/s |
| Qwen 3 32B | 32B | ~20GB | Complex reasoning, analysis | ~35 t/s |
| Mistral Small 3.1 | 24B | ~15GB | Balanced quality/speed, tool use | ~45 t/s |
| DeepSeek-R1 14B | 14B | ~9GB | Math, reasoning, chain-of-thought | ~65 t/s |
| Llama 3.3 70B | 70B | ~42GB | Maximum quality (needs big GPU) | ~18 t/s |
| Gemma 3 4B | 4B | ~3GB | Lightweight, fast responses | ~130 t/s |
# Pull models -- each download runs once, models are cached locally
ollama pull qwen3:8b
ollama pull llama4-scout
ollama pull mistral-small3.1
ollama pull deepseek-r1:14b
# List downloaded models
ollama list
# Quick test
ollama run qwen3:8b "Explain Docker volumes in two sentences."
Pro tip: Start with Qwen 3 8B. It punches well above its weight class for coding and general tasks, runs on modest hardware, and generates tokens fast. Add larger models later once you've confirmed your hardware handles them well. You can switch models mid-conversation in Open WebUI.
Step 3: Deploy with Docker Compose
The production setup uses Docker Compose to orchestrate Open WebUI, Ollama, SearXNG (for web search), and ChromaDB (for RAG document storage). Here's the complete stack:
# docker-compose.yml
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
# Uncomment for NVIDIA GPU passthrough:
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
environment:
- OLLAMA_NUM_PARALLEL=4 # concurrent requests
- OLLAMA_MAX_LOADED_MODELS=2 # models kept in memory
- OLLAMA_KEEP_ALIVE=10m # unload idle models after 10 min
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- open-webui-data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=true
- ENABLE_SIGNUP=false
- DEFAULT_USER_ROLE=user
- ENABLE_RAG_WEB_SEARCH=true
- RAG_WEB_SEARCH_ENGINE=searxng
- SEARXNG_QUERY_URL=http://searxng:8080/search?q=<query>&format=json
- RAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
- CHROMA_HTTP_HOST=chromadb
- CHROMA_HTTP_PORT=8000
depends_on:
- ollama
- searxng
- chromadb
searxng:
image: searxng/searxng:latest
container_name: searxng
restart: unless-stopped
volumes:
- searxng-data:/etc/searxng
environment:
- SEARXNG_BASE_URL=http://localhost:8080/
ports:
- "8888:8080"
chromadb:
image: chromadb/chroma:latest
container_name: chromadb
restart: unless-stopped
volumes:
- chroma-data:/chroma/chroma
ports:
- "8000:8000"
environment:
- ANONYMIZED_TELEMETRY=false
volumes:
ollama-data:
open-webui-data:
searxng-data:
chroma-data:
# Start the entire stack
docker compose up -d
# Check all services are healthy
docker compose ps
# View logs
docker compose logs -f open-webui
Open your browser to http://localhost:3000. The first user to register becomes the admin. If you set ENABLE_SIGNUP=false, the admin creates additional accounts manually through the admin panel.
GPU Passthrough Setup
For NVIDIA GPUs, you need the NVIDIA Container Toolkit installed on the host:
# Install NVIDIA Container Toolkit (Ubuntu/Debian)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify GPU access inside Docker
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Then uncomment the deploy section in the Ollama service definition and restart with docker compose up -d.
Model Preloading and Warm-Up
By default, Ollama loads a model into memory on first request, which adds 5-30 seconds of latency. To preload your primary model at boot:
# Add to crontab or a systemd timer
@reboot sleep 30 && curl -s http://localhost:11434/api/generate \
-d '{"model":"qwen3:8b","prompt":"warmup","stream":false}' > /dev/null
# Or create a preload script
#!/bin/bash
MODELS=("qwen3:8b" "mistral-small3.1")
for model in "${MODELS[@]}"; do
echo "Preloading $model..."
curl -s http://localhost:11434/api/generate \
-d "{"model":"$model","prompt":"hello","stream":false}" > /dev/null
done
Multi-User Configuration
Open WebUI supports full multi-user setups with role-based access. Key admin settings to configure after first login:
- User roles: Admin, User, and Pending. Set
DEFAULT_USER_ROLE=pendingto require admin approval for new accounts. - Model permissions: Restrict which models specific users can access. Useful for limiting expensive large models to senior team members.
- Shared conversations: Users can share conversation links within the instance. Admins can view all conversations if needed.
- Custom model presets: Create system-prompt templates (e.g., "Code Reviewer," "Technical Writer") that users can select as conversation modes.
- LDAP / OAuth: Integrate with existing identity providers for single sign-on. Supports Google, Microsoft, GitHub, and generic OIDC providers.
Power Features Worth Configuring
RAG: Document Upload and Querying
Open WebUI's RAG pipeline lets users upload PDFs, Word documents, text files, and web pages directly into a conversation. Documents are chunked, embedded, and stored in ChromaDB. When a user asks a question, relevant chunks are retrieved and injected into the model's context. This turns your local LLM into a knowledge base that can answer questions about your specific documents -- no data leaves your server.
# Fine-tune RAG in docker-compose environment variables
- RAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
- RAG_CHUNK_SIZE=1000
- RAG_CHUNK_OVERLAP=200
- RAG_TOP_K=5
- RAG_RELEVANCE_THRESHOLD=0.3
Web Browsing via SearXNG
With SearXNG integrated, users can toggle web search per message. The system queries SearXNG, retrieves top results, scrapes content, and feeds it to the model as context. This gives your local LLM access to current information without sending your prompts to Google or Bing. SearXNG itself is a meta-search engine that aggregates results from multiple providers anonymously.
Image Generation
Open WebUI supports image generation via AUTOMATIC1111's Stable Diffusion WebUI or ComfyUI backends. Configure the connection in admin settings, and users can generate images directly in chat. For teams that need image generation without sending prompts to DALL-E or Midjourney, this is a complete local alternative.
Conversation Branching
One of the most underrated features: you can branch a conversation at any point, creating alternative response paths. Ask the same question to different models, or explore multiple approaches to a problem without losing the original thread. Each branch maintains its own history and can be continued independently.
Security: Reverse Proxy and Access Control
Never expose Open WebUI directly to the internet. Place it behind a reverse proxy with TLS termination:
# Caddyfile (simplest option)
chat.yourdomain.com {
reverse_proxy localhost:3000
}
# Nginx alternative
server {
listen 443 ssl http2;
server_name chat.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/chat.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/chat.yourdomain.com/privkey.pem;
location / {
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support (required for streaming responses)
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
# Rate limiting
limit_req_zone $binary_remote_addr zone=chat:10m rate=10r/s;
location / {
limit_req zone=chat burst=20 nodelay;
}
}
Additional security measures worth implementing:
- Firewall rules: Block direct access to ports 11434 (Ollama), 8000 (ChromaDB), and 8888 (SearXNG) from external networks. Only Open WebUI should communicate with these services.
- Network isolation: Use a dedicated Docker network so backend services are not reachable from other containers or the host network.
- Backup: Schedule regular backups of the
open-webui-datavolume. This contains all conversations, user accounts, and uploaded documents.
Alternatives to Open WebUI
Open WebUI is the most feature-complete option, but these are worth evaluating depending on your needs:
| Project | Strengths | Best For | Limitations |
|---|---|---|---|
| LibreChat | Multi-provider (OpenAI, Anthropic, local), plugin system | Teams using both cloud and local models | More complex setup, heavier resource use |
| LobeChat | Polished UI, plugin marketplace, TTS/STT | Consumer-grade experience | Less focus on local-first deployment |
| text-generation-webui | Maximum model control, quantization options | ML engineers, model experimentation | Complex UI, single-user oriented |
| Jan | Desktop app, offline-first, simple | Individual users, non-technical | No multi-user, limited admin controls |
| AnythingLLM | Strong RAG focus, workspace-based | Document Q&A use cases | Smaller community, fewer integrations |
Frequently Asked Questions
How does Open WebUI compare to ChatGPT in terms of quality?
It depends entirely on the model you run behind it. A Qwen 3 8B or Llama 4 Scout will handle most general tasks competently -- summarization, coding assistance, writing, Q&A -- at quality roughly comparable to GPT-3.5. For GPT-4-level quality, you need 70B+ parameter models, which require 48GB+ VRAM. The interface itself matches or exceeds ChatGPT's feature set, especially with RAG and branching capabilities.
Can I connect Open WebUI to cloud APIs like OpenAI or Anthropic?
Yes. Open WebUI supports any OpenAI-compatible API endpoint. Set the OPENAI_API_BASE_URL and OPENAI_API_KEY environment variables to connect to OpenAI, Anthropic (via a proxy), Groq, Together AI, or any other provider. You can run local models via Ollama and cloud models simultaneously, letting users choose per conversation.
What happens when multiple users query the same model at once?
Ollama handles concurrent requests by queuing them. The OLLAMA_NUM_PARALLEL setting controls how many requests are processed simultaneously (default is 1). Set it to 2-4 for small teams. Each parallel request consumes additional memory for its KV cache, so monitor RAM/VRAM usage. With a 7B model on a 24GB GPU, you can comfortably handle 4 parallel requests.
How much disk space do models consume?
Quantized models at Q4 precision use roughly 0.5-0.6 GB per billion parameters. A 7B model is about 4.5GB, a 14B model is 9GB, and a 70B model is 42GB. Ollama stores models in its data directory and deduplicates shared layers across model variants. Plan for 50-100GB if you want 3-5 models available locally.
Can I fine-tune models through Open WebUI?
Not directly. Open WebUI is an inference frontend, not a training tool. However, you can create custom Modelfiles in Ollama that set system prompts, temperature, and other parameters to tailor model behavior. For actual fine-tuning, use tools like Unsloth, Axolotl, or the Hugging Face TRL library, then import the resulting model into Ollama with ollama create.
Is this setup suitable for production use in a company?
For internal tools serving 5-20 users, absolutely. I've seen teams run this stack reliably for months with proper monitoring and backups. For customer-facing production at scale, you'll want more robust infrastructure: load balancing across multiple Ollama instances, dedicated model serving with vLLM or TGI, proper observability, and an SLA-backed hosting environment. Open WebUI is best suited for internal productivity tooling.
How do I update Open WebUI and Ollama?
With Docker Compose, updates are straightforward. Pull the latest images and recreate the containers. Your data persists in Docker volumes, so updates don't affect conversations or settings. Check the release notes before major version upgrades -- breaking changes are rare but do happen.
docker compose pull
docker compose up -d
From Setup to Daily Driver
The gap between commercial AI chat products and self-hosted alternatives has closed dramatically. Open WebUI with Ollama gives you a ChatGPT-equivalent experience where every byte of data stays on your hardware. The setup takes under an hour, the models keep improving every few months, and you eliminate recurring API costs entirely. Start with a single model on whatever hardware you have, add models and features as your needs grow, and stop worrying about who's reading your conversations.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Can You Run LLMs Without GPU? CPU Benchmarks & Reality Check
A deep dive into running large language models on CPUs. Includes performance benchmarks, limitations, and optimization strategies.
10 min read
ContainersWasm vs Containers: Is WebAssembly the Future of Cloud Computing?
Benchmark WebAssembly runtimes (Wasmtime, WasmEdge, Wasmer) against Docker containers on startup, memory, compute, and I/O. Explore Fermyon Spin, wasmCloud, SpinKube, and where each technology wins.
12 min read
SecuritySoftware Supply Chain Security: SBOMs, Sigstore & Dependency Scanning
Anatomy of supply chain attacks (xz-utils, SolarWinds, event-stream), SBOM generation with Syft and Trivy, Sigstore keyless signing, dependency scanning tools compared, and the SLSA framework.
12 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.