English | 简体中文 | 繁體中文 | Русский
Deploy a complete, self-hosted AI stack on your own server with a single command.
- Zero-config: all services auto-configure on first start
- Secure: Ollama, LiteLLM, and MCP Gateway generate API keys automatically
- Private: audio, embeddings, and LLM inference all run locally — no data sent to third parties
- Optional auth: Whisper, Kokoro, and Embeddings work without API keys by default (set keys via env files for public deployments)
- Lightweight stacks for lower memory requirements (as low as ~2.5 GB)
- GPU acceleration via NVIDIA CUDA
Note: When using LiteLLM with external providers (e.g., OpenAI, Anthropic), your data will be sent to those providers.
Services included:
| Service | Role | Default port |
|---|---|---|
| Ollama (LLM) | Runs local LLM models (llama3, qwen, mistral, etc.) | 11434 |
| LiteLLM | AI gateway — routes requests to Ollama, OpenAI, Anthropic, and 100+ providers | 4000 |
| Embeddings | Converts text to vectors for semantic search and RAG | 8000 |
| Whisper (STT) | Transcribes spoken audio to text | 9000 |
| Kokoro (TTS) | Converts text to natural-sounding speech | 8880 |
| MCP Gateway | Provides MCP tools (filesystem, fetch, GitHub, search, databases) to AI clients | 3000 |
Also available:
- AI/Audio: WhisperLive (real-time STT)
- VPN: WireGuard, OpenVPN, IPsec VPN, Headscale
graph LR
A["🎤 Audio input"] -->|transcribe| W["Whisper<br/>(speech-to-text)"]
D["📄 Documents"] -->|embed| E["Embeddings<br/>(text → vectors)"]
E -->|store| VDB["Vector DB<br/>(Qdrant, Chroma)"]
W -->|query| E
VDB -->|context| L["LiteLLM<br/>(AI gateway)"]
W -->|text| L
L -->|routes to| O["Ollama<br/>(local LLM)"]
L -->|response| T["Kokoro TTS<br/>(text-to-speech)"]
T --> B["🔊 Audio output"]
C["🤖 AI client<br/>(Cline, Claude, etc.)"] -->|MCP tools| M["MCP Gateway<br/>(MCP endpoint)"]
C -->|chat| L
L -->|MCP protocol| M
Requirements:
- A Linux server (local or cloud) with Docker installed
- At least 8 GB of RAM (with small models). For larger LLM models (8B+), 32 GB or more is recommended.
- You can comment out services you don't need to reduce memory usage.
Start the full stack:
# Clone the repository to get the compose files
git clone https://github.com/hwdsl2/docker-ai-stack
cd docker-ai-stack
docker compose up -dPull a model (required before making LLM requests):
docker exec ollama ollama_manage --pull llama3.2:3bCheck the logs to confirm all services are ready:
docker compose logsGet the API keys:
# Ollama API key
docker exec ollama ollama_manage --showkey
# LiteLLM API key
docker exec litellm litellm_manage --getkey
# MCP Gateway API key
docker exec mcp mcp_manage --getkeyStop the stack:
docker compose downFor NVIDIA GPU acceleration, use the CUDA compose file:
docker compose -f docker-compose.cuda.yml up -dRequirements: NVIDIA GPU, NVIDIA driver 535+, and the NVIDIA Container Toolkit installed on the host. CUDA images are linux/amd64 only.
Don't need the full stack? Use a pre-configured subset from the stacks/ folder:
| Stack | Services | Memory | Use case |
|---|---|---|---|
| voice-pipeline | Whisper + Ollama + LiteLLM + Kokoro | ~5 GB | Speech-to-text → LLM → text-to-speech |
| rag-pipeline | Ollama + LiteLLM + Embeddings | ~3 GB | Semantic search + LLM Q&A |
| ai-tools | Ollama + LiteLLM + MCP Gateway | ~3 GB | AI coding assistant with tool access |
| chat-only | Ollama + LiteLLM | ~2.5 GB | Minimal local ChatGPT replacement |
git clone https://github.com/hwdsl2/docker-ai-stack
cd docker-ai-stack/stacks/voice-pipeline # or rag-pipeline, ai-tools, chat-only
docker compose up -dIf you prefer using docker run commands directly, first create a shared network so services can communicate:
docker network create ai-stackThen start each service on the shared network:
# Ollama (LLM)
docker run -d --name ollama --restart always \
--network ai-stack \
-v ollama-data:/var/lib/ollama \
hwdsl2/ollama-server
# LiteLLM (AI gateway)
docker run -d --name litellm --restart always \
--network ai-stack \
-p 4000:4000 \
-e LITELLM_OLLAMA_BASE_URL=http://ollama:11434 \
-v litellm-data:/etc/litellm \
hwdsl2/litellm-server
# Embeddings
docker run -d --name embeddings --restart always \
--network ai-stack \
-p 8000:8000 \
-v embeddings-data:/var/lib/embeddings \
hwdsl2/embeddings-server
# Whisper (STT)
docker run -d --name whisper --restart always \
--network ai-stack \
-p 9000:9000 \
-v whisper-data:/var/lib/whisper \
hwdsl2/whisper-server
# Kokoro (TTS)
docker run -d --name kokoro --restart always \
--network ai-stack \
-p 8880:8880 \
-v kokoro-data:/var/lib/kokoro \
hwdsl2/kokoro-server
# MCP Gateway
docker run -d --name mcp --restart always \
--network ai-stack \
-p 3000:3000 \
-v mcp-data:/var/lib/mcp \
hwdsl2/mcp-gatewayNote: The shared network allows services to reach each other by container name (e.g., LiteLLM connects to Ollama via http://ollama:11434). You can start only the services you need — they don't all have to run together.
Pull a model (required before making LLM requests):
docker exec ollama ollama_manage --pull llama3.2:3b# In your LiteLLM config, add the MCP gateway as a tool source:
mcp_servers:
- url: http://mcp:3000/mcp
transport: sse
headers:
Authorization: "Bearer <mcp_api_key>"Transcribe a spoken question, get a local LLM response via Ollama, and convert it to speech:
Tip: Need a sample audio file? Download this English speech sample (WAV, MIT License) from the Azure Samples repository:
curl -L -o sample_speech.wav \
"https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/sampledata/audiofiles/katiesteve.wav"LITELLM_KEY=$(docker exec litellm litellm_manage --getkey)
# Step 1: Transcribe audio to text (Whisper)
TEXT=$(curl -s http://localhost:9000/v1/audio/transcriptions \
-F file=@sample_speech.wav -F model=whisper-1 | jq -r .text)
# Step 2: Send text to Ollama via LiteLLM and get a response
RESPONSE=$(curl -s http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_KEY" \
-H "Content-Type: application/json" \
-d "{\"model\":\"ollama/llama3.2:3b\",\"messages\":[{\"role\":\"user\",\"content\":\"$TEXT\"}]}" \
| jq -r '.choices[0].message.content')
# Step 3: Convert the response to speech (Kokoro TTS)
curl -s http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d "{\"model\":\"tts-1\",\"input\":\"$RESPONSE\",\"voice\":\"af_heart\"}" \
--output response.mp3Embed documents for semantic search, retrieve context, then answer questions with a local Ollama model:
LITELLM_KEY=$(docker exec litellm litellm_manage --getkey)
# Step 1: Embed a document chunk and store the vector in your vector DB
curl -s http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "Docker simplifies deployment by packaging apps in containers.", "model": "text-embedding-ada-002"}' \
| jq '.data[0].embedding'
# → Store the returned vector alongside the source text in Qdrant, Chroma, pgvector, etc.
# Step 2: At query time, embed the question, retrieve the top matching chunks from
# the vector DB, then send the question and retrieved context to Ollama via LiteLLM.
curl -s http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/llama3.2:3b",
"messages": [
{"role": "system", "content": "Answer using only the provided context."},
{"role": "user", "content": "What does Docker do?\n\nContext: Docker simplifies deployment by packaging apps in containers."}
]
}' \
| jq -r '.choices[0].message.content'Use MCP Gateway to give your AI assistant access to files, web, and GitHub:
MCP_KEY=$(docker exec mcp mcp_manage --getkey)
# Use MCP endpoint with an AI client (e.g., Cline in VS Code)
# Set the MCP server URL: http://localhost:3000/mcp
# Set Authorization header: Bearer <api_key>
# Or test the MCP endpoint directly with an initialize request
curl -s http://localhost:3000/mcp \
-X POST \
-H "Authorization: Bearer $MCP_KEY" \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}}}'Each service can be configured with an optional env file. Copy the example env file from the respective repository, edit it, and uncomment the volume mount in docker-compose.yml:
| Service | Env file | Repository |
|---|---|---|
| Ollama | ollama.env |
docker-ollama |
| LiteLLM | litellm.env |
docker-litellm |
| Embeddings | embed.env |
docker-embeddings |
| Whisper | whisper.env |
docker-whisper |
| Kokoro | kokoro.env |
docker-kokoro |
| MCP Gateway | mcp.env |
docker-mcp-gateway |
For detailed configuration options, API reference, and model management, see the documentation in each service's repository.
To update all services to the latest versions:
docker compose pull
docker compose up -dYour data is preserved in the Docker volumes.
Copyright (C) 2026 Lin Song
This work is licensed under the MIT License.
This project is an independent Docker configuration and is not affiliated with, endorsed by, or sponsored by Ollama, Berri AI (LiteLLM), Hugging Face, hexgrad (Kokoro), OpenAI, SYSTRAN, or MCPHub.