Skip to content

AI Inference

Local LLM inference using a multi-pod architecture with dedicated GPU nodes for both AMD ROCm and NVIDIA CUDA workloads.

Architecture

The AI server uses a multi-pod design where each component runs as a separate deployment, all sharing model files via hostPath mounts to /data/ai-models.

PodNodePurpose
ROCmaimaxllama.cpp with AMD ROCm GPU
ThorthorvLLM, ComfyUI, Speaches TTS
LiteLLManyOpenAI-compatible API gateway
Open WebUIanyChat interface (sidecar to LiteLLM)

ROCm Pod

Runs llama.cpp with ROCm GPU acceleration on the Minisforum MS-S1 Max (Ryzen AI Max+ 395). This node handles the bulk of LLM inference with 128GB of unified memory.

SettingValue
Nodeaimax
Taintrocm-inference=true:NoSchedule
GPUAMD Radeon integrated (ROCm)
ModelsServed from /data/ai-models hostPath

Thor Pod

Runs on the NVIDIA AGX Thor Jetson node with CUDA acceleration:

  • vLLM - High-throughput LLM serving
  • ComfyUI - Image generation workflows
  • Speaches - Text-to-speech and transcription
SettingValue
Nodethor
Taintcuda-inference=true:NoSchedule
GPUNVIDIA Blackwell (CUDA)
Memory128GB

LiteLLM

LiteLLM acts as a unified API gateway, exposing all models from both ROCm and CUDA backends through a single OpenAI-compatible API endpoint. Applications can switch between models and backends without changing API calls.

Open WebUI

Open WebUI provides a ChatGPT-style web interface for interacting with local models. It runs as a sidecar alongside LiteLLM, connecting through the local API gateway.

Homelab Infrastructure

Build: a625fb4