AI Inference

Local LLM inference using a multi-pod architecture with dedicated GPU nodes for both AMD ROCm and NVIDIA CUDA workloads.

Architecture

The AI server uses a multi-pod design where each component runs as a separate deployment, all sharing model files via hostPath mounts to /data/ai-models.

Pod	Node	Purpose
ROCm	aimax	llama.cpp with AMD ROCm GPU
Thor	thor	vLLM, ComfyUI, Speaches TTS
LiteLLM	any	OpenAI-compatible API gateway
Open WebUI	any	Chat interface (sidecar to LiteLLM)

ROCm Pod

Runs llama.cpp with ROCm GPU acceleration on the Minisforum MS-S1 Max (Ryzen AI Max+ 395). This node handles the bulk of LLM inference with 128GB of unified memory.

Setting	Value
Node	aimax
Taint	`rocm-inference=true:NoSchedule`
GPU	AMD Radeon integrated (ROCm)
Models	Served from `/data/ai-models` hostPath

Thor Pod

Runs on the NVIDIA AGX Thor Jetson node with CUDA acceleration:

vLLM - High-throughput LLM serving
ComfyUI - Image generation workflows
Speaches - Text-to-speech and transcription

Setting	Value
Node	thor
Taint	`cuda-inference=true:NoSchedule`
GPU	NVIDIA Blackwell (CUDA)
Memory	128GB

LiteLLM

LiteLLM acts as a unified API gateway, exposing all models from both ROCm and CUDA backends through a single OpenAI-compatible API endpoint. Applications can switch between models and backends without changing API calls.

Open WebUI

Open WebUI provides a ChatGPT-style web interface for interacting with local models. It runs as a sidecar alongside LiteLLM, connecting through the local API gateway.

AI Inference ​

Architecture ​

ROCm Pod ​

Thor Pod ​