01 The Pipeline
From wake word to spoken response — every step runs on local hardware except the LLM inference.
02 The Hardware
Waveshare ESP32-S3-Touch-LCD-1.85C-BOX — a compact round-display device purpose-built for voice interaction.
240 MHz
16MB Flash
via ES7210 ADC
+ PA Amplifier
Capacitive Touch
Li-Po Battery
The dual-codec architecture (ES7210 ADC for mic, ES8311 DAC for speaker) enables simultaneous bidirectional audio without conflicts. Two independent I2S buses ensure capture and playback never block each other.
03 The Brain
A lean Node.js/TypeScript microservice (~500 LOC) exposing an OpenAI-compatible API. Under the hood, Claude Haiku 4.5 runs in an agentic tool-use loop — it decides what to do based on the user's request.
Tool Inventory
04 Infrastructure
| Component | Location | Stack |
|---|---|---|
| ESP32 Voice Satellite | Living room | ESPHome Wyoming |
| Home Assistant | Home server | Docker Thin proxy |
| openWakeWord | Home server | Docker "hey_jarvis" |
| Jarvis Voice Agent | Home server :3002 | Docker Node.js 22 |
| Skyvu API (ioBroker) | Home server :3001 | Node.js REST |
| faster-whisper STT | MacBook Pro M5 :10300 | Python Wyoming |
| Piper TTS | MacBook Pro M5 :10200 | Python Wyoming |
05 Design Decisions
-
Claude Haiku over local LLMs
$0.01–0.02 per interaction, ~800ms latency. Cheaper than the electricity to run a local GPU, and vastly better tool-use reasoning than any 7B model.
-
HA as thin proxy, not brain
ESPHome's voice_assistant only speaks HA WebSocket natively. HA does zero intent processing — just routes audio and text between services.
-
No intent classification layer
The LLM decides what to do. No regex, no NLU pipeline, no slot filling. "Turn off the kitchen lights and tell me tomorrow's weather" chains tools automatically.
-
Anthropic native web search
No Tavily, no SerpAPI, no external search key. The server tool runs on Anthropic's infrastructure and returns results directly in the model context.
-
German-first, TTS-optimized
No markdown, no bullet points, no URLs in responses. Numbers spoken naturally. Concise for commands, longer for knowledge queries.
-
Browse-first music
Never invents Spotify URIs. Uses browse_media to discover real content IDs, then plays them. Zero hallucinated track names.
06 Running Cost
Everything else runs on existing hardware.