STANDBY
🔶
Agent V2
Current production system
Process-per-call, Daily.co transport
Max ~300 concurrent calls
STANDBY
🟢
Agent V3
New architecture + TTS cache
Async coroutines, direct Telnyx WS
Max ~500+ concurrent calls
STANDBY
🔴
Reset
Unload ALL models from ALL GPUs
Clean slate — nothing running
8x H200 + 8x RTX 5090 idle
CURRENT MODE
No System Loaded
Click Agent V2 or Agent V3 to load a system
Calls in Queue
0
waiting for GPU slot
Active Telnyx Calls
0
SIP dials in progress
Human Conversations on GPU
0
active AI voice sessions
Total GPU Utilization
0%
all 16 GPUs combined
Dial Rate
0
simultaneous SIP calls
E2E Latency
—
human→bot response
Human Answer %
—
rolling 5-min
Uptime
—
since mode activated
OFF
No caching
0%
hit rate
Max: 300 calls
NORMAL
Pre-defined only
~15%
hit rate
Max: 350 calls
MODERATE
Exact match + PII filter
~35%
hit rate
Max: 460 calls
AGGRESSIVE
Semantic match 0.97
~50%
hit rate
Max: 560 calls
G
GPU Cluster — 16 GPUs (1,384GB Total VRAM)
NVIDIA H200 — 141GB HBM3e × 8 = 1,128GB (4.8 TB/s bandwidth)
NVIDIA RTX 5090 — 32GB GDDR7 × 8 = 256GB (1.79 TB/s bandwidth)
1
5 Client Tenants Push Campaigns
Client A
IT Staffing
15K/day
Client B
Healthcare
12K/day
Client C
Industrial
10K/day
Client D
Professional
8K/day
Client E
Light Industrial
5K/day
ADAPTIVE DIALER → TELNYX SIP → AMD GATE
📠
MACHINE (60-70%)
Telnyx voicemail drop
No GPU used
🧑
HUMAN (20-30%)
Bridge to GPU pipeline
Telnyx WS → Voice AI
🔇
NO ANSWER (10-15%)
Retry with backoff
No GPU used
HUMAN CONFIRMED → GPU VOICE PIPELINE
G
GPU Allocation — AgentV3 (16 GPUs, 1,384GB Total)
NVIDIA H200 — 141GB HBM3e × 8 = 1,128GB (4.8 TB/s bandwidth per card)
H0
141GB HBM3e
LLM
Mistral-24B FP8
vLLM cont. batching
~250 concurrent slots
88%
H1
141GB HBM3e
LLM
Mistral-24B FP8
vLLM cont. batching
~250 concurrent slots
85%
H2
141GB HBM3e
TTS Tokens
Orpheus-3B FP8
vLLM + 35% cached
~200 slots (eff. 300+)
55%
H3
141GB HBM3e
TTS Tokens
Orpheus-3B FP8
vLLM + 35% cached
~200 slots (eff. 300+)
52%
H4
141GB HBM3e
STT
Faster-Whisper L-v3
CTranslate2 FP16
~150 batched streams
65%
H5
141GB HBM3e
STT
Faster-Whisper L-v3
CTranslate2 FP16
~150 batched streams
60%
H6
141GB HBM3e
OVERFLOW
LLM+TTS burst
Auto-scale
On-demand
25%
H7
141GB HBM3e
OVERFLOW
LLM+TTS burst
Auto-scale
On-demand
20%
NVIDIA RTX 5090 — 32GB GDDR7 × 8 = 256GB (1.79 TB/s bandwidth per card)
R0
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
60%
R1
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
58%
R2
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
55%
R3
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
53%
R4
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
50%
R5
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
48%
R6
32GB GDDR7
VAD + Audio
Silero v5 + DeepFilterNet
CPU noise suppress
1000+ (CPU-bound)
25%
R7
32GB GDDR7
OVERFLOW
SNAC burst
Auto-scale
On-demand
10%
TTS Cache reduces SNAC load by 35%: H200 TTS GPUs show ~55% util instead of ~78% because cached responses skip TTS entirely. SNAC GPUs show ~55% instead of ~85% because cached audio skips decode. The bottleneck shifts from SNAC (300 max) to LLM (500+ max).
VOICE PIPELINE (per call)
Voice Pipeline (with TTS Cache)
▸
▸
▸
▸
▸
TTS Cache
Check L1→L2→GPU
0-120ms
35% HIT
▸
▸
~471ms uncached | ~170ms cached
G
GPU Allocation — Select V2 or V3
Click Agent V2 or Agent V3 at the top to see GPU allocation for that architecture.
V2 REALITY: LLM, TTS tokens, and STT all run on REMOTE servers — not on these local GPUs. 12 of 16 local GPUs are completely idle. Only 4 RTX 5090s run SNAC+VAD per-call.
H200 — 141GB HBM3e × 8 = 1,128GB ALL 8 IDLE
| GPU | VRAM | Role | Model | Slots | Util | Notes |
| H200 #0 | 141GB | IDLE | None loaded | — | 0% | LLM runs remote: 208.64.254.184:31470 |
| H200 #1 | 141GB | IDLE | None loaded | — | 0% | LLM runs remote: 208.64.254.184:31470 |
| H200 #2 | 141GB | IDLE | None loaded | — | 0% | TTS runs remote: 208.64.254.184:30755 |
| H200 #3 | 141GB | IDLE | None loaded | — | 0% | TTS fallback: 208.64.254.184:25348 |
| H200 #4 | 141GB | IDLE | None loaded | — | 0% | STT runs remote: 192.168.31.234:35992 |
| H200 #5 | 141GB | IDLE | None loaded | — | 0% | STT fallback: 192.168.31.234:35840 |
| H200 #6 | 141GB | IDLE | None loaded | — | 0% | Completely unused |
| H200 #7 | 141GB | IDLE | None loaded | — | 0% | Completely unused |
RTX 5090 — 32GB GDDR7 × 8 = 256GB (4 active, 4 idle)
| GPU | VRAM | Role | Model | Slots | Util | Notes |
| RTX #0 | 32GB | SNAC + VAD | SNAC ONNX + Silero VAD | 20 per-process | 65% | Per-call ONNX session, 2GB cap, NOT batched |
| RTX #1 | 32GB | SNAC + VAD | SNAC ONNX + Silero VAD | 20 per-process | 62% | Per-call ONNX session, 2GB cap, NOT batched |
| RTX #2 | 32GB | SNAC + VAD | SNAC ONNX + Silero VAD | 20 per-process | 58% | Per-call ONNX session, 2GB cap, NOT batched |
| RTX #3 | 32GB | SNAC + VAD | SNAC ONNX + Silero VAD | 20 per-process | 55% | Per-call ONNX session, 2GB cap, NOT batched |
| RTX #4 | 32GB | IDLE | None loaded | — | 0% | Reserved but unused |
| RTX #5 | 32GB | IDLE | None loaded | — | 0% | Reserved but unused |
| RTX #6 | 32GB | IDLE | None loaded | — | 0% | Reserved but unused |
| RTX #7 | 32GB | IDLE | None loaded | — | 0% | Reserved but unused |
V2 GPU Waste
12 of 16 GPUs IDLE
All 8 H200s (1,128GB) unused • 4 RTX (128GB) unused
V2 Actual Capacity
80 slots (4 RTX × 20)
Per-process SNAC+VAD • NOT batched • No cache
Remote Inference Servers (NOT on local GPUs):
LLM: Mistral-24B FP8 via vLLM @ 208.64.254.184:31470
TTS: Orpheus-3B FP8 via vLLM @ 208.64.254.184:30755 + :25348
STT: Whisper-Finetune-V4 @ 192.168.31.234:35992 + :35840
H200 — 141GB HBM3e × 8 = 1,128GB
| GPU | VRAM | Role | Model | Slots | Util | Notes |
| H200 #0 | 141GB | LLM | Mistral-24B FP8 | ~250 (vLLM) | 88% | Cont. batching + prefix caching |
| H200 #1 | 141GB | LLM | Mistral-24B FP8 | ~250 (vLLM) | 85% | Cont. batching + prefix caching |
| H200 #2 | 141GB | TTS Tokens | Orpheus-3B FP8 | ~200 (vLLM) | 55% | 35% cached = 35% less GPU load |
| H200 #3 | 141GB | TTS Tokens | Orpheus-3B FP8 | ~200 (vLLM) | 52% | 35% cached = 35% less GPU load |
| H200 #4 | 141GB | STT | Faster-Whisper L-v3 | ~150 batched | 65% | CTranslate2 FP16, batched |
| H200 #5 | 141GB | STT | Faster-Whisper L-v3 | ~150 batched | 60% | CTranslate2 FP16, batched |
| H200 #6 | 141GB | Overflow | LLM+TTS burst | On demand | 25% | Auto-scale when #0-3 saturate |
| H200 #7 | 141GB | Overflow | LLM+TTS burst | On demand | 20% | Auto-scale when #0-3 saturate |
RTX 5090 — 32GB GDDR7 × 8 = 256GB
| GPU | VRAM | Role | Model | Slots | Util | Notes |
| RTX #0 | 32GB | SNAC Decode | SNAC 24kHz ONNX | 40 batched | 60% | Batched 16 frames + 35% cached |
| RTX #1 | 32GB | SNAC Decode | SNAC 24kHz ONNX | 40 batched | 58% | Batched 16 frames + 35% cached |
| RTX #2 | 32GB | SNAC Decode | SNAC 24kHz ONNX | 40 batched | 55% | Batched 16 frames + 35% cached |
| RTX #3 | 32GB | SNAC Decode | SNAC 24kHz ONNX | 40 batched | 53% | Batched 16 frames + 35% cached |
| RTX #4 | 32GB | SNAC Decode | SNAC 24kHz ONNX | 40 batched | 50% | Batched 16 frames + 35% cached |
| RTX #5 | 32GB | SNAC Decode | SNAC 24kHz ONNX | 40 batched | 48% | Batched 16 frames + 35% cached |
| RTX #6 | 32GB | VAD + Audio | Silero v5 + DeepFilterNet | 1000+ (CPU) | 25% | CPU noise suppress + VAD |
| RTX #7 | 32GB | Overflow | SNAC burst | On demand | 10% | Auto-scale |
V3 System Bottleneck
LLM = 500+ max concurrent
TTS cache eliminates SNAC bottleneck • Batched SNAC for cache misses • Direct Telnyx WS • Shared inference
Key difference V2→V3: TTS cache (35% hit) skips H200 TTS + RTX SNAC for cached responses. Batched SNAC (16 frames/call) handles the remaining 65%. RTX utilization drops from 68-80% to 48-60%. SNAC is no longer the bottleneck — LLM is, and it scales via vLLM continuous batching.
$
TTS Response Cache — 4-Level Switch
OFF
No caching
Every response hits GPU
0%
hit rate
NORMAL
Pre-defined phrases only
Admin-curated, zero risk
~15%
hit rate
MODERATE
Pre-defined + exact match
PII exclusion, 1hr TTL
~35%
hit rate
AGGRESSIVE
All above + semantic match
0.97 threshold, entity-safe
~50%
hit rate
L1 Cache (Memory)
32MB
Slab-allocated mmap per process
L2 Cache (Redis)
~200MB
zstd compressed, 10K entries
Cache Hit Latency
~170ms
includes 120-170ms artificial delay
Cache Miss Latency
~471ms
full GPU pipeline
GPU Load Reduction
35%
TTS + SNAC skipped on hit
Extra Concurrent Calls
from 300 base to 460+
Bimodal latency fix: Cached responses add 120-170ms artificial delay (with jitter) to match the natural rhythm of uncached responses. Without this, the bot sounds "too fast" on cached phrases and normal on others.
AB
Agent V2 vs V3 — Side-by-Side
Metric
Agent V2
Agent V3
Architecture
Process per call (fork)
Async coroutine per call
Memory per call
~500MB
~50KB (10,000x less)
Transport
Telnyx → Daily.co → Server
Telnyx → Server (direct WS)
Max concurrent calls
~300
500+ (with cache)
E2E latency
~516ms
~471ms (uncached) / ~170ms (cached)
TTS caching
None
4-level (Off/Normal/Moderate/Aggressive)
GPU target
Unmanaged
90% with 40s hysteresis
Interruption handling
Broken
CancellationToken + Telnyx clear
Turn-taking
Basic VAD
State machine + backchannel detection
Conversation memory
None (repeats questions)
LLM fact extraction + sliding window
Multi-tenant
No
5+ clients with fair-share
Noise suppression
Disabled
DeepFilterNet / RNNoise
Call recording
None
Dual-channel S3
Process isolation
1 process = 1 call
4-8 processes, 60-75 calls each
Queue system
Ad-hoc Redis
arq (asyncio-native)
GPU monitoring
nvidia-smi subprocess
pynvml 200ms direct
Daily.co cost
Per-minute fees
$0 (eliminated)
A/B Testing Plan: Run V2 for 1 hour, switch to V3 for 1 hour. Compare: call quality, latency, voice quality, conversation quality, GPU utilization, concurrent calls achieved, cache hit rate. V2 remains production default until V3 is validated.