STANDBY
🔶
Agent V2
Current production system
Process-per-call, Daily.co transport
Max ~300 concurrent calls
STANDBY
🟢
Agent V3
New architecture + TTS cache
Async coroutines, direct Telnyx WS
Max ~500+ concurrent calls
STANDBY
🔴
Reset
Unload ALL models from ALL GPUs
Clean slate — nothing running
8x H200 + 8x RTX 5090 idle
CURRENT MODE
No System Loaded
Click Agent V2 or Agent V3 to load a system
Calls in Queue
0
waiting for GPU slot
Active Telnyx Calls
0
SIP dials in progress
Human Conversations on GPU
0
active AI voice sessions
Total GPU Utilization
0%
all 16 GPUs combined
Dial Rate
0
simultaneous SIP calls
Cache Hit Rate
N/A (V2)
E2E Latency
human→bot response
Human Answer %
rolling 5-min
Calls/Hour
0
all clients
Uptime
since mode activated
G

GPU Cluster — 16 GPUs (1,384GB Total VRAM)

NVIDIA H200 — 141GB HBM3e × 8 = 1,128GB (4.8 TB/s bandwidth)
H0
141GB HBM3e
IDLE
0%
H1
141GB HBM3e
IDLE
0%
H2
141GB HBM3e
IDLE
0%
H3
141GB HBM3e
IDLE
0%
H4
141GB HBM3e
IDLE
0%
H5
141GB HBM3e
IDLE
0%
H6
141GB HBM3e
IDLE
0%
H7
141GB HBM3e
IDLE
0%
NVIDIA RTX 5090 — 32GB GDDR7 × 8 = 256GB (1.79 TB/s bandwidth)
R0
32GB GDDR7
IDLE
0%
R1
32GB GDDR7
IDLE
0%
R2
32GB GDDR7
IDLE
0%
R3
32GB GDDR7
IDLE
0%
R4
32GB GDDR7
IDLE
0%
R5
32GB GDDR7
IDLE
0%
R6
32GB GDDR7
IDLE
0%
R7
32GB GDDR7
IDLE
0%
1

5 Client Tenants Push Campaigns

Client A
IT Staffing
15K/day
Client B
Healthcare
12K/day
Client C
Industrial
10K/day
Client D
Professional
8K/day
Client E
Light Industrial
5K/day
ADAPTIVE DIALER → TELNYX SIP → AMD GATE
📠
MACHINE (60-70%)
Telnyx voicemail drop
No GPU used
🧑
HUMAN (20-30%)
Bridge to GPU pipeline
Telnyx WS → Voice AI
🔇
NO ANSWER (10-15%)
Retry with backoff
No GPU used
HUMAN CONFIRMED → GPU VOICE PIPELINE
G

GPU Allocation — AgentV3 (16 GPUs, 1,384GB Total)

NVIDIA H200 — 141GB HBM3e × 8 = 1,128GB (4.8 TB/s bandwidth per card)
H0
141GB HBM3e
LLM
Mistral-24B FP8
vLLM cont. batching
~250 concurrent slots
88%
H1
141GB HBM3e
LLM
Mistral-24B FP8
vLLM cont. batching
~250 concurrent slots
85%
H2
141GB HBM3e
TTS Tokens
Orpheus-3B FP8
vLLM + 35% cached
~200 slots (eff. 300+)
55%
H3
141GB HBM3e
TTS Tokens
Orpheus-3B FP8
vLLM + 35% cached
~200 slots (eff. 300+)
52%
H4
141GB HBM3e
STT
Faster-Whisper L-v3
CTranslate2 FP16
~150 batched streams
65%
H5
141GB HBM3e
STT
Faster-Whisper L-v3
CTranslate2 FP16
~150 batched streams
60%
H6
141GB HBM3e
OVERFLOW
LLM+TTS burst
Auto-scale
On-demand
25%
H7
141GB HBM3e
OVERFLOW
LLM+TTS burst
Auto-scale
On-demand
20%
NVIDIA RTX 5090 — 32GB GDDR7 × 8 = 256GB (1.79 TB/s bandwidth per card)
R0
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
60%
R1
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
58%
R2
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
55%
R3
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
53%
R4
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
50%
R5
32GB GDDR7
SNAC Decode
SNAC 24kHz ONNX
Batched (16 frames)
40 slots (35% cached)
48%
R6
32GB GDDR7
VAD + Audio
Silero v5 + DeepFilterNet
CPU noise suppress
1000+ (CPU-bound)
25%
R7
32GB GDDR7
OVERFLOW
SNAC burst
Auto-scale
On-demand
10%
TTS Cache reduces SNAC load by 35%: H200 TTS GPUs show ~55% util instead of ~78% because cached responses skip TTS entirely. SNAC GPUs show ~55% instead of ~85% because cached audio skips decode. The bottleneck shifts from SNAC (300 max) to LLM (500+ max).
VOICE PIPELINE (per call)
Voice Pipeline (with TTS Cache)
Audio In
Telnyx WS
5ms
DeepFilter
Noise (CPU)
3ms
VAD
Silero v5
1ms
STT
Whisper L-v3
150ms
H200
LLM
Mistral-24B
200ms
H200
TTS Cache
Check L1→L2→GPU
0-120ms
35% HIT
SNAC
Batched ONNX
5ms
RTX
Audio Out
Telnyx WS
5ms
~471ms uncached | ~170ms cached
G

GPU Allocation — Select V2 or V3

Click Agent V2 or Agent V3 at the top to see GPU allocation for that architecture.
$

TTS Response Cache — 4-Level Switch

OFF
No caching
Every response hits GPU
0%
hit rate
NORMAL
Pre-defined phrases only
Admin-curated, zero risk
~15%
hit rate
MODERATE
Pre-defined + exact match
PII exclusion, 1hr TTL
~35%
hit rate
AGGRESSIVE
All above + semantic match
0.97 threshold, entity-safe
~50%
hit rate
L1 Cache (Memory)
32MB
Slab-allocated mmap per process
L2 Cache (Redis)
~200MB
zstd compressed, 10K entries
Cache Hit Latency
~170ms
includes 120-170ms artificial delay
Cache Miss Latency
~471ms
full GPU pipeline
GPU Load Reduction
35%
TTS + SNAC skipped on hit
Extra Concurrent Calls
+160
from 300 base to 460+
Bimodal latency fix: Cached responses add 120-170ms artificial delay (with jitter) to match the natural rhythm of uncached responses. Without this, the bot sounds "too fast" on cached phrases and normal on others.
T

GPU-Aware Adaptive Throttle

Core Formula
dial_rate = gpu_capacity / human_answer_%
Target: 90% GPU | Hysteresis: 40s | Smoothing: ±20%/tick
GPU < 70% for 40s
RAMP UP
Dial rate × 1.25 (push more calls)
GPU > 90% for 40s
SLOW DOWN
Dial rate × 0.75 (reduce calls)
AB

Agent V2 vs V3 — Side-by-Side

Metric
Agent V2
Agent V3
Architecture
Process per call (fork)
Async coroutine per call
Memory per call
~500MB
~50KB (10,000x less)
Transport
Telnyx → Daily.co → Server
Telnyx → Server (direct WS)
Max concurrent calls
~300
500+ (with cache)
E2E latency
~516ms
~471ms (uncached) / ~170ms (cached)
TTS caching
None
4-level (Off/Normal/Moderate/Aggressive)
GPU target
Unmanaged
90% with 40s hysteresis
Interruption handling
Broken
CancellationToken + Telnyx clear
Turn-taking
Basic VAD
State machine + backchannel detection
Conversation memory
None (repeats questions)
LLM fact extraction + sliding window
Multi-tenant
No
5+ clients with fair-share
Noise suppression
Disabled
DeepFilterNet / RNNoise
Call recording
None
Dual-channel S3
Process isolation
1 process = 1 call
4-8 processes, 60-75 calls each
Queue system
Ad-hoc Redis
arq (asyncio-native)
GPU monitoring
nvidia-smi subprocess
pynvml 200ms direct
Daily.co cost
Per-minute fees
$0 (eliminated)
A/B Testing Plan: Run V2 for 1 hour, switch to V3 for 1 hour. Compare: call quality, latency, voice quality, conversation quality, GPU utilization, concurrent calls achieved, cache hit rate. V2 remains production default until V3 is validated.
AgentV3 Command Center • JobTalk.ai • 2026