AgentV3 — Command Center

CURRENT MODE

No System Loaded

Click Agent V2 or Agent V3 to load a system

Calls in Queue

waiting for GPU slot

Active Telnyx Calls

SIP dials in progress

Human Conversations on GPU

active AI voice sessions

Total GPU Utilization

all 16 GPUs combined

Dial Rate

simultaneous SIP calls

Cache Hit Rate

—

N/A (V2)

E2E Latency

—

human→bot response

Human Answer %

—

rolling 5-min

Calls/Hour

all clients

Uptime

—

since mode activated

GPU Cluster — 16 GPUs (1,384GB Total VRAM)

NVIDIA H200 — 141GB HBM3e × 8 = 1,128GB (4.8 TB/s bandwidth)

141GB HBM3e

IDLE

141GB HBM3e

IDLE

141GB HBM3e

IDLE

141GB HBM3e

IDLE

141GB HBM3e

IDLE

141GB HBM3e

IDLE

141GB HBM3e

IDLE

141GB HBM3e

IDLE

NVIDIA RTX 5090 — 32GB GDDR7 × 8 = 256GB (1.79 TB/s bandwidth)

32GB GDDR7

IDLE

32GB GDDR7

IDLE

32GB GDDR7

IDLE

32GB GDDR7

IDLE

32GB GDDR7

IDLE

32GB GDDR7

IDLE

32GB GDDR7

IDLE

32GB GDDR7

IDLE

5 Client Tenants Push Campaigns

Client A

IT Staffing

15K/day

Client B

Healthcare

12K/day

Client C

Industrial

10K/day

Client D

Professional

8K/day

Client E

Light Industrial

5K/day

ADAPTIVE DIALER → TELNYX SIP → AMD GATE

📠

MACHINE (60-70%)

Telnyx voicemail drop
No GPU used

🧑

HUMAN (20-30%)

Bridge to GPU pipeline
Telnyx WS → Voice AI

🔇

NO ANSWER (10-15%)

Retry with backoff
No GPU used

HUMAN CONFIRMED → GPU VOICE PIPELINE

GPU Allocation — AgentV3 (16 GPUs, 1,384GB Total)

NVIDIA H200 — 141GB HBM3e × 8 = 1,128GB (4.8 TB/s bandwidth per card)

141GB HBM3e

LLM

Mistral-24B FP8

vLLM cont. batching

~250 concurrent slots

88%

141GB HBM3e

LLM

Mistral-24B FP8

vLLM cont. batching

~250 concurrent slots

85%

141GB HBM3e

TTS Tokens

Orpheus-3B FP8

vLLM + 35% cached

~200 slots (eff. 300+)

55%

141GB HBM3e

TTS Tokens

Orpheus-3B FP8

vLLM + 35% cached

~200 slots (eff. 300+)

52%

141GB HBM3e

STT

Faster-Whisper L-v3

CTranslate2 FP16

~150 batched streams

65%

141GB HBM3e

STT

Faster-Whisper L-v3

CTranslate2 FP16

~150 batched streams

60%

141GB HBM3e

OVERFLOW

LLM+TTS burst

Auto-scale

On-demand

25%

141GB HBM3e

OVERFLOW

LLM+TTS burst

Auto-scale

On-demand

20%

NVIDIA RTX 5090 — 32GB GDDR7 × 8 = 256GB (1.79 TB/s bandwidth per card)

32GB GDDR7

SNAC Decode

SNAC 24kHz ONNX

Batched (16 frames)

40 slots (35% cached)

60%

32GB GDDR7

SNAC Decode

SNAC 24kHz ONNX

Batched (16 frames)

40 slots (35% cached)

58%

32GB GDDR7

SNAC Decode

SNAC 24kHz ONNX

Batched (16 frames)

40 slots (35% cached)

55%

32GB GDDR7

SNAC Decode

SNAC 24kHz ONNX

Batched (16 frames)

40 slots (35% cached)

53%

32GB GDDR7

SNAC Decode

SNAC 24kHz ONNX

Batched (16 frames)

40 slots (35% cached)

50%

32GB GDDR7

SNAC Decode

SNAC 24kHz ONNX

Batched (16 frames)

40 slots (35% cached)

48%

32GB GDDR7

VAD + Audio

Silero v5 + DeepFilterNet

CPU noise suppress

1000+ (CPU-bound)

25%

32GB GDDR7

OVERFLOW

SNAC burst

Auto-scale

On-demand

10%

TTS Cache reduces SNAC load by 35%: H200 TTS GPUs show ~55% util instead of ~78% because cached responses skip TTS entirely. SNAC GPUs show ~55% instead of ~85% because cached audio skips decode. The bottleneck shifts from SNAC (300 max) to LLM (500+ max).

VOICE PIPELINE (per call)

Voice Pipeline (with TTS Cache)

Audio In

Telnyx WS

5ms

▸

DeepFilter

Noise (CPU)

3ms

▸

VAD

Silero v5

1ms

▸

STT

Whisper L-v3

150ms

H200

▸

LLM

Mistral-24B

200ms

H200

▸

TTS Cache

Check L1→L2→GPU

0-120ms

35% HIT

▸

SNAC

Batched ONNX

5ms

RTX

▸

Audio Out

Telnyx WS

5ms

~471ms uncached | ~170ms cached

GPU Allocation — Select V2 or V3

Click Agent V2 or Agent V3 at the top to see GPU allocation for that architecture.

V2 REALITY: LLM, TTS tokens, and STT all run on REMOTE servers — not on these local GPUs. 12 of 16 local GPUs are completely idle. Only 4 RTX 5090s run SNAC+VAD per-call.

H200 — 141GB HBM3e × 8 = 1,128GB ALL 8 IDLE

GPU	VRAM	Role	Model	Slots	Util	Notes
H200 #0	141GB	IDLE	None loaded	—	0%	LLM runs remote: 208.64.254.184:31470
H200 #1	141GB	IDLE	None loaded	—	0%	LLM runs remote: 208.64.254.184:31470
H200 #2	141GB	IDLE	None loaded	—	0%	TTS runs remote: 208.64.254.184:30755
H200 #3	141GB	IDLE	None loaded	—	0%	TTS fallback: 208.64.254.184:25348
H200 #4	141GB	IDLE	None loaded	—	0%	STT runs remote: 192.168.31.234:35992
H200 #5	141GB	IDLE	None loaded	—	0%	STT fallback: 192.168.31.234:35840
H200 #6	141GB	IDLE	None loaded	—	0%	Completely unused
H200 #7	141GB	IDLE	None loaded	—	0%	Completely unused

RTX 5090 — 32GB GDDR7 × 8 = 256GB (4 active, 4 idle)

GPU	VRAM	Role	Model	Slots	Util	Notes
RTX #0	32GB	SNAC + VAD	SNAC ONNX + Silero VAD	20 per-process	65%	Per-call ONNX session, 2GB cap, NOT batched
RTX #1	32GB	SNAC + VAD	SNAC ONNX + Silero VAD	20 per-process	62%	Per-call ONNX session, 2GB cap, NOT batched
RTX #2	32GB	SNAC + VAD	SNAC ONNX + Silero VAD	20 per-process	58%	Per-call ONNX session, 2GB cap, NOT batched
RTX #3	32GB	SNAC + VAD	SNAC ONNX + Silero VAD	20 per-process	55%	Per-call ONNX session, 2GB cap, NOT batched
RTX #4	32GB	IDLE	None loaded	—	0%	Reserved but unused
RTX #5	32GB	IDLE	None loaded	—	0%	Reserved but unused
RTX #6	32GB	IDLE	None loaded	—	0%	Reserved but unused
RTX #7	32GB	IDLE	None loaded	—	0%	Reserved but unused

V2 GPU Waste

12 of 16 GPUs IDLE

All 8 H200s (1,128GB) unused • 4 RTX (128GB) unused

V2 Actual Capacity

80 slots (4 RTX × 20)

Per-process SNAC+VAD • NOT batched • No cache

Remote Inference Servers (NOT on local GPUs):
LLM: Mistral-24B FP8 via vLLM @ 208.64.254.184:31470
TTS: Orpheus-3B FP8 via vLLM @ 208.64.254.184:30755 + :25348
STT: Whisper-Finetune-V4 @ 192.168.31.234:35992 + :35840

H200 — 141GB HBM3e × 8 = 1,128GB

GPU	VRAM	Role	Model	Slots	Util	Notes
H200 #0	141GB	LLM	Mistral-24B FP8	~250 (vLLM)	88%	Cont. batching + prefix caching
H200 #1	141GB	LLM	Mistral-24B FP8	~250 (vLLM)	85%	Cont. batching + prefix caching
H200 #2	141GB	TTS Tokens	Orpheus-3B FP8	~200 (vLLM)	55%	35% cached = 35% less GPU load
H200 #3	141GB	TTS Tokens	Orpheus-3B FP8	~200 (vLLM)	52%	35% cached = 35% less GPU load
H200 #4	141GB	STT	Faster-Whisper L-v3	~150 batched	65%	CTranslate2 FP16, batched
H200 #5	141GB	STT	Faster-Whisper L-v3	~150 batched	60%	CTranslate2 FP16, batched
H200 #6	141GB	Overflow	LLM+TTS burst	On demand	25%	Auto-scale when #0-3 saturate
H200 #7	141GB	Overflow	LLM+TTS burst	On demand	20%	Auto-scale when #0-3 saturate

RTX 5090 — 32GB GDDR7 × 8 = 256GB

GPU	VRAM	Role	Model	Slots	Util	Notes
RTX #0	32GB	SNAC Decode	SNAC 24kHz ONNX	40 batched	60%	Batched 16 frames + 35% cached
RTX #1	32GB	SNAC Decode	SNAC 24kHz ONNX	40 batched	58%	Batched 16 frames + 35% cached
RTX #2	32GB	SNAC Decode	SNAC 24kHz ONNX	40 batched	55%	Batched 16 frames + 35% cached
RTX #3	32GB	SNAC Decode	SNAC 24kHz ONNX	40 batched	53%	Batched 16 frames + 35% cached
RTX #4	32GB	SNAC Decode	SNAC 24kHz ONNX	40 batched	50%	Batched 16 frames + 35% cached
RTX #5	32GB	SNAC Decode	SNAC 24kHz ONNX	40 batched	48%	Batched 16 frames + 35% cached
RTX #6	32GB	VAD + Audio	Silero v5 + DeepFilterNet	1000+ (CPU)	25%	CPU noise suppress + VAD
RTX #7	32GB	Overflow	SNAC burst	On demand	10%	Auto-scale

V3 System Bottleneck

LLM = 500+ max concurrent

TTS cache eliminates SNAC bottleneck • Batched SNAC for cache misses • Direct Telnyx WS • Shared inference

Key difference V2→V3: TTS cache (35% hit) skips H200 TTS + RTX SNAC for cached responses. Batched SNAC (16 frames/call) handles the remaining 65%. RTX utilization drops from 68-80% to 48-60%. SNAC is no longer the bottleneck — LLM is, and it scales via vLLM continuous batching.

TTS Response Cache — 4-Level Switch

OFF

No caching
Every response hits GPU

hit rate

NORMAL

Pre-defined phrases only
Admin-curated, zero risk

~15%

hit rate

MODERATE

Pre-defined + exact match
PII exclusion, 1hr TTL

~35%

hit rate

AGGRESSIVE

All above + semantic match
0.97 threshold, entity-safe

~50%

hit rate

L1 Cache (Memory)

32MB

Slab-allocated mmap per process

L2 Cache (Redis)

~200MB

zstd compressed, 10K entries

Cache Hit Latency

~170ms

includes 120-170ms artificial delay

Cache Miss Latency

~471ms

full GPU pipeline

GPU Load Reduction

35%

TTS + SNAC skipped on hit

Extra Concurrent Calls

+160

from 300 base to 460+

Bimodal latency fix: Cached responses add 120-170ms artificial delay (with jitter) to match the natural rhythm of uncached responses. Without this, the bot sounds "too fast" on cached phrases and normal on others.

GPU-Aware Adaptive Throttle

Core Formula

dial_rate = gpu_capacity / human_answer_%

Target: 90% GPU | Hysteresis: 40s | Smoothing: ±20%/tick

GPU < 70% for 40s

RAMP UP

Dial rate × 1.25 (push more calls)

GPU > 90% for 40s

SLOW DOWN

Dial rate × 0.75 (reduce calls)

Agent V2 vs V3 — Side-by-Side

Metric

Agent V2

Agent V3

Architecture

Process per call (fork)

Async coroutine per call

Memory per call

~500MB

~50KB (10,000x less)

Transport

Telnyx → Daily.co → Server

Telnyx → Server (direct WS)

Max concurrent calls

~300

500+ (with cache)

E2E latency

~516ms

~471ms (uncached) / ~170ms (cached)

TTS caching

None

4-level (Off/Normal/Moderate/Aggressive)

GPU target

Unmanaged

90% with 40s hysteresis

Interruption handling

Broken

CancellationToken + Telnyx clear

Turn-taking

Basic VAD

State machine + backchannel detection

Conversation memory

None (repeats questions)

LLM fact extraction + sliding window

Multi-tenant

5+ clients with fair-share

Noise suppression

Disabled

DeepFilterNet / RNNoise

Call recording

None

Dual-channel S3

Process isolation

1 process = 1 call

4-8 processes, 60-75 calls each

Queue system

Ad-hoc Redis

arq (asyncio-native)

GPU monitoring

nvidia-smi subprocess

pynvml 200ms direct

Daily.co cost

Per-minute fees

$0 (eliminated)

A/B Testing Plan: Run V2 for 1 hour, switch to V3 for 1 hour. Compare: call quality, latency, voice quality, conversation quality, GPU utilization, concurrent calls achieved, cache hit rate. V2 remains production default until V3 is validated.

TTS Cache Level

GPU Cluster — 16 GPUs (1,384GB Total VRAM)

5 Client Tenants Push Campaigns

GPU Allocation — AgentV3 (16 GPUs, 1,384GB Total)

GPU Allocation — Select V2 or V3

TTS Response Cache — 4-Level Switch

GPU-Aware Adaptive Throttle

Agent V2 vs V3 — Side-by-Side