For the ControlUp engineering team  ·  Concept brief

3-second telemetry. 6M endpoints. Run it at the edge.

ControlUp ONE collects 3-second telemetry across 6M+ seats and 2,000+ enterprise customers, then routes it through Pulse AI and Claude + GPT for autonomous endpoint management. That workload — millions of always-on agents, governed multi-LLM inference, per-endpoint state, multi-tenant isolation — maps almost 1:1 to Cloudflare's developer platform. Cloudflare didn't build these primitives for ControlUp, but they may as well have.

6M+
Seats under management
2K+
Enterprise tenants
3s
Telemetry cadence
330+
Cloudflare POPs
The thesis

ControlUp isn't a monitoring tool.
It's a global stateful agent platform that happens to manage endpoints.

Every endpoint is a long-running session: heartbeat, last-seen, current alerts, remediation state, anomaly history. Every customer is an isolated tenant. Every Pulse AI call needs governance, audit, and cost attribution across two LLM vendors. That's not a DEX problem — that's a multi-tenant agentic infrastructure problem at planetary scale, and it's the workload Cloudflare's developer platform was purpose-built for.

What we noticed in your stack

Your product runs on Microsoft Azureapp.controlup.com and api.controlup.com both resolve into Azure's 20.168.x.x range, with Application Insights telemetry headers on every API response. Your marketing site is on GCP nginx + WordPress. Your corporate DNS is on AWS Route 53. And your AI stack pairs Anthropic and OpenAI (per your TXT verifications — both confirmed in production). Three clouds, two LLM vendors, one telemetry firehose — Cloudflare's developer platform is the layer that consolidates the edge.

Value plays

Eight things Cloudflare changes for ControlUp.

Ranked by impact-per-effort for your specific workload shape — 3-second telemetry across millions of endpoints, with Pulse AI on top.

01 — Flagship

Edge ingest for 3-second telemetry

Your agents check in every 3 seconds from anywhere on Earth. Today that traffic flows to Azure regions. With Workers at 330+ POPs, agents check in to the nearest Cloudflare edge — sub-50ms ingest globally, no regional round-trip, no Azure egress on the back-haul.

Workers Smart Placement Magic Transit
6M endpoints × 28,800 check-ins/day each = 172B events/day at edge
02 — Stateful agents

One Durable Object per endpoint

Each endpoint needs persistent state — heartbeat, last-seen, active alerts, remediation in-flight, anomaly history. Durable Objects give you a single-threaded actor per endpoint with strong consistency, geo-routing, and zero session-affinity infrastructure to manage. 6M endpoints = 6M DOs, each hibernating when idle and resuming on next check-in.

Durable Objects Storage API Alarms
Replaces a stateful service tier in Azure
03 — Highest AI ROI

AI Gateway for Pulse AI's multi-LLM stack

You're running both Claude and GPT (your TXT records confirm). Pulse AI fires LLM calls constantly across 2,000+ tenants. AI Gateway sits in front of both providers — per-tenant cost attribution, semantic cache on repeated IT troubleshooting questions (the cache hit rate for "why is Outlook slow?" across 6M endpoints is enormous), full audit logs, rate-limit + fallback routing. One config, no code change.

AI Gateway Semantic Cache Multi-provider
See calculator below ↓
04 — Tenant isolation

Per-customer runtimes with Workers for Platforms

2,000+ enterprise customers, each with their own agent rules, automation scripts, Pulse AI prompts, and remediation runbooks. Workers for Platforms dispatch namespaces give you one isolated worker per tenant — Microsoft's worker, Citrix-shop A's worker, Banking-customer-B's worker — all fully isolated, individually metered, no noisy-neighbor risk.

Workers for Platforms Dispatch Namespaces
2,000 tenants × isolated compute, no rewrite
05 — Inference at the edge

Workers AI for anomaly detection where the telemetry lands

Anomaly Detection currently runs after telemetry batches reach Azure. With Workers AI, the model runs at the same POP that received the agent's check-in — sub-100ms anomaly detection, no regional hop. Plus Workers AI's catalog (Llama, Mistral, embedding models, Whisper) becomes a complement to your Claude + OpenAI stack for tasks where edge latency matters more than frontier capability.

Workers AI Vectorize
3–5× faster anomaly surfacing
06 — Storage economics

R2 for telemetry archive (zero egress)

172 billion events per day adds up. Cold-storage archive of telemetry, screenshots, remediation logs, and pcaps on R2 instead of Azure Blob means zero egress when customers query historical data, when auditors need evidence, or when Pulse AI back-references patterns for similarity search. For a telemetry firehose at your scale, the egress line item is usually the silent margin tax.

R2 Zero Egress S3-compatible API
Typical 40–60% storage TCO reduction
07 — Knowledge / RAG

Vectorize for the AI Assistant's RCA knowledge base

Your in-console AI Assistant does root-cause analysis on signals across your platform. That's a textbook RAG workload — embed the symptoms, retrieve similar past incidents + remediations, ground the LLM. Vectorize gives you a managed vector DB at edge latency, isolated per tenant, with sub-30ms queries. Pair with Workers AI Embeddings for the indexing pipeline.

Vectorize Workers AI Embeddings R2
Per-tenant RAG, no external vector DB to wire
08 — Orchestration

Workflows for autonomous remediation pipelines

"Detect → diagnose → propose remediation → execute → verify → close ticket" is a multi-step, long-running, retry-heavy workflow with checkpoints. Cloudflare Workflows is durable execution for exactly this shape — replaces Azure Logic Apps or hand-rolled Temporal clusters, lives next to the Workers + DOs handling the data.

Workflows Queues Cron Triggers
No external orchestrator to operate
Mapping

ControlUp ONE capabilities, mapped to Cloudflare primitives.

Each capability you ship maps to a specific Cloudflare developer primitive. Not approximately — exactly.

ControlUp capability What it does Cloudflare primitive
3-second telemetry ingest Endpoint agents check in every 3 seconds from anywhere Workers at 330+ POPs + Smart Placement
Per-endpoint session state Heartbeat, last-seen, alerts, remediation state, anomaly history Durable Objects (1 DO per endpoint, ~6M total)
Pulse AI agentic engine Multi-LLM reasoning over telemetry across tenants AI Gateway + Workers AI at edge
AI Assistant (in-console RCA) Conversational root-cause analysis grounded in your data Vectorize + Workers AI Embeddings + R2
Anomaly Detection Behavior-based pattern learning, deviation surfacing Workers AI + Durable Objects baseline storage
Per-customer tenant isolation 2,000+ enterprise customers, isolated agent rules + policies Workers for Platforms dispatch namespaces
Automation & Workflows No-code workflows for routine fixes; autonomous remediation chains Workflows + Queues + Cron Triggers
Telemetry archive Historical telemetry, pcaps, screenshots, remediation logs R2 (zero egress, S3-compatible API)
AI-Powered IT Self-Service Conversational employee portal that resolves before tickets are filed AI Gateway + Workers + Pages for UI
Live Remote Management Real-time endpoint telemetry + silent remote remediation WebSockets in Workers + Durable Objects
Quantify it

The AI Gateway cache math for Pulse AI across 2,000+ tenants.

Drag the sliders. The compounding insight: when you serve N tenants whose users ask similar IT questions, semantic caching scales with N. Across 6M endpoints, the cache-hit rate for "why is Outlook slow?" or "Teams won't connect" is the kind of math that makes CFOs lean in.

AI Gateway savings calculator

Annual LLM inference cost — with and without semantic cache

Assumes blended Claude + GPT pricing. Adjust for your actual model mix. The cache assumption baked in: cache hits cost ~5% of a full inference call (embedding lookup + small response stitch).

2,000
5,000
2,500
50%
$15
Total Pulse AI calls / year 3.6B
Total tokens / year 9.1T
Cost without AI Gateway $136.9M
Cost with semantic cache $72.0M
Annual savings $64.9M

Calculator is directional. Actual cache-hit rates depend on prompt structure, question repeatability, and TTL config — IT troubleshooting workloads typically run higher than the 50% default because the question space is so repetitive across tenants. AI Gateway also adds free observability, rate limiting, fallback routing, and request logging — none of which is priced into the chart above.

Architecture

How a single endpoint check-in flows on Cloudflare.

One Windows 11 laptop in São Paulo reports a Teams audio anomaly. Following the full path.

1

Agent check-in hits the nearest Cloudflare POP (GRU)

The ControlUp agent on the São Paulo laptop sends its 3-second telemetry packet to agent.controlup.com, which resolves to the closest POP — São Paulo, not Azure us-east. Round-trip time drops from ~140ms to ~12ms.

Workers Smart Placement
2

Workers for Platforms routes to the right tenant namespace

Agent hostname → dispatch namespace lookup. The customer's worker — with their custom remediation scripts, alert rules, and Pulse AI prompts — runs in an isolated runtime. Zero noisy-neighbor risk between tenants.

Workers for Platforms Dispatch Namespaces
3

The endpoint's Durable Object loads

One DO per endpoint, geo-pinned to GRU. Loads last 30 seconds of state in <5ms — heartbeat, active alerts, anomaly baseline, current remediation status. Tracks the new telemetry packet, updates state, hibernates when the check-in ends.

Durable Objects Storage API
4

Anomaly Detection runs at the edge

Workers AI runs the behavior-baseline model on the new packet vs. the DO's stored baseline. Detects a Teams audio latency spike outside the normal envelope. Tags the event as anomaly-candidate, fires to the next stage. Total inference time: ~80ms, at the POP.

Workers AI Vectorize
5

AI Gateway checks the Pulse AI cache

The anomaly fingerprint hits AI Gateway: "Teams audio latency, Windows 11, corporate VPN, São Paulo." Semantic search finds 47 similar incidents resolved in the last 7 days across other tenants. Cached resolution + remediation suggestion returned in 30ms. No LLM call needed.

AI Gateway Semantic Cache
6

If cache miss, route to Claude or GPT with policy enforcement

If novel, AI Gateway routes to the configured LLM (Claude for nuanced reasoning, GPT for structured output). Rate-limited per-tenant. Fallback routing if one provider is degraded. Full request + response logged to Logpush for audit. Per-tenant cost attribution recorded.

AI Gateway Logpush
7

Workflows orchestrates the remediation

The Workflow starts: validate remediation against tenant policy → execute the script on the endpoint via the agent's command channel → wait for confirmation → verify the telemetry improves → close the loop. Durable, retry-able, with checkpoints.

Workflows Queues
8

Event + remediation archived to R2, dashboards updated

Full event trace + LLM decisions + remediation outcome written to R2 (zero egress on later retrieval). Dashboards in the tenant's console update in real time via WebSocket. Pulse AI's cross-tenant pattern model learns from the new resolution. Total wall-clock time end-to-end: under 2 seconds.

R2 WebSockets Workers Analytics Engine

Let's talk about what 20M endpoints looks like.

6M is where the Azure-centric architecture still works. 20M is where the edge economics, cross-tenant cache math, and per-endpoint state model start to dominate the P&L. A 30-minute architecture conversation, no slides, no sales pitch — just the engineering math and a whiteboard.

Book 30 min with Matt Holscher
Matt Holscher · Solutions Engineer · Cloudflare Developer Platform