ControlUp ONE collects 3-second telemetry across 6M+ seats and 2,000+ enterprise customers, then routes it through Pulse AI and Claude + GPT for autonomous endpoint management. That workload — millions of always-on agents, governed multi-LLM inference, per-endpoint state, multi-tenant isolation — maps almost 1:1 to Cloudflare's developer platform. Cloudflare didn't build these primitives for ControlUp, but they may as well have.
Every endpoint is a long-running session: heartbeat, last-seen, current alerts, remediation state, anomaly history. Every customer is an isolated tenant. Every Pulse AI call needs governance, audit, and cost attribution across two LLM vendors. That's not a DEX problem — that's a multi-tenant agentic infrastructure problem at planetary scale, and it's the workload Cloudflare's developer platform was purpose-built for.
Your product runs on Microsoft Azure — app.controlup.com and api.controlup.com both resolve into Azure's 20.168.x.x range, with Application Insights telemetry headers on every API response.
Your marketing site is on GCP nginx + WordPress. Your corporate DNS is on AWS Route 53. And your AI stack pairs Anthropic and OpenAI (per your TXT verifications — both confirmed in production).
Three clouds, two LLM vendors, one telemetry firehose — Cloudflare's developer platform is the layer that consolidates the edge.
Ranked by impact-per-effort for your specific workload shape — 3-second telemetry across millions of endpoints, with Pulse AI on top.
Your agents check in every 3 seconds from anywhere on Earth. Today that traffic flows to Azure regions. With Workers at 330+ POPs, agents check in to the nearest Cloudflare edge — sub-50ms ingest globally, no regional round-trip, no Azure egress on the back-haul.
Each endpoint needs persistent state — heartbeat, last-seen, active alerts, remediation in-flight, anomaly history. Durable Objects give you a single-threaded actor per endpoint with strong consistency, geo-routing, and zero session-affinity infrastructure to manage. 6M endpoints = 6M DOs, each hibernating when idle and resuming on next check-in.
You're running both Claude and GPT (your TXT records confirm). Pulse AI fires LLM calls constantly across 2,000+ tenants. AI Gateway sits in front of both providers — per-tenant cost attribution, semantic cache on repeated IT troubleshooting questions (the cache hit rate for "why is Outlook slow?" across 6M endpoints is enormous), full audit logs, rate-limit + fallback routing. One config, no code change.
2,000+ enterprise customers, each with their own agent rules, automation scripts, Pulse AI prompts, and remediation runbooks. Workers for Platforms dispatch namespaces give you one isolated worker per tenant — Microsoft's worker, Citrix-shop A's worker, Banking-customer-B's worker — all fully isolated, individually metered, no noisy-neighbor risk.
Anomaly Detection currently runs after telemetry batches reach Azure. With Workers AI, the model runs at the same POP that received the agent's check-in — sub-100ms anomaly detection, no regional hop. Plus Workers AI's catalog (Llama, Mistral, embedding models, Whisper) becomes a complement to your Claude + OpenAI stack for tasks where edge latency matters more than frontier capability.
172 billion events per day adds up. Cold-storage archive of telemetry, screenshots, remediation logs, and pcaps on R2 instead of Azure Blob means zero egress when customers query historical data, when auditors need evidence, or when Pulse AI back-references patterns for similarity search. For a telemetry firehose at your scale, the egress line item is usually the silent margin tax.
Your in-console AI Assistant does root-cause analysis on signals across your platform. That's a textbook RAG workload — embed the symptoms, retrieve similar past incidents + remediations, ground the LLM. Vectorize gives you a managed vector DB at edge latency, isolated per tenant, with sub-30ms queries. Pair with Workers AI Embeddings for the indexing pipeline.
"Detect → diagnose → propose remediation → execute → verify → close ticket" is a multi-step, long-running, retry-heavy workflow with checkpoints. Cloudflare Workflows is durable execution for exactly this shape — replaces Azure Logic Apps or hand-rolled Temporal clusters, lives next to the Workers + DOs handling the data.
Each capability you ship maps to a specific Cloudflare developer primitive. Not approximately — exactly.
| ControlUp capability | What it does | Cloudflare primitive |
|---|---|---|
| 3-second telemetry ingest | Endpoint agents check in every 3 seconds from anywhere | Workers at 330+ POPs + Smart Placement |
| Per-endpoint session state | Heartbeat, last-seen, alerts, remediation state, anomaly history | Durable Objects (1 DO per endpoint, ~6M total) |
| Pulse AI agentic engine | Multi-LLM reasoning over telemetry across tenants | AI Gateway + Workers AI at edge |
| AI Assistant (in-console RCA) | Conversational root-cause analysis grounded in your data | Vectorize + Workers AI Embeddings + R2 |
| Anomaly Detection | Behavior-based pattern learning, deviation surfacing | Workers AI + Durable Objects baseline storage |
| Per-customer tenant isolation | 2,000+ enterprise customers, isolated agent rules + policies | Workers for Platforms dispatch namespaces |
| Automation & Workflows | No-code workflows for routine fixes; autonomous remediation chains | Workflows + Queues + Cron Triggers |
| Telemetry archive | Historical telemetry, pcaps, screenshots, remediation logs | R2 (zero egress, S3-compatible API) |
| AI-Powered IT Self-Service | Conversational employee portal that resolves before tickets are filed | AI Gateway + Workers + Pages for UI |
| Live Remote Management | Real-time endpoint telemetry + silent remote remediation | WebSockets in Workers + Durable Objects |
Drag the sliders. The compounding insight: when you serve N tenants whose users ask similar IT questions, semantic caching scales with N. Across 6M endpoints, the cache-hit rate for "why is Outlook slow?" or "Teams won't connect" is the kind of math that makes CFOs lean in.
Assumes blended Claude + GPT pricing. Adjust for your actual model mix. The cache assumption baked in: cache hits cost ~5% of a full inference call (embedding lookup + small response stitch).
Calculator is directional. Actual cache-hit rates depend on prompt structure, question repeatability, and TTL config — IT troubleshooting workloads typically run higher than the 50% default because the question space is so repetitive across tenants. AI Gateway also adds free observability, rate limiting, fallback routing, and request logging — none of which is priced into the chart above.
One Windows 11 laptop in São Paulo reports a Teams audio anomaly. Following the full path.
The ControlUp agent on the São Paulo laptop sends its 3-second telemetry packet to agent.controlup.com, which resolves to the closest POP — São Paulo, not Azure us-east. Round-trip time drops from ~140ms to ~12ms.
Agent hostname → dispatch namespace lookup. The customer's worker — with their custom remediation scripts, alert rules, and Pulse AI prompts — runs in an isolated runtime. Zero noisy-neighbor risk between tenants.
One DO per endpoint, geo-pinned to GRU. Loads last 30 seconds of state in <5ms — heartbeat, active alerts, anomaly baseline, current remediation status. Tracks the new telemetry packet, updates state, hibernates when the check-in ends.
Workers AI runs the behavior-baseline model on the new packet vs. the DO's stored baseline. Detects a Teams audio latency spike outside the normal envelope. Tags the event as anomaly-candidate, fires to the next stage. Total inference time: ~80ms, at the POP.
The anomaly fingerprint hits AI Gateway: "Teams audio latency, Windows 11, corporate VPN, São Paulo." Semantic search finds 47 similar incidents resolved in the last 7 days across other tenants. Cached resolution + remediation suggestion returned in 30ms. No LLM call needed.
If novel, AI Gateway routes to the configured LLM (Claude for nuanced reasoning, GPT for structured output). Rate-limited per-tenant. Fallback routing if one provider is degraded. Full request + response logged to Logpush for audit. Per-tenant cost attribution recorded.
The Workflow starts: validate remediation against tenant policy → execute the script on the endpoint via the agent's command channel → wait for confirmation → verify the telemetry improves → close the loop. Durable, retry-able, with checkpoints.
Full event trace + LLM decisions + remediation outcome written to R2 (zero egress on later retrieval). Dashboards in the tenant's console update in real time via WebSocket. Pulse AI's cross-tenant pattern model learns from the new resolution. Total wall-clock time end-to-end: under 2 seconds.
6M is where the Azure-centric architecture still works. 20M is where the edge economics, cross-tenant cache math, and per-endpoint state model start to dominate the P&L. A 30-minute architecture conversation, no slides, no sales pitch — just the engineering math and a whiteboard.
Book 30 min with Matt Holscher →