ControlUp on Cloudflare — 3-second telemetry × 6M endpoints, at the edge

The thesis

ControlUp isn't a monitoring tool.
It's a global stateful agent platform that happens to manage endpoints.

Every endpoint is a long-running session: heartbeat, last-seen, current alerts, remediation state, anomaly history. Every customer is an isolated tenant. Every Pulse AI call needs governance, audit, and cost attribution across two LLM vendors. That's not a DEX problem — that's a multi-tenant agentic infrastructure problem at planetary scale, and it's the workload Cloudflare's developer platform was purpose-built for.

What we noticed in your stack

Your product runs on Microsoft Azure — app.controlup.com and api.controlup.com both resolve into Azure's 20.168.x.x range, with Application Insights telemetry headers on every API response. Your marketing site is on GCP nginx + WordPress. Your corporate DNS is on AWS Route 53. And your AI stack pairs Anthropic and OpenAI (per your TXT verifications — both confirmed in production). Three clouds, two LLM vendors, one telemetry firehose — Cloudflare's developer platform is the layer that consolidates the edge.

Value plays

Eight things Cloudflare changes for ControlUp.

Ranked by impact-per-effort for your specific workload shape — 3-second telemetry across millions of endpoints, with Pulse AI on top.

01 — Flagship

Edge ingest for 3-second telemetry

Your agents check in every 3 seconds from anywhere on Earth. Today that traffic flows to Azure regions. With Workers at 330+ POPs, agents check in to the nearest Cloudflare edge — sub-50ms ingest globally, no regional round-trip, no Azure egress on the back-haul.

Workers Smart Placement Magic Transit

6M endpoints × 28,800 check-ins/day each = 172B events/day at edge

02 — Stateful agents

One Durable Object per endpoint

Each endpoint needs persistent state — heartbeat, last-seen, active alerts, remediation in-flight, anomaly history. Durable Objects give you a single-threaded actor per endpoint with strong consistency, geo-routing, and zero session-affinity infrastructure to manage. 6M endpoints = 6M DOs, each hibernating when idle and resuming on next check-in.

Durable Objects Storage API Alarms

Replaces a stateful service tier in Azure

03 — Highest AI ROI

AI Gateway for Pulse AI's multi-LLM stack

You're running both Claude and GPT (your TXT records confirm). Pulse AI fires LLM calls constantly across 2,000+ tenants. AI Gateway sits in front of both providers — per-tenant cost attribution, semantic cache on repeated IT troubleshooting questions (the cache hit rate for "why is Outlook slow?" across 6M endpoints is enormous), full audit logs, rate-limit + fallback routing. One config, no code change.

AI Gateway Semantic Cache Multi-provider

See calculator below ↓

04 — Tenant isolation

Per-customer runtimes with Workers for Platforms

2,000+ enterprise customers, each with their own agent rules, automation scripts, Pulse AI prompts, and remediation runbooks. Workers for Platforms dispatch namespaces give you one isolated worker per tenant — Microsoft's worker, Citrix-shop A's worker, Banking-customer-B's worker — all fully isolated, individually metered, no noisy-neighbor risk.

Workers for Platforms Dispatch Namespaces

2,000 tenants × isolated compute, no rewrite

05 — Inference at the edge

Workers AI for anomaly detection where the telemetry lands

Anomaly Detection currently runs after telemetry batches reach Azure. With Workers AI, the model runs at the same POP that received the agent's check-in — sub-100ms anomaly detection, no regional hop. Plus Workers AI's catalog (Llama, Mistral, embedding models, Whisper) becomes a complement to your Claude + OpenAI stack for tasks where edge latency matters more than frontier capability.

Workers AI Vectorize

3–5× faster anomaly surfacing

06 — Storage economics

R2 for telemetry archive (zero egress)

172 billion events per day adds up. Cold-storage archive of telemetry, screenshots, remediation logs, and pcaps on R2 instead of Azure Blob means zero egress when customers query historical data, when auditors need evidence, or when Pulse AI back-references patterns for similarity search. For a telemetry firehose at your scale, the egress line item is usually the silent margin tax.

R2 Zero Egress S3-compatible API

Typical 40–60% storage TCO reduction

07 — Knowledge / RAG

Vectorize for the AI Assistant's RCA knowledge base

Your in-console AI Assistant does root-cause analysis on signals across your platform. That's a textbook RAG workload — embed the symptoms, retrieve similar past incidents + remediations, ground the LLM. Vectorize gives you a managed vector DB at edge latency, isolated per tenant, with sub-30ms queries. Pair with Workers AI Embeddings for the indexing pipeline.

Vectorize Workers AI Embeddings R2

Per-tenant RAG, no external vector DB to wire

08 — Orchestration

Workflows for autonomous remediation pipelines

"Detect → diagnose → propose remediation → execute → verify → close ticket" is a multi-step, long-running, retry-heavy workflow with checkpoints. Cloudflare Workflows is durable execution for exactly this shape — replaces Azure Logic Apps or hand-rolled Temporal clusters, lives next to the Workers + DOs handling the data.

Workflows Queues Cron Triggers

No external orchestrator to operate

Mapping

ControlUp ONE capabilities, mapped to Cloudflare primitives.

Each capability you ship maps to a specific Cloudflare developer primitive. Not approximately — exactly.

ControlUp capability	What it does	Cloudflare primitive
3-second telemetry ingest	Endpoint agents check in every 3 seconds from anywhere	`Workers` at 330+ POPs + `Smart Placement`
Per-endpoint session state	Heartbeat, last-seen, alerts, remediation state, anomaly history	`Durable Objects` (1 DO per endpoint, ~6M total)
Pulse AI agentic engine	Multi-LLM reasoning over telemetry across tenants	`AI Gateway` + `Workers AI` at edge
AI Assistant (in-console RCA)	Conversational root-cause analysis grounded in your data	`Vectorize` + `Workers AI Embeddings` + `R2`
Anomaly Detection	Behavior-based pattern learning, deviation surfacing	`Workers AI` + `Durable Objects` baseline storage
Per-customer tenant isolation	2,000+ enterprise customers, isolated agent rules + policies	`Workers for Platforms` dispatch namespaces
Automation & Workflows	No-code workflows for routine fixes; autonomous remediation chains	`Workflows` + `Queues` + `Cron Triggers`
Telemetry archive	Historical telemetry, pcaps, screenshots, remediation logs	`R2` (zero egress, S3-compatible API)
AI-Powered IT Self-Service	Conversational employee portal that resolves before tickets are filed	`AI Gateway` + `Workers` + `Pages` for UI
Live Remote Management	Real-time endpoint telemetry + silent remote remediation	`WebSockets` in Workers + `Durable Objects`

Quantify it

The AI Gateway cache math for Pulse AI across 2,000+ tenants.

Drag the sliders. The compounding insight: when you serve N tenants whose users ask similar IT questions, semantic caching scales with N. Across 6M endpoints, the cache-hit rate for "why is Outlook slow?" or "Teams won't connect" is the kind of math that makes CFOs lean in.

AI Gateway savings calculator

Annual LLM inference cost — with and without semantic cache

Assumes blended Claude + GPT pricing. Adjust for your actual model mix. The cache assumption baked in: cache hits cost ~5% of a full inference call (embedding lookup + small response stitch).

Enterprise tenants on Pulse AI

2,000

Avg Pulse AI calls per tenant per day

5,000

Avg tokens per call (in + out)

2,500

Cross-tenant semantic cache hit rate

50%

Blended model cost per 1M tokens

$15

Total Pulse AI calls / year 3.6B

Total tokens / year 9.1T

Cost without AI Gateway $136.9M

Cost with semantic cache $72.0M

            Annual savings
            $64.9M
          

Calculator is directional. Actual cache-hit rates depend on prompt structure, question repeatability, and TTL config — IT troubleshooting workloads typically run higher than the 50% default because the question space is so repetitive across tenants. AI Gateway also adds free observability, rate limiting, fallback routing, and request logging — none of which is priced into the chart above.

Architecture

How a single endpoint check-in flows on Cloudflare.

One Windows 11 laptop in São Paulo reports a Teams audio anomaly. Following the full path.

Agent check-in hits the nearest Cloudflare POP (GRU)

The ControlUp agent on the São Paulo laptop sends its 3-second telemetry packet to agent.controlup.com, which resolves to the closest POP — São Paulo, not Azure us-east. Round-trip time drops from ~140ms to ~12ms.

Workers Smart Placement

Workers for Platforms routes to the right tenant namespace

Agent hostname → dispatch namespace lookup. The customer's worker — with their custom remediation scripts, alert rules, and Pulse AI prompts — runs in an isolated runtime. Zero noisy-neighbor risk between tenants.

Workers for Platforms Dispatch Namespaces

The endpoint's Durable Object loads

One DO per endpoint, geo-pinned to GRU. Loads last 30 seconds of state in <5ms — heartbeat, active alerts, anomaly baseline, current remediation status. Tracks the new telemetry packet, updates state, hibernates when the check-in ends.

Durable Objects Storage API

Anomaly Detection runs at the edge

Workers AI runs the behavior-baseline model on the new packet vs. the DO's stored baseline. Detects a Teams audio latency spike outside the normal envelope. Tags the event as anomaly-candidate, fires to the next stage. Total inference time: ~80ms, at the POP.

Workers AI Vectorize

AI Gateway checks the Pulse AI cache

The anomaly fingerprint hits AI Gateway: "Teams audio latency, Windows 11, corporate VPN, São Paulo." Semantic search finds 47 similar incidents resolved in the last 7 days across other tenants. Cached resolution + remediation suggestion returned in 30ms. No LLM call needed.

AI Gateway Semantic Cache

If cache miss, route to Claude or GPT with policy enforcement

If novel, AI Gateway routes to the configured LLM (Claude for nuanced reasoning, GPT for structured output). Rate-limited per-tenant. Fallback routing if one provider is degraded. Full request + response logged to Logpush for audit. Per-tenant cost attribution recorded.

AI Gateway Logpush

Workflows orchestrates the remediation

The Workflow starts: validate remediation against tenant policy → execute the script on the endpoint via the agent's command channel → wait for confirmation → verify the telemetry improves → close the loop. Durable, retry-able, with checkpoints.

Workflows Queues

Event + remediation archived to R2, dashboards updated

Full event trace + LLM decisions + remediation outcome written to R2 (zero egress on later retrieval). Dashboards in the tenant's console update in real time via WebSocket. Pulse AI's cross-tenant pattern model learns from the new resolution. Total wall-clock time end-to-end: under 2 seconds.

R2 WebSockets Workers Analytics Engine

3-second telemetry. 6M endpoints. Run it at the edge.

ControlUp isn't a monitoring tool.
It's a global stateful agent platform that happens to manage endpoints.

Eight things Cloudflare changes for ControlUp.

Edge ingest for 3-second telemetry

One Durable Object per endpoint

AI Gateway for Pulse AI's multi-LLM stack

Per-customer runtimes with Workers for Platforms

Workers AI for anomaly detection where the telemetry lands

R2 for telemetry archive (zero egress)

Vectorize for the AI Assistant's RCA knowledge base

Workflows for autonomous remediation pipelines

ControlUp ONE capabilities, mapped to Cloudflare primitives.

The AI Gateway cache math for Pulse AI across 2,000+ tenants.

Annual LLM inference cost — with and without semantic cache

How a single endpoint check-in flows on Cloudflare.

Agent check-in hits the nearest Cloudflare POP (GRU)

Workers for Platforms routes to the right tenant namespace

The endpoint's Durable Object loads

Anomaly Detection runs at the edge

AI Gateway checks the Pulse AI cache

If cache miss, route to Claude or GPT with policy enforcement

Workflows orchestrates the remediation

Event + remediation archived to R2, dashboards updated

Let's talk about what 20M endpoints looks like.

ControlUp isn't a monitoring tool.It's a global stateful agent platform that happens to manage endpoints.

Eight things Cloudflare changes for ControlUp.

Edge ingest for 3-second telemetry

One Durable Object per endpoint

AI Gateway for Pulse AI's multi-LLM stack

Per-customer runtimes with Workers for Platforms

Workers AI for anomaly detection where the telemetry lands

R2 for telemetry archive (zero egress)

Vectorize for the AI Assistant's RCA knowledge base

Workflows for autonomous remediation pipelines

ControlUp ONE capabilities, mapped to Cloudflare primitives.

The AI Gateway cache math for Pulse AI across 2,000+ tenants.

Annual LLM inference cost — with and without semantic cache

How a single endpoint check-in flows on Cloudflare.

Agent check-in hits the nearest Cloudflare POP (GRU)

Workers for Platforms routes to the right tenant namespace

The endpoint's Durable Object loads

Anomaly Detection runs at the edge

AI Gateway checks the Pulse AI cache

If cache miss, route to Claude or GPT with policy enforcement

Workflows orchestrates the remediation

Event + remediation archived to R2, dashboards updated

Let's talk about what 20M endpoints looks like.

ControlUp isn't a monitoring tool.
It's a global stateful agent platform that happens to manage endpoints.