Uptime Monitoring That Catches CDN, DNS, and Routing Failures: Synthetic + RUM + SLOs

Traditional uptime monitoring — pinging your server every minute from a single location — misses most of the failures that actually affect users. A site can be "up" at the origin while users in specific regions experience DNS resolution failures, CDN cache misses returning errors, or BGP routing changes that make the site unreachable from certain networks.

Comprehensive uptime monitoring in 2026 requires combining synthetic checks (probing from multiple locations), Real User Monitoring (measuring actual user experience), and SLO-based alerting (acting on meaningful thresholds rather than individual check failures).

Why simple uptime checks aren't enough

What they miss

A basic HTTP check from a monitoring service tells you the origin responded with 200 OK from one location at one moment. It does not tell you:

Whether the CDN is serving stale or error content to users
Whether DNS resolution is failing from specific resolver networks
Whether a BGP route change has made your site unreachable from an entire region
Whether TLS certificate issues are affecting certain clients
Whether page-load performance has degraded below usable thresholds

Real failure patterns

Common failures that simple monitoring misses:

CDN edge returning 5xx while origin is fine: CDN configuration error, expired cache, or origin timeout at specific edge locations
DNS propagation failure: nameserver change that hasn't propagated to all resolvers, or a resolver caching an NXDOMAIN
Regional routing failure: a transit provider drops your route, making you unreachable from their customer networks
Certificate transparency issue: a mis-issued or expired certificate that affects some clients based on their CA trust store
Intermittent failures: issues that occur for 10% of requests but are invisible to a single probe checking once per minute

Synthetic monitoring

Synthetic monitoring sends automated requests to your site from multiple geographic locations at regular intervals. It simulates user access without actual users.

What to monitor

HTTP availability: GET your key pages and verify status code, response body content, and response headers.

# Example synthetic check configuration
checks:
  - name: "Homepage"
    url: "https://wplus.net/"
    interval: 60s
    locations: [us-east, eu-west, ap-southeast, us-west]
    assertions:
      - type: status_code
        value: 200
      - type: body_contains
        value: "wplus.net"
      - type: header
        name: "content-type"
        contains: "text/html"
      - type: response_time
        max_ms: 3000

DNS resolution: check that your domain resolves correctly from multiple DNS resolvers:

  - name: "DNS resolution"
    type: dns
    domain: "wplus.net"
    record_type: A
    nameserver: "1.1.1.1"
    assertions:
      - type: response_time
        max_ms: 100
      - type: record_count
        min: 1

TLS certificate: verify certificate validity, expiration, and chain:

  - name: "TLS certificate"
    type: ssl
    hostname: "wplus.net"
    port: 443
    assertions:
      - type: certificate_expiry
        min_days: 14
      - type: certificate_chain
        valid: true

Location strategy

Deploy synthetic checks from at least 4 geographic regions that represent your user base. For a globally-accessible site:

North America (east and west coast)
Europe (west)
Asia-Pacific (southeast or east)
Optional: South America, Middle East, Africa

A failure detected from one location but not others is a regional issue (routing, DNS, or CDN edge). A failure from all locations is a global issue (origin down, DNS zone broken, or certificate expired).

Check frequency

Critical pages: every 30–60 seconds
Important pages: every 2–5 minutes
Secondary pages: every 10–15 minutes
DNS and TLS: every 5 minutes

More frequent checks detect issues faster but generate more data and potential alert noise.

Real User Monitoring (RUM)

RUM collects performance and availability data from actual user browsers via JavaScript instrumentation.

What RUM captures that synthetic doesn't

Actual user geographic distribution: where your real users are, not where your probes are
Real device performance: mobile users on slow connections that synthetic checks don't simulate
Client-side errors: JavaScript failures, resource load failures, and rendering issues
CDN cache effectiveness: whether users are getting cache HITs or MISS responses
Navigation timing: DNS lookup, TCP connect, TLS handshake, TTFB, and full page load as experienced by real users

Key RUM metrics

Web Vitals: Largest Contentful Paint (LCP), Interaction to Next Paint (INP), Cumulative Layout Shift (CLS)
TTFB (Time to First Byte): measures server responsiveness including DNS, TCP, TLS, and server processing
Error rate: percentage of page loads that encounter HTTP errors or JavaScript exceptions
Geographic performance: TTFB and load times broken down by country/region

Implementation

Most analytics and observability platforms offer RUM SDKs:

<!-- Generic RUM beacon example -->
<script>
  // Capture navigation timing
  window.addEventListener('load', () => {
    const timing = performance.getEntriesByType('navigation')[0];
    const data = {
      dns: timing.domainLookupEnd - timing.domainLookupStart,
      tcp: timing.connectEnd - timing.connectStart,
      tls: timing.secureConnectionStart > 0 
        ? timing.connectEnd - timing.secureConnectionStart : 0,
      ttfb: timing.responseStart - timing.requestStart,
      load: timing.loadEventEnd - timing.navigationStart,
      protocol: timing.nextHopProtocol
    };
    // Send to your analytics endpoint
    navigator.sendBeacon('/analytics', JSON.stringify(data));
  });
</script>

SLO-based alerting

Service Level Objectives (SLOs) replace noisy per-check alerts with meaningful thresholds.

Defining SLOs

An SLO defines what "good" looks like over a time window:

Availability SLO: 99.9% of requests return a successful response over a 30-day window
Latency SLO: 95% of requests complete within 500ms (p95) over a 30-day window
Error SLO: fewer than 0.1% of requests return 5xx errors over a 30-day window

Error budgets

If your availability SLO is 99.9% over 30 days, your error budget is 0.1% of total requests — approximately 43 minutes of downtime. When the error budget is being consumed faster than expected, you alert.

Burn rate alerting

Instead of alerting on every failed check, alert on the rate at which you're consuming your error budget:

Fast burn (14.4x): consuming the entire monthly budget in 2 hours → page immediately
Medium burn (6x): consuming the budget in 5 hours → alert within 30 minutes
Slow burn (1x): consuming the budget steadily over the full month → informational, no alert

This approach dramatically reduces alert noise while catching real incidents.

Example SLO configuration

slos:
  - name: "Website availability"
    target: 99.9
    window: 30d
    indicator:
      type: availability
      good: "http.status_code < 500"
      total: "http.request_count"
    alerts:
      - burn_rate: 14.4
        window: 1h
        severity: critical
      - burn_rate: 6
        window: 6h
        severity: warning

Combining the three signals

Signal	Detects	Blind spots
Synthetic	Origin availability, DNS, TLS, regional reachability	Does not reflect real user experience
RUM	Actual user performance, client-side errors, CDN effectiveness	Only captures data when users visit; no data during off-hours
SLO	Meaningful trend-based alerting	Requires sufficient data volume for statistical significance

Use all three together:

Synthetic catches issues immediately, even when no users are active
RUM confirms whether real users are affected and measures severity
SLOs determine whether the issue is worth waking someone up for

Common mistakes

Monitoring only from one location. Regional failures are invisible to single-location monitoring. Use at least 3–4 probe locations.

Alerting on every failed check. Single-check failures happen constantly (network glitches, probe-side issues). Alert on sustained failures or SLO burn rates.

Not monitoring DNS separately. If your DNS is down, your HTTP checks may fail with misleading errors. Monitor DNS resolution independently.

Ignoring RUM data for operational decisions. Synthetic checks tell you what's possible; RUM tells you what's actually happening. Base your SLOs on RUM data when available.

Setting unrealistic SLOs. A 99.99% availability target for a site behind a single CDN provider may not be achievable. Set SLOs based on what you can actually deliver, then improve.

Verification

Deploy synthetic checks from at least 3 locations and verify they return correct results
Simulate a failure (block the origin, return a 503) and verify synthetic monitoring detects it within the expected interval
Implement RUM and verify data flows to your analytics platform
Define at least one SLO and verify burn-rate alerting triggers correctly with test data
Test regional failure detection: configure one synthetic check to fail and verify it's identified as regional, not global