Traditional uptime monitoring — pinging your server every minute from a single location — misses most of the failures that actually affect users. A site can be "up" at the origin while users in specific regions experience DNS resolution failures, CDN cache misses returning errors, or BGP routing changes that make the site unreachable from certain networks.

Comprehensive uptime monitoring in 2026 requires combining synthetic checks (probing from multiple locations), Real User Monitoring (measuring actual user experience), and SLO-based alerting (acting on meaningful thresholds rather than individual check failures).

Why simple uptime checks aren't enough

What they miss

A basic HTTP check from a monitoring service tells you the origin responded with 200 OK from one location at one moment. It does not tell you:

  • Whether the CDN is serving stale or error content to users
  • Whether DNS resolution is failing from specific resolver networks
  • Whether a BGP route change has made your site unreachable from an entire region
  • Whether TLS certificate issues are affecting certain clients
  • Whether page-load performance has degraded below usable thresholds

Real failure patterns

Common failures that simple monitoring misses:

  • CDN edge returning 5xx while origin is fine: CDN configuration error, expired cache, or origin timeout at specific edge locations
  • DNS propagation failure: nameserver change that hasn't propagated to all resolvers, or a resolver caching an NXDOMAIN
  • Regional routing failure: a transit provider drops your route, making you unreachable from their customer networks
  • Certificate transparency issue: a mis-issued or expired certificate that affects some clients based on their CA trust store
  • Intermittent failures: issues that occur for 10% of requests but are invisible to a single probe checking once per minute

Synthetic monitoring

Synthetic monitoring sends automated requests to your site from multiple geographic locations at regular intervals. It simulates user access without actual users.

What to monitor

HTTP availability: GET your key pages and verify status code, response body content, and response headers.

# Example synthetic check configuration
checks:
  - name: "Homepage"
    url: "https://wplus.net/"
    interval: 60s
    locations: [us-east, eu-west, ap-southeast, us-west]
    assertions:
      - type: status_code
        value: 200
      - type: body_contains
        value: "wplus.net"
      - type: header
        name: "content-type"
        contains: "text/html"
      - type: response_time
        max_ms: 3000

DNS resolution: check that your domain resolves correctly from multiple DNS resolvers:

  - name: "DNS resolution"
    type: dns
    domain: "wplus.net"
    record_type: A
    nameserver: "1.1.1.1"
    assertions:
      - type: response_time
        max_ms: 100
      - type: record_count
        min: 1

TLS certificate: verify certificate validity, expiration, and chain:

  - name: "TLS certificate"
    type: ssl
    hostname: "wplus.net"
    port: 443
    assertions:
      - type: certificate_expiry
        min_days: 14
      - type: certificate_chain
        valid: true

Location strategy

Deploy synthetic checks from at least 4 geographic regions that represent your user base. For a globally-accessible site:

  • North America (east and west coast)
  • Europe (west)
  • Asia-Pacific (southeast or east)
  • Optional: South America, Middle East, Africa

A failure detected from one location but not others is a regional issue (routing, DNS, or CDN edge). A failure from all locations is a global issue (origin down, DNS zone broken, or certificate expired).

Check frequency

  • Critical pages: every 30–60 seconds
  • Important pages: every 2–5 minutes
  • Secondary pages: every 10–15 minutes
  • DNS and TLS: every 5 minutes

More frequent checks detect issues faster but generate more data and potential alert noise.

Real User Monitoring (RUM)

RUM collects performance and availability data from actual user browsers via JavaScript instrumentation.

What RUM captures that synthetic doesn't

  • Actual user geographic distribution: where your real users are, not where your probes are
  • Real device performance: mobile users on slow connections that synthetic checks don't simulate
  • Client-side errors: JavaScript failures, resource load failures, and rendering issues
  • CDN cache effectiveness: whether users are getting cache HITs or MISS responses
  • Navigation timing: DNS lookup, TCP connect, TLS handshake, TTFB, and full page load as experienced by real users

Key RUM metrics

  • Web Vitals: Largest Contentful Paint (LCP), Interaction to Next Paint (INP), Cumulative Layout Shift (CLS)
  • TTFB (Time to First Byte): measures server responsiveness including DNS, TCP, TLS, and server processing
  • Error rate: percentage of page loads that encounter HTTP errors or JavaScript exceptions
  • Geographic performance: TTFB and load times broken down by country/region

Implementation

Most analytics and observability platforms offer RUM SDKs:

<!-- Generic RUM beacon example -->
<script>
  // Capture navigation timing
  window.addEventListener('load', () => {
    const timing = performance.getEntriesByType('navigation')[0];
    const data = {
      dns: timing.domainLookupEnd - timing.domainLookupStart,
      tcp: timing.connectEnd - timing.connectStart,
      tls: timing.secureConnectionStart > 0 
        ? timing.connectEnd - timing.secureConnectionStart : 0,
      ttfb: timing.responseStart - timing.requestStart,
      load: timing.loadEventEnd - timing.navigationStart,
      protocol: timing.nextHopProtocol
    };
    // Send to your analytics endpoint
    navigator.sendBeacon('/analytics', JSON.stringify(data));
  });
</script>

SLO-based alerting

Service Level Objectives (SLOs) replace noisy per-check alerts with meaningful thresholds.

Defining SLOs

An SLO defines what "good" looks like over a time window:

  • Availability SLO: 99.9% of requests return a successful response over a 30-day window
  • Latency SLO: 95% of requests complete within 500ms (p95) over a 30-day window
  • Error SLO: fewer than 0.1% of requests return 5xx errors over a 30-day window

Error budgets

If your availability SLO is 99.9% over 30 days, your error budget is 0.1% of total requests — approximately 43 minutes of downtime. When the error budget is being consumed faster than expected, you alert.

Burn rate alerting

Instead of alerting on every failed check, alert on the rate at which you're consuming your error budget:

  • Fast burn (14.4x): consuming the entire monthly budget in 2 hours → page immediately
  • Medium burn (6x): consuming the budget in 5 hours → alert within 30 minutes
  • Slow burn (1x): consuming the budget steadily over the full month → informational, no alert

This approach dramatically reduces alert noise while catching real incidents.

Example SLO configuration

slos:
  - name: "Website availability"
    target: 99.9
    window: 30d
    indicator:
      type: availability
      good: "http.status_code < 500"
      total: "http.request_count"
    alerts:
      - burn_rate: 14.4
        window: 1h
        severity: critical
      - burn_rate: 6
        window: 6h
        severity: warning

Combining the three signals

Signal Detects Blind spots
Synthetic Origin availability, DNS, TLS, regional reachability Does not reflect real user experience
RUM Actual user performance, client-side errors, CDN effectiveness Only captures data when users visit; no data during off-hours
SLO Meaningful trend-based alerting Requires sufficient data volume for statistical significance

Use all three together:

  1. Synthetic catches issues immediately, even when no users are active
  2. RUM confirms whether real users are affected and measures severity
  3. SLOs determine whether the issue is worth waking someone up for

Common mistakes

Monitoring only from one location. Regional failures are invisible to single-location monitoring. Use at least 3–4 probe locations.

Alerting on every failed check. Single-check failures happen constantly (network glitches, probe-side issues). Alert on sustained failures or SLO burn rates.

Not monitoring DNS separately. If your DNS is down, your HTTP checks may fail with misleading errors. Monitor DNS resolution independently.

Ignoring RUM data for operational decisions. Synthetic checks tell you what's possible; RUM tells you what's actually happening. Base your SLOs on RUM data when available.

Setting unrealistic SLOs. A 99.99% availability target for a site behind a single CDN provider may not be achievable. Set SLOs based on what you can actually deliver, then improve.

Verification

  1. Deploy synthetic checks from at least 3 locations and verify they return correct results
  2. Simulate a failure (block the origin, return a 503) and verify synthetic monitoring detects it within the expected interval
  3. Implement RUM and verify data flows to your analytics platform
  4. Define at least one SLO and verify burn-rate alerting triggers correctly with test data
  5. Test regional failure detection: configure one synthetic check to fail and verify it's identified as regional, not global

Related reading on wplus.net