Fixity at Scale: When to Hash, What to Store, and How to Detect Silent Corruption

Silent data corruption — bit rot, undetected disk errors, firmware bugs, faulty memory, incomplete writes — is a real and measurable phenomenon in any storage system operating at scale. Research from CERN, Google, and NetApp has documented corruption rates that, while individually rare, become statistically significant when you manage thousands of files over years.

Fixity checking — computing and verifying cryptographic hashes of stored files — is the primary defence. But doing it well at scale requires careful decisions about when to hash, what hash metadata to store, how often to verify, and how to handle the inevitable detection of corruption.

Why fixity matters

Every file in a storage system can be silently modified by:

Bit rot: random bit flips in storage media over time
Firmware bugs: storage controller firmware that silently corrupts data during read/write operations
Incomplete writes: power loss or crash during a write operation that leaves a file in a partially written state
Silent disk errors: sectors that return incorrect data without signalling an I/O error
Software bugs: application-level bugs that modify files unintentionally

Without fixity checking, you discover corruption only when a user reports a broken download, an archive fails to extract, or a backup restores a corrupted file. By then, the good copy may already be gone from your backup rotation.

Hashing strategy

Which hash algorithm

SHA-256 is the standard choice for fixity in 2026:

Widely supported across all platforms and tools
256-bit output is collision-resistant for the foreseeable future
Fast enough for bulk hashing with hardware acceleration (SHA-NI instructions on modern x86)
Compatible with integrity manifests, SRI, and most verification tooling

SHA-512 offers a wider output and can be faster on 64-bit systems without SHA-NI, but the extra bits provide no practical benefit for fixity checking.

BLAKE3 is significantly faster than SHA-256 (especially on multi-core systems) and equally secure, but tooling support is still catching up. Use it if your verification pipeline supports it.

MD5 and SHA-1 are broken for collision resistance and should not be used for new fixity implementations. They are acceptable only for compatibility with existing systems that cannot be upgraded.

When to hash

Hash at these points in the file lifecycle:

At ingest: when a file first enters your system. This establishes the baseline hash.
After transfer: when a file is copied to a new location (backup, mirror, CDN origin). Compare the hash at the destination with the origin hash.
Periodically at rest: schedule regular verification of stored files against their recorded hashes.
Before serving: optionally, verify the hash before serving a download to a user (adds latency; appropriate for high-assurance scenarios).

What to store

For each file, store:

filepath: /archives/project-v2.4.1.tar.gz
sha256: 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
size: 248791040
hash_time: 2026-03-01T12:00:00Z
source: ingest

The size field is a cheap first check — if the file size doesn't match, something is wrong without needing to compute the hash. The hash_time tells you when the hash was last verified, and source indicates whether this hash is from original ingest or a periodic verification.

Manifest formats

Simple checksum files

The traditional SHA256SUMS format:

9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08  project-v2.4.1.tar.gz
a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2  project-v2.4.0.tar.gz

Verify with: sha256sum -c SHA256SUMS

Advantages: universal tooling support, human-readable, trivial to generate.

Disadvantages: no metadata beyond filename and hash; no provenance information.

Structured manifests (JSON/YAML)

For richer metadata:

{
  "files": [
    {
      "path": "project-v2.4.1.tar.gz",
      "sha256": "9f86d081...",
      "size": 248791040,
      "created": "2026-03-01T12:00:00Z"
    }
  ],
  "manifest_version": "1.0",
  "generated": "2026-03-01T12:05:00Z"
}

Better for programmatic consumption, but requires custom verification tooling.

Periodic verification schedule

How often to verify depends on your risk tolerance and storage characteristics:

Critical archives (signed releases, legal documents): weekly or daily
Standard content (general file storage): monthly
Cold storage (rarely accessed backups): quarterly
Active serving paths (CDN origin): continuously via sample-based checking

Sample-based checking

For very large collections (millions of files), verifying everything daily is impractical. Instead:

Verify a random sample of N files per day
Over time, every file gets checked
Prioritise recently-ingested files and files that haven't been verified recently
Track verification coverage: what percentage of files have been verified in the last 30/90/365 days?

Detecting and handling corruption

When a hash mismatch is detected:

Immediate response

Quarantine the file: do not serve, copy, or back up a corrupted file
Log the event: record the file path, expected hash, actual hash, detection time, and storage device
Alert operators: corruption detection is an urgent event that may indicate a larger problem

Recovery

Check backups: find the most recent backup with a matching hash
Check mirrors: if the file exists on multiple mirrors, verify which copies are correct
Restore from verified copy: replace the corrupted file with a verified-good copy
Re-verify after restore: confirm the restored file hashes correctly

Root cause analysis

A single corrupted file may be a random event. Multiple corrupted files on the same storage device suggest a hardware problem:

Check disk SMART data for signs of degradation
Check recent I/O error logs
Test the storage device with vendor diagnostic tools
Consider proactively migrating data off the affected device

Common mistakes

Storing hashes on the same disk as the data. If the disk fails catastrophically, you lose both the files and the hashes. Store fixity manifests on a separate storage system.

Hashing only at ingest and never again. A single hash at ingest proves the file was correct when received. It does not detect corruption that occurs afterward. Periodic verification is essential.

Using fast but weak hashes for "performance." CRC32 and Adler32 are fast but detect only a fraction of corruption patterns. Use SHA-256 — the performance cost is acceptable on modern hardware.

Not tracking verification timestamps. Without knowing when each file was last verified, you cannot assess your overall integrity coverage or prioritise verification of unverified files.

Ignoring file size as a pre-check. Checking file size before computing a hash is a free early-exit that catches truncated files, incomplete copies, and some corruption scenarios without any computation.

Verification checklist

Generate SHA256SUMS for your archive directory: sha256sum /path/to/archives/* > SHA256SUMS
Verify immediately: sha256sum -c SHA256SUMS — all files should pass
Schedule periodic re-verification (cron job or equivalent)
Store SHA256SUMS on a separate storage system
Set up alerting for any verification failure
Test recovery: intentionally corrupt a test file and verify the detection and recovery pipeline works