burgerlogo

From Durability Claims to Jepsen Evidence: Mixed-Fault Linearizability 45/45

From Durability Claims to Jepsen Evidence: Mixed-Fault Linearizability 45/45

avatar
Aydarbek Romanuly

- Last Updated: April 13, 2026

avatar

Aydarbek Romanuly

- Last Updated: April 13, 2026

featured imagefeatured imagefeatured image

Durability claims are easy to write and hard to prove. In IoT infrastructure, that gap matters because real failures are the norm, not the exception.

In my previous article, I explained why durability-first design is necessary for event pipelines under crash and restart conditions. This follow-up is about the next step: moving from architecture claims to externally verifiable evidence.

Why this follow-up matters

If a system says “durable” or “strongly consistent,” teams deserve to know:

  • what failure modes were tested,
  • what invariants were checked,
  • how to reproduce the result.

Without that, we are still in “trust me” territory.

What we validated

For Ayder (an HTTP-native event log in C), we ran a Jepsen campaign focused on clustered failure behavior.

Fault modes:

  • mixed
  • partition-only
  • kill-only

Campaign shape:

  • multiple durations (short and long windows)
  • repeated independent runs per cell
  • strict claim path settings

Result:

  • mixed-fault linearizability: 45/45 pass

The point here is not “perfect forever.” The point is that the claim is tied to concrete, reproducible evidence.

What changed technically before this pass

This was not a first-try success. We had to harden real failure paths, especially:

  • startup/recovery behavior around shared storage edge cases
  • readiness and quorum stabilization before run start
  • broker offset/ack/read ordering behavior under kill + mixed faults
  • harness behavior around local chaos privileges and pre-run healing

These are exactly the classes of issues that often slip through benchmark-driven development.

Reproducibility artifacts

To keep the claim auditable, we published:

  • artifact bundle
  • checksum
  • run metadata and matrix summary inside the bundle

This allows independent review of what was run and what passed.

Why this is relevant to IoT teams

IoT workloads are especially sensitive to correctness drift:

  • intermittent networks
  • frequent process/node restarts
  • store-and-forward behavior at the edge
  • delayed reconciliation with central systems

In that environment, “mostly works” is not enough. You need confidence in ordering, commit semantics, and recovery correctness under disruption.

Lessons learned

  1. Durability requires testing failure transitions, not just steady state.
  2. Correctness campaigns should be designed as a product requirement, not an afterthought.
  3. Reproducible artifacts and checksums are part of trustworthy engineering communication.
  4. External critique is valuable: hard questions about AP/CP and guarantees improved both documentation and implementation quality.

Closing

The biggest shift was cultural as much as technical: from “we believe this is correct” to “here is the evidence and how to verify it.”

That shift is worth making for any IoT data system where correctness under failure matters more than headline benchmark numbers.

Need Help Identifying the Right IoT Solution?

Our team of experts will help you find the perfect solution for your needs!

Get Help