From Durability Claims to Jepsen Evidence: Mixed-Fault Linearizability 45/45

Aydarbek Romanuly

- Last Updated: April 13, 2026

Aydarbek Romanuly

- Last Updated: April 13, 2026

Durability claims are easy to write and hard to prove. In IoT infrastructure, that gap matters because real failures are the norm, not the exception.

In my previous article, I explained why durability-first design is necessary for event pipelines under crash and restart conditions. This follow-up is about the next step: moving from architecture claims to externally verifiable evidence.

Why this follow-up matters

If a system says “durable” or “strongly consistent,” teams deserve to know:

what failure modes were tested,
what invariants were checked,
how to reproduce the result.

Without that, we are still in “trust me” territory.

What we validated

For Ayder (an HTTP-native event log in C), we ran a Jepsen campaign focused on clustered failure behavior.

Fault modes:

mixed
partition-only
kill-only

Campaign shape:

multiple durations (short and long windows)
repeated independent runs per cell
strict claim path settings

Result:

mixed-fault linearizability: 45/45 pass

The point here is not “perfect forever.” The point is that the claim is tied to concrete, reproducible evidence.

What changed technically before this pass

This was not a first-try success. We had to harden real failure paths, especially:

startup/recovery behavior around shared storage edge cases
readiness and quorum stabilization before run start
broker offset/ack/read ordering behavior under kill + mixed faults
harness behavior around local chaos privileges and pre-run healing

These are exactly the classes of issues that often slip through benchmark-driven development.

Reproducibility artifacts

To keep the claim auditable, we published:

artifact bundle
checksum
run metadata and matrix summary inside the bundle

This allows independent review of what was run and what passed.

Why this is relevant to IoT teams

IoT workloads are especially sensitive to correctness drift:

intermittent networks
frequent process/node restarts
store-and-forward behavior at the edge
delayed reconciliation with central systems

In that environment, “mostly works” is not enough. You need confidence in ordering, commit semantics, and recovery correctness under disruption.

Lessons learned

Durability requires testing failure transitions, not just steady state.
Correctness campaigns should be designed as a product requirement, not an afterthought.
Reproducible artifacts and checksums are part of trustworthy engineering communication.
External critique is valuable: hard questions about AP/CP and guarantees improved both documentation and implementation quality.