Building a Durability-First Event Log That Survives Real Failures
- Last Updated: February 19, 2026
Aydarbek Romanuly
- Last Updated: February 19, 2026



Modern systems generate streams of events everywhere: devices at the edge, gateways, backend services, and cloud workloads. What often gets overlooked is that failure is the normal state, not the exception especially outside perfectly managed cloud environments.
Disk pressure, power loss, partial network partitions, process crashes, and restarts are daily reality in IoT, edge, and hybrid systems. Yet many event pipelines assume stable infrastructure, heavy runtimes, or complex operational setups.
This article shares lessons from building a durability-first event log, designed to behave predictably under failure, with a focus on correctness, operational simplicity, and realistic constraints rather than maximum feature breadth.
In many real systems, especially those touching hardware or edge deployments, you can’t assume:
Yet many popular event systems are optimized primarily for throughput and scale, with durability and recovery treated as secondary concerns or operationally expensive features.
From experience, the most painful incidents don’t come from lack of throughput they come from:
The question that motivated this work was simple:
What would an event log look like if durability, recovery, and simplicity were the first constraints not optional features?
The system described here (Ayder) follows a few strict principles:
Writes are acknowledged only after being safely persisted and replicated (configurable, but explicit). If a process is killed mid-write, the system must recover without data loss.
A restart should not trigger rebalancing storms, operator playbooks, or manual cleanup. Recovery should be automatic and fast.
A single static binary, no JVM, no external coordinators, no client libraries required to get started. If you can curl, you can produce and consume events.
P99.999 latency and unclean shutdown behavior are more informative than peak throughput numbers.
At its core, the system is:
No ZooKeeper, no KRaft controllers, no sidecars.
Clients:
This explicitness is intentional. It avoids hidden magic and makes failure behavior visible.
Instead of relying on theoretical guarantees, the system ships with a Jepsen-style smoke test that can be run locally.
The test repeatedly:
SIGKILL mid-writeInvariants checked:
If something breaks, the failure is reproducible. This has been more valuable than synthetic benchmarks alone.
One of the most revealing tests involved a 3-node cluster with ~8 million offsets:
Observed recovery time: ~40–50 seconds
No operator intervention. No manual reassignment.
This contrasts sharply with experiences where cluster restarts take hours or require human coordination.
Performance was measured under real network conditions, not loopback, and with durability enabled.
The long client-side tail was primarily network/kernel scheduling. Server-side work remained consistently sub-2ms even at extreme percentiles.
Perhaps the most surprising result came from running the same system on consumer ARM hardware:
Result:
This reinforced a few observations:
It also makes local HA testing far more accessible.
HTTP is not the fastest protocol on paper and that’s fine.
What HTTP provides:
curl, logs, proxies)Measured results showed that HTTP parsing was not the bottleneck. The system spent more time waiting on disk sync and network replication than parsing requests.
In practice, this tradeoff improved operability far more than it hurt performance.
This approach is not ideal for every workload.
It does make sense for:
It’s not intended to:
The goal is a predictable, durable core not maximal abstraction.
At this stage, the most valuable input is not feature requests, but reality checks.
I’m looking for 2–3 teams willing to:
This is not a sales ask, and not a request to migrate production systems. Even a 20-minute conversation about constraints would be incredibly valuable.
Most distributed systems look elegant until something crashes at the wrong time.
Building with failure as the default constraint changes design decisions dramatically from storage layout to APIs to recovery logic. The results may not be glamorous, but they’re often far more useful in practice.
If you’re operating or building event-driven systems under imperfect conditions, I’d love to compare notes.
The Most Comprehensive IoT Newsletter for Enterprises
Showcasing the highest-quality content, resources, news, and insights from the world of the Internet of Things. Subscribe to remain informed and up-to-date.
New Podcast Episode

Related Articles