burgerlogo

Building a Durability-First Event Log That Survives Real Failures

Building a Durability-First Event Log That Survives Real Failures

avatar
Aydarbek Romanuly

- Last Updated: February 19, 2026

avatar

Aydarbek Romanuly

- Last Updated: February 19, 2026

featured imagefeatured imagefeatured image

Modern systems generate streams of events everywhere: devices at the edge, gateways, backend services, and cloud workloads. What often gets overlooked is that failure is the normal state, not the exception especially outside perfectly managed cloud environments.

Disk pressure, power loss, partial network partitions, process crashes, and restarts are daily reality in IoT, edge, and hybrid systems. Yet many event pipelines assume stable infrastructure, heavy runtimes, or complex operational setups.

This article shares lessons from building a durability-first event log, designed to behave predictably under failure, with a focus on correctness, operational simplicity, and realistic constraints rather than maximum feature breadth.


The Core Problem: Failure Isn’t an Edge Case

In many real systems, especially those touching hardware or edge deployments, you can’t assume:

  • stable network connectivity
  • graceful shutdowns
  • unlimited disk
  • a dedicated ops team
  • homogeneous x86 servers

Yet many popular event systems are optimized primarily for throughput and scale, with durability and recovery treated as secondary concerns or operationally expensive features.

From experience, the most painful incidents don’t come from lack of throughput they come from:

  • unclear recovery semantics
  • long restart times
  • manual intervention after crashes
  • partial data loss that’s hard to detect

The question that motivated this work was simple:

What would an event log look like if durability, recovery, and simplicity were the first constraints not optional features?


Design Principles

The system described here (Ayder) follows a few strict principles:

1. Durability by Default

Writes are acknowledged only after being safely persisted and replicated (configurable, but explicit). If a process is killed mid-write, the system must recover without data loss.

2. Crash Recovery Must Be Boring

A restart should not trigger rebalancing storms, operator playbooks, or manual cleanup. Recovery should be automatic and fast.

3. Operational Simplicity Matters

A single static binary, no JVM, no external coordinators, no client libraries required to get started. If you can curl, you can produce and consume events.

4. Measure the Worst Case, Not the Average

P99.999 latency and unclean shutdown behavior are more informative than peak throughput numbers.


Architecture Overview (High Level)

At its core, the system is:

  • an append-only log with partitions and monotonically increasing offsets
  • replicated via Raft consensus (3/5/7 node clusters)
  • persisted using sealed append-only files (AOF)
  • accessed through a plain HTTP API

No ZooKeeper, no KRaft controllers, no sidecars.

Clients:

  • produce raw bytes via HTTP POST
  • consume via offset-based pulls
  • explicitly commit offsets

This explicitness is intentional. It avoids hidden magic and makes failure behavior visible.


Failure as a First-Class Test Case

Instead of relying on theoretical guarantees, the system ships with a Jepsen-style smoke test that can be run locally.

The test repeatedly:

  • kills nodes with SIGKILL mid-write
  • restarts them in random order
  • introduces optional network delay and jitter
  • verifies invariants

Invariants checked:

  • no gaps in offsets
  • no duplicates when idempotency keys are used
  • per-partition ordering preserved
  • committed offsets monotonic across restarts

If something breaks, the failure is reproducible. This has been more valuable than synthetic benchmarks alone.


Recovery Behavior in Practice

One of the most revealing tests involved a 3-node cluster with ~8 million offsets:

  1. A follower is killed mid-write
  2. Leader continues accepting writes
  3. Follower is restarted
  4. Follower replays its local AOF
  5. It requests missing offsets from the leader
  6. Leader streams the delta
  7. Cluster becomes fully healthy

Observed recovery time: ~40–50 seconds
No operator intervention. No manual reassignment.

This contrasts sharply with experiences where cluster restarts take hours or require human coordination.


Performance Under Real Constraints

Performance was measured under real network conditions, not loopback, and with durability enabled.

Cloud (x86) — 3-Node Cluster

  • Sync-majority writes (2/3 nodes)
  • ~50K msg/s with client P99 ≈ 3.5ms
  • server P99.999 ≈ 1.2ms

The long client-side tail was primarily network/kernel scheduling. Server-side work remained consistently sub-2ms even at extreme percentiles.

ARM64 (Snapdragon X Elite, WSL2, Battery)

Perhaps the most surprising result came from running the same system on consumer ARM hardware:

  • Snapdragon X Elite laptop
  • WSL2 Ubuntu
  • Running on battery
  • 3-node cluster on a single machine

Result:

  • ~106K msg/s
  • server P99.999 ≈ 0.65ms

This reinforced a few observations:

  • ARM64 is more than viable for server-style workloads
  • efficient C code benefits significantly from modern ARM cores
  • WSL2 overhead for async I/O is lower than often assumed

It also makes local HA testing far more accessible.


Why HTTP?

HTTP is not the fastest protocol on paper and that’s fine.

What HTTP provides:

  • debuggability (curl, logs, proxies)
  • no client SDK lock-in
  • easier integration in constrained environments
  • predictable behavior across languages

Measured results showed that HTTP parsing was not the bottleneck. The system spent more time waiting on disk sync and network replication than parsing requests.

In practice, this tradeoff improved operability far more than it hurt performance.


Where This Is Useful (and Where It Isn’t)

This approach is not ideal for every workload.

It does make sense for:

  • edge → cloud pipelines
  • device or gateway event ingestion
  • systems where restart time matters more than raw throughput
  • teams without dedicated infra operators
  • environments where JVM-based stacks are heavy

It’s not intended to:

  • replace existing Kafka deployments overnight
  • act as a SQL database
  • provide magic exactly-once semantics without client discipline

The goal is a predictable, durable core not maximal abstraction.


What I’m Looking for Next

At this stage, the most valuable input is not feature requests, but reality checks.

I’m looking for 2–3 teams willing to:

  • sanity-check this approach against their real constraints
  • share how they think about durability, recovery, and ops pain
  • optionally run a small pilot or failure test

This is not a sales ask, and not a request to migrate production systems. Even a 20-minute conversation about constraints would be incredibly valuable.


Closing Thoughts

Most distributed systems look elegant until something crashes at the wrong time.

Building with failure as the default constraint changes design decisions dramatically from storage layout to APIs to recovery logic. The results may not be glamorous, but they’re often far more useful in practice.

If you’re operating or building event-driven systems under imperfect conditions, I’d love to compare notes.

Need Help Identifying the Right IoT Solution?

Our team of experts will help you find the perfect solution for your needs!

Get Help