chaitanyadeshpande.com / work
Case study · Distributed systems

275 lines deleted. Zero race conditions added.

A subtle race was producing duplicate writes in production. The instinctive fix — distributed locks everywhere — would have worked. But it would have masked the real problem: shared ownership of state that should have had one owner.

Role
Diagnosis + architect
Scope
3+ services touched
Outcome
275 lines deleted
Duration
~10 weeks

The problem

Two services were writing to the same conceptual record from different code paths. One on the event-driven path: telemetry came in, that service updated the record. One on the request-response path: an operator action arrived through the API, that service updated the same record. Sometimes both events arrived within milliseconds of each other.

The symptoms were exactly the kind that wear engineers down:

  • Intermittent duplicate writes — sometimes once a day, sometimes never for a week.
  • Inconsistent records that violated invariants the schema didn't enforce.
  • Customers reporting "I just clicked X but the dashboard says Y."
  • Zero local reproducibility. The race only opened at production scale, with two services landing on the same key in the same millisecond.

The instinctive fix was distributed locks: lock the record, serialize the writes, done. That would have worked — and it would have papered over the actual problem.

The instinct was right. The diagnosis was wrong.
diagram
Service A — IngressService B — EnricherHealth record (row)REDIS LOCK APPLIEDBEFORE — DUPLICATE WRITESAFTER — SERIALIZED
Before: collisions on the same record. After: serialized writes via per-key Redis locks.

The diagnosis

I spent about three weeks on this before the right question surfaced. Not "how do we serialize the writes?" — but "why are two services writing this record at all?"

Walking the history: each service had been written by a different engineer at a different time. Each had assumed their service was authoritative for that record. The duplicate writes weren't really a race; they were two services racing to be authority on data that shouldn't have had two authorities.

Locks would have made the symptom go away. They would have left the underlying ambiguity in place — and the next engineer who needed to update that record would have inherited the same ambiguity, and added their own service to the contention.

Three decisions worth defending

Ownership over concurrency control

Decision
Considered
  • Distributed lock per record (Redis SET NX)
  • Optimistic concurrency control via row version columns
  • Single-owner refactor: one service is authoritative; others go through it
Chose

Single-owner refactor

Reason

Locks would have papered over the symptom. Optimistic concurrency with retries would have introduced new failure modes (write storms under contention). The actual problem was conceptual: there was no single authoritative owner for this record. Refactoring to a single-owner design — one service writes, others call its API to read or request changes — eliminated the race entirely. As a side effect, we deleted 275 lines of duplicated logic across the affected services.

Redis locks for legitimate concurrency, not for the diagnosis

After the ownership refactor, there were still operations that legitimately had concurrent contention — multiple replicas of the single-owner service updating the same key from different worker threads. For those, distributed locks were the right answer.

Decision
Considered
  • Wait for Postgres-level row locks on every write
  • Use a message queue for serialization
  • Redis SET NX EX for fast acquire/release
Chose

Redis SET NX EX

Reason

Operators watched a live dashboard; queue latency would have been felt in the UI. Postgres locks would have added connection-pool pressure on the hot path. Redis acquire/release was sub-millisecond. Built-in graceful degradation: if Redis is unreachable, writes proceed without the enrichment that the lock was protecting; data self-heals on the next heartbeat.

Decoupling via middleware

While I was in the code, I added a middleware abstraction layer between the affected services and the shared utility package they all imported. At the time it felt like over-engineering.

Decision
Considered
  • Direct imports: services use the shared package as-is
  • Version-pin the shared package per service
  • Middleware abstraction layer between services and shared code
Chose

Middleware abstraction layer

Reason

Two months later, the shared utility package shipped a breaking change. Services that had decoupled via the middleware kept working; services that had directly imported the broken function would have gone down. The decoupling investment paid for itself the first time it was tested.

The Redis lock pattern

For the legitimate concurrency cases, the lock implementation needs to be careful about two things: TTL (so a crashed holder doesn't block the key forever) and ownership (so you don't accidentally release someone else's lock).

tsdistributed lock — sketch
async function withLock(
  key: string,
  ttlSeconds: number,
  fn: () => Promise,
): Promise {
  const lockId = randomUUID();
  const acquired = await redis.set(key, lockId, "NX", "EX", ttlSeconds);
  if (!acquired) throw new LockContentionError(key);

  try {
    return await fn();
  } finally {
    // Only release if we still own the lock — the TTL might already have
    // expired and a different holder might have acquired it in the meantime.
    await redis.eval(RELEASE_SCRIPT, [key], [lockId]);
  }
}

const RELEASE_SCRIPT = `
  if redis.call("GET", KEYS[1]) == ARGV[1] then
    return redis.call("DEL", KEYS[1])
  else
    return 0
  end
`;

The aftermath

Three things worth measuring:

Lines deleted
275
Race incidents since
0
Regression caught later
1

The 275 lines were the duplicated logic across the previously- ambiguous-ownership services. Race incidents went to zero — both the pre-existing duplicate writes and any new ones — because the ambiguity that caused them was gone, not just suppressed. The one regression caught was the breaking change in the shared package two months later; the middleware decoupling absorbed it without anyone noticing.

What I'd do differently

Invest in distributed tracing first.The diagnosis took three weeks. With trace IDs propagating across both the event path and the request path, the same investigation would have taken a few hours. You can't see the race without tracing the two sides into a single timeline.

Document ownership in API contracts.The original ambiguity wasn't a bug; it was an undocumented assumption that two engineers had each made differently. If service ownership of records had been part of the API contract — and reviewed when either service was changed — the situation would never have accumulated.

Write the graceful-degradation tests during development, not after Redis goes down once. The fallback path (Redis unreachable, writes proceed without enrichment, self-heal on next heartbeat) was correct in code and untested until production showed it was correct. Test it before you need it.