The problem
The platform was a multi-tenant IoT system: a large fleet of devices streamed telemetry, and every event needed to resolve to a facility— the warehouse, floor, dock, or zone where the device was physically located. Without that resolution, the data is coordinates without context, and operators can't answer the only question they ever care about: where is my asset right now?
The problem wasn't resolving facility — every service did that. The problem was that every service did that. Each microservice had its own copy of the lookup logic, often subtly different. One read directly from PostgreSQL with a recursive CTE on every event. Another ran an in-memory cache that was eventually consistent at best. The dashboard layer reconciled facility on the way out. Three sources of truth for the same question — and each had its own bugs.
Inconsistencies bled through to customer-facing dashboards in ways that were hard to attribute and harder to fix. Worse: the flat data model couldn't represent multi-root containment — a device sitting in zone A and warehouse 3, simultaneously. Several customers needed that. We had bolted on workarounds. Every workaround made the inconsistencies worse.
Three sources of truth for the same question — each with its own bugs.
The constraints
Whatever I built had to absorb the existing reality, not replace it overnight. Specifically:
- Sub-millisecond lookups. Real-time pipeline budget was already tight; the resolver would be in the hot path of every telemetry event. Recursive SQL on every lookup was a non-starter.
- Tenant overrides without forking the schema. Tenants needed to override global facilities (rename them, restructure a sub-tree, add tenant-only nodes) without each customer getting their own database. The mechanism had to be data, not deployment.
- Multi-root trees.Zones overlap warehouses and floors; a device legitimately belongs to multiple parent chains. The flat model couldn't express this and never would.
- Zero-downtime cutover. Production traffic kept flowing. Migration had to be incremental, reversible, and gated per tenant.
- Cache coherence across replicas. Resolver state had to converge on tree changes within seconds, across N pods.
The architecture in one paragraph
A three-tier resolution chain — Tenant scope overlays Base tree overlays Global scope — backed by an in-memory tree per service replica, with Redis pub/sub invalidating the cache on any tree mutation. Multi-root trees are first-class; a device can resolve into multiple parent chains, with deterministic precedence. A single shared data-access package owns the algorithm; every microservice consumes the resolver from it.
Three decisions worth defending
In-memory tree over recursive SQL
The naive design is a flat facility table with parent_id self-references and a recursive CTE on every lookup. That works at scale-of-tens. It does not work in a hot path that runs thousands of times per second per pod.
- Considered
- Recursive CTE per lookup (Postgres `WITH RECURSIVE`)
- Materialized path columns (e.g. /global/factory-3/zone-a)
- Closure table (one row per ancestor relationship)
- In-memory tree, hydrated once per replica, invalidated via pub/sub
- Chose
In-memory tree per service replica
- Reason
Hot-path lookups become
O(1)parent-chain traversal. Hydration is a single query at boot. Invalidation is a Redis pub/sub message, ~10ms to propagate across replicas. The alternative (recursive CTEs) was already pegging connection-pool wait time on the existing services — adding more callers would have taken it down.
Overlay semantics, not data forks
Tenants need to override the global tree. The lazy way: copy the tree per tenant, mutate. The cost compounds — every tenant inherits a shadow copy of every change you didn't mean to push to them.
Instead: tenant-specific facilities live in their own scope and shadow matching nodes in the base tree. A lookup walks the chain in priority order: tenant overlay first, base tree second, global scope last. The base tree stays canonical. Tenants compose on top of it without mutating it.
- Considered
- Per-tenant copy of the base tree, fully forked
- Patch records (deltas applied at read time)
- Overlay scope with shadowing semantics
- Chose
Overlay scope with shadowing
- Reason
Forks couple tenant data to base tree changes — every base update risks polluting tenant-specific overrides. Patches require replaying every change every read. Overlay shadowing is declarative: the base tree is canonical, tenants layer on top, and resolution is deterministic. Same model GitHub uses for forks vs. branches. Same model Kubernetes uses for namespace overrides.
Multi-root trees, first-class
A device sitting in a cold-storage zone also sits inside the warehouse that contains the zone, which also sits in the distribution center. Three valid parent chains, all true at once.
The flat model couldn't express this — parent_id is scalar. The resolver schema treats containment as a relation, not a property; a node can have multiple parents, each tagged with its relation kind (physical, operational, access-zone). Resolution returns the full set of valid chains; the consumer picks the chain it cares about. Most consumers want physical; some (like compliance audits) want all of them.
The data model
Three tables do most of the work. facility holds the nodes themselves, facility_relation represents the multi-parent containment, and facility_overlay binds tenant-specific overrides to base nodes.
CREATE TYPE node_scope AS ENUM (39;global39;, 39;base39;, 39;tenant_overlay39;);
CREATE TABLE node (
id UUID PRIMARY KEY,
tenant_id UUID, -- NULL for global / base
scope node_scope NOT NULL,
name TEXT NOT NULL,
kind TEXT NOT NULL, -- 39;warehouse39;, 39;zone39;, 39;dock39;
attributes JSONB NOT NULL DEFAULT 39;{}39;,
is_transient BOOLEAN NOT NULL DEFAULT false,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE node_relation (
parent_id UUID NOT NULL REFERENCES node(id),
child_id UUID NOT NULL REFERENCES node(id),
relation TEXT NOT NULL, -- 39;physical39;, 39;operational39;, ...
PRIMARY KEY (parent_id, child_id, relation)
);
CREATE TABLE node_overlay (
tenant_id UUID NOT NULL,
base_id UUID NOT NULL REFERENCES node(id),
overlay_id UUID NOT NULL REFERENCES node(id),
PRIMARY KEY (tenant_id, base_id)
);
CREATE INDEX ON node (tenant_id, scope) WHERE tenant_id IS NOT NULL;
CREATE INDEX ON node_relation (child_id);
The resolution algorithm
Every service that needed to know "where is this device?" called the same function from the shared data-access package. The whole resolver is small — fewer than 80 lines — because the schema does the heavy lifting.
type Chain = Node[]; // ordered: device-leaf → root
export function resolve(
deviceId: string,
tenantId: string,
tree: Tree,
): { primary: Chain; alternates: Chain[] } {
const leaf = tree.leafFor(deviceId);
if (!leaf) return { primary: [], alternates: [] };
// Walk every parent edge; multi-parent nodes branch.
const chains = walkParents(leaf, tree);
// Apply tenant overlay shadowing — overlay nodes replace their base nodes.
const shadowed = chains.map((chain) =>
chain.map((n) => tree.overlayFor(tenantId, n) ?? n),
);
// Filter chains that pass through transient nodes above the leaf.
const valid = shadowed.filter(
(chain) => !chain.slice(1).some((n) => n.is_transient),
);
// Pick "physical" containment as primary; expose the rest.
const primary =
valid.find((c) => c[0]?.relation === "physical") ?? valid[0] ?? [];
const alternates = valid.filter((c) => c !== primary);
return { primary, alternates };
}The tree itself was held in process memory and rebuilt at boot. On any write to a node, relation, or overlay, the writing service published a Redis message; every subscriber re-hydrated the affected sub-tree (not the whole tree). Steady-state CPU on the resolver was negligible. Steady-state memory was ~3MB per tenant, dominated by the JSONB attributes blob.
The 12-point validation suite
The dangerous part of consolidating fifteen separate resolvers into one is that the new one might be subtly different from the old ones — and the difference might only manifest as wrong dashboard data. No errors, no exceptions, just numbers that are quietly wrong.
Before flipping a single tenant over, I built a validation harness that replayed the last 30 days of telemetry through both the old and the new resolvers and asserted twelve invariants on every event. The full list lives in the codebase; the spirit of it:
- Resolution exists. Every device has a facility — no orphans.
- Single primary chain. Multi-root resolution always returns exactly one
primary; the rest are alternates. - Transient nodes don't leak. No chain that traverses an
is_transientnode is ever returned as primary. - Tenant scope respected.No tenant ever sees a facility from another tenant's overlay.
- Old vs new agree.For every event, the new resolver's primary chain matches the old resolver's output — modulo the bugs we knew the old one had.
The suite caught three data-integrity issues that would have shipped silently. One was an overlay row pointing to a base node that had been deleted; the resolver would have returned the deleted node's name to the customer. The other two were about is_transient propagation in edge cases I hadn't anticipated.
The whole point of validation isn't the cases you imagined. It's the cases you didn't.
The migration: 954 lines, per-tenant cutover
A single 954-line PostgreSQL script created the new schema, backfilled from the legacy tables, and built the indexes. It was idempotent — safe to re-run — because zero-downtime migrations have to be. Halfway through the rollout I needed to fix a backfill bug and re-run on a subset of tenants without affecting the ones that had already cut over.
The cutover itself was per-tenant, gated on a feature flag that defaulted to off; flipping it for a tenant routed every read through the new resolver. The flag was per-tenant and per-environment, so I could validate against staging tenants for two weeks before a single production tenant moved.
- Services unified
- 15
- Lines, migration
- 954
- Validation invariants
- 12
- Production rollbacks
- 0
What I'd do differently
The architecture stands. The thing I'd change is process: I should have invested in the validation harness first, before writing a single line of resolver code. I built it second, after the resolver was already most of the way done — and the cases I caught with it were cases I'd already missed in design. The harness wasn't a check on my work; it was the work, and I treated it as overhead.
Second: I'd make the resolver's alternate chains exposed explicitly through a typed interface from day one. I added them as a return value mid-implementation, after the second customer asked for access-zone resolution alongside physical. They should have been first-class from the start; their absence shaped six weeks of consumer code that I had to update afterward.
Third: more tests at the boundaries. The interior of the resolver is well-tested. The seams — the Redis pub/sub invalidation, the cache re-hydration on partial failure — relied on integration tests I didn't write until production showed me where they should have been.