The problem
The platform was running on MSSQL — fine in the early days when the codebase was a monolith. By the time we had a fleet of microservices, each with their own access patterns, the database choice was actively costing us. Hosting bills were scaling worse than linearly. The query patterns the product roadmap wanted (relational joins, window functions, JSONB) all worked better on PostgreSQL. Most importantly: the team had standardized on Postgres for newer services, leaving us with a polyglot persistence layer where half the codebase couldn't talk to the other half's data without going through HTTP.
A clean migration could fix all three. A messy one could take down production for a week.
Hosting bills weren't the real cost. Workarounds were.
The constraints
- Zero downtime. Production traffic was 24/7. Any cutover that required a maintenance window was disqualified.
- Six services, varying activity. Some were hot paths in the request flow; others ran nightly batches. Each had its own quirks, its own consumers, its own assumptions.
- Feature work continued.The team couldn't freeze. The migration ran alongside the ongoing roadmap.
- No big-bang. Cutting over six services simultaneously would have been a single point of failure for the whole platform. The cutover had to be incremental and reversible.
- Bundled tooling change. The monorepo also needed to migrate from npm to pnpm. We bundled both into the same change window — one breaking-change period, not two.
The approach in one paragraph
Two parallel branches — one for each stack — running side by side for ~3 months. A new shared data-access package, built from scratch, owns every Postgres connection in the new world; nothing else talks to the database directly. Service-by-service cutover: each service moves through dual-write, then read cutover, then write cutover. Zod validates every row crossing the package boundary. My own services migrate first as the reference implementation; the rest of the team uses them as templates.
Three decisions worth defending
A shared data-access package, built from scratch
The new world needed a single way to talk to PostgreSQL. The options were existing ORMs, each service owning its own access layer, or building something internal.
- Considered
- Prisma — generates from a schema, migrations baked in
- TypeORM / Drizzle — typed query builders, more flexible
- Each service owns its own data layer (zero shared package)
- Build a shared internal package on top of node-postgres
- Chose
Shared internal package on raw node-postgres + Zod
- Reason
Existing ORMs assumed a single-database posture and didn't cleanly handle the dual-stack window we needed. Letting each service roll its own would have produced six dialects of the same problem. The shared package gave us repository-pattern access, runtime Zod validation on every row leaving the database, dual ESM/CJS so legacy services could adopt it, and
golang-migratefor SQL migrations. ~600 lines of infrastructure that paid for itself within the first cutover.
Branch-level isolation for the dual-stack period
For three months, two stacks lived in parallel. The straightforward approach is to put them in the same branch behind feature flags. The right approach turned out to be putting them in separate branches.
- Considered
- Big-bang cutover (all services at once)
- Same-branch flag: both stacks in main, gated by env var
- Branch-level isolation: parallel branches, regular sync
- Chose
Branch-level isolation
- Reason
Same-branch flagging meant every commit had to compile against both stacks — npm + MSSQL + old patterns AND pnpm + Postgres + new patterns. Tests had to run against both. The CI matrix doubled. Worse, mixing the two paths made every diff hard to review. Two parallel branches let each stack be tested cleanly in isolation. We rebased the migration branch onto main weekly. Cutting a service over meant moving it from one branch to the other.
Reference implementation, then templates
Six services don't need six migration philosophies. The first one had to set the pattern.
- Considered
- Each team migrates their own services in parallel
- One engineer migrates everything sequentially
- Migrate one as reference, then others use it as a template
- Chose
Reference implementation
- Reason
Migrating my own services first established the conventions: schema layout, error handling, transaction boundaries, validation posture. The rest of the team could then port their services without re-litigating decisions. The first cutover took six weeks; subsequent cutovers took 3–5 days each.
The Zod-validated abstraction
Every row leaving the database goes through a Zod schema. This was the most useful single decision in the migration — it caught silent bugs the old code had been hiding for years.
const DeviceSchema = z.object({
id: z.string().uuid(),
tenant_id: z.string().uuid(),
name: z.string(),
status: z.enum(["online", "offline", "stale"]),
last_seen_at: z.string().datetime().nullable(),
attributes: z.record(z.unknown()),
});
type Device = z.infer;
export class DeviceRepository {
async findByTenant(tenant: string): Promise {
const rows = await this.db.query(
"SELECT * FROM device WHERE tenant_id = $1",
[tenant],
);
return rows.map((row) => DeviceSchema.parse(row));
}
}
Zod runs afterthe database returns. Any row that doesn't match the schema throws a typed error with the field path and the observed value. Wired into structured logging, those errors became the most useful migration signal we had.
The cutover, per service
Each of the six services moved through the same five phases:
- 1. Schema design.Postgres schema modeled on the access patterns we'd wished MSSQL had supported.
- 2. Backfill.Idempotent script copying historical data. Re-run safe — by the third service we'd have run them dozens of times.
- 3. Dual-write. Writes go to both databases; reads stay on MSSQL. Asymmetric divergence shows up in monitoring within minutes.
- 4. Read cutover. Reads switch to Postgres. If anything regresses, we revert — the old database still has every write.
- 5. Write cutover. Writes go to Postgres only. MSSQL frozen as a read-only fallback for ~2 weeks, then decommissioned per service.
The dual-write phase was the most informative. Anywhere the two databases diverged in real time, we had a bug — usually in the Zod schema or in a translation function. We held in dual-write for ~10 days per service before promoting to read cutover.
- Services migrated
- 6+
- Type bugs caught
- 12+
- Parallel branches
- 3 mo
- Rows lost
- 0
What I'd do differently
Three things stand out, none of which were obvious at the start.
Invest in observability before, not during.The first cutover was painful because we didn't have structured logging for Zod validation errors yet. Field-level error logging existed; aggregating and alerting on it didn't. By the third cutover we had per-service dashboards for validation failure rates. That should have been the first thing built, not the third.
Order services by data-flow dependency, not team readiness.We migrated services in the order their teams were ready. Twice this bit us — an upstream service's schema changed during a downstream service's cutover, forcing a re-do. A topological sort of data dependencies would have caught this in planning.
Write the rollback runbook before the first cutover, not after. By the third cutover I realized the rollback procedure had several untested assumptions — most importantly, that we could re-route reads back to MSSQL inside seconds. That actually took ~30s of pod restarts. Knowing this in advance would have changed how we sized the rollback budget for each cutover window.