chaitanyadeshpande.com / work
Case study · Zero-downtime migration

Six microservices, two databases, three months in parallel.

The cost of running on a database that no longer fits your access patterns isn't paid in latency — it's paid in workarounds. Here's what it took to move six production services off MSSQL onto PostgreSQL while traffic kept flowing.

Role
Architect & lead
Scope
6+ services migrated
Duration
~3 months parallel
Outcome
Zero data loss

The problem

The platform was running on MSSQL — fine in the early days when the codebase was a monolith. By the time we had a fleet of microservices, each with their own access patterns, the database choice was actively costing us. Hosting bills were scaling worse than linearly. The query patterns the product roadmap wanted (relational joins, window functions, JSONB) all worked better on PostgreSQL. Most importantly: the team had standardized on Postgres for newer services, leaving us with a polyglot persistence layer where half the codebase couldn't talk to the other half's data without going through HTTP.

A clean migration could fix all three. A messy one could take down production for a week.

Hosting bills weren't the real cost. Workarounds were.

The constraints

  • Zero downtime. Production traffic was 24/7. Any cutover that required a maintenance window was disqualified.
  • Six services, varying activity. Some were hot paths in the request flow; others ran nightly batches. Each had its own quirks, its own consumers, its own assumptions.
  • Feature work continued.The team couldn't freeze. The migration ran alongside the ongoing roadmap.
  • No big-bang. Cutting over six services simultaneously would have been a single point of failure for the whole platform. The cutover had to be incremental and reversible.
  • Bundled tooling change. The monorepo also needed to migrate from npm to pnpm. We bundled both into the same change window — one breaking-change period, not two.

The approach in one paragraph

Two parallel branches — one for each stack — running side by side for ~3 months. A new shared data-access package, built from scratch, owns every Postgres connection in the new world; nothing else talks to the database directly. Service-by-service cutover: each service moves through dual-write, then read cutover, then write cutover. Zod validates every row crossing the package boundary. My own services migrate first as the reference implementation; the rest of the team uses them as templates.

diagram
MSSQLlegacyPostgreSQLtargetshared data-storeZod validation12+ TYPE MISMATCHEScaught before silent-bugSERVICE-BY-SERVICE CUTOVER — 6 SERVICES, 3 MONTHS, ZERO DATA LOSS
Source → Zod-validated package → target. Invalid rows reject before they reach production.

Three decisions worth defending

A shared data-access package, built from scratch

The new world needed a single way to talk to PostgreSQL. The options were existing ORMs, each service owning its own access layer, or building something internal.

Decision
Considered
  • Prisma — generates from a schema, migrations baked in
  • TypeORM / Drizzle — typed query builders, more flexible
  • Each service owns its own data layer (zero shared package)
  • Build a shared internal package on top of node-postgres
Chose

Shared internal package on raw node-postgres + Zod

Reason

Existing ORMs assumed a single-database posture and didn't cleanly handle the dual-stack window we needed. Letting each service roll its own would have produced six dialects of the same problem. The shared package gave us repository-pattern access, runtime Zod validation on every row leaving the database, dual ESM/CJS so legacy services could adopt it, and golang-migrate for SQL migrations. ~600 lines of infrastructure that paid for itself within the first cutover.

Branch-level isolation for the dual-stack period

For three months, two stacks lived in parallel. The straightforward approach is to put them in the same branch behind feature flags. The right approach turned out to be putting them in separate branches.

Decision
Considered
  • Big-bang cutover (all services at once)
  • Same-branch flag: both stacks in main, gated by env var
  • Branch-level isolation: parallel branches, regular sync
Chose

Branch-level isolation

Reason

Same-branch flagging meant every commit had to compile against both stacks — npm + MSSQL + old patterns AND pnpm + Postgres + new patterns. Tests had to run against both. The CI matrix doubled. Worse, mixing the two paths made every diff hard to review. Two parallel branches let each stack be tested cleanly in isolation. We rebased the migration branch onto main weekly. Cutting a service over meant moving it from one branch to the other.

Reference implementation, then templates

Six services don't need six migration philosophies. The first one had to set the pattern.

Decision
Considered
  • Each team migrates their own services in parallel
  • One engineer migrates everything sequentially
  • Migrate one as reference, then others use it as a template
Chose

Reference implementation

Reason

Migrating my own services first established the conventions: schema layout, error handling, transaction boundaries, validation posture. The rest of the team could then port their services without re-litigating decisions. The first cutover took six weeks; subsequent cutovers took 3–5 days each.

The Zod-validated abstraction

Every row leaving the database goes through a Zod schema. This was the most useful single decision in the migration — it caught silent bugs the old code had been hiding for years.

tsrepository pattern with runtime validation — sketch
const DeviceSchema = z.object({
  id:           z.string().uuid(),
  tenant_id:    z.string().uuid(),
  name:         z.string(),
  status:       z.enum(["online", "offline", "stale"]),
  last_seen_at: z.string().datetime().nullable(),
  attributes:   z.record(z.unknown()),
});

type Device = z.infer;

export class DeviceRepository {
  async findByTenant(tenant: string): Promise {
    const rows = await this.db.query(
      "SELECT * FROM device WHERE tenant_id = $1",
      [tenant],
    );
    return rows.map((row) => DeviceSchema.parse(row));
  }
}

Zod runs afterthe database returns. Any row that doesn't match the schema throws a typed error with the field path and the observed value. Wired into structured logging, those errors became the most useful migration signal we had.

The cutover, per service

Each of the six services moved through the same five phases:

  • 1. Schema design.Postgres schema modeled on the access patterns we'd wished MSSQL had supported.
  • 2. Backfill.Idempotent script copying historical data. Re-run safe — by the third service we'd have run them dozens of times.
  • 3. Dual-write. Writes go to both databases; reads stay on MSSQL. Asymmetric divergence shows up in monitoring within minutes.
  • 4. Read cutover. Reads switch to Postgres. If anything regresses, we revert — the old database still has every write.
  • 5. Write cutover. Writes go to Postgres only. MSSQL frozen as a read-only fallback for ~2 weeks, then decommissioned per service.

The dual-write phase was the most informative. Anywhere the two databases diverged in real time, we had a bug — usually in the Zod schema or in a translation function. We held in dual-write for ~10 days per service before promoting to read cutover.

Services migrated
6+
Type bugs caught
12+
Parallel branches
3 mo
Rows lost
0

What I'd do differently

Three things stand out, none of which were obvious at the start.

Invest in observability before, not during.The first cutover was painful because we didn't have structured logging for Zod validation errors yet. Field-level error logging existed; aggregating and alerting on it didn't. By the third cutover we had per-service dashboards for validation failure rates. That should have been the first thing built, not the third.

Order services by data-flow dependency, not team readiness.We migrated services in the order their teams were ready. Twice this bit us — an upstream service's schema changed during a downstream service's cutover, forcing a re-do. A topological sort of data dependencies would have caught this in planning.

Write the rollback runbook before the first cutover, not after. By the third cutover I realized the rollback procedure had several untested assumptions — most importantly, that we could re-route reads back to MSSQL inside seconds. That actually took ~30s of pod restarts. Knowing this in advance would have changed how we sized the rollback budget for each cutover window.