CI/CD pipeline
The four boring changes — parallel tests, Docker layer caching, Terraform, live metrics — that cut our CI/CD pipeline from over an hour to six minutes.
Continuous Integration and Continuous Deployment pipelines are the backbone of modern software delivery. When your pipeline lags, everything else lags with it — feedback loops, bug fixes, release cadence, team morale. We hit that wall hard. Here's how we cut ours from over an hour to six minutes.
The first thing we did was stop guessing. We instrumented every stage and measured it. The results surprised us:
If you don't measure, you end up optimizing the wrong stage. We almost spent a week on Docker layer caching before realizing it would have saved us ninety seconds on an hour-long pipeline.
Our tests ran serially on a single runner. Moving to parallel test sharding across four runners was the single biggest win — cut testing time by 68%.
jobs:
test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
shard: [1, 2, 3, 4]
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v1
- run: bun install --frozen-lockfile
- run: bun test --shard=${{ matrix.shard }}/4
We also killed a dozen tests that were flaky or testing framework behavior instead of our own code. Fewer tests, faster signal.
Our Docker builds were starting from scratch every time. Adding BuildKit layer caching to a remote registry dropped image builds from four minutes to forty seconds:
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ env.IMAGE }}:${{ github.sha }}
cache-from: type=registry,ref=${{ env.IMAGE }}:buildcache
cache-to: type=registry,ref=${{ env.IMAGE }}:buildcache,mode=max
Environment provisioning used to require a human in the loop. We moved to Terraform so the pipeline could spin up a preview environment for every pull request:
resource "fly_app" "staging" { name = "app-staging-${var.pr_number}" org = "personal" }
resource "fly_machine" "api" { app = fly_app.staging.name region = "ams" image = var.image }
Every PR gets its own preview environment, destroyed automatically when the PR closes. Reviewers stop asking "can you deploy this somewhere I can click?"
Pipelines rot. We wired Prometheus to scrape pipeline metrics and built a Grafana dashboard that alerts us when any stage creeps past its budget. If tests drift back toward fifteen minutes, we know before it becomes a daily annoyance.
From 62 minutes to 6 minutes. A 10x improvement, achieved through four boring changes: measure, parallelize, cache, automate. No magic. No new framework. Just ruthless attention to where the time was actually going.
The lesson: faster pipelines aren't a one-time project. They're a habit. Measure weekly, cut ruthlessly, and don't let your CI slow to a crawl while you're not looking.