Caching Your Way to Faster CI Runners
I maintained CI Platform at a company running a Go monorepo with trunk-based development, more than 100 services, and over 100 engineers pushing code every day. CI runs on EKS with GitHub Actions Runner Controller. For a long time the feedback I kept hearing from engineers was “CI is slow” and the feedback from finance was “the bill keeps going up.” Neither is actionable on its own.
This post covers the startup optimizations we shipped and why we made the decisions we did. I talked about the foundation of this at KCD Indonesia 2024. This extends that.
Stack: EKS, GitHub Actions Runner Controller (ARC), Go monorep & trunk-based development.
The runner pickup problem#
The most common complaint I heard was runner pickup time. You push, CI queues, and nothing happens for a while. The runner just hasn’t started yet
The obvious fix is standby runners. Keep a warm pool and pickup is instant. The problem is procurement notices. Idle runners sitting at 0% utilization waiting for jobs are hard to justify, especially when the load pattern is spiky. You end up in a conversation about why you’re paying for compute that’s doing nothing.
So the real question isn’t “how many standby runners do we keep.” It’s “how do we make cold runner startup fast enough that we don’t need many standby runners.” That’s a different problem, and it’s one the platform team can actually own.
Our runner image sits at 4-7 GB. That’s not an accident or neglect. We run a monorepo with a lot of services, each with its own dependencies. Every CI job needs those deps available. You have two options: download them at job runtime and pay data transfer costs repeatedly, or bake them into the image and pay once. We chose to bake. The image is large because that’s the deliberate tradeoff.
The consequence is that every cold node meant a full 4-7 GB pull before the container could start. P95 startup on a cold node was 2-3 minutes. On top of that, when load spikes and multiple new nodes all pull that image simultaneously, the IO pressure during scale-up is real even on nodes that already had the image cached. The image being fetched because load arrived is the wrong order. You want it there before load arrives.
We attacked this on two fronts simultaneously.
SOCI: lazy image loading for cold nodes#
SOCI (Seekable OCI) is AWS’s implementation of lazy image loading for ECR. The container starts immediately and layers are fetched in the background on demand. GKE has the same concept under the name Image Streaming.
SOCI caches per layer, so Dockerfile layer ordering matters. Stable layers (OS, runtime, tooling) stay warm across image updates. Only the volatile top layers need fresh fetches. If your layers are ordered by change frequency with stable at the bottom, SOCI’s cache hit rate climbs fast. If they’re not, you’re fetching more than you need to on every update.
We don’t use the upstream standalone-soci-indexer directly. We forked it and added support for a custom registry and multi-arch images. The SOCI index gets created and pushed right after the image build, as part of the same pipeline step:
- name: Create and Push SOCI Index
env:
IMAGE_REF: ${{ steps.login-ecr.outputs.registry }}/${{ env.DOCKER_IMAGE_NAME }}:${{ env.DOCKER_IMAGE_TAG }}-${{ matrix.architecture }}
run: |
set -euo pipefail
docker pull "${IMAGE_REF}"
soci convert "${IMAGE_REF}" "${IMAGE_REF}"
soci push "${IMAGE_REF}"
flowchart LR
A["Build image\ndocker buildx"] --> B["Push to registry\namd64 + arm64"]
B --> C["SOCI Indexer\nsoci convert + push\n(forked, custom registry + multi-arch)"]
C --> D["Registry\nimage layers + SOCI index"]
D -.->|"lazy fetch on demand"| E["Runner pod\nstarts in seconds"]The SOCI index is what makes lazy loading work. Without it, the container runtime has no map of which byte ranges in the blob correspond to which files, so it can’t fetch on demand. Converting and pushing the index is a one-time cost per image build. After that, every cold node gets sub-30-second startup instead of 2-3 minutes.
NVMe hostPath: prebaking everything else#
SOCI handles the image layer problem. But once the container is running, it still needs Go’s module and build caches to be warm, and it needs Docker images for test dependencies (Postgres, Vault, etc.) to already exist on the node. Without that, every job start meant network round-trips for Go modules and container pulls for integration test infrastructure.
I looked at EBS for this. EBS baseline gives you 3,000 IOPS. That’s fine for most workloads and not fine for CI, where you’re doing high-throughput reads from a build cache on every job. The c8id instance type we use for CI runners has local NVMe instance storage that’s an order of magnitude faster, and we’re already paying for it as part of the instance cost. Adding EBS on top just to get the snapshot restore feature didn’t make sense.
The solution is a hostPath cache on NVMe, seeded nightly by a CronJob and hydrated by an initContainer on every runner pod startup. Here’s what lives in it:
GOMODCACHE + GOCACHE. Go’s module zips and compiled build artifacts. The seeder clones main, runs a full build and test cycle, packages the result (~4.6 GiB), and uploads to blob storage. Eliminates per-pod download and extraction on cache hits.
/var/lib/docker snapshot. Pre-pulled images for integration and e2e test dependencies. When the node boots, these are already present. Zero network cost, zero IO cost at runtime. The cost is paid once at seeder build time.
flowchart LR
A["Nightly CronJob\nclone main\nfull build + test\n4.6 GiB artifact"] --> B["Blob storage\nlatest.tar.zst"]
B --> C["initContainer\ndownload + extract\nwrite .ready sentinel\nexit 0 on any error"]
C --> D["Runner Pod\nGOMODCACHE hard-link\nGOCACHE full copy\n/var/lib/docker"]
subgraph node["NVMe hostPath (instance-store)"]
C
D
endTwo decisions in the design that aren’t obvious from the outside.
GOMODCACHE uses hard-links, GOCACHE uses a full copy. Module zips are content-addressed and immutable, so hard-linking is safe. GOCACHE is different: Go’s cache-trim mutates files in place. Hard-linking would let one pod’s trim operation corrupt the shared inode that another pod is reading. Full cp -a gives each pod an independent copy. On NVMe it costs milliseconds.
The .ready sentinel exists because partial extractions are silent. If a node gets rebooted mid-extraction and the sentinel isn’t there, the next pod reads a half-extracted tree and builds fail in confusing ways. With the gate, any anomaly falls through to a cold build. Slow, but correct.
Opinion: I seed the full build every night regardless of what changed. The smarter thing would be selective invalidation based on what actually changed. I chose not to because “build everything nightly” is easy to reason about, easy to debug, and the storage cost is negligible. Incremental invalidation logic is a maintenance surface I don’t need.