ECR Pull-Through Cache

The ECR Pull-Through Cache (PTC) is an AWS-managed feature that transparently mirrors container images from upstream public registries into your own Amazon ECR. Skyscrapers uses it to guard your EKS clusters from upstream registry outages and rate limits, most notably Docker Hub’s anonymous-pull throttling, but also full upstream outages such as several GHCR, Quay.io, etc incidents.

When enabled, system component image references that previously resolved to quay.io, registry.k8s.io, docker.io, or ghcr.io are rewritten to your customer-specific ECR registry. The first pull populates the cache from upstream; subsequent pulls (across all your nodes and over time) are served from ECR. Cached images stay available even when the upstream is completely down.

Why we run it

Two recurring failure modes motivated this:

  • Docker Hub rate limits. Docker Hub limits unauthenticated pulls to 100 per 6 hours per IP. Velero, in particular no longer has a Verified Publisher status, so clusters with frequent node churn started hitting throttling on routine scaling events.
  • Upstream outages. When Quay.io went fully down, no new node could pull system images: boot loops, no recovery path, no mitigation we control.

There is no other practical mitigation that doesn’t require us either to maintain our own mirror infrastructure or to bake credentials into customer accounts. The Pull-Through Cache solves both at AWS-managed cost.

Where it runs

The cache is deployed per customer, in the SharedTooling AWS account, in the same region as the EKS clusters:

SharedTooling Account                   Cluster Account (dev/staging/prod)
┌─────────────────────────┐             ┌──────────────────────────────┐
│ ECR Pull-Through Cache  │             │ EKS Cluster                  │
│                         │  cross-acct │                              │
│ quay/*                  │◄────────────│ kubelet on workers/Fargate   │
│ k8s/*                   │  same-region│   pulls images that are      │
│ docker-hub/* (opt-in)   │  free       │   rewritten to ECR           │
│ ghcr/* (opt-in)         │             │                              │
│                         │             └──────────────────────────────┘
└─────────────────────────┘

The cache is shared across all of your cluster accounts in that region: one deployment serves dev, staging, and production. Cross-account ECR pulls within the same region are free (S3 gateway endpoint), so this adds no networking cost beyond the negligible ECR storage fee.

Tiers

The cache has two tiers, reflecting the AWS authentication requirements:

Default tier: always on when the cache is enabled

RegistryECR prefix
quay.ioquay/
registry.k8s.iok8s/

Both registries allow anonymous pulls; AWS does not require credentials to set up a Pull-Through Cache rule for them. These tiers are enabled the moment you turn on the cache.

Opt-in tier: requires customer-provided credentials

RegistryECR prefixCredential needed
docker.iodocker-hub/Docker Hub account + access token
ghcr.ioghcr/GitHub PAT with read:packages
registry.gitlab.comgitlab/GitLab PAT with read_registry

AWS requires credentials in Secrets Manager for these registries even if you only intend to pull public images. Skyscrapers will not deploy our own credentials into customer accounts, so customers create and own these secrets. See the setup guide.

How image rewriting works

Image references in our Kubernetes-stack templates use centralized Terraform locals (registry_quay, registry_k8s, registry_docker_hub, registry_ghcr) instead of hardcoded hostnames. Each local resolves to one of two values:

  • When the cache is disabled (default): the upstream URL, e.g. quay.io. Image references are identical to what they would be without the cache.
  • When the cache is enabled: the ECR prefix, e.g. <account>.dkr.ecr.<region>.amazonaws.com/quay. The container runtime pulls from ECR instead of upstream.

This means turning the cache off is a straight reversion to upstream pulls: no leftover state, no special-case logic.

The rewriting happens at template render time inside the Terraform stacks; there is no admission webhook, no in-cluster sidecar, and nothing on the node to misconfigure. If your application workloads also need cache routing, you can reference the same ECR registry directly from your image manifests.

What’s covered

Roughly 50 system component images are routed through the cache:

  • Default tier (quay.io, registry.k8s.io): cert-manager, Prometheus stack (operator, alertmanager, prometheus, node-exporter, kube-state-metrics, etc.), oauth2-proxy, external-dns, secrets-store-csi-driver, vertical-pod-autoscaler, ingress-nginx, blackbox-exporter, node-local-dns, prometheus-cloudwatch-exporter, k8s-sidecar.
  • Opt-in docker.io tier: Velero, Grafana, Loki, Alloy, Tempo.
  • Opt-in ghcr.io tier: Flux controllers, KEDA, Dex, Fluent-bit, WireGuard, kube-green, GHA Runner controller and runners, Tailscale, Karpenter EIP-assigner.

What’s not covered

A few classes of images stay on their original source:

  • public.ecr.aws images (Karpenter, AWS Load Balancer Controller, EFS CSI driver, etc.): already on AWS-owned infrastructure. No cache benefit.

  • EKS-managed addons (CoreDNS, kube-proxy, VPC CNI, EBS CSI driver, etc.): pulled from 602401143452.dkr.ecr.<region>.amazonaws.com, AWS-owned and AWS-managed.

  • HNC (gcr.io/k8s-staging-multitenancy/hnc-manager): ECR Pull-Through Cache does not support gcr.io as an upstream, and the retired HNC project has no registry.k8s.io mirror.

  • Customer application workloads: image references in your own manifests are not rewritten by the platform. Reference the ECR cache directly from your manifests if you want them routed too.

    Note

    A future opt-in feature will transparently rewrite customer application workload images at admission time via a mutating webhook, so you won’t need to touch your manifests to route them through the cache. Until then, the manual reference is the only option.

Cost

Per customer, in steady state:

  • ECR storage: ~$1.50/month for ~15 GB of cached images (60 unique images, 2-3 cached versions each, lifecycle policy expiring after 30 days).
  • Cross-account data transfer: $0.00 within the same region (S3 gateway endpoint).
  • Pull-through cache feature + API calls: $0.00.
  • Secrets Manager: ~$0.40/month per opt-in tier secret (Docker Hub / ghcr.io / GitLab).

Total: ~$1.50/month for the default tier, ~$2.30-3/month with the opt-in tiers enabled.

Lifecycle behaviour

Cached repositories are auto-created on first pull with a lifecycle policy that expires images 30 days after they were pushed. Expired images are simply re-cached on next pull, so this is a storage-bound, not availability-bound, knob.

Note

ECR’s sinceImagePushed rather than sinceImagePulled is used here because AWS only allows the sinceImagePulled count type with transition (cold-storage tiering) actions, not expire. Our 30-day default is conservative; if storage cost ever becomes meaningful, lower it. The only consequence is more frequent re-caches from upstream.

Related

Last updated on