ECR Pull-Through Cache
The ECR Pull-Through Cache (PTC) is an AWS-managed feature that transparently mirrors container images from upstream public registries into your own Amazon ECR. Skyscrapers uses it to guard your EKS clusters from upstream registry outages and rate limits, most notably Docker Hub’s anonymous-pull throttling, but also full upstream outages such as several GHCR, Quay.io, etc incidents.
When enabled, system component image references that previously resolved to quay.io, registry.k8s.io, docker.io, or ghcr.io are rewritten to your customer-specific ECR registry. The first pull populates the cache from upstream; subsequent pulls (across all your nodes and over time) are served from ECR. Cached images stay available even when the upstream is completely down.
Why we run it
Two recurring failure modes motivated this:
- Docker Hub rate limits. Docker Hub limits unauthenticated pulls to 100 per 6 hours per IP. Velero, in particular no longer has a Verified Publisher status, so clusters with frequent node churn started hitting throttling on routine scaling events.
- Upstream outages. When Quay.io went fully down, no new node could pull system images: boot loops, no recovery path, no mitigation we control.
There is no other practical mitigation that doesn’t require us either to maintain our own mirror infrastructure or to bake credentials into customer accounts. The Pull-Through Cache solves both at AWS-managed cost.
Where it runs
The cache is deployed per customer, in the SharedTooling AWS account, in the same region as the EKS clusters:
SharedTooling Account Cluster Account (dev/staging/prod)
┌─────────────────────────┐ ┌──────────────────────────────┐
│ ECR Pull-Through Cache │ │ EKS Cluster │
│ │ cross-acct │ │
│ quay/* │◄────────────│ kubelet on workers/Fargate │
│ k8s/* │ same-region│ pulls images that are │
│ docker-hub/* (opt-in) │ free │ rewritten to ECR │
│ ghcr/* (opt-in) │ │ │
│ │ └──────────────────────────────┘
└─────────────────────────┘The cache is shared across all of your cluster accounts in that region: one deployment serves dev, staging, and production. Cross-account ECR pulls within the same region are free (S3 gateway endpoint), so this adds no networking cost beyond the negligible ECR storage fee.
Tiers
The cache has two tiers, reflecting the AWS authentication requirements:
Default tier: always on when the cache is enabled
| Registry | ECR prefix |
|---|---|
quay.io | quay/ |
registry.k8s.io | k8s/ |
Both registries allow anonymous pulls; AWS does not require credentials to set up a Pull-Through Cache rule for them. These tiers are enabled the moment you turn on the cache.
Opt-in tier: requires customer-provided credentials
| Registry | ECR prefix | Credential needed |
|---|---|---|
docker.io | docker-hub/ | Docker Hub account + access token |
ghcr.io | ghcr/ | GitHub PAT with read:packages |
registry.gitlab.com | gitlab/ | GitLab PAT with read_registry |
AWS requires credentials in Secrets Manager for these registries even if you only intend to pull public images. Skyscrapers will not deploy our own credentials into customer accounts, so customers create and own these secrets. See the setup guide.
How image rewriting works
Image references in our Kubernetes-stack templates use centralized Terraform locals (registry_quay, registry_k8s, registry_docker_hub, registry_ghcr) instead of hardcoded hostnames. Each local resolves to one of two values:
- When the cache is disabled (default): the upstream URL, e.g.
quay.io. Image references are identical to what they would be without the cache. - When the cache is enabled: the ECR prefix, e.g.
<account>.dkr.ecr.<region>.amazonaws.com/quay. The container runtime pulls from ECR instead of upstream.
This means turning the cache off is a straight reversion to upstream pulls: no leftover state, no special-case logic.
The rewriting happens at template render time inside the Terraform stacks; there is no admission webhook, no in-cluster sidecar, and nothing on the node to misconfigure. If your application workloads also need cache routing, you can reference the same ECR registry directly from your image manifests.
What’s covered
Roughly 50 system component images are routed through the cache:
- Default tier (quay.io, registry.k8s.io): cert-manager, Prometheus stack (operator, alertmanager, prometheus, node-exporter, kube-state-metrics, etc.), oauth2-proxy, external-dns, secrets-store-csi-driver, vertical-pod-autoscaler, ingress-nginx, blackbox-exporter, node-local-dns, prometheus-cloudwatch-exporter, k8s-sidecar.
- Opt-in docker.io tier: Velero, Grafana, Loki, Alloy, Tempo.
- Opt-in ghcr.io tier: Flux controllers, KEDA, Dex, Fluent-bit, WireGuard, kube-green, GHA Runner controller and runners, Tailscale, Karpenter EIP-assigner.
What’s not covered
A few classes of images stay on their original source:
public.ecr.awsimages (Karpenter, AWS Load Balancer Controller, EFS CSI driver, etc.): already on AWS-owned infrastructure. No cache benefit.EKS-managed addons (CoreDNS, kube-proxy, VPC CNI, EBS CSI driver, etc.): pulled from
602401143452.dkr.ecr.<region>.amazonaws.com, AWS-owned and AWS-managed.HNC (
gcr.io/k8s-staging-multitenancy/hnc-manager): ECR Pull-Through Cache does not supportgcr.ioas an upstream, and the retired HNC project has noregistry.k8s.iomirror.Customer application workloads: image references in your own manifests are not rewritten by the platform. Reference the ECR cache directly from your manifests if you want them routed too.
Note
A future opt-in feature will transparently rewrite customer application workload images at admission time via a mutating webhook, so you won’t need to touch your manifests to route them through the cache. Until then, the manual reference is the only option.
Cost
Per customer, in steady state:
- ECR storage: ~$1.50/month for ~15 GB of cached images (60 unique images, 2-3 cached versions each, lifecycle policy expiring after 30 days).
- Cross-account data transfer: $0.00 within the same region (S3 gateway endpoint).
- Pull-through cache feature + API calls: $0.00.
- Secrets Manager: ~$0.40/month per opt-in tier secret (Docker Hub / ghcr.io / GitLab).
Total: ~$1.50/month for the default tier, ~$2.30-3/month with the opt-in tiers enabled.
Lifecycle behaviour
Cached repositories are auto-created on first pull with a lifecycle policy that expires images 30 days after they were pushed. Expired images are simply re-cached on next pull, so this is a storage-bound, not availability-bound, knob.
Note
ECR’s sinceImagePushed rather than sinceImagePulled is used here because AWS only allows the sinceImagePulled count type with transition (cold-storage tiering) actions, not expire. Our 30-day default is conservative; if storage cost ever becomes meaningful, lower it. The only consequence is more frequent re-caches from upstream.