Platform & Application Responsibility Definitions
This document defines the operational boundary between what Skyscrapers manages as platform and what customers own as application. It clarifies monitoring, alerting, and response responsibilities for both parties.
For a visual overview of how these responsibilities are distributed across service tiers, see the Shared Responsibility Model.
The three layers
We distinguish three layers in every environment we manage. Each layer has a clear owner responsible for its operation, monitoring, and incident response.
Infrastructure
The foundational cloud resources that the platform runs on.
Owner: Skyscrapers (provisioned and managed via IaC)
This includes the AWS account structure, VPCs, subnets, NAT gateways, IAM roles and policies, and managed technology components such as RDS instances, OpenSearch clusters, S3 buckets, and other resources provisioned through Skyscrapers’ infrastructure-as-code. These resources carry the maintainer=skyscrapers tag in AWS.
Infrastructure resources are managed at the cloud provider level and have different alerting characteristics than platform components running inside Kubernetes. We separate them because the monitoring approach, tooling, and response procedures differ — even though both layers are operated by Skyscrapers.
Platform
The Kubernetes cluster and all Skyscrapers-managed add-ons that provide a production-ready runtime for customer workloads.
Owner: Skyscrapers
This includes the K8s control plane, worker nodes and node pools (managed via Karpenter), and all components deployed in Skyscrapers-managed namespaces (identifiable by the maintainer=skyscrapers label). Concrete examples are Traefik, cert-manager, external-dns, the monitoring stack (Prometheus, Alertmanager, Grafana), the logging stack (Fluent Bit, Loki), Vault, Dex, VPA, Velero, CoreDNS, and any other cluster add-on Skyscrapers operates. The full list is available here: Skyscrapers Services.
Skyscrapers provides and maintains these components as a runtime platform. Customers consume the capabilities they expose (e.g. creating Ingress resources, deploying applications, requesting certificates, defining PodMonitors,…) but do not operate the underlying components.
Application
Everything the customer deploys onto the platform to serve their end users.
Owner: Customer
This includes all workloads deployed by the customer (eg Deployments, StatefulSets, Jobs, CronJobs, etc.), application-level Kubernetes resources such as Ingresses, Services, ConfigMaps, Secrets, HPA/VPA configurations, and NetworkPolicies, application endpoints and health checks, CI/CD pipelines and their configurations, and any cloud resources not provisioned or managed by Skyscrapers.
Monitoring & alerting responsibilities
How ownership is determined
The boundary between platform and application alerting is determined by namespace labels, not by the type of metric or resource. All Kubernetes namespaces carrying the maintainer=skyscrapers label are classified as platform. Alerts originating from these namespaces are routed to Skyscrapers’ internal alert pipeline (infra-alerts). Everything else is routed to the customer’s alert channel (app-alerts).
You can verify which namespaces are platform-managed on your cluster:
kubectl get ns -l maintainer=skyscrapers
For managed technology components outside Kubernetes (such as RDS, OpenSearch, S3), ownership is determined by the maintainer=skyscrapers AWS resource tag.
A full list of all active alerts and their current state can be found in the Prometheus UI available on each cluster. For details on specific alerts, including descriptions and recommended actions, see the Runbook.
What Skyscrapers monitors and responds to
Skyscrapers monitors and responds to alerts concerning the availability and health of platform components and managed infrastructure.
Concretely, Skyscrapers actively responds to:
- Component availability: platform add-ons being down or unhealthy (e.g. ingress controller, cert-manager, Prometheus, Loki, CoreDNS)
- Cluster and node availability: EKS control plane issues, nodes not ready, node pool scaling failures
- Disk space: node filesystem space filling up, persistent volume space for platform components
- Managed technology component health: RDS instance availability, OpenSearch cluster health, backup job failures
- Platform error rates: elevated error rates on platform components
For production environments, critical alerts go through Skyscrapers’ on-call escalation system (24/7) and are acted upon by a Skyscrapers engineer. Warning-level alerts are tracked in Skyscrapers’ internal monitoring channel during business hours, but are not guaranteed to be acted upon.
For non-production environments (staging, development), alerts are logged in Slack and handled on a best-effort basis during business hours. There is no 24/7 on-call coverage for non-production environments.
What Skyscrapers does NOT respond to
Skyscrapers does not monitor or respond to alerts that fall within the application layer. Examples include:
- Resource usage alerts: high CPU or memory consumption on application pods, K8s node CPU usage driven by application workloads, RDS CPU usage
- Application scaling limits: HPA reaching maximum replicas, pod pending due to insufficient requested resources
- Application-level failures: failing health checks on customer endpoints, failing service monitors, certificate errors on customer-deployed Ingresses, application crash loops
- Single OOMKill events: individual out-of-memory kills on application pods are not acted upon by Skyscrapers. Customers are responsible for tuning resource requests and limits for their workloads. For platform components, Skyscrapers has separate alerts in place (such as
SystemPodCrashLoopingandSystemTargetDown) to detect sustained service degradation. - Queue and messaging backlogs: SQS queue depth growing, message processing delays
- Application performance: slow response times, elevated error rates on customer services
- Customer-managed cloud resources: any AWS/cloud resource not provisioned by Skyscrapers
The supporting role
While application monitoring is the customer’s responsibility, Skyscrapers plays a supporting role. Customers can escalate application issues to Skyscrapers through the normal support process when they need help with troubleshooting, root cause analysis, or platform-related investigation. See the Support Process for details.
Alert routing
Skyscrapers configures two default alert routes in Alertmanager:
| Route | Scope | Severity: Info | Severity: Warning | Severity: Critical |
|---|---|---|---|---|
infra-alerts | Namespaces with maintainer=skyscrapers | Not routed | Slack (Skyscrapers internal — customer notified if potential impact) | On-call escalation + incident management |
app-alerts | All other namespaces | Not routed | Customer alert channel | Customer alert channel |
Customers are expected to define an appropriate escalation path for their own alerts. During onboarding, Skyscrapers configures the customer’s preferred alert destination (the default is a shared Slack channel). Customers can request custom routes and endpoints for different severity levels. If you want to configure this further, get in touch with us.
Summary
| Platform (Skyscrapers) | Application (Customer) | |
|---|---|---|
| Scope | K8s platform + add-ons in maintainer=skyscrapers namespaces, managed technology components with maintainer=skyscrapers AWS tags | All customer-deployed workloads, customer-managed cloud resources |
| Monitors | Component health, cluster/node availability, disk space, backup success, platform error rates | Application health, endpoint availability, queue depths, business metrics, application performance |
| Responds to | Platform outages, node failures, disk pressure, managed service unavailability | Application errors, scaling limits, misconfigurations, performance degradation |
| Does NOT respond to | Application workload issues, customer resource misconfigurations, single OOMKill events | Platform component failures (escalate to Skyscrapers) |
| Escalation | Internal on-call (24/7 for critical, production). Non-production: business hours, best-effort. | Customer-defined escalation path, with Skyscrapers as support escalation |