Skip to content

The Operation

Once the Platform is live, we run it for you, 24/7 — this is Step 2, Operate. The goal is the offer’s Zero-Ops promise: within 90 days your engineers spend effectively zero hours on infra operations. Everything below is how we make that true.

We own incidents. We lead every response — your team never carries a pager.

Incidents are classified by severity, and our response is bound by an SLA tied to that severity. The most serious is Sev-1 (a critical outage). Our P1 Promise: a Sev-1 outage caused by us means that month is free.

| Severity | Meaning | Response | |---|---|---| | Sev-1 | Critical outage / major impact | Immediate, all-hands, SLA-bound | | Sev-2 | Significant degradation | Urgent, SLA-bound | | Sev-3 | Minor / contained issue | Handled in normal operations |

  1. Detect — an alert fires from the observability stack (or you raise it in the shared channel).

  2. Acknowledge — the on-call engineer picks it up within the SLA and opens a thread in Teams or Slack.

  3. Respond — we work the incident from runbooks, keeping you updated in the channel as it progresses. For the worst cases, recovery often means a GitOps revert to the last healthy state.

  4. Resolve — service is restored and verified against the dashboards.

  5. Post-mortem — for significant incidents we run a blameless post-mortem and feed the fixes back into the platform and the runbooks.

On-call is a real rotation of senior engineers (24/7 on-call with an SLA is part of the Operate + Scale tier), not “whoever’s awake.”

You get a Direct-Line to an engineer, not a help desk — over the tool your team already lives in:

Microsoft Teams

A shared channel for day-to-day operations, incident threads, and questions.

Slack

Same shared-channel model — we work in the open, where you can see it.

  • One shared channel is the home for status updates, change notifications, and incident threads — everything in the open.
  • Real-time during incidents — you’re kept informed at every step, no chasing a ticket number.
  • Escalation paths are agreed up front, so the right person is reachable when it matters.

The day-to-day work that prevents incidents in the first place:

  • Patching on a regular cadence — OS, dependencies, and platform components kept current.
  • Backups & tested DR — backups are verified by periodic disaster-recovery drills, not assumed.
  • Change management through GitOps — every change is a reviewed, revertable commit, with no manual cluster edits.
  • Ongoing tech-debt paydown and continuous security scanning.

Beyond keeping the lights on, we look ahead each quarter (part of Operate + Scale):

  • Architecture & scaling review — we grow the platform ahead of your demand.
  • Cloud cost optimization — frequently offsets part of the fee.
  • Board-ready ROI report — the value of the platform in a single slide.

| Guarantee | What it means here | |---|---| | Zero-Ops | Within 90 days your engineers spend ~zero hours on infra ops, or we keep working free until they do | | P1 Promise | A Sev-1 outage caused by us means that month is free | | No-Lock-In | Month-to-month; cancel anytime on 30 days’ notice |