SRE Pipeline and Process: Material and Best Practices
#sre#devops#reliability#incident-management#slos#process
Site Reliability Engineering (SRE) applies software engineering practices to operations so systems are reliable, scalable, and efficient. An SRE pipeline is the path work follows from code to production and back into improvement; processes are the repeatable steps for incidents, changes, releases, and capacity. This document covers core material, the SRE pipeline, key processes with a clear follow-and-process flow, and best practices.
What is SRE?
SRE is a discipline that:
- Treats operations as a software problem — Automate toil, use code (IaC, runbooks, tooling) to manage systems.
- Balances reliability and velocity — Use error budgets so teams can ship features while staying within agreed reliability targets.
- Focuses on outcomes — SLIs and SLOs define “reliable” in measurable terms; work is prioritized by user impact and budget consumption.
| Concept | Meaning |
|---|---|
| SLI | Service Level Indicator — a measurable signal of user-facing quality (e.g. availability, latency, throughput). |
| SLO | Service Level Objective — target for an SLI (e.g. 99.9% availability, p99 < 500 ms). |
| Error budget | Allowed unreliability (e.g. 0.1% = 43 min/month). When budget is exhausted, focus shifts from features to reliability. |
| Toil | Manual, repetitive work that doesn’t add long-term value. SRE aims to eliminate or automate it. |
SRE pipeline: from build to operate
The SRE pipeline is the path changes take from development to production and how feedback flows back. It aligns with a Continuous Delivery pipeline but is viewed through an SRE lens: reliability, observability, and operational readiness at every stage.
+----------+ +----------+ +----------+ +----------+ +----------+
| Build | --> | Test | --> | Release | --> | Deploy | --> | Operate |
| | | (incl. | | (artefact| | (to | | & Monitor|
| | | rel.) | | ready) | | prod) | | |
+----------+ +----------+ +----------+ +----------+ +----------+
| | | | |
v v v v v
- Versioning - Unit, integ. - Sign-off, - Canary, - SLO
- Artefact - Perf, chaos - gates, - blue/green - Alerts
identity - Security rollback or rolling - Incidents
- Reproducible - SLO-related plan deploy - Feedback
build tests to dev
| Stage | SRE focus | Typical activities |
|---|---|---|
| Build | Reproducible, versioned artefacts; no secrets in code. | CI builds, versioning (semver or commit-based), artefact storage (registry). |
| Test | Reliability and performance tested before production. | Unit/integration tests, performance and chaos tests, SLO-related checks, security scans. |
| Release | Clear release criteria and rollback plan. | Release gates (tests, approvals), artefact promotion, rollback procedure documented. |
| Deploy | Safe, observable deployments; minimal blast radius. | Canary, blue/green, or rolling deploy; health checks; automated rollback on failure. |
| Operate & monitor | Systems meet SLOs; incidents are detected and resolved; feedback improves the pipeline. | Monitoring, alerting, on-call, incident response, postmortems, capacity planning. |
Feedback loop: Operate & monitor produces incidents, postmortems, and SLO/error-budget data. That feedback drives improvements in build, test, release, and deploy (e.g. better tests, safer rollouts, less toil).
Core SRE processes and follow-the-process flow
1. Incident management process
Goal: Detect, triage, mitigate, and resolve incidents quickly; communicate clearly; learn afterward.
Follow-and-process flow:
1. DETECT 2. TRIAGE 3. MITIGATE 4. RESOLVE 5. LEARN
(Alert / report) (Severity, owner) (Contain impact) (Root cause fix) (Postmortem)
| | | | |
v v v v v
- Alert fires or - Assign severity - Run playbook / - Deploy fix or - Blameless
user reports (P1–P4) - Scale, restart, config change postmortem
- On-call paged - Assign incident disable feature - Verify SLO - Action items
- Create incident commander - Communicate recovery - Update runbooks
channel/ticket - Create comms status - Close incident and alerts
channel - Update status
| Step | Who | Actions |
|---|---|---|
| Detect | Monitoring / On-call / Users | Alert fires or report received; create incident record and comms channel. |
| Triage | Incident commander / On-call | Set severity (P1/P2/P3/P4); assign commander and roles; open status page / comms. |
| Mitigate | On-call + relevant engineers | Follow runbook; contain impact (scale, restart, feature flag off); keep stakeholders updated. |
| Resolve | Engineering | Fix root cause (or permanent workaround); verify recovery; close incident. |
| Learn | Team | Blameless postmortem; document cause, timeline, actions; update runbooks and alerts. |
Best practices:
- Runbooks for common failures so mitigation is repeatable.
- Single incident channel (e.g. Slack, PagerDuty) for coordination and status.
- Severity matrix (e.g. P1 = full outage, P2 = major degradation) so response matches impact.
- Blameless postmortem within a few days; focus on systems and process, not individuals.
2. Change and release process
Goal: Ship changes safely and reversibly; avoid regressions and unplanned downtime.
Follow-and-process flow:
1. REQUEST / PLAN 2. REVIEW & APPROVE 3. EXECUTE (DEPLOY) 4. VERIFY & CLOSE
- Change ticket - Peer / SRE review - Automated pipeline - Health checks
- Risk assessment - Error budget check - Canary / staged - SLO check
- Rollback plan - Calendar / window - Rollback if needed - Ticket closed
| Step | Actions |
|---|---|
| Request / plan | Create change (ticket or MR); describe what, why, rollback plan; check error budget and calendar. |
| Review & approve | Peer or SRE review; approve only if rollback is clear and risk is acceptable; schedule if required. |
| Execute | Run pipeline (deploy); use canary or staged rollout; auto-rollback on health/SLO failure. |
| Verify & close | Confirm health and SLOs; close change; if failed, roll back and document. |
Best practices:
- Standard change for low-risk, automated, repeatable changes (e.g. config from approved pipeline).
- Error budget gate: If budget is exhausted, only reliability-related changes (or emergency) until budget is restored.
- Rollback first: Design so rollback is one click or one command; test rollback in staging.
3. On-call and escalation process
Goal: Someone capable is always available to respond; escalation is clear and timely.
Follow-and-process flow:
Alert fires --> On-call (L1) acknowledges --> Triage
| |
| v
| Can resolve? --> Yes --> Mitigate/Resolve --> Close
| |
| No (or P1) --> Escalate to L2 / specialist / incident commander
| |
v v
No ack within SLA? --> Escalate to next tier or manager
| Element | Practice |
|---|---|
| Rotation | Clear primary and secondary; rotation schedule and handover. |
| Alert quality | Only page for actionable, human-needed events; use severity and routing. |
| Escalation | Defined L1 → L2 → … and when to pull in incident commander or management. |
| SLA | Acknowledgement and response time targets (e.g. P1 ack in 5 min). |
| Fatigue | Limit consecutive shifts; no punishment for handing off when tired. |
4. Postmortem process
Goal: Learn from incidents and near-misses; improve systems and process; avoid blame.
Follow-and-process flow:
1. SCHEDULE (within 24–72 h)
|
2. GATHER FACTS (timeline, actions, metrics, logs)
|
3. WRITE DRAFT (impact, cause, what went well, what didn’t)
|
4. REVIEW (team + stakeholders; blameless language)
|
5. PUBLISH & ACT (action items, owner, due date; update runbooks/alerts)
Postmortem content (template):
- Summary — One paragraph: what happened, impact, root cause.
- Impact — Users affected, duration, SLO impact.
- Timeline — Key events (detect, mitigate, resolve) with times.
- Root cause — Why it happened (systems, process, tooling), not “who.”
- What went well / what didn’t — So the team can reinforce or fix.
- Action items — Concrete improvements with owner and due date.
Best practice: Blameless — Focus on how the system allowed the failure and how to change the system, not on blaming people.
5. Capacity and demand process
Goal: Add capacity before it’s needed; avoid surprise overload and cost spikes.
Follow-and-process flow:
Forecast demand (traffic, growth) --> Model capacity (current headroom)
| |
v v
Plan scaling (horizontal / vertical) --> Approve budget / procurement
|
v
Implement (autoscaling, new nodes, quotas) --> Re-check SLO and headroom
| Activity | Frequency | Output |
|---|---|---|
| Trend review | Weekly / monthly | Usage vs capacity; growth rate. |
| Headroom | Per service | How much load before SLO is at risk. |
| Scaling triggers | Per release / quarter | When to add capacity (e.g. 70% utilization, or N weeks before campaign). |
Best practices summary
| Area | Best practice |
|---|---|
| SLOs & error budget | Define a few user-centric SLOs per service; use error budget to decide when to pause features and fix reliability; review SLOs and targets periodically. |
| Incidents | Severity matrix; single channel; runbooks; blameless postmortems and action items; update runbooks and alerts after each incident. |
| Changes | Rollback plan for every change; use error budget as a gate; prefer small, frequent changes over big bang; automate where possible. |
| On-call | Only actionable alerts; clear escalation; ack/response SLAs; limit toil and fatigue. |
| Toil | Identify toil (manual, repetitive); automate or eliminate; spend saved time on reliability and automation. |
| Documentation | Runbooks for common failures; architecture and dependencies; postmortems and playbooks in a single place. |
| Culture | Blameless learning; shared ownership of reliability; SRE and product/development collaborate on error budget and priorities. |
End-to-end: follow the process
When something happens, follow the process so response is consistent and improvable:
- Alert or request → Create incident or change record; assign owner.
- Triage → Severity and impact; notify the right people; open comms.
- Act → Follow runbook or change plan; mitigate or execute; document what you did.
- Close → Resolve and verify; close ticket; if incident, schedule postmortem.
- Learn → Postmortem and action items; update runbooks, alerts, and tests.
The pipeline (build → test → release → deploy → operate) ensures changes are reliable and observable; the processes (incident, change, on-call, postmortem, capacity) ensure the team responds predictably and improves over time. Together they form the core of an SRE practice that you can adopt and adapt to your organization.
Comments