quyennv.com

Senior DevOps Engineer · Healthcare, Singapore

SRE Pipeline and Process: Material and Best Practices

#sre#devops#reliability#incident-management#slos#process

Site Reliability Engineering (SRE) applies software engineering practices to operations so systems are reliable, scalable, and efficient. An SRE pipeline is the path work follows from code to production and back into improvement; processes are the repeatable steps for incidents, changes, releases, and capacity. This document covers core material, the SRE pipeline, key processes with a clear follow-and-process flow, and best practices.


What is SRE?

SRE is a discipline that:

  • Treats operations as a software problem — Automate toil, use code (IaC, runbooks, tooling) to manage systems.
  • Balances reliability and velocity — Use error budgets so teams can ship features while staying within agreed reliability targets.
  • Focuses on outcomes — SLIs and SLOs define “reliable” in measurable terms; work is prioritized by user impact and budget consumption.
ConceptMeaning
SLIService Level Indicator — a measurable signal of user-facing quality (e.g. availability, latency, throughput).
SLOService Level Objective — target for an SLI (e.g. 99.9% availability, p99 < 500 ms).
Error budgetAllowed unreliability (e.g. 0.1% = 43 min/month). When budget is exhausted, focus shifts from features to reliability.
ToilManual, repetitive work that doesn’t add long-term value. SRE aims to eliminate or automate it.

SRE pipeline: from build to operate

The SRE pipeline is the path changes take from development to production and how feedback flows back. It aligns with a Continuous Delivery pipeline but is viewed through an SRE lens: reliability, observability, and operational readiness at every stage.

  +----------+     +----------+     +----------+     +----------+     +----------+
  |  Build   | --> |  Test    | --> | Release  | --> |  Deploy  | --> |  Operate |
  |          |     | (incl.   |     | (artefact|     | (to      |     | & Monitor|
  |          |     |  rel.)   |     |  ready) |      |  prod)   |     |          |
  +----------+     +----------+     +----------+     +----------+     +----------+
       |                 |                 |                 |                 |
       v                 v                 v                 v                 v
  - Versioning      - Unit, integ.    - Sign-off,       - Canary,         - SLO
  - Artefact        - Perf, chaos     - gates,          - blue/green       - Alerts
    identity        - Security         rollback          or rolling       - Incidents
  - Reproducible    - SLO-related      plan              deploy           - Feedback
    build             tests                                                  to dev
StageSRE focusTypical activities
BuildReproducible, versioned artefacts; no secrets in code.CI builds, versioning (semver or commit-based), artefact storage (registry).
TestReliability and performance tested before production.Unit/integration tests, performance and chaos tests, SLO-related checks, security scans.
ReleaseClear release criteria and rollback plan.Release gates (tests, approvals), artefact promotion, rollback procedure documented.
DeploySafe, observable deployments; minimal blast radius.Canary, blue/green, or rolling deploy; health checks; automated rollback on failure.
Operate & monitorSystems meet SLOs; incidents are detected and resolved; feedback improves the pipeline.Monitoring, alerting, on-call, incident response, postmortems, capacity planning.

Feedback loop: Operate & monitor produces incidents, postmortems, and SLO/error-budget data. That feedback drives improvements in build, test, release, and deploy (e.g. better tests, safer rollouts, less toil).


Core SRE processes and follow-the-process flow

1. Incident management process

Goal: Detect, triage, mitigate, and resolve incidents quickly; communicate clearly; learn afterward.

Follow-and-process flow:

  1. DETECT          2. TRIAGE           3. MITIGATE          4. RESOLVE           5. LEARN
  (Alert / report)   (Severity, owner)   (Contain impact)     (Root cause fix)    (Postmortem)
        |                  |                   |                     |                   |
        v                  v                   v                     v                   v
  - Alert fires or    - Assign severity   - Run playbook /     - Deploy fix or     - Blameless
    user reports       (P1–P4)              - Scale, restart,      config change       postmortem
  - On-call paged     - Assign incident     disable feature    - Verify SLO         - Action items
  - Create incident    commander           - Communicate         recovery           - Update runbooks
    channel/ticket     - Create comms        status              - Close incident      and alerts
                        channel             - Update status
StepWhoActions
DetectMonitoring / On-call / UsersAlert fires or report received; create incident record and comms channel.
TriageIncident commander / On-callSet severity (P1/P2/P3/P4); assign commander and roles; open status page / comms.
MitigateOn-call + relevant engineersFollow runbook; contain impact (scale, restart, feature flag off); keep stakeholders updated.
ResolveEngineeringFix root cause (or permanent workaround); verify recovery; close incident.
LearnTeamBlameless postmortem; document cause, timeline, actions; update runbooks and alerts.

Best practices:

  • Runbooks for common failures so mitigation is repeatable.
  • Single incident channel (e.g. Slack, PagerDuty) for coordination and status.
  • Severity matrix (e.g. P1 = full outage, P2 = major degradation) so response matches impact.
  • Blameless postmortem within a few days; focus on systems and process, not individuals.

2. Change and release process

Goal: Ship changes safely and reversibly; avoid regressions and unplanned downtime.

Follow-and-process flow:

  1. REQUEST / PLAN    2. REVIEW & APPROVE    3. EXECUTE (DEPLOY)    4. VERIFY & CLOSE
  - Change ticket      - Peer / SRE review    - Automated pipeline    - Health checks
  - Risk assessment    - Error budget check  - Canary / staged       - SLO check
  - Rollback plan      - Calendar / window   - Rollback if needed    - Ticket closed
StepActions
Request / planCreate change (ticket or MR); describe what, why, rollback plan; check error budget and calendar.
Review & approvePeer or SRE review; approve only if rollback is clear and risk is acceptable; schedule if required.
ExecuteRun pipeline (deploy); use canary or staged rollout; auto-rollback on health/SLO failure.
Verify & closeConfirm health and SLOs; close change; if failed, roll back and document.

Best practices:

  • Standard change for low-risk, automated, repeatable changes (e.g. config from approved pipeline).
  • Error budget gate: If budget is exhausted, only reliability-related changes (or emergency) until budget is restored.
  • Rollback first: Design so rollback is one click or one command; test rollback in staging.

3. On-call and escalation process

Goal: Someone capable is always available to respond; escalation is clear and timely.

Follow-and-process flow:

  Alert fires --> On-call (L1) acknowledges --> Triage
       |                    |
       |                    v
       |              Can resolve? --> Yes --> Mitigate/Resolve --> Close
       |                    |
       |                    No (or P1) --> Escalate to L2 / specialist / incident commander
       |                                        |
       v                                        v
  No ack within SLA? --> Escalate to next tier or manager
ElementPractice
RotationClear primary and secondary; rotation schedule and handover.
Alert qualityOnly page for actionable, human-needed events; use severity and routing.
EscalationDefined L1 → L2 → … and when to pull in incident commander or management.
SLAAcknowledgement and response time targets (e.g. P1 ack in 5 min).
FatigueLimit consecutive shifts; no punishment for handing off when tired.

4. Postmortem process

Goal: Learn from incidents and near-misses; improve systems and process; avoid blame.

Follow-and-process flow:

  1. SCHEDULE (within 24–72 h)
        |
  2. GATHER FACTS (timeline, actions, metrics, logs)
        |
  3. WRITE DRAFT (impact, cause, what went well, what didn’t)
        |
  4. REVIEW (team + stakeholders; blameless language)
        |
  5. PUBLISH & ACT (action items, owner, due date; update runbooks/alerts)

Postmortem content (template):

  • Summary — One paragraph: what happened, impact, root cause.
  • Impact — Users affected, duration, SLO impact.
  • Timeline — Key events (detect, mitigate, resolve) with times.
  • Root cause — Why it happened (systems, process, tooling), not “who.”
  • What went well / what didn’t — So the team can reinforce or fix.
  • Action items — Concrete improvements with owner and due date.

Best practice: Blameless — Focus on how the system allowed the failure and how to change the system, not on blaming people.


5. Capacity and demand process

Goal: Add capacity before it’s needed; avoid surprise overload and cost spikes.

Follow-and-process flow:

  Forecast demand (traffic, growth) --> Model capacity (current headroom)
        |                                    |
        v                                    v
  Plan scaling (horizontal / vertical) --> Approve budget / procurement
        |
        v
  Implement (autoscaling, new nodes, quotas) --> Re-check SLO and headroom
ActivityFrequencyOutput
Trend reviewWeekly / monthlyUsage vs capacity; growth rate.
HeadroomPer serviceHow much load before SLO is at risk.
Scaling triggersPer release / quarterWhen to add capacity (e.g. 70% utilization, or N weeks before campaign).

Best practices summary

AreaBest practice
SLOs & error budgetDefine a few user-centric SLOs per service; use error budget to decide when to pause features and fix reliability; review SLOs and targets periodically.
IncidentsSeverity matrix; single channel; runbooks; blameless postmortems and action items; update runbooks and alerts after each incident.
ChangesRollback plan for every change; use error budget as a gate; prefer small, frequent changes over big bang; automate where possible.
On-callOnly actionable alerts; clear escalation; ack/response SLAs; limit toil and fatigue.
ToilIdentify toil (manual, repetitive); automate or eliminate; spend saved time on reliability and automation.
DocumentationRunbooks for common failures; architecture and dependencies; postmortems and playbooks in a single place.
CultureBlameless learning; shared ownership of reliability; SRE and product/development collaborate on error budget and priorities.

End-to-end: follow the process

When something happens, follow the process so response is consistent and improvable:

  1. Alert or request → Create incident or change record; assign owner.
  2. Triage → Severity and impact; notify the right people; open comms.
  3. Act → Follow runbook or change plan; mitigate or execute; document what you did.
  4. Close → Resolve and verify; close ticket; if incident, schedule postmortem.
  5. Learn → Postmortem and action items; update runbooks, alerts, and tests.

The pipeline (build → test → release → deploy → operate) ensures changes are reliable and observable; the processes (incident, change, on-call, postmortem, capacity) ensure the team responds predictably and improves over time. Together they form the core of an SRE practice that you can adopt and adapt to your organization.

← All posts

Comments