Skip to content

SRE Requirements

  1. way of getting servers
  2. way of getting services
  3. way of talking and cominiucate
  4. Technical Documenations ?

📋 Requirements for SRE / Technical Support / NOC Team

1. Team Structure & Communication

  • Shift Coverage: 24/7/365 or follow-the-sun model, with on-call rotations.
  • Escalation Paths: Define L1 (NOC/Support), L2 (SRE/DevOps), L3 (Engineering/Dev).
  • Communication Tools:

  • ChatOps (Slack, Mattermost, etc.) for alert integrations.

  • Ticketing system (Jira Service Management, Zendesk, Freshservice).
  • Incident management platform (PagerDuty, Opsgenie, Squadcast).
  • Runbooks and knowledge base (Confluence, Notion, internal wiki).

2. Infrastructure & Servers

  • Monitoring & Observability:

  • Metrics: Prometheus + Grafana / Datadog / New Relic.

  • Logs: ELK / OpenSearch / Loki / Splunk.
  • Tracing: OpenTelemetry / Jaeger / Tempo.

  • Alerting:

  • Clear thresholds for CPU, memory, latency, error rates, queue backlogs, etc.

  • Alert routing (critical vs. warning, business vs. infra).

  • Access & Security:

  • Jump host / Bastion server for SSH access. ( installing, TSHoot, Management)

  • Role-based access control (RBAC) with audit logging.

  • Server Requirements:

  • The sheet for start and a way to ask for another servers.
  • NOC Screens/Wallboards: Large dashboards in ops center (or virtual dashboards).
  • Monitoring/Logging nodes: HA setup to avoid blind spots.
  • Secure VPN or zero-trust access for remote NOC members.

3. Processes & Documentation

  • Incident Management Process:

  • Severity levels (SEV1–SEV4).

  • On-call escalation rules.
  • Postmortem process (blameless retros).

  • Change Management:

  • Deployments tracked via CI/CD (GitOps/ArgoCD/Jenkins/GitHub Actions).
  • Approval flow for production changes.

  • Runbooks & Playbooks:

  • Step-by-step guides for common issues (e.g., restart service, failover DB).
  • Automated scripts wherever possible.

  • Knowledge Sharing:

  • Internal wiki, FAQs, troubleshooting database.

4. Tooling

  • Collaboration & Tracking:
  • Ticketing (Jira, ServiceNow, Freshdesk).
  • Incident timeline automation (e.g., PagerDuty incident commander).

  • Monitoring Dashboards:

  • Unified health dashboards for infra + apps + business KPIs.

  • CMDB / Asset Inventory:

  • Track servers, VMs, containers, cloud services.

  • Automation:

  • Auto-remediation scripts (restart pods, scale services).

5. Reporting & Metrics

  • SLA / SLO / SLI definitions:
  • Availability, latency, error rate, throughput.

  • Operational Metrics:

  • MTTR (Mean Time to Recovery).
  • MTTA (Mean Time to Acknowledge).
  • Incident frequency & recurrence.

  • Business Dashboards:

  • Impacted users, lost revenue per downtime, uptime % reporting.