Job Description
The Domain Lead - Site Realibility Management is a senior leadership role responsible for the end-to-end reliability, resilience, and operational excellence of all IT systems within T-Systems. This executive will lead a distributed team of 10 Site Reliability Engineers embedded throughout the company, setting the strategic direction for reliability engineering and ensuring the stability of critical business services operating and developing our entire internal IT landscape. The role is pivotal in driving a culture of continuous improvement, proactive risk management, and blameless learning throughout the IT organization bringing new technology and smart solutions to the forefront of the company's future. .
Purpose of the role is :
- To serve as the organization's chief stability and reliability authority, accountable for the availability, performance, and recoverability of all IT services.
- Lead the design and execution of a comprehensive reliability strategy, aligning with business objectives and regulatory requirements.
- Foster a company-wide culture of resilience, incident prevention, and operational transparency .
Key Responsibilities
Strategic Leadership : Define and champion the company’s reliability vision, policies, and maturity roadmap. Set and monitor organizational SLOs, SLIs, and error budgets .Team Management : Direct and mentor a distributed team of SRMs, ensuring consistent standards, knowledge sharing, and professional growth across domains.Reliability Governance : Oversee domain-wide stability programs, coordinate cross-functional reliability initiatives, and ensure alignment with business impact priorities.Incident Command : Act as the executive escalation point during major incidents, ensuring effective incident response, root cause analysis, and implementation of systemic fixes.Observability & Monitoring : Ensure comprehensive observability across all platforms, driving adoption of modern monitoring tools and practices to enable proactive detection and resolution .Infrastructure & Deployment : Oversee the reliability of CI / CD pipelines, infrastructure as code practices, and deployment strategies (e.g., canary releases, blue-green deployments).Resilience Engineering : Lead organization-wide initiatives in chaos engineering, failure testing, and capacity planning to minimize blast radius and prevent outages.Change Management : Guide risk assessment and approval of major releases and configuration changes, potentially replacing legacy Change Challenger models.Stakeholder Collaboration : Partner with engineering, product, and business leaders to align reliability goals, communicate risk, and drive adoption of best practices.Culture & Learning : Promote a blameless postmortem culture, facilitate reliability workshops, and ensure continuous learning and improvement.Qualifications
Key Qualifications :
Proven executive experience in SRE, IT operations, or large-scale infrastructure leadership within complex, distributed environments.Deep technical expertise in SRE principles, incident management, observability, and cloud / hybrid architectures (e.g., AWS, Azure, GCP).Demonstrated success in leading cross-functional teams, driving organization-wide stability programs, and managing high-stakes incidents.Strong familiarity with modern observability tools (Prometheus, Grafana, ELK, Datadog) and deployment frameworks (Kubernetes, Terraform, Ansible).Exceptional communication skills, with the ability to influence senior stakeholders and coach both technical and non-technical teams.Experience with ITIL, DevOps, and structured Change, Incident, and Problem Management frameworks.Success Metrics :
Reduction in critical incidents, IBIs, and Mean Time to Repair (MTTR).Measurable improvements in observability, monitoring coverage, and SLO adherence.Implementation and tracking of preventive actions and systemic fixes.Organization-wide visibility and mitigation of stability risks.Delivery and execution of a reliability roadmap, with clear progress metrics .Core Knowledge Areas :
SRE principles (error budgets, toil reduction, SLOs / SLIs)Incident lifecycle and blameless postmortemsObservability and monitoring (metrics, logging, alerting)Infrastructure as code, CI / CD, deployment best practicesChaos engineering, load and failure testingCloud and hybrid system design, geo-redundancyGovernance, communication, and cross-domain collaboration