Description
Skills
- Site Reliability, Observability and Monitoring experience in complex projects, in particular, monitoring configuration as code in Azure; AKS, Grafana & Prometheus
Desirable
- Prior experience in establishing Site Reliability Engineering function with 24/7 support
- Coding experience and demonstrate how to build, test, scan and deploy a .NET and JavaScript application.
- Hands-on experience of Azure cloud, IaC, JSON, Azure Bicep, Azure policies, Azure DevOps, Open telemetry, Azure Monitoring, Azure Sentinel, Azure Defender, Grafana, Kusto queries, Kubernetes AKS, Azure ARC, Azure function apps.
- Excellent knowledge of DevOps, Security and IT Service Management
- Hands on experience with Azure Cloud and Full Stack Observability using tools such as Log Analytics and AppInsights.
- Deep knowledge of Kubernetes and Prometheus
- Experience on GitOps practices.
- Understanding Shift to Right approaches and have experience with chaos engineering.
- A proactive approach to spotting problems areas for improvement and performance bottlenecks.
- Knowledge of automation of IT request fulfilment process through orchestration ServiceNow.
- Knowledge of cloud native micro services including containerisation and API Management.
- Effective communication and presentation skills.
What you will do
Site Reliability Engineer - lead the adoption of SRE practises as part of our SRE enablement team. You will work closely with our feature team and other colleagues to meet defined service level objectives and continually improve systems and environments. You will track and reduce toil define SLIs SLOs define error budgets that support finding the right balance between risk reliability.
You will also provide structure help to our release process suggesting making improvements where possible. You will scale systems sustainably through mechanisms like automation evolving them by pushing for changes that improve reliability velocity. We will also look to you coach provide guidance colleagues wider team.