Principal Site Reliability Engineer

Department: DevOps

Employment Type: Full time

Location: North East England

Amplience is looking to grow and expand its business to become one of the best headless content & media delivery platforms in the market. Helping to achieve this will be a core part of the Principle Site Reliability Engineers role. Ensuring that Amplience has a platform that is reliable and can scale to meet future demands in a rapidly expanding business. The SRE will be responsible for implementing SLO’s and SLI’s to meet current and future business SLA’s. Defining a monitoring and capacity planning strategy that allows Amplience to plan 6-12 months ahead. Ensuring that agreed uptimes are met and identifying and raising current and future risk areas with stakeholders. The SRE will work with the stakeholders to identify which services would benefit from having SLOs in place. They will then document and share the agreed SLOs, error budgets and error budget policies. The SRE will define and agree the supporting SLI’s that can be used to measure and report against. As part of the SRE role this will be a constantly evolving process as new services are introduced and challenges arise, while actively looking for potential problem areas. The SRE will be required to work closely with all areas of the business including Operations, Engineering and Customer Success to ensure an holistic view to expectations and performance.

Key Responsibilities

Taking ownership of defining, implementing and maintaining the SLIs and SLOs.
Managing team of SREs in order to deliver the SLIs and SLOs.
Ensuring releases are scoped within the SLIs/
SLOs and managing the change process where updates are necessary.
Work with the Product, Engineering and other stakeholders to ensure alignment of and correctness of Objectives.
Drive improvement processes across the teams.
Focus on reducing MTTR - and maintaining the error budget.
Maintaining and improving our observability tools.
Taking ownership of the Services.
Gate keeper of the production environments -ensuring changes are honest and don't impact reliability, scalability, performance etc.

Skills Knowledge and Expertise

Experience working with Infrastructure and Application Monitoring tools: Cloudwatch,Prometheus, Grafana, Kibanba, DataDog, etc
Experience with monitoring, instrumentation and metrics that clearly describe service behaviours.
Experience defining and implementing incident response management processes.
Thorough understanding of automation and orchestration principles
Use of profilers, APM, tracing.
Expert knowledge of AWS, Ideally AWS Professional level certified - at least be able to demonstrate professional level experience.
SRE/DevOps experience and comfortable operating software in a Linux based
environment.

Benefits

Competitive salary
Flexible working arrangements
Discretionary bonus scheme
Company pension scheme
Employee share options so that everyone can benefit from our success
Enhanced maternity & paternity policies
Extra holidays once you've been with us for a while
The option to purchase additional holidays
Charity / volunteer days
Life assurance policy
Ride to work scheme
Season ticket advance loans

Apply Now

About Amplience

Amplience is an API-first, headless CMS and DAM in one: a unified platform for commerce content that does everything you need it to. Organize, find and enrich all your assets from a central library. Optimize and automate your product media, images and videos.

Plan, schedule, produce and deliver customer experiences. Do it all from the same platform.

And do more of it. Better, and faster, than ever.

Our Hiring Process