Job Summary
We’re seeking a Site Reliability Engineer with experience in hybrid AWS environments, ideally within regulated or high-compliance settings. You’ll be responsible for improving system reliability, reducing failure domains, and implementing best practices across infrastructure automation, CI/CD, observability, and platform engineering, while enabling teams to deliver stable, scalable solutions.
Responsibilities
- Contributes actively to technical domains such as Site reliability, cloud best practices and DevOps.
- Actively identifies and reduces failure domains, designs and builds resilient systems, and strives to reduce adverse effects of an outage.
- Strong contributor to the objectives of the cross functional team (Architects, Engineers, Business Analysts, Scrum Master, Security, etc.)
- Builds software to minimize effort and business impact during maintenance and failures.
- Synthesize business and technical requirements to deliver business value.
- Mentor engineers on the team to improve technology and apply best practices.
- Encourages and enhances inter-team execution and collaboration
What you'll do:
- Contribute to site reliability, cloud engineering, and DevOps practices, improving platform stability and scalability.
- Work hands-on within a hybrid AWS environment, ideally in PCI-DSS / financial services settings.
- Design and build resilient systems, reducing failure domains and minimising downtime.
- Implement and manage CI/CD pipelines to support efficient and reliable delivery.
- Automate AWS workloads, including infrastructure provisioning and configuration deployments.
- Enhance observability across environments (monitoring, logging, alerting).
- Troubleshoot and optimise low-latency, Java-based applications within an SRE context.
- Apply SRE principles to improve reliability, performance, and incident response.
- Align solutions with the AWS Well-Architected Framework.
- Collaborate with cross-functional teams to deliver reliable, business-aligned solutions.
- Build automation to reduce manual effort and improve maintainability.
- Translate business and technical requirements into scalable solutions.
- Mentor team members and promote best practices and knowledge sharing.
- Drive collaboration and delivery excellence across teams.
Your expertise:
- Must have at least 7+ years hands-on experience in a DevOps / SRE role.
- Hands on work experience in a hybrid (AWS) cloud environment preferably in the financial sector with PCI-DSS compliant systems.
- Implementing CI/CD pipeline automation across functional areas
- AWS workload automation (provisioning infrastructure and custom software configuration rollout)
- Implementing Observability for hybrid and cloud-based environments
- Strong understanding of networking principles and protocols
- Troubleshooting low latency & Java based applications in a SRE (Site Reliability Engineering) context
- Proven experience with Site Reliability Engineering principles and practices
- Understanding of the AWS Well-Architected Framework and practices
Technology and Tooling:
Essential:
- AWS Services: EC2, RDS (Postgres & Oracle EE), EBS, EKS, ECS, Load Balancing, Secrets Manager, KMS, Service Catalog
- AWS CDK
- Observability tooling: AWS CloudWatch, Splunk / AppDynamics, ELK
- CI/CD platforms (e.g., Harness, Azure DevOps, GitLab or equivalent)
- Operating Systems: Linux and Windows Server
Preferred:
- Python scripting
- AWS CloudFormation
- Ansible
- Bash scripting
Optional:
- Zabbix monitoring
- Keycloak
- ActiveMQ
- WebLogic 14c
- Nginx
Certifications: (Beneficial)
- AWS Certified SysOps Administrator (CloudOps Engineer)
- AWS Certified Solutions Architect – Associate