Manager, Site Reliability Engineering
Do you like problem solving, getting things done. Keen on optimization and large scale systems development.
Smartsheet UK is looking for a Manager, Site Reliability Engineering to join our Site Reliability Engineering team. Our business is built on finding top grade talent and getting out of their way while they build and improve our Converse service offering which enable users to build conversational business workflows. Our software team is small, highly efficient, and results oriented. We work in an agile apolitical environment that stays focused on building great software. We are looking for the most highly motivated and intelligent individuals.
This position is based at our Edinburgh, Scotland site.
- Hire, develop and manage a team of Site Reliability Engineers and Administrators providing 24x7 production support
- Troubleshoot, investigate, and fix production issues in cloud and hosted environments, including both hardware and internal software issues
- Develop and improve automated system alerts, effectively troubleshoot system errors and work incidents to return systems to normal operating conditions
- Manage customer support and development escalations; working directly with Sustaining Engineering
- Ensure production changes are documented, fully tested in non-production environments, and adhere to change control and audit requirements
- Lead incident management, deployment and change processes
- Identify and mitigate security, risk and compliance concerns, in accordance with company policies
- Special projects as assigned
- 8+ years experience running a 24x7 mission critical production service with 99.99% uptime requirements
- 6+ year of work experience with production Linux systems administration
- 4+ year of experience with at least one scripting language (e.g., Bash, Python, Ruby, Go )
- Highly motivated, critical thinker with proven ability to manage a diverse team in a production support environment
- Ability to successfully manage competing priorities in critical incident situations
- Proficient with networking and internet protocols (e.g., HTTP, DNS, TCP/IP)
- Proficient with config management, source control and containerization tools
- Experience with agile, scrum and ITIL service management methodologies
- Strong desire to learn, understand new technologies and mentor others
- Excellent verbal and written communication skills
- Bachelor’s degree in Computer Science or related discipline required
- Legally eligible to work in the UK on an ongoing basis