Back to Careers

Manager, Site Reliability Engineering

At Smartsheet, we are building the next generation workspace collaboration platform. Our Technical Operations team is committed to operational excellence and delivering a world class customer experience. We're in an exciting high growth stage and now is the best time to join our team. Learn more about us with this short video overview of Smartsheet: Smartsheet Overview Video.

We are currently looking for a Site Reliability Engineering Manager to join our Site Reliability Engineering team. In this position, you will directly impact the reliability and performance of our critical production application systems; supporting 24/7 delivery to over 70,000 customers worldwide. We’re looking for a leader to manage and develop a high performance SRE team. This position will report directly to the Site Reliability Director and is located at our Bellevue, WA headquarters.

Responsibilities:

  • Hire, develop and manage a team of Site Reliability Engineers and Administrators providing 24x7 production support
  • Troubleshoot, investigate, and fix production issues in cloud and hosted environments, including both hardware and internal software issues
  • Develop and improve automated system alerts, effectively troubleshoot system errors and work incidents to return systems to normal operating conditions
  • Manage customer support and development escalations; working directly with Sustaining Engineering
  • Ensure production changes are documented, fully tested in non-production environments, and adhere to change control and audit requirements
  • Lead incident management, deployment and change processes
  • Identify and mitigate security, risk and compliance concerns, in accordance with company policies
  • Special projects as assigned

Qualifications:

  • 8+ years experience running a 24x7 mission critical production service with 99.99% uptime requirements
  • 6+ year of work experience with production Linux systems administration
  • 4+ year of experience with at least one scripting language ( eg Bash, Python, Ruby, Go )
  • Highly motivated, critical thinker with proven ability to manage a diverse team in a production support environment
  • Ability to successfully manage competing priorities in critical incident situations
  • Proficient with networking and internet protocols (eg HTTP, DNS, TCP/IP)
  • Proficient with config management, source control and containerization tools
  • Experience with agile, scrum and ITIL service management methodologies
  • Strong desire to learn, understand new technologies and mentor others
  • Excellent verbal and written communication skills
  • Ability to work in the U.S. on an ongoing basis
  • Bachelor’s degree in Computer Science or related discipline required

Apply Now