1. Choose your Affinity Group

* Equal Opportunity / Affirmative Action

We serve Equal Opportunity Employers and are an Equal Opportunity Employer. The Professional Diversity Network has separate professional networking sites for different affinities, and in selecting the groups you identify with, you will be joined with those networks.

Note: Providing this information is strictly voluntary - you will not be penalized or subjected to adverse treatment. If you choose not to provide this information, simply select "Choose not to identify."

2. Choose Method
Sign in with LinkedIn
Sign in with Facebook

Tell us about yourself

Senior Site Reliability Engineer
at OnSolve
Boston, MA

Senior Site Reliability Engineer
at OnSolve
Boston, MA

Save or bookmark jobs as you go and access them anytime later with your account.



Job Description

 Job Title: Senior Site Reliability Engineer        

Department: Technology, Production Administration

Location: Boston, MA


Company Description 

OnSolve: Always On. Solving Problems.

OnSolve is the market leader in real-time, mass notification and collaboration solutions used by the world’s largest brands and thousands of government agencies to deliver critical information in any situation. Mass notification and collaboration is an essential element of emergency response and business continuity planning, keeping teams on track and coordinating during critical events. The OnSolve suite of critical communication tools is a key component of the business continuity, emergency response, IT alerting, employee safety and security programs of every organization we serve. Visit us on the Web at onsolve.com.

OnSolve is an equal employment opportunity/affirmative action employer.  All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other status protected by law.


Job Summary

At OnSolve, the Senior Site Reliability Engineer is responsible for the reliability, performance, monitoring, and incident response of the platforms and services that run the OnSolve enterprise.

All complex systems can suffer from poor performance, spikes in resource usages, slow responses, uncaught exceptions and hardware failures. The Senior Site Reliability Engineer position is prepared to act quickly in response to problems, using a wide variety of off-the-shelf tools, DevOps patterns/practices and their own custom solutions.

This role ensures all production workloads are implemented against a comprehensive, repeatable set of Ops requirements including diagrams, service dependency information, monitoring and logging plans, backups and high availability setups. Measured against MTTR (mean time to recovery) and MTTF (mean time to failure), the effectiveness of this role is a function of how quickly services are up and running after failure, and how infrequently failures arise.



  • Focus on monitoring system uptime, capacity planning, change management and emergency response
  • Develop tools & processes to improve the availability, scalability, latency, and efficiency of OnSolve's software stacks
  • Perform quantitative analysis to understand the events that disrupt functionality, and manage the cross-team efforts to resolve those events
  • Build & manage systems through the use of infrastructure-as-code and configuration management tools & best-practices (i.e. Terraform/Ansible/Puppet, Git workflows, etc…)
  • Leverage development skills to build custom tools, plugins, exporters, and applications for the purpose of managing infrastructure services & backend automation, monitoring, and analysis
  • Virtualization (vSphere) and Container (K8s) deployment and management
  • Log management & monitoring (ELK)
  • Management of monitoring/metrics systems (i.e. Nagios, Prometheus/Grafana, etc…), and development of custom plugins & exporters for these systems
  • Document important processes/procedures, and perform team training exercises as-needed
  • Drive RCA & reporting efforts, while working with cross-functional teams to resolve complex issues
  • Assume responsibility for independently learning new technologies & practices with a DevOps focus 



  • Understanding of large-scale distributed systems including n-tier architectures, application security, monitoring, virtualization, and containerization platforms
  • Experience with cloud providers, such as Azure/AWS
  • Aversion to "snowflake" systems, and a corresponding appreciation for establishing simple/repeatable patterns
  • Capable of developing advanced automation in multiple languages, such as Python, BASH, etc…
  • Competency in Git branching/merging and other common SCM operations/practices
  • Linux OS
  • Windows Server 2016
  • Understanding of layer 3-7 networking concepts & common protocols
  • Self-starter attitude, and capable of making decisions independently (but also willing to learn/follow established team practices)
  • Experience participating in Agile team workflows, using Kanban/SCRUM processes


Preferred Qualifications

  • Understanding of object-oriented programming concepts and the SDLC
  • RHCE, CCNP, AZ-100+, and other relevant certifications
  • Prior experience designing/building CI/CD pipelines using tools like Jenkins/TeamCity/Artifactory
  • Experience developing Python code


Compensation & Benefits

  • Health, Dental, Vision, Life and additional supplemental insurance
  • 401K
  • Paid time off and personal days
  • Paid holidays

The above statements are intended to describe the general nature and level of work being performed by people assigned to this classification. They are not to be construed as an exhaustive list of all responsibilities, duties, and skills required of personnel so classified. All personnel may be required to perform duties outside of their normal responsibilities from time to time, as needed.




Company Description

Come be part of the team which has been credited with helping to locate more than 3,500 missing children and provides real-time alerts to enhance personal safety via our state-of-the-art critical notification technology.

OnSolve is the largest global provider of SaaS-based critical communication solutions for enterprise, SMB, and government customers. The company’s cloud-based software communications platform provides seamless and easy-to-deploy solutions for the exchange of critical information among organizations, their people, devices and external entities, with use cases designed to save lives, enhance revenue, and reduce costs.

OnSolve solutions include MIR3®, the most comprehensive solution available for large enterprises and federal agencies seeking to manage critical events or natural disasters effectively via the transmission of critical information and instructions. The company’s CodeRED® solution provides high-speed notification services capable of reaching millions of people in minutes, applying its mission critical capabilities to government, utilities, healthcare and other markets. In addition, the company recently acquired Send Word Now’s emergency notification system, increasing its enterprise offerings across all industries. Other solutions offered include SmartNotice® and TelAlert® for specific use cases, or for companies with less complex notification requirements.

OnSolve is an equal employment opportunity/affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other status protected by law.

Similar Jobs

See All »

Other Jobs at OnSolve

See All »