StackPath Site Reliability Engineer in United States
StackPath is a platform of secure Internet services built at the cloud's edge. StackPath services enable developers to build protection and performance into any cloud-based solution?from apps, to games, web sites, and beyond?without needing cloud security and delivery expertise of their own. More than 800,000 customers already use StackPath services, ranging from early-stage enterprises to Fortune 100 organizations. Headquartered in Dallas, Texas, StackPath has offices across the U.S. and around the world.
For more information follow StackPath at www.fb.com/stackpathllc and www.twitter.com/stackpath.
About the Role
The StackPath Site Reliability Engineering (SRE) team combines software, systems and network engineering to deploy and run a portfolio of high-performance edge services including CDN, WAF and Compute. SRE?s daily focus is on the availability, change velocity, performance and capacity of customer-facing services and supporting internal systems.
On the SRE team you will have the opportunity to apply your experience against systems at scale ? where a single week can involve shifting terabits of traffic between sites, deploying configuration changes to shave milliseconds off billions of requests, or enabling a new software feature on thousands of systems using automated tooling you designed and built.
This role will report to our: VP Site Reliability Engineering
Essential Duties and Responsibilities
Respond to incidents during on-call duty
Respond to complex customer escalations, which often cross system, network and software boundaries
Design, develop and maintain internal service metrics (SLA, SLO, SLI) in cross-team collaborations
Design, develop and maintain dashboards, tooling, alarms and playbooks in collaboration with operations teams to support service-level objectives
Design, develop and maintain reusable monitoring and canary infrastructure
Design, execute and evaluate performance experiments
Collaborate with development teams to complete production readiness checklists prior to major feature launches
Collaborate with operations and engineering teams in determining root cause of major incidents, performance anomalies, or other customer-impacting issues
Desired Skills and Experience
Experience with monitoring and alerting platforms (Prometheus and Alertmanager, Grafana, Zabbix, Nagios)
Experience with a Linux server environment
Experience with scripting languages (Python, Ruby, Perl)
Experience with systems programming languages (Go, C)
Experience with configuration management systems (Puppet, Ansible, Chef)
Expert-level proficiency in systems, network or software engineering
Excited about working on a remote-first engineering team
Proficient at troubleshooting complex systems
Production experience in a service provider environment
Comfortable with a software engineering workflow for collaboration and configuration management ? branches, pull requests, merges, conflicts
Projects you might work on
Software and platform feature releases
Live streaming event planning and execution
Network reach and capability expansion
Network and system automation tooling development
Telemetry and monitoring system development
Defining service metrics (SLA, SLO, SLI) during new product development
This job description is not intended to be all-inclusive.
StackPath is an Equal Opportunity Employer. EOE/AA M/F/D/V
If your experience and qualifications match our current needs, a member of our human resources team will contact you. We look forward to hearing from you.