14539 – Site Reliability Engineer (Hybrid) – Austin, TX
Start Date: ASAP
Type: Temporary Project
Estimated Duration: 12+ months with possible extensions
Work Setting: Hybrid. Position will be 3 days remote with 2 days (Mondays and Thursdays) required to be onsite.
Only candidates able to relocate as required should apply to avoid removal from future consideration.
Required:
• Experience in systems engineering, DevOps, or site reliability engineering roles (8+ years);
• Experience with Linux/Unix systems and system internals (8+ years);
• Experience in one or more programming/scripting languages (Python, Go, Java, Bash) (8+ years);
• Experience designing and operating highly available, distributed systems (8+ years);
• Experience with cloud platforms (AWS, or GCP) and cloud-native services (8+ years);
• Experience with containerization and orchestration (Docker, Kubernetes) (8+ years);
• Experience monitoring, alerting, and logging concepts (8+ years);
• Experience defining and managing SLIs, SLOs, and error budgets (8+ years);
• Experience with incident management, root cause analysis (RCA), and postmortems (8+ years);
• Experience integrating security and compliance into operational workflows (8+ years).
Preferred:
• Experience with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk), (4+ years);
• Experience operating 24x7 production environments with on-call rotations (4+ years);
• Experience with chaos engineering and resiliency testing (4+ years);
• Experience with feature flags, canary deployments, and progressive delivery (4+ years);
• Experience in documentation for runbooks, dashboards, and operational standards (4+ years).
Responsibilities include but are not limited to the following:
• Ensure system reliability and performance by designing, implementing, and maintaining highly available and scalable production systems;
• Collaborate with development teams to build resilient, observable, and automated platforms that meet defined Service Level Objectives (SLOs);
• Develop automation tools and scripts (using Python, Go, Bash, etc.) to streamline operational tasks, deployments, and monitoring setups;
• Implement monitoring and alerting solutions using observability tools (e.g., Prometheus, Grafana, Datadog) to proactively detect and resolve issues;
• Conduct incident management and root cause analysis (RCA) to improve system resilience and prevent recurring outages;
• Integrate security and compliance requirements into infrastructure and operational workflows to ensure secure and compliant system performance;
• Continuously optimize infrastructure and processes by performing cost-benefit analyses,
evaluating alternative solutions, and innovating with new technologies.