SR DevOps Engineer
Software Engineering
Palo Alto, CA, USA
USD 170k-220k / year + Equity
Transform healthcare with us.
At Qualified Health, we’re redefining what’s possible with Generative AI in healthcare. Our infrastructure provides the guardrails for safe AI governance, healthcare-specific agent creation, and real-time algorithm monitoring—working alongside leading health systems to drive real change.
This is more than just a job. It’s an opportunity to build the future of AI in healthcare, solve complex challenges, and make a lasting impact on patient care. If you’re ambitious, innovative, and ready to move fast, we’d love to have you on board.
Join us in shaping the future of healthcare.
Job Summary
We're looking for a Senior DevOps Engineer / Site Reliability Engineer to ensure the reliability, performance, and operational excellence of our production environments powering AI solutions for major health systems. You'll partner closely with engineering teams to make services production-ready, own observability and incident response, and drive the practices that keep our platform stable as we scale. As a key member of our infrastructure team, you'll be the connective tissue between development and production, ensuring new features ship safely while maintaining the reliability standards required for healthcare workloads.
Key Responsibilities
Partner with engineering teams to ensure services are production-ready before release, including reviewing deployment patterns, failure modes, resource requirements, and rollback strategies
Design and maintain observability infrastructure including metrics, logging, distributed tracing, and dashboards across multi-cloud environments
Define and manage alerting policies, SLIs/SLOs, and on-call rotations to ensure timely response to production issues
Lead and support incident response for production issues, drive root cause analysis, and coordinate hotfix deployments when needed
Author and maintain release documentation, runbooks, incident postmortems, and operational playbooks
Provide day-to-day operational support to engineering teams, unblocking deployments, debugging production issues, and improving developer experience around shipping to production
Design and maintain zero trust network architectures, ensuring secure connectivity across multi-cloud environments and tenant boundaries
Build and improve CI/CD pipelines and release processes to make production deployments safer, faster, and more predictable
Develop automation in Python and Terraform to reduce toil and codify operational best practices
Manage Kubernetes-based workloads in production, including troubleshooting cluster issues, optimizing resource utilization, and maintaining workload reliability
Operate Temporal workflows in production, including monitoring, scaling, and troubleshooting long-running workflow executions
Collaborate with security and compliance teams to maintain HIPAA and HITRUST controls across production environments
Required Qualifications
6+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering, with at least 3 years directly managing production workloads
Strong proficiency with Terraform including module development, state management, and multi-environment architectures
Deep experience operating production Kubernetes environments, including troubleshooting, networking, workload management, and cluster operations
Hands-on experience with both Google Cloud Platform and Microsoft Azure services
Strong networking and security knowledge, including zero trust architectures, network segmentation, private connectivity, identity-based access controls, and secrets management
Production experience with Temporal or comparable workflow orchestration systems
Strong proficiency in Python for automation, tooling, and operational scripting
Demonstrated experience designing and operating observability stacks including metrics, logging, tracing, and alerting
Experience leading incident response, including on-call rotation management, runbook development, and postmortem processes
Track record of partnering with engineering teams to improve production readiness and release practices
Excellent written communication skills for authoring runbooks, postmortems, and release documentation
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience
Desirable Skills
Experience in healthcare industry with understanding of HIPAA compliance requirements
Familiarity with HITRUST or similar compliance frameworks
Experience operating LLM-based systems, agentic workflows, or RAG pipelines in production
Experience with GitOps workflows (Rancher Fleet, ArgoCD, or Flux)
Experience building and operating multi-tenant SaaS infrastructure
Familiarity with chaos engineering and reliability testing practices
Prior experience as a founding or early SRE/Platform hire at a startup
Technical Environment
Our infrastructure is built on modern cloud technologies including:
Google Cloud Platform (primary) and Microsoft Azure
Google Kubernetes Engine (GKE)
Terraform and Terragrunt
Temporal for workflow orchestration
Python, Go, Shell scripting
GitOps-based deployment workflows
Modern monitoring and observability tools
Why Join Qualified Health?
This is an opportunity to join a fast-growing company and a world-class team, that is poised to change the healthcare industry. We are a passionate, mission-driven team that is building a category-defining product. We are backed by premier investors and are looking for founding team members who are excited to do the best work of their careers.
Our employees are integral to achieving our goals so we are proud to offer competitive salaries with equity packages, robust medical/dental/vision insurance, flexible working hours, hybrid work options and an inclusive environment that fosters creativity and innovation.
Our Commitment to Diversity
Qualified Health is an equal opportunity employer. We believe that a diverse and inclusive workplace is essential to our success, and we are committed to building a team that reflects the world we live in. We encourage applications from all qualified individuals, regardless of race, color, religion, gender, sexual orientation, gender identity or expression, age, national origin, marital status, disability, or veteran status.
Pay & Benefits: The pay range for this role is between $170,000 and $220,000, and will depend on your skills, qualifications, experience, and location. This role is also eligible for equity and benefits.
Join our mission to revolutionize healthcare with AI. To apply, please send your resume through the application below.