You will drive migrations end-to-end and
collaborate with stakeholders. You will own Problem Management and drive root
cause analysis (RCA) discussions, with accountability for cost optimization
efforts, infrastructure re-design where needed, and product reliability and
scalability. You will partner with other team members to identify cost
optimization and automation opportunities, help define and maintain service
level indicators (SLIs, SLOs, SLAs, and error budgets), and troubleshoot
incidents while helping users with issues and requests. You will define
standards, guidelines, and best practices, bring in new tools, conduct proofs
of concept (POCs) with development teams, and define product readiness for
SREs. You will create documentation covering configuration, operations, and
troubleshooting procedures, identify new patterns, and participate in
architectural discussions to improve product scalability and stability. You
will also provide DevOps/SRE support for planned and unplanned work, including
team on-call support, and provide technical guidance and mentorship to the
team.
Your work will help maintain and improve the
Developer, Observability, and Logging Platforms by strengthening observability,
logging, monitoring, and alerting capabilities, as well as secrets management
and CI/CD foundations. By leading root cause analysis for support escalation
issues and supporting cloud-native and containerized platforms, your role
directly contributes to reliable, well-monitored, and well-operated systems
used by product and support teams.
You will be located in San Jose,
Costa Rica as part of our Product Tooling Services Group within Secure
Foundations.