Cloud Reliability Engineer

Cloud Reliability Engineer

Cloud Reliability Engineer

Infios

Workday

Remote Brazil

2 horas atrás

Nenhuma candidatura

Sobre

If you are looking for a meaningful career where people work and act with passion, rethink the existing and always strive to find the best solution - you have come to the right place. We develop future technologies to relentlessly make supply chains better. We are a leader in supply chain software solutions, helping organizations streamline operations, reduce costs, and improve efficiency. Key Responsibilities ▶ Cloud Infrastructure Operations o Operate, maintain, and improve cloud infrastructure in AWS, Azure, or GCP environments. o Manage and optimize Kubernetes clusters — deployment, scaling, patching, and upgrades. o Ensure system availability, scalability, and performance through proactive monitoring and optimization. o Maintain infrastructure-as-code (IaC) for consistent and repeatable deployments. ▶ Automation & Continuous Improvement o Identify opportunities for operational automation to eliminate manual processes (“reduce toil”). o Build and maintain automated pipelines for deployments, configuration, and remediation. o Develop self-healing mechanisms to automatically detect and resolve common service issues. o Participate in continuous improvement initiatives around reliability, performance, and efficiency. ▶ Reliability Engineering o Implement SRE principles: define and track SLIs, SLOs, and error budgets. o Perform incident analysis and postmortems to identify root causes and prevent recurrence. o Design proactive monitoring, alerting, and observability dashboards (Dynatrace, DataDog). o Collaborate with DevOps and development teams to build reliable, observable, and resilient systems. ▶ CI/CD and Release Operations o Manage and optimize CI/CD pipelines to ensure reliable and consistent delivery. o Support deployment strategies (blue/green, canary, rolling) to reduce downtime risk. o Collaborate with Product and DevOps teams on release readiness and rollback automation. ▶ Incident Response & Troubleshooting o Monitor, troubleshoot, and resolve infrastructure and application issues o Respond to production incidents and ensure rapid mitigation and resolution. o Troubleshoot complex cloud, container, and networking issues across distributed systems. o Drive a culture of proactive monitoring, data-driven analysis, and preventive action. Required Qualifications ▶ Bachelor’s degree in computer science, Engineering, or related field (or equivalent experience). ▶ 5+ years of experience in experience in Cloud Engineering, DevOps, or Site Reliability roles. ▶ Hands-on experience with cloud platforms (OCI, AWS, Azure, or GCP). ▶ Strong knowledge of Kubernetes deployment, management, and troubleshooting ▶ Solid understanding of observability and monitoring (e.g., Dynatrace, DataDog) and incident management platforms. ▶ Proficiency in scripting and automation (e.g., Python, Bash, Terraform, Ansible). ▶ Strong troubleshooting and analytical skills across infrastructure and applications. ▶ Experience with incident response, RCA, and postmortem processes. ▶ A mindset of continuous improvement, reliability, and self-healing automation. ▶ Understanding of SRE principles, SLAs/SLOs/SLIs, and chaos engineering practices. Preferred Skills ▶ Experience in conducting resilience assessments and recovery drills. ▶ Familiarity with ServiceNow and Dynatrace or other observability and ITSM tools. ▶ Experience with chaos engineering or resiliency testing frameworks ▶ Background in networking, load balancing, and performance tuning ▶ Strong communication and stakeholder management skills. Soft Skills & Mindset ▶ Strong collaboration skills — comfortable working with developers, ops, and management. ▶ Clear communicator; able to translate technical issues into business impact. ▶ Self-starter with a problem-solving and automation-first mentality. ▶ Resilient under pressure — thrives in a dynamic, fast-paced environment. ▶ Passionate about operational excellence and continuous learning. Key Success Metrics ▶ SLA/SLO compliance for critical services ▶ Reduction in MTTR (Mean Time to Recover) ▶ Increase in automated incident resolution rates ▶ Reduction in customer-impacting incidents ▶ Frequency and outcomes of resilience testing exercises ▶ Service uptime / availability Why join us? At Infios, we're not just looking for employees; we're looking for partners in innovation, growth, and purpose. Meeting you where you are to create the future you need is at the core of who we are and what we do. Whether you're at the beginning of your career or a seasoned expert, we meet you on your journey, equipping you with the tools and opportunities to build the future you envision. Together, we will relentlessly work toward one common goal - making supply chains better. We believe the future is better when supply chains work better. We are an equal-opportunity employer and committed to inclusion in the workplace. At Infios, we believe that inclusion is a fundamental cornerstone of our success. We are committed to creating a safe and welcoming environment where every individual’s unique experiences and perspectives are valued—whether they look, think, move, believe, or love differently. All qualified applicants will receive consideration for employment without regard to race, color, ethnicity, national origin, sex, sexual orientation, gender identity, marital status, pregnancy, religion, age, disability, veteran status, genetic information, or any other characteristic protected by law. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions of this role. If you require assistance or accommodation due to a disability during the recruiting process, please let us know at jobs@infios.com Disclaimer: This job advertisement is not designed to cover a comprehensive listing of all duties or responsibilities that are required for this job. Please note that any salary information is a general guideline only. Individual compensation will be determined by various factors such as the scope and responsibilities of the position, experience, education, skills, location, and market and business considerations. Applications must be submitted via our career site. Körber Supply Chain Software is now Infios. At Infios we believe that the future is better when supply chains work better. That’s why we’ve been pushing boundaries - driving purposeful innovation, thinking ahead and creating adaptable solutions to relentlessly make supply chains better. For everyone. Wherever you are on your journey, we’ll meet you where you are to create the future you need.