Senior Kubernetes Developer – OPS00016

Senior Kubernetes Developer – OPS00016

Senior Kubernetes Developer – OPS00016

Dev.pro

Brazil

2 horas atrás

Nenhuma candidatura

Sobre

  • At Dev.Pro, we partner with businesses worldwide, from startups to Fortune 500 companies — across fintech, retail, hospitality and beyond.
  • With a remote‑first mindset and a team in 55+ countries, we focus on aligning technical expertise with client needs, communicating clearly, and staying adaptable as priorities shift. This commitment to ownership and flexibility helps us create lasting partnerships — so you can focus on what you do best.
  • With a remote‑first mindset and a team in 55+ countries, we focus on aligning technical expertise with client needs, communicating clearly, and staying adaptable as priorities shift. This commitment to ownership and flexibility helps us create lasting partnerships — so you can focus on what you do best.
  • About this opportunity
  • We invite a skilled Kubernetes Developer to join our fully remote, international team. In this role, you’ll build and optimize the Kubernetes orchestration platform and develop custom operators to run HPC/AI workloads efficiently on GPU clusters. You’ll enhance infrastructure performance and reliability, create internal tools to improve the developer experience, and ensure multi-tenant HPC workloads remain secure and compliant.

What's in it for you

  • • Work on cutting-edge GPU infrastructure and next-gen HPC/AI workloads
  • • Build a Slurm-on-Kubernetes product from scratch and shape its architecture
  • • Collaborate with a top-tier international team and grow through continuous learning and conference participation
  • Is that you?
  • • 3+ years of hands-on Kubernetes experience in production
  • • Experience with HPC schedulers (Slurm, PBS, LSF, Volcano)
  • • Strong background in GPU resource management and distributed systems
  • • Experience with cloud/hybrid cloud architectures (AWS, GCP, Azure, on-prem GPU clusters)
  • • Knowledge of Kubernetes operators, CRDs, scheduling, networking, and storage
  • • Deep knowledge of HPC job scheduling and workload orchestration
  • • Expertise in IaC (Terraform, Helm, or GitOps: ArgoCD/Flux) and monitoring & observability (Prometheus, Grafana, Jaeger, ELK)
  • • Programming skills in Go, Python, Bash/Shell
  • • Familiarity with PyTorch, TensorFlow, distributed training, and model serving
  • • Skills in Linux administration, performance tuning, and advanced networking (RDMA, InfiniBand, TCP/IP, DNS, load balancing)
  • • Experience in storage management and optimization for large datasets
  • Key responsibilities and your contribution
  • In this role, you'll design, develop, and manage Kubernetes platforms for GPU-intensive AI/HPC workloads.
  • • Design and build a Slurm-like orchestration layer on Kubernetes for HPC/AI workloads
  • • Develop custom operators and controllers for GPU job scheduling and execution
  • • Integrate batch schedulers with Kubernetes to provide a hybrid HPC/Cloud product
  • • Implement advanced GPU resource management
  • • Build internal tools and a self-service platform to simplify AI/HPC job deployment and management
  • • Build a cloud-native platform for AI training, inference, and HPC workloads
  • • Optimize scheduling to improve GPU utilization and reduce queue times
  • • Monitor GPU clusters, troubleshoot production issues, and ensure high availability, fault tolerance, and disaster recovery
  • • Develop CI/CD pipelines for GPU-intensive workloads
  • • Implement best practices for multi-tenant GPU clusters with AI/HPC workloads
  • • Ensure compliance with data sovereignty and international regulations
  • • Maintain secure container, runtime, and workload isolation policies