Senior DevOps Engineer for Video Platform Scaling

Upwork

Remoto

•

14 horas atrás

•

Nenhuma candidatura

Sobre

We are seeking a senior DevOps engineer with extensive experience in scaling video-based platforms. The ideal candidate will have a deep understanding of cloud infrastructure, CI/CD pipelines, and container orchestration technologies. Your role will be critical in ensuring our platform can handle increased traffic and deliver high-quality streaming experiences. If you have a passion for optimizing system performance and possess strong problem-solving skills, we want to hear from you! What We Need: Design and implement a production-ready baseline that keeps costs tiny at launch (sub-1k CCU), yet can scale with switches, not rewrites. Scope (design + hands-on implementation) Reference Architecture & Migration Plan Cloudflare: WAF, Load Balancer, Tiered Cache, Cache Reserve, geo-aware routing; (optionally) Workers/Durable Objects for chat/room affinity. Media: object storage (R2/S3), HLS/DASH pipeline (FFmpeg), signed URLs, versioned paths, near-zero origin egress. Geo-aware cache strategy: per-region cache keys/TTLs, origin shielding, regional failover, “local-first” rules. App: stateless services, managed Postgres (+ PgBouncer, partitioning path), Redis (rate limits, counters, feed fan-out), queues (SQS/Kafka/NATS), WebSocket gateways (sharding plan). Search: efficient player directory filters (Postgres indices/JSONB). Observability: OpenTelemetry + Prometheus/Grafana, logs, dashboards, SLOs & error budgets. Security/abuse/cost: WAF, Turnstile, quotas, auth/signing, and a cost model with dials (bitrates, ladder size, TTLs). Infra-as-Code + CI/CD Terraform modules + GitHub Actions; blue/green or canary. Implementation (MVP, low-cost) Stand up caching, media pipeline (start with 2-rung ladder: 360p/720p), signed URLs, long TTLs. App + chat on minimal compute (1–2 small nodes), PgBouncer, Redis token buckets. Load tests: k6 for HLS (TTFB, cache-hit%), WebSocket CCU/churn. Dashboards for CDN cache-hit, origin egress, media TTFB, DB CPU/conn/lag, WebSocket CCU & msgs/s, queue depth/age + Notification system for Admin to enable to scale what's already in place. Runbooks: deploy/rollback, on-call, incident playbooks. Fixed price with milestone payments. Low cost now: when you finish, infra must comfortably handle a few hundred CCU with minimal ongoing spend (no 1M-scale bills). Scalable later: clear, documented switches (turn on replicas, add workers, enable Cache Reserve, shard chat) to jump from 1k → 100k+ CCU. Original thinking: your proposal must include specific, creative suggestions tailored to our context. We will prioritize unique, practical ideas over generic boilerplate. Deliverables Architecture doc + diagram (incl. geo-caching topology) & Migration Plan from Contabo. Terraform repo (LB, WAF, cache rules, buckets, queues, DB/Redis) + CI/CD. MVP environment running (staging/prod) with media pipeline and chat. SLOs + load-test plan/scripts, with results (p95 targets, cache-hit over 94% after warmup). Cost forecast with clear dials (bitrates, ladder, TTLs, replicas, workers). Runbooks (deploy/rollback/incident) + handoff session/recording. Milestones (suggested; propose your own if better) M1 — Design & Plan (fixed): Architecture + migration + cost model + SLOs. M2 — IaC + CI/CD (fixed): Terraform modules + Actions + basic staging. M3 — Media & Caching (fixed): HLS (360p/720p), signed URLs, cache rules, over 94% cache-hit in k6 from 2 regions, low origin egress. M4 — App/Chat + Observability (fixed): PgBouncer, Redis rate limits, WebSocket gateway, dashboards, alerts. M5 — Hardening (fixed): load-test tuning, runbooks, final cost report, go-live checklist. Success Criteria (examples) Media TTFB p95 below 250 ms via CDN; origin egress negligible under cache. Feed API p95 below 150 ms at MVP load; chat stable at a few thousand CCU on small nodes. Cost at rest: object storage + minimal compute only; no idle enterprise bills. All components documented; flip-switches ready for scale on master admin dashboard. What to Include in Your Proposal (must have all 3) Two short case studies with numbers (CCU/QPS, cache-hit%, Mbps/Tbps, p95/p99, cost impact). A Terraform snippet (redacted ok) that provisions: bucket + Cloudflare LB/WAF + custom cache key rule + shielding. Your plan to keep costs tiny now but scale later: the exact switches you’d leave in place (e.g., “enable Cache Reserve”, “add read replica”, “increase workers by queue depth”, “shard WebSockets by userId mod N”). Avoid: AWS as its cost is too impredictable, and GCP/Azure Keep in mind: we can also paste this in ChatGPT and check its recommendation, so beware we're looking for someone who ACTUALLY did, not someone who COULD do. Experience is key. We value original suggestions. If your proposal reads like generic boilerplate, we won’t proceed.