Yang akan Anda kerjakan
- Operate and expand Dalang's GPU fleet — H100, H200, L40S, A100 (whichever we buy) — across HGX baseboards and NVLink topologies.
- Tune the NVIDIA software stack: drivers, CUDA toolkit, NCCL, NVIDIA Container Toolkit, AI Enterprise where licensed.
- Build and operate the East–West fabric for distributed training — InfiniBand HDR/NDR or RoCE on Ethernet, with the right RDMA configuration for our workloads.
- Integrate GPU resources into Incus and the customer dashboard: passthrough, MIG slicing, time-sliced sharing, vGPU where it makes sense.
- Maintain inference-stack tooling we expose to customers: Triton, vLLM, TensorRT-LLM, sglang, ollama-style endpoints when relevant.
- Support customers on GPU-related issues; capture playbooks for the common configurations.
Yang kami butuhkan dari Anda
- 4+ years infrastructure engineering with at least 1 year operating GPU or HPC clusters in production.
- Hands-on with the NVIDIA stack: CUDA, drivers, NCCL, Container Toolkit, dcgm-exporter.
- Linux fluency at the kernel and driver level — lspci, nvidia-smi, dcgm, nvflash, IOMMU debugging.
- Networking depth: InfiniBand or RoCE on Ethernet; can configure RDMA-aware workloads end-to-end.
- Comfortable troubleshooting GPU thermals, power-draw spikes, and hardware failures.
Nilai tambah
- Commissioned a multi-node training cluster (≥ 8 GPUs) end-to-end.
- Slurm, Kubernetes Device Plugin, NVIDIA Run.AI, or similar GPU-orchestration experience.
- LLM serving stacks at scale: vLLM, Triton + TensorRT-LLM, sglang.
- Performance benchmarking (MLPerf, MosaicLLM, custom training-throughput tests).
Tolok ukur sukses dalam 90 hari
- First production GPU node provisioned and live — for either a paying customer or an internal benchmark workload.
- Documented bring-up playbook with pinned driver and firmware versions.
- 12-month GPU-fleet capacity plan shared with the DC Manager and ownership.
Cara melamar
Kirimkan CV beserta catatan singkat (Bahasa Inggris atau Bahasa Indonesia) yang menjelaskan dua tanggung jawab pertama yang akan Anda kerjakan dan alasannya. Kami membaca setiap lamaran dan membalas dalam 7 hari.
Lamar → [email protected]