What you will do
- Operate and expand Dalang's GPU fleet — H100, H200, L40S, A100 (whichever we buy) — across HGX baseboards and NVLink topologies.
- Tune the NVIDIA software stack: drivers, CUDA toolkit, NCCL, NVIDIA Container Toolkit, AI Enterprise where licensed.
- Build and operate the East–West fabric for distributed training — InfiniBand HDR/NDR or RoCE on Ethernet, with the right RDMA configuration for our workloads.
- Integrate GPU resources into Incus and the customer dashboard: passthrough, MIG slicing, time-sliced sharing, vGPU where it makes sense.
- Maintain inference-stack tooling we expose to customers: Triton, vLLM, TensorRT-LLM, sglang, ollama-style endpoints when relevant.
- Support customers on GPU-related issues; capture playbooks for the common configurations.
What we need from you
- 4+ years infrastructure engineering with at least 1 year operating GPU or HPC clusters in production.
- Hands-on with the NVIDIA stack: CUDA, drivers, NCCL, Container Toolkit, dcgm-exporter.
- Linux fluency at the kernel and driver level — lspci, nvidia-smi, dcgm, nvflash, IOMMU debugging.
- Networking depth: InfiniBand or RoCE on Ethernet; can configure RDMA-aware workloads end-to-end.
- Comfortable troubleshooting GPU thermals, power-draw spikes, and hardware failures.
Nice to have
- Commissioned a multi-node training cluster (≥ 8 GPUs) end-to-end.
- Slurm, Kubernetes Device Plugin, NVIDIA Run.AI, or similar GPU-orchestration experience.
- LLM serving stacks at scale: vLLM, Triton + TensorRT-LLM, sglang.
- Performance benchmarking (MLPerf, MosaicLLM, custom training-throughput tests).
What success looks like in 90 days
- First production GPU node provisioned and live — for either a paying customer or an internal benchmark workload.
- Documented bring-up playbook with pinned driver and firmware versions.
- 12-month GPU-fleet capacity plan shared with the DC Manager and ownership.
How to apply
Send your CV plus a short note (English or Bahasa Indonesia) telling us which two responsibilities you would tackle first and why. We read every application and reply within 7 days.
Apply → [email protected]