What you will do

Operate and expand Dalang's GPU fleet — H100, H200, L40S, A100 (whichever we buy) — across HGX baseboards and NVLink topologies.
Tune the NVIDIA software stack: drivers, CUDA toolkit, NCCL, NVIDIA Container Toolkit, AI Enterprise where licensed.
Build and operate the East–West fabric for distributed training — InfiniBand HDR/NDR or RoCE on Ethernet, with the right RDMA configuration for our workloads.
Integrate GPU resources into Incus and the customer dashboard: passthrough, MIG slicing, time-sliced sharing, vGPU where it makes sense.
Maintain inference-stack tooling we expose to customers: Triton, vLLM, TensorRT-LLM, sglang, ollama-style endpoints when relevant.
Support customers on GPU-related issues; capture playbooks for the common configurations.

What we need from you

4+ years infrastructure engineering with at least 1 year operating GPU or HPC clusters in production.
Hands-on with the NVIDIA stack: CUDA, drivers, NCCL, Container Toolkit, dcgm-exporter.
Linux fluency at the kernel and driver level — lspci, nvidia-smi, dcgm, nvflash, IOMMU debugging.
Networking depth: InfiniBand or RoCE on Ethernet; can configure RDMA-aware workloads end-to-end.
Comfortable troubleshooting GPU thermals, power-draw spikes, and hardware failures.

Nice to have

Commissioned a multi-node training cluster (≥ 8 GPUs) end-to-end.
Slurm, Kubernetes Device Plugin, NVIDIA Run.AI, or similar GPU-orchestration experience.
LLM serving stacks at scale: vLLM, Triton + TensorRT-LLM, sglang.
Performance benchmarking (MLPerf, MosaicLLM, custom training-throughput tests).

What success looks like in 90 days

First production GPU node provisioned and live — for either a paying customer or an internal benchmark workload.
Documented bring-up playbook with pinned driver and firmware versions.
12-month GPU-fleet capacity plan shared with the DC Manager and ownership.

How to apply

Send your CV plus a short note (English or Bahasa Indonesia) telling us which two responsibilities you would tackle first and why. We read every application and reply within 7 days.

Apply → [email protected]