← Back to all roles
Engineering Full-time Indonesia-remote (occasional Pandeglang and Jakarta DC visits)

AI / GPU Infrastructure Engineer

Build the GPU side of Dalang's infrastructure — GPU baremetal, GPU-enabled VMs, the NVIDIA software stack, and the East–West network fabric that makes distributed training fast. Pair tightly with the SRE on shared infra and with the AI Solutions Architect on customer-facing GPU offerings.

What you will do

  • Operate and expand Dalang's GPU fleet — H100, H200, L40S, A100 (whichever we buy) — across HGX baseboards and NVLink topologies.
  • Tune the NVIDIA software stack: drivers, CUDA toolkit, NCCL, NVIDIA Container Toolkit, AI Enterprise where licensed.
  • Build and operate the East–West fabric for distributed training — InfiniBand HDR/NDR or RoCE on Ethernet, with the right RDMA configuration for our workloads.
  • Integrate GPU resources into Incus and the customer dashboard: passthrough, MIG slicing, time-sliced sharing, vGPU where it makes sense.
  • Maintain inference-stack tooling we expose to customers: Triton, vLLM, TensorRT-LLM, sglang, ollama-style endpoints when relevant.
  • Support customers on GPU-related issues; capture playbooks for the common configurations.

What we need from you

  • 4+ years infrastructure engineering with at least 1 year operating GPU or HPC clusters in production.
  • Hands-on with the NVIDIA stack: CUDA, drivers, NCCL, Container Toolkit, dcgm-exporter.
  • Linux fluency at the kernel and driver level — lspci, nvidia-smi, dcgm, nvflash, IOMMU debugging.
  • Networking depth: InfiniBand or RoCE on Ethernet; can configure RDMA-aware workloads end-to-end.
  • Comfortable troubleshooting GPU thermals, power-draw spikes, and hardware failures.

Nice to have

  • Commissioned a multi-node training cluster (≥ 8 GPUs) end-to-end.
  • Slurm, Kubernetes Device Plugin, NVIDIA Run.AI, or similar GPU-orchestration experience.
  • LLM serving stacks at scale: vLLM, Triton + TensorRT-LLM, sglang.
  • Performance benchmarking (MLPerf, MosaicLLM, custom training-throughput tests).

What success looks like in 90 days

  • First production GPU node provisioned and live — for either a paying customer or an internal benchmark workload.
  • Documented bring-up playbook with pinned driver and firmware versions.
  • 12-month GPU-fleet capacity plan shared with the DC Manager and ownership.

How to apply

Send your CV plus a short note (English or Bahasa Indonesia) telling us which two responsibilities you would tackle first and why. We read every application and reply within 7 days.

Apply → [email protected]