← Kembali ke daftar posisi
Engineering Full-time Indonesia-remote (occasional Pandeglang and Jakarta DC visits)

AI / GPU Infrastructure Engineer

Build the GPU side of Dalang's infrastructure — GPU baremetal, GPU-enabled VMs, the NVIDIA software stack, and the East–West network fabric that makes distributed training fast. Pair tightly with the SRE on shared infra and with the AI Solutions Architect on customer-facing GPU offerings.

Deskripsi pekerjaan dalam bahasa Inggris. Anda dapat melamar dalam bahasa Inggris atau Bahasa Indonesia.

Yang akan Anda kerjakan

  • Operate and expand Dalang's GPU fleet — H100, H200, L40S, A100 (whichever we buy) — across HGX baseboards and NVLink topologies.
  • Tune the NVIDIA software stack: drivers, CUDA toolkit, NCCL, NVIDIA Container Toolkit, AI Enterprise where licensed.
  • Build and operate the East–West fabric for distributed training — InfiniBand HDR/NDR or RoCE on Ethernet, with the right RDMA configuration for our workloads.
  • Integrate GPU resources into Incus and the customer dashboard: passthrough, MIG slicing, time-sliced sharing, vGPU where it makes sense.
  • Maintain inference-stack tooling we expose to customers: Triton, vLLM, TensorRT-LLM, sglang, ollama-style endpoints when relevant.
  • Support customers on GPU-related issues; capture playbooks for the common configurations.

Yang kami butuhkan dari Anda

  • 4+ years infrastructure engineering with at least 1 year operating GPU or HPC clusters in production.
  • Hands-on with the NVIDIA stack: CUDA, drivers, NCCL, Container Toolkit, dcgm-exporter.
  • Linux fluency at the kernel and driver level — lspci, nvidia-smi, dcgm, nvflash, IOMMU debugging.
  • Networking depth: InfiniBand or RoCE on Ethernet; can configure RDMA-aware workloads end-to-end.
  • Comfortable troubleshooting GPU thermals, power-draw spikes, and hardware failures.

Nilai tambah

  • Commissioned a multi-node training cluster (≥ 8 GPUs) end-to-end.
  • Slurm, Kubernetes Device Plugin, NVIDIA Run.AI, or similar GPU-orchestration experience.
  • LLM serving stacks at scale: vLLM, Triton + TensorRT-LLM, sglang.
  • Performance benchmarking (MLPerf, MosaicLLM, custom training-throughput tests).

Tolok ukur sukses dalam 90 hari

  • First production GPU node provisioned and live — for either a paying customer or an internal benchmark workload.
  • Documented bring-up playbook with pinned driver and firmware versions.
  • 12-month GPU-fleet capacity plan shared with the DC Manager and ownership.

Cara melamar

Kirimkan CV beserta catatan singkat (Bahasa Inggris atau Bahasa Indonesia) yang menjelaskan dua tanggung jawab pertama yang akan Anda kerjakan dan alasannya. Kami membaca setiap lamaran dan membalas dalam 7 hari.

Lamar → [email protected]