← 返回所有职位
Engineering Full-time Indonesia-remote (occasional Pandeglang and Jakarta DC visits)

AI / GPU Infrastructure Engineer

Build the GPU side of Dalang's infrastructure — GPU baremetal, GPU-enabled VMs, the NVIDIA software stack, and the East–West network fabric that makes distributed training fast. Pair tightly with the SRE on shared infra and with the AI Solutions Architect on customer-facing GPU offerings.

职位描述以英文呈现。您可以使用英文或印尼语提交申请。

工作职责

  • Operate and expand Dalang's GPU fleet — H100, H200, L40S, A100 (whichever we buy) — across HGX baseboards and NVLink topologies.
  • Tune the NVIDIA software stack: drivers, CUDA toolkit, NCCL, NVIDIA Container Toolkit, AI Enterprise where licensed.
  • Build and operate the East–West fabric for distributed training — InfiniBand HDR/NDR or RoCE on Ethernet, with the right RDMA configuration for our workloads.
  • Integrate GPU resources into Incus and the customer dashboard: passthrough, MIG slicing, time-sliced sharing, vGPU where it makes sense.
  • Maintain inference-stack tooling we expose to customers: Triton, vLLM, TensorRT-LLM, sglang, ollama-style endpoints when relevant.
  • Support customers on GPU-related issues; capture playbooks for the common configurations.

任职要求

  • 4+ years infrastructure engineering with at least 1 year operating GPU or HPC clusters in production.
  • Hands-on with the NVIDIA stack: CUDA, drivers, NCCL, Container Toolkit, dcgm-exporter.
  • Linux fluency at the kernel and driver level — lspci, nvidia-smi, dcgm, nvflash, IOMMU debugging.
  • Networking depth: InfiniBand or RoCE on Ethernet; can configure RDMA-aware workloads end-to-end.
  • Comfortable troubleshooting GPU thermals, power-draw spikes, and hardware failures.

加分项

  • Commissioned a multi-node training cluster (≥ 8 GPUs) end-to-end.
  • Slurm, Kubernetes Device Plugin, NVIDIA Run.AI, or similar GPU-orchestration experience.
  • LLM serving stacks at scale: vLLM, Triton + TensorRT-LLM, sglang.
  • Performance benchmarking (MLPerf, MosaicLLM, custom training-throughput tests).

90 天内的成功标准

  • First production GPU node provisioned and live — for either a paying customer or an internal benchmark workload.
  • Documented bring-up playbook with pinned driver and firmware versions.
  • 12-month GPU-fleet capacity plan shared with the DC Manager and ownership.

申请方式

请发送您的简历以及一段简短说明(英文或印尼语),告诉我们您会优先处理哪两项职责以及原因。我们会阅读每一份申请,并在 7 天内回复。

立即申请 → [email protected]