工作职责

Operate and expand Dalang's GPU fleet — H100, H200, L40S, A100 (whichever we buy) — across HGX baseboards and NVLink topologies.
Tune the NVIDIA software stack: drivers, CUDA toolkit, NCCL, NVIDIA Container Toolkit, AI Enterprise where licensed.
Build and operate the East–West fabric for distributed training — InfiniBand HDR/NDR or RoCE on Ethernet, with the right RDMA configuration for our workloads.
Integrate GPU resources into Incus and the customer dashboard: passthrough, MIG slicing, time-sliced sharing, vGPU where it makes sense.
Maintain inference-stack tooling we expose to customers: Triton, vLLM, TensorRT-LLM, sglang, ollama-style endpoints when relevant.
Support customers on GPU-related issues; capture playbooks for the common configurations.

任职要求

4+ years infrastructure engineering with at least 1 year operating GPU or HPC clusters in production.
Hands-on with the NVIDIA stack: CUDA, drivers, NCCL, Container Toolkit, dcgm-exporter.
Linux fluency at the kernel and driver level — lspci, nvidia-smi, dcgm, nvflash, IOMMU debugging.
Networking depth: InfiniBand or RoCE on Ethernet; can configure RDMA-aware workloads end-to-end.
Comfortable troubleshooting GPU thermals, power-draw spikes, and hardware failures.

Commissioned a multi-node training cluster (≥ 8 GPUs) end-to-end.
Slurm, Kubernetes Device Plugin, NVIDIA Run.AI, or similar GPU-orchestration experience.
LLM serving stacks at scale: vLLM, Triton + TensorRT-LLM, sglang.
Performance benchmarking (MLPerf, MosaicLLM, custom training-throughput tests).

First production GPU node provisioned and live — for either a paying customer or an internal benchmark workload.
Documented bring-up playbook with pinned driver and firmware versions.
12-month GPU-fleet capacity plan shared with the DC Manager and ownership.

请发送您的简历以及一段简短说明（英文或印尼语），告诉我们您会优先处理哪两项职责以及原因。我们会阅读每一份申请，并在 7 天内回复。