
Location: Shenzhen/Guangzhou
Design, build, and operate scalable, reliable model hosting platforms for LLMs, embeddings, and STT/TTS across heterogeneous hardware.
Drive inference optimisation for latency, throughput, and cost (quantisation, KV-cache optimisation, dynamic/continuous batching).
Evaluate, integrate, and tailor inference frameworks (e.g., vLLM, TensorRT-LLM, SGLang) to maximise performance on target hardware.
Own inference health and performance monitoring: latency, throughput, TTFT, memory, availability; troubleshoot bottlenecks and deployment issues.
Partner with hardware teams to apply hardware-specific optimisations and improve resource utilisation.
Ensure hosting systems meet production standards for reliability, scalability, security, and high availability.
Build end-to-end, scalable fine-tuning pipelines to adapt foundation models using domain datasets.
Work with data scientists/ domain experts to define objectives and metrics, validate results, and integrate fine-tuned models into the hosting/ inference stack.
Bachelor’s degree or above in CS/Software Engineering (or equivalent experience) with 8+ years’ engineering experience, including 5+ years building and running distributed, cloud-native backend systems in production.
Strong hands-on Python (asyncio/FastAPI) and/or Go, with deep knowledge of non-blocking I/O, async programming, and high-concurrency service design.
Proven runtime control and performance engineering: queueing/ backpressure, rate limiting/ overload protection, retries/ timeouts/ cancellation, latency and throughput optimisation under load.
Experience with event-driven and streaming systems plus API platforms: WebSocket/ SSE/ message streaming, REST/ gRPC, and API gateway/ platform components; strong observability (metrics/ logs/ traces, OpenTelemetry) and production troubleshooting (profiling, RCA).
Cloud and platform operations: Kubernetes/ Docker on AWS/ GCP/ Azure; operational controls such as RBAC/ authN/ authZ, configuration management, and audit logging; AI-native mindset and effective use of coding assistants.

夜雨聆风