Machine Learning Engineer (Agent & Inference) - (Chinese Mandarin Speaker)

Bitus Labs • Full-time • Irvine, CA, US • $130k / year • 3d ago

About the Role

We're an online gaming company using AI to power and personalize player experiences. This role sits within the AI Engineering team, which is responsible for taking AI capabilities into production. This role focuses primarily on agent systems, with model deployment and inference engineering as a secondary responsibility.

We do not train foundation models from scratch. Our focus is on production AI systems, model adaptation, inference optimization, and agentic applications.

Responsibilities

Agent Systems — Primary

Design, build, and optimize LLM-powered agents, including planning, tool use, workflow orchestration, and multi-step reasoning
Architect memory systems, including short-term memory, long-term memory, context management, and session state
Build and optimize RAG pipelines for relevance, grounding, freshness, and retrieval quality
Design and operate vector-store infrastructure (e.g., pgvector, Milvus, Qdrant, Weaviate)
Define evaluation methodologies for agents, prompts, and workflows
Optimize end-to-end agent quality, latency, reliability, and operating cost

Model Deployment & Inference — Secondary

Build and operate production inference services that are low-latency, high-concurrency, and highly reliable
Serve online-learning models (e.g., contextual bandits and reinforcement learning policies) with real-time inference and online parameter or weight updates
Deploy and optimize AI inference systems for latency, throughput, reliability, and resource efficiency
Analyze and resolve inference-serving bottlenecks
Support deployment and serving of recommendation, ranking, and reinforcement learning models developed by research scientists
Apply lightweight model adaptation techniques (e.g., LoRA, QLoRA, PEFT) when appropriate for domain-specific requirements

MLOps — Supporting Both

Build and maintain deployment pipelines, observability systems, and tracing infrastructure for agents and serving endpoints
Monitor quality regression, performance degradation, and model drift
Maintain version control for models, prompts, datasets, and agent configurations
Contribute to automated validation, testing, and CI/CD workflows for AI systems

Collaboration

Partner with research scientists, backend engineers, and data scientists to integrate AI systems into production products
Document systems, best practices, and internal tooling
Contribute to engineering standards and operational excellence across AI initiatives

Required Qualifications

Bachelor's or Master's degree in Computer Science, Machine Learning, or a related field
3+ years of industry experience in Machine Learning Engineering or related roles
Strong software and systems engineering experience, including building low-latency, reliable production services in languages such as Go, Rust, C++, or equivalent
Experience building or supporting real-time inference systems for recommendation, ranking, contextual bandits, reinforcement learning, or similar adaptive machine learning applications Strong experience with PyTorch and the Hugging Face ecosystem
Experience building production LLM or agent applications (e.g., LangGraph, LlamaIndex, or equivalent frameworks)
Hands-on experience with RAG systems, embeddings, and vector databases
Experience evaluating and monitoring LLM or agent systems in production
Experience deploying and optimizing production machine learning or LLM systems
Understanding of inference runtime behavior, resource utilization, latency optimization, and production serving performance
Experience with Docker and Kubernetes
Experience with cloud platforms such as AWS, GCP, or Azure
Fluent Mandarin Chinese

Preferred / Nice to Have

Experience fine-tuning open-weight LLMs using LoRA, QLoRA, PEFT, or related approaches
Familiarity with the underlying algorithms used in recommender systems, ranking systems, contextual bandits, or reinforcement learning
Experience with custom GPU kernel development using CUDA or OpenAI Triton
Experience with graph-level optimization and low-level inference performance tuning
Experience with large-scale distributed training (e.g., FSDP, DeepSpeed, multi-GPU workloads)
Experience deploying models to edge environments using TFLite, CoreML, or NPU accelerators
Strong understanding of CI/CD principles and deployment workflows
Background in gaming, gaming AI, or player personalization systems
Experience with distributed systems, Spark, Hadoop, or large-scale data infrastructure

Pay: From $130,000.00 per year

Benefits: