Vice President of AI Infrastructure & Engineering
Job Description
Vice President of AI Infrastructure & Engineering
Reporting to: CEO
Position Overview
*Build and Overscale Training Platform:* Design and maintain high-performance training architecture supporting GPU clusters at the scale of tens of thousands of cards.
*Key Focus:* Establish a unified task scheduling and model management system (MLOps). All model training, checkpoint storage, and code version control must be exclusively conducted through this platform.
*Engineering-Led Data Governance:*
Develop end-to-end data processing pipelines covering cleansing, annotation, versioning, and secure storage.
*Key Focus:* Implement rigorous data access audit logs and data lineage tracking to ensure security and compliance of core data assets.
*Developer Experience as Control Mechanism:*
Maximize experimental efficiency through standardized toolchains, enabling researchers to focus solely on algorithm development without managing underlying configurations.
*Key Focus:* Codify "best practices" into reusable code templates.
*System Reliability and Disaster Recovery:*
Implement fault tolerance, automated model snapshot archiving, and continuity protocols to prevent loss of critical assets due to single points of failure (hardware or human).
Qualifications
*Background:* 10+ years in distributed systems, cloud computing, or high-performance computing (HPC), with prior experience in core infrastructure teams at leading firms such as Google, Meta, AWS, or NVIDIA.
*Mindset:* Exceptional engineering rigor with a focus on building stable, scalable systems rather than solely pursuing algorithmic innovation. Service-oriented attitude with a commitment to empowering top-tier scientists.
*Technical Skills:* Proficiency in orchestration systems such as Kubernetes, Ray, or Slurm; familiarity with PyTorch distributed training frameworks; deep understanding of data security and access control mechanisms.