Position: ML/AI Architect
Location: Gaithersburg, MD(Hybrid 3 Day per Week)
This position requires a deep understanding of cloud-native ML/AI Ops methodologies and technologies, AWS infrastructure, State-of-the-art (SOTA) Foundation Models and AWS GenAI Services, and the unique demands of regulated industries, making it a cornerstone of our success in delivering impactful solutions to the pharmaceutical industry.
Accountabilities:
Operational Excellence
- Lead by example in creating high-performance, mission-focused and interdisciplinary teams/culture founded on trust, mutual respect, growth mentalities, and an obsession for building extraordinary products with extraordinary people.
- Drive the creation of proactive capability and process enhancements that ensures enduring value creation and analytic compounding interest.
- Design and implement resilient cloud ML/AI operational capabilities to improve our system Abilities (Learnability, Flexibility, Extendibility, Interoperability, Scalability).
- Drive precision and systemic cost efficiency, optimized system performance, and risk mitigation with a data-driven strategy, comprehensive analytics, and predictive capabilities at the tree-and-forest level of our ML/AI systems, workloads and processes.
ML/AI Cloud Operations and Engineering
- Architect and implement scalable AWS ML/AI cloud infrastructure in a multi-tenant SaaS environment.
- Establish governance frameworks for ML/AI infrastructure management and ensure compliance with industry standard processes.
- Ensure principled and methodical validation pathways and a Well Architected Framework for Embryonic Research (WAFER) similar to and building on AWS s Well Architected Framework (WAF) for all early stage product and operational GenAI PoC s across the organization.
- Oversee ML/AI related Kubernetes (k8s) cluster management and provide guidance on alternative ML/AI workflow orchestration options such as Argo vs Kubeflow, and ML/AI data pipeline creation, management and governance with tools like Airflow.
- Employ AWS CDK (TypeScript), Projen, and Argo CD to automate infrastructure deployment and management.
- Help set the strategy and manage the tactical balance between framework and platform experimentation and democratization with standardization and centralized management and governance
- Conduct cost-benefit analyses and formal processes for selection and utilization of foundation models, evaluating their architectures, performance, and costs.
- Work with multiple teams to ensure that the platform meets organizational needs and scales effectively.
Personal Attributes:
- Customer-obsessed and passionate about building products that solve real-world problems.
- Highly organized and diligent, with the ability to manage multiple initiatives and deadlines.
- Collaborative and inclusive, fostering a positive team culture where creativity and innovation thrive.
Essential Skills/Experience:
- HS Diploma and 5 years of experience in Engineering/IT solutions OR BA/BS
- Minimum of 5 years in cloud infrastructure design and management roles.
- Deep understanding of the Data Science Lifecycle (DSLC) and the ability to shepherd data science projects from inception to production within the platform architecture.
- Expert in Typescript, AWS CDK, Projen, and Argo CD and other Cloud Infrastructure CI/CD Tools
- Extensive experience in managing Kubernetes clusters for ML workflows.
- Solid understanding of foundation models and their applications in ML/AI solutions.
- Strong background in AWS DevOps practices and cloud architecture.
- Deep knowledge of AWS services (Bedrock, Sagemaker, EC2, S3, RDS, Lambda, etc) and hands-on design and implementation cloud systems (microservices architecture, API design, and database management (SQL/NoSQL))
- Experience with monitoring and optimizing cloud infrastructure for scalability and cost-efficiency.
- Ability to collaborate effectively with engineering, design, product, science and security teams.
- Strong written and verbal communication skills for reporting and documentation.
- Demonstrated ability to manage large-scale, complex projects across an organization.
- Proven experience in conducting performance and cost analyses of AWS infrastructure and ML/AI models.