Responsibilities
- Platform Development: Design, build, and maintain scalable machine learning platforms to support model development, experimentation, and production workflows.
- Infrastructure Automation: Automate the deployment and scaling of ML infrastructure, including data pipelines, model training, validation, and deployment.
- Model Lifecycle Management: Manage the end-to-end lifecycle of machine learning models, including versioning, deployment, monitoring, and retraining.
- LLM Operations (LLM Ops): Implement systems and practices for managing large language models (LLMs), ensuring efficient fine-tuning, deployment, and monitoring of these models in production.
- Collaboration with Data Scientists and Engineers: Provide infrastructure and tools that enable seamless collaboration between data science teams and engineering for the development and deployment of machine learning models.
- Performance Optimization: Optimize model inference and training performance on a range of hardware architectures, including GPU and cloud-based environments.
- Security and Compliance: Ensure the security of the ML platform and compliance with relevant regulations and standards, especially in environments dealing with sensitive data.
- Tooling and Frameworks: Evaluate and integrate MLOps tools, frameworks, and libraries to continuously improve platform capabilities and efficiency.
- Monitoring and Alerting: Implement robust monitoring and alerting systems for production models, ensuring reliability and timely detection of performance drift or anomalies.
- User-Centric Development: Emphasize user needs and experiences in platform design and implementation.
- Adaptive Problem-Solving: Quickly adapt to changing requirements and technological landscapes in ML and AI.
- Product Focus: Maintain a strong product-oriented mindset, aligning technical solutions with business goals and user needs.
Skills and Experience required
-
Experience:
- 3+ years of experience in software engineering or infrastructure roles, with a focus on machine learning platforms or MLOps.
- Proven experience in building, deploying, and maintaining ML platforms or systems at scale.
- Strong experience with cloud platforms such as AWS, GCP, or Azure, particularly for machine learning and data processing tasks.
- Experience with containerization technologies (Docker) and orchestration tools (Kubernetes) for ML workloads.
- Proficiency in programming languages such as Python, and familiarity with ML libraries and frameworks (e.g., TensorFlow, PyTorch).
- Familiarity with CI/CD pipelines tailored for machine learning (e.g., model validation, deployment automation).
-
Technical Expertise:
As an ethical employer, Tag will never ask job applicants to provide private, sensitive information upfront or make offers of employment contingent on financial requests or responsibilities from any candidate.