Keeping in the tradition of naming software products after Greek words we named this project Cynthus after Mount Cynthus. It so happens that Mount Cynthus is also where the name Cynthia got its origin. <3
Big thanks to Intel for helping out with the project and helping to guide us throughout the process. Thanks to Professor Ata Turk and the rest of the team for giving us this opportunity.
Existing products that seek to streamline AI application deployments such as Sagemaker, Vertex AI, AzureML etc. end up being a little clunky to use particularly if the Data Scientist or Machine Learning Engineer has very little Cloud or Computing knowledge.
Data Scientists and Machine Learning Engineers don't care and don't need to know about the intricacies of their model deployment all they want to do is see it work. Existing solutions still require the user to know a lot about Cloud Infrastructure and these solutions still need the Engineer to imperatively tell the system to pull data from this bucket or to use this instance to compute, to load balance like this, to install these dependencies etc. Often ML Engineers and Data Scientists don't wanna deal with all of this when they think about deployment, they want to say, "here's my data, here's my code, now I want it to work." And that's what we are going to try and do.
*We were initially told to not use Kubernetes as IDC had yet to have a managed Kubernetes solutions, along with budget constraints with running a cluster.
My Role and Contributions
For this project, I was in charge of a lot of the cloud architectural design choices and the implementation for those choices such as using Serverless (GCP Cloud Run), using an Orchestrator for worker heart beats, User Authentication and Creation, API Creation, IAM and Security.
All in all I was responsible for:
Containerization, Caching and Compute Environment Setup:
Dependency Installation
GPU compute compatibility (CUDA)
Containerization of User Source Code (Dockerfiles)
Caching of commmon Docker images with the Google Artifacts Registry
Infrastructure Management
Management of terraform state files
Provisioning of new resources
Scaling resources on demand
Kubernetes with Kubeflow for distributed ML workloads *very limited access to GPUs
Orchestrator heartbeats and re-run logic for worker VMs
Serverless Functions
Function for creating output buckets and input buckets for data function called (bucket-operations)
Functions to trigger on bucket creation (bucket-listener-vm)
Function to create VM (create-vm)
Functions for incremental updates i.e. changes to user code or user data (code-update, data-update)
Functions to run workloads (run-container)
Functions to destroy resources (destroy-resources)
IAM management, Authentication and Security
Creation of service accounts and granting them least privilege permissions
Networking rules for communication between client and server and also setting up VPC between serverless functions and compute instances to send heartbeat messages
Secret Management with GCP Secret Manager
1. Vision and Goals Of The Project
Cynthus aims to simplify the deployment of AI applications on cloud platforms. While initially designed for Intel Developer Cloud (IDC), the project currently operates on Google Cloud Platform (GCP) due to accessibility considerations. The platform addresses the challenges developers face when deploying AI workloads by providing automated solutions for resource management, dependency handling, and deployment orchestration. Key goals of the project include:
Creating a simplified command-line interface for end-to-end AI application deployment
Automating resource allocation and dependency management through Terraform and Ansible
Providing seamless integration with public datasets and models from sources like HuggingFace and Kaggle
Implementing secure containerized deployments using Docker
Managing cloud infrastructure through automated scripts and serverless functions
Supporting scalable and maintainable AI workload deployments
2. Users/Personas Of The Project
The platform serves various users in the AI development ecosystem:
AI developers who need an efficient way to deploy models without managing complex infrastructure
Engineers requiring specific hardware configurations for AI model deployment
Newcomers to cloud computing who want to explore AI capabilities without deep cloud expertise
Teams needing secure and scalable infrastructure for AI workloads
Developers working with custom models who need flexible deployment options
Organizations requiring automated resource management and cost optimization
3. Scope and Features Of The Project:
The AI Deployment Platform provides:
Command-line interface with:
User authentication via Firebase
Project initialization and configuration
Automated deployment to cloud storage
Resource management and monitoring
Cloud Infrastructure:
Serverless functions for VM provisioning
MySQL database for logging and state management
Cloud Storage buckets for project data and source code
Docker containerization for application deployment
Integration Features:
Support for HuggingFace and Kaggle datasets
Automated dependency management
Version control for containers and deployments
Security Features:
Firebase authentication
Resource tagging for access control
Secure secret management
Service account management
4. Solution Concept
Cloud Architecture Diagram
The solution architecture consists of several key components working together to provide end-to-end AI application deployment:
Client Layer
Primary user interaction point
Handles authentication through Firebase
Manages project initialization and configuration
Builds and uploads Docker containers
Monitors deployment status and results
Downloads results
Data Management Layer
Dataset Downloader
Integrates with Kaggle and HuggingFace
Manages dataset versioning and storage
Handles data preprocessing requirements
Bucket Builder
Creates and manages GCP storage buckets
Generates requirements.txt automatically
Handles input/output storage configuration
Storage Layer
Input Object Storage (GCP Bucket)
Stores user data, requirements.txt, and source code
Triggers deployment workflows
Manages access control through Firebase authentication
Output Object Storage (GCP Bucket)
Stores computation results
Maintains execution logs
Provides secure access to processed data
Processing Layer
Cloud Run Functions
Handles VM provisioning and configuration
Manages container deployment
Coordinates with orchestrator for deployment status
Processes authentication and authorization
Management Layer
SQL Database
Tracks deployment metadata:
Run ID and User ID
Resource paths and states
Deployment configurations
Maintains system state information
Orchestrator Server
Monitors VM health through heartbeats
Manages container lifecycle
Handles failure recovery
Updates deployment states
Coordinates between components
Container Registry Layer
Artifacts Registry
Stores Docker container images
Manages image versions
Provides secure container distribution
Integrates with VM deployment
Compute Layer
VM Bare Metal
Executes containerized AI workloads
Reports health status to orchestrator
Manages data processing
Handles output generation
Key Workflows:
Authentication Flow:
User authenticates via Firebase
Access tokens manage resource permissions
Secure communication between components
Deployment Flow:
Container image built and pushed to Artifacts Registry
Cloud Run Functions provision VM resources
Orchestrator manages deployment lifecycle
System state tracked in SQL database
Data Management Flow:
Dataset Downloader fetches external data
Bucket Builder creates storage infrastructure
Input/Output buckets manage data lifecycle
Execution Flow:
VM pulls container from Artifacts Registry
Workload processes data
Results stored in output bucket
Status updates maintained in database
Monitoring Flow:
Orchestrator tracks VM health
System handles failure recovery
Metrics and logs collected
State management maintained
Command Line Interface (CLI)
Primary user interaction point
Handles authentication through Firebase
Manages project initialization and configuration
Builds and uploads Docker containers
Monitors deployment status and results
Downloads results
5. Acceptance criteria
Minimum acceptance criteria includes:
Functional CLI for end-to-end deployment:
User authentication and project management
Automated resource provisioning
Container deployment and monitoring
Cloud Infrastructure Setup:
Successful VM provisioning with Terraform
Automated configuration with Ansible
Docker container deployment
External Integrations:
Working connections to HuggingFace and Kaggle
Successful data and model management
Security Implementation:
User authentication
Resource access control
Secure deployment pipeline
"The essence of greatness is the perception that virtue is perception that virtue is enough." — Ralph Waldo Emerson