Mount Cynthus / Ser Amantio di Nicolao

Cynthus: Orchestrating AI Applications on the Cloud

By:
Mentor/ Team Lead:
Category:
Date:

Keeping in the tradition of naming software products after Greek words we named this project Cynthus after Mount Cynthus. It so happens that Mount Cynthus is also where the name Cynthia got its origin. <3

Big thanks to Intel for helping out with the project and helping to guide us throughout the process. Thanks to Professor Ata Turk and the rest of the team for giving us this opportunity.

Introduction

See Github here

Existing products that seek to streamline AI application deployments such as Sagemaker, Vertex AI, AzureML etc. end up being a little clunky to use particularly if the Data Scientist or Machine Learning Engineer has very little Cloud or Computing knowledge.

Data Scientists and Machine Learning Engineers don't care and don't need to know about the intricacies of their model deployment all they want to do is see it work. Existing solutions still require the user to know a lot about Cloud Infrastructure and these solutions still need the Engineer to imperatively tell the system to pull data from this bucket or to use this instance to compute, to load balance like this, to install these dependencies etc. Often ML Engineers and Data Scientists don't wanna deal with all of this when they think about deployment, they want to say, "here's my data, here's my code, now I want it to work." And that's what we are going to try and do.

*We were initially told to not use Kubernetes as IDC had yet to have a managed Kubernetes solutions, along with budget constraints with running a cluster.

My Role and Contributions

For this project, I was in charge of a lot of the cloud architectural design choices and the implementation for those choices such as using Serverless (GCP Cloud Run), using an Orchestrator for worker heart beats, User Authentication and Creation, API Creation, IAM and Security. 

All in all I was responsible for:

Containerization, Caching and Compute Environment Setup:

  • Dependency Installation
  • GPU compute compatibility (CUDA)
  • Containerization of User Source Code (Dockerfiles)
  • Caching of commmon Docker images with the Google Artifacts Registry

Infrastructure Management

  • Management of terraform state files
  • Provisioning of new resources
  • Scaling resources on demand
  • Kubernetes with Kubeflow for distributed ML workloads *very limited access to GPUs
  • Orchestrator heartbeats and re-run logic for worker VMs

Serverless Functions

  • Function for creating output buckets and input buckets for data function called (bucket-operations)
  • Functions to trigger on bucket creation (bucket-listener-vm)
  • Function to create VM (create-vm)
  • Functions for incremental updates i.e. changes to user code or user data (code-update, data-update)
  • Functions to run workloads (run-container)
  • Functions to destroy resources (destroy-resources)

IAM management, Authentication and Security

  • Creation of service accounts and granting them least privilege permissions
  • Networking rules for communication between client and server and also setting up VPC between serverless functions and compute instances to send heartbeat messages
  • Secret Management with GCP Secret Manager

1. Vision and Goals Of The Project

Cynthus aims to simplify the deployment of AI applications on cloud platforms. While initially designed for Intel Developer Cloud (IDC), the project currently operates on Google Cloud Platform (GCP) due to accessibility considerations. The platform addresses the challenges developers face when deploying AI workloads by providing automated solutions for resource management, dependency handling, and deployment orchestration. Key goals of the project include:

  • Creating a simplified command-line interface for end-to-end AI application deployment
  • Automating resource allocation and dependency management through Terraform and Ansible
  • Providing seamless integration with public datasets and models from sources like HuggingFace and Kaggle
  • Implementing secure containerized deployments using Docker
  • Managing cloud infrastructure through automated scripts and serverless functions
  • Supporting scalable and maintainable AI workload deployments

2. Users/Personas Of The Project

The platform serves various users in the AI development ecosystem:

  • AI developers who need an efficient way to deploy models without managing complex infrastructure
  • Engineers requiring specific hardware configurations for AI model deployment
  • Newcomers to cloud computing who want to explore AI capabilities without deep cloud expertise
  • Teams needing secure and scalable infrastructure for AI workloads
  • Developers working with custom models who need flexible deployment options
  • Organizations requiring automated resource management and cost optimization

3. Scope and Features Of The Project:

The AI Deployment Platform provides:

  • Command-line interface with:
    • User authentication via Firebase
    • Project initialization and configuration
    • Automated deployment to cloud storage
    • Resource management and monitoring
  • Cloud Infrastructure:
    • Serverless functions for VM provisioning
    • MySQL database for logging and state management
    • Cloud Storage buckets for project data and source code
    • Docker containerization for application deployment
  • Integration Features:
    • Support for HuggingFace and Kaggle datasets
    • Automated dependency management
    • Version control for containers and deployments
  • Security Features:
    • Firebase authentication
    • Resource tagging for access control
    • Secure secret management
    • Service account management

4. Solution Concept

Architecture
Cloud Architecture Diagram

The solution architecture consists of several key components working together to provide end-to-end AI application deployment:

Client Layer

  • Primary user interaction point
  • Handles authentication through Firebase
  • Manages project initialization and configuration
  • Builds and uploads Docker containers
  • Monitors deployment status and results
  • Downloads results

Data Management Layer

  • Dataset Downloader
    • Integrates with Kaggle and HuggingFace
    • Manages dataset versioning and storage
    • Handles data preprocessing requirements
  • Bucket Builder
    • Creates and manages GCP storage buckets
    • Generates requirements.txt automatically
    • Handles input/output storage configuration

Storage Layer

  • Input Object Storage (GCP Bucket)
    • Stores user data, requirements.txt, and source code
    • Triggers deployment workflows
    • Manages access control through Firebase authentication
  • Output Object Storage (GCP Bucket)
    • Stores computation results
    • Maintains execution logs
    • Provides secure access to processed data

Processing Layer

  • Cloud Run Functions
    • Handles VM provisioning and configuration
    • Manages container deployment
    • Coordinates with orchestrator for deployment status
    • Processes authentication and authorization

Management Layer

  • SQL Database
    • Tracks deployment metadata:
      • Run ID and User ID
      • Resource paths and states
      • Deployment configurations
    • Maintains system state information
  • Orchestrator Server
    • Monitors VM health through heartbeats
    • Manages container lifecycle
    • Handles failure recovery
    • Updates deployment states
    • Coordinates between components

Container Registry Layer

  • Artifacts Registry
    • Stores Docker container images
    • Manages image versions
    • Provides secure container distribution
    • Integrates with VM deployment

Compute Layer

  • VM Bare Metal
    • Executes containerized AI workloads
    • Reports health status to orchestrator
    • Manages data processing
    • Handles output generation

Key Workflows:

  1. Authentication Flow:
    • User authenticates via Firebase
    • Access tokens manage resource permissions
    • Secure communication between components
  2. Deployment Flow:
    • Container image built and pushed to Artifacts Registry
    • Cloud Run Functions provision VM resources
    • Orchestrator manages deployment lifecycle
    • System state tracked in SQL database
  3. Data Management Flow:
    • Dataset Downloader fetches external data
    • Bucket Builder creates storage infrastructure
    • Input/Output buckets manage data lifecycle
  4. Execution Flow:
    • VM pulls container from Artifacts Registry
    • Workload processes data
    • Results stored in output bucket
    • Status updates maintained in database
  5. Monitoring Flow:
    • Orchestrator tracks VM health
    • System handles failure recovery
    • Metrics and logs collected
    • State management maintained
  • Command Line Interface (CLI)
    • Primary user interaction point
    • Handles authentication through Firebase
    • Manages project initialization and configuration
    • Builds and uploads Docker containers
    • Monitors deployment status and results
    • Downloads results

5. Acceptance criteria

Minimum acceptance criteria includes:

  • Functional CLI for end-to-end deployment:
    • User authentication and project management
    • Automated resource provisioning
    • Container deployment and monitoring
  • Cloud Infrastructure Setup:
    • Successful VM provisioning with Terraform
    • Automated configuration with Ansible
    • Docker container deployment
  • External Integrations:
    • Working connections to HuggingFace and Kaggle
    • Successful data and model management
  • Security Implementation:
    • User authentication
    • Resource access control
    • Secure deployment pipeline

"The essence of greatness is the perception that virtue is perception that virtue is enough."
— Ralph Waldo Emerson