Remote Camera Control w/ ngrok, AWS IVS and Motorized Pan Head

Mount Cynthus / Ser Amantio di Nicolao

Cynthus: Orchestrating AI Applications on the Cloud

By:

Jimmy Sui, Harlan Jones, Quoc Thai Nguyen, Ryan Darrow, Krisna Patel, Peter Gu

Mentor/ Team Lead:

Shripad Nadgowda @ Intel

Category:

MLOps, DevOps, Infrastructure Engineering, Platform Engineering, Solutions Architecture

Date:

May. 2024 - Dec 2024

Keeping in the tradition of naming software products after Greek words we named this project Cynthus after Mount Cynthus. It so happens that Mount Cynthus is also where the name Cynthia got its origin. <3

Big thanks to Intel for helping out with the project and helping to guide us throughout the process. Thanks to Professor Ata Turk and the rest of the team for giving us this opportunity.

‍

Introduction

See Github here

Existing products that seek to streamline AI application deployments such as Sagemaker, Vertex AI, AzureML etc. end up being a little clunky to use particularly if the Data Scientist or Machine Learning Engineer has very little Cloud or Computing knowledge.

Data Scientists and Machine Learning Engineers don't care and don't need to know about the intricacies of their model deployment all they want to do is see it work. Existing solutions still require the user to know a lot about Cloud Infrastructure and these solutions still need the Engineer to imperatively tell the system to pull data from this bucket or to use this instance to compute, to load balance like this, to install these dependencies etc. Often ML Engineers and Data Scientists don't wanna deal with all of this when they think about deployment, they want to say, "here's my data, here's my code, now I want it to work." And that's what we are going to try and do.

*We were initially told to not use Kubernetes as IDC had yet to have a managed Kubernetes solutions, along with budget constraints with running a cluster.

My Role and Contributions

For this project, I was in charge of a lot of the cloud architectural design choices and the implementation for those choices such as using Serverless (GCP Cloud Run), using an Orchestrator for worker heart beats, User Authentication and Creation, API Creation, IAM and Security.

All in all I was responsible for:

Containerization, Caching and Compute Environment Setup:

Dependency Installation
GPU compute compatibility (CUDA)
Containerization of User Source Code (Dockerfiles)
Caching of commmon Docker images with the Google Artifacts Registry

Infrastructure Management

Management of terraform state files
Provisioning of new resources
Scaling resources on demand
Kubernetes with Kubeflow for distributed ML workloads *very limited access to GPUs
Orchestrator heartbeats and re-run logic for worker VMs

Serverless Functions

Function for creating output buckets and input buckets for data function called (bucket-operations)
Functions to trigger on bucket creation (bucket-listener-vm)
Function to create VM (create-vm)
Functions for incremental updates i.e. changes to user code or user data (code-update, data-update)
Functions to run workloads (run-container)
Functions to destroy resources (destroy-resources)

IAM management, Authentication and Security

Creation of service accounts and granting them least privilege permissions
Networking rules for communication between client and server and also setting up VPC between serverless functions and compute instances to send heartbeat messages
Secret Management with GCP Secret Manager

1. Vision and Goals Of The Project

Cynthus aims to simplify the deployment of AI applications on cloud platforms. While initially designed for Intel Developer Cloud (IDC), the project currently operates on Google Cloud Platform (GCP) due to accessibility considerations. The platform addresses the challenges developers face when deploying AI workloads by providing automated solutions for resource management, dependency handling, and deployment orchestration. Key goals of the project include:

Creating a simplified command-line interface for end-to-end AI application deployment
Automating resource allocation and dependency management through Terraform and Ansible
Providing seamless integration with public datasets and models from sources like HuggingFace and Kaggle
Implementing secure containerized deployments using Docker
Managing cloud infrastructure through automated scripts and serverless functions
Supporting scalable and maintainable AI workload deployments

2. Users/Personas Of The Project

The platform serves various users in the AI development ecosystem:

AI developers who need an efficient way to deploy models without managing complex infrastructure
Engineers requiring specific hardware configurations for AI model deployment
Newcomers to cloud computing who want to explore AI capabilities without deep cloud expertise
Teams needing secure and scalable infrastructure for AI workloads
Developers working with custom models who need flexible deployment options
Organizations requiring automated resource management and cost optimization

3. Scope and Features Of The Project:

The AI Deployment Platform provides:

Command-line interface with:
- User authentication via Firebase
- Project initialization and configuration
- Automated deployment to cloud storage
- Resource management and monitoring
Cloud Infrastructure:
- Serverless functions for VM provisioning
- MySQL database for logging and state management
- Cloud Storage buckets for project data and source code
- Docker containerization for application deployment
Integration Features:
- Support for HuggingFace and Kaggle datasets
- Automated dependency management
- Version control for containers and deployments
Security Features:
- Firebase authentication
- Resource tagging for access control
- Secure secret management
- Service account management

4. Solution Concept

The solution architecture consists of several key components working together to provide end-to-end AI application deployment:

Client Layer

Primary user interaction point
Handles authentication through Firebase
Manages project initialization and configuration
Builds and uploads Docker containers
Monitors deployment status and results
Downloads results

Data Management Layer

Dataset Downloader
- Integrates with Kaggle and HuggingFace
- Manages dataset versioning and storage
- Handles data preprocessing requirements
Bucket Builder
- Creates and manages GCP storage buckets
- Generates requirements.txt automatically
- Handles input/output storage configuration

Storage Layer

Input Object Storage (GCP Bucket)
- Stores user data, requirements.txt, and source code
- Triggers deployment workflows
- Manages access control through Firebase authentication
Output Object Storage (GCP Bucket)
- Stores computation results
- Maintains execution logs
- Provides secure access to processed data

Processing Layer

Cloud Run Functions
- Handles VM provisioning and configuration
- Manages container deployment
- Coordinates with orchestrator for deployment status
- Processes authentication and authorization

Management Layer

SQL Database
- Tracks deployment metadata:
  - Run ID and User ID
  - Resource paths and states
  - Deployment configurations
- Maintains system state information
Orchestrator Server
- Monitors VM health through heartbeats
- Manages container lifecycle
- Handles failure recovery
- Updates deployment states
- Coordinates between components

Container Registry Layer

Artifacts Registry
- Stores Docker container images
- Manages image versions
- Provides secure container distribution
- Integrates with VM deployment

Compute Layer

VM Bare Metal
- Executes containerized AI workloads
- Reports health status to orchestrator
- Manages data processing
- Handles output generation

Key Workflows:

Authentication Flow:
- User authenticates via Firebase
- Access tokens manage resource permissions
- Secure communication between components
Deployment Flow:
- Container image built and pushed to Artifacts Registry
- Cloud Run Functions provision VM resources
- Orchestrator manages deployment lifecycle
- System state tracked in SQL database
Data Management Flow:
- Dataset Downloader fetches external data
- Bucket Builder creates storage infrastructure
- Input/Output buckets manage data lifecycle
Execution Flow:
- VM pulls container from Artifacts Registry
- Workload processes data
- Results stored in output bucket
- Status updates maintained in database
Monitoring Flow:
- Orchestrator tracks VM health
- System handles failure recovery
- Metrics and logs collected
- State management maintained

Command Line Interface (CLI)
- Primary user interaction point
- Handles authentication through Firebase
- Manages project initialization and configuration
- Builds and uploads Docker containers
- Monitors deployment status and results
- Downloads results

5. Acceptance criteria

Minimum acceptance criteria includes:

Functional CLI for end-to-end deployment:
- User authentication and project management
- Automated resource provisioning
- Container deployment and monitoring
Cloud Infrastructure Setup:
- Successful VM provisioning with Terraform
- Automated configuration with Ansible
- Docker container deployment
External Integrations:
- Working connections to HuggingFace and Kaggle
- Successful data and model management
Security Implementation:
- User authentication
- Resource access control
- Secure deployment pipeline

‍

Contents