Core Services

Architectural Overview

The DanteGPU platform is engineered as a distributed system, with its backend services forming the operational core responsible for orchestrating GPU resources, managing AI workloads, ensuring secure access, and facilitating a seamless user experience within the Agora marketplace and the broader GPU-as-a-Service (GPUaaS) ecosystem. The architecture is predicated on modern principles of microservices, robust communication patterns, and a carefully selected technology stack to achieve high availability, scalability, and maintainability.

I. Architectural Paradigm: Decentralized Microservices

The DanteGPU backend eschews a monolithic structure in favor of a microservices architecture. This paradigm involves decomposing the application into a suite of small, independently deployable services, each organized around a specific business capability or domain. This choice is foundational to achieving the platform's goals of democratization and distributed operation.

A. Rationale and Benefits:

Enhanced Scalability & Elasticity: Each microservice can be scaled independently (e.g., horizontally by adding more instances or vertically by allocating more resources) based on its specific load profile. For instance, the Job Queue Service might require different scaling characteristics during peak submission times than the Authentication Service. This granular scalability ensures optimal resource utilization and cost-effectiveness.
Improved Fault Isolation & Resilience: The failure of a single microservice, if designed correctly with patterns like circuit breakers or bulkheads (though specific implementations depend on service logic), is less likely to cause a catastrophic failure of the entire platform. This resilience is paramount for a system managing real-time GPU access and financial transactions ($dGPU).
Technology Diversity & Specialization: While the primary languages are Go and Python (FastAPI), a microservices approach theoretically allows for selecting the best technology (language, database, etc.) for each service's specific needs without impacting others. This fosters innovation and allows developers to leverage optimal tools for particular problems. For example, Go's concurrency primitives and performance are ideal for network-intensive services like the API Gateway or resource orchestration, while Python's rich ecosystem and rapid development capabilities suit services like user authentication or business logic layers.
Independent Development & Deployment Cycles: Teams can develop, test, and deploy their respective microservices autonomously. This accelerates development velocity, simplifies continuous integration/continuous deployment (CI/CD) pipelines, and reduces the scope and risk of individual deployments.
Clearer Domain Boundaries (Domain-Driven Design - DDD): Microservices naturally align with DDD principles, where service boundaries are defined around specific business domains (e.g., User Management, GPU Resource Management, Job Lifecycle Management). This leads to services with high cohesion and loose coupling, making the system easier to understand, evolve, and maintain.
Alignment with Decentralization Ethos: A distributed network of services mirrors the decentralized nature of the GPU providers and consumers DanteGPU aims to connect.

B. Inherent Challenges and Mitigation Strategies:

Adopting microservices also introduces complexities that the DanteGPU architecture must address:

Operational Overhead: Managing a multitude of services requires robust automation for deployment, scaling, monitoring, and logging. Technologies like Docker, Docker Compose, and a future transition to Kubernetes are key to mitigating this.
Distributed System Complexity: Debugging and tracing requests across multiple service boundaries can be challenging. Implementing distributed tracing (e.g., using OpenTelemetry) and comprehensive, correlated logging are essential.
Inter-Service Communication: Network latency, reliability, and the need for robust communication patterns (discussed below) become critical concerns.
Data Consistency: Maintaining data consistency across services that own their respective databases requires careful design, often employing patterns like eventual consistency, sagas, or two-phase commits where strong consistency is indispensable.
Testing Complexity: End-to-end testing of workflows spanning multiple services requires more sophisticated strategies than testing a monolith.

II. Inter-Service Communication Strategy

Effective communication between microservices is vital. DanteGPU employs a hybrid approach, leveraging both synchronous and asynchronous patterns:

A. Synchronous Communication:

Used for request/response interactions where an immediate response is expected.

gRPC (Google Remote Procedure Call):
- Rationale: Preferred for internal, high-throughput, low-latency communication between backend services.
- Mechanism: Utilizes HTTP/2 for transport, offering multiplexing, header compression, and bidirectional streaming. Protocol Buffers (Protobufs) are used as the Interface Definition Language (IDL), enforcing contract-first design, ensuring type safety, and enabling efficient binary serialization/deserialization.
- Benefits: High performance, efficient data encoding, strongly-typed contracts, support for streaming, and code generation in multiple languages.
- Use Cases: Internal API calls between core services like the Scheduler/Orchestrator Service querying the Provider Registry Service for available GPUs, or internal control plane operations.
RESTful APIs (HTTP/JSON):
- Rationale: Employed for services that might be consumed by a wider range of clients (including potentially third-party developers in the future) or where the overhead of gRPC setup is not justified. The API Gateway also exposes RESTful endpoints to external clients.
- Mechanism: Standard HTTP methods (GET, POST, PUT, DELETE) with JSON payloads. Adherence to REST principles (statelessness, resource-based URLs) is expected.
- Benefits: Simplicity, ubiquity, human-readability (JSON), wide support across languages and tools, easier integration with web frontends.
- Use Cases: External API Gateway endpoints, specific internal services where simplicity is prioritized over raw performance, or interaction with services like the Authentication Service from the API Gateway.

B. Asynchronous Communication / Event-Driven Architecture:

Used for decoupling services, improving resilience, and handling long-running or background tasks. This is crucial for a system managing potentially time-consuming AI jobs.

NATS JetStream:
- Rationale: Provides a persistent, reliable, and high-performance messaging and streaming platform for asynchronous operations.
- Mechanism: NATS is a lightweight, high-performance messaging system. JetStream adds persistence, message replay, and various delivery semantics (at-least-once, at-most-once, and potentially exactly-once patterns depending on consumer logic). Services publish events/messages to named "subjects" (topics), and interested services subscribe to these subjects.
  - Streams: Persistent logs of messages.
  - Consumers: Allow services to read messages from streams, with options for push or pull delivery, acknowledgments, and durable subscriptions.
- Benefits:
  - Decoupling: Producers and consumers are independent; they don't need to know about each other or be available simultaneously.
  - Resilience & Durability: Message persistence ensures that requests are not lost if a consuming service is temporarily unavailable.
  - Scalability: Allows for scaling consumer groups independently to process messages in parallel.
  - Load Leveling: Smooths out peak loads by queuing requests.
- Use Cases:
  - Job Queuing: The Job Queue Service relies on NATS JetStream to persist AI job requests submitted by users. The Scheduler/Orchestrator Service consumes these jobs from the queue.
  - Event Notification: Broadcasting events like "GPU available," "job status updated," or "new model published on Agora" to interested services without direct coupling.
  - Data Pipelines: Facilitating asynchronous data flows, e.g., logs or metrics forwarding before final aggregation.

III. Core Technology Stack Choices

The selection of technologies for the DanteGPU backend reflects a pragmatic approach, balancing performance, developer productivity, and ecosystem support.

Go (Golang):
- Role: Primary language for high-performance, concurrent network services and infrastructure components.
- Strengths: Excellent support for concurrency (goroutines, channels), compiled to native code for speed, static typing for reliability, efficient memory management, and a strong standard library for networking. Ideal for services like the API Gateway (Siger), Provider Registry Service, and Scheduler/Orchestrator Service.
Python (with FastAPI):
- Role: Used for services where rapid development, a rich ecosystem of libraries (e.g., for machine learning, data science, web frameworks), or specific integrations are key.
- Strengths (FastAPI): Modern, high-performance web framework for building APIs with Python 3.7+ based on standard Python type hints. Offers automatic data validation, serialization, interactive API documentation (Swagger UI, ReDoc), and leverages Starlette (for web parts) and Pydantic (for data parts). Ideal for the Authentication Service.
Docker & Docker Compose:
- Role: Containerization technology for packaging applications and their dependencies. Docker Compose is used for defining and running multi-container Docker applications, especially in development and testing environments.
- Benefits: Environment consistency, isolation, portability across machines, simplified dependency management, and a foundational step towards more advanced orchestration.
Consul (by HashiCorp):
- Role: Service discovery, configuration management, and health checking.
- Mechanism: Services register themselves with Consul, and other services can query Consul to find their network locations. Consul performs health checks to ensure only healthy service instances receive traffic. It can also serve as a distributed key-value store for dynamic configuration.
- Benefits: Enables dynamic scaling and resilience, as services don't need hardcoded addresses. Simplifies the routing logic in the API Gateway and internal service communication.
PostgreSQL:
- Role: Robust, open-source object-relational database system used for persistent storage by services requiring structured data and transactional integrity.
- Strengths: ACID compliance, reliability, extensibility, rich feature set (JSONB support, full-text search, etc.). Suitable for the Provider Registry Service, Scheduler/Orchestrator Service (job store), and Authentication Service (user data).
MinIO:
- Role: High-performance, S3-compatible object storage service.
- Strengths: Scalable, resilient storage for unstructured data like AI models, datasets, job results, and user uploads. Can be self-hosted, providing data sovereignty.
Kubernetes (Future Consideration):
- Role: Advanced container orchestration platform.
- Benefits: Automated deployment, scaling, self-healing, service discovery, load balancing, and configuration management for containerized applications at scale. A natural evolution from Docker Compose for production environments demanding higher resilience and operational efficiency.

IV. Overview of Core Service Domains

The backend is logically segmented into several domains, each encompassing one or more microservices:

API Gateway (siger-api-gateway): The unified ingress point. Handles routing, authentication (JWT), rate limiting, CORS, and acts as a facade for the backend services. (To be detailed in Page 5).
Authentication Service (auth-service): Manages user identities (providers, consumers), registration, credential verification (password hashing), JWT issuance, and potentially profile management.
Provider Registry Service (provider-registry-service): Tracks connected GPU providers, their hardware specifications (GPU model, VRAM, drivers), real-time status (idle, busy), location, and utilization metrics. Critical for the scheduler to find suitable resources.
Job Queue Service (Integrated via NATS): Manages the intake and persistent queuing of AI job requests.
Scheduler/Orchestrator Service (scheduler-orchestrator-service): The "brain" of the system. Dequeues jobs, queries the Provider Registry for suitable GPUs, dispatches tasks (likely via NATS) to provider daemons (beatrice-core-services), and tracks job progress.
Storage Service (storage-service): Abstracts interactions with object storage (MinIO), handling uploads, downloads, and management of models, datasets, and results.
Monitoring & Logging Service (monitoring-logging-service): A stack (Prometheus, Grafana, Loki, Promtail, etc.) for aggregating metrics and logs from all services, enabling observability and debugging.
GPU Provider Daemon (beatrice-core-services): A client-side agent running on the GPU provider's machine. Responsible for receiving tasks from the Scheduler, executing them (e.g., in a containerized environment with GPU passthrough), monitoring execution, and reporting back results and status. (This component is critical for the GPUaaS functionality).
Billing & Payment Service (Planned Post-MVP): Will integrate with payment gateways to track resource usage (GPU time, storage) and manage financial transactions (payouts to providers, charges to consumers).

V. Cross-Cutting Concerns

Several concerns span multiple services:

Security: Beyond authentication/authorization, includes secure inter-service communication (e.g., mTLS), input validation, protection against common vulnerabilities, and secure secrets management.
Observability:
- Logging: Consistent, structured logging across all services, often correlated with request IDs for tracing.
- Metrics: Collection of key performance indicators (KPIs) from each service for monitoring health and performance.
- Distributed Tracing: Implementing mechanisms (e.g., OpenTelemetry) to trace requests as they flow through multiple services.
Configuration Management: Centralized and dynamic configuration for services, potentially using Consul's KV store or environment variables managed by the orchestration platform.
Error Handling & Resilience: Consistent error reporting, retry mechanisms, and fault tolerance patterns within and between services.

PreviousProject Vision and Targets NextAPI Gateway - Siger

Last updated 1 month ago