Provider Registry Service - Statius

The Provider Registry Service is a crucial component in the Dante GPU Platform. Its primary role is to manage the registration, discovery, and status of GPU providers. This service acts as a central directory where GPU providers announce their availability and specifications, and other services, particularly the Scheduler Orchestrator Service, can query to find suitable providers for incoming jobs.

1. Responsibilities

The Provider Registry Service has the following key responsibilities:

Provider Registration: Allows new GPU providers to register themselves with the platform. This involves storing details about the provider, such as their hardware specifications (GPU model, VRAM, drivers), unique ID, and location.
Status Tracking: Monitors and updates the status of registered providers (e.g., idle, busy, offline, maintenance, error). This is typically achieved through heartbeats sent by the provider daemons or explicit status updates.
Provider Discovery: Provides an API for other services (especially the Scheduler Orchestrator) to query and discover available and suitable GPU providers based on various criteria like GPU type, VRAM, status, and location.
Service Discovery Integration: Registers itself with a service discovery tool like Consul, allowing other components of the Dante platform to locate and communicate with it.
Data Persistence: Stores provider information persistently, typically in a PostgreSQL database.

2. Tech Stack

The service is built using the following technologies:

Programming Language: Go (version 1.22+)
HTTP Routing: Chi router
Logging: Zap for structured logging
Service Discovery: HashiCorp Consul (via Consul API Client)
Database: PostgreSQL (for persistent storage)
Unique Identifiers: UUID for provider IDs

3. Core Components and Logic

3.1. Configuration (configs/config.yaml)

The service's behavior is controlled by a config.yaml file. Key configuration parameters include:

port: The port on which the service listens (e.g., ":8002").
log_level: Logging verbosity (e.g., "info", "debug").
request_timeout: Default timeout for HTTP server requests.
database_url: Connection string for the PostgreSQL database.
secrets_file_path: Optional path to a JSON file for secrets like database credentials.
consul_address: Address of the Consul agent (e.g., "localhost:8500").
service_name: Name under which the service is registered in Consul (e.g., "provider-registry").
service_id_prefix: Prefix for generating a unique service ID in Consul.
service_tags: Tags for Consul registration (e.g., ["cacciaguida", "registry"]).
health_check_path: HTTP path for Consul health checks (e.g., "/health").
health_check_interval: Frequency of Consul health checks.
health_check_timeout: Timeout for Consul health checks.

The system can load database credentials either directly from the database_url or by constructing it from individual components specified in a secrets file or environment variables (prefixed with DANTE_).

3.2. Data Models (internal/models/)

Provider: This is the central model representing a GPU provider. It includes:
- ID (UUID): Unique identifier for the provider.
- OwnerID (string): Identifier of the user who owns the provider.
- Name (string): User-friendly name for the provider.
- Hostname (string, optional): Hostname of the provider machine.
- IPAddress (string, optional): IP address of the provider.
- Status (ProviderStatus): Current status (idle, busy, offline, maintenance, error).
- GPUs ([]GPUDetail): A slice detailing each GPU available on the provider machine.
- Location (string, optional): Physical or logical location of the provider.
- RegisteredAt (time.Time): Timestamp of registration.
- LastSeenAt (time.Time): Timestamp of the last heartbeat or status update.
- Metadata (map[string]interface{}, optional): Additional key-value information.
GPUDetail: Describes an individual GPU's specifications and current metrics:
- ModelName (string): e.g., "NVIDIA GeForce RTX 5090".
- VRAM (uint64): Video RAM in Megabytes.
- DriverVersion (string): GPU driver version.
- Architecture (string, optional): e.g., "Ampere", "Ada Lovelace".
- ComputeCapability (string, optional): e.g., "8.9".
- CudaCores (uint32, optional).
- TensorCores (uint32, optional).
- MemoryBandwidth (uint64, optional): GB/s.
- PowerConsumption (uint32, optional): Max power in Watts.
- UtilizationGPU (uint8, optional): Current GPU core utilization (0-100%).
- UtilizationMem (uint8, optional): Current GPU memory utilization (0-100%).
- Temperature (uint8, optional): Current GPU temperature in Celsius.
- PowerDraw (uint32, optional): Current power draw in Watts.
- IsHealthy (bool): Indicates if the GPU is functioning correctly.
Error Models: Defines custom error types like ErrProviderNotFound and ErrProviderAlreadyExists for clear error handling.

3.3. Storage (internal/store/)

The service uses a ProviderStore interface to abstract data storage operations.

PostgresProviderStore: An implementation of ProviderStore using PostgreSQL.
- Initialization: Creates providers and gpu_details tables if they don't exist. The gpu_details table has a foreign key relationship with the providers table (ON DELETE CASCADE).
- Operations: Includes methods to add, get, list, update, and delete providers and their GPU details within database transactions.
- Heartbeat Update: Updates the last_seen_at timestamp and GPU metrics for a provider. If the provider was offline or in error, its status is set to idle.
- Retry Logic: Database operations are wrapped with a retry mechanism (retryer.WithRetry) to handle transient errors.
InMemoryProviderStore: An in-memory implementation primarily for testing or simple deployments. It provides the same interface methods but stores data in a map.

3.4. API Handlers (internal/handlers/)

HTTP request handling is managed by ProviderHandler.

Routes: Defines endpoints for provider operations:
- POST /providers: Register a new provider.
- GET /providers: List available providers (supports filtering by status, gpu_model, min_vram, architecture, healthy_only).
- GET /providers/{providerID}: Get details for a specific provider.
- PUT /providers/{providerID}: Fully update a provider's details.
- PATCH /providers/{providerID}/status: Update a provider's status.
- POST /providers/{providerID}/heartbeat: Signal a provider's active presence and update GPU metrics.
- DELETE /providers/{providerID}: Deregister a provider.
Request/Response: Uses structs like RegisterProviderRequest, UpdateProviderStatusRequest, and HeartbeatRequest for request payloads. Responses generally use the models.Provider structure or simple JSON messages.
Contextual Logging: Leverages logging.ContextLogger to include correlation and request IDs in logs.

3.5. Service Discovery (internal/consul/)

Connection: Establishes a connection with the Consul agent specified in the configuration.
Registration: Registers the Provider Registry Service itself with Consul, including its name, unique ID, address, port, tags, and an HTTP health check. The health check URL uses 127.0.0.1 if the service address is unspecified (e.g., "0.0.0.0").
Deregistration: The service is designed to deregister from Consul upon graceful shutdown.

3.6. HTTP Server (internal/server/)

A simple HTTP server is configured using chi.Router.
Standard middleware is used for request ID, real IP, structured logging (custom), panic recovery, and request timeouts.
A correlation ID middleware (customMiddleware.CorrelationID) is also added to inject or forward correlation IDs for distributed tracing.

4. API Endpoints

The service plans to expose the following RESTful API endpoints under the /providers base path:

POST /providers: Register a new GPU provider.
- Request Body: RegisterProviderRequest JSON object.
- Response: 201 Created with the created models.Provider object.
GET /providers: List available providers.
- Query Parameters (Filters):
  - status (string): Filter by provider status (e.g., "idle", "busy").
  - gpu_model (string): Filter by GPU model name (partial match).
  - min_vram (uint64): Filter by minimum VRAM in MB.
  - architecture (string): Filter by GPU architecture.
  - healthy_only (bool): If true, only return providers where all GPUs are healthy.
- Response: 200 OK with an array of models.Provider objects.
GET /providers/{providerID}: Get details for a specific provider.
- Path Parameter: providerID (UUID).
- Response: 200 OK with the models.Provider object or 404 Not Found.
PUT /providers/{providerID}: Update an existing provider's details (full update).
- Path Parameter: providerID (UUID).
- Request Body: models.Provider JSON object.
- Response: 200 OK with the updated models.Provider object or 404 Not Found.
PATCH /providers/{providerID}/status: Update the status of a specific provider.
- Path Parameter: providerID (UUID).
- Request Body: UpdateProviderStatusRequest JSON object (e.g., {"status": "busy"}).
- Response: 200 OK with a success message or 404 Not Found.
POST /providers/{providerID}/heartbeat: Update a provider's last seen time and optionally its GPU metrics.
- Path Parameter: providerID (UUID).
- Request Body (optional): HeartbeatRequest JSON object containing gpu_metrics.
- Response: 200 OK with a success message or 404 Not Found.
DELETE /providers/{providerID}: Deregister a provider.
- Path Parameter: providerID (UUID).
- Response: 200 OK with a success message or 404 Not Found.
GET /health: Health check endpoint for Consul and other monitoring tools.
- Response: 200 OK if healthy, 503 Service Unavailable if dependencies (like DB) are down.

5. Workflow

Startup:
- The service loads its configuration.
- It initializes the logger and database connection (PostgreSQL).
- The database schema (providers and gpu_details tables) is created if it doesn't exist.
- It connects to Consul and registers itself as a service, providing a health check endpoint.
- The HTTP server starts listening for API requests.
Provider Registration:
- A GPU provider daemon (external service) sends a POST request to /providers with its details (owner, name, GPU specs, location, etc.).
- The handler validates the request and creates a new models.Provider object.
- The provider and its associated GPU details are stored in the PostgreSQL database.
- The service responds with the newly created provider object.
Provider Heartbeat/Status Update:
- Provider daemons periodically send POST requests to /providers/{providerID}/heartbeat to indicate they are still active. This request can optionally include updated GPU metrics.
- The handler updates the LastSeenAt timestamp and any provided GPU metrics in the database. If the provider was marked offline or error, a heartbeat typically transitions it to idle.
- Alternatively, a provider or an admin tool can send a PATCH request to /providers/{providerID}/status to explicitly change the provider's status (e.g., to maintenance, busy).
Provider Discovery:
- The Scheduler Orchestrator Service (or other authorized services) sends a GET request to /providers to find available providers.
- The request can include query parameters to filter providers based on status, GPU model, VRAM, etc.
- The Provider Registry Service queries its database, applies the filters, and returns a list of matching providers.
Shutdown:
- Upon receiving a termination signal (SIGINT, SIGTERM), the service initiates a graceful shutdown.
- It deregisters itself from Consul.
- The HTTP server is shut down gracefully within a timeout.
- The database connection pool is closed.

6. Future Considerations

Advanced Filtering: Implement more complex filtering options for provider discovery (e.g., geographic proximity, price range, specific driver versions).
Provider Reputation/Rating: Store and manage provider reliability or performance ratings.
Security: Implement authentication and authorization for API endpoints (e.g., only provider daemons can register/heartbeat, only schedulers can list providers).
Scalability: Ensure the database and service can handle a large number of providers and frequent updates/queries. This might involve read replicas for the database or caching strategies.
Eventing: Publish events (e.g., via NATS) when provider status changes, allowing other services to react in real-time.

PreviousAuthentication Service - Minos NextScheduler Orchestrator Service - Matilda

Last updated 1 month ago