Provider Registry Service - Statius
The Provider Registry Service is a crucial component in the Dante GPU Platform. Its primary role is to manage the registration, discovery, and status of GPU providers. This service acts as a central directory where GPU providers announce their availability and specifications, and other services, particularly the Scheduler Orchestrator Service, can query to find suitable providers for incoming jobs.
1. Responsibilities
The Provider Registry Service has the following key responsibilities:
Provider Registration: Allows new GPU providers to register themselves with the platform. This involves storing details about the provider, such as their hardware specifications (GPU model, VRAM, drivers), unique ID, and location.
Status Tracking: Monitors and updates the status of registered providers (e.g.,
idle
,busy
,offline
,maintenance
,error
). This is typically achieved through heartbeats sent by the provider daemons or explicit status updates.Provider Discovery: Provides an API for other services (especially the Scheduler Orchestrator) to query and discover available and suitable GPU providers based on various criteria like GPU type, VRAM, status, and location.
Service Discovery Integration: Registers itself with a service discovery tool like Consul, allowing other components of the Dante platform to locate and communicate with it.
Data Persistence: Stores provider information persistently, typically in a PostgreSQL database.
2. Tech Stack
The service is built using the following technologies:
Programming Language:
Go (version 1.22+)
HTTP Routing:
Chi router
Logging: Zap for structured logging
Service Discovery:
HashiCorp
Consul (via Consul API Client)Database:
PostgreSQL
(for persistent storage)Unique Identifiers:
UUID
for provider IDs
3. Core Components and Logic
3.1. Configuration (configs/config.yaml
)
The service's behavior is controlled by a config.yaml
file. Key configuration parameters include:
port
: The port on which the service listens (e.g., ":8002").log_level
: Logging verbosity (e.g., "info", "debug").request_timeout
: Default timeout for HTTP server requests.database_url
: Connection string for the PostgreSQL database.secrets_file_path
: Optional path to a JSON file for secrets like database credentials.consul_address
: Address of the Consul agent (e.g., "localhost:8500").service_name
: Name under which the service is registered in Consul (e.g., "provider-registry").service_id_prefix
: Prefix for generating a unique service ID in Consul.service_tags
: Tags for Consul registration (e.g., ["cacciaguida", "registry"]).health_check_path
: HTTP path for Consul health checks (e.g., "/health").health_check_interval
: Frequency of Consul health checks.health_check_timeout
: Timeout for Consul health checks.
The system can load database credentials either directly from the database_url
or by constructing it from individual components specified in a secrets file or environment variables (prefixed with DANTE_
).
3.2. Data Models (internal/models/
)
Provider
: This is the central model representing a GPU provider. It includes:ID
(UUID): Unique identifier for the provider.OwnerID
(string): Identifier of the user who owns the provider.Name
(string): User-friendly name for the provider.Hostname
(string, optional): Hostname of the provider machine.IPAddress
(string, optional): IP address of the provider.Status
(ProviderStatus): Current status (idle
,busy
,offline
,maintenance
,error
).GPUs
([]GPUDetail): A slice detailing each GPU available on the provider machine.Location
(string, optional): Physical or logical location of the provider.RegisteredAt
(time.Time): Timestamp of registration.LastSeenAt
(time.Time): Timestamp of the last heartbeat or status update.Metadata
(map[string]interface{}, optional): Additional key-value information.
GPUDetail
: Describes an individual GPU's specifications and current metrics:ModelName
(string): e.g., "NVIDIA GeForce RTX 5090".VRAM
(uint64): Video RAM in Megabytes.DriverVersion
(string): GPU driver version.Architecture
(string, optional): e.g., "Ampere", "Ada Lovelace".ComputeCapability
(string, optional): e.g., "8.9".CudaCores
(uint32, optional).TensorCores
(uint32, optional).MemoryBandwidth
(uint64, optional): GB/s.PowerConsumption
(uint32, optional): Max power in Watts.UtilizationGPU
(uint8, optional): Current GPU core utilization (0-100%).UtilizationMem
(uint8, optional): Current GPU memory utilization (0-100%).Temperature
(uint8, optional): Current GPU temperature in Celsius.PowerDraw
(uint32, optional): Current power draw in Watts.IsHealthy
(bool): Indicates if the GPU is functioning correctly.
Error Models: Defines custom error types like
ErrProviderNotFound
andErrProviderAlreadyExists
for clear error handling.
3.3. Storage (internal/store/
)
The service uses a ProviderStore
interface to abstract data storage operations.
PostgresProviderStore
: An implementation ofProviderStore
using PostgreSQL.Initialization: Creates
providers
andgpu_details
tables if they don't exist. Thegpu_details
table has a foreign key relationship with theproviders
table (ON DELETE CASCADE
).Operations: Includes methods to add, get, list, update, and delete providers and their GPU details within database transactions.
Heartbeat Update: Updates the
last_seen_at
timestamp and GPU metrics for a provider. If the provider wasoffline
or inerror
, its status is set toidle
.Retry Logic: Database operations are wrapped with a retry mechanism (
retryer.WithRetry
) to handle transient errors.
InMemoryProviderStore
: An in-memory implementation primarily for testing or simple deployments. It provides the same interface methods but stores data in a map.
3.4. API Handlers (internal/handlers/
)
HTTP request handling is managed by ProviderHandler
.
Routes: Defines endpoints for provider operations:
POST /providers
: Register a new provider.GET /providers
: List available providers (supports filtering bystatus
,gpu_model
,min_vram
,architecture
,healthy_only
).GET /providers/{providerID}
: Get details for a specific provider.PUT /providers/{providerID}
: Fully update a provider's details.PATCH /providers/{providerID}/status
: Update a provider's status.POST /providers/{providerID}/heartbeat
: Signal a provider's active presence and update GPU metrics.DELETE /providers/{providerID}
: Deregister a provider.
Request/Response: Uses structs like
RegisterProviderRequest
,UpdateProviderStatusRequest
, andHeartbeatRequest
for request payloads. Responses generally use themodels.Provider
structure or simple JSON messages.Contextual Logging: Leverages
logging.ContextLogger
to include correlation and request IDs in logs.
3.5. Service Discovery (internal/consul/
)
Connection: Establishes a connection with the Consul agent specified in the configuration.
Registration: Registers the Provider Registry Service itself with Consul, including its name, unique ID, address, port, tags, and an HTTP health check. The health check URL uses
127.0.0.1
if the service address is unspecified (e.g., "0.0.0.0").Deregistration: The service is designed to deregister from Consul upon graceful shutdown.
3.6. HTTP Server (internal/server/
)
A simple HTTP server is configured using
chi.Router
.Standard middleware is used for request ID, real IP, structured logging (custom), panic recovery, and request timeouts.
A correlation ID middleware (
customMiddleware.CorrelationID
) is also added to inject or forward correlation IDs for distributed tracing.
4. API Endpoints
The service plans to expose the following RESTful API endpoints under the /providers
base path:
POST /providers
: Register a new GPU provider.Request Body:
RegisterProviderRequest
JSON object.Response:
201 Created
with the createdmodels.Provider
object.
GET /providers
: List available providers.Query Parameters (Filters):
status
(string): Filter by provider status (e.g., "idle", "busy").gpu_model
(string): Filter by GPU model name (partial match).min_vram
(uint64): Filter by minimum VRAM in MB.architecture
(string): Filter by GPU architecture.healthy_only
(bool): If true, only return providers where all GPUs are healthy.
Response:
200 OK
with an array ofmodels.Provider
objects.
GET /providers/{providerID}
: Get details for a specific provider.Path Parameter:
providerID
(UUID).Response:
200 OK
with themodels.Provider
object or404 Not Found
.
PUT /providers/{providerID}
: Update an existing provider's details (full update).Path Parameter:
providerID
(UUID).Request Body:
models.Provider
JSON object.Response:
200 OK
with the updatedmodels.Provider
object or404 Not Found
.
PATCH /providers/{providerID}/status
: Update the status of a specific provider.Path Parameter:
providerID
(UUID).Request Body:
UpdateProviderStatusRequest
JSON object (e.g.,{"status": "busy"}
).Response:
200 OK
with a success message or404 Not Found
.
POST /providers/{providerID}/heartbeat
: Update a provider's last seen time and optionally its GPU metrics.Path Parameter:
providerID
(UUID).Request Body (optional):
HeartbeatRequest
JSON object containinggpu_metrics
.Response:
200 OK
with a success message or404 Not Found
.
DELETE /providers/{providerID}
: Deregister a provider.Path Parameter:
providerID
(UUID).Response:
200 OK
with a success message or404 Not Found
.
GET /health
: Health check endpoint for Consul and other monitoring tools.Response:
200 OK
if healthy,503 Service Unavailable
if dependencies (like DB) are down.
5. Workflow
Startup:
The service loads its configuration.
It initializes the logger and database connection (PostgreSQL).
The database schema (
providers
andgpu_details
tables) is created if it doesn't exist.It connects to Consul and registers itself as a service, providing a health check endpoint.
The HTTP server starts listening for API requests.
Provider Registration:
A GPU provider daemon (external service) sends a
POST
request to/providers
with its details (owner, name, GPU specs, location, etc.).The handler validates the request and creates a new
models.Provider
object.The provider and its associated GPU details are stored in the PostgreSQL database.
The service responds with the newly created provider object.
Provider Heartbeat/Status Update:
Provider daemons periodically send
POST
requests to/providers/{providerID}/heartbeat
to indicate they are still active. This request can optionally include updated GPU metrics.The handler updates the
LastSeenAt
timestamp and any provided GPU metrics in the database. If the provider was markedoffline
orerror
, a heartbeat typically transitions it toidle
.Alternatively, a provider or an admin tool can send a
PATCH
request to/providers/{providerID}/status
to explicitly change the provider's status (e.g., tomaintenance
,busy
).
Provider Discovery:
The Scheduler Orchestrator Service (or other authorized services) sends a
GET
request to/providers
to find available providers.The request can include query parameters to filter providers based on status, GPU model, VRAM, etc.
The Provider Registry Service queries its database, applies the filters, and returns a list of matching providers.
Shutdown:
Upon receiving a termination signal (SIGINT, SIGTERM), the service initiates a graceful shutdown.
It deregisters itself from Consul.
The HTTP server is shut down gracefully within a timeout.
The database connection pool is closed.
6. Future Considerations
Advanced Filtering: Implement more complex filtering options for provider discovery (e.g., geographic proximity, price range, specific driver versions).
Provider Reputation/Rating: Store and manage provider reliability or performance ratings.
Security: Implement authentication and authorization for API endpoints (e.g., only provider daemons can register/heartbeat, only schedulers can list providers).
Scalability: Ensure the database and service can handle a large number of providers and frequent updates/queries. This might involve read replicas for the database or caching strategies.
Eventing: Publish events (e.g., via NATS) when provider status changes, allowing other services to react in real-time.
Last updated