Provider Daemon - Leah
The Provider Daemon is a software component that runs on the machine of a GPU provider. Its main purpose is to make the provider's GPU resources available to the Dante GPU Platform, execute tasks assigned by the Scheduler Orchestrator, and report status and metrics.
1. Responsibilities
The Provider Daemon has several key responsibilities:
GPU Detection: Automatically detects available GPUs on the host machine, including NVIDIA, AMD, and Apple Silicon (on macOS). It gathers details like model name, VRAM, and driver version.
Provider Registration: Registers the host machine and its GPUs with the Provider Registry Service. (This is initiated by the
cmd/provider/main.go
which then would interact with the registry service, though the direct registration call isn't explicitly shown in the provided daemon code, it's implied by the architecture).Task Subscription: Listens for incoming task assignments from the Scheduler Orchestrator Service via NATS.
Task Execution: Executes assigned tasks. It supports different execution types:
Docker Containers: Runs tasks within Docker containers, pulling necessary images and managing container lifecycle. This includes handling GPU access for containers (e.g., using NVIDIA Container Toolkit).
Scripts: Executes scripts (e.g., shell scripts, Python scripts) directly on the host or within a specified interpreter.
Workspace Management: Creates and manages temporary workspaces for each task to store scripts, input/output files, and logs.
Status Reporting: Publishes task status updates (e.g.,
preparing
,in_progress
,completed
,failed
) to the Scheduler Orchestrator Service via NATS.Metrics Collection & Reporting: Periodically collects GPU metrics (utilization, temperature, power draw) and system metrics. These are sent as heartbeats to the Provider Registry Service and can also be used for billing updates.
Billing Integration: Sends real-time usage updates (GPU utilization, power draw) to the Billing Payment Service for active sessions.
Graceful Shutdown: Handles termination signals to stop ongoing tasks cleanly (if possible) and deregister from the platform.
2. Tech Stack
The Provider Daemon utilizes the following main technologies:
Programming Language: Go
Messaging Queue: NATS (including JetStream for task reception)
Containerization (for tasks): Docker (interacts via Docker SDK)
GPU Interaction:
NVIDIA:
nvidia-smi
command-line toolApple Silicon (macOS):
system_profiler
command-line toolAMD:
rocm-smi
(basic support planned)Intel:
intel_gpu_top
(basic support planned)
System Metrics:
gopsutil
library (for CPU, memory)Logging: Zap for structured logging
Configuration: YAML files
3. Core Components and Logic
3.1. Configuration (configs/config.yaml
)
The daemon's behavior is controlled by configs/config.yaml
. Key parameters include:
instance_id
: A unique identifier for this daemon instance (defaults to "provider-" + hostname).log_level
: Logging verbosity (e.g., "info", "debug").request_timeout
: Default timeout for HTTP calls to other services (e.g., Provider Registry).nats_address
: URL of the NATS server.nats_task_subscription_subject_pattern
: NATS subject pattern for receiving tasks (e.g., "tasks.dispatch.%s.*", where %s is the instance_id).nats_job_status_update_subject_prefix
: NATS subject prefix for publishing task status updates (e.g., "jobs.status").nats_command_timeout
: Timeout for NATS operations.provider_registry_service_name
: Name of the Provider Registry service in Consul (used ifprovider_registry_url
is not set).provider_registry_url
: Direct URL to the Provider Registry Service (optional).provider_heartbeat_interval
: Frequency at which heartbeats are sent to the Provider Registry.workspace_dir
: Base directory for creating temporary task workspaces (defaults to a temporary directory).docker_endpoint
: Endpoint for the Docker daemon (e.g., "unix:///var/run/docker.sock").managed_gpu_ids
: (Placeholder) Specific GPU IDs this daemon manages.
3.2. Data Models (internal/models/
)
Task
: Represents a unit of work received from the Scheduler. It includes:JobID
,UserID
,JobType
,JobName
JobParams
: A map containing task-specific parameters like script content, Docker image name, commands, environment variables.ExecutionType
: Specifies how the task should be run (e.g., "script", "docker").Resource requirements (
GPUTypeNeeded
,GPUCountNeeded
).AssignedProviderID
: Should match the daemon'sInstanceID
.DispatchedAt
: Timestamp of when the task was dispatched by the scheduler.
TaskStatusUpdate
: Used to report the status of task execution back to the Scheduler. It includes:JobID
,ProviderID
(daemon's instance ID).Status
: Current execution status (e.g.,preparing
,in_progress
,completed
,failed
).Timestamp
,Message
,Progress
,ExitCode
,ExecutionLog
.
JobStatus
Constants: Defines the possible states of a task being executed (e.g.,StatusPreparing
,StatusInProgress
).
3.3. GPU Detection (internal/gpu/detector.go
)
Detector
: Manages GPU detection and monitoring.DetectGPUs()
: Orchestrates detection across different GPU vendors (NVIDIA, AMD, Apple, Intel).detectNVIDIAGPUs()
: Usesnvidia-smi
to query GPU details.detectAMDGPUs()
: (Planned) Usesrocm-smi
.detectAppleGPUs()
: Usessystem_profiler
on macOS.detectIntelGPUs()
: (Planned) Uses tools likeintel_gpu_top
.
GPUInfo
: Struct to hold detailed information about a detected GPU.GPUMetrics
: Struct for real-time GPU metrics.GetMetrics()
: Collects current metrics from all detected GPUs.
3.4. Task Execution (internal/executor/
)
Executor
Interface: Defines theExecute
method that all task executors must implement.ExecutionResult
: Struct to hold the outcome of a task (stdout, stderr, exit code, error).ScriptExecutor
:Executes scripts specified in
task.JobParams["script_content"]
using an interpreter defined intask.JobParams["script_interpreter"]
(e.g.,/bin/bash
,python3
).Writes the script to a temporary file within the task's workspace.
Supports task execution timeout via context.
DockerExecutor
:Initializes a Docker client.
Pulls the Docker image specified in
task.JobParams["docker_image"]
.Creates and starts a Docker container with:
Command from
task.JobParams["docker_command"]
.Environment variables from
task.JobParams["docker_env_vars"]
.Workspace mounted into
/workspace
in the container.GPU access configured via
container.DeviceRequest
iftask.JobParams["docker_gpus"]
is set (e.g., "all", or specific device IDs).
Waits for container completion and retrieves logs (stdout, stderr) and exit code.
Supports task execution timeout via context.
Manages container removal after execution.
3.5. Task Handling (internal/tasks/handler.go
)
Handler
: Orchestrates task processing.HandleTask()
:Receives a
models.Task
.Publishes
StatusPreparing
status.Creates a temporary workspace for the task.
Selects the appropriate executor (Script or Docker) based on
task.ExecutionType
.Publishes
StatusInProgress
status.Calls the selected executor's
Execute
method.Publishes final status (
StatusCompleted
orStatusFailed
) with logs and exit code.Cleans up the workspace.
3.6. NATS Client (internal/nats/client.go
)
Client
: Manages NATS connection and subscriptions.Establishes a connection to NATS with retry logic and obtains a JetStream context.
StartListening()
: Creates a JetStream Pull Subscription to the task dispatch subject (e.g.,tasks.dispatch.{instance_id}.*
).messageFetchLoop()
: Continuously fetches messages from the pull subscription.handleMessage()
:Unmarshals the received
models.Task
.Validates if
task.AssignedProviderID
matches the daemon'sInstanceID
.Calls the registered
TaskHandlerFunc
(which istasks.Handler.HandleTask
).Acknowledges (ACKs) or negatively acknowledges (NAKs) the NATS message based on processing outcome.
PublishStatus()
: Publishesmodels.TaskStatusUpdate
messages to a NATS subject (e.g.,jobs.status.{job_id}
).Stop()
: Gracefully drains the NATS subscription and closes the connection.
3.7. Billing Client (internal/billing/client.go
)
Client
: Handles communication with the Billing Payment Service.SendUsageUpdate()
: Sends real-time usage data (GPU utilization, VRAM utilization, power draw, temperature) for an active session.Monitor()
: Periodically collects GPU metrics and sends usage updates for a given session.Placeholder methods for
StartSession
andEndSession
notifications to the billing service.
4. Workflow
Startup (
cmd/daemon/main.go
orcmd/provider/main.go
):Loads configuration.
Initializes logger.
Initializes GPU detector and detects available GPUs.
Initializes Billing client.
Initializes ScriptExecutor and DockerExecutor.
Initializes Task Handler, passing the executors.
Initializes NATS client, passing the task handler function (
taskHandlerInstance.HandleTask
).Sets the NATS client as the status reporter for the Task Handler.
Starts the NATS listener for task messages.
(Implicit) The
cmd/provider/main.go
also includes logic for provider registration with the registry and periodic heartbeats/metrics updates.
Task Reception & Handling:
The
nats.Client
fetches a task message.The message is unmarshalled into a
models.Task
.If the
AssignedProviderID
in the task matches the daemon'sInstanceID
,tasks.Handler.HandleTask
is called.HandleTask
creates a workspace, selects an executor, and runs the task.During execution, status updates (
Preparing
,InProgress
) are published vianats.Client.PublishStatus
.If the task involves billing (e.g., a session ID is present in
JobParams
), theDockerExecutor
(or potentially other executors) will start billing monitoring via thebilling.Client
.
Task Completion/Failure:
The executor returns an
ExecutionResult
.HandleTask
publishes the final status (Completed
orFailed
) with logs and exit code.The NATS message for the task is ACK'd or NAK'd.
The workspace is cleaned up.
Heartbeat & Metrics (from
cmd/provider/main.go
):Periodically, the daemon collects system and GPU metrics.
It sends a heartbeat along with these metrics to the Provider Registry Service.
Shutdown:
Receives a termination signal (SIGINT, SIGTERM).
The NATS client gracefully drains its subscription and closes the connection.
Ongoing tasks might be cancelled via their contexts.
5. Future Considerations
Resource Management: Implement more sophisticated local resource management to handle
managed_gpu_ids
and prevent overloading the provider machine if it runs multiple tasks concurrently.Advanced GPU Metrics: Expand GPU metric collection for AMD and Intel, and potentially more detailed NVIDIA metrics using NVIDIA Management Library (NVML) bindings if
nvidia-smi
parsing becomes a bottleneck or insufficient.Task Sandboxing: Enhance security for script execution beyond temporary workspaces (e.g., using
runsc
(gVisor) or other sandboxing technologies).Input/Output File Handling: Integrate with the Storage Service for downloading input data and uploading results, rather than relying on direct URLs in
JobParams
.Direct Provider-Scheduler Communication: Consider gRPC for more direct communication for certain actions if NATS proves to have limitations for specific use cases.
Resilience: Improve error handling and retry mechanisms for task execution and reporting.
Plugin System for Executors: Allow new execution types to be added more easily.
Consul Integration: While the
cmd/provider/main.go
implies registration with a provider registry, the daemon itself (provider-daemon
directory code) doesn't show direct Consul registration. If intended to be a standalone service discoverable by Consul, this would need to be added. However, its primary discovery seems to be via the Provider Registry.
Last updated