Provider Daemon - Leah

The Provider Daemon is a software component that runs on the machine of a GPU provider. Its main purpose is to make the provider's GPU resources available to the Dante GPU Platform, execute tasks assigned by the Scheduler Orchestrator, and report status and metrics.

1. Responsibilities

The Provider Daemon has several key responsibilities:

GPU Detection: Automatically detects available GPUs on the host machine, including NVIDIA, AMD, and Apple Silicon (on macOS). It gathers details like model name, VRAM, and driver version.
Provider Registration: Registers the host machine and its GPUs with the Provider Registry Service. (This is initiated by the cmd/provider/main.go which then would interact with the registry service, though the direct registration call isn't explicitly shown in the provided daemon code, it's implied by the architecture).
Task Subscription: Listens for incoming task assignments from the Scheduler Orchestrator Service via NATS.
Task Execution: Executes assigned tasks. It supports different execution types:
- Docker Containers: Runs tasks within Docker containers, pulling necessary images and managing container lifecycle. This includes handling GPU access for containers (e.g., using NVIDIA Container Toolkit).
- Scripts: Executes scripts (e.g., shell scripts, Python scripts) directly on the host or within a specified interpreter.
Workspace Management: Creates and manages temporary workspaces for each task to store scripts, input/output files, and logs.
Status Reporting: Publishes task status updates (e.g., preparing, in_progress, completed, failed) to the Scheduler Orchestrator Service via NATS.
Metrics Collection & Reporting: Periodically collects GPU metrics (utilization, temperature, power draw) and system metrics. These are sent as heartbeats to the Provider Registry Service and can also be used for billing updates.
Billing Integration: Sends real-time usage updates (GPU utilization, power draw) to the Billing Payment Service for active sessions.
Graceful Shutdown: Handles termination signals to stop ongoing tasks cleanly (if possible) and deregister from the platform.

2. Tech Stack

The Provider Daemon utilizes the following main technologies:

Programming Language: Go
Messaging Queue: NATS (including JetStream for task reception)
Containerization (for tasks): Docker (interacts via Docker SDK)
GPU Interaction:
- NVIDIA: nvidia-smi command-line tool
- Apple Silicon (macOS): system_profiler command-line tool
- AMD: rocm-smi (basic support planned)
- Intel: intel_gpu_top (basic support planned)
System Metrics: gopsutil library (for CPU, memory)
Logging: Zap for structured logging
Configuration: YAML files

3. Core Components and Logic

3.1. Configuration (configs/config.yaml)

The daemon's behavior is controlled by configs/config.yaml. Key parameters include:

instance_id: A unique identifier for this daemon instance (defaults to "provider-" + hostname).
log_level: Logging verbosity (e.g., "info", "debug").
request_timeout: Default timeout for HTTP calls to other services (e.g., Provider Registry).
nats_address: URL of the NATS server.
nats_task_subscription_subject_pattern: NATS subject pattern for receiving tasks (e.g., "tasks.dispatch.%s.*", where %s is the instance_id).
nats_job_status_update_subject_prefix: NATS subject prefix for publishing task status updates (e.g., "jobs.status").
nats_command_timeout: Timeout for NATS operations.
provider_registry_service_name: Name of the Provider Registry service in Consul (used if provider_registry_url is not set).
provider_registry_url: Direct URL to the Provider Registry Service (optional).
provider_heartbeat_interval: Frequency at which heartbeats are sent to the Provider Registry.
workspace_dir: Base directory for creating temporary task workspaces (defaults to a temporary directory).
docker_endpoint: Endpoint for the Docker daemon (e.g., "unix:///var/run/docker.sock").
managed_gpu_ids: (Placeholder) Specific GPU IDs this daemon manages.

3.2. Data Models (internal/models/)

Task: Represents a unit of work received from the Scheduler. It includes:
- JobID, UserID, JobType, JobName
- JobParams: A map containing task-specific parameters like script content, Docker image name, commands, environment variables.
- ExecutionType: Specifies how the task should be run (e.g., "script", "docker").
- Resource requirements (GPUTypeNeeded, GPUCountNeeded).
- AssignedProviderID: Should match the daemon's InstanceID.
- DispatchedAt: Timestamp of when the task was dispatched by the scheduler.
TaskStatusUpdate: Used to report the status of task execution back to the Scheduler. It includes:
- JobID, ProviderID (daemon's instance ID).
- Status: Current execution status (e.g., preparing, in_progress, completed, failed).
- Timestamp, Message, Progress, ExitCode, ExecutionLog.
JobStatus Constants: Defines the possible states of a task being executed (e.g., StatusPreparing, StatusInProgress).

3.3. GPU Detection (internal/gpu/detector.go)

Detector: Manages GPU detection and monitoring.
DetectGPUs(): Orchestrates detection across different GPU vendors (NVIDIA, AMD, Apple, Intel).
- detectNVIDIAGPUs(): Uses nvidia-smi to query GPU details.
- detectAMDGPUs(): (Planned) Uses rocm-smi.
- detectAppleGPUs(): Uses system_profiler on macOS.
- detectIntelGPUs(): (Planned) Uses tools like intel_gpu_top.
GPUInfo: Struct to hold detailed information about a detected GPU.
GPUMetrics: Struct for real-time GPU metrics.
GetMetrics(): Collects current metrics from all detected GPUs.

3.4. Task Execution (internal/executor/)

Executor Interface: Defines the Execute method that all task executors must implement.
ExecutionResult: Struct to hold the outcome of a task (stdout, stderr, exit code, error).
ScriptExecutor:
- Executes scripts specified in task.JobParams["script_content"] using an interpreter defined in task.JobParams["script_interpreter"] (e.g., /bin/bash, python3).
- Writes the script to a temporary file within the task's workspace.
- Supports task execution timeout via context.
DockerExecutor:
- Initializes a Docker client.
- Pulls the Docker image specified in task.JobParams["docker_image"].
- Creates and starts a Docker container with:
  - Command from task.JobParams["docker_command"].
  - Environment variables from task.JobParams["docker_env_vars"].
  - Workspace mounted into /workspace in the container.
  - GPU access configured via container.DeviceRequest if task.JobParams["docker_gpus"] is set (e.g., "all", or specific device IDs).
- Waits for container completion and retrieves logs (stdout, stderr) and exit code.
- Supports task execution timeout via context.
- Manages container removal after execution.

3.5. Task Handling (internal/tasks/handler.go)

Handler: Orchestrates task processing.
HandleTask():
1. Receives a models.Task.
2. Publishes StatusPreparing status.
3. Creates a temporary workspace for the task.
4. Selects the appropriate executor (Script or Docker) based on task.ExecutionType.
5. Publishes StatusInProgress status.
6. Calls the selected executor's Execute method.
7. Publishes final status (StatusCompleted or StatusFailed) with logs and exit code.
8. Cleans up the workspace.

3.6. NATS Client (internal/nats/client.go)

Client: Manages NATS connection and subscriptions.
Establishes a connection to NATS with retry logic and obtains a JetStream context.
StartListening(): Creates a JetStream Pull Subscription to the task dispatch subject (e.g., tasks.dispatch.{instance_id}.*).
messageFetchLoop(): Continuously fetches messages from the pull subscription.
handleMessage():
- Unmarshals the received models.Task.
- Validates if task.AssignedProviderID matches the daemon's InstanceID.
- Calls the registered TaskHandlerFunc (which is tasks.Handler.HandleTask).
- Acknowledges (ACKs) or negatively acknowledges (NAKs) the NATS message based on processing outcome.
PublishStatus(): Publishes models.TaskStatusUpdate messages to a NATS subject (e.g., jobs.status.{job_id}).
Stop(): Gracefully drains the NATS subscription and closes the connection.

3.7. Billing Client (internal/billing/client.go)

Client: Handles communication with the Billing Payment Service.
SendUsageUpdate(): Sends real-time usage data (GPU utilization, VRAM utilization, power draw, temperature) for an active session.
Monitor(): Periodically collects GPU metrics and sends usage updates for a given session.
Placeholder methods for StartSession and EndSession notifications to the billing service.

4. Workflow

Startup (cmd/daemon/main.go or cmd/provider/main.go):
- Loads configuration.
- Initializes logger.
- Initializes GPU detector and detects available GPUs.
- Initializes Billing client.
- Initializes ScriptExecutor and DockerExecutor.
- Initializes Task Handler, passing the executors.
- Initializes NATS client, passing the task handler function (taskHandlerInstance.HandleTask).
- Sets the NATS client as the status reporter for the Task Handler.
- Starts the NATS listener for task messages.
- (Implicit) The cmd/provider/main.go also includes logic for provider registration with the registry and periodic heartbeats/metrics updates.
Task Reception & Handling:
- The nats.Client fetches a task message.
- The message is unmarshalled into a models.Task.
- If the AssignedProviderID in the task matches the daemon's InstanceID, tasks.Handler.HandleTask is called.
- HandleTask creates a workspace, selects an executor, and runs the task.
- During execution, status updates (Preparing, InProgress) are published via nats.Client.PublishStatus.
- If the task involves billing (e.g., a session ID is present in JobParams), the DockerExecutor (or potentially other executors) will start billing monitoring via the billing.Client.
Task Completion/Failure:
- The executor returns an ExecutionResult.
- HandleTask publishes the final status (Completed or Failed) with logs and exit code.
- The NATS message for the task is ACK'd or NAK'd.
- The workspace is cleaned up.
Heartbeat & Metrics (from cmd/provider/main.go):
- Periodically, the daemon collects system and GPU metrics.
- It sends a heartbeat along with these metrics to the Provider Registry Service.
Shutdown:
- Receives a termination signal (SIGINT, SIGTERM).
- The NATS client gracefully drains its subscription and closes the connection.
- Ongoing tasks might be cancelled via their contexts.

5. Future Considerations

Resource Management: Implement more sophisticated local resource management to handle managed_gpu_ids and prevent overloading the provider machine if it runs multiple tasks concurrently.
Advanced GPU Metrics: Expand GPU metric collection for AMD and Intel, and potentially more detailed NVIDIA metrics using NVIDIA Management Library (NVML) bindings if nvidia-smi parsing becomes a bottleneck or insufficient.
Task Sandboxing: Enhance security for script execution beyond temporary workspaces (e.g., using runsc (gVisor) or other sandboxing technologies).
Input/Output File Handling: Integrate with the Storage Service for downloading input data and uploading results, rather than relying on direct URLs in JobParams.
Direct Provider-Scheduler Communication: Consider gRPC for more direct communication for certain actions if NATS proves to have limitations for specific use cases.
Resilience: Improve error handling and retry mechanisms for task execution and reporting.
Plugin System for Executors: Allow new execution types to be added more easily.
Consul Integration: While the cmd/provider/main.go implies registration with a provider registry, the daemon itself (provider-daemon directory code) doesn't show direct Consul registration. If intended to be a standalone service discoverable by Consul, this would need to be added. However, its primary discovery seems to be via the Provider Registry.

PreviousScheduler Orchestrator Service - Matilda NextStorage Service - Cacciaguida

Last updated 1 month ago