Monitoring Logging Service - Lucia
The Monitoring Logging Service is responsible for collecting, aggregating, and providing access to logs and metrics from all other services within the Dante GPU Platform. This centralized observability stack is crucial for debugging, performance analysis, and ensuring the overall health and reliability of the platform.
This service is primarily composed of a suite of well-known open-source tools configured to work together, rather than a custom-built Go application.
1. Responsibilities
The key responsibilities of the Monitoring Logging Service stack are:
Log Collection: Aggregating logs from all microservices in the Dante GPU Platform.
Log Storage and Indexing: Storing logs efficiently and indexing them for fast querying and analysis.
Metrics Collection: Scraping metrics exposed by various services (e.g., via Prometheus endpoints).
Metrics Storage: Storing time-series metrics data.
Visualization: Providing dashboards for visualizing logs and metrics to understand system behavior and performance.
Alerting: Defining and managing alerts based on specific log patterns or metric thresholds to notify administrators of potential issues.
Service Discovery Integration: Leveraging Consul to discover services that need to be monitored or from which logs need to be collected (e.g., Prometheus using Consul service discovery).
2. Tech Stack
The Monitoring Logging Service is composed of the following primary tools:
Log Collection Agent: Promtail
Collects logs from local files (e.g., container logs, service log files).
Discovers log sources, often via Docker labels or Kubernetes service discovery.
Forwards logs to Loki.
Log Aggregation & Storage: Loki
Stores logs and indexes their metadata (labels) rather than full text, making it cost-effective and horizontally scalable.
Metrics Collection & Storage: Prometheus
Scrapes metrics from HTTP endpoints exposed by services.
Stores metrics as time-series data.
Provides a powerful query language (PromQL).
Visualization: Grafana
Connects to Loki and Prometheus as data sources.
Allows creation of dashboards to visualize logs and metrics.
Alerting: Alertmanager (typically used with Prometheus)
Handles alerts fired by Prometheus.
Manages deduplication, grouping, and routing of alerts via various notification channels (e.g., email, Slack, PagerDuty).
Orchestration: Docker Compose is used to define and run this multi-container application stack.
3. Core Components and Configuration
3.1. Promtail (promtail/promtail-config.yml)
Server Configuration: Defines the HTTP and gRPC listen ports.
Positions: Specifies a file to store the read positions for log files, ensuring logs are not re-read after restarts.
Clients: Configures the Loki instance(s) to send logs to (URL).
Scrape Configs: Defines how Promtail discovers and scrapes logs. This often includes:
static_configs: For fixed log file paths.docker_sd_configs: To discover logs from Docker containers based on labels. This is a common setup for microservices running in Docker. Promtail would look for containers managed by the local Docker daemon.kubernetes_sd_configs: For discovering logs in a Kubernetes environment.Each job specifies labels to apply to the logs (e.g.,
job,hostname).relabel_configs: Allows manipulation of labels before logs are sent to Loki.
3.2. Loki (loki/loki-config.yml)
Target: Configures Loki to run as a single binary (
all) for simplicity in smaller setups, or can be configured to run in a distributed microservices mode for larger scale.Auth:
auth_enabled: falseindicates no authentication is set up for Loki itself in this configuration.Server: Defines HTTP listen port.
Storage Config (
storage_config):Specifies the storage backend for logs. The example uses
filesystemwhich stores chunks and indexes locally.For production, object storage like S3, GCS, or MinIO is recommended for scalability and durability.
Schema Config (
schema_config): Defines how logs are indexed and stored over time.Limits Config (
limits_config): Configures various operational limits, like ingestion rate, query duration, etc.
3.3. Prometheus (prometheus/prometheus.yml)
Global Settings: Defines scrape interval, evaluation interval for rules.
Alerting: Configures Alertmanager instances.
Rule Files: Specifies paths to files containing recording and alerting rules.
Scrape Configs: Defines jobs that tell Prometheus what targets to scrape for metrics.
Example job for Prometheus itself.
consul_sd_configs: This section is crucial for integrating with Dante services. Prometheus will query Consul to find services tagged appropriately (e.g., with a "metrics" tag or a specific service name) and automatically scrape their/metrics(or configured) endpoint.relabel_configs: Used to filter targets discovered via service discovery and to manipulate labels on scraped metrics.
3.4. Grafana (grafana/grafana.ini, grafana/provisioning/)
Configuration (
grafana.ini): Main configuration file for Grafana, setting server options, database connections (for Grafana's own state), security, etc.Anonymous access might be enabled for ease of viewing in development.
Provisioning (
grafana/provisioning/): Allows automatic setup of data sources and dashboards when Grafana starts.Datasources (
datasources/default.yaml): Defines connections to Prometheus and Loki, specifying their URLs.Dashboards (
dashboards/default.yaml,dashboards/dante-platform/):default.yamlpoints to directories where dashboard JSON files are located.Dashboard JSON files (e.g.,
service-overview.json) define the panels, queries, and layout for visualizing data. The example service overview dashboard likely includes panels for system metrics (CPU, memory, disk, network) and application-specific metrics from Dante services.
3.5. Alertmanager (alertmanager/alertmanager.yml)
Global Settings: Defines default notification settings.
Route: Defines how alerts are routed to receivers. A root route typically catches all alerts and can have sub-routes for more specific alert handling based on labels.
Receivers: Configures notification channels (e.g., email, Slack, PagerDuty, webhook). The example shows a
webhook_configswhich can be used to send alerts to a custom HTTP endpoint.Inhibit Rules: Allows suppression of certain alerts if other alerts are already firing (e.g., don't send "database down" if "host down" is active).
3.6. Docker Compose (docker-compose.yml)
Defines the services (Prometheus, Grafana, Loki, Promtail, Alertmanager) as containers.
Specifies image versions, ports, volumes (for configuration files and data persistence), and network connections between the containers.
Ensures that services like Grafana can reach Prometheus and Loki by using Docker's internal networking.
4. Workflow
Service Startup:
All monitoring components (Prometheus, Grafana, Loki, Promtail, Alertmanager) are started, typically via
docker-compose up.Grafana provisions its data sources (Prometheus, Loki) and dashboards.
Log Collection:
Dante microservices write logs to stdout/stderr (if containerized) or to log files.
Promtail, running on the same host or with access to Docker's log drivers, discovers and tails these logs based on its
scrape_configs.Promtail attaches labels (e.g., service name, instance ID, job ID) to the log streams.
Logs are pushed to Loki, which indexes the labels and stores the log content.
Metrics Collection:
Dante microservices expose a metrics endpoint (e.g.,
/metrics) in Prometheus format.Prometheus, configured with
consul_sd_configs, discovers these services via Consul.Prometheus periodically scrapes the metrics endpoints of the discovered services.
Scraped metrics are stored in Prometheus's time-series database.
Visualization and Querying:
Users access Grafana via its web UI.
In Grafana, users can select dashboards (e.g., "Service Overview") to view metrics and logs.
Grafana queries Prometheus (using PromQL) for metrics and Loki (using LogQL) for logs based on the dashboard panels or ad-hoc queries in the "Explore" view.
Alerting:
Prometheus evaluates alerting rules defined in its configuration.
If an alert condition is met, Prometheus sends an alert to Alertmanager.
Alertmanager processes the alert (deduplication, grouping) and routes it to the configured receivers (e.g., webhook, email).
5. Key Configuration Highlights
Prometheus Consul Integration: The
consul_sd_configsinprometheus.ymlis vital for dynamically discovering and scraping metrics from Dante services registered in Consul. This avoids static configuration of scrape targets.Grafana Datasource Provisioning: The
default.yamlundergrafana/provisioning/datasourcesautomatically sets up Prometheus and Loki as data sources in Grafana, simplifying setup.Log Labeling: Effective use of labels in Promtail's
scrape_configs(e.g.,service,instance,job_id) is crucial for efficient log querying in Loki and correlating logs with metrics.Data Persistence: The
docker-compose.ymlshould define volumes for Prometheus, Loki, and Grafana to persist their data across container restarts. The current example configuration uses local filesystem storage within containers or bind mounts.
6. Future Considerations
Distributed Tracing: Integrate a distributed tracing system like Jaeger or Zipkin to trace requests across multiple microservices, providing deeper insights into request latency and dependencies.
Log Archival: Implement long-term log archival from Loki to cheaper storage (e.g., S3, GCS) if needed for compliance or extended historical analysis.
Metric Cardinality Management: Monitor and manage high-cardinality metrics in Prometheus to prevent performance issues.
Security: Secure the endpoints of Prometheus, Grafana, Loki, and Alertmanager, especially if exposed externally (e.g., using authentication, HTTPS).
Scalability: For larger deployments, configure Loki and Prometheus to run in their distributed/clustered modes and use scalable storage backends.
Custom Metrics: Ensure all Dante services expose comprehensive custom metrics relevant to their operations (e.g., job queue length for Scheduler, active sessions for Billing).
Advanced Alerting: Develop more sophisticated alerting rules based on application-specific Key Performance Indicators (KPIs).
Last updated