Infrastructure Tools

Prometheus, Loki, Alertmanager, and Grafana are powerful tools commonly used for monitoring and visualizing metrics and logs in a distributed system. Here, we focus on integrating these tools to effectively monitor Polkadot metrics across relay chains and parachains.

Metrics

Prometheus is an open source solution that can be used to collect metrics from applications. It collects metrics from configured targets endpoints at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.

Prometheus Configuration

Targets are a list of endpoints you want to scrape. The two main exporters we care about are for 1) polkadot/substrate and 2) the node-exporter. An example of scraping these on the IP address 10.100.0.0 would be:

scrape_configs: - job_name: polkadot static_configs: - targets: - 10.100.0.0:9080 # promtail - 10.100.0.0:9100 # node exporter - 10.100.0.0:9615 # relaychain metrics - 10.100.0.0:9625 # parachain metrics

Logs

Loki is an open source solution that can be used to aggregate logs from applications, allowing the operator to see errors, patterns and be able to search through the logs from all hosts very easily. An agent such as Promtail or Grafana Alloy is used to push logs to the Loki server.

Example promtail.yaml configuration to collect the logs and create a Promtail metrics that aggregates each log level:

# promtail server config server: http_listen_port: 9080 grpc_listen_port: 0 log_level: info positions: filename: /var/lib/promtail/positions.yaml # loki servers clients: - url: http://loki.example.com/loki/api/v1/push backoff_config: min_period: 1m max_period: 1h max_retries: 10000 scrape_configs: - job_name: journald journal: max_age: 1m path: /var/log/journal labels: job: journald pipeline_stages: - match: selector: '{job="journald"}' stages: - multiline: firstline: '^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3}' max_lines: 2500 - regex: expression: '(?P<date>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\s+(?P<level>(TRACE|DEBUG|INFO|WARN|ERROR))\s+(?P<worker>([^\s]+))\s+(?P<target>[\w-]+):?:?(?P<subtarget>[\w-]+)?:[\s]?(?P<chaintype>\[[\w-]+\]+)?[\s]?(?P<message>.+)' - labels: level: target: subtarget: - metrics: log_lines_total: type: Counter Description: "Total Number of Chain Logs" prefix: "promtail_chain_" max_idle_duration: 24h config: match_all: true action: inc - match: selector: '{job="journald", level="ERROR"}' stages: - multiline: firstline: '^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3}' max_lines: 2500 - metrics: log_lines_total: type: Counter Description: "Total Number of Chain Error Logs" prefix: "promtail_chain_error_" max_idle_duration: 24h config: match_all: true action: inc - match: selector: '{job="journald", level=~".+"} |~ "(?i)(panic)"' stages: - multiline: firstline: '^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3}' max_lines: 2500 - metrics: log_lines_total: type: Counter Description: "Total Number of Chain Panic Logs" prefix: "promtail_chain_panic_" max_idle_duration: 24h config: match_all: true action: inc relabel_configs: - source_labels: ["__journal__systemd_unit"] target_label: "unit" - source_labels: ["unit"] regex: "(.*.scope|user.*.service)" action: drop - source_labels: ["__journal__hostname"] target_label: "host" - action: replace source_labels: - __journal__cmdline - __journal__hostname separator: ";" regex: ".*--chain.*;(.*)" target_label: "node" replacement: $1

The above example also configures the following custom metrics derived from logs promtail_chain_log_lines_total, promtail_chain_error_log_lines_total and promtail_chain_panic_log_lines_total to be exposed on the promtail metrics endpoint http://host:9080.

Alerts

Alertmanager handles alerts sent by Prometheus client, and responsible for deduplicating, grouping and routing to the correct receiver such as email, PagerDuty, and other mechanisms via webhook receiver.

A simple alert for block production being slow would look like:

- alert: BlockProductionSlow annotations: message: 'Best block on instance {{ $labels.instance }} increases by less than 1 per minute for more than 5 minutes.' expr: increase(substrate_block_height{status="best"}[1m]) < 1 for: 5m labels: severity: warning

Visualization

Grafana is where you can define dashboards to show the time series information that Prometheus is collecting. You just need to ensure you add a datasource:

datasources: - name: "Prometheus" type: prometheus access: proxy editable: false orgId: 1 url: "http://prometheus.monitoring.svc.cluster.local" version: 1 jsonData: timeInterval: 30s - name: Loki type: loki access: proxy orgId: 1 url: http://loki:3100 basicAuth: false version: 1 editable: true

Polkadot-mixin

Polkadot-monitoring-mixin is a set of Polkadot monitoring dashboards, alerts and rules collected based on our experience operating Polkadot Relay Chain and Parachain nodes. You can find it in this repo.

Docker Compose

You can install the following components Docker Compose if you are evaluating, or developing monitoring stack for Polkadot. The configuration files associated with these installation instructions run the monitoring stack as a single binary.

version: "3.8" networks: polkadot: services: prometheus: container_name: prometheus image: prom/prometheus:v2.53.0 command: - '--config.file=/etc/prometheus/prometheus.yml' ports: - 9090:9090 restart: unless-stopped configs: - source: prometheus_config target: /etc/prometheus/prometheus.yml networks: - polkadot loki: container_name: loki image: grafana/loki:3.1.0 ports: - "3100:3100" command: -config.file=/etc/loki/local-config.yaml networks: - polkadot promtail: container_name: promtail image: grafana/promtail:3.1.0 command: -config.file=/etc/promtail/config.yml user: root # Required to read container logs ports: - 9080:9080 volumes: - /var/lib/docker/containers:/var/lib/docker/containers - /var/run/docker.sock:/var/run/docker.sock configs: - source: promtail_config target: /etc/promtail/config.yml networks: - polkadot grafana: container_name: grafana image: grafana/grafana:latest environment: - GF_PATHS_PROVISIONING=/etc/grafana/provisioning - GF_AUTH_ANONYMOUS_ENABLED=true - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin entrypoint: - sh - -euc - | mkdir -p /etc/grafana/provisioning/datasources cat <<EOF > /etc/grafana/provisioning/datasources/ds.yaml apiVersion: 1 datasources: - name: Loki type: loki access: proxy orgId: 1 url: http://loki:3100 basicAuth: false version: 1 editable: true - name: Prometheus type: prometheus access: proxy orgId: 1 url: http://prometheus:9090 basicAuth: false isDefault: true version: 1 editable: true EOF /run.sh ports: - "3000:3000" networks: - polkadot polkadot_collator: container_name: polkadot_collator image: parity/polkadot-parachain:1.14.0 command: > --tmp --prometheus-external --prometheus-port 9625 -- --tmp --prometheus-external --prometheus-port 9615 ports: - "9615:9615" - "9625:9625" networks: - polkadot configs: prometheus_config: content: | global: scrape_interval: 15s # By default, scrape targets every 15 seconds. evaluation_interval: 15s # Evaluate rules every 15 seconds. scrape_configs: - job_name: polkadot static_configs: - targets: - polkadot_collator:9615 # relaychain metrics - polkadot_collator:9625 # parachain metrics promtail_config: content: | server: http_listen_port: 9080 positions: filename: /tmp/positions.yaml clients: - url: http://loki:3100/loki/api/v1/push scrape_configs: - job_name: containers docker_sd_configs: - host: unix:///var/run/docker.sock refresh_interval: 5s filters: - name: name values: [polkadot_collator] relabel_configs: - source_labels: ['__meta_docker_container_name'] regex: '/(.*)' target_label: 'container'
Last change: 2024-11-29, commit: a566801