Infrastructure Tools
Prometheus, Loki, Alertmanager, and Grafana are powerful tools commonly used for monitoring and visualizing metrics and logs in a distributed system. Here, we focus on integrating these tools to effectively monitor Polkadot metrics across relay chains and parachains.
Metrics
Prometheus is an open source solution that can be used to collect metrics from applications. It collects metrics from configured targets endpoints at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.
Prometheus Configuration
Targets are a list of endpoints you want to scrape. The two main exporters we care about are for 1) polkadot/substrate and 2) the node-exporter. An example of scraping these on the IP address 10.100.0.0 would be:
scrape_configs:
- job_name: polkadot
static_configs:
- targets:
- 10.100.0.0:9080 # promtail
- 10.100.0.0:9100 # node exporter
- 10.100.0.0:9615 # relaychain metrics
- 10.100.0.0:9625 # parachain metrics
Logs
Loki is an open source solution that can be used to aggregate logs from applications, allowing the operator to see errors, patterns and be able to search through the logs from all hosts very easily. An agent such as Promtail or Grafana Alloy is used to push logs to the Loki server.
Example promtail.yaml configuration to collect the logs and create a Promtail metrics that aggregates each log level:
# promtail server config
server:
http_listen_port: 9080
grpc_listen_port: 0
log_level: info
positions:
filename: /var/lib/promtail/positions.yaml
# loki servers
clients:
- url: http://loki.example.com/loki/api/v1/push
backoff_config:
min_period: 1m
max_period: 1h
max_retries: 10000
scrape_configs:
- job_name: journald
journal:
max_age: 1m
path: /var/log/journal
labels:
job: journald
pipeline_stages:
- match:
selector: '{job="journald"}'
stages:
- multiline:
firstline: '^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3}'
max_lines: 2500
- regex:
expression: '(?P<date>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\s+(?P<level>(TRACE|DEBUG|INFO|WARN|ERROR))\s+(?P<worker>([^\s]+))\s+(?P<target>[\w-]+):?:?(?P<subtarget>[\w-]+)?:[\s]?(?P<chaintype>\[[\w-]+\]+)?[\s]?(?P<message>.+)'
- labels:
level:
target:
subtarget:
- metrics:
log_lines_total:
type: Counter
Description: "Total Number of Chain Logs"
prefix: "promtail_chain_"
max_idle_duration: 24h
config:
match_all: true
action: inc
- match:
selector: '{job="journald", level="ERROR"}'
stages:
- multiline:
firstline: '^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3}'
max_lines: 2500
- metrics:
log_lines_total:
type: Counter
Description: "Total Number of Chain Error Logs"
prefix: "promtail_chain_error_"
max_idle_duration: 24h
config:
match_all: true
action: inc
- match:
selector: '{job="journald", level=~".+"} |~ "(?i)(panic)"'
stages:
- multiline:
firstline: '^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3}'
max_lines: 2500
- metrics:
log_lines_total:
type: Counter
Description: "Total Number of Chain Panic Logs"
prefix: "promtail_chain_panic_"
max_idle_duration: 24h
config:
match_all: true
action: inc
relabel_configs:
- source_labels: ["__journal__systemd_unit"]
target_label: "unit"
- source_labels: ["unit"]
regex: "(.*.scope|user.*.service)"
action: drop
- source_labels: ["__journal__hostname"]
target_label: "host"
- action: replace
source_labels:
- __journal__cmdline
- __journal__hostname
separator: ";"
regex: ".*--chain.*;(.*)"
target_label: "node"
replacement: $1
The above example also configures the following custom metrics derived from logs promtail_chain_log_lines_total
, promtail_chain_error_log_lines_total
and promtail_chain_panic_log_lines_total
to be exposed on the promtail metrics endpoint http://host:9080.
Alerts
Alertmanager handles alerts sent by Prometheus client, and responsible for deduplicating, grouping and routing to the correct receiver such as email, PagerDuty, and other mechanisms via webhook receiver.
A simple alert for block production being slow would look like:
- alert: BlockProductionSlow
annotations:
message: 'Best block on instance {{ $labels.instance }} increases by
less than 1 per minute for more than 5 minutes.'
expr: increase(substrate_block_height{status="best"}[1m]) < 1
for: 5m
labels:
severity: warning
Visualization
Grafana is where you can define dashboards to show the time series information that Prometheus is collecting. You just need to ensure you add a datasource:
datasources:
- name: "Prometheus"
type: prometheus
access: proxy
editable: false
orgId: 1
url: "http://prometheus.monitoring.svc.cluster.local"
version: 1
jsonData:
timeInterval: 30s
- name: Loki
type: loki
access: proxy
orgId: 1
url: http://loki:3100
basicAuth: false
version: 1
editable: true
Polkadot-mixin
Polkadot-monitoring-mixin is a set of Polkadot monitoring dashboards, alerts and rules collected based on our experience operating Polkadot Relay Chain and Parachain nodes. You can find it in this repo.
Docker Compose
You can install the following components Docker Compose if you are evaluating, or developing monitoring stack for Polkadot. The configuration files associated with these installation instructions run the monitoring stack as a single binary.
version: "3.8"
networks:
polkadot:
services:
prometheus:
container_name: prometheus
image: prom/prometheus:v2.53.0
command:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- 9090:9090
restart: unless-stopped
configs:
- source: prometheus_config
target: /etc/prometheus/prometheus.yml
networks:
- polkadot
loki:
container_name: loki
image: grafana/loki:3.1.0
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
networks:
- polkadot
promtail:
container_name: promtail
image: grafana/promtail:3.1.0
command: -config.file=/etc/promtail/config.yml
user: root # Required to read container logs
ports:
- 9080:9080
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers
- /var/run/docker.sock:/var/run/docker.sock
configs:
- source: promtail_config
target: /etc/promtail/config.yml
networks:
- polkadot
grafana:
container_name: grafana
image: grafana/grafana:latest
environment:
- GF_PATHS_PROVISIONING=/etc/grafana/provisioning
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
entrypoint:
- sh
- -euc
- |
mkdir -p /etc/grafana/provisioning/datasources
cat <<EOF > /etc/grafana/provisioning/datasources/ds.yaml
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
orgId: 1
url: http://loki:3100
basicAuth: false
version: 1
editable: true
- name: Prometheus
type: prometheus
access: proxy
orgId: 1
url: http://prometheus:9090
basicAuth: false
isDefault: true
version: 1
editable: true
EOF
/run.sh
ports:
- "3000:3000"
networks:
- polkadot
polkadot_collator:
container_name: polkadot_collator
image: parity/polkadot-parachain:1.14.0
command: >
--tmp --prometheus-external --prometheus-port 9625 -- --tmp --prometheus-external --prometheus-port 9615
ports:
- "9615:9615"
- "9625:9625"
networks:
- polkadot
configs:
prometheus_config:
content: |
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # Evaluate rules every 15 seconds.
scrape_configs:
- job_name: polkadot
static_configs:
- targets:
- polkadot_collator:9615 # relaychain metrics
- polkadot_collator:9625 # parachain metrics
promtail_config:
content: |
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: containers
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
filters:
- name: name
values: [polkadot_collator]
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: 'container'