Thruk Clustering

new in release v2.24

Clustered setups provide high-availability and performance improvements at the price of higher complexity. All Thruk nodes in a cluster must use shared storage for their etc_path and var_path while the tmp_path should remain local.

Setup

  • Create shared storage for etc_path and var_path

  • Setup a load balancer

  • Double check the node url in cluster_nodes

  • Enable cluster with cluster_enabled=1

Static Cluster

Having a fixed number of cluster nodes will be set if you configure multiple cluster_nodes with fixed hostnames like:

cluster_enabled = 1
cluster_nodes = http://clusternode1/$url_prefix$/
cluster_nodes = http://clusternode2/$url_prefix$/

Autoscaling / Dynamic Cluster

When having no (the default) or only one generic cluster_node url defined, Thruk uses autoscaling / registering cluster mode and new nodes are registered automatically when they startup.

cluster_enabled = 1
# optionally set generic url if the default does not work
#cluster_nodes = http://$hostname$/$url_prefix$/

Features

Shared cronjobs

Cluster setups will share their cronjobs, so reports or scheduling recurring downtimes will only run once in a cluster.

High Availability

Having more than one Thruk node with a load-balancer in front will share the load over all nodes and will still work if at least one node is available.

Load Balancer Configuration

Loadbalancing is not included automatically, but you can put any loadbalancer you like in front. Use this url for heartbeats to detect which cluster nodes are online.

/thruk/cgi-bin/remote.cgi?lb_ping

This page will respond either:

  • 200 OK

  • 503 MAINTENANCE

You can put Thruk cluster nodes in maintenance mode with the thruk cluster maint command. Later reactivate the node with thruk cluster unmaint.

Maintenance

Cluster nodes can be put in maintenance mode with the cli command:

%> thruk cluster maint

This will change the output of the url /thruk/cgi-bin/remote.cgi?lb_ping to MAINTENANCE, so if your loadbalancer is configured to match the OK string, it should remove the node from the active nodes. Additionally it will return the HTTP code 503 Service Unavailable;

Remove the maintenance mode with the command:

%> thruk cluster unmaint

Troubleshooting

  • Try to get Thruk working without cluster mode first. If that works, enable clustering.

  • You might run into issues with relative symlinks when moving etc and var onto shares. Eventually you have to link some folders to the share to make relative symlinks work again.

  • Run the command thruk r /thruk/cluster and thruk r -d "" /thruk/cluster/heartbeat to see if cluster nodes are detected correctly.

Edit page on GitHub