Cluster API allows to configure Machine Health Checks with custom remediation strategies. This is helpful for our bare metal servers. If the health checks give an outcome that one server cannot be reached, the default strategy would be to delete it. In that case, it would need to be provisioned again. This takes, of course, longer for bare metal servers than for virtual cloud servers. Therefore, we want to try to avoid this with the help of our HetznerBareMetalRemediationController
and HCloudRemediationController
. Instead of deleting the object and deprovisioning it, we first try to reboot it and see whether this helps. If it solves the problem, we save a lot of time that is required for re-provisioning it.
If the MHC is configured to be used with the HetznerBareMetalRemediationTemplate
(also see the reference of the object) and HCloudRemediationTemplate
(also see the reference of the object), then such an object is created every time the MHC finds an unhealthy machine.
The HetznerBareMetalRemediationController
reconciles this object and then sets an annotation in the relevant HetznerBareMetalHost
object specifying the desired remediation strategy. At the moment, only "reboot" is supported. The HCloudRemediationController
reboots the HCloudMachine directly via the HCloud API. For HCloud servers, there is no other strategy than "reboot" either.
Here is an example of how to configure the Machine Health Check and HetznerBareMetalRemediationTemplate
: