At Spotflux we use Zabbix for not just monitoring our entire infrastructure but for automating recovery from failures and managing high availability. As with most implementations of Zabbix there are some significant performance and management considerations in placing Zabbix reporting agents on a large amount of virtualized instances across a large amount of physical servers. While leveraging Zabbix Proxy is a good path for addressing some of the performance challenges, the current state of Zabbix did not allow for us to easily control the state of virtualized instances from the hypervisor, meaning they would have to be controlled through a complicated set of customized scripts. In our post today we are introducing a publicly available Zabbix patch that which will extend the Zabbix Proxy to allow for the use of Zabbix Actions on KVM virtual hosts which exist on a completely isolated network segment inside the hypervisor. Not only does this provide a nice security benefit in certain environments as we don’t need Zabbix to run inside the virtual hosts but it assists in automating high availability by allowing KVM hosts to be controlled even if the instance becomes unresponsive. Let’s look at a simple example of how something like this might be deployed in production:
Let’s assume we have an environment with over 100 hypervisor hosts which each contain roughly 30 virtual hosts. The hypervisors are configured with a Zabbix Proxy that will be responsible for monitoring and performing actions on the 30 virtualized hosts in its operating environment. Each of the 30 hosts in this example runs an LDIRECTOR server responsible for serving some content back to the internet, and all the NGINX servers exchange session information with each other through a shared memcached layer on the backend. The NGINX servers are managed by a load balancing mechanism such as haproxy which resides on the hypervisor. Using this example config, our goal is to maintain service stability, detect failures, and auto-recover from failures as much as possible.
Using our new patch, here’s how this could be done:
1- Detection: Our first level of detection happens at the load balancer which we’ve configured with a simple HTTP request checking script
2- First Response: Our script finds a failure with a particual virtual instances and the load balancer removes it from the list of available nodes and fires an event to Zabbix
3- First Zabbix Action: Notify admins via IRC/JABBER/E-mail that we have an issue. Wait for a certain time for the event to clear and if not let’s take a more aggressive action.
4- Third Action: Gently ask the VM for restart. This might sound violent at first but in certain instances trying to stop a crashed service and start it again is more painful than just restarting a VM.
5- Fourth Action: If a soft restart did not work then force the VM to restart , then notify by IRC or jabber + email to the admin in watch.
6- Final Fall Back: The VM wasn’t able to start again or the service didn’t come back. Increase the alert level and send IRC or jabber + email + SMS to the admin on watch.
Without our patch we would have gotten stuck on Step #3 if for whatever reason the VM became unresponsive which isn’t always the most ideal situation at 3am on a Sunday. As an added bonus you can now also configure your Zabbix front end to run reboot commands or other actions on the VM without having to add additional customization.