FreeBSD and hardware/software watchdogs

This evening I spent a little bit of time researching FreeBSD’s watchdog support and enabling it on all of our servers in our datacenter. The documentation feels a bit overzealous (relevant man pages: watchdog(4), watchdog(8), watchdogd(8), and watchdog(9)) and complex, but as it turns out it’s a heck of a lot more simple than you think.

Preface

The general concept of a watchdog (whether it be implemented in software or hardware) is to ensure that a system, or a piece of the system, is working — and if not, reboot it. The methodology is simple: the watchdog has to be reset within N number of seconds or else the system is rebooted.

FreeBSD offers both software and hardware watchdog support. I’ll be discussing hardware watchdog support, since there are well-known problems (see line 33) with FreeBSD’s software watchdog when run on multiprocessor systems using the ULE scheduler (which is now the default). Hardware watchdogs are generally more reliable, especially when it comes to a system reset portion (it’s a bit hard for the kernel to reboot a system when the kernel itself is wedged). If you’re really wanting details about the software watchdog, Google for “FreeBSD SW_WATCHDOG”.

Implementation

The methodology in FreeBSD simple: kernel-level device drivers are available which communicate with the respective system’s hardware watchdog. An ioctl(2) call is also provided so userland applications can periodically “pat” the watchdog to ensure the system doesn’t reboot. Yep, it’s that simple — and yep, the ioctl call really is named WDIOCPATPAT.

FreeBSD offers Intel ICHxx watchdog interrupt timer support via the ichwd(4) driver. This driver works on both i386 and amd64.

In Practise

Our servers are all Supermicro-based, driven by Intel chipsets. Supermicro boxes offer a couple different watchdogs: one tied to a Winbond chip (which is what the watchdog settings in the BIOS control), and the unmentioned Intel ICHxx watchdog. In our case, we chose to use the Intel ICHxx chipset watchdog, since it’s reliable. I’m more than familiar with Winbond’s chips, but I’d much rather use something that’s universally available (since we have different Supermicro boxes with different ICHxx chips).

To test in real-time, I did the following on our RELENG_7 and RELENG_8 amd64 boxes:

releng7# kldload ichwd
releng7# dmesg
[...]
ichwd module loaded
ichwd0:  on isa0
ichwd0: Intel ICH7 watchdog timer (ICH7 or equivalent)
releng8# kldload ichwd
releng8# dmesg
[...]
ichwd module loaded
ichwd0:  on isa0
ichwd0: Intel ICH9R watchdog timer (ICH9 or equivalent)

Next was to enable and configure watchdogd to utilise the driver. I added the following to /etc/rc.conf:

watchdogd_enable="yes"

And started it via /etc/rc.d/watchdogd start. Yep, that’s all!

How it works, and testing failure

Every 16 seconds, watchdogd, by default, will execute stat(2) on /etc and make sure it gets back a successful result. If so, it sends the WDIOPATPAT ioctl call which resets the watchdog timer (machine continues to operate normally). Otherwise the ICHxx watchdog will eventually reset the machine. For what “eventually” means, see the ichwd(4) man page.

Testing an actual failure is pretty simple: killall -9 watchdogd.

It’s important that you send SIGKILL (-9) and not SIGINT or SIGTERM, otherwise the daemon shuts down the watchdog framework and the machine will remain up. Using SIGKILL ensures the daemon’s SIGINT and SIGTERM handlers don’t get called, allowing the ICHxx watchdog to kick in and reset the box.

Q: I keep seeing these “watchdog timeout” messages coming from [some other driver, e.g. Ethernet], are these related to what your blog post is about?

Absolutely not. “Watchdog timeout” messages coming from a specific device indicate that communication between the driver and that device timed out, and the hardware watchdog on the device (e.g. NIC) fired resulting in the device itself being reset.

It’s the same concept as what’s described above, just implemented on a driver-to-device level. I hope this makes sense!

The only reason I’m mentioning this is because a certain forum post is completely wrong in its correlation between driver watchdog timeouts and actual hardware/system watchdogs.

About these ads