NOTE: The below behaviour may have changed in FreeBSD 11.x and newer as of MFC commit r327920 (and semi-documented in commit r327923). Best I understand: as of r327920, if using
options SW_WATCHDOG in your kernel configuration, the watchdog will be hardware-based (if you have a driver like
ichwd(4) loaded), else purely software (if you have no hardware watchdog driver loaded) — and regardless of which is used, the watchdog will now be suspended (i.e. won’t get triggered) during critical things like panics that result in kernel core dump generation.
This evening I spent a little bit of time researching FreeBSD’s watchdog support and enabling it on all of our servers in our datacenter. The documentation feels a bit overzealous (relevant man pages:
watchdog(9)) and complex, but as it turns out it’s a heck of a lot more simple than you think.
The general concept of a watchdog (whether it be implemented in software or hardware) is to ensure that a system, or a piece of the system, is working — and if not, reboot it. The methodology is simple: the watchdog has to be reset within N number of seconds or else the system is rebooted.
FreeBSD offers both software and hardware watchdog support. I’ll be discussing hardware watchdog support, since there are well-known problems (see line 33) with FreeBSD’s software watchdog when run on multiprocessor systems using the ULE scheduler (which is now the default). Hardware watchdogs are generally more reliable, especially when it comes to a system reset portion (it’s a bit hard for the kernel to reboot a system when the kernel itself is wedged). If you’re really wanting details about the software watchdog, Google for “FreeBSD SW_WATCHDOG”.
The methodology in FreeBSD simple: kernel-level device drivers are available which communicate with the respective system’s hardware watchdog. An
ioctl(2) call is also provided so userland applications can periodically “pat” the watchdog to ensure the system doesn’t reboot. Yep, it’s that simple — and yep, the
ioctl call really is named
FreeBSD offers Intel ICHxx watchdog interrupt timer support via the
ichwd(4) driver. This driver works on both i386 and amd64.
Our servers are all Supermicro-based, driven by Intel chipsets. Supermicro boxes offer a couple different watchdogs: one tied to a Winbond chip (which is what the watchdog settings in the BIOS control), and the unmentioned Intel ICHxx watchdog. In our case, we chose to use the Intel ICHxx chipset watchdog, since it’s reliable. I’m more than familiar with Winbond’s chips, but I’d much rather use something that’s universally available (since we have different Supermicro boxes with different ICHxx chips).
To test in real-time, I did the following on our RELENG_7 and RELENG_8 amd64 boxes:
releng7# kldload ichwd releng7# dmesg [...] ichwd module loaded ichwd0: ... on isa0 ichwd0: Intel ICH7 watchdog timer (ICH7 or equivalent)
releng8# kldload ichwd releng8# dmesg [...] ichwd module loaded ichwd0: ... on isa0 ichwd0: Intel ICH9R watchdog timer (ICH9 or equivalent)
Next was to enable and configure
watchdogd to utilise the driver. I added the following to
And started it via
/etc/rc.d/watchdogd start. Yep, that’s all!
How it works, and testing failure
Every 16 seconds,
watchdogd, by default, will execute
/etc and make sure it gets back a successful result. If so, it sends the
WDIOPATPAT ioctl call which resets the watchdog timer (machine continues to operate normally). Otherwise the ICHxx watchdog will eventually reset the machine. For what “eventually” means, see the
ichwd(4) man page.
Testing an actual failure is pretty simple:
killall -9 watchdogd.
It’s important that you send SIGKILL (-9) and not SIGINT or SIGTERM, otherwise the daemon shuts down the watchdog framework and the machine will remain up. Using SIGKILL ensures the daemon’s SIGINT and SIGTERM handlers don’t get called, allowing the ICHxx watchdog to kick in and reset the box.
Q: I keep seeing these “watchdog timeout” messages coming from [some other driver, e.g. Ethernet], are these related to what your blog post is about?
Absolutely not. “Watchdog timeout” messages coming from a specific device indicate that communication between the driver and that device timed out, and the hardware watchdog on the device (e.g. NIC) fired resulting in the device itself being reset.
It’s the same concept as what’s described above, just implemented on a driver-to-device level. I hope this makes sense!
The only reason I’m mentioning this is because a certain forum post is completely wrong in its correlation between driver watchdog timeouts and actual hardware/system watchdogs.