FreeBSD and ZFS — hot-swapping SATA disks with AHCI

There’s been a large discussion on freebsd-stable@ as of late regarding how to hot-swap SATA hard disks on FreeBSD which are part of a ZFS pool. I wanted to take the opportunity to demonstrate how this works on real server-class hardware.

Goal

We have a system that contains 3 SATA drives — one SSD (for the OS, using standard UFS2 and isn’t part of the picture), and two 1TB drives (in a ZFS mirror). The system contains only 4 hot-swap drive bays, 3 of which are populated.

Without powering the system off, we want to replace the two existing 1TB drives with newer 1TB drives (which have more cache and offer SATA 3.x support, although the controller we’re using only supports up to SATA 2.x).

Hardware

Software

  • FreeBSD 8.1-PRERELEASE (amd64, RELENG_8 tag)
  • World/kernel last built 2010/07/13
  • ahci.ko used to gain AHCI via CAM + NCQ capability
  • Disk ada0 is standalone UFS2 and is our boot/main OS drive
  • Disks ada1 and ada2 make up a ZFS mirror pool called data

ZFS and system tuning details

Here’s the system in question. First, pool status/info:

icarus# zpool status
  pool: data
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0

errors: No known data errors

Disks themselves:

icarus# camcontrol devlist
<INTEL SSDSA2M040G2GC 2CV102HD>    at scbus2 target 0 lun 0 (pass0,ada0)
<WDC WD1001FALS-00J7B1 05.00K05>   at scbus3 target 0 lun 0 (pass1,ada1)
<WDC WD1001FALS-00J7B1 05.00K05>   at scbus5 target 0 lun 0 (pass2,ada2)

ZFS and related tunings:

icarus# cat /boot/loader.conf
# Use the AHCI-to-CAM translation layer, rather than ataahci(4).
# This driver trumps ataahci.ko if both are loaded.
ahci_load="yes"

# Increase vm.kmem_size to allow for ZFS ARC to utilise more memory.
vm.kmem_size="2048M"
vfs.zfs.arc_max="1536M"

# Disable ZFS prefetching
# http://southbrain.com/south/2008/04/the-nightmare-comes-slowly-zfs.html
# Increases overall speed of ZFS, but when disk flushing/writes occur,
# system is less responsive (due to extreme disk I/O).
# NOTE: 8.0-RC1 disables this by default on systems <= 4GB RAM anyway
vfs.zfs.prefetch_disable="1"

# Disable UMA (uma(9)) for ZFS; amd64 was moved to exclusively use UMA
# on 2010/05/24.
# http://lists.freebsd.org/pipermail/freebsd-stable/2010-June/057162.html
vfs.zfs.zio.use_uma="0"

# Decrease ZFS txg timeout value from 30 (default) to 5 seconds.  This
# should increase throughput and decrease the "bursty" stalls that
# happen during immense I/O with ZFS.
# http://lists.freebsd.org/pipermail/freebsd-fs/2009-December/007343.html
# http://lists.freebsd.org/pipermail/freebsd-fs/2009-December/007355.html
vfs.zfs.txg.timeout="5"

Preparation quirks/gotchas

There are well-known shortcomings in ZFS with regards to replacement disks that contain a smaller number of sectors than their predecessors. ZFS reports this via error message “device is too small”.

It’s incredibly common for hard disk vendors today to have different sector counts for their drives which are all of the same capacity, so this problem is absolutely infuriating for users of ZFS who aren’t aware of this in advance. Here’s another highly volatile discussion of the problem. The ZFS folks are definitely aware of it and plan on implementing vdev shrinking in the future.

So before we begin any efforts to upgrade our pools, we should check out the sector counts of the old and new disks. If the new disk LBA counts are either identify or larger we shouldn’t run into any problems. If smaller, we’ll need to completely destroy and recreate our pool.

Determining sector counts and serial numbers

Quite simple:

icarus# camcontrol identify ada1 | egrep "serial|LBA48"
serial number         WD-WMATV1803888
LBA48 supported       1953525168 sectors
icarus# camcontrol identify ada2 | egrep "serial|LBA48"
serial number         WD-WMATV1736548
LBA48 supported       1953525168 sectors

Please note that the command used with SATA disks under CAM is identify, not inquiry. The latter is for SCSI disks.

Based on this we can tell the older disks have a sector count of 1953525168 in LBA48 mode, which is what FreeBSD will use when available (any disk/controller made circa 2002 or later should be using LBA48).

As for the new drives: most folks would consider unwrapping a new drive, plugging it into Bay 2, rescanning the bus using camcontrol rescan all and then executing camcontrol identify ada3. But there’s an even easier way: examine the label printed on the drive! Manufacturers still to this day print the number of LBAs on the disk labels themselves. This means that you don’t have to open the anti-static bag the drive comes in (ex. no restocking fee on a product return since the bag is never opened).

Physical examination of a WD1002FAEX drive indicates the LBA count is 1953525168. Excellent! No surprises await us when we attempt to do the upgrade. This also significantly simplifies the whole procedure.

As for the drive serial numbers: we want to be absolutely certain that the OS sees the presence of a new disk after the swap, and this is a for-sure way to determine that. We don’t go off of drive model string because if you’re replacing a disk with the same version/model this string won’t change.

Replacing the disks

The Solaris ZFS Administrator’s Guide documents the exact procedure to follow when it comes to disk replacements. We’ll be following that procedure but using commands relating to FreeBSD instead (ex. camcontrol instead of cfgadm).

The below procedure to follow is quite simple given our circumstances. I tend to write out procedures before attempting them, so that if any mistakes or unexpected scenarios occur I can determine exactly what went wrong and where.

  • Perform backup of all data on ZFS pool. I cannot stress this enough! If anything goes awry, we can fall back to restoration from backups
  • zpool offline data ada2
  • Physically remove drive ada2 from Bay 3. FreeBSD will notice physical removal of the device; see dmesg for indicators
  • Physically insert a new WD1002FAEX disk into Bay 3 and wait 15 seconds for it to spin up. FreeBSD will notice physical insertion of the device without manual intervention; see dmesg for indicators
  • camcontrol identify ada2 — check the drive serial number (and in this case the drive model number) to make sure things are different
  • zpool online data ada2
  • zpool replace data ada2
  • zpool status
  • The pool should be resilvering (rebuilding) at this point, and we need to wait until the resilvering completes before continuing on with the next disk. Once the resilvering is done…
  • zpool offline data ada1
  • Physically remove drive ada1 from Bay 1
  • Physically insert a new WD1002FAEX disk into Bay 1 and wait 15 seconds for it to spin up
  • camcontrol identify ada1
  • zpool online data ada1
  • zpool replace data ada1
  • zpool status
  • Wait for the resilvering to finish and voilĂ , we’re done!

Performing the procedure

Below is a link to the full output during the procedure. This was done on an actual production machine. There are some extra commands (mainly to check the status of things, in addition to SMART statistics of the new disks) between steps, but it should give you an idea of what to expect.

Conclusion

ZFS disk replacement and hot-swapping via AHCI works great. Furthermore, the new ahci.ko module appears to be reliable; it’s somewhat new code (compared to the older ataahci.ko which doesn’t use CAM), but it’s good to see it has been reliably tested (thanks Alexander!). Going forward, all of our RELENG_8 systems will be using ahci.ko.

I should also note that the system was usable (read: responsive) during the hot-swap and resilvering phases. Expectedly, pool I/O during the resilvering phase was quite slow.

Q: You can solve the issue with sector counts not being identical by using slices instead.

A: But is this a good idea? I don’t know. It’s been stated on the Solaris lists that you don’t want to do this “due to write caching being disabled”, which conflicts with what the the ZFS Best Practises Wiki states. Whether or not this applies to FreeBSD is also unknown to me, but due to the inconsistencies in the information available for Solaris, I would tend to stick to using whole disks. The extra administrative complexity isn’t worth it.