Using FreeBSD graid (GEOM RAID)

Quite some time ago I moved my home system away from ZFS and over to gmirror (GEOM mirror). There were many reasons for this, all of which are outside of the scope of this post thus I will not go into them. Things were good with gmirror, until after rebuilding world somewhat recently I began seeing this message:

GEOM: ada1: the secondary GPT header is not in the last LBA.
GEOM: ada2: the secondary GPT header is not in the last LBA.
GEOM_MIRROR: Device mirror/gm0 launched (2/2).

My initial reaction was: “um, no shit, and there’s no way to solve that, so why are you bitching?”

This problem is a well-known (but not as much as it should be) design mistake on the part of gmirror: the last physical LBA (sector) of the provider is used to store the metadata. The issue with this is that GPT also uses the exact same design choice, so you end up with a conflict of whose data ends up in the last LBA of the drive: either gmirror’s metadata or the backup GPT header. You can’t have both, and the FreeBSD Handbook even says so. I will point out to readers that even FreeBSD kernel developers have commented on this awful design choice (and were subsequently shot down). Be sure to note that the kernel developer discussion took place in July 2009 — over 3 years ago.

This becomes a serious problem with the introduction of 4K-sector disks and SSDs, where the only way to achieve proper partition alignment is to use GPT (compared to classic MBR). If you think I’m kidding, poke around with gpart(8) and it’s -a and -b flags using the MBR scheme sometime — good luck aligning a partition to a proper 1MByte boundary.

The “socially propagated” workaround for this cat-and-mouse problem is to create partitions on each disk and mirror those. Meaning: instead of using gmirror to mirror ada1 and ada2, you would use gpart(8) to create GPT partitions on ada1 and ada2, respectively, then use gmirror to mirror ada1p1 and ada2p1.

However, what nobody bothers to mention is that there are major ramifications to this workaround. The biggest problem, in my opinion, is when a disk fails. No longer do you yank the old disk + put in a new one and watch the mirror get rebuilt. Instead, upon insertion of the new disk, the administrator now has to manually fuddle with gpart(8) to re-create the partition table (and it must be 100% identical to that of the working disk) then “poke” gmirror to get it to re-read the partition table. Only then will the array rebuild. Sorry, this is just not acceptable — it almost smells of ensuring job security (“yes boss, when a disk fails I need to be there to do some magic”). Completely unacceptable.

Given this problem, today I decided to take the plunge and play with graid(8). I mentioned wanting wanting to do so last year but never got around to it until now; with ada1 and ada2 being 4KByte sector disks, I had a lot more reason to consider it.

First and foremost, I want to make something very clear: users considering graid should make sure they are running RELENG_9 from source dated 2012/05/25 UTC or newer. There was a bug in the graid code where if a write failure happened (disk pulled, disk failure, etc.) the entire system could panic. This was fixed in RELENG_9 source dated roughly 2012/05/24 UTC. When 9.1-RELEASE comes out, this fix will obviously be in there too.

Enabling this capability was quite simple on my Supermicro X7SBA system (Intel ICH8R-based): I changed the PC BIOS from AHCI mode to RAID mode. Once I did that, the AHCI option ROM loaded (RAID mode uses AHCI), followed immediately by the RAID option ROM being loaded. Pressing Ctrl-I allowed me to define two disks as part of a RAID-1 array. Stripe size was not selectable (option ROM appeared to pick 128KBytes (131072 bytes) on its own — more on stripe sizes later, as there is a very confusing set of terminology used that makes things very easy to mess up). Great, seemed easy enough.

My kernel config included the GEOM_RAID option so I did not need to manually load geom_raid.ko myself. I let the system boot and sure enough the RAID array was tasted and a new device called /dev/raid/r0 was created automatically.

I should note that the GEOM_RAID layer also emit messages about something called Intel-cfe2a69f, which appears to be the internal reference name which graid relies on when referring to the array itself. The hexadecimal digits which come after the “Intel-” string appear to be randomly generated every time an array is created.

However, there was one very disheartening flaw: upon the array being discovered, according to graid status, the array began rebuilding for no reason. The array was labelled DEGRADED; disk ada1 was in perfect shape but ada2 was being rebuilt. Huh? This was a brand new array.

Rather than let 1TB of zeros get mirrored (there’s no point to that), I began fooling around with graid. The first thing I did was attempt to get the rebuild to stop. I did a graid list to look at what the Intel RAID BIOS had created on its own; this is where I found “Strip: 131072” — note Strip is not the same thing as Stripesize.

I decided to take aggressive action and completely destroy the array I had just created via the native Intel RAID BIOS. This did the trick:

graid delete Intel-cfe2a69f

Next, I opted to create the array from within FreeBSD rather than via the RAID BIOS. These sorts of things need to work reliably from within the OS, so I figured I might as well test them out while I had the chance. I issued the following command:

graid label -s 131072 Intel data RAID1 ada1 ada2

The reason I explicitly specified -s 131072 is because graid defaults to a “strip” of 65536 (per the man page). I wanted to ensure the value was the same as what the Intel RAID BIOS would have picked.

On the console, messages from GEOM_RAID showed up, and I had a brand new array available — and this time, no array rebuilding. graid list showed everything identical, and graid status showed everything in an optimal state. The internal reference name was different (this time “Intel-1512a6e3”) but so what. Output:

# graid list
Geom name: Intel-1512a6e3
State: OPTIMAL
Metadata: Intel
Providers:
1. Name: raid/r0
   Mediasize: 1000204881920 (931G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   Subdisks: ada1 (ACTIVE), ada2 (ACTIVE)
   Dirty: Yes
   State: OPTIMAL
   Strip: 131072
   Components: 2
   Transformation: RAID1
   RAIDLevel: RAID1
   Label: data
Consumers:
1. Name: ada1
   Mediasize: 1000204886016 (931G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   ReadErrors: 0
   Subdisks: r0(data):0@0
   State: ACTIVE (ACTIVE)
2. Name: ada2
   Mediasize: 1000204886016 (931G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   ReadErrors: 0
   Subdisks: r0(data):1@0
   State: ACTIVE (ACTIVE)

# graid status
   Name   Status  Components
raid/r0  OPTIMAL  ada1 (ACTIVE (ACTIVE))
                  ada2 (ACTIVE (ACTIVE))

I then made a filesystem on /dev/raid/r0 by doing newfs -U /dev/raid/r0 and mounted it. I also added a relevant line to /etc/fstab.

I did some basic sequential I/O tests (reads and writes), to make sure there weren’t any performance problems since the ada1 and ada2 disks were 4KByte sector drives. Initial sequential write I/O was around 157MBytes/second, which is about right for these models of disks. Fantastic.

I then rebooted the system (always good to make sure things come up cleanly). Output from the kernel:

GEOM_RAID: Intel-1512a6e3: Array Intel-1512a6e3 created.
GEOM_RAID: Intel-1512a6e3: Disk ada1 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-1512a6e3: Subdisk data:0-ada1 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-1512a6e3: Disk ada2 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-1512a6e3: Subdisk data:1-ada2 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-1512a6e3: Array started.
GEOM_RAID: Intel-1512a6e3: Volume data state changed from STARTING to OPTIMAL.
GEOM_RAID: Intel-1512a6e3: Provider raid/r0 for volume data created.

Excellent.

I then began restoring data from my backup drive to the array. This is when things started to concern me.

Stability-wise everything was fine. Disk I/O performance was also fine (monitoring using gstat -I500ms). What concerned me was how the system “felt” during heavy I/O. During an rsync -avH /backups/mirror/ /array/ I noticed that any disk I/O (reading or writing) from/to the /array filesystem was extremely slow. Things would block (wait) for very long periods of time before moving forward. Things like mutt -y (which loads 19 mbox mailboxes, totalling around 30MBytes of disk total) took 8-9 full seconds. This behaviour is easily reproducible. But when the rsync wasn’t running things were speedy.

So basically interactivity/usability during heavy I/O seems worse than with gmirror, but the overall disk I/O speeds are identical. I’m not sure if this is the result of using BIOS-level RAID or not, nor do I have any good way to determine “where” the slowdown is happening within the kernel. I imagine there will be some which will claim this is due to the lack of disk I/O scheduler in FreeBSD (and don’t mention gsched — that thing causes more problems than solutions, for example how it causes sysinstall to segfault due to what it does to kern.geom.conftxt). I do not believe lack of disk I/O scheduler is the cause; I believe there is a different explanation.

I have not tested things like yanking disks while I/O is occurring and things of that nature.

Anyway, that’s my experience with graid as of this writing. Big thanks to Alexander Motin, Warner Losh, and the iXSystems folks for graid, as it’s something FreeBSD has needed for a very, very long time.