Testing out FreeBSD 8.0-RC2

Those who haven’t read about my 8.0-RC1 experience should do so first:

Basically, my experience with 8.0-RC2 was identical to that of RC1, except some of the bugs/issues I experienced are now gone (hooray!).

Fixes/improvements:

  • The issue I experienced with the Boot Manager selection phase of installation has been fixed. Also, Standard is now the default option (first choice).
  • The geometry does not match label problem has been addressed by fixing the FreeBSD slice editor in sysinstall/sade; see below.
  • libdisk has been modified to work properly with GEOM; the FreeBSD slice editor links to libdisk — thanks to Randi Harper for tracking this commit down! Commonly when installing FreeBSD on a box, people go into the slice editor and press “a” to use the entire disk. Previously, users would end up with a disk where the first 63 sectors were unused (probably for the PBR/MBR and overall alignment), then the FreeBSD slice, and a third “unused” portion of the disk (which, if I remember correctly, was done solely for alignment reasons). Example:
    Offset       Size(ST)        End     Name  PType       Desc  Subtype    Flags
    
             0         63         62        -     12     unused        0
            63  390716802  390716864    ad8s1      8    freebsd      165    A
     390716865       5103  390721967        -     12     unused        0
    

    Starting with RC2, this is what you’ll see:

    Offset       Size(ST)        End     Name  PType       Desc  Subtype    Flags
    
             0         63         62        -     12     unused        0
            63  390721905  390721967    ad8s1      8    freebsd      165    A
    

    Note the lack of the last “unused” section.

    Sadly, this also means people will need to reinstall FreeBSD (specifically, deleting the slice and re-creating it) to benefit from this. As far as I know, you can’t fix this without a full reinstallation.

  • The EOF issue for ttys (re: ^D being shown) has been fixed and committed to CURRENT (FreeBSD 9.0), but hasn’t been MFC’d to RELENG_8 yet. Yes, it’s scheduled to be (in about 2 weeks). Big thanks to Ed Schouten for fixing this!
  • There were some ZFS commits which happened between RC1 and RC2 which may indicate that the ARC exhausting all available kmem is no longer possible. I have not been able to confirm/deny whether this fix works, but looking at the code, it may be sufficient. I’d need to get in touch with Kip Macy to confirm/deny.

Issues that are still pending:

  • bsdlabel still behaves incorrectly (“Class not found”). Instead, users should use gpart to write the bootstraps as follows: gpart bootcode [disk], where [disk] is ad4 or similar. Note that you pick the disk itself now, not the slice like in bsdlabel (unless you were using dangerously dedicated disks :-) ).
  • The ZFS notice pertaining to vfs.zfs.prefetch_disable when the system has less than 4GB RAM available has been re-worded again, but still is vague/unclear. A little bit of ego here — the person committing these changes should really consider changing the message to what I proposed.
  • I still haven’t received a reply to my request for clarification on ZFS stabilisation. Is /boot/loader.conf tuning for kmem-related parameters still required? We still need an official statement on this matter.

I also want to take a moment to send a shout-out to John Baldwin, who has been working incredibly hard on the FreeBSD kernel (specifically VM and ACPI) over the past 4 weeks. John, I’ve seen/followed your commits, and I appreciate the improvements! Thank you!

UNIX mail format annoyances

For many years now I’ve been dealing with an ongoing issue which still to this day has no real solution: classic UNIX mailboxes (called mbox) comparing the files’ mtime to its atime to determine if there’s new mail inside of the mailbox (if the mtime is greater than the atime, there’s new mail. If the mtime is smaller than the atime, new mail has been read/there is no new mail). “Classic mail spools” (e.g. /var/mail or /var/spool/mail) are mbox.

Why is this a problem? Because those of us who use mutt/alpine/etc. on our UNIX machines, who also do backups using things like tar/cp/rsync (more on rsync in a moment) end up with mailboxes with a lost/clobbered atime after the backup takes place. The end result: our mail clients no longer tell us there’s new mail in that mailbox, which can be detrimental in many respects.

The most common rebuttal is “shut up and use Maildir“. What Maildir advocates don’t care to acknowledge is that there are many problems with the Maildir concept, particularly when used on a filesystem like ZFS. With classic mbox, your multi-megabyte mailboxes loads quickly — but with Maildir, since it uses a single file per mail, the end result is a mail client that takes forever to load due to the one-file-per-mail concept. ZFS does not perform well when it comes to massive numbers of small/terse files.

UFS/UFS2, ext2fs/ext3fs, and other filesystems don’t have this problem, but let’s pull our heads out for a moment (since tunnel vision/ostrich syndrome is what got us here in the first place!) — we’re entering year 2010 and ZFS is already being used heavily by Solaris/OpenSolaris and FreeBSD users across the globe; ZFS is here to stay, end of discussion. There are some proposed solutions such as making use of ZFS’s semi-new L2ARC to add an additional layer of caching using dedicated low-latency devices (specifically SSDs), but there’s been no actual evidence this improves things with Maildir. And besides, who in their right mind is going to go out and drop hundreds of dollars on an Intel X25-M per machine just to solve this problem? Seriously.

And let’s not forget administrators who mount their filesystems with the noatime mount flag for added performance benefits, especially on a journalled filesystem.

One workaround proposed for mutt users involves recompiling mutt to use Oracle/SleepyCat DB, GDBM, or Tokyo Cabinet to maintain a cache of mailbox headers (using the header_cache directive), thus speeding up the process. Does this help? Yes, there’s a decent improvement, but anyone who uses this method (such as me) can tell you that it’s still no where near as fast as classic mbox, especially when you’ve got a mailbox with a couple hundred new mails in it.

Does the saga end here? Not even close.

There’s a new mailbox format, called MIX, which is being used within alpine. This format is more or less a combination of mbox and Maildir, and performs much better than Maildir. Sounds great, right? Except those of us who use mutt are out of luck — unsupported, and there’s been absolutely no discussion of it since February 2007. Even the author of mutt, Michael Elkins, had nothing useful to say other than snide comments. Oh, and MIX isn’t supported in procmail or Sieve either — double whammy. But MIX does sound like the way to go — too bad it isn’t getting the attention it should.

Some administrators using ZFS are using ZFS snapshots to do their backups instead of something like rsync, which is great except that they’re hit-or-miss (reliability-wise) on FreeBSD — or at least that’s what I last read 6-9 months ago — while rsync is filesystem-independent. Most folks I know who run into snapshot problems revert back to rsync.

So what now? With all the above in mind,I decided to poke at rsync, because there’s been many discussions in the past on the mailing lists about getting rsync to preserve file atime. rsync out-of-the-box will preserve ctime and mtime when using the --times flag. However, there’s a patch called atimes.diff which comes with the rsync-patches tarball that provides a --atimes flag that supposedly solves this. Sounds great… except there’s one problem…

The flag does cause the atimes of the source file to be copied to the destination, but the atimes of the source file are lost! And here’s a more recent confirmation.

If that’s not enough, here’s final confirmation. Note that I’m using non-zero-byte files intentionally; rsync behaves differently when the files are zero bytes.

rsync -a:

$ echo "hello" > source
$ stat -x source
Access: Wed Oct 28 06:27:05 2009
Modify: Wed Oct 28 06:27:05 2009
Change: Wed Oct 28 06:27:05 2009
$ rsync -a source dest
$ stat -x source
Access: Wed Oct 28 06:27:29 2009
Modify: Wed Oct 28 06:27:05 2009
Change: Wed Oct 28 06:27:05 2009
$ stat -x dest
Access: Wed Oct 28 06:27:29 2009
Modify: Wed Oct 28 06:27:05 2009
Change: Wed Oct 28 06:27:29 2009
$ rm source dest

Above, we see that after the rsync, the atime in the source file is lost, and the ctime in the destination file does not match that of the source — only the mtime is retained.

rsync -a --atimes:

$ echo "hello" > source
$ stat -x source
Access: Wed Oct 28 06:32:50 2009
Modify: Wed Oct 28 06:32:50 2009
Change: Wed Oct 28 06:32:50 2009
$ rsync -a --atimes source dest
$ stat -x source
Access: Wed Oct 28 06:34:06 2009
Modify: Wed Oct 28 06:32:50 2009
Change: Wed Oct 28 06:32:50 2009
$ stat -x dest
Access: Wed Oct 28 06:32:50 2009
Modify: Wed Oct 28 06:32:50 2009
Change: Wed Oct 28 06:34:06 2009

Above, we see the atime and the mtime in the source file is retained in the destination. However, again, the atime in the source file is lost and the ctime doesn’t match that of the source.

cp -p:

$ echo "hello" > source
$ stat -x source
Access: Wed Oct 28 06:37:56 2009
Modify: Wed Oct 28 06:37:56 2009
Change: Wed Oct 28 06:37:56 2009
$ cp -p source dest
$ stat -x source
Access: Wed Oct 28 06:38:27 2009
Modify: Wed Oct 28 06:37:56 2009
Change: Wed Oct 28 06:37:56 2009
$ stat -x dest
Access: Wed Oct 28 06:37:56 2009
Modify: Wed Oct 28 06:37:56 2009
Change: Wed Oct 28 06:38:27 2009

With cp -p, we see identical behaviour to that of rsync -a --atimes.

Some may be wondering: “is it even possible to solve this problem?” Of course it is. The logic flow should be pretty obvious at this point:

  1. stat(2) or fstat(3) the source file and save (in memory) the atime, mtime, and ctime. Neither call modifies the atime
  2. Copy the source file to the destination file
  3. Set the atime, mtime, and ctime of the destination file using utimes(3) with the previously-obtained values
  4. Set the atime and mtime of the source file using utimes(3) with the previously-obtained values

You can accomplish the same thing with touch.

And let’s not forget that FreeBSD lacks the O_NOATIME GNU extension for open(2), which was proposed in 1998.

So is there a solution to all of this? As far as I’ve been able to tell, no, there isn’t. Using filesystem-level snapshots appears to be the only way to “solve” this issue. I’d be much happier if the --atimes patch for rsync did what it was supposed to… but it’s 23KB, and I’m not familiar with the rsync code (it’s not as black-and-white as one may think).

We UNIX folks should be ashamed of this whole debacle. There isn’t a better way to say it: what a clusterfuck.

Testing out FreeBSD 8.0-RC1

EDIT: Those interested in the upcoming release of FreeBSD 8.0 should read both the below, as well as my Testing out FreeBSD 8.0-RC2 post (which notes that many, but not all, of these problems have been fixed).

Yesterday I took the plunge and upgraded my home FreeBSD amd64 box from RELENG_7 to FreeBSD 8.0-RC1, which going forward I will refer to as RELENG_8 (yes, it has been tagged!). I did a complete reinstall, like I always do when migrating between major FreeBSD releases. Said box consists of a Supermicro X7SBA motherboard, Intel Core2Duo E8400 CPU, 4GB of RAM, and 3 SATA disks connected via the on-board Intel ICH9 + AHCI — one for the OS, and two in a ZFS mirror pool. Relevant dmesg information:

CPU: Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (2992.52-MHz K8-class CPU)
real memory  = 4294967296 (4096 MB)
avail memory = 4112478208 (3921 MB)
atapci0: <Intel ICH9 SATA300 controller> port 0x1c50-0x1c57,0x1c44-0x1c47,0x1c48-0x1c4f,0x1c40-0x1c43,0x18e0-0x18ff mem 0xdc000800-0xdc000fff irq 17 at device 31.2 on pci0
atapci0: [ITHREAD]
atapci0: AHCI called from vendor specific driver
atapci0: AHCI v1.20 controller with 6 3Gbps ports, PM supported
ata4: <ATA channel 2> on atapci0
ata4: [ITHREAD]
ata5: <ATA channel 3> on atapci0
ata5: [ITHREAD]
ata7: <ATA channel 5> on atapci0
ata7: [ITHREAD]
ad8: 190782MB <WDC WD2000JD-00HBB0 08.02D08> at ata4-master SATA150
ad10: 953869MB <WDC WD1001FALS-00J7B1 05.00K05> at ata5-master SATA300
ad14: 953869MB <WDC WD1001FALS-00J7B1 05.00K05> at ata7-master SATA300

And ZFS-related details:

  pool: storage
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            ad10    ONLINE       0     0     0
            ad14    ONLINE       0     0     0

errors: No known data errors

The first thing worth noting is that I performed the installation entirely using the FreeBSD 8.0-RC1 amd64 memstick image on a SanDisk Cruzer Micro 2GB USB drive. If you’re going to try this, make sure you have a 2GB or larger USB drive, since the memstick image is larger than 1GB.

Writing the .img file to the USB drive required the use of another FreeBSD box (someone is going to have to address that fact eventually), and was achieved per Ken Smith’s original instructions:

dd if=8.0-RC1-amd64-memstick.img of=/dev/da0 bs=10240 conv=sync

The entire dd took about 4 minutes. I then booted it directly without any issues. Installation went as expected, with the exception of choosing a different installation medium than usual; there’s a new USB menu item near the bottom of the installation medium selection list. I should also note that I deleted the existing FreeBSD partition and re-created it during the sysinstall phase, and during the Boot Manager selection phase, I chose Standard like I always do. You’ll understand why I’ve noted these two things in a moment.

After rebooting + booting off the main OS hard disk, the first thing I saw which was different/anomalous was that I was being shown the F1/F5/F6 FreeBSD boot manager menu — as if I had selected BootMgr and not Standard. Options shown were F1 for FreeBSD, F5 for Floppy, and F6 for PXE (that’s a new one!). So there’s definitely a bug/regression somewhere with regards to the boot manager you choose; or maybe it was because I was installing from a USB drive? Not sure.

The FreeBSD box booted fine, but the following kernel message caught my eye:

GEOM: ad8s1: geometry does not match label (255h,63s != 16h,63s).

This indicated that what sysinstall wrote to the actual on-disk BSD label, as far as drive geometry went, didn’t match what GEOM expected — GEOM expecting 16 heads, the drive label containing 255 heads. I’m not using GPT (at least not knowingly; if sysinstall does it without telling you, then someone needs to work out where the actual problem lies). I first looked at gpart show ad8 and gpart show ad8s1 to see what it claimed:

# gpart show ad8
=>       63  390721905  ad8  MBR  (186G)
         63  390716802    1  freebsd  [active]  (186G)
  390716865       5103       - free -  (2.5M)

# gpart show ad8s1
=>        0  390716802  ad8s1  BSD  (186G)
          0    4194304      1  freebsd-ufs  (2.0G)
    4194304   16777216      2  freebsd-swap  (8.0G)
   20971520   33554432      4  freebsd-ufs  (16G)
   54525952   16777216      5  freebsd-ufs  (8.0G)
   71303168  319413634      6  freebsd-ufs  (152G)

Next, and more conclusive, I used bsdlabel -e -A ad8s1 and this is what I got:

# /dev/ad8s1:
type: ESDI
disk: ad8s1
label:
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 255
sectors/cylinder: 16065
cylinders: 24321
sectors/unit: 390721968
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0           # milliseconds
track-to-track seek: 0  # milliseconds
drivedata: 0

8 partitions:
#        size   offset    fstype   [fsize bsize bps/cpg]
  a:  4194304        0    4.2BSD        0     0     0
  b: 16777216  4194304      swap
  c: 390716802        0    unused        0     0         # "raw" part, don't edit
  d: 33554432 20971520    4.2BSD        0     0     0
  e: 16777216 54525952    4.2BSD        0     0     0
  f: 319413634 71303168    4.2BSD        0     0     0

So yes, sysinstall is definitely doing the Wrong Thing(tm) here. One of the biggest problems with this is that users cannot easily fix this — it requires booting a fixit or LiveFS image and editing the disk label, as well as users having to re-calculate sectors/cylinder and cylinders by hand. And no, I’m not talking out of my ass. Also note that the post to freebsd-current is dated March 2009 — 7 months ago. So this issue has existed for quite some time without proper attention.

Upon exiting the editor session inside of bsdlabel (using :q!), I was given the following error:

bsdlabel: partition c doesn't cover the whole unit!
bsdlabel: An incorrect partition c may cause problems for standard system utilities
bsdlabel: Class not found
re-edit the label? [y]: n

Whoa whoa whoa, even more insanity going on here! Also, what’s with the Class not found error? Wow, this is pretty jacked, and apparently I’m not the only one who’s noticed. Note this post is dated July 2009 — 3 months ago. The easy solution seems to be to use gpart(8) to create all of the slices… except it’s obvious no one has fixed sysinstall to do this. There’s also a patch mentioned which fixes the problem, but that obviously wasn’t committed before the RELENG_8 tagging, nor backported since.

But the root of the problem does appear to be sysinstall not doing the Right Thing(tm) any longer.

Next up was my attempt to fix the Boot Manager oddities. Historically, re-writing the boot blocks on a disk consisted of doing the appropriate equivalent of bsdlabel -B ad8s1. However, I was greeted with the exact same “Class not found” error as above. Hmmm… This doesn’t bode well. Presumably I can use gpart(8) to re-write the boot blocks, but given that GEOM is complaining about disk geometry errors, I don’t dare mess with it.

All of this is a pretty major bug. The kernel message is going to catch a lot of user attention if it’s not fixed by the time 8.0-RELEASE is announced. I plan on sending Robert Watson and Ken Smith an Email about the issue after I get done writing this.

At this point I decided to make appropriate modifications to /etc/rc.conf, specifically the addition of zfs_enable="yes", so that I could get access to my ZFS filesystems. After doing so, and running /etc/rc.d/hostid start then /etc/rc.d/zfs start, I was greeted with the usual ZFS kernel messages — but with an unexpected surprise:

ZFS NOTICE: system has less than 4GB and prefetch enable is not set... disabling.

Given how familiar I am with FreeBSD and ZFS at this point, this message caught me off-guard.

First of all, grammatically this sentence is confusing as hell — there is no “prefetch enable” tunable; the tunable is actually called vfs.zfs.prefetch_disable, and it defaults to 0 (off, e.g. prefetching enabled), so why would I exclusively enable something which is enabled by default? And why’s it getting disabled? Secondly, my system has 4GB of RAM installed… so what’s going on here?!

I dug around in the relevant CVS commit logs and found numerous changes to this file, specifically the message that was printed. Apparently the existing message was reviewed by 5 separate people (revision 1.23) to “Improve wording”.

   3554 #ifdef _KERNEL
   3555         if (TUNABLE_INT_FETCH("vfs.zfs.prefetch_disable", &zfs_prefetch_disable))
   3556                 prefetch_tunable_set = 1;
   3557
   ...
   3565         if ((((uint64_t)physmem * PAGESIZE) < (1ULL << 32)) &&
   3566             prefetch_tunable_set == 0) {
   3567                 printf("ZFS NOTICE: system has less than 4GB and prefetch enable is not set"
   3568                     "... disabling.\n");
   3569                 zfs_prefetch_disable=1;
   3570         }
   3571 #endif

Ahh, now we have a much better idea of what’s going on. There are two reasons why this message got printed on my machine:

1) I had not done any tuning of /boot/loader.conf at this point, so vfs.zfs.prefetch_disable hadn’t been set. The above code basically says “if someone has administratively set vfs.zfs.prefetch_disable to something in loader.conf, set prefetch_tunable_set to 1″. You can set the tunable to whatever you want (enabled or disabled) and the message won’t get printed. If you don’t set the tunable, the following applies:

2) physmem is actually the amount of memory in pages that’s available to the kernel when it loads. The multiplication is actually hw.availpages * hw.pagesize. The 1ULL << 32 statement may look ugly but it’s a bitshift equivalent of 2^32, e.g. 4294967296.

Let’s work out the math:

# sysctl hw.pagesize hw.availpages
hw.pagesize: 4096
hw.availpages: 1046201
# expr 1046201 "*" 4096
4285239296

4285239296 is indeed less than 4294967296. Wait a minute… where’s that extra memory going?

Well, it’s going to two places on the X7SBA: 1) on-board video (which has an 8MB framebuffer), and 2) the AHCI BIOS which takes up an unknown amount of RAM, but I’d guess about 1-2MB. So let’s do the math:

# expr 4294967296 - 4285239296
9728000

With all of this information kept in mind, the kernel message really should be re-worded to say the following:

ZFS NOTICE: System has less than 4294967296 bytes (4GB) of usable memory,
ZFS NOTICE: and vfs.zfs.prefetch_disable has not explicitly been defined
ZFS NOTICE: in loader.conf.  Setting vfs.zfs.prefetch_disable="1"...

I’m also questioning the logic behind why prefetching is disabled on systems with less than 4GB of available memory; I’d like to know what the reasoning is there. Is it in regards to stability? Performance? I don’t know. I can’t find an answer on the mailing lists either.

Finally, I found an unexpected oddity with the new tty/pty/pts code with regards to EOF. All other operating systems, including RELENG_7 and earlier, behave as follows when EOF is pressed on a terminal. This is regardless of shell, by the way:

bash$ cat
{press Control-D here}bash$

While on RELENG_8, the literal Control-D (^D) character is shown on-screen:

bash$ cat
{press Control-D here}^Dbash$

I’ve already mailed Ed Schouten about this, and he agrees it’s a bug which he’ll work on fixing, hopefully tonight.

Everything else past this point was peachy keen. No odd problems building ports, no system lock-ups or odd experiences, and so on. It’s all worked great so far. I’m looking forward to upgrading our production servers to RELENG_8 when it comes out.

Posted in FreeBSD, ZFS. 1 Comment »

FreeBSD and ZFS — is it truly stable?

There’s an “age old question” that has been floating around with regards to ZFS on FreeBSD — is it stable? “Stable” in this case means: do I risk losing my data, will it cause kernel panics or other oddities, and do I need to tune it?

The answer, still, may be yes.

I’ve taken the initiative — that is to say, get an official response to these type of questions, specifically with regards to kernel panics. I’m incredibly surprised no one — not even the user community — has responded at this point. It’s not a trick question either; FreeBSD users really do need an answer to this.

People are continually comparing FreeBSD’s ZFS to that of Solaris 10 and OpenSolaris’ ZFS. Given that my day job involves heavy use of Solaris 10 on massive numbers of servers across the United States, I can safely say without a doubt ZFS on Solaris behaves better and won’t crash your system due to kernel memory exhaustion.

Posted in FreeBSD, ZFS. 1 Comment »

FreeBSD and ZFS — NFS bug fixed on RELENG_7 amd64

I just saw this commit come through for RELENG_7:

 Edit src/sys/nfsserver/nfs_serv.c
  Add delta 1.174.2.8 2009.07.01.12.44.23 avg

The CVS commit log indicates this fixes a bug where NFS is being used on ZFS v13 exported filesystems, and the system mounting the NFS share attempts to open(2) with flags O_CREAT and O_EXCL set. The file is created — 0 bytes in size, with mode 0000 — yet the operation returns EIO. This is pretty major:

src/sys/nfsserver/nfs_serv.c

I also enjoyed reading the PR for this bug, where some developers made some amazing statements (my favourite being “use UFS2 instead of ZFS, use cp/rm instead of mv, don’t use NFS”):

http://www.freebsd.org/cgi/query-pr.cgi?pr=135412

This further validates my concern that there isn’t enough QA being done prior to code being committed to the STABLE branches. I guess no one tested ZFS v13 filesystems being exported via NFS prior to the v13 commit?

Believe me, I’m thankful that ZFS v13 is now part of STABLE — sincerely I am — but my concern isn’t limited to ZFS: it applies to FreeBSD as a whole.

FreeBSD and ZFS — more performance quirks

I’ve been trying to work out the source of “strange” performance-related issues when using ZFS on FreeBSD. The situation is reproducible, but not always, and it seems the complex inner nature of ZFS is probably contributing to the problem.

The problem I’m trying to track down is what causes sudden fread() operations, with 4KB reads, to suddenly start taking an absurd amount of time (0.2 seconds, for example), on a ZFS raidz1 pool with 4 disks all capable of operating at 80-100MB/sec (read and write).

My understanding is that the slowness is expected initially as ZFS’s ZIL and ARC do not have any reference to this data the first time around. And that’s fine — not to mention, seems true. For example, today I performed the following test:

1) rm Maildir/header_cache.db
2) mutt -y
3) Ran through all of the folders, thus populating the Maildir header cache. A folder with 400 files in it would take maybe 10 full seconds. Then I exited mutt.
4) Again, rm header_cache.db
5) mutt -y
6) Once more, ran through all the folders. This time things were much faster — almost instantaneous. So populating the cache was very fast/quick given that ZFS now had some of the files/whatever cached in the ARC, or possibly references to the system calls in the ZIL.

This is an ideal situation, and makes sense. The ARC has a tendency to grow very large as more and more I/O happens, and that’s fine — that’s the nature of the ZFS beast.

However, where things get bizarre is when the above situation “logical” scenario stops occurring — acting as if the ZFS ARC or ZIL is “full” and doesn’t choose to cache anything more for some particular reason. Again, I can reproduce this quite easily using the above procedure, but it’s not a very good real-world test to post to a mailing list. It’s getting to the point where I’m probably going to have to write some C code that mimics all of the scenarios possible. There is definitely a problem somewhere, and I think that’s the only way I’m going to get people to track it down.

But something today occurred to me while reading the Cache Flushes section of the ZFS Evil Tuning Guide. I was left wondering if setting the FreeBSD equivalent of zfs_nocacheflush would improve things.

So today I decided to set vfs.zfs.cache_flush_disable=1 in /boot/loader.conf to see how things performed.

Something tells me I’m probably going to have to create a ZFS-related WordPress Category in my blog just for this kind of thing. :-)

ZFS support in loader(8) being continually added/removed

FreeBSD users should be aware of the massive rash of commits which have occurred over the past few weeks with regards to LOADER_ZFS_SUPPORT functionality. This functionality has been added, removed, tinkered with, re-added, removed, etc. numerous times. Proof is provided below. As of this writing, LOADER_ZFS_SUPPORT has been disabled entirely. Please see these commits:

This affects both i386 and amd64, despite the pathname implying otherwise.

FreeBSD users should be outraged by this, and be questioning why said changes are not being fully tested before being committed. I’ll use this opportunity as confirmation of further proof that all administrators should be paying VERY close attention to commits to src-all in RELEASE and STABLE branches.

I consider this evidence further justification for keeping one’s root filesystem as UFS.

FreeBSD and ZFS — horrible raidz1 speed — finale

A follow-up to the following two posts of mine:

The problem I described has not recurred since enabling prefetch. So it seems whatever performance-related problems we had with prefetch when ZFS was first committed to FreeBSD “back in the day” have since been addressed. I wish I could pinpoint where/when/how this was fixed, but the beast is complex…

I’ve since re-enabled prefetch on my co-located production servers (Intel ICH7-based) and I’m seeing great improvements there too. Those are single-disk systems (e.g. no raidz1 in use) too.

I’d recommend that users who have previously disabled the ZFS prefetch mechanism on FreeBSD should re-enable it and reboot. :-)

Posted in FreeBSD, ZFS. 1 Comment »

FreeBSD and ZFS — horrible raidz1 speed — part 2

While debugging the aforementioned problem, I decided to try re-enabling ZFS’s prefetch capability as a test; “maybe things have greatly improved between v6 and v13″, I thought.

Based on what I can tell so far (it’s only been 10-15 minutes, and I’ve transferred hundreds of megabytes of data), that appears to have fixed the problem.

$ rm ~/Maildir/header_cache.db
$ time mutt -f ~/Maildir/system

real    0m0.151s
user    0m0.030s
sys     0m0.014s
$ time mutt -f ~/Maildir/system

real    0m0.038s
user    0m0.004s
sys     0m0.012s
$ time mutt -f ~/Maildir/system

real    0m0.493s
user    0m0.012s
sys     0m0.004s

There are occasional times where ZFS “lags”, e.g. the above times will jump up to 0.5 seconds or so, but I believe that’s an issue which has been mentioned many times in the past by users — occasionally ZFS appearing “bursty” in I/O. Not sure what the cause of that is, but I know I’m not the only one seeing that behaviour.

Anyway, I’ll keep an eye on things for a few days and see if things remain speedy or if they get worse. ZFS prefetch being enabled could still be a red herring; for all I know memory fragmentation could be the root of the problem. Who knows — too soon to tell…

Posted in FreeBSD, ZFS. 1 Comment »

FreeBSD and ZFS — horrible raidz1 read speed

I’ve been noticing what appears to be absolutely horrible speeds from a ZFS raidz1 pool, but only in some circumstances — specifically, when mutt and header caching (for Maildir) is used.

The setup is as follows:

ad4: 190782MB <WDC WD2000JD-00HBB0 08.02D08> at ata2-master SATA150
ad8: 715404MB <WDC WD7501AALS-00J7B0 05.00K05> at ata4-master SATA300
ad10: 715404MB <WDC WD7501AALS-00J7B0 05.00K05> at ata5-master SATA300
ad12: 715404MB <WDC WD7500AACS-00D6B0 01.01A01> at ata6-master SATA300
ad14: 715404MB <WDC WD7500AACS-00D6B0 01.01A01> at ata7-master SATA300

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad8     ONLINE       0     0     0
            ad10    ONLINE       0     0     0
            ad12    ONLINE       0     0     0
            ad14    ONLINE       0     0     0

Filesystem   1024-blocks      Used      Avail Capacity  Mounted on
storage/home    67108864     95232   67013632     0%    /home
storage       2149507840 227146240 1922361600    11%    /storage
/dev/ad4s1e      8122126     52450    7419906     1%    /tmp

This indicates that /home (e.g. /home/jdc/Maildir) lives on the ZFS pool storage, consisting of four (4) SATA300 disks. The UFS2-based stuff (OS disk, etc.) consists of a single SATA150 disk.

Now for relevant bits from my muttrc:

set mbox_type=Maildir
set folder="~/Maildir"
set mbox="~/Maildir"
set spoolfile="~/Maildir"
set header_cache="~/Maildir/header_cache.db"
set maildir_header_cache_verify=no

Now for the tests. First, we populate the Maildir header cache:

$ rm ~/Maildir/header_cache.db
$ mutt -f ~/Maildir/system
$ ls -l ~/Maildir/header_cache.db
-rw-------    1 jdc       users     671744  1 Jun 20:10 /home/jdc/Maildir/header_cache.db

Next, we copy the contents of ~/Maildir/system (which is on the ZFS pool) to /tmp/Maildir/system (which is UFS2):

$ rsync -a ~/Maildir/system /tmp/Maildir/

mutt will use ~/Maildir/header_cache.db no matter what we pass to the -f flag (see the above muttrc).

And now it’s time to prove my statement.

$ time mutt -f ~/Maildir/system

real    0m3.447s
user    0m0.030s
sys     0m0.022s
$ time mutt -f /tmp/Maildir/system

real    0m0.233s
user    0m0.013s
sys     0m0.022s

The only ZFS-related tunable I’ve set is vfs.zfs.prefetch_disable="1".

I’m really not sure what to make of this — we’re talking about a common operation that’s 17 times slower when using ZFS vs. UFS2. ZFS’s ARC infrastructure should already have the contents of this stuff in memory, so I’m not sure where the delays are coming from. I don’t think it’s ZIL-related, since that’s supposed to be used for writes. Surely raidz1 isn’t *that* slow…?

And before anyone claims my disks are slow or responsible for the problem…

# dd if=/dev/ad4 of=/dev/null bs=64k
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1    733    733  46883    1.4      0      0    0.0   99.2| ad4

# dd if=/dev/ad8 of=/dev/null bs=64k
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1   1659   1659 106156    0.6      0      0    0.0   97.0| ad8

# dd if=/dev/ad10 of=/dev/null bs=64k
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1   1739   1739 111267    0.6      0      0    0.0   96.8| ad10

# dd if=/dev/ad12 of=/dev/null bs=64k
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1   1461   1461  93510    0.7      0      0    0.0   98.4| ad12

# dd if=/dev/ad14 of=/dev/null bs=64k
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1   1375   1375  88017    0.7      0      0    0.0   98.3| ad14

They’re not, as can be seen from sequential I/O and gstat(8) output.

I’m willing to bet if I bust out ktrace(1) on this stuff, the delays seen will be on read(2) calls.

Posted in FreeBSD, ZFS. 1 Comment »