OpenPAM bugs (affecting dovecot)

I woke up this afternoon to a series of user reports, stating our POP3/IMAP server, Dovecot v1.1, was having problems. Users were intermittently receiving authentication errors in their mail client, e.g.:

[*] Connection established to XX.XX.XX.XX
>> 0020 +OK Dovecot ready.
<< 0014 USER user
>> 0005 +OK
<< 0017 PASS XXXXXXXXXX
>> 0040 -ERR Temporary authentication failure.

Our kernel log was filled with the following:

Mar 28 05:25:32 horus dovecot-auth: in openpam_load_module(): no pam_permit.so found
Mar 28 05:26:02 horus last message repeated 17 times
Mar 28 05:27:07 horus last message repeated 5 times
Mar 28 05:38:21 horus last message repeated 35 times
...
Mar 28 12:36:33 horus kernel: pid 97068 (dovecot-auth), uid 0: exited on signal 11

Our Dovecot logs indicated pam_start(3) was returning PAM_SYSTEM_ERR:

dovecot: Mar 28 05:25:35 Error: auth-worker(default): pam(user,XX.XX.XX.XX): pam_start() failed: system error

I did some digging on the Web, and I’m not the only individual who has seen this problem. The following threads appear to be relevant to this problem, and are not FreeBSD-specific (e.g. other OSes using OpenPAM) nor dovecot v1.1-specific:

The common claim is that OpenPAM is leaking memory or file descriptors. FreeBSD uses OpenPAM version “Hydrangea”, which is the latest non-beta release. The OpenPAM source browser indicates there hasn’t been much work done on OpenPAM in nearly 15 months.

Our FreeBSD server (RELENG_7 amd64) was last rebuilt October 2008, and we’re using dovecot-1.1.6_1. There have been no changes to OpenPAM in FreeBSD in 15 months, so safe to say this issue has not been fixed in FreeBSD between October 2008 and present.

We don’t use OpenPAM for much on this system (except OpenSSH, and the machine generally does not get a lot of SSH traffic; I’ve been considering setting UsePAM no in /etc/ssh/sshd_config for quite some time now). That said:

Our Dovecot configuration file, prior to the issue, contained the following:

auth default {
  mechanisms = plain login
  user = root

  passdb passwd {
  }

  passdb pam {
  }

  userdb passwd {
  }
}

The Dovecot configuration Wiki is quite a mess (very hard to follow, especially with regards to this piece of the configuration), and the above configuration was based purely off of the stock dovecot.conf example template. What wasn’t made clear in the Dovecot docs is that the passdb and userdb directives define what password and account (UID, home directory, spool location, etc.) lookup methods are used — and in what order. It’s not intuitive.

The above configuration snippet tells Dovecot to attempt obtaining an account password through /etc/passwd, and despite success or failure, obtain the account password again via OpenPAM. Then for account details (UID, home directory, etc.), obtain those details via /etc/passwd.

So, this means that OpenPAM was in fact being queried even though we don’t use it on our systems. To work around the problem, I removed the passdb pam section entirely, so everything should be using getpwent(3) and similar functions (pure use of /etc/passwd).

I’m going to spend a bit of time this evening digging through the OpenPAM source to determine all the situation where PAM_SYSTEM_ERROR could be returned and see if there’s anything that stands out.