I’m beginning to think that XVM, one of the SIPB services that I maintain, is cursed. For some unbeknown reason, it seems that at every turn, XVM has managed to run into critical bugs in core pieces of its architecture. Even its hardware is not above failing in perhaps the most obscure ways possible.
I suppose one small consolation is that I know XVM’s troubles predate my involvement with the project, so I can conclude that the problem is not my proximity. Some time before I became involved, XVM kept being bitten by a bug in LVM. I don’t know the full details of the problem, but the solution involved allocating many times more metadata for the LVs than one should reasonable have to allocate. After that, XVM had some problems with the clustering software, which would deadlock in kernelspace while trying to coordinate access to its RAID. But with careful debugging in the first case and switching to a new clustering solution in the latter, XVM was functional once again.
Business was quite good for a time. So good, in fact, that XVM ran out of capacity. We had four hosts in the pool, and all of them were out of RAM. Every day, we would receive email indicating that people were trying to create new virtual machines, but we had nowhere to boot them. Our hosts were rock solid (and indeed, I think those four hosts have been up continuously for as long as I’ve been paying attention).
But one day, we received a grant to expand our capacity. We bought some new hosts, and after some trouble getting them to boot (our temporary workaround: turn off ECC), we put them in the pool. But try as we might, our clustering software kept spitting back an error, claiming it could not allocate memory to even list the logical volumes on the RAID. And then to make matters worse, we noticed that the new hosts would crash after a few days. With ECC on, they wouldn’t even boot.
When the semester wound down and we finally had time to debug, we sat down with our clustering software’s source and eventually traced the problem to a constant defining the maximum number of locks. We bumped that a few orders of magnitude, and we haven’t seen any troubles since. To make matters even more exciting, we couldn’t find a place in the source where that constant is actually used productively. (One would expect that it defines the size of some static buffer, but the only usages we found were to throw errors if too many locks were taken out at a time.)
And then this past weekend, we noticed that one of our four new hosts, the only one we hadn’t yet configured, was actually able to boot. We stuck in the drives from one of the failing hosts, and it still was able to boot. We compared BIOS settings, and we noticed that one interesting difference was that the failing machine had “8-DIMM drive strength” disabled, while the succeeding one had it enabled. Don’t ask me what that option means; it was undocumented, and some preliminary Googling has not revealed any useful information. Turning this on allowed our other three hosts to boot.
And so at long last, everything has been resolved! The curse has been lifted! …Or has it? For alas, last night we realized that one of the new hosts had frozen. Tonight we reproduced the error while having a serial console session active, and managed to obtain some interesting MCE output.
On the whole, I actually think I’m ok with how my involvement with XVM seems to be going. There’s been a lot of frustration and heartbreak, but it’s also been an excellent learning experience, a continual source of new challenges, and it’s really awesome when we finally make some misbehaving component work properly.