[insert title here]

June 28, 2010

The Curse of the Xtreme Virtual Machine Service

Filed under: Uncategorized — gdb @ 5:19 am

I’m beginning to think that XVM, one of the SIPB services that I maintain, is cursed.  For some unbeknown reason, it seems that at every turn, XVM has managed to run into critical bugs in core pieces of its architecture.  Even its hardware is not above failing in perhaps the most obscure ways possible.

I suppose one small consolation is that I know XVM’s troubles predate my involvement with the project, so I can conclude that the problem is not my proximity.  Some time before I became involved, XVM kept being bitten by a bug in LVM.  I don’t know the full details of the problem, but the solution involved allocating many times more metadata for the LVs than one should reasonable have to allocate.  After that, XVM had some problems with the clustering software, which would deadlock in kernelspace while trying to coordinate access to its RAID.  But with careful debugging in the first case and switching to a new clustering solution in the latter, XVM was functional once again.

Business was quite good for a time.  So good, in fact, that XVM ran out of capacity.  We had four hosts in the pool, and all of them were out of RAM.  Every day, we would receive email indicating that people were trying to create new virtual machines, but we had nowhere to boot them.  Our hosts were rock solid (and indeed, I think those four hosts have been up continuously for as long as I’ve been paying attention).

But one day, we received a grant to expand our capacity.  We bought some new hosts, and after some trouble getting them to boot (our temporary workaround: turn off ECC), we put them in the pool.  But try as we might, our clustering software kept spitting back an error, claiming it could not allocate memory to even list the logical volumes on the RAID.  And then to make matters worse, we noticed that the new hosts would crash after a few days.  With ECC on, they wouldn’t even boot.

When the semester wound down and we finally had time to debug, we sat down with our clustering software’s source and eventually traced the problem to a constant defining the maximum number of locks.  We bumped that a few orders of magnitude, and we haven’t seen any troubles since.  To make matters even more exciting, we couldn’t find a place in the source where that constant is actually used productively.  (One would expect that it defines the size of some static buffer, but the only usages we found were to throw errors if too many locks were taken out at a time.)

And then this past weekend, we noticed that one of our four new hosts, the only one we hadn’t yet configured, was actually able to boot.  We stuck in the drives from one of the failing hosts, and it still was able to boot.  We compared BIOS settings, and we noticed that one interesting difference was that the failing machine had “8-DIMM drive strength” disabled, while the succeeding one had it enabled.  Don’t ask me what that option means; it was undocumented, and some preliminary Googling has not revealed any useful information.  Turning this on allowed our other three hosts to boot.

And so at long last, everything has been resolved!  The curse has been lifted! …Or has it?  For alas, last night we realized that one of the new hosts had frozen.  Tonight we reproduced the error while having a serial console session active, and managed to obtain some interesting MCE output.

On the whole, I actually think I’m ok with how my involvement with XVM seems to be going.  There’s been a lot of frustration and heartbreak, but it’s also been an excellent learning experience, a continual source of new challenges, and it’s really awesome when we finally make some misbehaving component work properly.

June 20, 2010

Source diving

Filed under: Uncategorized — gdb @ 8:02 pm

A few days ago, I was confronted with a question about how Python’s multiprocessing library will behave when worker threads terminate unexpectedly.  Without thought, I immediately pulled began an interactive Python session and ran

>>> import multiprocessing
>>> multiprocessing.__file__
'/usr/lib/python2.6/multiprocessing/__init__.pyc'

(I can never remember where to look for system-wide Python packages, so I just let the system figure it out for me.)  I immediately began diving into the code I had located, searching the raw source of this unfamiliar system for details of its behavior.

A few minutes in to this, I paused.  Why had my first instinct been to read the code rather than the documentation?   Why on earth was I comfortable reading the code in the first place?

I remember for a long time being desperately afraid of other programmers’ code.  I was a newcomer to the world of programming, and here were these lines of source written by far more competent programmers–lines of code that would be too complicated for my puny brain to comprehend.  How could I, a mere mortal, hope to comprehend these gifts from the gods?  Instead, when I needed to learn more about a system, my first line and only line of defense was a Google search, hoping to hit a section of the documentation or an online forum where my question had already been anticipated and answered.

But at some point in the recent future (probably on the order of months), something has begun to change.  For Python systems at least, I am more than comfortable immersing myself in others’ code.  When I want to learn how one works, I know where to begin and how to mentally construct and traverse the chain of dependencies.  I’m just beginning to do the same with some C systems, having source-dived Apace extensively and Git and openais/corosync very peripherally.  In the future, I expect that as I gain a better understanding of how C systems are typically structured, I’ll feel more comfortable source diving other ones as well.

In any case, I’m not sure I can accurately determine what has led to this change.  It could be simply that I have recently had to understand undocumented aspects of systems’ behavior, but I don’t think that’s it.  Rather, I think this marks a turning point in my maturity as a programmer, moving past the stage of being unable or unwilling to peel back the abstraction layers when those abstractions are getting in the way.

Or it could be that I’m just becoming too lazy to Google.

Powered by WordPress