Archive for the ‘software’ Category

Fast Searching Is Slow

September 15, 2008

Whenever I get a new Apple or Microsoft OS release, I spend a couple of days finding out how to turn off most of the new features, because each release has a usable RAM requirement about twice as big as the previous one.

A couple of weeks ago I upgraded some Macs in our studio from Tiger to Leopard. We have around 40 Firewire hard drives with audio and video files on them, and Leopard wants to re-index them all for Spotlight. I decided to let it go ahead and plugged one drive into each of several Macs. After 3-4 hours of using 100% CPU and showing progress bars that alternated between estimates like 30 hours or 60 hours remaining, and a barber pole progress bar saying it was estimating the time it would take (again), I’d had enough. I never need Spotlight on those drives anyway. What was it trying to do, full-text index my audio and video files?

I found out that I can run the command “touch .metadata_never_index” on the root of each drive to stop this nonsense. Doing that on all 40 drives took maybe 20 minutes, and now I can work again.

This reminds me of the early ’90s when I was maintaining GNU find. The “fast find” code from BSD find was in the public domain, and had been factored out by POSIX.2 into a separate locate command, so I added that to the GNU findutils distribution. In doing so, I refactored James Woods’ monolithic code into coherent functions with meaningful variable names. That allowed me to figure out what the code was doing and document the database file format (and change it to be 8 bit clean).

Richard Stallman (leader of the GNU Project) thought the separate locate program was an inelegant kludge, because it wasn’t guaranteed to produce correct (up to date) results, depending on what had changed since the last time the updatedb command had been run to walk the file system and update the locate database. So he added an item to the GNU task list to integrate the locate code properly into find, to make the database usage a transparent optimization for find, which would fall back on brute force file system traversal for whatever directory trees there wasn’t an up to date locate database (based on time stamps).

After some months, a volunteer named Tim actually submitted a modified version of GNU find where he had done that. Unfortunately, in the meantime I had made some other major changes to GNU find, so Tim’s patches no longer applied. Also, I wasn’t sure I could prove his code to be correct. I kept thinking of edge cases and issues with network mounted file systems that complicated the problem. I integrated some of Tim’s optimizations, but the locate database integration into find was never finished. I always felt a little sad about that.

It was gratifying years later to see Apple, Microsoft, and Google bring indexed disk searching to the public finally, hopefully mostly correctly. (And now including search on file contents as well as attributes.) I just wish the indexing wasn’t such a pig.

Do What You’re Good At

August 8, 2008

In 1993, I unexpectedly did some good when I was hired to do something I wasn’t good at.

I was going to college part-time and working to pay my way. For almost five years, I had been writing and maintaining various utilities that were part of the GNU operating system (now used mainly with Linux). For a couple of those summers I’d actually been employed as a programmer by the Free Software Foundation, the nonprofit organization that coordinates the GNU Project.

I was looking for year-round part-time work (full-time in the summer), so I got in contact with Cygnus Support, a company founded to do for-profit work on GNU code. (It’s now part of Red Hat.) It seemed like a logical fit. After a telephone interview, I was hired. I spent the summer at the beautiful Cygnus offices in Mountain View, California; here is what the front entrance looked like:

Cygnus Support entrance

Cygnus was looking for someone to maintain the GNU linker/loader ld, part of the binutils compiler toolchain that their employees had written. The guy who wrote GNU ld, Steve Chamberlain, is a brilliant programmer several years older than I am. He had written it for a contract with an insane deadline, and then had to move immediately to another project with another insane deadline in order to keep bringing in enough income to keep the company afloat. In the meantime, he hadn’t had time to do much documenting of his code. And he didn’t have time to explain it to me, either.

Although I was working on a CS degree, I hadn’t taken a compilers class yet. When I started looking at the source code to ld, I was horrified to discover that it was all a hand-written parser for a complex linker language derived from System V Unix (but extended so it was even more complicated). I’d chase function calls and pointers from here to there but could never figure out what the control flow was. It was way more complicated than I’d expected. I never did understand most of the code.

I got so frustrated that weekly I’d rush into my boss’s office in tears saying I couldn’t do it. She’d reassure me and I’d go stare at the code some more. I did learn enough about the code to fix some isolated bugs, but the hard ones I’d always find ways to pass off onto more experienced coders on the team, even though they weren’t supposedly working on the linker.

While I was struggling to understand the code, I learned a significant amount about the BFD library that the Cygnus binutils are based on, and there I started to make a difference. Some things I am good at are making things more consistent and making things more user-friendly. Cleanup work. I’d done that a lot on other GNU utilities for the FSF. I started improving the BFD library; I added some missing functionality, documented it better, and wrote a new utility to make it easier to analyze files built with it.

While doing that, I also took a close look at the configuration scripts that the binutils used. They were designed to support many computer architectures, but a lot of the configuration had to be done by hand. For the FSF utilities, I had written a system called Autoconf to generate automatic configuration scripts, but those couldn’t handle CPU architecture selection. I decided to merge the two systems into a best-of system (Autoconf 2), and I spent much of the summer of ’93 doing that and converting all of Cygnus’s utilities to use it. It got an enthusiastic reception and I believe it’s still in use 15 years later.

While improving and documenting Autoconf, I examined the Texinfo documentation system that GNU and Cygnus used. There were some scripts for printing out manuals in various formats, which involved some tricks with indexing and Postscript. I made some improvements to the documentation tools that summer, but I’ve forgotten exactly what they were.

After a few more months, I left the job at Cygnus to work as a programmer and system administrator for the University of Maryland. Cygnus deserved someone more suitable to work on the linker. But though I was just treading water as a linker maintainer, Cygnus got some valuable improvements to their software surrounding the linker that no one else probably would have done. And I enjoyed doing that.

It seems like this happens at most of the jobs I’ve had. I end up redefining the job description to be things I’m good at, and everyone’s pretty happy with how it turns out.

When Your Naming Scheme Runs Dry

August 6, 2008

The group of system administrators I worked with for over a decade had a tradition of giving Unix computers host names that followed a different theme for each cluster of computers.

We started out as students running the computer labs for the College of Engineering at the University of Maryland, College Park in the late 1980s. Our first public computer lab was a dozen or so Sun 3/50 and 3/60 workstations named after Tolkien’s The Hobbit. Host names included shire, bilbo, gloin, rivendell, etc.

When the Sun 3’s became obsolete, we turned them into X terminals connecting to several Sparcstation servers, which got their own naming scheme: coke, pepsi, jolt, and mountain-dew. We could have kept adding soda names for awhile, but we didn’t need many servers for that lab.

In the staff office (the “Hackers Pitt”, with spelling from Buckaroo Banzai), the Sparcstations we got for testing and software development were called tweak, twiddle, and frob. Good thing we didn’t need to come up with any more names in that series!

Here are a couple of pictures of the Hackers Pitt. Dave, Josh, and Chris:

Kurt, Dave, and Randall:

Later, we opened up another lab, consisting of Decstations, I think. We decided to go with names of computer languages as the naming scheme, so we had workstations called basic, cobol, lisp, perl, etc. There was a networked Postscript laser printer in the lab, and someone got the bright idea to give it a name in the same scheme, so naturally it had to be called postscript! A few months later, though, we had to rename it, because some software would get confused and malfunction when encountering a printer queue called postscript.

Aerospace Engineering named their computers after airplanes. Their Sun3 server was called hellcat, a WWII fighter.

There was one department in the College of Engineering that simply numbered their workstations: Chemical Engineering, whose computers were named cm##. The student sysadmins didn’t like that scheme much, because it was hard to keep those computers straight. Their names had no memorable personalities, so we had trouble remembering whether we were supposed to do something to cm18 or cm19, or cm23 or cm32.

Within a couple of years, many of us started work at UUNET and created one of the first commercial web hosting services. We took our penchant for naming schemes with us. Here’s a picture of Josh, Chris, and Kurt in Kurt’s office at UUNET assembling some servers:

The infrastructure servers (email, rdist, backups, etc.) had names of butlers from literature: jeeves, nestor, smithers, alfred. That was a clever but very restrictive scheme. It turns out there aren’t very many well-known butlers in literature. Now there’s a Wikipedia article listing them, but at the time we were beating our heads against the wall trying to think of more.

For Kerberos (secure login) servers, the list was the most limited. The first one was called keymaster (from Ghostbusters). When we added a second one as a backup, we had to make up a name. Would it be keyslave or keyminor? Hmm, maybe that wasn’t such a great idea.

For customer web servers, we decided on a larger class of cleverly appropriate names: spider names. The first few were easy: charlotte, blackwidow, brownrecluse, tarantula, trapdoor, funnelweb. After exhausting the well-known ones, we had to get more creative or obscure: peterparker, banana, garden, huntsman, crab. Customers had to login or FTP to the machine’s name to administer their servers, and it felt a little silly to tell them their web server was hosted on, say, banana. Who actually knows there’s such a thing as a banana spider? As we got more customers, we wasted quite a bit of time researching and compiling lists of spider species so we’d have enough names for new servers we were bringing online.

It all started out as good fun, but these days I’d just call them all web001, web002, mail, etc. and be done with it. No creative naming scheme will scale to hundreds of computers.

I confess that at home, I adopted a naming scheme based on classical elements: fire, water, air, earth, and a few more inspired by that pattern. I haven’t changed it partly because the limitations of that scheme help motivate me to not keep too many computers at home.


Design a site like this with WordPress.com
Get started