Archive for September, 2008

Fast Searching Is Slow

September 15, 2008

Whenever I get a new Apple or Microsoft OS release, I spend a couple of days finding out how to turn off most of the new features, because each release has a usable RAM requirement about twice as big as the previous one.

A couple of weeks ago I upgraded some Macs in our studio from Tiger to Leopard. We have around 40 Firewire hard drives with audio and video files on them, and Leopard wants to re-index them all for Spotlight. I decided to let it go ahead and plugged one drive into each of several Macs. After 3-4 hours of using 100% CPU and showing progress bars that alternated between estimates like 30 hours or 60 hours remaining, and a barber pole progress bar saying it was estimating the time it would take (again), I’d had enough. I never need Spotlight on those drives anyway. What was it trying to do, full-text index my audio and video files?

I found out that I can run the command “touch .metadata_never_index” on the root of each drive to stop this nonsense. Doing that on all 40 drives took maybe 20 minutes, and now I can work again.

This reminds me of the early ’90s when I was maintaining GNU find. The “fast find” code from BSD find was in the public domain, and had been factored out by POSIX.2 into a separate locate command, so I added that to the GNU findutils distribution. In doing so, I refactored James Woods’ monolithic code into coherent functions with meaningful variable names. That allowed me to figure out what the code was doing and document the database file format (and change it to be 8 bit clean).

Richard Stallman (leader of the GNU Project) thought the separate locate program was an inelegant kludge, because it wasn’t guaranteed to produce correct (up to date) results, depending on what had changed since the last time the updatedb command had been run to walk the file system and update the locate database. So he added an item to the GNU task list to integrate the locate code properly into find, to make the database usage a transparent optimization for find, which would fall back on brute force file system traversal for whatever directory trees there wasn’t an up to date locate database (based on time stamps).

After some months, a volunteer named Tim actually submitted a modified version of GNU find where he had done that. Unfortunately, in the meantime I had made some other major changes to GNU find, so Tim’s patches no longer applied. Also, I wasn’t sure I could prove his code to be correct. I kept thinking of edge cases and issues with network mounted file systems that complicated the problem. I integrated some of Tim’s optimizations, but the locate database integration into find was never finished. I always felt a little sad about that.

It was gratifying years later to see Apple, Microsoft, and Google bring indexed disk searching to the public finally, hopefully mostly correctly. (And now including search on file contents as well as attributes.) I just wish the indexing wasn’t such a pig.