Importing bugs into UDD, faster.

Importing bugs from the Debian Bug Tracking System into UDD (the Ultimate Debian Database) in a reasonable time is challenging. The BTS uses flat files to store bug information, and importing a bug typically requires reading a ‘summary’ file, and a file containing the verisoning information, both of a few hundred bytes. That looks easy, but when you multiply it by ~70000 unarchived bugs, it takes a lot of time (about 40 minutes) to read those ~100k files, because the import process will block on every I/O. The problem is not the amount of data to read (19.8 MB for summary files, 7.4 MB for versioning files), but the number of files (69612 summary files, 17507 versioning files).

The obvious solution is to preload all the files into the page cache, so they are there when you need them. But you can’t simply do that with find /org/bugs.debian.org/versions/pkg -type f -exec cat {} \+ &>/dev/null, because that wouldn’t fix anything: you would still block on each file, and prevent the I/O scheduler to reorder the reads and optimize them (it’s called elevator for a reason). So, how do I tell the kernel “I’m going to read that in the future, please preload it?” readahead(2) is blocking, so it’s not helpful. The right solution is to use posix_fadvise(2), that allows to declare an access pattern. Using fadvise to preload all the files takes less than 5 minutes, and importing the bugs after that takes less than 8 minutes, so it’s really a big win.

Does someone know if there’s already an fadvise-based tool that allows to preload a list of files? That’s something that I could need in other contexts as well.

5 thoughts on “Importing bugs into UDD, faster.

Comments are closed.