Importing bugs into UDD, faster.

Importing bugs from the Debian Bug Tracking System into UDD (the Ultimate Debian Database) in a reasonable time is challenging. The BTS uses flat files to store bug information, and importing a bug typically requires reading a ‘summary’ file, and a file containing the verisoning information, both of a few hundred bytes. That looks easy, but when you multiply it by ~70000 unarchived bugs, it takes a lot of time (about 40 minutes) to read those ~100k files, because the import process will block on every I/O. The problem is not the amount of data to read (19.8 MB for summary files, 7.4 MB for versioning files), but the number of files (69612 summary files, 17507 versioning files).

The obvious solution is to preload all the files into the page cache, so they are there when you need them. But you can’t simply do that with find /org/bugs.debian.org/versions/pkg -type f -exec cat {} \+ &>/dev/null, because that wouldn’t fix anything: you would still block on each file, and prevent the I/O scheduler to reorder the reads and optimize them (it’s called elevator for a reason). So, how do I tell the kernel “I’m going to read that in the future, please preload it?” readahead(2) is blocking, so it’s not helpful. The right solution is to use posix_fadvise(2), that allows to declare an access pattern. Using fadvise to preload all the files takes less than 5 minutes, and importing the bugs after that takes less than 8 minutes, so it’s really a big win.

Does someone know if there’s already an fadvise-based tool that allows to preload a list of files? That’s something that I could need in other contexts as well.

Cool stats about Debian bugs

Now that bug #500000 has been reported, let’s have a look at all our other bugs, using UDD.

Number of archived bugs:

select count(*) from archived_bugs;
 count  
--------
 402826

Number of unarchived bugs marked done:

select count(*) from bugs where status = 'done';
 count 
-------
  8267

Status of unarchived bugs (“pending” doesn’t mean “tagged pending” here):

select status, count(*) from bugs group by status;
    status     | count 
---------------+-------
 pending       | 53587
 pending-fixed |  1195
 forwarded     |  6778
 done          |  8267
 fixed         |   167

The sum isn’t even close to 500000. That’s because quite a lot of bugs disappeared:

select id from bugs union select id from archived_bugs order by id limit 10;
 id  
-----
 563
 660
 710
 725
 740
 773
 775
 783
 817
 819

Now, let’s look at our open bugs.
Oldest open bugs:

select id, package, title, arrival from bugs where status != 'done' order by id limit 10;
  id  |    package     |                                   title                                    |       arrival       
------+----------------+----------------------------------------------------------------------------+---------------------
  825 | trn            | trn warning messages corrupt thread selector display                       | 1995-04-22 18:33:01
 1555 | dselect        | dselect per-screen-half focus request                                      | 1995-10-06 15:48:04
 2297 | xterm          | xterm: xterm sometimes gets mouse-paste and RETURN keypress in wrong order | 1996-02-07 21:33:01
 2298 | trn            | trn bug with shell escaping                                                | 1996-02-07 21:48:01
 3175 | xonix          | xonix colors bad for colorblind                                            | 1996-05-31 23:18:04
 3180 | linuxdoc-tools | linuxdoc-sgml semantics and formatting problems                            | 1996-06-02 05:18:03
 3251 | acct           | accounting file corruption                                                 | 1996-06-12 17:44:10
 3773 | xless          | xless default window too thin and won't go away when asked nicely          | 1996-07-14 00:03:09
 4073 | make           | make pattern rules delete intermediate files                               | 1996-08-08 20:18:01
 4448 | dselect        | [PERF] dselect performance gripe (disk method doing dpkg -iGROEB)          | 1996-09-09 03:33:05

Breakdown by severity:

select severity, count(*) from bugs where status != 'done' group by severity;
 severity  | count 
-----------+-------
 normal    | 27680
 important |  7606
 minor     |  6921
 wishlist  | 18898
 critical  |    29
 grave     |   209
 serious   |   384

Top 10 submitters for open bugs:

select submitter, count(*) from bugs where status != 'done' group by submitter order by count desc limit 10;
submitter                      | count 
----------------------------------------------------+-------
 Dan Jacobson                  |  1455
 martin f krafft                |   667
 Raphael Geissert                |   422
 Joey Hess                        |   392
 Marc Haber            |   368
 Julien Danjou                     |   342
 Josh Triplett                |   331
 Vincent Lefevre                |   296
 jidanni@jidanni.org                                |   260
 Justin Pryzby  |   245

Top bugs reporters ever:

select submitter, count(*) from (select * from bugs union select * from archived_bugs) as all_bugs
group by submitter order by count desc limit 10;
                  submitter                   | count 
----------------------------------------------+-------
 Martin Michlmayr             |  4279
 Dan Jacobson            |  3652
 Daniel Schepler  |  3045
 Joey Hess                  |  2836
 Lucas Nussbaum     |  2701
 Andreas Jochens                |  2605
 Matthias Klose         |  2442
 Christian Perrier        |  2302
 James Troup                |  2198
 Matt Zimmerman               |  2027

You want more data? Connect to UDD (from master.d.o or alioth.d.o, more info here), run your own queries, and post them with the results in the comments!

Looking for cliques in the GPG signatures graph

The strongly connected set of the GPG keys graph contains a bit more than 40000 keys now (yes, that’s a lot of geeks!). I wondered what was the biggest clique (complete subgraph) in that graph, and also of course the biggest clique I was in.

It’s easy to grab the whole web of trust there. Finding the maximum clique in a graph is NP-complete, but there are algorithms that work quite well for small instances (and you don’t need to consider all 40000 keys: to be in a clique of n keys, a key must have at least n-1 signatures, so it’s easy to simplify the graph — if you find a clique with 20 keys, you can remove all keys that have less than 19 signatures).

My first googling result pointed to Ashay Dharwadker’s solver implementation (which also proves P=NP ;). Googling further allowed me to find the solver provided with the DIMACS benchmarks. It’s clearly not the state of the art, but it was enough in my case (allowed to find the result almost immediately).

The biggest clique contains 47 keys. However, it looks like someone had fun, and injected a lot of bogus keys in the keyring. See the clique. So I ignored those keys, and re-ran the solver. And guess what’s the size of the biggest “real” clique? Yes. 42. Here are the winners:

CF3401A9 Elmar Hoffmann 
AF260AB1 Florian Zumbiehl 
454C864C Moritz Lapp 
E6AB2957 Tilman Koschnick 
A0ED982D Christian Brueffer 
5A35FD42 Christoph Ulrich Scholler 
514B3E7C Florian Ernst 
AB0CB8C0 Frank Mohr 
797EBFAB Enrico Zini 
A521F8B5 Manuel Zeise 
57E19B02 Thomas Glanzmann 
3096372C Michael Fladerer 
E63CD6D6 Daniel Hess 
A244C858 Torsten Marek 
82FB4EAD Timo Weingärtner
1EEF26F4 Christoph Ulrich Scholler 
AAE6022E Karlheinz Geyer 
EA2D2C41 Mattia Dongili 
FCC5040F Stephan Beyer 
6B79D401 Giunchedi Filippo 
74B11360 Frank Mohr 
94C09C7F Peter Palfrader
2274C4DA Andreas Priesz 
3B443922 Mathias Rachor 
C54BD798 Helmut Grohne 
9DE1EEB1 Marc Brockschmidt 
41CF0322 Christoph Reeg 
218D18D7 Robert Schiele 
0DCB0431 Daniel Hess 
B84EF12A Mathias Rachor 
FD6A8D9D Andreas Madsack 
67007C30 Bernd Paysan 
9978AF86 Christoph Probst 
BD8B050D Roland Rosenfeld 
E3DB4EA7 Christian Barth 
E263FCD4 Kurt Gramlich 
0E6D09CE Mathias Rachor 
2A623F72 Christoph Probst 
E05C21AF Sebastian Inacker 
5D64F870 Martin Zobel-Helas 
248AEB73 Rene Engelhard 
9C67CD96 Torsten Veller

It’s likely that this happened thanks to a very successful key signing party somewhere in germany (looking at the email addresses). [Update: It was the LinuxTag 2005 KSP.] It might be a nice challenge to beat that clique during next Debconf ;)

And the biggest clique I’m in contains 23 keys. Not too bad.

tool to mirror a website locally?

Dear lazyweb,

I need a tool to mirror a website locally (so I can browse it offline). Requirements:
– not GUI-based (I want to run it in a script)
– support recursive retrieval and include/exclude lists (like wget)
– no output when everything is fine, but still output errors (not possible with wget, which still output “basic information” when running with –no-verbose, and doesn’t output errors when running with –quiet)
– understands timestamps, and retransfers files if timestamps or sizes don’t match
– not too over-engineered, not too badly maintained, etc…

Thank you.

New Debian Developers!

We got a lot of (>= 10) new Debian developers recently. I’m really happy to see that the bottlenecks in the New Maintainer process were (at least partially) solved. My first NM (actually my second, my first one is on hold) also became a DD today.

So, how long does it take to become a DD ? Let’s take 2 examples. Both are very active and skilled new contributors, that probably were quite close from being the faster you can be through NM:

Name Applied AM assigned Approved by AM Account created
Chris Lamb 2008-05-01 2008-06-12 2008-07-22 2008-09-16
Sandro Tosi 2008-03-24 2008-05-06 2008-06-22 2008-09-16

We have the proof: provided you have all the required skills, you can become a DD in less than 6 months!

Of course, some things are not perfect yet:

  • A lot of very good contributors are waiting for an AM, because not enough DDs volunteer to be AMs.
  • Some NMs still take too long to answer questions, using AMs that could probably mentor faster NMs. If your AM is waiting for you, feel guilty now!
  • Front Desk and DAM are still managed by a small set of very active (and very busy elsewhere) DDs. Many of the new DDs were FD-approved and DAM-approved by the same person, which is not so great if we want to keep this two-steps check.

Is Mozilla the new XFree86? Could Ubuntu actually help?

All the recent moves of Mozilla make me feel that they are really taking the XFree86 path. Reading the Launchpad bug log about the EULA shows that most of the posters agree on who is on the wrong side, and favor switching to IceWeasel or Epiphany+Webkit.

Even if Mozilla is apparently going to back off on the EULA story, it looks like the harm was done. If they want to fix that, they will have to start listening to other players in the Free Software community. Or just watch Webkit eat their market share.

Since Ubuntu leaders are apparently talking to Mozilla about that, I really hope that they are aiming for a solution that will help the Free Software community as a whole, and are not looking for a work-around that will “fix” the problem for Ubuntu.

There has been a lot of noise about the lack of “giving back” to the community by Ubuntu. Using Ubuntu user base to weight in and solve such issues in a way that benefit the whole community would probably be seen as a much more valuable contribution than another bunch of patches.

UDD and Buildstat

Ultimate Debian Database (UDD) is a GSoC project (now finished) which aimed at importing all different data sources that we have in Debian in a single SQL database, to make it easy to combine that data. Currently, we import information about source and binary packages, all bugs (both archived and unarchived), lintian, carnivore, popcon, history of uploads, history of migrations to testing, and orphaned packages. The goal is really to have that data, without really thinking of a specific use cases: there will be lots of use cases.

Buildstat is a project by Gonéri Le Bouder, that provides a framework for running QA tests (rebuilds and lintian currently, but buildstat is built in a very extensible way) on packages, using both packages in the archive, and packages in the VCS repositories of teams. This is pretty cool: it allows teams to get an overview of the status of their packages, not using the archive as reference, but using their VCS. Buildstat schedules and runs the tests, store the data in an SQL database, and allows to browse the data using a web interface. Buildstat also import some data from other sources (only the BTS currently, using the LDAP dump) to display it on the web interface.

Since both projects are using an SQL DB, people have been asking why we don’t simply merge them. The big advantage would be that the data is synchronized: buildstat would display more up-to-date info about the data sources it doesn’t generate locally (like bugs), and UDD would get fresh buildstat data. We have been talking a lot with Gonéri, considering the different possibilities. But I don’t think it’s a good idea.

I think that both projects should try to do one thing, but do it very well, instead of trying to fix the world. UDD focuses on importing data that exists elsewhere. Sometimes it means doing some complex processing. But data should be be generated by UDD. Merging the projects would mean having a very big piece of software that does everything (or tries to do everything).

Both databases were designed differently, with different goals. UDD tries to stay close to the data it imports. There might be some incoherences in the data sources, but that’s fine: one of the goal of UDD is to make it easy to find them (and fix them), so we need them in the DB. In buildstat, since the goal of importing data is to display it on the web interface, with a strict use case, you can freely “simplify” data if it helps. Another big difference is that UDD is designed to be easy to use (ie write and run queries) by a human user: UDD uses multi-column primary keys, while buildstat uses surrogate keys (integer “id” keys), that ORM tools usually require.

There are also more technical concerns: currently UDD makes a compromise, for each data source, on what it imports: it tries to import the data that is useful, not all the data available. Merging buildstat and UDD would mean increasing the DB size significantly, by adding all the “private” data that buildstat needs. Another problem is the stable API problem: if buildstat and UDD are merged, it means that buildstat cannot change its DB schema without making sure that it wouldn’t break what UDD users are doing.

So, what should we do, from my POV, instead of merging?

– Continue to talk, and get Gonéri into the UDD “team”. He gathered a lot of experience working on buildstat, and he probably would be able to help a lot.

– Data that is not generated by buildstat (bugs data) should be imported from UDD. Doing SQL->SQL will probably make things easier there.

– A summary of buildstat’s infos should be imported into UDD.

There’s also the issue of providing the data through a web interface, to the DDs, which buildstat tries to address partially. The Debian Developer Packages Overview (DDPO)’s main limitations are:

– lack of knowledge about VCS (what buildstat solves)

– lack of knowledge about complex organizations (you can’t get any list of packages, or list of packages maintained by teams not using a consistant Maintainer/Uploaders scheme, or list of packages from tasks, etc)

– poor handling of large amount of packages. In teams with lots of packages, it’s useful to have restricted views, such as “packages outdated compared to upstream”, “packages which have bugs/RC bugs”, “packages which are newer in the VCS than in the archive”. The perl team’s work on PET clearly shows the kind of things that are needed.

At this point, I think that DDPO would benefit from a full rewrite (using the existing code as a source of inspiration, and making sure that there are no regressions, of course).

Using UDD as the data source, it should be easy to get something done (even if UDD still lacks some of the data DDPO has currently). But it still requires web developer skills, which I don’t have. If you are interested, contact me!