Datamining Launchpad bugs

One think that really is annoying with Launchpad is its lack of interfaces with the outside world. No SOAP interface (well, I think that work is being done on this), no easy way to export all bugs. The only way to get all the bug data in a machine-parseable is to first fetch this URL, and then, for each bug number listed there, to make another request for https://launchpad.net/bugs/$bug/+text. I filed a bug a few weeks ago, asking for a simpler way to get all the data.

A Launchpad dev suggested to do what I just described (fetch all the number, then fetch the data for each bug). I originally dismissed the idea because it just sounded too dirty/aggressive/whatever, but since I needed to practice python, I gave it a try. And actually, it works: I was able to get all the data in less than an hour (but that probably put some load on Launchpad ;-)).

That allows to write cool SQL queries.

Bugs with the most subscribers:

select bugs.bug, title, count(*) as subscribers
from bugs, subscribers
where bugs.bug = subscribers.bug
group by bugs.bug, title
order by subscribers desc
limit 10;
bug firefox subscribers
188540 firefox-3.0 crashed with SIGSEGV in g_slice_alloc() 291
154697 package update-manager 1:0.81 failed to install/upgrade: ErrorMessage: SystemError in cache.commit(): E:Sub-process /tmp/tmpjP6Bsx/backports/usr/bin/dpkg returned an error code (1), E:Sub-process /tmp/tmpjP6Bsx/backports/usr/bin/dpkg returned an error code (1), E:Sub-process /tmp/tmpjP6Bsx/backports/usr/bin/dpkg returned an error code (1), E:Sub-process /tmp/tmpjP6Bsx/backports/usr/bin/dpkg returned an error code (1) 278
141613 npviewer.bin crashed with SIGSEGV 262
59695 High frequency of load/unload cycles on some hard disks may shorten lifetime 182
215005 jockey-gtk crashed with AttributeError in enables_composite() 171
216043 compiz.real crashed with SIGSEGV 168
121653 [gutsy] fglrx breaks over suspend/resume 144
1 Microsoft has a majority market share 142
145360 compiz.real crashed with SIGSEGV 134
23369 firefox(-gnome-support) should get proxy from gconf 126

Bugs where someone is subscribed twice:

select bug, subscriber_login as cnt
from subscribers
group by bug, subscriber_login
having count(*) > 1;
bug subscriber
33065 mvo
48262 mvo
144628 skyguy
158126 benekal
213741 sandro-grundmann
216043 jotacayul
221630 kami911

(Yes, that forced me to change a primary key)

Packages with the most bugs:

select package, count(distinct bug) as cnt
from tasks
group by package
order by cnt desc
limit 10;
package number
ubuntu 5392
linux 1464
linux-source-2.6.20 1034
update-manager 826
linux-source-2.6.22 724
firefox 684
kdebase 673
firefox-3.0 668
ubiquity 590
openoffice.org 566

Bugs with the shortest titles:

select bug, title, length(title) as len
from bugs
order by len asc
limit 5;
bug title length
190560 1
160381 uh 2
224350 css 3
133621 gnus 4
138052 pbe5 4

If you want to play too, you can fetch the SQLite3 DB (5.8M, lzma-compressed), the DB creation script, and the script that fetches the bugs and import them into the DB. Comments about my code would be very appreciated (stuff like “oh, there’s a better way to do that in python!”), as I’m not very confident about my pythonic skills. :-)

Update: apparently, I’m not really fetching all the bugs. I’m getting the same results as when you just press “Search” on https://launchpad.net/ubuntu/+bugs. But if you click on “Advanced search”, then select all the bug statuses, and click search, you get a lot more bugs (154066 vs 49031). If someone know which bugs are excluded with the default search, I’m interested!

Update 2: Got it. Apparently the default search doesn’t list bugs that have all their “tasks” marked “Won’t fix”, “Fix Released”, or “Invalid”.

5 thoughts on “Datamining Launchpad bugs

  1. One think that really is annoying with Launchpad is its lack of interfaces with the outside world. No SOAP interface (well, I think that work is being done on this), no easy way to export all bugs.
    This is something that a lot of people have been asking for for a long time.

    You’ll be pleased to hear, then, that we most definitely are working hard on developing a comprehensive set of APIs (using JSON-RPC, IIRC) that will allow you to do anything that you can do through the Launchpad web interface programmatically.

    The APIs should be release in the not-too-distant future; keep an eye on http://news.launchpad.net or on launchpad-users@lists.ubuntu.com.

  2. “… I originally dismissed the idea because it just sounded too dirty/aggressive/whatever…
    That reminds me of the “Research Friendly” initiative :http://libresoft.es/Activities/Research_activities/downloads/fosdem2008/slides/2008-researchfriendly.org.pdf

    Researchers need to acquire data from forges from time to time, and it’s sad if they break the platform because spidering it is too aggressive… so there are several initiatives (like FlossMole, FlossMetrics, etc.) which manage a proxy for the research community of the spidered (or otherwise extracted) artefacts.

    Just wanted to add this bit of loosely related info.

  3. You could have used python-launchpad-bugs (which has a HTML and a text ‘connector’):

    >>> import launchpadbugs.connector as Connector
    >>> Bug = Connector.ConnectBug(method=”Text”)
    >>> b = Bug(123456)
    >>> print b.description
    Binary package hint: amarok

    One of the podcasts at http://www.touchmusic.org.uk/TouchPod/podcast.xml
    crashes amarok. The podcast in question is Touch Radio 25. The other
    episodes I have tried seem to work fine. What happens with #25 is that I
    get a few hundred ms of sound and then amarok freezes for a little while
    before evaporating from the desktop altogether.

    I don’t know whether or not the audio file in question is corrupt, but
    obviously amarok shouldn’t crash even if it is.
    >>> print b.subscribers
    set([, , ])
    >>>

    etc… :-)

  4. Ah, I didn’t know that it had a text connector as well. I’m not sure it provides all the info in the text interface (like the various date fields for each part).

Comments are closed.