One think that really is annoying with Launchpad is its lack of interfaces with the outside world. No SOAP interface (well, I think that work is being done on this), no easy way to export all bugs. The only way to get all the bug data in a machine-parseable is to first fetch this URL, and then, for each bug number listed there, to make another request for https://launchpad.net/bugs/$bug/+text. I filed a bug a few weeks ago, asking for a simpler way to get all the data.
A Launchpad dev suggested to do what I just described (fetch all the number, then fetch the data for each bug). I originally dismissed the idea because it just sounded too dirty/aggressive/whatever, but since I needed to practice python, I gave it a try. And actually, it works: I was able to get all the data in less than an hour (but that probably put some load on Launchpad ;-)).
That allows to write cool SQL queries.
Bugs with the most subscribers:
select bugs.bug, title, count(*) as subscribers from bugs, subscribers where bugs.bug = subscribers.bug group by bugs.bug, title order by subscribers desc limit 10;
bug | firefox | subscribers |
---|---|---|
188540 | firefox-3.0 crashed with SIGSEGV in g_slice_alloc() | 291 |
154697 | package update-manager 1:0.81 failed to install/upgrade: ErrorMessage: SystemError in cache.commit(): E:Sub-process /tmp/tmpjP6Bsx/backports/usr/bin/dpkg returned an error code (1), E:Sub-process /tmp/tmpjP6Bsx/backports/usr/bin/dpkg returned an error code (1), E:Sub-process /tmp/tmpjP6Bsx/backports/usr/bin/dpkg returned an error code (1), E:Sub-process /tmp/tmpjP6Bsx/backports/usr/bin/dpkg returned an error code (1) | 278 |
141613 | npviewer.bin crashed with SIGSEGV | 262 |
59695 | High frequency of load/unload cycles on some hard disks may shorten lifetime | 182 |
215005 | jockey-gtk crashed with AttributeError in enables_composite() | 171 |
216043 | compiz.real crashed with SIGSEGV | 168 |
121653 | [gutsy] fglrx breaks over suspend/resume | 144 |
1 | Microsoft has a majority market share | 142 |
145360 | compiz.real crashed with SIGSEGV | 134 |
23369 | firefox(-gnome-support) should get proxy from gconf | 126 |
Bugs where someone is subscribed twice:
select bug, subscriber_login as cnt from subscribers group by bug, subscriber_login having count(*) > 1;
bug | subscriber |
---|---|
33065 | mvo |
48262 | mvo |
144628 | skyguy |
158126 | benekal |
213741 | sandro-grundmann |
216043 | jotacayul |
221630 | kami911 |
(Yes, that forced me to change a primary key)
Packages with the most bugs:
select package, count(distinct bug) as cnt from tasks group by package order by cnt desc limit 10;
package | number |
---|---|
ubuntu | 5392 |
linux | 1464 |
linux-source-2.6.20 | 1034 |
update-manager | 826 |
linux-source-2.6.22 | 724 |
firefox | 684 |
kdebase | 673 |
firefox-3.0 | 668 |
ubiquity | 590 |
openoffice.org | 566 |
Bugs with the shortest titles:
select bug, title, length(title) as len from bugs order by len asc limit 5;
bug | title | length |
---|---|---|
190560 | – | 1 |
160381 | uh | 2 |
224350 | css | 3 |
133621 | gnus | 4 |
138052 | pbe5 | 4 |
If you want to play too, you can fetch the SQLite3 DB (5.8M, lzma-compressed), the DB creation script, and the script that fetches the bugs and import them into the DB. Comments about my code would be very appreciated (stuff like “oh, there’s a better way to do that in python!”), as I’m not very confident about my pythonic skills. :-)
Update: apparently, I’m not really fetching all the bugs. I’m getting the same results as when you just press “Search” on https://launchpad.net/ubuntu/+bugs. But if you click on “Advanced search”, then select all the bug statuses, and click search, you get a lot more bugs (154066 vs 49031). If someone know which bugs are excluded with the default search, I’m interested!
Update 2: Got it. Apparently the default search doesn’t list bugs that have all their “tasks” marked “Won’t fix”, “Fix Released”, or “Invalid”.
You’ll be pleased to hear, then, that we most definitely are working hard on developing a comprehensive set of APIs (using JSON-RPC, IIRC) that will allow you to do anything that you can do through the Launchpad web interface programmatically.
The APIs should be release in the not-too-distant future; keep an eye on http://news.launchpad.net or on launchpad-users@lists.ubuntu.com.
“… I originally dismissed the idea because it just sounded too dirty/aggressive/whatever…
That reminds me of the “Research Friendly” initiative :http://libresoft.es/Activities/Research_activities/downloads/fosdem2008/slides/2008-researchfriendly.org.pdf
Researchers need to acquire data from forges from time to time, and it’s sad if they break the platform because spidering it is too aggressive… so there are several initiatives (like FlossMole, FlossMetrics, etc.) which manage a proxy for the research community of the spidered (or otherwise extracted) artefacts.
Just wanted to add this bit of loosely related info.
You could have used python-launchpad-bugs (which has a HTML and a text ‘connector’):
>>> import launchpadbugs.connector as Connector
>>> Bug = Connector.ConnectBug(method=”Text”)
>>> b = Bug(123456)
>>> print b.description
Binary package hint: amarok
One of the podcasts at http://www.touchmusic.org.uk/TouchPod/podcast.xml
crashes amarok. The podcast in question is Touch Radio 25. The other
episodes I have tried seem to work fine. What happens with #25 is that I
get a few hundred ms of sound and then amarok freezes for a little while
before evaporating from the desktop altogether.
I don’t know whether or not the audio file in question is corrupt, but
obviously amarok shouldn’t crash even if it is.
>>> print b.subscribers
set([, , ])
>>>
etc… :-)
Ah, I didn’t know that it had a text connector as well. I’m not sure it provides all the info in the text interface (like the various date fields for each part).
It should provide like 95% of the data in the /+text interface and for the rest bugs are filed. :)