September 10th, 2008 by lucas
Ultimate Debian Database (UDD) is a GSoC project (now finished) which aimed at importing all different data sources that we have in Debian in a single SQL database, to make it easy to combine that data. Currently, we import information about source and binary packages, all bugs (both archived and unarchived), lintian, carnivore, popcon, history of uploads, history of migrations to testing, and orphaned packages. The goal is really to have that data, without really thinking of a specific use cases: there will be lots of use cases.
Buildstat is a project by Gonéri Le Bouder, that provides a framework for running QA tests (rebuilds and lintian currently, but buildstat is built in a very extensible way) on packages, using both packages in the archive, and packages in the VCS repositories of teams. This is pretty cool: it allows teams to get an overview of the status of their packages, not using the archive as reference, but using their VCS. Buildstat schedules and runs the tests, store the data in an SQL database, and allows to browse the data using a web interface. Buildstat also import some data from other sources (only the BTS currently, using the LDAP dump) to display it on the web interface.
Since both projects are using an SQL DB, people have been asking why we don’t simply merge them. The big advantage would be that the data is synchronized: buildstat would display more up-to-date info about the data sources it doesn’t generate locally (like bugs), and UDD would get fresh buildstat data. We have been talking a lot with Gonéri, considering the different possibilities. But I don’t think it’s a good idea.
I think that both projects should try to do one thing, but do it very well, instead of trying to fix the world. UDD focuses on importing data that exists elsewhere. Sometimes it means doing some complex processing. But data should be be generated by UDD. Merging the projects would mean having a very big piece of software that does everything (or tries to do everything).
Both databases were designed differently, with different goals. UDD tries to stay close to the data it imports. There might be some incoherences in the data sources, but that’s fine: one of the goal of UDD is to make it easy to find them (and fix them), so we need them in the DB. In buildstat, since the goal of importing data is to display it on the web interface, with a strict use case, you can freely “simplify” data if it helps. Another big difference is that UDD is designed to be easy to use (ie write and run queries) by a human user: UDD uses multi-column primary keys, while buildstat uses surrogate keys (integer “id” keys), that ORM tools usually require.
There are also more technical concerns: currently UDD makes a compromise, for each data source, on what it imports: it tries to import the data that is useful, not all the data available. Merging buildstat and UDD would mean increasing the DB size significantly, by adding all the “private” data that buildstat needs. Another problem is the stable API problem: if buildstat and UDD are merged, it means that buildstat cannot change its DB schema without making sure that it wouldn’t break what UDD users are doing.
So, what should we do, from my POV, instead of merging?
- Continue to talk, and get Gonéri into the UDD “team”. He gathered a lot of experience working on buildstat, and he probably would be able to help a lot.
- Data that is not generated by buildstat (bugs data) should be imported from UDD. Doing SQL->SQL will probably make things easier there.
- A summary of buildstat’s infos should be imported into UDD.
There’s also the issue of providing the data through a web interface, to the DDs, which buildstat tries to address partially. The Debian Developer Packages Overview (DDPO)’s main limitations are:
- lack of knowledge about VCS (what buildstat solves)
- lack of knowledge about complex organizations (you can’t get any list of packages, or list of packages maintained by teams not using a consistant Maintainer/Uploaders scheme, or list of packages from tasks, etc)
- poor handling of large amount of packages. In teams with lots of packages, it’s useful to have restricted views, such as “packages outdated compared to upstream”, “packages which have bugs/RC bugs”, “packages which are newer in the VCS than in the archive”. The perl team’s work on PET clearly shows the kind of things that are needed.
At this point, I think that DDPO would benefit from a full rewrite (using the existing code as a source of inspiration, and making sure that there are no regressions, of course).
Using UDD as the data source, it should be easy to get something done (even if UDD still lacks some of the data DDPO has currently). But it still requires web developer skills, which I don’t have. If you are interested, contact me!