Architecture Handling in Distributed Software Collections ========================================================= Copyright 1999, 2000, 2002 Marcus Brinkmann distribution of verbatim copies allowed without restriction. If you want to modify this, contact me and I will put it under a better license. I am not unwilling, just lazy. The way I think about architecture treatment stems from my experiences with the Debian package tools and bootstrapping the Debian GNU/Hurd system. The concept presented below has the following core ideas, which I think are important enough to summarize them in a short list: * Package installation is orthogonal to distribution creation. * Package installation requires only dependency verification. * Distribution creation requires dependency verification and scoring. Package Installation ==================== A package manager (the part of the packaging system that installs packages on a host machine) should only be concerned with fulfilling dependencies. It should determine if a package is installable, and if it is, it should perform the required action. For this, (probably virtual) packages representing ABI capabilities really perform best. You can provide virtual architecture packages for any ABI feature, like which processor is required, the object format (a.out, elf), if a /proc filesystem is provided and to which standard it conforms, if Linux syscalls are available, if mach interfaces are provided and so on. IMO, dependencies should be as weak as possible. If a pentium optimized package runs also on a 486, the dependency should be "486" (or even 386), not "pentium". Examples: * "grub" only depends on "i386", regardless of the operating system running. (Note: this is not true anymore since grub provides a grub shell that runs under the OS. Another example which still is valid is "mbr"). * Most GNU/Linux binaries will depend on a certain cpu only (apart from any libraries), the elf object format, and certain libraries determined by the soname, and will not make use of syscalls or the proc filessystem directly. Those will run on the GNU/Hurd without recompilation. * A perl script evaluating the Linux proc fs (version 2.2) can depend on perl and the procfs virtual package of version 2.2. That was the "depends on" side. There is also the "provides" side. A usual pentium GNU/Linux box will provide "i386, i486, i586", "linux-syscalls (= 2.2)", "elf" and "proc-fs" for example. Note that you can have version numbers here, too. Which packages are provided is a feature of the "distribution". Distributions ============= The distribution is really a concept that should not be built directly into the package installer, because it requires some overview on the system. The reason is that a distribution needs to be a consistent set of packages. Here config.guess may come in. There is a "default" mapping from canonicalized host type strings to virtual packages, for example: i586-gnu-linux --> "i386", "i486", "i586", "linux" i486-gnu --> "i386", "i486", "gnu" Note that concepts like "linux-syscalls", "elf" and "procfs" would be provided by the Linux kernel package itself. (Or by emulator packages on the GNU/Hurd, for example). It is not hard to write a utility that can create distributions on the fly if all available packages are collected in a pool. You just have to map your host system type to a set of virtual packages provided, and start adding packages which dependencies can be fulfilled. Of course, you will get further possibilities through additional "provides:" in the packages you added, so you can add more and more packages from your pool to your distribution. Using some standard algorithms from graph theory, you should be able to take into account conflicting packages etc, to get a maximum distribution, e.g., a distribution which only consists of packages which are installable using packages from this distribution, and is not missing any package from the pool with this property. Hints or Scoring ================ Hints can be complex and I haven't thought them completely through. But here is a rather complete analysis of the different cases that you have to treat differently when considering scoring: Usually, we would recompile a package for another architecture only if it is not yet available for this architecture, eg, if it is not in this architectures distribution. Furthermore, recompiling for another architecture would often result in a binary package that is not useful on any of the existing architectures, so there is no conflict. Example: Package foo is available for all i386 arches, because it only depends on the virtual package "i386". Because we want to support powerpc, we recompile it for this architectures. The resulting binary depends on powerpc only. Currently, no platform can run i386 and powerpcs at the same time. Therefore, all distributions we create with the procedure above will only contain either of these packages. If two binary versions of the same package happen to match the same distribution, two cases are possible: 1) "Scoring": Both packages have different dependencies ("i386" vs. "powerpc") or do not carry hints. 2) "Hints": Both packages have the same dependencies ("i386" native and "i386" with pentium optimization [but without pentium specific instructions]). Then we need hints to decide. I won't go into detail how b1 and b2 could be treated, only so much: For b1), it would be sufficient to order the available virtual dependency packages that should be considered in a priority list. Or more abstract: You have a function "int score(list of dependencies)" which calculates a value for each package, and the highest value wins. Equal value would mean both packages are equal and it doesn't matter which one is picked. The function can be very simple and only catch the cases that are known to occure. If both packages have same dependencies, but are not equal, they have to tell us more about them in a "hints" metadata. This could be the config.guess of the compiled architecture for example, or some preferred target architecture, or something alike. The scoring would then be a function of the dependencies and the hints metadata. This can actually be a very simple function (like hint "i586" is better than "i486" is better than "i386" for the i586 distribution). The _important_ thing is that scoring is part of the distribution creation, not part of package installation. I should always be allowed to install a package with a bad score, even if it is a worse score than the default package it would replace. For example, I could fetch such a package from the package pool to override a bug in the optimization or something alike. Complexity ========== I think my concept is rather complete, at least it is not obvious to me which weird combination of architecture dependency is not covered by it. But is it actually too complex? Is it overkill? I think not, and this has two reasons: 1. The main complexity is in the distribution creation, which is completely hidden from the user. It will happen at the main repository of the software distribution. Users will continue to get a i386 GNU/Linux distribution, or a powerpc distribution of the GNU/Hurd, in short: there will be distributions created for the config.guess system types as we do now. What we gain back is a very simple package installation tool which is not any longer concerned about scoring or similar, but merely a dependency checker. 2. The most difficult part seems to be the scoring. But this is only complicated if you want to implement a general catch-all solution. I think the number of cases were scoring actually happens is very small, and can be covered by some simple rules specific for the distribution you are creating. The rest is really only organization of the packages that are uploaded and installed in the distribution. Determining if a package needs to be recompiled =============================================== Consider the following situation: System B emulates system A completely, but has additional features, which software C can use. If C is compiled for A, the resulting binary can run on B, but when C is recompiled for B, it will make use of the special features provided by system B. In reality, most packages adopt themselve at run time or are conflicting, so the case above should occur rather seldom. In this case, a feature list can be added to the source package, which contains a hint for the builder that if compiled for system B, the package will offer further features. I have not worked out the details for this, mostly because this case occurs very rarely and any agreement on the feature tag that offers the above distinction will fit. An example ========== How could this scheme applied to Debian? The easiest approach would be to encode the architecture field into the Dependencies and drop it from the control data. Old version: Package: bash Architecture: i386 Depends: libc6 (>= 2.2.5) New version: Package: bash Depends: i386, libc6 (>= 2.2.5), ... For distribution creation, the distribution for the i386 architecture would provide the virtual package `i386' by default. Likewise for other architectures. All architectures would provide the virtual package `all'. For example, if you would encode the base dependencies in a dummy package `base-arch', the packages file might look like this: Package: base-arch Essential: yes Version: 1.0 Provides: i386, all Package: bash Depends: i386, libc6 (>= 2.2.5), ... Package: makedev Depends: all, base-passwd (>= 3.0.4) Nothing would be gained by this, it would be, semantically, completely backward compatible with the current way architectures are handled. However, with such a solution in place, you can easily fix some of the problems, like the problem that you can not have packages that are installable on all Linux systems: Package: base-arch Essential: yes Version: 1.1 Provides: i386, linux, all Package: makedev Depends: all, linux, base-passwd (>= 3.0.4) Or, you could easily define a package that is installable on all i386 systems: Package: oskit Depends: i386 Or you could define a package that runs on all i386 systems with glibc: Package: freesweep Depends: i386, libc.so.6 (this assumes that libc.so.6 is provided by all libraries exposing the current GNU/Linux glibc ABI, and not by others. Sorry, BSD folks, this is just a simple example :) TODO ==== Specify what to do with source package's control information in the above cases.