Architecture Handling in Distributed Software Collections
=========================================================

Copyright 1999, 2000, 2002 Marcus Brinkmann <brinkmd@debian.org>
 distribution of verbatim copies allowed without restriction. 
 If you want to modify this, contact me and I will put it under a better
 license. I am not unwilling, just lazy.

The way I think about architecture treatment stems from my experiences
with the Debian package tools and bootstrapping the Debian GNU/Hurd
system.

The concept presented below has the following core ideas, which I
think are important enough to summarize them in a short list:

* Package installation is orthogonal to distribution creation.
* Package installation requires only dependency verification.
* Distribution creation requires dependency verification and scoring.


Package Installation
====================

A package manager (the part of the packaging system that installs
packages on a host machine) should only be concerned with fulfilling
dependencies.  It should determine if a package is installable, and if
it is, it should perform the required action.

For this, (probably virtual) packages representing ABI capabilities
really perform best.  You can provide virtual architecture packages
for any ABI feature, like which processor is required, the object
format (a.out, elf), if a /proc filesystem is provided and to which
standard it conforms, if Linux syscalls are available, if mach
interfaces are provided and so on.  IMO, dependencies should be as
weak as possible.  If a pentium optimized package runs also on a 486,
the dependency should be "486" (or even 386), not "pentium".

Examples:

* "grub" only depends on "i386", regardless of the operating system
  running.  (Note: this is not true anymore since grub provides a grub
  shell that runs under the OS.  Another example which still is valid
  is "mbr").

* Most GNU/Linux binaries will depend on a certain cpu only (apart
  from any libraries), the elf object format, and certain libraries
  determined by the soname, and will not make use of syscalls or the
  proc filessystem directly. Those will run on the GNU/Hurd without
  recompilation.

* A perl script evaluating the Linux proc fs (version 2.2) can depend
  on perl and the procfs virtual package of version 2.2.

That was the "depends on" side.  There is also the "provides" side.  A
usual pentium GNU/Linux box will provide "i386, i486, i586",
"linux-syscalls (= 2.2)", "elf" and "proc-fs" for example. Note that
you can have version numbers here, too.  Which packages are provided
is a feature of the "distribution".


Distributions
=============

The distribution is really a concept that should not be built directly
into the package installer, because it requires some overview on the
system.  The reason is that a distribution needs to be a consistent
set of packages.

Here config.guess may come in.  There is a "default" mapping from
canonicalized host type strings to virtual packages, for example:

i586-gnu-linux   -->   "i386", "i486", "i586", "linux"
i486-gnu         -->   "i386", "i486", "gnu"

Note that concepts like "linux-syscalls", "elf" and "procfs" would be
provided by the Linux kernel package itself.  (Or by emulator packages
on the GNU/Hurd, for example).

It is not hard to write a utility that can create distributions on the
fly if all available packages are collected in a pool.  You just have
to map your host system type to a set of virtual packages provided,
and start adding packages which dependencies can be fulfilled.  Of
course, you will get further possibilities through additional
"provides:" in the packages you added, so you can add more and more
packages from your pool to your distribution.  Using some standard
algorithms from graph theory, you should be able to take into account
conflicting packages etc, to get a maximum distribution, e.g., a
distribution which only consists of packages which are installable
using packages from this distribution, and is not missing any package
from the pool with this property.


Hints or Scoring
================

Hints can be complex and I haven't thought them completely through.
But here is a rather complete analysis of the different cases that you
have to treat differently when considering scoring:

Usually, we would recompile a package for another architecture only if
it is not yet available for this architecture, eg, if it is not in
this architectures distribution.  Furthermore, recompiling for another
architecture would often result in a binary package that is not useful
on any of the existing architectures, so there is no conflict.

Example:
 Package foo is available for all i386 arches, because it only
 depends on the virtual package "i386".

 Because we want to support powerpc, we recompile it for this
 architectures.  The resulting binary depends on powerpc only.
 Currently, no platform can run i386 and powerpcs at the same time.
 Therefore, all distributions we create with the procedure above will
 only contain either of these packages.

If two binary versions of the same package happen to match the same
distribution, two cases are possible:

1) "Scoring": Both packages have different dependencies ("i386" vs.
   "powerpc") or do not carry hints.
2) "Hints": Both packages have the same dependencies ("i386" native and
   "i386" with pentium optimization [but without pentium specific
   instructions]).  Then we need hints to decide.

I won't go into detail how b1 and b2 could be treated, only so much:

For b1), it would be sufficient to order the available virtual
dependency packages that should be considered in a priority list.  Or
more abstract: You have a function "int score(list of dependencies)"
which calculates a value for each package, and the highest value wins.
Equal value would mean both packages are equal and it doesn't matter
which one is picked.  The function can be very simple and only catch
the cases that are known to occure.

If both packages have same dependencies, but are not equal, they have
to tell us more about them in a "hints" metadata.  This could be the
config.guess of the compiled architecture for example, or some
preferred target architecture, or something alike.  The scoring would
then be a function of the dependencies and the hints metadata.  This
can actually be a very simple function (like hint "i586" is better
than "i486" is better than "i386" for the i586 distribution).

The _important_ thing is that scoring is part of the distribution
creation, not part of package installation.  I should always be
allowed to install a package with a bad score, even if it is a worse
score than the default package it would replace.  For example, I could
fetch such a package from the package pool to override a bug in the
optimization or something alike.


Complexity
==========

I think my concept is rather complete, at least it is not obvious to
me which weird combination of architecture dependency is not covered
by it.  But is it actually too complex?  Is it overkill?

I think not, and this has two reasons:

1. The main complexity is in the distribution creation, which is
completely hidden from the user.  It will happen at the main
repository of the software distribution.  Users will continue to get a
i386 GNU/Linux distribution, or a powerpc distribution of the
GNU/Hurd, in short: there will be distributions created for the
config.guess system types as we do now.  What we gain back is a very
simple package installation tool which is not any longer concerned
about scoring or similar, but merely a dependency checker.

2. The most difficult part seems to be the scoring.  But this is only
complicated if you want to implement a general catch-all solution.  I
think the number of cases were scoring actually happens is very small,
and can be covered by some simple rules specific for the distribution
you are creating.

The rest is really only organization of the packages that are uploaded
and installed in the distribution.


Determining if a package needs to be recompiled
===============================================

Consider the following situation: System B emulates system A
completely, but has additional features, which software C can use.

If C is compiled for A, the resulting binary can run on B, but when C
is recompiled for B, it will make use of the special features provided
by system B.

In reality, most packages adopt themselve at run time or are
conflicting, so the case above should occur rather seldom.  In this
case, a feature list can be added to the source package, which
contains a hint for the builder that if compiled for system B, the
package will offer further features.

I have not worked out the details for this, mostly because this case
occurs very rarely and any agreement on the feature tag that offers
the above distinction will fit.


An example
==========

How could this scheme applied to Debian?  The easiest approach would
be to encode the architecture field into the Dependencies and drop it
from the control data.

Old version:
Package: bash
Architecture: i386
Depends: libc6 (>= 2.2.5)

New version:
Package: bash
Depends: i386, libc6 (>= 2.2.5), ...

For distribution creation, the distribution for the i386 architecture
would provide the virtual package `i386' by default.  Likewise for
other architectures.  All architectures would provide the virtual
package `all'.  For example, if you would encode the base dependencies
in a dummy package `base-arch', the packages file might look like
this:

Package: base-arch
Essential: yes
Version: 1.0
Provides: i386, all

Package: bash
Depends: i386, libc6 (>= 2.2.5), ...

Package: makedev
Depends: all, base-passwd (>= 3.0.4)

Nothing would be gained by this, it would be, semantically, completely
backward compatible with the current way architectures are handled.
However, with such a solution in place, you can easily fix some of the
problems, like the problem that you can not have packages that are
installable on all Linux systems:

Package: base-arch
Essential: yes
Version: 1.1
Provides: i386, linux, all

Package: makedev
Depends: all, linux, base-passwd (>= 3.0.4)

Or, you could easily define a package that is installable on all i386
systems:

Package: oskit
Depends: i386

Or you could define a package that runs on all i386 systems with glibc:

Package: freesweep
Depends: i386, libc.so.6

(this assumes that libc.so.6 is provided by all libraries exposing the
current GNU/Linux glibc ABI, and not by others.  Sorry, BSD folks,
this is just a simple example :)


TODO
====

Specify what to do with source package's control information in the
above cases.