Issues with BIND 4.9.x resolver code and SunOS 4.1.x shared libraries
=====================================================================

$Id: ISSUES,v 1.3 1996/11/26 10:11:24 vixie Exp $

Changes to the shared library setup have lots of little pitfalls and
mines.  This is an attempt to map the minefield, for those who feel
they've noticed something that they think should be done another way.

by Chris Davis <ckd@kei.com>

heavily based on a document by Dave Morrison <drmorris@mit.edu>, 2/3/94

=======================================================================
The following five items should be read by everyone; they expand upon
installation techniques and issues discussed in shres/sunos/INSTALL, as well as
items that may need to be addressed after installing BIND.
=======================================================================

* Differences between Sun's resolver and BIND's resolver

Sun's name resolver, in the default setup, is reached via NIS.  If a host
is not found in the NIS map, the NIS server program (ypserv) looks for a
special "cookie" value (which is generated by uncommenting the "B=-b" line
in /var/yp/Makefile) and, if that value is found, does a DNS lookup and
returns the found value (if any).  This means that locally-defined names
are found in the NIS host map, and others are (usually) found indirectly
by consulting the DNS.

Programs compiled with Sun's libresolv.a (such as /usr/lib/sendmail.mx)
will use DNS directly, but due to deficiencies in the shipped library are
vulnerable to certain types of DNS spoofing (see RFC 1535).  (Some
versions of Sun's libresolv.a also fail to "fall over" to a second
nameserver if the first is not responding.)

BIND's resolver does not use NIS, either, but includes fixes for the
problem behavior described in RFC1535.  Because of these fixes, BIND no
longer has the old domain search behavior; you may need to put a "search"
directive in resolv.conf.  See the BOG for more details on "search".

Because BIND's resolver does not consult /etc/hosts or NIS, you may find
that "localhost" and "loghost" do not resolve.  You should put the name
"localhost", with the address 127.0.0.1, in every domain that contains
hosts.  (See the BOG for more details.)  If your syslog.conf files use
references to "loghost", you will need to either add "loghost" to your
zone file (probably as a CNAME) or change the syslog.conf files.

* UDP checksums

Since DNS queries and responses use UDP, it is extremely useful to have
UDP checksums enabled in order to allow detection of errors.  SunOS, by
default, has UDP checksums off, ostensibly to speed performance of NFS by
depending on the Ethernet layer to do checksumming.  (Of course, this
performance improvement was done in the days when a Sun3/60 was a pretty
hot box to have on your desk, so it's a bit pointless now in these days of
SPARC.)  Since not all DNS queries stay on your Ethernet, use of the UDP
checksum is *highly* recommended.  (It's also not a bad idea if you do
NFS, since not all NFS stays on one Ethernet, and the Ethernet layer won't
catch every error.)

To turn it on permanently, edit /sys/netinet/in_proto.c, change the line
"int udp_cksum = 0" to "int udp_cksum = 1", and rebuild your kernel.

To turn it on "on the fly":
   echo "udp_cksum/W1" | /usr/bin/adb -wk /vmunix /dev/mem

* Modifying the static libc

The 4.9.x libresolv uses an external routine (strerror) that is not in
Sun's shipped libc.  The BIND 4.9.x shared library install procedure
merges the compatibility code for strerror into libc.so, but does not
modify libc.a, nor does it include it into the unshareable libresolv.a.
This means code that is statically linked (cc -Bstatic or gcc -static)
that uses -lresolv will fail to link unless also linked with the
compatibility library (lib44bsd.a).  Typical culprits are emacs and
Berkeley sendmail 8.x, since they're among the few things that are often
linked statically.

It also means that programs that use auto-configuration utilities may
detect strerror in the shared C library, but compile against the
non-shared C library, resulting in problems.

Solutions for this dilemma include (but are probably not limited to) the
following:

  - use 'ar' to put the needed compatibility code in libc.a

  - use 'ar' to put the needed compatibility code in libresolv.a

  - use 'ar' to put the compatibility code in *both* libc and libresolv

  - always link programs dynamically, even when using -lresolv

  - link with -l44bsd when statically linking and using -lresolv

The needed compatibility module is compat/lib/strerror.o.

My personal solution was to link it into libc.a, in order to have as few
differences between statically and dynamically linked programs as
possible.  Besides, strerror is a Good Thing, and Sun should have included
it in libc in the first place.

Michael Helm pointed out, however, that there is a potential problem; if
code is linked with -lresolv and dynamically linked against a libc that
contains strerror, then moved to a machine that doesn't have strerror in
libc.so, it can fail (possibly silently, and often at the worst possible
time).  Accordingly, I now recommend putting the compatibility code into
libc.a (to protect against programs detecting it in libc.so and linking
with libc.a) and libresolv.a (to prevent the unresolved dependency issue).

To do this, first make a copy of libc.a (call it libcnew.a).

Then, from the top of the BIND build tree (i.e. $BINDSRC/sun4.b if you did
"make links"):

  ar rv /usr/lib/libcnew.a compat/lib/strerror.o

(you should see something like
  a - compat/lib/strerror.o
as the output from ar)

  ranlib /usr/lib/libcnew.a

Make sure nobody is using the static libc for anything, then

  cd /usr/lib
  mv libc.a libc.a.old && mv libcnew.a libc.a

Now, repeat the process with libresolv.a:

  cd $BINDSRC/sun4.b
  cp /usr/lib/libresolv.a /usr/lib/libresolvnew.a
  ar rv /usr/lib/libresolvnew.a compat/lib/strerror.o
  ranlib /usr/lib/libresolvnew.a
  cd /usr/lib
  mv libresolv.a libresolv.a.old && mv libresolvnew.a libresolv.a

* RFC 1101 network names vs. /etc/networks

If you install BIND's getnetby* routines in your shared library, you will
find that programs using the shared library will no longer consult
/etc/networks.  Instead, they will use the DNS to resolve network names as
well as host names.

I consider this to be a Good Thing.  Just as /etc/hosts has been
deprecated and effectively replaced by the DNS, so should /etc/networks be
replaced by a dynamic, scalable, and centrally-updatable system (the DNS).

You will probably want to put DNS entries in for your networks.  RFC 1101,
included in the BIND distribution (doc/rfc/rfc1101) has the full details;
you basically just add forward and reverse mapping entries for "host-zero"
addresses.  (The class C network 192.88.144 is 0.144.88.192.in-addr.arpa,
for example.)

Once you do this, netstat -r will display network names, rather than
numbers; netstat -rn will display addresses numerically.

=======================================================================
The following items are more nitty-gritty, "why it was done this way"
issues, and can be safely ignored if all you care about is getting your
system to look up names in the DNS.
=======================================================================

* What's shared, what's static

The purpose of these modifications to Sun's libc.so is to provide DNS
lookup for gethostby* and if you desire, getnetby* (this requires
installing RFC 1101 network entries in the DNS).  This involves replacing
the following SunOS libc routines.

	gethostbyname			getnetbyname
	gethostbyaddr			getnetbyaddr
	gethostent			getnetent
	sethostent			setnetent
	endhostent			endnetent

The routines use the res_* routines from the resolv library to get their
information from DNS.  Because it is most convenient, all these objects
are linked into the shared library, meaning they are linkable without
using -lresolv.  Full details are given below, and unless you want to get
into the nitty gritty, obey the following rule.

Anything which uses -lresolv routines other than the stock OS routines
above should link using -lresolv.

The symptom of not obeying this rule is finding that _res is unresolved at
link time.

* global variable collision

The global variable _res is particularly troublesome.  Any executables
which were compiled with -lresolv before the shared library was installed
has in it _res staticly compiled in as a global data structure.
Unfortunately, the resolv library in BIND 4.9.x has a global variable
_res, and it is defined slightly differently.  At run time, when the
shared libraries are loaded up, some linking is done by ld.so.  The
runtime linker notices that _res is statically defined and does not link
in the dynamic version.  This means that if the shared libc resolver code
ever gets called from this executable, the _res defined there would
overwrite the static version.  Since the static version is a smaller data
structure, this could overwrite bits of memory.  Not good.  It turns out
the worst case is not a likely scenario, but I'd rather be safe then
sorry.

This is why shres/sunos/Makefile does -D_res=_res_shlib.  The collision is
removed.  This means that _res is not accessible as a global variable in
the shared libc library.  To compile a program which accesses _res
directly, libresolv must linked in statically.

This would not be a problem if you could recompile any code which used
_res.  This would mean recompiling some of SunOS and perhaps other vendor
code if you've obtained additional software.  Since people don't generally
have the source to everything on the machine, this isn't a viable option
except for Sun and miscellaneous wizards.

Note that because of this workaround, you cannot use libresolv_pic.a as
/usr/lib/libresolv.a, which would make things much simpler.  (If you try,
programs linked with -lresolv won't find _res, as it will be named
_res_shlib.)

* Having named and tools linked with a shared libc.

It is very tempting (and almost doable) to compile the entire BIND
distribution with a resolv in a shared libc.  There are dangers associated
with doing this.  First, there's the global variable collision problem
mentioned above.  Second, there's a problem of maintaining the the shared
library version control.

People have a tendency to copy tools like dig or the named server from
machine to machine.  If the new shared library (the one with *this*
distributions resolv) is not present on the machines to which these
goodies are copied TO, the user will be getting SUN'S copy of resolv.
This could cause you to lose most heinously, and you will spend DAYS if
not WEEKS trying to figure out what the problem is.  It's debatable if
there's even a performance improvement by doing the sharing.  Compare that
to the debugging and frustration time you are going to spend.

You also will need to replace libc everywhere when a new release when new
releases come out.  This isn't as big an issue for a production release of
bind, but for the alpha test team, it means a few less things to worry
about, when there is already plenty to worry about.

Again, if you could recompile everything, there wouldn't be a problem.
Vendors should release the tools and server shared, as they already have
the assurance that there is a standard libc, and users may want to handle
some problem routines by relinking the shared library.

* shared archives

In addition to a shared object (the libc.so files) which contain the
executable libc code, there is also a shared archive (the libc.sa files).
The shared archive contains global initialized data.  When a program is
linked, if it accesses any of this global initialized data, that data is
included from the shared archive in the final executable.  Some examples
include errno (intialzed to zero), the ctype.h tables, sys_errlist, and
_iob for stdio.

If this data is not accessible from a shared archive, but is accessible
from the shared object (e.g. no libc.sa.x.y.z exists for libc.so.x.y.z),
the shared object copy will be used, but not linked into the executable.
This results in a performance hit for executables which use that data.
Sun's documentation claims this to be possibly degrading to the system as
a whole on a heavily used library.  I have yet to observe anything besides
a slight (max 10%) performance hit.

This is why it is important to copy+ranlib the old libc.sa.a.b.c, when
creating a new libc.so.x.y.z.  Sun's instructions in building a new shared
libc (shlib.etc package or patch) neglect to mention this.

There are 5 instances of global initialized data in -lresolv.  They are
_res (renamed to _res_shlib), _res_resultcodes, _res_opcodes, h_errlist,
and h_nerr.  In principle, they should be added to libc.sa.x.y.z.
However, long as they are never referenced, it does not matter that they
are not there.  Programs which use these variables should link with
-lresolv to get the static version, and the problem is solved.

The reason for not including them in the shared archive, is that there
is a potential problem in that if this global data ever changed, as it
might in a future bind release, the MAJOR version of the library should
change.  By using the static versions with -lresolv, you allow yourself
the option to upgrade the -lresolv code without major fuss.

Update: as of BIND 4.9.3, the resolver library no longer uses initialized
static data, so this should never be a problem again.  (You should still
copy and re-ranlib the Sun-supplied libc.sa, however.)

* shared library revision numbers

Technically, the shared library changes are sufficient enough to warrent a
minor revision change.  On SunOS 4.1.3, this would mean the shared library
should be numbered libc.so.1.9.  However, Sun has already used this for
4.1.3_U1 and 4.1.4.  If you upgrade, you will suddenly have two
libc.so.1.9's.  Programs would be compiled to use "libc.so.1.9" and would
be no distinction between those which want to use the SunOS libc.so.1.9
and those which want the locally compiled libc.so.1.9.  At this point, the
locally compiled libc.so.1.9 should really be 1.10, and you have to
recompile everything you originally compiled, anyway.

So, for 4.1.3, stick with libc.so.1.8.x++; for 4.1.3_U1 and 4.1.4,
libc.so.1.9.x++.  Just be aware that if you compile on a machine with this
new shared library, and you use the res_ routines directly without
-lresolv (uncool, see above) you will not be able to take it to a previous
stock SunOS without a few problems.

For SunOS 4.1.1 (and 4.1.1_U1) [on sun3] the library numbers already
include a "CUSTOM" number, so the best thing to do is just continue to
increment this number as the awk script will do for you.

* Compiling with gcc

Compiling resolv with gcc is highly preferable as it understands the
concept of making read only data shared.  Sun's 4.1.3 cc doesn't (simply
to make read-only strings shared takes some nasty effort).

Currently (BIND 4.9.3 or later resolver library and gcc 2.5.8 or later),
the resolver library does not use any special gcc references.
Specifically, there are no unresolved references in libresolv.a objects
that are brought in from libgcc.a.  This means that even if you compile
with gcc, the objects created may be linked with any compiler.  All is
cool, use gcc.

SHOULD THIS CHANGE (in a new release of gcc or BIND - not likely to
change, but possible), you can still use gcc and create objects usable by
any compiler.  You will need to add libgcc.a to the shared library link
line (before -ldl).
