          WWWOFFLE - World Wide Web Offline Explorer - Version 2.4a
          =========================================================


WHAT?
-----

The format of the cache that WWWOFFLE uses to store the web pages has changed in
version 2.x compared to the previous versions.  If you have used WWWOFFLE
version 1.x then you *MUST* upgrade the existing cache before you can use the
new version of the program.


HOW?
----

         *** READ ALL THIS SECTION BEFORE DOING ANYTHING ELSE ***


When you compile WWWOFFLE there is another program called 'upgrade-cache' that
is also compiled.  You need to run this program to convert the cache from the
old format to the new one.


There are a number of options that you can take for this upgrade route, the
following applies to all of them.

In each of the options the basics are that you must run upgrade-cache and it
takes an argument of the name of the cache directory that is used (usually
/var/spool/wwwoffle).  When the program runs it prints out informational and
warning messages, these may be useful.


Option 1 - Be reckless

Run 'upgrade-cache /var/spool/wwwoffle', watch the messages go flashing by and
hope that it works.

Option 2 - Be brave

With sh/bash run 'upgrade-cache /var/spool/wwwoffle > upgrade.log 2>&1'
or with csh/tcsh run 'upgrade-cache /var/spool/wwwoffle >& upgrade.log'
read the messages and check the warnings.

Option 3 - Be safe

Backup the cache first then follow option 2.
With GNU tar I suggest that you use the --atime-preserve option so that the
access times of the files in the cache are not modified by performing the
backup.  The index and purge options in WWWOFFLE use these so it is important.


When it finishes, the multiple host named directories in /var/spool/wwwoffle are
gone, moved into a new sub-directory called http.  The outgoing directory and
this http directory are the only directories that should be left.

If there is a warning message then you should decide what needs doing.  It could
be any of the following reasons:

That upgrade-cache was run by a user without write permissions.
That one or more files were changed while the program was running.
That there is a spare file in one of the host directories that needs deleting.
That there is a symbolic link that does not point anywhere.


If the upgrade-cache program crashes then that is a bug - tell me.

If you are left with many files or directories and the warnings are unclear then
this may be a bug - tell me.

If there are only a small number of spare files or directories, then just delete
them, you probably won't notice that they are missing.


WHY?
----

The existing scheme for naming of the files in the cache had some problems, the
new one is better.

0) It was designed for my personal use which did not involve many web-pages
   stored and did not visit any pages with unusual names,
   You could say that the hacks that I implemented to get it working as I wrote
   it were not well enough thought out.  But at the time I wrote it I wanted to
   get it working as soon as possible and did not write it with the future
   growth in mind.  The scheme as implemented has not caused any problems for me
   personally.

1) It was possible for a web-page that has several possible arguments to be
   stored incorrectly.
   This is because for each page that has arguments a hash value is computed
   from the arguments to provide a unique filename.  The reason for this failing
   is that I used a hash function that I made up on the spot, giving a 32-bit
   hashed value.  This seemed to be sufficient for 4 billion sub-pages with the
   same path name for each host and path combination.  As it turned out the hash
   function was not strong enough and the number of possibilities was much
   smaller.

2) There was no provision for any protocol other than http.
   Very quickly the idea of doing ftp as well came to my mind, but could not be
   implemented easily or cleanly with the current system.

3) The outgoing directory was inefficient for large numbers of files.
   An increasing sequence of numbers was used resulting in slow access, this was
   fixed in version 1.2x but there could still be many requests for the same URL
   in the directory.  Now a unique name based on a hash is used so that only one
   request for each page is stored.

4) Bad characters and url-encoded URLs caused problems.
   Some URLs that had funny characters including URL-encoded sequences caused
   problems.  The URL http://www.foo.com/~bar and http://www.foo.com/%7Ebar are
   the same URL but could be stored in different files.

5) It is now a neater design with no special cases.
   Previously only files with arguments needed hashing, now all of them use a
   hash, this simplifies the logic.  The format of the outgoing directory is the
   same as the other directories.

6) There are more possibilities for future expansion.
   It is now possible to consider adding more files to the cache to store extra
   information about a URL, for example a password.  It is obvious now that this
   would be another file with the same hash value but a different prefix.
