Copyright (C) 2002 Python Software Foundation; All Rights Reserved

The Python Software Foundation (PSF) holds copyright on all material
in this project.  You may use it under the terms of the PSF license;
see LICENSE.txt.


Assorted clues.


What's Here?
============
Lots of mondo cool partially documented code.  What else could there be <wink>?

The focus of this project so far has not been to produce the fastest or
smallest filters, but to set up a flexible pure-Python implementation
for doing algorithm research.  Lots of people are making fast/small
implementations, and it takes an entirely different kind of effort to
make genuine algorithm improvements.  I think we've done quite well at
that so far.  The focus of this codebase may change to small/fast
later -- as is, the false positive rate has gotten too small to measure
reliably across test sets with 4000 hams + 2750 spams, and the f-n rate
has also gotten too small to measure reliably across that much training data.

The code in this project requires Python 2.2 (or later).

You should definitely check out the FAQ:
http://spambayes.org/faq.html


Primary Core Files
==================
Options.py
    Uses ConfigParser to allow fiddling various aspects of the classifier,
    tokenizer, and test drivers.  Create a file named bayescustomize.ini to
    alter the defaults.  Modules wishing to control aspects of their
    operation merely do

        from Options import options

    near the start, and consult attributes of options.  To see what options
    are available, import Options.py and do

        print Options.options.display_full()

    This will print out a detailed description of each option, the allowed
    values, and so on.  (You can pass in a section or section and option
    name to display_full if you don't want the whole list).

    As an alternative to bayescustomize.ini, you can set the environment
    variable BAYESCUSTOMIZE to a list of one or more .ini files, these will
    be read in, in order, and applied to the options. This allows you to
    tweak individual runs by combining fragments of .ini files.  The
    character used to separate different .ini files is platform-dependent.
    On Unix, Linux and Mac OS X systems it is ':'.  On Windows it is ';'.
    On Mac OS 9 and earlier systems it is a NL character.

    *NOTE* The separator character changed after the second alpha version of
    the first release.  Previously, if multiple files were specified in
    BAYESCUSTOMIZE they were space-separated.

classifier.py
    The classifier, which is the soul of the method.

tokenizer.py
    An implementation of tokenize() that Tim can't seem to help but keep
    working on <wink>.  Generates a token stream from a message, which
    the classifier trains on or predicts against.

chi2.py
    A collection of statistics functions.

IMPORTANT NOTE
==============

The applications have all been renamed in preparation for 1.0 - the
following section refers to old application names.

IMPORTANT NOTE
==============

The applications have all been renamed in preparation for 1.0 - the
following section refers to old application names.

Apps
====
hammie.py
    A spamassassin-like filter which uses tokenizer and classifier (above).

hammiefilter.py
    A simpler hammie front-end that doesn't print anything.  Useful for
    procmail filtering and scoring from your MUA.

mboxtrain.py
    Trainer for Maildir, MH, or mbox mailboxes.  Remembers which
    messages it saw the last time you ran it, and will only train on new
    messages or messages which should be retrained.  

    The idea is to run this automatically every night on your Inbox and
    Spam folders, and then sort misclassified messages by hand.  This
    will work with any IMAP4 mail client, or any client running on the
    server.

pop3proxy.py
    A spam-classifying POP3 proxy.  It adds a spam-judgment header to
    each mail as it's retrieved, so you can use your email client's
    filters to deal with them without needing to fiddle with your email
    delivery system.

    Also acts as a web server providing a user interface that allows you
    to train the classifier, classify messages interactively, and query
    the token database.  This piece will at some point be split out into
    a separate module.

smtpproxy.py
   A message training SMTP proxy.  It sits between your email client and
   your SMTP server and intercepts mail to set ham and spam addresses.
   All other mail is simply passed through to the SMTP server.

mailsort.py
    A delivery agent that uses a CDB of word probabilities and delivers
    a message to one of two Maildir message folders, depending on the
    classifier score.  Note that both Maildirs must be on the same
    device.

hammiesrv.py
    A stab at making hammie into a client/server model, using XML-RPC.

hammiecli.py
    A client for hammiesrv.

imapfilter.py
    A spam-classifying and training application for use with IMAP servers.
    You can specify folders that contain mail to train as ham/spam, and
    folders that contain mail to classify, and the filter will do so.
    Note that this is currently in very early development and not
    recommended for production use.


Test Driver Core
================
Tester.py
    A test-driver class that feeds streams of msgs to a classifier
    instance, and keeps track of right/wrong percentages and lists
    of false positives and false negatives.

TestDriver.py
    A flexible higher layer of test helpers, building on Tester above.
    For example, it's usable for building simple test drivers, NxN test
    grids, and N-fold cross-validation drivers.  See also rates.py,
    cmp.py, and table.py below.

msgs.py
    Some simple classes to wrap raw msgs, and to produce streams of
    msgs.  The test drivers use these.


Concrete Test Drivers
=====================
mboxtest.py
    A concrete test driver like timtest.py, but working with a pair of
    mailbox files rather than the specialized timtest setup.

timcv.py
    An N-fold cross-validating test driver.  Assumes "a standard" data
        directory setup (see below)) rather than the specialized mboxtest
        setup.
    N classifiers are built.
    1 run is done with each classifier.
    Each classifier is trained on N-1 sets, and predicts against the sole
        remaining set (the set not used to train the classifier).
    mboxtest does the same.
    This (or mboxtest) is the preferred way to test when possible:  it
        makes best use of limited data, and interpreting results is
        straightforward.

timtest.py
    A concrete test driver like mboxtest.py, but working with "a standard"
        test data setup (see below).  This runs an NxN test grid, skipping
        the diagonal.
    N classifiers are built.
    N-1 runs are done with each classifier.
    Each classifier is trained on 1 set, and predicts against each of
        the N-1 remaining sets (those not used to train the classifier).
    This is a much harder test than timcv, because it trains on N-1 times
        less data, and makes each classifier predict against N-1 times
        more data than it's been taught about.
    It's harder to interpret the results of timtest (than timcv) correctly,
        because each msg is predicted against N-1 times overall.  So, e.g.,
        one terribly difficult spam or ham can count against you N-1 times.


Test Utilities
==============
rates.py
    Scans the output (so far) produced by TestDriver.Drive(), and captures
    summary statistics.

cmp.py
    Given two summary files produced by rates.py, displays an account
    of all the f-p and f-n rates side-by-side, along with who won which
    (etc), the change in total # of unique false positives and negatives,
    and the change in average f-p and f-n rates.

table.py
    Summarizes the high-order bits from any number of summary files,
    in a compact table.

fpfn.py
    Given one or more TestDriver output files, prints list of false
    positive and false negative filenames, one per line.


Test Data Utilities
===================
cleanarch
    A script to repair mbox archives by finding "Unix From" lines that
    should have been escaped, and escaping them.

unheader.py
    A script to remove unwanted headers from an mbox file.  This is mostly
    useful to delete headers which incorrectly might bias the results.
    In default mode, this is similar to 'spamassassin -d', but much, much
    faster.

loosecksum.py
    A script to calculate a "loose" checksum for a message.  See the text of
    the script for an operational definition of "loose".

rebal.py
    Evens out the number of messages in "standard" test data folders (see
    below).  Needs generalization (e.g., Ham and 4000 are hardcoded now).

mboxcount.py
    Count the number of messages (both parseable and unparseable) in
    mbox archives.

split.py
splitn.py
    Split an mbox into random pieces in various ways.  Tim recommends
    using "the standard" test data set up instead (see below).

splitndirs.py
    Like splitn.py (above), but splits an mbox into one message per file in
    "the standard" directory structure (see below).  This does an
    approximate split; rebal.py (above) can be used afterwards to even out
    the number of messages per folder.

runtest.sh
    A Bourne shell script (for Unix) which will run some test or other.
    I (Neale) will try to keep this updated to test whatever Tim is
    currently asking for.  The idea is, if you have a standard directory
    structure (below), you can run this thing, go have some tea while it
    works, then paste the output to the SpamBayes list for good karma.


Standard Test Data Setup
========================
Barry gave Tim mboxes, but the spam corpus he got off the web had one spam
per file, and it only took two days of extreme pain to realize that one msg
per file is enormously easier to work with when testing:  you want to split
these at random into random collections, you may need to replace some at
random when testing reveals spam mistakenly called ham (and vice versa),
etc -- even pasting examples into email is much easier when it's one msg
per file (and the test drivers make it easy to print a msg's file path).

The directory structure under my spambayes directory looks like so:

Data/
    Spam/
        Set1/ (contains 1375 spam .txt files)
        Set2/            ""
        Set3/            ""
        Set4/            ""
        Set5/            ""
        Set6/            ""
        Set7/            ""
        Set9/            ""
        Set9/            ""
        Set10/           ""
	reservoir/ (contains "backup spam")
    Ham/
        Set1/ (contains 2000 ham .txt files)
        Set2/            ""
        Set3/            ""
        Set4/            ""
        Set5/            ""
        Set6/            ""
        Set7/            ""
        Set8/            ""
        Set9/            ""
        Set10/           ""
        reservoir/ (contains "backup ham")

Every file at the deepest level is used (not just files with .txt
extensions).  The files don't need to have a "Unix From"
header before the RFC-822 message (i.e. a line of the form "From
<address> <date>").

If you use the same names and structure, huge mounds of the tedious testing
code will work as-is.  The more Set directories the merrier, although you
want at least a few hundred messages in each one.  The "reservoir"
directories contain a few thousand other random hams and spams.  When a ham
is found that's really spam, move it into a spam directory, then use the
rebal.py utility to rebalance the Set directories moving random message(s)
into and/or out of the reservoir directories.  The reverse works as well
(finding ham in your spam directories).

The hams are 20,000 msgs selected at random from a python-list archive.
The spams are essentially all of Bruce Guenter's 2002 spam archive:

    <http://www.em.ca/~bruceg/spam/>

The sets are grouped into pairs in the obvious way:  Spam/Set1 with
Ham/Set1, and so on.  For each such pair, timtest trains a classifier on
that pair, then runs predictions on each of the other pairs.  In effect,
it's a NxN test grid, skipping the diagonal.  There's no particular reason
to avoid predicting against the same set trained on, except that it
takes more time and seems the least interesting thing to try.

Later, support for N-fold cross validation testing was added, which allows
more accurate measurement of error rates with smaller amounts of training
data.  That's recommended now.  timcv.py is to cross-validation testing
as the older timtest.py is to grid testing.  timcv.py has grown additional
arguments to allow using only a random subset of messages in each Set.

CAUTION:  The partitioning of your corpora across directories should
be random.  If it isn't, bias creeps in to the test results.  This is
usually screamingly obvious under the NxN grid method (rates vary by a
factor of 10 or more across training sets, and even within runs against
a single training set), but harder to spot using N-fold c-v.

Testing a change and posting the results
========================================

(Adapted from clues Tim posted on the spambayes and spambayes-dev lists)

Firstly, setup your data as above; it's really not worth the hassle to
come up with a different scheme.  If you use the Outlook plug-in, the
export.py script in the Outlook2000 directory will export all the spam
and ham in your 'training' folders for you into this format (or close
enough).

Basically the idea is that you should have 10 sets of data, each with
200 to 500 messages in them.  Obviously if you're testing something to
do with the size of a corpus, you'll want to change that.  You then want
to run
    timcv.py -n 10 > std.txt
(call std.txt whatever you like), and then
    rates.py std.txt
You end up with two files, std.txt, which has the raw results, and stds.txt,
which has more of a summary of the results.

Now make the change to the code or options, and repeat the process,
giving the files different names (note that rates.py will automatically
choose the name for the output file, based on the input one).

You've now got the data you need, but you have to interpret it.  The
simplest way of all is just to post it to spambayes-dev@python.org and let
someone else do it for you <wink>.  The data you should post is the output of
    cmp.py stds.txt alts.txt
along with the output of
    table.py stds.txt alts.txt
(note that these just print to stdout).

Other information you can find in the 'raw' output (std.txt, above) are
histograms of the ham/spam spread, and a copy of the options settings.

Interpreting cmp.py output
--------------------------

(Using an example from Tim on spambayes-dev)

> cv_octs.txt -> cv_oct_subjs.txt
> -> <stat> tested 488 hams & 897 spams against 1824 hams & 3501 spams 
> -> <stat> tested 462 hams & 863 spams against 1850 hams & 3535 spams 
> -> <stat> tested 475 hams & 863 spams against 1837 hams & 3535 spams 
> -> <stat> tested 430 hams & 887 spams against 1882 hams & 3511 spams 
> -> <stat> tested 457 hams & 888 spams against 1855 hams & 3510 spams 
> -> <stat> tested 488 hams & 897 spams against 1824 hams & 3501 spams 
> -> <stat> tested 462 hams & 863 spams against 1850 hams & 3535 spams 
> -> <stat> tested 475 hams & 863 spams against 1837 hams & 3535 spams 
> -> <stat> tested 430 hams & 887 spams against 1882 hams & 3511 spams 
> -> <stat> tested 457 hams & 888 spams against 1855 hams & 3510 spams
>
> false positive percentages
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.219  0.219  tied
>
> won   0 times
> tied  5 times
> lost  0 times

So all 5 runs tied on FP.  That tells us much more than that the *net*
effect across 5 runs was nil on FP:  it tells us that there are no hidden
glitches hiding behind that "net nothing" -- it was no change across the board.

> total unique fp went from 1 to 1 tied
> mean fp % went from 0.0437636761488 to 0.0437636761488 tied
>
> false negative percentages
>     2.007  2.007  tied
>     1.390  1.390  tied
>     1.622  1.622  tied
>     2.029  1.917  won     -5.52%
>     2.703  2.477  won     -8.36%
>
> won   2 times
> tied  3 times
> lost  0 times

When evaluating a small change, I'm heartened to see that in no run did it lose.
At worst it tied, and twice it helped a little.  That's encouraging.

What the histograms would tell us that we can't tell from this is whether you
could have done just as well without the change by raising your ham cutoff a little.
That would also tie on FP, and *may* also get rid of the same number (or even
more) of FN.

> total unique fn went from 86 to 83 won     -3.49%
> mean fn % went from 1.95029003772 to 1.88269707836 won     -3.47%
>
> ham mean                     ham sdev
>    0.57    0.58   +1.75%        4.63    4.77   +3.02%
>    0.08    0.07  -12.50%        1.20    1.01  -15.83%
>    0.36    0.29  -19.44%        3.61    3.23  -10.53%
>    0.08    0.11  +37.50%        0.89    1.18  +32.58%
>    0.72    0.76   +5.56%        6.80    7.06   +3.82%
>
> ham mean and sdev for all runs
>    0.37    0.37   +0.00%        4.10    4.16   +1.46%

That's a good example of grand averages hiding the truth:  the averaged change
in the mean ham score was 0 across all 5 runs, but *within* the 5 runs it slobbered
around wildly, from decreasing 20% to increasing 40%(!).

> spam mean                    spam sdev
>   96.43   96.44   +0.01%       15.89   15.89   +0.00%
>   97.01   97.07   +0.06%       13.79   13.70   -0.65%
>   97.14   97.16   +0.02%       14.05   14.02   -0.21%
>   96.52   96.56   +0.04%       15.65   15.52   -0.83%
>   95.53   95.63   +0.10%       17.47   17.31   -0.92%
>
> spam mean and sdev for all runs
>   96.52   96.57   +0.05%       15.46   15.37   -0.58%

That's good to see:  it's a consistent win for spam scores across runs,
although an almost imperceptible one.  It's good when the mean spam score rises,
and it's good when sdev (for ham or spam) decreases.

> ham/spam mean difference: 96.15 96.20 +0.05

This is a slight win for the chance, although seeing the details gives cause
to worry some about the effect on ham:  the ham sdev increased overall, and
the effects on ham mean and ham sdev varied wildly across runs.  OTOH, the
"before" numbers for ham mean and ham sdev varied wildly across runs already.
That gives cause to worry some about the data <wink>.


Making a source release
=======================

Source releases are built with distutils.  Here's how I (Richie) have been
building them.  I do this on a Windows box, partly so that the zip release
can have Windows line endings without needing to run a conversion script.
I don't think that's actually necessary, because everything would work on
Windows even with Unix line endings, but you couldn't load the files into
Notepad and sometimes it's convenient to do so.  End users might not even
have any other text editor, so it make things like the README unREADable.
8-)

Anthony would rather eat live worms than trying to get a sane environment
on Windows, so his approach to building the zip file is at the end.

 o If any new file types have been added since last time (eg. 1.0a5 went
   out without the Windows .rc and .h files) then add them to MANIFEST.in.
   If there are any new scripts or packages, add them to setup.py.  Test
   these changes (by building source packages according to the instructions
   below) then commit your edits.
 o Checkout the 'spambayes' module twice, once with Windows line endings
   and once with Unix line endings (I use WinCVS for this, using "Admin /
   Preferences / Globals / Checkout text files with the Unix LF".  If you
   use TortoiseCVS, like Tony, then the option is on the Options tab in
   the checkout dialog).
 o Change spambayes/__init__.py to contain the new version number but don't
   commit it yet, just in case something goes wrong.
 o In the Windows checkout, run "python setup.py sdist --formats zip"
 o In the Unix checkout, run "python setup.py sdist --formats gztar"
 o Take the resulting spambayes-1.0a5.zip and spambayes-1.0a5.tar.gz, and
   test the former on Windows (ideally in a freshly-installed Python
   environment; I keep a VMWare snapshot of a clean Windows installation
   for this, but that's probably overkill 8-) and test the latter on Unix
   (a Debian VMWare box in my case).
 o If you can, rename these with "rc" at the end, and make them available
   to the spambayes-dev crowd as release candidates.  If all is OK, then
   fix the names (or redo this) and keep going.
 o Dance the SourceForge release dance:
   http://sourceforge.net/docman/display_doc.php?docid=6445&group_id=1#filereleasesteps
   When it comes to the "what's new" and the ChangeLog, I cut'n'paste the
   relevant pieces of WHAT_IS_NEW.txt and CHANGELOG.txt into the form, and
   check the "Keep my preformatted text" checkbox.
 o Now commit spambayes/__init__.py and tag the whole checkout - see the
   existing tag names for the tag name format.
 o Update the website News, Download, Windows and Application sections.
 o Update reply.txt in the website repository as needed (it specifies the
   latest version).  Then let Tim, Barry, Tony, or Skip know that they need to
   update the autoresponder.

Then announce the release on the mailing lists and watch the bug reports
roll in.  8-)

Anthony's Alternate Approach to Building the Zipfile

 o Unpack the tarball somewhere, making a spambayes-1.0a7 directory
   (version number will obviously change in future releases)
 o Run the following two commands:

     find spambayes-1.0a7 -type f -name '*.txt' | xargs zip -l sb107.zip 
     find spambayes-1.0a7 -type f \! -name '*.txt' | xargs zip sb107.zip 

 o This makes a tarball where the .txt files are mangled, but everything
   else is left alone.

Making a binary release
=======================

The binary release includes both sb_server and the Outlook plug-in and
is an installer for Windows (98 and above) systems.  In order to have
COM typelibs that work with Outlook 2000, 2002 and 2003, you need to
build the installer on a system that has Outlook 2000 (not a more recent
version).  You also need to have InnoSetup, resourcepackage and py2exe
installed.

 o Get hold of a fresh copy of the source (Windows line endings,
   presumably).
 o Run sb_server and open the web interface.  This gets resourcepackage
   to generate the needed files.
 o Replace the __init__.py file in spambayes/spambayes/resources with
   a blank file to disable resourcepackage.
 o Ensure that the version numbers in spambayes/spambayes/__init__.py
   and spambayes/spambayes/Version.py are up-to-date.
 o Ensure that you don't have any other copies of spambayes in your
   PYTHONPATH, or py2exe will pick these up!  If in doubt, run
   setup.py install.
 o Run the "setup_all.py" script in the spambayes/windows/py2exe/
   directory. This uses py2exe to create the files that Inno will install.
 o Open (in InnoSetup) the spambayes.iss file in the spambayes/windows/
   directory.  Change the version number in the AppVerName and
   OutputBaseFilename lines to the new number.
 o Compile the spambayes.iss script to get the executable.
 o You can now follow the steps in the source release description above,
   from the testing step.
