NAME
    sitescooper - download news from web sites and convert it
    automatically into one of several formats suitable for viewing
    on a Palm handheld.

SYNOPSIS
    sitescooper [options] [ [-site sitename] ...]

    sitescooper [options] [-sites sitename ...]

    sitescooper [options] [-name nm] [-levels n] [-storyurl regexp]
    [-set sitefileparam value] url [...]

    Options: [-debug] [-refresh] [-config file] [-install dir] [-
    instapp app] [-dump] [-dumpprc] [-nowrite] [-nodates] [-quiet]
    [-admin cmd] [-nolinkrewrite] [-stdout-to file] [-keep-tmps] [-
    noheaders] [-nofooters] [-fromcache] [-filename template] [-
    prctitle template] [-parallel] [-disc] [-limit numkbytes] [-
    maxlinks numlinks] [-maxstories numstories] [-text | -html | -
    mhtml | -doc | -isilo | -misilo | -richreader | -pipe fmt
    command]

DESCRIPTION
    This script, in conjunction with its configuration file and its
    set of site files, will download news stories from several top
    news sites into text format and/or onto your Palm handheld (with
    the aid of the makedoc/MakeDocW or iSilo utilities).

    Alternatively URLs can be supplied on the command line, in which
    case those URLs will be downloaded and converted using a
    reasonable set of default settings.

    HTTP and local files, using the `file:///' protocol, are both
    supported.

    Multiple types of sites are supported:

        1-level sites, where the text to be converted is all present
        on one page (such as Slashdot, Linux Weekly News, BluesNews,
        NTKnow, Ars Technica);

        2-level sites, where the text to be converted is linked to
        from a Table of Contents page (such as Wired News, BBC News,
        and I, Cringely);

        3-level sites, where the text to be converted is linked to
        from a Table of Contents page, which in turned is linked to
        from a list of issues page (such as PalmPower).

    In addition sites that post news as items on one big page, such
    as Slashdot, Ars Technica, and BluesNews, are supported using
    diff.

    Note that at this moment in time, the URLs-on-the-command-line
    invocation format does not support 2- or 3-level sites.

    The script is portable to most UNIX variants that support perl,
    as well as the Win32 platform (tested with ActivePerl 5.00502
    build 509).

    Currently the configuration is stored as a string inside the
    script itself, but an alternative configuration file can be
    specified with the -config switch.

    The sites downloaded will be the ones listed in the site files
    you keep in your sites directory.

    sitescooper maintains a cache in its temporary directory; files
    are kept in this cache for a week at most. Ditto for the text
    output directory (set with TextSaveDir in the built-in
    configuration).

    If a password is required for the site, and the current
    sitescooper session is interactive, the user will be prompted
    for the username and password. This authentication token will be
    saved for later use. This way a site that requires login can be
    set up as a .site -- just log in once, and your password is
    saved for future non-interactive runs.

    Note however that the encryption used to hide the password in
    the sitescooper configuration is pretty transparent; I recommend
    that rather than using your own username and password to log in
    to passworded sites, a dedicated, sitescooper account is used
    instead.

OPTIONS
    -refresh
        Refresh all links -- ignore the already_seen file, do not
        diff pages, and always fetch links, even if they are
        available in the cache.

    -config file
        Read the configuration from file instead of using the built-
        in one.

    -limit numkbytes
        Set the limit for output file size to numkbytes kilobytes,
        instead of the default 200K.

    -maxlinks numlinks
        Stop retrieving web pages after numlinks have been
        traversed. This is not used to specify how "deep" a site
        should be scooped -- it is the number of links followed in
        total.

    -maxstories numstories
        Stop retrieving web pages after numstories stories have been
        retrieved.

    -install dir
        The directory to save PRC files to once they've been
        converted, in order to have them installed to your Palm
        handheld.

    -instapp app
        The application to run to install PRC files onto your Palm,
        once they've been converted.

    -site sitename
        Limit the run to the site named in the sitename argument.
        Normally all available sites will be downloaded. To limit
        the run to 2 or more sites, provide multiple -site arguments
        like so:

                -site ntk.site -site tbtf.site

    -sites sitename [...]
        Limit the run to multiple sites; an easier way to specify
        multiple sites than using the -site argument for each file.

    -name name
        When specifying a URL on the command-line, this provides the
        name that should be used when installing the site to the
        Pilot. It acts exactly the same way as the Name: field in a
        site file.

    -levels n
        When specifying a URL on the command-line, this indicates
        how many levels a site has. Not needed when using .site
        files.

    -storyurl regexp
        When specifying a URL on the command-line, this indicates
        the regular expression which links to stories should conform
        to. Not needed when using .site files.

    -doc
        Convert the page(s) downloaded into DOC format, with all the
        articles listed in full, one after the other.

    -text
        Convert the page(s) downloaded into plain text format, with
        all the articles listed in full, one after the other.

    -html
        Convert the page(s) downloaded into HTML format, on one big
        page, with a table of contents (taken from the site if
        possible), followed by all the articles one after another.

    -mhtml
        Convert the page(s) downloaded into HTML format, but retain
        the multiple-page format. This will create the output in a
        directory called site_name; in conjunction with the -dump
        argument, it will output the path of this directory on
        standard output before exiting.

    -isilo
        Convert the page(s) downloaded into iSilo format (see
        http://www.isilo.com/ ), on one big page. This is the
        default. The page(s) will be displayed with a table of
        contents (taken from the site if possible), followed by all
        the articles one after another.

    -misilo
        Convert the page(s) downloaded into iSilo format (see
        http://www.isilo.com/ ), with one iSilo document per site,
        with each story on a separate page. The iSilo document will
        have a table-of-contents page, taken from the site if
        possible, with each article on a separate page.

    -richreader
        Convert the page(s) downloaded into RichReader format using
        HTML2Doc.exe (see
        http://users.erols.com/arenakm/palm/RichReader.html ). The
        page(s) will be displayed with a table of contents (taken
        from the site if possible), followed by all the articles one
        after another.

    -pipe fmt command
        Convert the page(s) downloaded into an arbitrary format,
        using the command provided. Sitescooper will still rewrite
        the page(s) according to the fmt argument, which should be
        one of:

    text    Plain text format.

    html    HTML in one big page.

    mhtml   HTML in multiple pages.

        The command argument can contain `__SCOOPFILE__', which will
        be replaced with the filename of the file containing the
        rewritten pages in the above format, `__SYNCFILE__', which
        will be replaced with a suitable filename in the Palm
        synchronization folder, and `__TITLE__', which will be
        replaced by the title of the file (generally a string
        containing the date and site name).

        Note that for the -mhtml switch, `__SCOOPFILE__' will be
        replaced with the name of the file containing the table-of-
        contents page. It's up to the conversion utility to follow
        the href links to the other files in that directory.

    -dump
        Output the page(s) downloaded directly to stdout in text or
        HTML format, instead of writing them to files and converting
        each one. This option implies -text; to dump HTML, use -dump
        -html.

    -dumpprc
        Output the page(s) downloaded directly to stdout, in
        converted format as a PRC file, suitable for installation to
        a Palm handheld.

    -nowrite
        Test mode -- do not write to the cache or already_seen file,
        instead write what would be written normally to a directory
        called new_cache and a new_already_seen file. This is very
        handy when writing a new site file.

    -debug
        Enable debugging output. This output is in addition to the
        usual progress messages.

    -quiet
        Process sites quietly, without printing the usual progress
        messages to STDERR. Warnings about incorrect site files and
        system errors will still be output, however.

    -admin cmd
        Perform an administrative command. This is intended to ease
        the task of writing scripts which use sitescooper output.
        The following admin commands are available:

    dump-sites
            List the sites which would be scooped on a scooping run,
            and their URLs. Instead of scooping any sites,
            sitescooper will exit after performing this task. The
            format is one site per line, with the site file name
            first, a tab, the site's URL, a tab, the site name, a
            tab, and the output filename that would be generated
            without path or extension. For example:

            foobar.site http://www.foobar.com/ Foo Bar
            1999_01_01_Foo_Bar

    journal Write a journal with dumps of the documents as they pass
            through the formatting and stripping steps of the
            scooping process. This is written to a file called
            journal in the sitescooper temporary directory.

    import-cookies file
            Import a Netscape cookies file into sitescooper, so that
            certain sites which require them, can use them. For
            example, the site economist_full.site requires this.
            Here's how to import cookies on a UNIX machine:

            sitescooper.pl -admin import-cookies ~/.netscape/cookies

            and on Windows:

            perl sitescooper.pl -admin import-cookies "C:\Program
            Files\Netscape\Users\Default\cookies.txt"

            Unfortunately, MS Internet Explorer cookies are
            currently unsupported. If you wish to write a patch to
            support them, that'd be great.

    -nolinkrewrite
        Do not rewrite links on scooped documents -- leave them
        exactly as they are.

    -noheaders
        Do not attach the sitescooper header (URL, site name, and
        navigation links) to each page.

    -nofooters
        Do not attach the sitescooper footer ("copyright retained by
        original authors" blurb) to each page.

    -fromcache
        Do not perform any network access, retrieve everything from
        the cache or the shared cache.

    -filename template
        Change the format of output filenames. template contains the
        following keyword strings, which are substituted as follows:

    YYYY    The current year, in 4-digit format.

    MM      The current month number (from 01 to 12), in 2-digit format.

    Mon     The current month name (from Jan to Dec), in 3-letter
            format.

    DD      The current day of the month (from 01 to 31), in 2-digit
            format.

    Day     The current day of the week (from Sun to Sat), in 3-letter
            format.

    hh      The current hour (from 00 to 23), in 2-digit format.

    mm      The current minute (from 00 to 59), in 2-digit format.

    Site    The current site's name.

    Section The section of the current site (now obsolete).

        The default filename template is YYYY_MM_DD_Site.

    -prctitle template
        Change the format of the titles of the resulting PRC files.
        template may contain the same keyword strings as -filename.

        The default PRC title template is YYYY-Mon-DD: Site.

    -nodates
        Do not put the date in the installable file's filename. This
        allows you to automatically overwrite old files with new
        ones when you HotSync. It's a compatibility shortcut for -
        filename Site -prctitle "Site".

    -parallel
        Use the LWP::Parallel perl module, if available, to preload
        certain pages before the single-threaded sitescooper engine
        gets to them. This can speed up the scooping of several
        sites at once, but is more prone to crashes as the
        LWP::Parallel code is not as resilient as the traditional
        LWP code. This is off by default.

    -disc
        Disconnect a PPP connection once the scooping has finished.
        Currently this code is experimental, and will probably only
        work on Macintoshes. This is off by default.

    -stdout-to file
        Redirect the output of sitescooper into the named file. This
        is needed on Windows NT and 95, where certain combinations
        of perl and Windows do not seem to support the &gt;
        operator.

    -keep-tmps
        Keep temporary files after conversion. Normally the .txt or
        .html rendition of a site is deleted after conversion; this
        option keeps it around.

INSTALLATION
    To install, edit the script and change the #! line. You may also
    need to (a) change the Pilot install dir if you plan to use the
    pilot installation functionality, and (b) edit the other
    parameters marked with CUSTOMISE in case they need to be
    customised for your site. They should be set to acceptable
    defaults (unless I forgot to comment out the proxy server lines
    I use ;).

EXAMPLES
            sitescooper.pl http://www.ntk.net/

    To snarf the ever-cutting NTKnow newsletter.

            sitescooper.pl -refresh -html http://www.ntk.net/

    To snarf NTKnow, ignoring any previously-read text, and
    producing HTML output.

ENVIRONMENT
    sitescooper makes use of the `$http_proxy' environment variable,
    if it is set.

AUTHOR
    Justin Mason <jm /at/ jmason.org>

COPYRIGHT
    Copyright (C) 1999-2000 Justin Mason

    This program is free software; you can redistribute it and/or
    modify it under the terms of the GNU General Public License as
    published by the Free Software Foundation; either version 2 of
    the License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
    General Public License for more details.

    You should have received a copy of the GNU General Public
    License along with this program; if not, write to the Free
    Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
    MA 02111-1307, USA, or read it on the web at
    http://www.gnu.org/copyleft/gpl.html .

SCRIPT CATEGORIES
    The CPAN script category for this script is `Web'. See
    http://www.cpan.org/scripts/ .

PREREQUISITES
    `File::Find' `File::Copy' `File::Path' `FindBin' `Carp' `Cwd'
    `URI::URL' `LWP::UserAgent' `HTTP::Request::Common' `HTTP::Date'
    `HTML::Entities'

    All these can be picked up from CPAN at http://www.cpan.org/ .
    Note that `HTML::Entities' is actually included in one of the
    previous packages, so you do not need to install it separately.

COREQUISITES
    `LWP::Parallel' will be used if available. `Win32::TieRegistry',
    if running on a Win32 platform, to find the Pilot Desktop
    software's installation directory. `Algorithm::Diff' to support
    diffing sites without running an external diff application (this
    is required on Mac systems).

README
    Sitescooper downloads news stories from the web and converts
    them to Palm handheld iSilo, DOC or text format for later
    reading on-the-move. Site files and full documentation can be
    found at http://sitescooper.cx/ .

