***********************************************************************
*                            STePP Tagger                             *
*        a Simple Trainable Probabilistc Part-of-speech tagger        *
***********************************************************************

Overview
--------

  This is a general-purpose part-of-speech tagger based on log-linear 
  probabilistic models [1]. 
  The main features of this tagger are:
     - state-of-the-art accuracy (97.3% on the WSJ corpus)
     - fast tagging (<1ms/sentence in the "fast" mode)
     - can output tag probabilities
     - trainable using your own tagged corpus
     - can build compact models using L1-regularization


How to Build
------------

  1.  Type "make".


How to Use
----------

  The default package contains a compact model which should work well for 
  ordinary English sentences (a slightly more accurate (+0.05%) but heavy 
  model is available on the download page). You can perform part-of-speech 
  tagging using this model with the following command:

    % ./stepp -m ./models_wsj02-21c < samples/test.txt > tmp

  Note that by default the input must be one-sentence-per-line, and the 
  words have to be tokenized with white spaces. 

  If you want the tagger to perform tokenization, try -t option. In this
  case, you might find --standoff option useful because it allows you to
  easily map the output with the original input.

  If necessary, you can get tag probabilities for each word,

    % ./stepp -p -m ./models_wsj02-21c < samples/test.txt > tmp

  The tagger has a fast-tagging mode, which is enabled by -f option. The 
  tagging accuracy of the fast mode is slightly lower than that of the 
  normal mode (about -0.1% on WSJ), but the tagging speed is significantly
  faster.

    % ./stepp -f -m ./models_wsj02-21c < samples/test.txt > tmp

  You can display help messages by -h option.

    % ./stepp -h


How to Train the Tagger
-----------------------

  You can build a tagging model (a collection of probabilistic models)
  using your own annotated corpus. Use the "stepp-learn" command:

    % ./stepp-learn -m ./models samples/train.pos 

  Once you train the model, you can use it by specifying the directory 
  that contains the model files generated.

    % ./stepp -m ./models < samples/test.txt > tmp


How to Evaluate the Tagger
--------------------------

  You can evaluate tagging accuracy with the "stepp-eval" command.

    % ./stepp-eval samples/test.pos tmp samples/train.pos


Tips
----

  o The memory and the time required for training vary depending on the 
    tagset and the size of the corpus. The training using sections 0-18
    of the WSJ corpus used 1.3GB memory and took 8 hours on an AMD Opteron 
    server.

  o You can build compact models by using L1 regularization (a kind of
    feature selection). Try -c option when building the models. Training 
    will take much longer, though.

  o Although the normal (CRF + ME) mode usually gives better accuracy than
    the fast (CRF-only) mode, there are some cases where the latter 
    performs better. It may be worth trying the CRF-only mode especially 
    when the size of the training data is small.


References
----------

  [1] Yoshimasa Tsuruoka, John McNaught, and Sophia Ananiadou,
      Part-of-speech tagging with log-linear models: conditional random 
      fields and two-phase maximum entropy tagging, 
      (unpublished manuscript).

  [2] Steven J. Benson and Jorge J. More, A Limited-Memory Variable-Metric
      Method for Bound-Constrained Minimization, Preprint ANL/MCS-P909-0901
      http://www-unix.mcs.anl.gov/~benson/blmvm/


Change history
--------------

  o  7 June  2007 version 0.5
    - add n-best output

  o  14 May  2007 version 0.4
    - add standoff output

  o  22 Apr. 2007 version 0.3
    - add L1 regularization

  o  21 Apr. 2007 version 0.2
    - fixed learn.cpp so that it uses appropriate values for regularization

  o  19 Apr. 2007 version 0.1


--------------------------------------------------------
Yoshimasa Tsuruoka (yoshimasa.tsuruoka@manchester.ac.uk)
