.\" ====================================================================
.\"  @Troff-man-file{
.\"     author          = "Nelson H. F. Beebe",
.\"     version         = "1.06",
.\"     date            = "23 September 2004",
.\"     time            = "15:02:52 MDT",
.\"     filename        = "bibparse.man",
.\"     address         = "University of Utah
.\"                        Department of Mathematics, 110 LCB
.\"                        155 S 1400 E RM 233
.\"                        Salt Lake City, UT 84112-0090
.\"                        USA",
.\"     telephone       = "+1 801 581 5254",
.\"     FAX             = "+1 801 581 4148",
.\"     URL             = "http://www.math.utah.edu/~beebe",
.\"     checksum        = "15027 450 1781 15536",
.\"     email           = "beebe@math.utah.edu, beebe@acm.org,
.\"                        beebe@computer.org  (Internet)",
.\"     codetable       = "ISO/ASCII",
.\"     keywords        = "bibliography, BibTeX, lexical analysis",
.\"     supported       = "yes",
.\"     docstring       = "This file is the UNIX nroff/troff manual
.\"                        page documentation for bibparse, a tool for
.\"                        parsing the lexical analysis output of
.\"                        bibclean or biblex from BibTeX and Scribe
.\"                        bibliography data base files, or BibTeX and
.\"                        Scribe files directly, to verify that they
.\"                        conform to a proposed grammar for BibTeX.
.\"
.\"                        The checksum field above contains a CRC-16
.\"                        checksum as the first value, followed by the
.\"                        equivalent of the standard UNIX wc (word
.\"                        count) utility output of lines, words, and
.\"                        characters.  This is produced by Robert
.\"                        Solovay's checksum utility.",
.\"  }
.\"=====================================================================
.\"
.if t .ds Bi B\s-2IB\s+2T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
.if n .ds Bi BibTeX
.\"
.if t .ds Te T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
.if n .ds Te TeX
.\"
.\"=====================================================================
.TH BIBPARSE 1 "23 September 2004" "Version 1.06"
.\"=====================================================================
.SH NAME
bibparse \- verify a bibclean or biblex lexical token stream, or BibTeX files
.\"=====================================================================
.SH SYNOPSIS
.B bibparse
[
.B \-d
]
.I "<infile"
.nf
or
.fi
.B bibparse
[
.B \-d
]
.I "file1 file2 file3 .\|.\|."
.\"=====================================================================
.SH DESCRIPTION
Compilation of a computer language is
traditionally divided into three steps:
.TP \w'\(bu'u+2n
\(bu
Lexical analysis is the grouping of consecutive
characters into units, called
.IR tokens ,
that are meaningful in a particular language.
.BR bibclean (1)
and
.BR biblex (1)
are two programs that do this job for \*(Bi\&
data.
.TP
\(bu
Parsing is the processing of the lexical analysis
token sequence to verify that tokens appear in an
order permitted by the language rules, called the
.IR grammar .
.B bibparse
does this for \*(Bi\& data.
.TP
\(bu
Semantic analysis, or code generation, is the
interpretation of a grammar-conformant token
stream to perform an intended task.  For example,
.BR bibtex (1)
transforms \*(Bi\& data according to rules in a
user-specified style file into formatted
bibliographic data suitable for a typesetting
system.
.IP
Although
.BR bibtex (1)
includes internal implementations of lexical
analysis and parsing, it does not make them
available to the user.
.PP
.B bibparse
takes a lexical token stream from
.BR bibclean (1)
or from
.BR biblex (1),
or \*(Bi\& files directly, and verifies their
conformance to a proposed grammar for \*(Bi,
published in the articles
.RS
Nelson H. F. Beebe,
.IR "Bibliography prettyprinting and syntax checking" ,
TUGboat (ISSN 0896-3207)
.BR 14 (3)
222, October 1993, and
TUGboat
.BR 14 (4)
395--419, December 1993.
.RE
The text of the latter is included with the
.BR bibclean (1)
distribution.
.PP
The only output normally produced by
.B bibparse
is on the standard error unit,
.IR stderr ,
and then only if grammatical errors are detected.
Silent execution means a successful parse.
.PP
The program exit code is zero on a successful parse,
and non-zero otherwise.
.PP
For example, you can syntax check a bibliography
collection by any of these three UNIX pipelines:
.RS
.nf
\fCbibclean -no-prettyprint \fI*.bib\fP | bibparse\fP
\fCbiblex \fI*.bib\fP | bibparse\fP
\fCbibparse \fI*.bib\fP\fP
.fi
.RE
.B bibparse
distinguishes between lexical token streams and
\*(Bi\& files by examination of the
.I first
character of each input file: if it is a sharp
sign, `#', then it is assumed to be the start of a
line-number directive in a lexical token stream.
Otherwise, it is assumed to be a \*(Bi\& file.
.B bibparse
then selects one of two internal lexical
analyzers: a simple one that reads a lexical token
stream from a file, or the complex one from
.BR biblex (1)
linked into the
.B bibparse
executable.
.\"=====================================================================
.SH OPTIONS
.TP
.B \-d
Write debug output to the standard output stream,
.IR stdout .
This output is extremely verbose: it includes a
record of each lexical token found, and how it is
parsed according to the \*(Bi\& grammar.
.IP
If you are puzzled by an error message reported by
.BR bibparse ,
you are advised to extract the \*(Bi\& entry at,
and possibly, immediately preceding, the line
number in the diagnostic, then save that data in a
temporary file and run
.B "bibparse \-d"
on that small file, so as not to be overwhelmed by
the output.
.\"=====================================================================
.SH BIBTEX GRAMMAR
Here is a slightly-reformatted listing of the
\*(Bi\& grammar, defined in detail in the articles
cited above, and taken directly from the
.B bibparse
source code, which is transformed by a
.I "parser generator"
like UNIX
.BR yacc (1),
or GNU
.BR bison (1),
into a C-language program which can then be
compiled by either C or C++ compilers, and then
linked to produce the
.B bibparse
executable program.
.PP
The tokens, also called
.I terminals
in a grammar, that are recognized by
.BR bibclean (1)
and
.BR biblex (1)
are spelled in \fCUPPERCASE\fP letters.
.PP
Nonterminals, which are intermediate stages in the
grammar processing, are spelled in \fClowercase\fP
letters.  Each nonterminal referred to in the
grammar eventually defines a grammar rule, which
takes the form of a nonterminal, a colon, and one
or more alternative expansions, separated by a
vertical bar.
.PP
Interspersed in the rule expansions are braced
.I actions
which are to be invoked when the input token
stream matches that rule.  Here, they are simply
calls to a function \fCRECOGNIZE()\fP which, when
debug output is requested, prints its argument,
followed by a newline, and then returns silently.
.PP
Internally, the parser does not deal with
character strings at all: both terminals and
nonterminals are simply small integer values that
it manipulates on stacks using highly-efficient
pattern matching to determine whether they match
grammar rules.
.PP
The first three lines of the grammar below define
the precedence of four tokens, so as to
disambiguate cases where two rules would match the
current token sequence.
.PP
The first rule, also called the
.IR "start symbol" ,
says that a \fCfile\fP is either optional space,
or an \fCobject_list\fP optionally preceded and
followed by space.  Thus, an empty file, or one
consisting only of space, is a valid \*(Bi\& file.
.PP
The remaining rules are read similarly.
.PP
Most programming language grammars omit
specification of rules for comments and spacing,
assuming merely that they are permitted anywhere
between tokens; this assumption simplifies the
grammar significantly.
.PP
However, grammars for prettyprinters need to
include rules for spacing because there may be
circumstances where such spacing is significant
for program layout and human readers.  Space
information is also required by unlexers, like
.BR bibunlex (1),
which take a possibly-modified lexical token
stream, and reconstruct a source program from it.
Thus, this grammar includes precise rules for
where spaces are permitted.
.nf
\fC\s-1%nonassoc EQUALS
%left SPACE INLINE NEWLINE
%left SHARP

%%
file:             opt_space                       {RECOGNIZE("file-1");}
                | opt_space object_list opt_space {RECOGNIZE("file-2");}
                ;

object_list:      object                          {RECOGNIZE("object-1");}
                | object_list opt_space object    {RECOGNIZE("object-2");}
                ;

object:           AT opt_space at_object          {RECOGNIZE("object");}
                ;

at_object:        comment                         {RECOGNIZE("comment");}
                | entry                           {RECOGNIZE("entry");}
                | include                         {RECOGNIZE("include");}
                | preamble                        {RECOGNIZE("preamble");}
                | string                          {RECOGNIZE("string");}
                | error RBRACE                    {RECOGNIZE("error");}
                ;

comment:          COMMENT opt_space LITERAL       {RECOGNIZE("comment");}
                ;

entry:            entry_head assignment_list
                        RBRACE                    {RECOGNIZE("entry-1");}
                | entry_head assignment_list
                        COMMA opt_space RBRACE    {RECOGNIZE("entry-2");}
                | entry_head RBRACE               {RECOGNIZE("entry-3");}
                ;

entry_head:       ENTRY opt_space
                        LBRACE opt_space
                        key_name opt_space
                        COMMA opt_space           {RECOGNIZE("entry_head");}
                ;

key_name:         KEY                             {RECOGNIZE("key_name-1");}
                | ABBREV                          {RECOGNIZE("key_name-2");}
                ;

include:          INCLUDE opt_space LITERAL       {RECOGNIZE("include");}
                ;

preamble:         PREAMBLE opt_space
                        LBRACE opt_space
                        value opt_space
                        RBRACE                    {RECOGNIZE("preamble");}
                ;

string:           STRING opt_space
                        LBRACE opt_space
                        assignment
                        opt_space RBRACE          {RECOGNIZE("string");}
                ;

value:            simple_value                    {RECOGNIZE("value-1");}
                | value opt_space                 {RECOGNIZE("value-1-1");}
                        SHARP                     {RECOGNIZE("value-1-2");}
                        opt_space simple_value    {RECOGNIZE("value-2");}
                ;

simple_value:     VALUE                           {RECOGNIZE("simple_value-1");}
                | ABBREV                          {RECOGNIZE("simple_value-2");}
                ;

assignment_list:  assignment                      {RECOGNIZE("single assignment");}
                | assignment_list COMMA opt_space
                        assignment                {RECOGNIZE("assignment-list");}
                ;

assignment:       assignment_lhs opt_space
                        EQUALS opt_space          {RECOGNIZE("assignment-0");}
                        value opt_space           {RECOGNIZE("assignment");}
                ;

assignment_lhs:   FIELD                           {RECOGNIZE("assignment_lhs-1");}
                | ABBREV                          {RECOGNIZE("assignment_lhs-2");}
                ;

opt_space:      /* empty */                       {RECOGNIZE("opt_space-1");}
                | space                           {RECOGNIZE("opt_space-2");}
                ;

space:            single_space                    {RECOGNIZE("single space");}
                | space single_space              {RECOGNIZE("multiple spaces");}
                ;

single_space:     SPACE
                | INLINE
                | NEWLINE
                ;\s0\fP
.fi
.\"=====================================================================
.SH "PERFORMANCE"
As a demonstration of the efficiency of parsing,
tests were carried out on a Sun 336MHz UltraSPARC
system, with all programs compiled at the highest
optimization level, and present in the current
directory, using a 4MB test file (the largest from
the \*(Te\& User Group bibliography archive)
present in the memory-mapped
.I /tmp
directory for fast access.  The tests were run ten
times inside a shell script to amortize the script
startup time, and the total wall-clock time (from the
UNIX
.BR time (1)
command) for each script's execution was then
divided by ten to produce these results:
.nf
.\" =========== time ./time-bibclean-bibparse.sh /tmp/ibmjrd.bib
.\" 45.56u 1.52s 0:37.76 124.6%
.\" =========== time ./time-biblex-bibparse.sh /tmp/ibmjrd.bib
.\" 31.72u 1.72s 0:24.03 139.1%
.\" =========== time ./time-bibparse.sh /tmp/ibmjrd.bib
.\" 10.00u 0.21s 0:10.30 99.1%
.\" =========== time ./time-bibtex.sh /tmp/ibmjrd
.\" 32.00u 0.77s 0:33.13 98.9%
.ce 9
\fC---------------------------------------------------------------
Program pipeline                                 Time  Relative
                                                         time
---------------------------------------------------------------
bibclean -no-prettyprint -no-warnings | bibparse 3.786s  3.67
bibtex                                           3.313s  3.21
biblex | bibparse                                2.403s  2.33
bibparse                                         1.030s  1.00
---------------------------------------------------------------\fP
.fi
The \*(Bi\& run used the \*(Te\& \fC\enocite{*}\fP
command to generate citations in the
\fCis-alpha\fP style of every entry in the
bibliography.
.PP
The addition of support in
.B bibparse
version 1.04 for direct processing of \*(Bi\&
files via an internal copy of the
.BR biblex (1)
lexical analyzer has thus produced a 2.3-times
speedup over previous versions that required
.BR biblex (1),
and at data rates of 4MB/s, the programs are fast
enough on 1999-vintage desktop computers to
require only a small fraction of a second to
process a typical \*(Bi\& bibliography, so they can
be used routinely to validate such files.
.\"=====================================================================
.SH "SEE ALSO"
.BR bibcheck (1),
.BR bibclean (1),
.BR bibdup (1),
.BR bibextract (1),
.BR bibjoin (1),
.BR biblabel (1),
.BR biblex (1),
.BR biborder (1),
.BR bibsearch (1),
.BR bibsort (1),
.BR bibtex (1),
.BR bibunlex (1),
.BR citefind (1),
.BR citesub (1),
.BR citetags (1),
.BR latex (1),
.BR scribe (1),
.BR tex (1).
.\"=====================================================================
.SH AUTHOR
.nf
Nelson H. F. Beebe
University of Utah
Department of Mathematics, 110 LCB
155 S 1400 E RM 233
Salt Lake City, UT 84112-0090
USA
Email: \fCbeebe@math.utah.edu\fP, \fCbeebe@acm.org\fP, \fCbeebe@computer.org\fP (Internet)
WWW URL: \fChttp://www.math.utah.edu/~beebe\fP
Telephone: +1 801 581 5254
FAX: +1 801 581 4148
.fi
.\"=====================================================================
.\" This is for GNU Emacs file-specific customization:
.\" Local Variables:
.\" fill-column: 50
.\" End:
