====================
BeautifulSoup Parser
====================

BeautifulSoup_ is a Python package that parses broken HTML.  While libxml2
(and thus lxml) can also parse broken HTML, BeautifulSoup is much more
forgiving and has superiour `support for encoding detection`_.

.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
.. _`support for encoding detection`: http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit

lxml can benefit from the parsing capabilities of BeautifulSoup
through the ``lxml.html.soupparser`` module.  It provides three main
functions: ``fromstring()`` and ``parse()`` to parse a string or file
using BeautifulSoup, and `convert_tree()` to convert an existing
BeautifulSoup tree into a list of top-level Elements.

The functions ``fromstring()`` and ``parse()`` behave as known from
ElementTree.  The first returns a root Element, the latter returns an
ElementTree.

Here is a document full of tag soup, similar to, but not quite like, HTML::

    >>> tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>'

all you need to do is pass it to the ``fromstring()`` function::

    >>> from lxml.html.soupparser import fromstring
    >>> root = fromstring(tag_soup)

To see what we have here, you can serialise it::

    >>> from lxml.etree import tostring
    >>> print tostring(root, pretty_print=True),
    <html>
      <meta/>
      <head>
        <title>Hello</title>
      </head>
      <body onload="crash()">Hi all<p/></body>
    </html>

Not quite what you'd expect from an HTML page, but, well, it was broken
already, right?  BeautifulSoup did its best, and so now it's a tree.

To control which Element implementation is used, you can pass a
``makeelement`` factory function to ``parse()`` and ``fromstring()``.
By default, this is based on the HTML parser defined in ``lxml.html``.

By default, the BeautifulSoup parser also replaces the entities it
finds by their character equivalent::

    >>> tag_soup = '<body>&copy;&euro;&#45;&#245;&#445;<p>'
    >>> body = fromstring(tag_soup).find('.//body')
    >>> body.text
    u'\xa9\u20ac-\xf5\u01bd'

If you want them back on the way out, you can serialise with the
'html' method, which will always use escaping for safety reasons::

    >>> tostring(body, method="html")
    '<body>&#xA9;&#x20AC;-&#xF5;&#x1BD;<p></p></body>'

    >>> tostring(body, method="html", encoding="utf-8")
    '<body>&#xA9;&#x20AC;-&#xF5;&#x1BD;<p></p></body>'

    >>> tostring(body, method="html", encoding=unicode)
    u'<body>&#xA9;&#x20AC;-&#xF5;&#x1BD;<p></p></body>'

Otherwise, when serialising to XML, only the plain ASCII encoding will
escape non-ASCII characters::

    >>> tostring(body)
    '<body>&#169;&#8364;-&#245;&#445;<p/></body>'

    >>> tostring(body, encoding="utf-8")
    '<body>\xc2\xa9\xe2\x82\xac-\xc3\xb5\xc6\xbd<p/></body>'

    >>> tostring(body, encoding=unicode)
    u'<body>\xa9\u20ac-\xf5\u01bd<p/></body>'

There is also a legacy module called ``lxml.html.ElementSoup``, which
mimics the interface provided by ElementTree's own ElementSoup_
module.

.. _ElementSoup: http://effbot.org/zone/element-soup.htm
