License: GPL (GNU General Public License)
File size: 49K
Developer: John Cowan
TagSoup is a SAX2 parser written in Java that, instead of parsing well-formed or valid XML. Tag Soup parses HTML as it is found in the wild: nasty and brutish, though quite often far from short.

By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. It is a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly.

The following options are understood:

Output into individual files, with html extensions changed to xhtml. Otherwise, all output is sent to the standard output.
Output is in clean HTML: the XML declaration is suppressed, as are end-tags for the known empty elements.
The XML declaration is suppressed.
End-tags for the known empty HTML elements are suppressed.
Output is in PYX format.
Input is in PYXoid format (need not be well-formed).
Namespaces are suppressed. Normally, all elements are in the XHTML 1.x namespace, and all attributes are in no namespace.
Bogons (unknown elements) are suppressed. Normally, they are treated as empty.
suppress default attribute values
change explicit colons in element and attribute names to underscores
don't restart any normally restartable elements
Bogons are given a content model of ANY rather than EMPTY.
Pass through HTML comments. Has no effect when output is in PYX format.
Reuse a single instance of TagSoup parser throughout. Normally, a new one is instantiated for each input file.
Change the content models of the script and style elements to treat them as ordinary #PCDATA (text-only) elements, as in XHTML, rather than with the special CDATA content model.
Specify the input encoding. The default is the Java platform default.
Print help.
Print the version number.

Java 1.4.2 or later

What's New in This Release:
All known bugs are fixed and all features considered appropriate have been added.
This release is ready for full production use.

