Tag Soup 1.0 review

Download
by rbytes.net on

TagSoup is a SAX2 parser written in Java that, instead of parsing well-formed or valid XML

License: GPL (GNU General Public License)
File size: 49K
Developer: John Cowan
0 stars award from rbytes.net

TagSoup is a SAX2 parser written in Java that, instead of parsing well-formed or valid XML. Tag Soup parses HTML as it is found in the wild: nasty and brutish, though quite often far from short.

By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. It is a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly.

The following options are understood:

--files
Output into individual files, with html extensions changed to xhtml. Otherwise, all output is sent to the standard output.
--html
Output is in clean HTML: the XML declaration is suppressed, as are end-tags for the known empty elements.
--omit-xml-declaration
The XML declaration is suppressed.
--method=html
End-tags for the known empty HTML elements are suppressed.
--pyx
Output is in PYX format.
--pyxin
Input is in PYXoid format (need not be well-formed).
--nons
Namespaces are suppressed. Normally, all elements are in the XHTML 1.x namespace, and all attributes are in no namespace.
--nobogons
Bogons (unknown elements) are suppressed. Normally, they are treated as empty.
--nodefaults
suppress default attribute values
--nocolons
change explicit colons in element and attribute names to underscores
--norestart
don't restart any normally restartable elements
--any
Bogons are given a content model of ANY rather than EMPTY.
--lexical
Pass through HTML comments. Has no effect when output is in PYX format.
--reuse
Reuse a single instance of TagSoup parser throughout. Normally, a new one is instantiated for each input file.
--nocdata
Change the content models of the script and style elements to treat them as ordinary #PCDATA (text-only) elements, as in XHTML, rather than with the special CDATA content model.
--encoding=encoding
Specify the input encoding. The default is the Java platform default.
--help
Print help.
--version
Print the version number.

Requirements:
Java 1.4.2 or later

What's New in This Release:
All known bugs are fixed and all features considered appropriate have been added.
This release is ready for full production use.

Tag Soup 1.0 search tags