CyberNeko HTML Parser 0.9.5 review

Download
by rbytes.net on

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the inform

License: The Apache License
File size: 386K
Developer: Andy Clark
0 stars award from rbytes.net

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.

The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

NekoHTML is written using the Xerces Native Interface (XNI) that is the foundation of the Xerces2 implementation. This enables you to use the NekoHTML parser with existing XNI tools without modification or rewriting code.

Limitations:
There are HTML documents for which NekoHTML cannot properly generate a well-formed XML document event stream. For example, documents with multiple tags are inherently ill-formed because XML documents may only have a single root element.

Code added to the core DOM implementation in Xerces-J 2.0.1 introduced a bug in the HTML DOM implementation based on it.

The bug causes the element nodes in the resultant HTML document object to be of type org.apache.xerces.dom.ElementNSImpl instead of the appropriate HTML DOM element objects.

The problem affects NekoHTML users who use the parser with Xerces-J 2.0.1 and anyone using the HTML DOM implementation in Xerces-J 2.0.1.

There are no other known major limitations with this release. However, additional work can always be done to improve performance, fix bugs, and add functionality.

Requirements:
Java 1.1 (or higher)
Xerces 2.0.0 (or higher)

What's New in This Release:
Added feature submitted by Asgeir Asgeirsson to allow scanner to fix character entity references for Microsoft Windows characters
stopped building nekohtmlXni.jar file by default
fixed handling of to better match browser behavior
fixed tag-balancing bug for unknown elements
fixed mapping of encoding name in element
changed tag-balancing to allow headers inside of links
applied attribute namespace patch from Joseph Walton
fixed namespace bug for "xml" prefixes
fixed namespace bug for "xmlns" prefixes
and fixed no-such-method exception bug when using augmentations feature with older versions of Xerces2

CyberNeko HTML Parser 0.9.5 search tags