DomSax 1.0.0 review
DownloadDomSax is an implementation of a XML-parser based on the standard Document Object Model principle (and sun's implementation), co
|
|
DomSax is an implementation of a XML-parser based on the standard Document Object Model principle (and sun's implementation), combining it with the flexibility and possibility of low memory consumption of the SAX-parser (also sun's implementation).
Based on the fact that most XML-documents contain repeating blocks (eg the same structure of elements repeated over and over), the parser creates for each repeating block a complete document (with the document-root being the start-element of the repeating block). This enables the programmer to keep the code clean and the memory consumption within bounds.
The parser has been tested on java 1.5.1.
For parsing XML-files there are currently two options: SAX and DOM. With SAX you get the flexibility to load specific elements from a stream, minimizing memory consumption, but complicating searches and decreasing load-time. With DOM you get the nice interface for searching elements in the completely loaded document, but this interface comes with a high cost in memory consumption and low speed.
When I started with this project one of the demands was the ability to process xml-files of 100+ Mb. This left me effectively only the choice of SAX, which allows for parsing the file element for element and enable me to keep the memory consumption within bounds. However I didn't like the implications on the code for the project. Anyone who ever created a parser with SAX will agree that you're left with a mess, because of the separation of receiving the open-tag, data and close-tag.
So what I wanted was the flexibility of the SAX parser combined with the ease of use of the DOM approach. The underlying principle of DomSax is repeating blocks, which can be indicated with the existing XPath technology. Most xml-files store records, which are always described in the same manner (eg repeating blocks).
In the example below there is a single header, which is always the first element within the document-root tag (blue box). After the header the elements follow (orange boxes). For each of the boxes indicated to the parser with an xpath a complete document is created, containing only the data within the box. After the document is completed it is passed to the registered listeners.
DomSax 1.0.0 keywords