PDFTextStream 2.0.2 review
DownloadPDFTextStream project is a PDF text and metadata extraction library available for Java, Python, and .NET. It supports all versions
|
|
PDFTextStream project is a PDF text and metadata extraction library available for Java, Python, and .NET.
It supports all versions of the PDF document specification, (including v1.6, used by Acrobat 7), extraction of text encoded using double-byte character sets (including Chinese, Japanese, and Korean), decryption of 40-bit and 128-bit encrypted documents, and extraction of all document metadata provided by PDF documents (including form data, bookmarks, and annotations).
Easy integration with Jakarta Lucene is included.
Requirements:
Apache Lucene (optional)
What's New in This Release:
This release adds a com.snowtide.pdf.RegionOutputTarget to support region-specific content extraction.
It adds the ability to derive encoding and spatial metrics of Type3 fonts.
It adds a pdfts.type3.derive system property to disable derivation if necessary.
A problem with com.snowtide.pdf.VisualOutputTarget, where lines would sometimes be inappropriately combined, has been fixed.
PDFTextStream 2.0.2 search tags