JBootCat 0.2 review
DownloadJBootCat is a Java implemention of the BootCat scripts written by Marco Baroni et al for generating corpora from the Internet
|
|
JBootCat is a Java implemention of the BootCat scripts written by Marco Baroni et al for generating corpora from the Internet. JBootCat's main goal is to encapsulate the BootCat functionality within a user-friendly desktop application.
The advantage of using the Java platform is that JBootCat can be run easily on most major operating systems.
Here are some key features of "JBootCat":
Step-by-step "wizard" interface - review each step of the process
Enter "seeds" direct or load from a file (and save to file for future).
Generate "tuples" directly or load from a file (and save to file for future).
Queries Google's massive online index to obtain relevant web pages (only HTML pages supported at the moment).
HTML clenser and advanced tokeniser (courtesy of jTokeniser).
URL review
Selected URLs downloaded to text file (using BootCat's "Raw" format) and saved as UTF8.
Multi-platform - runs on any computer with Java installed.
Free and Open Source (LGPL)
What's New in This Release:
This version contains the core functionality for searching Google for relevant pages and then downloading, filtering, and tokenising.
JBootCat 0.2 search tags