jTokeniser 2.0 review
DownloadjTokeniser project is a Java library for tokenising strings into a list of tokens. Here are some key features of " jTokeniser": ·
|
|
jTokeniser project is a Java library for tokenising strings into a list of tokens.
Here are some key features of " jTokeniser":
WhiteSpaceTokeniser - this splits a string on all occurances of whitespace, which include spaces, newlines, tabs and linefeeds.
StringTokeniser - this is basically the same as java.util.StringTokenizer with some extra methods (and extends from Tokeniser). Its default behaviour is to act as a WhiteSpaceTokeniser, however, you can specify a set of characters that are to be used to indicate word delimiters.
RegexTokeniser - this tokeniser is much more flexible as you can use regular expressions to define a what a token is. So, "w+" means whenever it matches one or more letters, it will consider that a word. By default, it uses a regular expression equivalent to a whitespace tokeniser.
RegexSeparatorTokeniser - this can be thought of as an advanced StringTokeniser. Whereas StringTokeniser is limited to defining delimiters as a set of individual characters, RegexSeparatorTokeniser can utilise regular expressions for a richer and more flexible approach.
BreakIteratorTokeniser - one of the most sophisticated tokenisers in the library, although should only be used on natural language strings to isolate words. It also comes with built-in rules about how to find words, knowing how to disregard punctuation, etc.
SentenceTokeniser - this also uses a BreakIterater like the above, but tuned towards finding sentence boundaries. The "tokens" in this tokeniser are in fact individual sentences.
What's New in This Release:
This release includes an easy to use GUI front-end to use the tokenisers interactively, out-of-the-box.
This is especially useful for experimenting with tokenisers, perhaps within a teaching environment.
It is also handy for those without the Java experience to utilise the library API directly.
jTokeniser 2.0 search tags