NGramJ 1.0 review
Downloadngrams are a rather classical instrument in Natural Language Processing (NLP) applications. NGramJ is an ngram library for NLP with
|
|
ngrams are a rather classical instrument in Natural Language Processing (NLP) applications.
NGramJ is an ngram library for NLP with Java. It's major focus is to provide robust and state of the art language recognition. Both types are meant to be embedded into larger applications.
Language recognition is not the only NLP application of ngrams and NGramJ can be used as a building block in all kinds of differing applications. However Langugage recognition was my major application and therefore NGramJ is somewhat streamlined for this.
NGramJ
This uses ngrams of bytes to determine from a sequence of bytes both language and encoding. In symbols: NGramJ : byte[] --> (Language, Encoding)
CNgram
This uses ngrams of characters to determine the langauge of a character sequence. In symbols CNgram : char[] --> Language
If you think of applying this to files: NGramJ is the right thing, if you do not know what encoding the files use. On the other hand if you know the encoding, it is better to explictely use the encoding to read the file and apply CNgram afterward.
Once you are in a program and treat Strings and other kinds of CharacterSequences, CNgram is the only reasonable way to go.
Caution: For historical reasons NGramJ sometimes refers to the (older) byte based ngrams excluding the newer addition of CNgram. I'm sorry about the confusion.
What's New in This Release:
First public release of CNgram: Character based language recognition!
NGramJ wakes from stasis after over 4 years.
Some optimizations of NGramJ memory performance.
Added ant based building.
Moved ngram to the de.spieleck.app.ngramj Package.
Corrected typographic wrong class name "Cathegory".
Provide self executing archives for both CNgram and NGramJ.
CNgramj (2 prerelease) straightened NGramProfiles
CNgramj (2 prerelease) added new Nutch profiles
NGramJ 1.0 search tags