Home | Geschichten | Kunst | Computer | Tindertraum |
A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and ClusteringBow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow).
The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students.
The library provides facilities for:
- Recursively descending directories, finding text files.
- Finding `document' boundaries when there are multiple documents per file.
- Tokenizing a text file, according to several different methods.
- Including N-grams among the tokens.
- Mapping strings to integers and back again, very efficiently.
- Building a sparse matrix of document/token counts.
- Pruning vocabulary by word counts or by information gain.
- Building and manipulating word vectors.
- Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
- Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning.
- Scoring queries for retrieval or classification.
- Writing all data structures to disk in a compact format.
- Reading the document/token matrix from disk in an efficient, sparse fashion.
- Performing test/train splits, and automatic classification tests.
- Operating in server mode, receiving and answering queries over a socket.
[ by Martin>] [permalink] [similar entries]
similar entries (vs):
similar entries (cg):
no similar entries (yet?)