Traumwind - The `Bow' Toolkit

A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering
Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow).

The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students.

The library provides facilities for:

Recursively descending directories, finding text files.

Finding `document' boundaries when there are multiple documents per file.

Tokenizing a text file, according to several different methods.

Including N-grams among the tokens.

Mapping strings to integers and back again, very efficiently.

Building a sparse matrix of document/token counts.

Pruning vocabulary by word counts or by information gain.

Building and manipulating word vectors.

Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.

Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning.

Scoring queries for retrieval or classification.

Writing all data structures to disk in a compact format.

Reading the document/token matrix from disk in an efficient, sparse fashion.
Performing test/train splits, and automatic classification tests.

Operating in server mode, receiving and answering queries over a socket.

[ by Martin>] [permalink] [similar entries]

similar entries (vs):

Automatic Document Classification (# 11%)
Meschach page on wwwmaths (# 11%)
Finding Out About (# 10%)
libTextCat - Lightweight text categorization (# 9%)

similar entries (cg):

no similar entries (yet?)

Martin Spernau
© 1994-2003

Big things to come (TM) 30th Dez 2002

Do the words need changing?
Oblique Strategies, Ed.3 Brian Eno and Peter Schmidt

amazon.de Wunschliste

usefull links:
Google Graph browser
Traumwind 6-Colormatch
UAV News