Show server:search / libtextcat - j0ke.net Open Build Service

Library for text classification

Libtextcat is a library with functions that implement the classification
technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization"
[1]. It was primarily developed for language guessing, a task on which it is
known to perform with near-perfect accuracy.

The central idea of the Cavnar & Trenkle technique is to calculate a
"fingerprint" of a document with an unknown category, and compare this with the
fingerprints of a number of documents of which the categories are known. The
categories of the closest matches are output as the classification. A
fingerprint is a list of the most frequent n-grams occurring in a document,
ordered by frequency. Fingerprints are compared with a simple out-of-place
metric. See the article for more details.

Considerable effort went into making this implementation fast and efficient.
The language guesser processes over 100 documents/second on a simple PC, which
makes it practical for many uses. It was developed for use in our webcrawler
and search engine software, in which it it handles millions of documents a day.

Authors:
--------
    Frank Scheelen

Source Files

Filename	Size	Changed	Actions
libtextcat-2.2.tar.gz	0000540999528 KB	1226143326about 16 years ago
libtextcat.changes	0000000296296 Bytes	1226143326about 16 years ago

Revision 2 (latest revision is 6)

unknown committed about 16 years ago (revision 2)

Library for text classification

Source Files

Revision 2 (latest revision is 6)

Comments for server:search (0)