Show server:search / libtextcat - j0ke.net Open Build Service

Library for text classification

Libtextcat is a library with functions that implement the classification
technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization"
[1]. It was primarily developed for language guessing, a task on which it is
known to perform with near-perfect accuracy.

The central idea of the Cavnar & Trenkle technique is to calculate a
"fingerprint" of a document with an unknown category, and compare this with the
fingerprints of a number of documents of which the categories are known. The
categories of the closest matches are output as the classification. A
fingerprint is a list of the most frequent n-grams occurring in a document,
ordered by frequency. Fingerprints are compared with a simple out-of-place
metric. See the article for more details.

Considerable effort went into making this implementation fast and efficient.
The language guesser processes over 100 documents/second on a simple PC, which
makes it practical for many uses. It was developed for use in our webcrawler
and search engine software, in which it it handles millions of documents a day.

Authors:
--------
    Frank Scheelen

Source Files

Filename	Size	Changed
libtextcat-2.2.tar.gz	0000540999528 KB	1226143326about 16 years ago
libtextcat.changes	0000000296296 Bytes	1226143326about 16 years ago
libtextcat.spec	00000041404.04 KB	1263223509almost 15 years ago

Latest Revision

hostmaster committed almost 15 years ago (revision 6)

update

Library for text classification

Source Files

Latest Revision

Comments for server:search (0)