PyTextCat

PyTextCat guesses the language of a given input text from over 70 different languages.

It is an implementation of the classification technique described by William B. Cavnar & John M. Trenkle (1994) in N-Gram-Based Text Categorization, and is based upon Gertjan van Noord's Perl implementation.

Both a Python library and a command-line interface are provided.

PyTextCat is released under GPLv2 (see COPYING). The lm files and test texts are from TextCat, and are therefore licensed under LGPLv2.1 (see COPYING.LGPL). Source code is available for download at GitHub.

Demo

Note that text containing unicode characters probably isn't handled properly at the moment in this online demo.

Please enter the text to guess the language of:

1024 characters remaining.


Page last updated: 2009-11-04 | View page source

blog comments powered by Disqus