LIBS
Language Identifier on Byte String
LIBS is a trainable software module for identifying language, script, and encoding schemes of written texts. In particular, each text is viewed in a more fine-grained encoding as the string of bytes.
Features
- Trained easily with small amounts of data
- Develop learning algorithms based on string kernels that can efficiently compute the similarity between two texts
- Accelerate the kernel computation with a data structure called suffix trees
- Identify more than 85 languages
Applications
- Monitoring the use of languages on WWW (part of WLE project)
- Classifying web pages for better indexing and searching of a multilingual search engine
- Source language verification for a machine translation system
Demo
Source Code
Canasai Kruengkrai