Language Identification Based on String Kernels
Canasai Kruengkrai, Prapass Srichaivattana, Virach Sornlertlamvanich, and Hitoshi Isahara
Abstract
In this paper, we propose a novel approach for automatically identifying the language of a given text based on the concept of string kernels. Our approach can identify the language from the text directly, regardless of its coding system. In particular, we view the text in a more fine-grained encoding as the string of bytes. The similarity between two strings can be implicitly computed through an efficient dynamic alignment using suffix trees. We provide empirical evidence that applying the string kernels to the language identification problem yields an impressive performance using two different kernel classifiers: the kernelized version of the centroid-based method and the support vector machines. Our experiments are based on a reasonable scale of the data set in terms of the number of languages to be identified, including 17 different languages.
Download: pdf, ps
Demo: LIBS (Language Identifier on Byte String)
Canasai Kruengkrai