Mining Word Senses from Text for Corpus-Based Lexicography
Canasai Kruengkrai, Thatsanee Charoenporn, Virach Sornlertlamvanich, and Hitoshi Isahara
Abstract
This paper discusses the problem of automated lexicography. In the corpus-based approach, a lexicographer has to manually group contexts of a target word into clusters in order to identify word senses. When a large number of the contexts is given, this process becomes a tedious and time-consuming task. To overcome this problem, we propose an efficient technique based on unsupervised clustering. We present the spherical Gaussian EM algorithm that can be enhanced by combining a robust initialization method based on Principal Component Analysis. The resulting clusters can provide a structure for analyzing the underlying senses of the target word found in a text corpus. Experimental results on two different data sets of polysemous words indicate that our proposed algorithm is a promising technique for corpus-based lexicography.
Download: pdf, ps
Canasai Kruengkrai