Thai Text Document Clustering Using Parallel Spherical K-means Algorithm on PIRUN Linux Cluster
Canasai Kruengkrai and Chuleerat Jaruskulchai
Abstract
Document clustering is the process of grouping similar or related documents into classes. Many sequential algorithms have been developed to deal with document clustering problems. However, these algorithms are too slow when are applied to large document collections. In this paper, we propose a parallel algorithm for clustering text documents based on spherical k-means. We implemented our algorithm on the PIRUN Linux Cluster, which is a parallel computer using cluster computing technology. The data set consists of 4,800 articles taken from a Thai newspaper. Experimental results show that the use of parallel algorithm can significantly improve clustering performance. Furthermore, we find that our algorithm is also effective when the problem size is scaled up.
Download: pdf, ps
Canasai Kruengkrai