Index of /canasai/software/omniclusterer/omniclusterer-beta-0.2
Name Last modified Size Description
Parent Directory -
COPYING 26-Jan-2006 15:35 18K
EM.conf 24-Mar-2006 20:16 1.5K
IB.conf 24-Mar-2006 20:25 1.4K
KM.conf 24-Mar-2006 20:20 1.5K
PDDP.conf 24-Mar-2006 20:12 2.0K
build.xml 09-May-2007 16:58 2.1K
build/ 04-Jan-2006 14:25 -
doc/ 09-May-2007 18:09 -
lib/ 09-May-2007 17:00 -
src/ 04-Jan-2006 14:25 -
test-problems/ 04-Jan-2006 15:22 -
OMNICLUSTERER
-------------
OMNICLUSTERER is an integrated collection of Java source codes for data
clustering. It consists of several standard clustering algorithms,
including:
- the PDDP algorithm
- the k-means algorithm
- the spherical Gaussian EM algorithm
- the sequential information bottleneck algorithm
- and many combinations of these algorithms to improve clustering
performance.
It is designed to work with many input formats in terms of matrix.
About the package
-----------------
See `doc' directory. Some short descriptions are below.
problem - read and parse an input file in several formats, perform
matrix operations on the input data such as matrix
multiplication
utilities - perform common vector operations, and show clustering results
lanczos - calculate the first left and right singular vectors of a
matrix using the Lanczos algorithm
statistics - calculate some statistics such as BIC, AD
pddp - the implementation of the Principal Direction Divisive
Partitioning (PDDP) algorithm
kmeans - the implementation of the k-means algorithm
em - the implementation of the EM algorithm
ib - the implementation of the sIB algorithm
comb - several combinations of the standard algorithms
INSTALLATION
------------
(1) You need JDK 1.4 or above and ANT to compile.
See http://java.sun.com/j2se/ and http://ant.apache.org/ for
download details.
(2) In the omniclusterer base directory, type:
ant
(3) To test that the program can read the problem file properly, type:
java -cp lib/omniclusterer.jar:. org.tcllab.clustering.problem.TestProblem
(4) To test that the SVD solver works well, type:
java -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.lanczos.TestLanczos
QUICK START
-----------
(1) Preprocessing
You may use the BOW toolkit to built the doc-term matrix from raw text
files. See more details at: http://www.cs.cmu.edu/~mccallum/bow/rainbow/.
Suppose you also download `20_newsgroups' dataset for experiments.
Now you have three directories:
+-- your parent dir
+-- omniclusterer
+-- bow-20020213
+-- 20_newsgroups
To build a doc-term matrix, first type (in bow base directory):
./rainbow -d ./model -h --index ../20_newsgroups/talk.politics.*
It creates the document model for all subdirectories in `talk.politics',
and stores that model in the directory `model'. The option `-h' means
`skip the header of the document before tokenization'.
Then, to extract the matrix and keep it in a file, type:
./rainbow -d ./model --print-matrix=n > ../omniclusterer/test-problems/talk-politics
You may reduce the dimension of the matrix by removing low frequently
occurred words using rainbow with options. For example, type:
./rainbow -d ./model -h -D 1 -O 4 --index ../20_newsgroups/talk.politics.*
It will replace the old model, then type:
./rainbow -d ./model --print-matrix=n > ../omniclusterer/test-problems/talk-politics-reduced
The option `-D' means `Remove words that occur in N or fewer documents',
and `-O' means `Remove words that occur less than N times'. See more
details about options on Section 4.5 in the `rainbow' document.
If you need to extract words, just copy `./model/vocabulary' to the target directory:
cp model/vocabulary ../omniclusterer/test-problems/talk-politics-reduced.feat
(2) Once you have the doc-term matrix, now you can find its hidden
cluster structure with omniclusterer.
To test PDDP, you have to set some configuration in file `PDDP.conf'. Then,
type (in omniclusterer base directory):
java -Xms256m -Xmx256m -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.pddp.PDDP PDDP.conf
(3) To test another clustering algorithms, just edit their configuration
files according to your purpose.
For EM, type:
java -Xms256m -Xmx256m -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.em.EM EM.conf
For k-means, type:
java -Xms256m -Xmx256m -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.kmeans.KM KM.conf
For IB (extremely slow, but giving good results), type:
java -Xms256m -Xmx256m -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.ib.IB IB.conf
For PDDP+EM, type:
java -Xms256m -Xmx256m -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.comb.PDDPEM PDDP.conf 0 1