Index of /canasai/software/omniclusterer/omniclusterer-beta-0.2

Icon  Name                             Last modified      Size  Description
[DIR] Parent Directory - [TXT] COPYING 26-Jan-2006 15:35 18K [TXT] EM.conf 24-Mar-2006 20:16 1.5K [TXT] IB.conf 24-Mar-2006 20:25 1.4K [TXT] KM.conf 24-Mar-2006 20:20 1.5K [TXT] PDDP.conf 24-Mar-2006 20:12 2.0K [TXT] build.xml 09-May-2007 16:58 2.1K [DIR] build/ 04-Jan-2006 14:25 - [DIR] doc/ 09-May-2007 18:09 - [DIR] lib/ 09-May-2007 17:00 - [DIR] src/ 04-Jan-2006 14:25 - [DIR] test-problems/ 04-Jan-2006 15:22 -

OMNICLUSTERER
-------------

OMNICLUSTERER is an integrated collection of Java source codes for data
clustering. It consists of several standard clustering algorithms,
including:
  - the PDDP algorithm
  - the k-means algorithm
  - the spherical Gaussian EM algorithm
  - the sequential information bottleneck algorithm
  - and many combinations of these algorithms to improve clustering
    performance.
It is designed to work with many input formats in terms of matrix.

About the package
-----------------

See `doc' directory. Some short descriptions are below.
problem    - read and parse an input file in several formats, perform
             matrix operations on the input data such as matrix
             multiplication
utilities  - perform common vector operations, and show clustering results
lanczos    - calculate the first left and right singular vectors of a
             matrix using the Lanczos algorithm
statistics - calculate some statistics such as BIC, AD
pddp       - the implementation of the Principal Direction Divisive
             Partitioning (PDDP) algorithm
kmeans     - the implementation of the k-means algorithm
em         - the implementation of the EM algorithm
ib         - the implementation of the sIB algorithm
comb       - several combinations of the standard algorithms


INSTALLATION
------------
(1) You need JDK 1.4 or above and ANT to compile.
    See http://java.sun.com/j2se/ and http://ant.apache.org/ for
    download details.

(2) In the omniclusterer base directory, type:

ant

(3) To test that the program can read the problem file properly, type:

java -cp lib/omniclusterer.jar:. org.tcllab.clustering.problem.TestProblem

(4) To test that the SVD solver works well, type:

java -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.lanczos.TestLanczos


QUICK START
-----------

(1) Preprocessing

You may use the BOW toolkit to built the doc-term matrix from raw text
files. See more details at: http://www.cs.cmu.edu/~mccallum/bow/rainbow/.
Suppose you also download `20_newsgroups' dataset for experiments.
Now you have three directories:
+-- your parent dir
      +-- omniclusterer
      +-- bow-20020213
      +-- 20_newsgroups

To build a doc-term matrix, first type (in bow base directory):

./rainbow -d ./model -h --index ../20_newsgroups/talk.politics.*

It creates the document model for all subdirectories in `talk.politics',
and stores that model in the directory `model'. The option `-h' means
`skip the header of the document before tokenization'.
Then, to extract the matrix and keep it in a file, type:

./rainbow -d ./model --print-matrix=n > ../omniclusterer/test-problems/talk-politics

You may reduce the dimension of the matrix by removing low frequently
occurred words using rainbow with options. For example, type:

./rainbow -d ./model -h -D 1 -O 4 --index ../20_newsgroups/talk.politics.*

It will replace the old model, then type:

./rainbow -d ./model --print-matrix=n > ../omniclusterer/test-problems/talk-politics-reduced

The option `-D' means `Remove words that occur in N or fewer documents',
and `-O' means `Remove words that occur less than N times'. See more
details about options on Section 4.5 in the `rainbow' document.

If you need to extract words, just copy `./model/vocabulary' to the target directory:

cp model/vocabulary ../omniclusterer/test-problems/talk-politics-reduced.feat


(2) Once you have the doc-term matrix, now you can find its hidden
    cluster structure with omniclusterer.

To test PDDP, you have to set some configuration in file `PDDP.conf'. Then,
type (in omniclusterer base directory):

java -Xms256m -Xmx256m -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.pddp.PDDP PDDP.conf

(3) To test another clustering algorithms, just edit their configuration
    files according to your purpose.

For EM, type:

java -Xms256m -Xmx256m -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.em.EM EM.conf

For k-means, type:

java -Xms256m -Xmx256m -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.kmeans.KM KM.conf

For IB (extremely slow, but giving good results), type:

java -Xms256m -Xmx256m -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.ib.IB IB.conf

For PDDP+EM, type:

java -Xms256m -Xmx256m -cp lib/omniclusterer.jar:lib/Jama-1.0.1.jar:. org.tcllab.clustering.comb.PDDPEM PDDP.conf 0 1