Clustering Algorithms' Referee Package: CARP (version 3.2)

Description:

The C-package CARP is a convenient and easy tool for evaluating performance of clustering algorithms. The underlying methodology is based on first simulating Gaussian mixture models according to prespecified levels of average and maximum pairwise overlaps. The concept of overlap is defined as the sum of two misclassification probabilities (Maitra and Melnykov, 2010). Datasets are then simulated from the realized Gaussian mixtures. The software implementing this phase is called C-MixSim and can be invoked standalone. This concludes the first phase of the procedure. In the second phase, the clustering algorithm being evaluated is run on the generated datasets. We provide an example here using an agglomerative hierarchical clustering algorithm hierclust which is included. The third phase compares obtained and true groupings. By default, the comparison measure is the Adjusted Rand index of Hubert and Arabie (1985) but the user can also provide some other measure in executable form. Upon conclusion, CARP provides a distribution of the desired performance measure for the clustering method being evaluated at the preferred setting. This provides for a detailed understanding of the performance of the clustering algorithm being evaluated.

CARP is released under the GNU GPL license.

Dependencies:

gcc compiler, glibc library

Installation:

CARP can be installed in two easy steps:

- extract files from "CARP_v3.2.tar.gz":

> tar -xzvf CARP_v3.2.tar.gz

- compile files running the makefile:

> make CARP

To check the integrity of the package, run the command:

> make check

To remove the installed files, use the command:

> make clean

Usage:

> ./CARP <set of parameters>

Parameters:

-b : average overlap (no default value)

-m : maximum overlap (no default value)

-g : generalized overlap (no default value)

-p : number of dimensions (2 by default)

-G : number of mixing components at simulation stage (2 by default)

-s : spherical covariance matrix structure (non-spherical by default if option is unspecified)

-H : homogeneous covariance matrices (heterogeneous by default if option is unspecified)

-e : maximum eccentricity (0.90 by default)

-z : smallest mixing proportion (equal mixing proportions 1/K by default)

-k : minimum number of mixing components (no default value)

-K : maximum number of mixing components (no default value)

-L : lower bound for Uniform(<lower bound>, <upper bound> ) distribution from which mean vectors are generated

-U : upper bound for Uniform(<lower bound>, <upper bound> ) distribution from which mean vectors are generated

-r : maximum number of resimulations (100 by default)

-a : accuracy of estimation (1e-06 by default)

-l : maximum number of integration terms (1e06 by default)

-P : file containing mixing proportions ("Pi.dat" by default)

-M : file containing mean vectors ("Mu.dat" by default)

-S : file containing covariance matrices in triangular form ("LTSigma.dat" by default)

-D : working directory ("DATA" by default)

-N : file containing cluster sizes ("Nk.dat" by default)

-I : file containing true classifications ("idTrue.dat" by default)

-i : file containing estimated classifications ("idEst.dat" by default)

-X : file containing simulated datasets ("x.dat" by default)

-W : file containing maps of pairwise overlaps ("overMap.dat" by default)

-O : file containing characteristics of simulated mixtures ("overBarMax.dat" by default)

-R : file containing index values (Adjusted Rand index and "AR.dat" by default)

-n : number of observations generated from every mixture (0 by default)

-# : number of simulated mixtures (1 by default)

-V : name of the file (if file does not exist random transformation is applied)

-w : number of noise variables (0 by default)

-o : number of outliers for generated datasets (0 by default)

-A : name of clustering program (if this option is not specified, only the simulation stage is run)

-c : activate the command line interface with option -A (FALSE by default)

-E : name of partition analyzing program (Adjusted Rand index is used by default; program "AdjRand")

-C : activate the command line interface with option -E (FALSE by default)

-h : help

Details:

Upon launching CARP, three stages of the program are processed. The first stage is reponsible for the simulation of Gaussian mixtures with prespecified level of complexity expressed in terms of pairwise overlap and for the generation of datasets from these mixtures. The following options are related to the first stage: -b, -m, -g, -p, -G, -s, -H, -e, -z, -L, -U, -r, -a, -l, -P, -M, -S, -D, -N, -I, -X, -W, -O, -n, -#, -V, -w, -o. The option -h provides help. The output will be directed to a working folder specified by the option -D. The obtained parameters - mixing proportions, mean vectors and covariance matrices - will be saved to the files specified by options -P, -M, and -S correspondingly while samples drawn from the mixtures will be saved into the file specified by the option -X. The file provided with the option -I contains true classifications for all observations while option -N provides the name for the file with cluster sizes. In addition, the map of misclassification probabilities will be saved into the file specified by the option -W while average and maximum overlaps as well as the row and column numbers of the components that produced maximum overlap are stored in the file given by the option -C. The element with the index (i, j) in the misclassification map represents the probability that X simulated from the ith component is classified to the jth component.

At the second stage, a user-specified clustering method has to be run for the simulated datasets. For illustration, several examples involving algorithms such as hierarchical clustering with Ward's linkage (Ward, 1963) (hierclust), expectation-maximization (Dempster et al., 1977; Fraley and Raftery, 2006) (Mclust), k-means (Forgy, 1965; MacQueen, 1967) (Kmeans), and partitioning around medoids (Kaufman and Rousseuw, 1990) (PAM) are provided. Other methods can be run with the option -A. User's clustering program has to be able to accept the following options: -p, -n, -#, -D, -i, -X, -K, and also -k if the best clustering solution has to be searched between the minimum and maximum numbers of clusters specified by options -k and -K respectively. The program will obtain partitionings and write them into the file specified by the option -i. Two options are provided, one for the expert user and the other for those that are not so comfortable with writing code. For the expert user, it is assumed that (s)he is comfortable writing a small piece of code responsible for accepting the above parameters coming from the first stage. For the less experienced user not comfortable with this writing a few lines of C code, the command-line interface provides a ready alternative; this is readily done with the option -c inluded into the call of CARP. In this case, CARP will not attempt to pass the parameters between stages. The user is then reponsible for specifying these parameters within a clustering program or command line.

At the third stage, classifications derived by the clustering algorithm under investigation are compared with the true to assess performance. The adjusted Rand index (AdjRand) is incorporated as a default measure of similarity between the two classifications. An user, however, can specify other program for comparing the obtained and true partitionings. Examples of other methods that are included with this manual are through call to R packages which calculate Rand index (Rand, 1971) (Rand), bilogical homogeneity index (Datta and Datta, 2006) (BHI), and the proprotion of correctly classified observations (Diag). If an option -E is specified, the program should comply with the following options: -n, -#, -D, -I, -i, -R. Calculated values of the provided similarity measure have to be written to the file specified by the option -R. Alternatively, less experienced users can specify an option -C with similar meaning as for the option -c from the second stage: a user can provide a command line invoking a partition analyzing program.

If options -b and -m are both specified, CARP produces a mixture satisfying both characteristics, average and maximum overlap. If one option, -b, -m, or -g, is specified, a mixture satisfying the prespecified value is generated.

Simple Examples:

% simulate a 3-dimensional 4-component mixture with spherical covariance matrices, equal mixing proportions, maximum overlap 0.1 and average overlap 0.05; generate a sample of size 100 and analyze using "hierclust"

> ./CARP -m0.1 -b0.05 -p3 -G4 -s -n100 -Ahierclust

% simulate a 2-dimensional 4-component mixture with homogeneous covariance matrices, maximum overlap 0.1 and mixing proportions greater or equal than 0.10; generate a sample of size 100

> ./CARP -m0.1 -p2 -G4 -H -z0.10 -n100

% compute the overlap for the 2-dimensional 4-component mixture with parameters specified in "DATA/Pi.dat", "DATA/Mu.dat" and "DATA/LTSigma.dat"

> ./CARP -p2 -G4 -n100

% get help

> ./CARP -h

See the manual for more comprehensive examples.

Authors:

Melnykov, V. and Maitra, R.

References:

Maitra, R. and Melnykov, V. (2010) "Simulating data to study performance of finite mixture modeling and clustering algorithms", Journal of Computational and Graphical Statistics, 19(2), 354-376, 2010; doi: 10.1198/jcgs.2009.08054.

Davies, R. (1980) "The distribution of a linear combination of chi-square random variables", Applied Statistics, 29, 323-333.

Download:

CARP (from MLOSS site)

The CARP manual