Clustering Algorithms' Referee Package: CARP
is a convenient and easy tool for evaluating performance of clustering algorithms.
The underlying methodology is based on first simulating Gaussian mixture models
according to prespecified levels of average and maximum pairwise overlaps.
The concept of overlap is defined as the sum of two misclassification probabilities
Datasets are then simulated from the realized Gaussian mixtures. The software
implementing this phase is called C-MixSim and can be invoked standalone.
This concludes the first phase of the procedure. In the second phase, the clustering
algorithm being evaluated is run on the generated datasets. We provide an example
here using an agglomerative hierarchical clustering algorithm hierclust which is included.
The third phase compares obtained and true groupings. By default, the comparison
measure is the Adjusted Rand index of Hubert and Arabie (1985) but the user
can also provide some other measure in executable form. Upon conclusion,
CARP provides a distribution of the desired
performance measure for the clustering method being evaluated at the preferred setting. This provides for a detailed
understanding of the performance of the clustering algorithm being evaluated.
released under the GNU GPL license.
gcc compiler, glibc library
be installed in two easy steps:
- extract files from "CARP_v3.2.tar.gz":
> tar -xzvf CARP_v3.2.tar.gz
- compile files running the makefile:
> make CARP
To check the integrity of the package, run the command:
> make check
To remove the installed files, use the command:
> make clean
> ./CARP <set of parameters>
-b : average overlap (no default value)
-m : maximum overlap (no default value)
-g : generalized overlap (no default value)
-p : number of dimensions (2 by default)
-G : number of mixing components at simulation stage (2 by default)
-s : spherical covariance matrix structure (non-spherical by default if option is unspecified)
-H : homogeneous covariance matrices (heterogeneous by default if option is unspecified)
-e : maximum eccentricity (0.90 by default)
-z : smallest mixing proportion (equal mixing proportions 1/K by default)
-k : minimum number of mixing components (no default value)
-K : maximum number of mixing components (no default value)
-L : lower bound for Uniform(<lower bound>, <upper bound> ) distribution from which mean vectors are generated
-U : upper bound for Uniform(<lower bound>, <upper bound> ) distribution from which mean vectors are generated
-r : maximum number of resimulations (100 by default)
-a : accuracy of estimation (1e-06 by default)
-l : maximum number of integration terms (1e06 by default)
-P : file containing mixing proportions ("Pi.dat" by default)
-M : file containing mean vectors ("Mu.dat" by default)
-S : file containing covariance matrices in triangular form ("LTSigma.dat" by default)
-D : working directory ("DATA" by default)
-N : file containing cluster sizes ("Nk.dat" by default)
-I : file containing true classifications ("idTrue.dat" by default)
-i : file containing estimated classifications ("idEst.dat" by default)
-X : file containing simulated datasets ("x.dat" by default)
-W : file containing maps of pairwise overlaps ("overMap.dat" by default)
-O : file containing characteristics of simulated mixtures ("overBarMax.dat" by default)
-R : file containing index values (Adjusted Rand index and "AR.dat" by default)
-n : number of observations generated from every mixture (0 by default)
-# : number of simulated mixtures (1 by default)
-V : name of the file (if file does not exist random transformation is applied)
-w : number of noise variables (0 by default)
-o : number of outliers for generated datasets (0 by default)
-A : name of clustering program (if this option is not specified, only the simulation stage is run)
-c : activate the command line interface with option -A (FALSE by default)
-E : name of partition analyzing program (Adjusted Rand index is used by default; program "AdjRand")
-C : activate the command line interface with option -E (FALSE by default)
-h : help
Upon launching CARP, three stages of the program are processed. The first stage is reponsible
for the simulation of Gaussian mixtures with prespecified level of complexity expressed in
terms of pairwise overlap and for the generation of datasets from these mixtures. The following
options are related to the first stage: -b, -m, -g, -p, -G, -s, -H, -e, -z, -L, -U, -r, -a,
-l, -P, -M, -S, -D, -N, -I, -X, -W, -O, -n, -#, -V, -w, -o. The option -h provides help.
The output will be directed to a working folder specified by the option -D. The obtained
parameters - mixing proportions, mean vectors and covariance matrices - will be saved to the files
specified by options -P, -M, and -S correspondingly while samples drawn from the mixtures will be
saved into the file specified by the option -X. The file provided with the option -I contains
true classifications for all observations while option -N provides the name for the file with
cluster sizes. In addition, the map of misclassification probabilities will be saved
into the file specified by the option -W while average and maximum overlaps as well as the row
and column numbers of the components that produced maximum overlap are stored in the file given
by the option -C. The element with the index (i, j) in the misclassification map represents
the probability that X simulated from the ith component is classified to the jth component.
At the second stage, a user-specified clustering method has to be run for the simulated datasets.
For illustration, several examples involving algorithms such as hierarchical clustering with
Ward's linkage (Ward, 1963) (hierclust), expectation-maximization (Dempster et al., 1977;
Fraley and Raftery, 2006) (Mclust), k-means (Forgy, 1965; MacQueen, 1967) (Kmeans), and
partitioning around medoids (Kaufman and Rousseuw, 1990) (PAM) are provided. Other methods
can be run with the option -A. User's clustering program has to be able to accept the following
options: -p, -n, -#, -D, -i, -X, -K, and also -k if the best clustering solution has to be
searched between the minimum and maximum numbers of clusters specified by options -k and -K
respectively. The program will obtain partitionings and write them into the file specified
by the option -i. Two options are provided, one for the expert user and the other for
those that are not so comfortable with writing code. For the expert user, it is assumed that
(s)he is comfortable writing a small piece of code responsible for accepting the above parameters
coming from the first stage. For the less experienced user not comfortable with this writing a
few lines of C code, the command-line interface provides a ready alternative; this is readily
done with the option -c inluded into the call of CARP. In this case, CARP will not attempt to pass
the parameters between stages. The user is then reponsible for specifying these parameters within
a clustering program or command line.
At the third stage, classifications derived by the clustering algorithm under investigation
are compared with the true to assess performance. The adjusted Rand index (AdjRand) is
incorporated as a default measure of similarity between the two classifications. An user, however,
can specify other program for comparing the obtained and true partitionings. Examples of other
methods that are included with this manual are through call to R packages which calculate Rand
index (Rand, 1971) (Rand), bilogical homogeneity index (Datta and Datta, 2006) (BHI), and the
proprotion of correctly classified observations (Diag). If an option -E is specified, the program
should comply with the following options: -n, -#, -D, -I, -i, -R. Calculated values of the provided
similarity measure have to be written to the file specified by the option -R. Alternatively,
less experienced users can specify an option -C with similar meaning as for the option -c from
the second stage: a user can provide a command line invoking a partition analyzing program.
If options -b and -m are both specified, CARP produces a mixture satisfying both characteristics,
average and maximum overlap. If one option, -b, -m, or -g, is specified, a mixture satisfying
the prespecified value is generated.
simulate a 3-dimensional 4-component mixture with spherical covariance
matrices, equal mixing proportions, maximum overlap 0.1 and average
overlap 0.05; generate a sample of size 100 and analyze using "hierclust"
> ./CARP -m0.1 -b0.05 -p3 -G4 -s -n100 -Ahierclust
% simulate a 2-dimensional 4-component mixture with homogeneous
covariance matrices, maximum overlap 0.1 and mixing proportions
greater or equal than 0.10; generate a sample of size 100
> ./CARP -m0.1 -p2 -G4 -H -z0.10 -n100
compute the overlap for the 2-dimensional 4-component mixture with
parameters specified in "DATA/Pi.dat", "DATA/Mu.dat" and
> ./CARP -p2 -G4 -n100
% get help
> ./CARP -h
See the manual for more comprehensive examples.
Melnykov, V. and Maitra, R.
Maitra, R. and Melnykov, V. (2010) "Simulating
data to study performance of
finite mixture modeling and clustering algorithms", Journal of Computational and Graphical Statistics, 19(2), 354-376, 2010; doi: 10.1198/jcgs.2009.08054.
Davies, R. (1980) "The distribution of a linear combination of
chi-square random variables", Applied Statistics, 29, 323-333.
CARP (from MLOSS site)
The CARP manual