Cluster generators

Data clustering is an unsupervised classification technique. Its aim is to identify groups of similar data items within large data sets. The output of a clustering algorithm is typically a partitioning of a data set, such that data items within the same cluster are similar and those within different cluster are dissimilar. Clustering problems are encountered in many different disciplines, and the performance of a given clustering algorithm usually varies for different types of data.

Synthetic data sets, in which the properties of the clusters and the correct cluster assignments are known a priori, can be helpful in the development and evaluation of new clustering algorithms. In the past, I have developed two generators for clustered data and these can be downloaded below. To the right there are two examples of data sets generated with these generators (projection to two dimensions).

Downloads:

Description of the two generators (.pdf file)

160 sample data sets. These data sets are in space separated row-column format, with the last colum containg the class label.

Ellipsoid generator (C source code)

Gaussian generator (C++ source code)

Hand-crafted data sets

USM dissimilarity matrix for random computer files.