Quick Example of Plotting

Dendrograms with scipy-cluster

Author: Damian Eads
Authored: November 17, 2007
Revised: May 24, 2008

This tutorial uses the Iris data set, which first appeared in a paper on optimal linear discriminants for Gaussian data [1]. This data set contains four observations (sepal length, sepal width, petal length, and petal width) for 150 collected specimens of flowers. Three species (or classes) are represented here: Iris Setosa, Iris Versicolour, and Iris Virginica. Rather than using Fisher's Discriminant Analysis to classify the data, we will use hcluster to analyze the data. Ideally, pairs of flowers in the same species should cluster more closely, and flowers from different species should be farther apart from one another. Attaining good clustering requires careful consideration of the distance metric. Since the purpose of this document is to demonstrate this library in action, we give cursory consideration to the choice of distance metric.

What to do

First, import the hcluster module. We then load the flower data set using matplotlib's load command. Standardized Euclidean distance is used to compute the distances between each pair of flower specimens using the pdist command. Next, we use the single linkage algorithm to build the agglomorative clustering:

from hcluster import *
X=load('iris.txt')
Y=pdist(X, 'seuclidean')
Z=linkage(Y, 'single')
dendrogram(Z, color_threshold=0)

This yields the following dendrogram plot:

Color Thresholds

Some linkage methods (centroid, ward, and median) do not take condensed distance matrices as arguments but instead require the raw observations as input. Now that we have a dendrogram, let's find a suitable color threshold. By cutting the tree at 1.8, three clusters are formed. Let's plot another dendrogram using this color threshold. The legend shows the membership of each of the flat clusters formed by the cut.

Z=linkage(X, 'centroid')
dendrogram(Z, color_threshold=1.8)
title('Sir Ronald Fisher\'s Iris Data Set')
xlabel('Flower Specimen Number')
ylabel('Distance')
legend(('Iris Setosa', 'Iris Virginica', 'Iris Versicolour'))

The dendrogram plot is shown below.

Using Complete Linkage

For comparison, we show how the dendrogram plot for complete linkage differs from the dendrogram derived from a single linkage. I eyeballed it and chose 2.3 as the cutoff.

Z=linkage(Y, 'complete')
dendrogram(Z, color_threshold=2.3)
title('Sir Ronald Fisher\'s Iris Data Set')
xlabel('Flower Specimen Number')
ylabel('Distance')
legend(('Iris Setosa', 'Iris Virginica', 'Iris Versicolour'))

The dendrogram that results is shown below:

Using Level Truncation

The number of specimens in the data set is large enough that the dendrogram looks cluttered. Truncation condenses the dendrogram:
dendrogram(Z, color_threshold=0, truncate_mode='level', p=3, show_contracted=True)
The truncate_mode parameter tells the dendrogram plotting routine the type of truncation to perform. When set along with p, no more than p levels of the dendrogram tree are displayed. If a non-leaf node is above this level threshold, it and its descendents are contracted into a single node. The show_contracted=True parameter specification plots a marker for each non-singleton cluster contracted on the link that it belongs to. The height of the marker is the cophenetic distance of the contracted node.

The contracted dendrogram with contraction markers that results:

.

Contracted leaf nodes are labeled with a number in parenthesis, represents the total number of leaf nodes belonging to the non-singleton clusters represented by the contracted link.

The orientation parameter rotates the dendrogram,

dendrogram(Z, color_threshold=0, truncate_mode='level', p=3, show_contracted=True, orientation='left')
and the result is:

Download

See my software page for more information on downloading the package used for this example.

Documentation

See the API documentation for reference on how to use each function in the scipy-cluster package.

References

Fisher, R.A. "The use of multiple measurements in taxonomic problems." Annals of Eugenics, 7(2): 179-188. 1936