This tutorial uses the Iris data set, which first appeared in a
paper on optimal linear discriminants for Gaussian data ^{[1]}.
This data set contains four observations (sepal length, sepal width, petal
length, and petal width) for 150 collected specimens of flowers.
Three species (or classes) are represented here: *Iris Setosa*,
*Iris Versicolour*, and *Iris Virginica*. Rather than using
Fisher's Discriminant Analysis to classify the data, we will use
`hcluster` to analyze the data.
Ideally,
pairs of flowers in the same species should cluster more closely,
and flowers from different species should be farther apart from
one another.
Attaining good clustering requires careful consideration of the distance
metric. Since the purpose of this document is to demonstrate this
library in action, we give cursory consideration to the choice of
distance metric.

First, import the `hcluster`

module. We then load the
flower data set using `matplotlib`

's `load`

command. Standardized Euclidean distance is used to compute the
distances between each pair of flower specimens using the `pdist`

command.
Next, we use the *single* linkage algorithm to build the
agglomorative clustering:

from hcluster import * X=load('iris.txt') Y=pdist(X, 'seuclidean') Z=linkage(Y, 'single') dendrogram(Z, color_threshold=0)

This yields the following dendrogram plot:

Z=linkage(X, 'centroid') dendrogram(Z, color_threshold=1.8) title('Sir Ronald Fisher\'s Iris Data Set') xlabel('Flower Specimen Number') ylabel('Distance') legend(('Iris Setosa', 'Iris Virginica', 'Iris Versicolour'))

The dendrogram plot is shown below.

Z=linkage(Y, 'complete') dendrogram(Z, color_threshold=2.3) title('Sir Ronald Fisher\'s Iris Data Set') xlabel('Flower Specimen Number') ylabel('Distance') legend(('Iris Setosa', 'Iris Virginica', 'Iris Versicolour'))

The dendrogram that results is shown below:

dendrogram(Z, color_threshold=0, truncate_mode='level', p=3, show_contracted=True)The

`truncate_mode`

parameter tells the dendrogram plotting routine
the type of truncation to perform. When set along with `p`

, no more
than `show_contracted=True`

parameter specification
plots a marker for each non-singleton cluster contracted on the link
that it belongs to. The height of the marker is the cophenetic distance
of the contracted node.
The contracted dendrogram with contraction markers that results:

.Contracted leaf nodes are labeled with a number in parenthesis, represents the total number of leaf nodes belonging to the non-singleton clusters represented by the contracted link.

The `orientation`

parameter rotates the dendrogram,

dendrogram(Z, color_threshold=0, truncate_mode='level', p=3, show_contracted=True, orientation='left')and the result is:

See my software page for more information on downloading the package used for this example.

See the API documentation for reference on
how to use each function in the *scipy-cluster* package.

Fisher, R.A. "The use of multiple measurements in taxonomic problems."
*Annals of Eugenics*, 7(2): 179-188. 1936