rasbt / python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performing hierarchical clustering on a distance matrix

VedAustin opened this issue · comments

I understood the concept of complete linkage .. however in the example you provided I did not understand the values in the table with columns 'row label 1', 'row label 2' etc ..

  1. For example what do the numbers (0-7) under the first two columns : 'row label _' represent?
  2. On the first step to create clusters, when you just have points, how do you go about creating clusters? You attempted to explain that via the example, but if you could expand on your example I would really appreciate it.

Oh, I agree, I really should make this more clear (noting it down for a potential 2nd addition).

I think you are referring to this one, right?

screen shot 2016-07-29 at 8 28 16 pm

We can read this as follows ...

Let's say we have n samples with indices i in {0, ..., n}
where n=5 in this case.

Sorry for being overly complicated with the "letters" here, but it helps with generalizing explanation, I hope :). If we have n samples, we will eventually get n-1 clusters. So, in this case our linkage matrix has 5-1=4 rows: The clusters.

Now, the numbers 0-7 are the indices of the samples/clusters being merged. The numbers 0-4 are the indices of singleton clusters (our initial n sample indices). The indices 5-7 are non-singleton clusters that were created upon merging. Maybe, let's walk through it step by step:

  1. find to most dissimilar samples and merge them to one cluster. Here we merge sample i=0 and sample i=4 to the first cluster, cluster 1. This cluster gets the index i=5, because the numbers 0-4 are already used as sample indices.
  2. Now, we do the distance comparison again and merge sample 1 with sample 2 to create cluster 2, which gets the index i=6
  3. In row 3, we merge sample 3 with the cluster i=5 (the one that we created in step 1) to create cluster i=7
  4. Finally, we merge cluster i=6 (from step 2) with cluster i=7 from step 3.

I hope this doesn't sound too complicated and makes sense? Let me know :)

Yep .. thank you very much make sense! A quick question: when you are using 'complete' linkage, the most dissimilar items have the lowest distance in the distance matrix?
Also when you create cluster with new index, i=5, how did you decide to merge with i=3? In other words what is the output of creating a cluster?

Glad to hear that it helped!

A quick question: when you are using 'complete' linkage, the most dissimilar items have the lowest distance in the distance matrix?

Yes :).

  1. For each cluster a, you compute the distance to all elements in cluster b. Then, you keep the distance (or "link") that is largest (the link between the 2 most dissimilar members.
  2. You repeat this process for all clusters. If you have k clusters, you end up with k of these links.
  3. Then, you compare these links to each other and find the one that has the smallest distance. Based on this, you merge the 2 clusters.

Also when you create cluster with new index, i=5, how did you decide to merge with i=3?

You basically follow the three steps above. Now, note that a cluster can consist of 1 single sample (at the very beginning, or in other words, you have n clusters, where n is the number of samples in your dataset.

Great! Thank you Sebastian! All clear now.