Finding clusters in a graph

Started by Kylotan September 29, 2003 06:43 AM

26 comments, last by Kylotan 21 years, 4 months ago

Author

10,513

October 01, 2003 06:39 AM

In my particular case, I don''t think I could derive any sort of coordinates. For example, say I have 3 nodes, A, B, and C. The distance between A and B is 10, between B and C is 1, and between C and A is 100. I can''t represent those distances in 2D space as no triangle could have those coordinates. Any method which tried to find a node that represented the centre of those 3 points might produce something meaningless. Of course, this example shows some particularly bad data, but the method I use has to be able to accommodate this.

For what it''s worth I implemented hierarchical clustering last night and had a reasonable degree of success using the median distance between cluster members. Using the minimum distance (maximum similarity in my case) made it too quick to form what could be considered long and thin clusters, whereas using the maximum distance/minimum similarity fails when there is no correlation between 2 nodes as it comes to zero. So the median method tended to give reasonably accurate clusters while not forcing the special cases into clusters they really didn''t belong in.

[ MSVC Fixes | STL Docs | SDL | Game AI | Sockets | C++ Faq Lite | Boost
Asking Questions | Organising code files | My stuff | Tiny XML | STLPort]

Predictor

198

October 01, 2003 07:55 AM

quote:
Original post by MikeD
Predictor: I was always under the impression that K-means clustering clustered each node in the space to the nearest cluster point by some (possibly abstracted) distance metric. Why would it need coordinates per-se? Perhaps in some possible state space it would be possible to have non-point attractors in the process where a change in the cluster positions caused a change in the clustering of node n which caused a change in the cluster positions which caused a change in the clustering of node n in a continuous cycle. I imagine that this cannot happen in Euclidean geometry and could happen in other state spaces, however, I don't think this precludes use with _any_ distance metric, it just might be problematic.

Am I mis-understanding your use of the term coordinates?

The "means" in k-means clustering are means of the features (coordinates in some space, really) in each cluster. In other words, the centroids of the clusters have the same variables as the original data, with each cluster's centroid having the means of each variable for the items in that cluster. It's true that items in k-means are assigned to the nearest cluster, and this could probably be accomplished using a distance matrix, but what value would the cluster center have (if we only have the distances between items, and not actual features)?

It may help to read my article, "Modeling Methodology 4: Localizing Global Models":

http://will.dwinnell.com/will/Modeling%20Methodology%204%20-%20Localizing%20Global%20Models.DOC

...which actually uses fuzzy c-means, not k-means, but the same general idea applies.

[edited by - Predictor on October 1, 2003 9:01:27 AM]

kirkd

505

October 01, 2003 08:37 AM

In my post above, I was thinking of the problem as using the distance matrix. The cluster center wouldn't necessarily have an explicit meaning, and trying to interpret it in terms of features would not likely give meaningful results.

It would serve for the purposes identifying which observations are similar, however, wouldn't it? Since the goal is to find groups of similar observations and not to elucidate a typical observation (the cluster center) it seems it would still be useful.

I was considering a potentially different data representation, which I suppose I should explain. It may be totally inappropriate for the disucssions at hand. Kylotan said:

quote:

I have a set of (between 100-1000) nodes, each of which has a 2-way connection to a number (maybe 10%) of the other nodes. Each connection has a weight, which I'll call a distance. The weights are entirely arbitrary so there is no guarantee that the graph could exist in 2D or 3D space. In fact, these weights are based on how many humans judge the 2 nodes to be related. There should be a high level of transivity in that the distance between A and C should correlate highly to the total of the distances between A and B and B and C.

If the data is represented by a connectivity matrix which is a square matrix with each node ID across the rows and columns. Any particular cell would be the weight between nodes. From this we can derive a distance matrix between nodes, and this matrix would be of the same form as the previous. Each cell would represent the "distance" between rows in the connectivity matrix. You could use Euclidian distance, cosine, or whatever. Sticking with Kylotan's example, I assigned some normalized weights such that self similarity of a node is 1. Similarity between a and c is 0 (nobody judged them as similar), a and b is .5 (half the people judged them as similar),and b and c is .2.

connectivity matrix:

     a     b     ca    1     .5    0     b    .5     1    .2c    0      .2   1

We can then create the distance matrix by taking the Euclidean distance between rows (or columns in this symmetric case).

     a     b     ca    0     .73   1.44b    .73   0     1.13c   1.44   1.13  0

If we then cluster by rows or columns (they're the same in this case), we should be able to group similar nodes. This in fact works fairly well in chemistry where the connectivity matrix represents atoms and the types of bonds between them. Clustering in this way results in molecules of similar structure grouping together.

I apologize for the LONG post, but please let me know if this helps.

edit - Improved matrices.

[edited by - KirkD on October 1, 2003 9:39:34 AM]

MikeD

158

October 01, 2003 11:38 AM

quote:
Original post by Predictor
The "means" in k-means clustering are means of the features (coordinates in some space, really) in each cluster. In other words, the centroids of the clusters have the same variables as the original data, with each cluster''s centroid having the means of each variable for the items in that cluster. It''s true that items in k-means are assigned to the nearest cluster, and this could probably be accomplished using a distance matrix, but what value would the cluster center have (if we only have the distances between items, and not actual features)?

It may help to read my article, "Modeling Methodology 4: Localizing Global Models":

http://will.dwinnell.com/will/Modeling%20Methodology%204%20-%20Localizing%20Global%20Models.DOC

...which actually uses fuzzy c-means, not k-means, but the same general idea applies.

Nice article. As to the question of finding the mean of a cluster, surely, whilst slow and laborious, the distance from a node to a cluster could be the mean distance of that node to every member of that cluster (not including the node, should it already be a member). Thus a node is still clustered to those other nodes it is, on average, most similar two, despite the lack of a centroid in any kind of co-ordinate space. However, I agree that if the concept of clustering is a mechanism most appropriate to coordinate space, perhaps non-coordinate state spaces should find a more appropriate method.

Mike

Predictor

198

October 01, 2003 02:34 PM

quote:
Original post by MikeD
As to the question of finding the mean of a cluster, surely, whilst slow and laborious, the distance from a node to a cluster could be the mean distance of that node to every member of that cluster (not including the node, should it already be a member). Thus a node is still clustered to those other nodes it is, on average, most similar two, despite the lack of a centroid in any kind of co-ordinate space.

There is nothing wrong with what you describe, but it is essentially the definition of a hierarchical clustering process.

-Predictor
http://will.dwinnell.com

MikeD

158

October 01, 2003 07:55 PM

quote:
Original post by Predictor
There is nothing wrong with what you describe, but it is essentially the definition of a hierarchical clustering process.

So, essentially, k-means cluster is a special case hierarchical clustering system for a Euclidean coordinate space?

Timkin

864

October 01, 2003 09:19 PM

Doh! I hate it when I miss a good discussion! Kylotan, have you solved your problem yet?

Much of what I would have said has been said, however I''m not as enthusiastic about methods such as split and merge (heirarchical clustering) or k-means/fuzzy k-means. The important point that I hope you picked up from this thread is that such methods as those mentioned above perform clustering based on attributes, which can be represented as spatial dimensions. However, quite obviously, such mappings from attributes to dimensions may not produce a Euclidean space.

The sort of problem you present has been solved before and you should take a look at the literature on clustering gene expression data. You might want to look at such methods as density estimation or non-parametric clustering. Stick ''clustering gene expression data'' into Google for a start on your lit search.

Good luck,

Timkin

kirkd

505

October 01, 2003 09:53 PM

I believe my post above also applies to gene expression data.

-Kirk

Timkin

864

October 02, 2003 02:47 AM

How one describes the data may vary, but yes, in such problems, a connectivity matrix depicting arc weights is typical. This is applicable in gene expression data as well as problems like that which you mentioned Kirk (molecular structure).

Cheers,

Timkin

Predictor

198

October 02, 2003 05:37 AM

quote:
Original post by Predictor
There is nothing wrong with what you describe, but it is essentially the definition of a hierarchical clustering process.

quote:
Original post by MikeD
So, essentially, k-means cluster is a special case hierarchical clustering system for a Euclidean coordinate space?

No, because what you described was not k-means clustering. K-means represents each cluster prototype as the vector means of the items within that cluster. See, for instance:

http://www.mathworks.com/access/helpdesk/help/toolbox/stats/multiv16.shtml
http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html
http://www.ece.neu.edu/groups/rpl/projects/kmeans/

And I was speaking a little abstractly by describing your algorithm as "hierarchical" since your method does not construct the full hierarchy. But this part:

quote:
Original post by MikeD
...the distance from a node to a cluster could be the mean distance of that node to every member of that cluster (not including the node, should it already be a member). Thus a node is still clustered to those other nodes it is, on average, most similar two, despite the lack of a centroid in any kind of co-ordinate space.

...is exactly like an average distance linkage which is often used in hierarchical clustering.

Finding clusters in a graph

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Finding clusters in a graph

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines