Finding clusters in a graph

Started by Kylotan September 29, 2003 06:43 AM

26 comments, last by Kylotan 21 years, 4 months ago

Author

10,513

September 29, 2003 06:43 AM

I''m not sure if this is really an AI question, but since it''s an activity that humans can perform without difficulty, I figured it might fit in here. I have a set of (between 100-1000) nodes, each of which has a 2-way connection to a number (maybe 10%) of the other nodes. Each connection has a weight, which I''ll call a distance. The weights are entirely arbitrary so there is no guarantee that the graph could exist in 2D or 3D space. In fact, these weights are based on how many humans judge the 2 nodes to be related. There should be a high level of transivity in that the distance between A and C should correlate highly to the total of the distances between A and B and B and C. Given this description, is there any way of finding areas of the graph that are tightly clustered, for the purposes of dividing it up into arbitrary sections? One system I''ve looked at is fuzzy clustering, but this appears to have some limitations. [ MSVC Fixes | STL Docs | SDL | Game AI | Sockets | C++ Faq Lite | Boost
Asking Questions | Organising code files | My stuff | Tiny XML | STLPort]

Predictor

198

September 29, 2003 07:46 AM

quote:
Original post by Kylotan
I have a set of (between 100-1000) nodes, each of which has a 2-way connection to a number (maybe 10%) of the other nodes. Each connection has a weight, which I''ll call a distance. The weights are entirely arbitrary so there is no guarantee that the graph could exist in 2D or 3D space. In fact, these weights are based on how many humans judge the 2 nodes to be related. There should be a high level of transivity in that the distance between A and C should correlate highly to the total of the distances between A and B and B and C.

Given this description, is there any way of finding areas of the graph that are tightly clustered, for the purposes of dividing it up into arbitrary sections?

This is a very conventional clustering problem. Try hierarchical clustering, which can accept distances between items as opposed to their features (coordinates). Any good statistical software should provide this functionality.

-Predictor
http://will.dwinnell.com

Kylotan

Author

10,513

September 29, 2003 07:51 AM

Predictor

198

September 29, 2003 09:22 AM

The basic idea behind hierarchical clustering is to either: 1. start with every item as its own cluster, and repeatedly merge, or 2. start with one cluster including all items and repeatedly split. The complication comes in the details. There are many ways to perform hierarchical clustering (for instance, does one consider the distance between two clusters by their closest points? their furthest points? something else?), resulting in many (!) variations.

This type of clustering is used in many fields, and is very easy to program. You should be able to find alot of material online. Here is a beginning:

http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
http://www.crmportals.com/hierarchical_cluster_analysis.pdf
http://149.170.199.144/multivar/hc.htm
http://www.analytictech.com/networks/hiclus.htm

Good luck!

-Predictor
http://will.dwinnell.com

Kylotan

Author

10,513

September 30, 2003 08:48 AM

Yeah, I took another look at the stuff and it''s actually really easy, so sorry for asking for more details when they weren''t really necessary.

kirkd

505

September 30, 2003 09:38 AM

Another method which has gained favor over hierarchical clustering in some domains is K-means clustering. Again, very simple to implement but requires that you provide the number of clusters you expect a priori.

You could also look at partitioning which is effectivley clustering in reverse.

And if you want to get realy fancy, there''s the SOM option.

Just my $0.02.

-Kirk

Kylotan

Author

10,513

September 30, 2003 09:50 AM

I suppose you could iteratively train K-means by picking a random number of clusters, measuring the ''quality'' of the clustering, and adjusting the number of clusters accordingly.

However, does the standard K-means algorithm require that all the points fit in Euclidean space? I could have a situation where A is 1 point away from B and B is 1 point away from C, yet A is 40 points away from C. Obviously this wouldn''t work in 2D or 3D space and any algorithm that assumes such properties may fail. By looking at it, it would appear that K-means would expect the distance between A and C to be ''reasonable'', which I can''t guarantee

kirkd

505

September 30, 2003 10:23 AM

I don''t think any clustering method is totally dependent upon Euclidean conditions. As long as you''ve defined a distance metric which behaves properly, i.e., as a metric, I don''t think it will matter. Usually Euclidean distances are the basis due to their ease of understanding.

I could be wrong, however. It happened once before. 8^)

-Kirk

Predictor

198

September 30, 2003 02:08 PM

quote:
Original post by kirkd
Another method which has gained favor over hierarchical clustering in some domains is K-means clustering. Again, very simple to implement but requires that you provide the number of clusters you expect a priori.

You could also look at partitioning which is effectivley clustering in reverse.

And if you want to get realy fancy, there's the SOM option.

Yes, but both of these methods require coordinates, as opposed to distances. Synthetic coordinates may be derived from distances, but may require many dimensions for an accurate mapping. Consider that any N points may theoretically require as many as N-1 dimensions for accurate coordinates. For example, in the worst case scenario, 3 points (on the vertices of a triangle, not lying in a line segment) will require 2 dimensions. With hundreds or thousands of points, one may very well require hundreds or thousands of dimensions for accurate representation. Converting to coordinates, by multidimensional scaling, etc. adds another step and makes this solution clumsy.

While I tend to reach for centroid-based clustering algorithms such as you mention, I think they'd be an awkward fit to this problem.

[edited by - Predictor on September 30, 2003 8:42:43 PM]

MikeD

158

October 01, 2003 05:43 AM

Predictor: I was always under the impression that K-means clustering clustered each node in the space to the nearest cluster point by some (possibly abstracted) distance metric. Why would it need coordinates per-se? Perhaps in some possible state space it would be possible to have non-point attractors in the process where a change in the cluster positions caused a change in the clustering of node n which caused a change in the cluster positions which caused a change in the clustering of node n in a continuous cycle. I imagine that this cannot happen in Euclidean geometry and could happen in other state spaces, however, I don''t think this precludes use with _any_ distance metric, it just might be problematic.

Am I mis-understanding your use of the term coordinates?

I look forward to elucidation on this point

Mike

Finding clusters in a graph

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Finding clusters in a graph

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines