Advertisement

Recommend an algorithm for this?

Started by May 21, 2008 06:07 AM
47 comments, last by Timkin 16 years, 5 months ago
...sigh...

I can see that I'm going to have to spell it out, lest certain people continue to disparage me.

Let's formalise the OPs problem algebraically.

We have N objects each having K attributes defined on the space AK = A1xA2x...xAK. We require M subsets of objects, where M<N and each object may exist in one and only one group.

Form the MxK matrix of means
| μ1(G1) μ2(G1) ... μK(G1)|| μ1(G2) μ2(G2) ... μK(G1)|| .     .              . || .          .         . || .               .    . || μ1(GM) μ2(GM) ... μK(GM)|

Those claiming that I have misread the question, or that I don't understand clustering have falsely assumed that the rows of this matrix define the possible clusters for this problem (i.e., a one-to-one mapping of groups and clusters in the original space AK in which the objects are defined). That obviously doesn't work in this problem since it would give M clusters all having the same label (i.e., M copies of the same cluster). Given such a glaring absurdity, why would anyone immediately assume that this is what I meant/intended? I think I've shown over the years that I'm not an idiot... but perhaps I'm mistaken (either I am an idiot and don't know it, or I haven't proven otherwise).

Clearly to solve this problem one must not seek M clusters based on a k-dimensional label, but rather K clusters in the space of means UM having an m-dimensional label. I.e, cluster the matrix by columns, rather than rows, under the constraint of uniqueness of the assignment of objects to groups.

To borrow from one of my detractors...

Quote: Does it make more sense now?
Quote: Original post by Timkin
Those claiming that I have misread the question, or that I don't understand clustering have falsely assumed that the rows of this matrix define the possible clusters for this problem (i.e., a one-to-one mapping of groups and clusters in the original space AK in which the objects are defined).

No, I didn't falsely assume anything. You may be very familiar with clustering and understand that you can push the definition into solving a larger class of optimization problems, but for the rest of us clustering refers more or less to what Wikipedia says it is:
Quote: From Wikipedia
Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure.


Nobody here thought that you are an idiot; misreading a problem wouldn't make you one.

I am still not convinced that you can use a clustering algorithm for this, but I am willing to accept that I just don't know enough about clustering. And I still think this is an optimization problem.

I also thought of a solution using dynamic programming that would do the original case (20 people into 5 groups) in a fraction of a second, but it's messy to implement so I haven't done it.
Advertisement
Quote: Original post by Timkin
Those claiming that I have misread the question, or that I don't understand clustering

I think I've shown over the years that I'm not an idiot... but perhaps I'm mistaken (either I am an idiot and don't know it, or I haven't proven otherwise).




Nobody here claimed that you didn't understand clustering nor did anyone claim that you're an idiot. Perhaps a little overly defensive, but not an idiot. Relax, Timkin. A little diplomacy goes a long way.

I can see that your matrix of means representation gives a step toward the solution, but I'm failing to see how you are assigning membership of any object to any particular group. Based on what I'm reading (perhaps I'm misreading it or I'm just an idiot), you're proposing to cluster the matrix of means by column, but doesn't this matrix require that each object is assigned to a cluster?

EDIT: I believe that Sneftel is asking the same question, below.

[Edited by - kirkd on May 27, 2008 10:05:12 AM]
Quote: Original post by Timkin
Clearly to solve this problem one must not seek M clusters based on a k-dimensional label, but rather K clusters in the space of means UM having an m-dimensional label. I.e, cluster the matrix by columns, rather than rows, under the constraint of uniqueness of the assignment of objects to groups.

In order to cluster, however, you'll need actual values for those rows, which will need actual values for Gi. Where do those come from?
Quote: Original post by alvaro
You may be very familiar with clustering and understand that you can push the definition into solving a larger class of optimization problems, but for the rest of us clustering refers more or less to what Wikipedia says it is:

Quote: From Wikipedia...



Sorry, but that one made me laugh my arse right off my chair. To claim that I'm 'pushing a definition' is quite funny and, nothing personal, but quite absurd. Transformation of data to an alternate space is a common mathematical technique applied in many areas of computer science. As for Wikipedia, it is the bane of modern scholarship. It is NOT authoratative, and almost never accurate.

Quote: And I still think this is an optimization problem.


If you go back over my posts you might notice that I didn't say that you couldn't solve it as an optimisation problem, but simply that from it's definition it wasn't an optimisation problem. An optimisation problem is one in which you have a function to optimise. To treat this problem as an optimisation problem you need to arbitrarily propose a function that encodes the problem parameters and then optimise that, hoping that you get the encoding right and find a solution to your original problem. That's bending the problem to fit your tool and I was most amused to see InnocuousFox among those proposing this approach, since he is such a staunch advocate (and rightly so) of letting the problem dictate the best tool for the job.

kirkd & sneftel:

There is certainly more than one way to skin this transformed cat... but at heart you're trying to find one and only one cluster for each column of this matrix (simultaneously). Transforming this problem to an optimisation one would be equivalent to bounding your M clusters by a hyper-ellipsoid and attempting to minimising the volume of this bounding object, which should highlight the expected multi-modal nature of the solution space.

As to how you assign objects to groups... if you choose to solve the problem by searching the space of assignments, i.e., iteratively, then you need to start with an initial guess at the solution (as you would do if you were using an optimisation method).

Ultimately there are a range of clustering/classification algorithms that could be applied to the problem. If I had to take a stab at which I'd apply, I'd stick with my previous experience and use either MML or an iterative Bayesian method.

Regards,

Timkin
Here's a simple C++ implementation of simulated annealing to minimize the global function that I suggested. It took me about an hour to write.

It would be very constructive if advocates of other solutions would post similarly self-contained code that illustrated their approach, so we could all play around with it and learn about the pros and cons.
#include <iostream>#include <cstdlib>#include <cmath>#include <ctime>// returns a sample from a uniform(0,1) distribution double U() {  return double(std::rand())/RAND_MAX;}double simulated_annealing(double attributes[20][5], int result[20]) {  // Compute desired group sums  double desired_group_sums[5] = {    0,0,0,0,0  };    for(int i=0;i<20;++i) {    for(int j=0;j<5;++j)      desired_group_sums[j] += attributes[j];  }    for(int j=0;j<5;++j)    desired_group_sums[j] /= 5.0;    // We can start with any assignment we want  int group_assignment[20] = {    0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4  };    // Compute initial group deviations from desired sums  double group_deviations[5][5]; // [group][attribute]    for(int i=0;i<5;++i) {    for(int j=0;j<5;++j)      group_deviations[j] = -desired_group_sums[j];  }      for(int i=0;i<20;++i) {    for(int j=0;j<5;++j)      group_deviations[group_assignment][j] += attributes[j];  }    // Compute initial total error  double error = 0.0;  for(int i=0;i<5;++i) {    for(int j=0;j<5;++j) {      double delta = group_deviations[j];      error += delta*delta;    }  }    // Save this first guess as the result  double min_error = error;  for(int i=0;i<20;++i)    result = group_assignment;  // If you think you are getting stuck in local minima, or if you  // want the algorithm to run faster, try playing around with these  // parameters  const int num_iterations = 500000;  const double initial_temperature = .15;    // This is the main loop of the simulated annealing  for(int k=0;k<num_iterations;++k) {    // Pick two random elements to swap groups    int e1=std::rand()%20;    int e2=std::rand()%20;    int g1=group_assignment[e1];    int g2=group_assignment[e2];    if(g1 == g2) { // In the same group, so not really a change      --k; // Don't count this as an iteration      continue;    }        // Compute how much this swap would change the error    double error_diff = 0;    for(int j=0;j<5;++j) {      double diff = attributes[e2][j] - attributes[e1][j];      error_diff += (+2*group_deviations[g1][j]+diff)*diff;      error_diff += (-2*group_deviations[g2][j]+diff)*diff;    }        // Now accept the swap if the error goes down, or if it goes up    // with some probability dictated by the temperature and the size    // of the increase    double temperature = initial_temperature*(1.0-double(k)/num_iterations);    if(error_diff < 0 || U() < std::exp(-error_diff/temperature)) {      int tmp = group_assignment[e1];      group_assignment[e1] = group_assignment[e2];      group_assignment[e2] = tmp;      for(int j=0;j<5;++j) {        double diff = attributes[e2][j] - attributes[e1][j];        group_deviations[g1][j] += diff;        group_deviations[g2][j] -= diff;      }      error += error_diff;            // Keep track of the minimum error we've seen      if(error<min_error) {        min_error = error;        for(int i=0;i<20;++i)          result = group_assignment;      }    }  }    return min_error;}int main() {  std::srand(std::time(0));    // Create a random case. Notice that each attribute is in [0,1). I  // suggest you normalize your attributes the same way, or you'll  // have to adjust initial_temperature above.  double attributes[20][5];  for(int i=0;i<20;++i)    for(int j=0;j<5;++j)      attributes[j]=U();    // This is how you call the function  int assignment[20];  double error = simulated_annealing(attributes, assignment);    // Print out the result  std::cout << error << ' ';  for(int i=0;i<20;++i)    std::cout << assignment << ' ';  std::cout << '\n';}
Advertisement
Quote: Original post by Timkin
There is certainly more than one way to skin this transformed cat... but at heart you're trying to find one and only one cluster for each column of this matrix (simultaneously). Transforming this problem to an optimisation one would be equivalent to bounding your M clusters by a hyper-ellipsoid and attempting to minimising the volume of this bounding object, which should highlight the expected multi-modal nature of the solution space.


I'll just have to take your word on that one. 8^)

Quote: Original post by Timkin
As to how you assign objects to groups... if you choose to solve the problem by searching the space of assignments, i.e., iteratively, then you need to start with an initial guess at the solution (as you would do if you were using an optimisation method).

Ultimately there are a range of clustering/classification algorithms that could be applied to the problem. If I had to take a stab at which I'd apply, I'd stick with my previous experience and use either MML or an iterative Bayesian method.


The description here sounds a lot like k-means clustering, to me. I'm sure there are fundamental differences that I don't fully appreciate, but, and please correct me if I'm wrong, you start with an initial guess and make small modifications to the group assignments to either minimize the MML or maximize the Bayesian probability. The modifications would be driven by their effect on MML or Bayesian prob.


-KirkD


EDIT: It is interesting to me that Sneftel and I keep asking similar questions at almost the exact same time. Are we the same person in parallel existences?? 8^)

[Edited by - kirkd on May 28, 2008 11:58:25 AM]
Quote: Original post by Timkin
If you go back over my posts you might notice that I didn't say that you couldn't solve it as an optimisation problem, but simply that from it's definition it wasn't an optimisation problem. An optimisation problem is one in which you have a function to optimise. To treat this problem as an optimisation problem you need to arbitrarily propose a function that encodes the problem parameters and then optimise that, hoping that you get the encoding right and find a solution to your original problem. That's bending the problem to fit your tool and I was most amused to see InnocuousFox among those proposing this approach, since he is such a staunch advocate (and rightly so) of letting the problem dictate the best tool for the job.

But if you're iteratively searching over all possible groupings as you're proposing, then isn't that just an optimization method, with your cluster analysis as the objective function? I'm not sure I understand the basis of your approach here. And on a different tack, how is the objective I posted not "the one true objective function"? It's the weighted sum over all attributes of the distance of each group's average for that attribute from the global average for that attribute. I can't think of any other objective that would make sense for the problem, nor can I think of a way in which this objective makes any unfounded assumptions (other than those arising from vagueness in the original description, of course).
kirkd & sneftel: *IF* you chose a clustering algorithm that required clusters defined in terms of a distance metric about a cluster label (such as k-means), then you have mapped the original problem directly into an optimisation one in the 'cluster space' (and gained little in my opinion). I don't advocate the use of k-means, unless you're happy with a direct optimisation approach. It's a rather weak classifier and it will still suffer from the same problems as local optimisation of an arbitrary error function (as the distance metric wont unfold the multi-modality or the correlation problem in this small data set). k-means is best suited to simple segmentation problems, rather than clustering/classification problems.

An MML (Minimum Message Length) approach to clustering/classification creates an information theoretic measure over the description of a cluster and the probability that the given data was generated by that description, P(H)P(D|H) (H: hypothesis, D: data). Many of the Bayesian techniques can, iirc, be shown to map back to MML or MDL (Minimum Description Length), an approximation to the strict MML solution. Solving the problem in an information space avoids the nastiness of using local optimisation in the problem space, because you're finding a globally optimal solution.

If you're happy with direct optimisation and the difficulties it brings, then Sneftel's objective function is a reasonable cluster definition. While it is certainly the case that many algorithms for finding optimal solutions to a problem involve, at some stage, an optimisation step, that doesn't mean the original problem is an optimisation one that you can automatically throw an optimisation algorithm at... some might call this nit-picking of terminology, although it is not. If you took that stand, you'd be saying that every single problem is an optimisation problem, which would deflect you from more efficient approaches. Certainly, throwing SA or a GA at this problem is going to take a lot longer to solve (computationally) because it offers no guarantees of optimality (as both are local optimisers - hill climbers - only). Indeed, one can expect that it will take you an asymptotically longer time to find any improvements over a current solution... and you'll still never know that your current solution is the best one... or even how bad it is compared to the best.

alvaro: the speed with which you can produce code is in now way a measure of the appropriateness of the algorithm to the problem. There's plenty of free code available online for all of the algorithms discussed here if anyone wishes to compare solutions emperically. Throwing your code at the problem wont answer any of the questions as to whether the problem is solvable using direct optimisation. How do you know that the answer it gives you is correct?

Ultimately this discussion seems to boil down to one salient issue: whether one should throw an algorithm at a problem because they think it will work, or whether one should look for a provably optimal algorithm. If you're a software engineer, the first approach is often the appropriate one because you want to get an answer as quickly as possible so you can move on to the next problem... optimisation and proof of correctness is something you can do later if you have time (or your bosses require it). But you often have no guarantees about the quality of your answer. If, however, you're concerned with accuracy at the start, the latter approach is usually more efficient in the long run.

If the OP just wants a quick and dirty solution then all of this discussion is moot. There are far quicker and dirtier methods than any proposed so far.

Regards,

Timkin

[Edited by - Timkin on May 28, 2008 8:38:44 PM]
Quote: Original post by Timkin
alvaro: the speed with which you can produce code is in now way a measure of the appropriateness of the algorithm to the problem.

The comment about how long it took me to write the code wasn't meant to prove anything. I just wanted to encourage other people to share the code to their solutions, since it doesn't take that much effort to put together something that will work in this very concrete example.

Quote: There's plenty of free code available online for all of the algorithms discussed here if anyone wishes to compare solutions emperically.

Except nobody has been explicit enough about the details of the other possible solutions: If we use GA, what are we using as genomes? How do we mutate or recombine genomes? If we use MML or a Bayesian approach (which in my view are optimization problems where the function to maximize is likelihood), what priors do we use and what probability distributions? I don't know enough about those methods to make good choices, so no amount of code available online will help me understand how I would apply those methods to this problem. By putting it in code, you have an opportunity to show us what a reasonable application of those methods to this case would look like. I am honestly interested in seeing that code.

Quote: Throwing your code at the problem wont answer any of the questions as to whether the problem is solvable using direct optimisation. How do you know that the answer it gives you is correct?

That depends on your definition of "correct". I inspected the results, and it looks like indeed it produces groupings where the averages are all close. I believe there is some theorem that can be used to prove that the probability of SA finding the global optimum tends to 1 as the number of iterations tends to infinity. Other than that, it's hard to guarantee anything.

The problem of finding a guaranteed-optimal solution is probably NP-hard. Consider that the 1-attribute 2-groups version of this problem looks very similar to the knapsack problem. That's why I think you are very unlikely to do better than SA.

Quote: Ultimately this discussion seems to boil down to one salient issue: whether one should throw an algorithm at a problem because they think it will work, or whether one should look for a provably optimal algorithm. If you're a software engineer, the first approach is often the appropriate one because you want to get an answer as quickly as possible so you can move on to the next problem... optimisation and proof of correctness is something you can do later if you have time (or your bosses require it). But you often have no guarantees about the quality of your answer. If, however, you're concerned with accuracy at the start, the latter approach is usually more efficient in the long run.

...unless you are presented with a problem that is NP-hard. In this case I did consider a couple of brute force approaches too. For this particular size of the problem, I think one could use dynamic programming and find the global optimum in the order of seconds (I thought it would be faster earlier, but I am not so sure anymore).

This topic is closed to new replies.

Advertisement