Advertisement

Matching algorithms

Started by November 24, 2005 08:02 AM
20 comments, last by Timkin 19 years ago
I am interested in finding out about any matching algorithms which can pair up items in one list with items in a second list, given a function that can return a degree of similarity between two items, and given that the lists may be of different sizes and some items may not match at all. To try every single permutation would run in factorial time and therefore be prohibitively slow on anything but short lists, so I'm hoping there is a good heuristic approach available. I suppose that ranking all possible matches and then using a stable matching algorithm might be one good approach?
You want to maximize the sum of the similarity(a,b) for all matched items?
Advertisement
Just an idea:
If the function for determining similairity, f(a,b) can be written as something like g(a)-g(b)+smallfactor(a,b), then you could sort the list by g, and look only at close neighbours in the sorted list.
Depending on what you are using it for you could also take a look at sequence alignment algorithms.
Yeah, sequence alignment is a better name for what I'm looking for, although there are not going to be any exact matches, just a lot of very close matches. Yes, that will be maximising the sum of the similarity, or minimising the sum of the difference, but I don't think that alone is sufficient, due to the presence of some data in one or both lists that may need to remain unmatched. The lists are already roughly sorted but the similarity function could be quite complex.
Surely there is an existing matching/permutation algorithm that minimize the Total Least Square (TLS) of the differences. Try looking up with that keyword. Do you need the absolute best match, or a good local minima will do?
It may not be possible to score the differences, only to rank them. So I'm interested in more algorithms than just one that will numerically reduce differences, and also in how they will handle sequences of different lengths - what is the square of the difference between an item and a non-existent item, for example?
Advertisement
You can solve this problem exactly (wrt. some metric) in O(n^6) with a minimum cost flow algorithm (n=|a|+|b|). Maybe it's only n^5, I didn't try to find a tight upper bound.

Both lists can be of different length and arbitrary similarity relations are allowed.
Quote: Original post by Kylotan
what is the square of the difference between an item and a non-existent item, for example?


Usually it is a normalization constant is the minimization algorithm.

This sounds like a classic nearest neighbours search problem, which can be solved efficiently using a k-d tree representation of the data. Create one tree per list (where data is distributed based on intra-list similarity). Then, if you want to find a match for each item in each list: for each item not presently matched, use it as the query to find the nearest neighbour in the other list. You can choose to find the n nearest neighbours with minimal additional runtime cost.

Your problem sounds very much like a data set cross-correlation problem I solved a couple of years ago. It becomes more efficient to use nearest neighbour search than brute force when the number of items for comparison reaches something like 10^5 (if I recall the literature correctly).

If you get desperate for an implementation, drop me an email and I'll forward you some code you can play with.

Cheers,

Timkin
Would cross-corelation work for what you want?

I've used the Zero-mean normalized cross-correlation algorithm successfully to find similar chunks of data in two images.

Will

------------------http://www.nentari.com

This topic is closed to new replies.

Advertisement