Matching algorithms
Hmm, one more thing, which I didn't stress enough before: the two lists are roughly in order - so if the data was perfect, x[1] would match y[1], x[2] would match y[2], etc. Unfortunately, sometimes items are missing from one or the other, and more generally, the values in one list are biased an unknown amount relative to the other list.
A good analogy (in 2D rather than 1D) would be matching fingerprints, where the lines for a good input match and the model it's matching with will be largely superimposed, but with a degree of error either side. Also, the input sample is not going to be exactly in the same position in the image data as the training model is - perhaps one is off-centre, for example. (This corresponds to my bias factor.)
Does that make sense? :)
A good analogy (in 2D rather than 1D) would be matching fingerprints, where the lines for a good input match and the model it's matching with will be largely superimposed, but with a degree of error either side. Also, the input sample is not going to be exactly in the same position in the image data as the training model is - perhaps one is off-centre, for example. (This corresponds to my bias factor.)
Does that make sense? :)
November 29, 2005 11:54 AM
Could you approach it in a similar way to a diff algorithm? Sounds very similar to me.
Look up "dynamic time warping". Useful for roughly linear correlations between two sequences.
Quote: Original post by Kylotan
Hmm, one more thing, which I didn't stress enough before: the two lists are roughly in order - so if the data was perfect, x[1] would match y[1], x[2] would match y[2], etc. Unfortunately, sometimes items are missing from one or the other, and more generally, the values in one list are biased an unknown amount relative to the other list.
Okay, that's a slightly different problem to the way I interpreted your initial post. Correlation algorithms don't typically care where in the list of data you find the information, only what information is contained in the data set (actually, that one data set contains information about the other... Mutual Information is a better way of understanding correlation IMHO). So, my question is this: Consider lists A and B, such that A = { x1,x2,x3,x4,... } and B = { x2,x3,x4,... }. What do you want to say about these lists? Do you want to say they are completely mismatched (becase a1!=b1, a2!=b2, etc) or that they are offset (B(i) = A(i+1)) or that they have a distribution of offsets or something else entirely?
Cheers,
Timkin
I don't really want to say anything about the lists, just about the data items within them. In this case, I want to report that there are good matches for x2, x3, and x4, and no good match for x1. I'm not interested in the offset except for how it aids matching, and the offset I referred to was not necessarily an offset in the ordinal value of list members - although that can and does sometimes exist - but an offset between the actual data being matched. eg. I would want the list [1,2,100,101] to match to [5,6,104,105]. Here the offset looks like +4 to get the second set of values from the first. However there are error values also so I'd want it to match with [5,6,105,106] too, for example.
Assuming you're comparing strings, look up the Levenshtein Distance and Single/Double Metaphone functions.
I'm not comparing strings. :) The items are largely judged on two floating point values, one of which dictates the ordering, the other of which just has to 'match'.
This may not be the best solution, but...
From what you have described I think you could use a cross-correlation algorithm. The cross-correlation algorithm will return a degree of similarity for any offset you try.
start at offset 0, then 1, 2, 3, 4, 5, 6, ..... if currScore > bestScore then bestOffset = currOffset.
Cross-correlation should tell you that the first list appears at offset 4 in the second list.
Noise in the data shouldn't be a major problem, as the correct match (even if not perfect) will have a higher similarity score than any other.
If list 1 is bigger than list 2, then search list 1 for the occurance of list 2, otherwise search list 2 for list 1.
It might help to normalize the blocks of data (as opposed to normalizing the entire list).
Will
From what you have described I think you could use a cross-correlation algorithm. The cross-correlation algorithm will return a degree of similarity for any offset you try.
want to find [1,2,3,4,5,6]in [1, 1, 1, 1, 4,3,3,5,6,7, 1, 1, 1, 1] -----------
start at offset 0, then 1, 2, 3, 4, 5, 6, ..... if currScore > bestScore then bestOffset = currOffset.
Cross-correlation should tell you that the first list appears at offset 4 in the second list.
Noise in the data shouldn't be a major problem, as the correct match (even if not perfect) will have a higher similarity score than any other.
If list 1 is bigger than list 2, then search list 1 for the occurance of list 2, otherwise search list 2 for list 1.
It might help to normalize the blocks of data (as opposed to normalizing the entire list).
Will
------------------http://www.nentari.com
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement