Kohonen Neural Network Problem, ideas please
Hello,
I am currently using the Kohonen Neural Network for my Character Recognition problem. I have a degrading Neighbor function and learn rate function. I use Euclidean for my distance function. Both of these functions are linear.
Now the problem comes when I try and recognize 26 different patterns (7 x 5 matrix), I will get a result such as
A B C D E F G H I K L M N P Q R S T T U V V W X Y Z
I am missing J and O but have 2 V's and 2 T's, yet neither J or O look anything like V or T. (Edit: The J and T look very similar)
So my question is if anyone might have any idea where the problem lies. I know that is a vague question but this might be a common occurrence in KNN's.
Also my neighbor function for 1 neighbor just looks at the matrix to the left and right of the winner. My one solution or change might be to find the 2 closest matrix's (based on Euclidean distance) and have those be the neighbors instead of the actual matrix's right by.
As well both my neighbor and learn rate functions are linear, does anyone have any suggestions on any functions to use that are not, that may help?
Any help, suggestions or anything would be greatly appreciated.
I'm afraid that the information you give isn't adequate to formulate a response. Could you give more detail regarding how you trained the network, what you're trying to accomplish, and how you're trying to get a prediction? Any details on the nature of the data set - representation and distance metric - would be helpful as well.
-Kirk
-Kirk
Quote: Original post by kirkd
I'm afraid that the information you give isn't adequate to formulate a response. Could you give more detail regarding how you trained the network, what you're trying to accomplish, and how you're trying to get a prediction? Any details on the nature of the data set - representation and distance metric - would be helpful as well.
-Kirk
I had assumed so.
First, right now I am just trying to get my output weights to "almost" match my test set. If it works correctly I should get an entire alphabet if I give it only 1 character of each (26 input, 26 output). I am not trying to do any drawing or anything else.
I train the network with just an input of 26 [5 x 7] matrices. The weight for the test set is just 0's or 1's. 2 examples are
00100 10001
01110 10001
01010 10001
11111 11111
11011 10001
10001 10001
10001 10001
My output is a double value from 0 to 1, which is the same size as the input matrices.
The input goes through anything between 800 and 4000 iterations (an iteration is looking at a single test input once). (In my code an actual iteration is looking at all test inputs once).
The output weights are completely randomized at the start of the training process.
The neighbor function starts at the length of my test set / 2, so that it will initially adjust every single neuron.
original neighbor = test size / 2
The algorithm for changing neighbor appears to be a linear decrease. Where it original neighbor * exp(current iteration ) / (total iterations / log(test size))
Now my learn rate usually starts at some value between .5 and 1.00 (depending on what I want), and initially adjusts the exact same way as neighbor. As well I have the learn rate degrade depending on how far the neuron we are adjusting is from the winning neuron.
This equation is: degrade = exp(- (distance from winner squared) / (neighbor function * 2))
So my final algorithm for actually adjusting the weights of some neuron is:
degrade * learn rate * (value of position of test input - value of position of winning neuron)
As well if some value goes beyond 1 or below 0 it is just fixed to 1 or 0.
Hopefully this is enough information. A neighbor is defined as just being a neuron to the right or left of the winning neuron (in an array). If it is at an end point then the opposite endpoint is considered its neighbor.
I based how my program works on how I thought http://www.psychology.mcmaster.ca/4i03/demos/competitive-demo.html worked.
If you need any other information I can easily provide it.
Have you gotten this code to work on a smaller problem (one with less classes)?
Let me verify some of your information:
The topology of your network is one-dimensional. Each node in the map has two neighbors and the map loops at the ends. How many nodes do you have? 26?
Your input representation is the same as the node representations (as would be expected) and consists of a 5x7 bit map. Each letter is represented by a unique bitmap.
Your learning, neighborhood, and decay functions seem appropriate. There are so many possible variations of these that you're probably better off just tweaking these as necessary.
The only question I have remaining is how do you determine distances between an input pattern and a node? I presume you have to find a best matching unit, but how do you define that distance in order to find the best?
I assume that you're assigning an identity to each node after training based upon which input pattern maps to that particular node. Then when a test pattern is input you take the label of the node as the prediction.
Given what you've said, I'm not too surprised that you lose the J and get two T's as J and T are so close they are likely neighbors, but I'm surprised that you get two V's. It would be helpful to know which letter maps to which prediction - what is O predicted to be by the map?
-Kirk
The topology of your network is one-dimensional. Each node in the map has two neighbors and the map loops at the ends. How many nodes do you have? 26?
Your input representation is the same as the node representations (as would be expected) and consists of a 5x7 bit map. Each letter is represented by a unique bitmap.
Your learning, neighborhood, and decay functions seem appropriate. There are so many possible variations of these that you're probably better off just tweaking these as necessary.
The only question I have remaining is how do you determine distances between an input pattern and a node? I presume you have to find a best matching unit, but how do you define that distance in order to find the best?
I assume that you're assigning an identity to each node after training based upon which input pattern maps to that particular node. Then when a test pattern is input you take the label of the node as the prediction.
Given what you've said, I'm not too surprised that you lose the J and get two T's as J and T are so close they are likely neighbors, but I'm surprised that you get two V's. It would be helpful to know which letter maps to which prediction - what is O predicted to be by the map?
-Kirk
Predictor, I got it working with A - G, 100% of the time, however A - I works maybe 40% and anything further is near 0.
Let me verify some of your information:
This is all correct.
I use Euclidean distance, so which ever node is closest to 0 is considered the best match. I have read that since I have an equal number of inputs to outputs it should take almost no iterations. Do I need to have some type of error checking or threshold for an "acceptable" best match? Or change my range from my 0 - 1 to -1 - 1?
My O is generally recognized as a Q, which makes it the closest. My D and B are considered neighbors of Q, which also makes since.
At this test I got "A B C D E F F G H I J K M N Q R S T U V W X Y Y Y Z" Which seemed rather weird, it is missing a V and O but has 3 Y's.
It seems to me that 2 that are "very" similar are just modifying the same node, instead of neighbors (which they should eventually do).
Here are two results, the matrices are the display of the duplicates (except for L). Where a completely filled in spot means it is between .9 and 1.0 and anything partially filled in is not.
http://img19.imageshack.us/my.php?image=imagec.png
EDIT:
I decided to add a method that found out which test input was not actually getting an output, and which neurons where never firing. After that I found out based on those which neuron would be best used for some test input. Then based on those, which ever test input had a neuron that was closest to it was one that was automatically modified. Hopefully this makes since. In the end it actually worked. The program runs it once every 20 iterations. However I don't know if this is "cheating" since it seems to me that this is now more supervised learning then unsupervised. As well this doesn't fix the overall problem since I would like to be able to have a test input size 4x larger then my output size (4 different types of writing each character). If anyone can give input on any of this it would be great.
[Edited by - Lothia on March 12, 2009 4:14:31 PM]
Let me verify some of your information:
Quote: The topology of your network is one-dimensional. Each node in the map has two neighbors and the map loops at the ends. How many nodes do you have? 26?
This is all correct.
Quote: Your learning, neighborhood, and decay functions seem appropriate. There are so many possible variations of these that you're probably better off just tweaking these as necessary.That was one of the areas I figured could use some tweaking but I am unsure except perhaps making the slope steeper during the middle neighbors.
Quote: The only question I have remaining is how do you determine distances between an input pattern and a node? I presume you have to find a best matching unit, but how do you define that distance in order to find the best?
I use Euclidean distance, so which ever node is closest to 0 is considered the best match. I have read that since I have an equal number of inputs to outputs it should take almost no iterations. Do I need to have some type of error checking or threshold for an "acceptable" best match? Or change my range from my 0 - 1 to -1 - 1?
Quote: I assume that you're assigning an identity to each node after training based upon which input pattern maps to that particular node. Then when a test pattern is input you take the label of the node as the prediction.That is correct, I give it a character as a label for the test data and that is how I dictate what each node would most likely be considered.
Quote: Given what you've said, I'm not too surprised that you lose the J and get two T's as J and T are so close they are likely neighbors, but I'm surprised that you get two V's. It would be helpful to know which letter maps to which prediction - what is O predicted to be by the map?I do see how J and T are similar since J is pretty much a T with some extra areas.
My O is generally recognized as a Q, which makes it the closest. My D and B are considered neighbors of Q, which also makes since.
At this test I got "A B C D E F F G H I J K M N Q R S T U V W X Y Y Y Z" Which seemed rather weird, it is missing a V and O but has 3 Y's.
It seems to me that 2 that are "very" similar are just modifying the same node, instead of neighbors (which they should eventually do).
Here are two results, the matrices are the display of the duplicates (except for L). Where a completely filled in spot means it is between .9 and 1.0 and anything partially filled in is not.
http://img19.imageshack.us/my.php?image=imagec.png
EDIT:
I decided to add a method that found out which test input was not actually getting an output, and which neurons where never firing. After that I found out based on those which neuron would be best used for some test input. Then based on those, which ever test input had a neuron that was closest to it was one that was automatically modified. Hopefully this makes since. In the end it actually worked. The program runs it once every 20 iterations. However I don't know if this is "cheating" since it seems to me that this is now more supervised learning then unsupervised. As well this doesn't fix the overall problem since I would like to be able to have a test input size 4x larger then my output size (4 different types of writing each character). If anyone can give input on any of this it would be great.
[Edited by - Lothia on March 12, 2009 4:14:31 PM]
Regarding your training functions, I would suspect that steeper is not necessarily better. I suggest making them less steep and see how that works. What I've found for 2D maps (rectangular or hexagonal topologies) is that best results are obtained with the neighborhood is initially set to half the width of the map at its widest point. This decays to zero (much as your equations do) over a time period that allows each input to be presented to the network at least 100 times.
Now that I look at your equations more closely, do both the learning rate and neighborhood function degrade over time?
My guess is that what you are finding here are dead nodes, or nodes that are never found to be a best matching unit. The problem arises when you have two neighbors in the input space that are actually quite far apart. A sequence of three nodes in your case is likely modeling the two inputs - one node on each end for the inputs and one in the middle which is pulled between the two equally and thus never migrates to one region of the input space. This node in the middle makes a bridge between the other two but is never found as a best matching unit as it occupies an empty region of your input space that is too far away from any input.
Another possibly problem is that you may be getting topological distortions in your map. In a 2D map, this looks like a twist in the middle of the map rather than a nice 2D grid. In your case, you may have nodes out of order - difficult to explain. Suppose that your map has neighbors like this:
10 <- 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 -> 1
But once the modeling takes place and you put the nodes in the same order as the input data occur in the input space you get this:
10 <- 1 - 2 - 3 * 6 - 5 - 4 * 7 - 8 - 9 - 10 -> 1
At the *s you have a the same neighbor connection as above, but in the actual data space you end up with a twist. I hope that makes sense. The only thing I've found to prevent this situation is a much less steep decay of the learning rate and neighbor function and more iterations.
Now that I look at your equations more closely, do both the learning rate and neighborhood function degrade over time?
My guess is that what you are finding here are dead nodes, or nodes that are never found to be a best matching unit. The problem arises when you have two neighbors in the input space that are actually quite far apart. A sequence of three nodes in your case is likely modeling the two inputs - one node on each end for the inputs and one in the middle which is pulled between the two equally and thus never migrates to one region of the input space. This node in the middle makes a bridge between the other two but is never found as a best matching unit as it occupies an empty region of your input space that is too far away from any input.
Another possibly problem is that you may be getting topological distortions in your map. In a 2D map, this looks like a twist in the middle of the map rather than a nice 2D grid. In your case, you may have nodes out of order - difficult to explain. Suppose that your map has neighbors like this:
10 <- 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 -> 1
But once the modeling takes place and you put the nodes in the same order as the input data occur in the input space you get this:
10 <- 1 - 2 - 3 * 6 - 5 - 4 * 7 - 8 - 9 - 10 -> 1
At the *s you have a the same neighbor connection as above, but in the actual data space you end up with a twist. I hope that makes sense. The only thing I've found to prevent this situation is a much less steep decay of the learning rate and neighbor function and more iterations.
I know that it's popular to map rasterized images to rasterized images in SOMs, the way you are, but is there any reason not to extract features from the input cases, instead? Reducing the size of the input space, if done carefully, can simplify recognition problems like this one. Right now, you have 35 input variables (the 5x7 raster array). Using the horizontal and vertical projection profiles (row and column averages), though, would get you down to 12 input variables (5 vertical averages + 7 horizontal ones). Even throwing in a few more features, you could probably cut the number of input variables in half and still retain the most important information in the images.
-Will Dwinnell
Data Mining in MATLAB
-Will Dwinnell
Data Mining in MATLAB
Quote: Original post by kirkdI will try a less steep approach but I need to try and figure out how I would do that, should the learn rate still be close to 0 when it is almost finish training? Or should one just put in more iterations?
Regarding your training functions, I would suspect that steeper is not necessarily better. I suggest making them less steep and see how that works. What I've found for 2D maps (rectangular or hexagonal topologies) is that best results are obtained with the neighborhood is initially set to half the width of the map at its widest point. This decays to zero (much as your equations do) over a time period that allows each input to be presented to the network at least 100 times.
Quote: Now that I look at your equations more closely, do both the learning rate and neighborhood function degrade over time?
Yes they degrade at the same speed exception neighbor starts at the length of the input / 2, and the learn rate starts at whatever the user inputs.
Quote: My guess is that what you are finding here are dead nodes, or nodes that are never found to be a best matching unit. The problem arises when you have two neighbors in the input space that are actually quite far apart. A sequence of three nodes in your case is likely modeling the two inputs - one node on each end for the inputs and one in the middle which is pulled between the two equally and thus never migrates to one region of the input space. This node in the middle makes a bridge between the other two but is never found as a best matching unit as it occupies an empty region of your input space that is too far away from any input.
Another possibly problem is that you may be getting topological distortions in your map. In a 2D map, this looks like a twist in the middle of the map rather than a nice 2D grid. In your case, you may have nodes out of order - difficult to explain. Suppose that your map has neighbors like this:
10 <- 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 -> 1
But once the modeling takes place and you put the nodes in the same order as the input data occur in the input space you get this:
10 <- 1 - 2 - 3 * 6 - 5 - 4 * 7 - 8 - 9 - 10 -> 1
At the *s you have a the same neighbor connection as above, but in the actual data space you end up with a twist. I hope that makes sense. The only thing I've found to prevent this situation is a much less steep decay of the learning rate and neighbor function and more iterations.
I actually understand this quite well as I think it is the problem I am having. I found on average at the end I have 2 or 3 nodes that go to the same output and only 1 or so neurons that is never called (not sure how the numbers are actually different but they have been).
At least my learn rate/neighbor degrade functions look correct (or haven't been told differently)
Quote: Right now, you have 35 input variables (the 5x7 raster array). Using the horizontal and vertical projection profiles (row and column averages), though, would get you down to 12 input variables (5 vertical averages + 7 horizontal ones). Even throwing in a few more features, you could probably cut the number of input variables in half and still retain the most important information in the images.
I had not thought of doing something like this. I am wondering how I could define each section differently. How would I add up the values so that if we are looking at a horizontal line it is 01001 it would look different from 10010. As well my actual goal was to make it have 112 or 138 inputs with 28 outputs, would what you are saying still work?
Quote:
should the learn rate still be close to 0 when it is almost finish training? Or should one just put in more iterations?
More iterations should do the trick. That has been my experience in the past.
Quote:
I actually understand this quite well as I think it is the problem I am having. I found on average at the end I have 2 or 3 nodes that go to the same output and only 1 or so neurons that is never called (not sure how the numbers are actually different but they have been).
The solution here is to prevent that dead node from being pulled between two distant classes/clusters and ending up in the middle but too far from any input patterns to be used. You can do this by severing connections between nodes. One method is to assign a freshness value to every link and decrement the values for all links during each iteration. For each training pattern find the best matching unit and the second best matching unit. For the link that connects those two best matching units, reset the freshness value (refreshen it). After a set number of iterations, remove all those links that fall below a certain value. This takes some tweaking to prevent all the links from disappearing or to have anything happen at all, but it can alleviate some of the problems of dead nodes and twisted maps.
Below are links to a demo and a report showing some of these ideas. The growing SOM methods are particularly interesting, I think.
Fritzke Demo
Some Competitive Learning Methods report
More Fritzke info
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement