Advertisement

Linear Regression/Gradient Descent Issue

Started by August 08, 2010 08:02 PM
8 comments, last by slack 14 years, 3 months ago
I'm getting interested in machine learning and felt great about everything I read. But when I construct a dummy dataset and test my implementation, everything explodes.

The dummy data points takes the form:
{x,y}

I have a simple linear heuristic/hypothesis function, h, which takes a guess at y, given x:
h(x) = w0 + w1x
where w is the weight I'm trying to hone.

The cost function is:
C(w) = 0.5 * (h(x) - y)^2 (summed over the dataset)

The update scheme for w in gradient descent, as I understand it, is:
wi += rate * (d/dwi)C(w)
Where (d/dwi)C(w) is the partial derivative of C with respect to the weight wi.

(d/dwi)C(w) = (h(x) - y)xi

So, what I'm trying, unsuccessfully, is updating each weight component (some pseudocode)
for i = 0 to numWeights:   sum = 0   for j = 0 to numDataPoints:      sum += ( h( dataset.x ) - dataset.y ) * dataset.x<br>   weights += rate * sum <br></pre></div><!–ENDSCRIPT–><br><br>Obviously I'm missing something huge as the weights explode to infinity very quickly, even with relatively small values of rate.<br><br>Any help is appreciated!<br><br>
It looks like you got a sign wrong somewhere. I think should be
wi -= rate * (d/dwi)C(w)
Advertisement
Thanks for the reply. Doesn't seem to really make a difference though. Here's a test program, I'm building a perfectly linear set of data points:
#include <iostream>#include <vector>using namespace std;struct DataPoint {	float x;	float y;};int main (int argc, char * const argv[]) {        vector<DataPoint> dataset;		int i;		for( i = 0; i < 40; i++ ) 	{				DataPoint point;				float scale = (float)( arc4random() % RAND_MAX ) / RAND_MAX; 				point.x = 600.0f + 4000.0f * scale;		point.y = 60000.0f + 400000.0f * scale;				dataset.push_back( point );			}		float w[ 2 ] = { 1.0f, 1.0f };		float rate = 0.001f;		//blindly run over 10 iterations	for( i = 0; i < 10; i++ ) 	{				//over each weight		for( int j = 0; j < 2; j++ )		{						//over each data point			for( int k = 0; k < dataset.size(); k++ )			{								float h = w[ 0 ] + w[ 1 ] * dataset[ k ].x;				float x = k == 0 ? 1.0f : dataset[ k ].x;								w[ j ] -= rate * ( h - dataset.y ) * x;<br>				<br>				cout &lt;&lt; <span class="cpp-literal">"w"</span> &lt;&lt; j &lt;&lt; <span class="cpp-literal">":  "</span> &lt;&lt; w[ j ] &lt;&lt; endl;<br>				<br>			}<br>			<br>		}<br>		<br>	}<br>	<br>    <span class="cpp-keyword">return</span> <span class="cpp-number">0</span>;<br>	<br>}<br><br></pre></div><!–ENDSCRIPT–><br>It makes sense when looking at it that it explodes as the rather large x is a scaling term. I feel really dense considering how simple this is, but it's not working! Thoughts?
Given the size of the values you are handling, your rate is way too large: Try with 1e-8. Besides that, you have two bugs that make your program not work:

				float x = k == 0 ? 1.0f : dataset[ k ].x;								w[ j ] -= rate * ( h - dataset[ i ].y ) * x;


The first thing I highlighted should be `j', not `k'. And the second one should be `k', not `i'.

Also, you don't want to be wondering if you are having precision issues, so I recommend using doubles instead of floats.
Thank you, alvaro! The size of the learning rate was the issue. Sorry for making you point out errors in the code I posted; I had written a whole framework for this and just wrote this quick, buggy, brass-tacks version for demonstrating my problem. Thanks again.

Is there a rule for determining a good learning rate?
Quote: Original post by newtomaths
Is there a rule for determining a good learning rate?


Instead of updating the weights as you see data, you could compute the true partial derivative of the error function with respect to each weight, and then update the weights at once. Then you would add a verification that the error was in fact reduced. If it was not, reduce the learning rate, say, to half its value.

Another thing that helps is normalizing the input (subtract the mean from each column and divide by its standard deviation). This would make the appropriate learning rate not depend on the scale of your data. You can always scale things back if you need to.

Advertisement
Quote: Original post by alvaro
Instead of updating the weights as you see data, you could compute the true partial derivative of the error function with respect to each weight, and then update the weights at once. Then you would add a verification that the error was in fact reduced. If it was not, reduce the learning rate, say, to half its value.


This would be a kind of line search. There's a nice list of various line search methods here in eqns 7-14.
A practical algorithm that is worth mentioning is the BFGS method, which computes an approximation to the Hessian matrix and instead of simply following the gradient, performs a quasi-Newton iteration, which converges much much faster. Also check out the limited-memory version of BFGS.

Has anyone here used a library implementation of these algorithms?
awesome, lots of stuff to diget. thank you both.
ITK has implementations of some of these taken from earlier (Fortran?) code for use in image registration. Depending on what you're doing, learning ITK may be overkill, but you might glance at the code as a reference. There is a class called LBFGSBOptimizer.

This topic is closed to new replies.

Advertisement