Neural Network: calculating errors during backpropagation - artificial-intelligence

I'm using this article to implement a neural network with backpropagation, but having trouble calculating errors. In a nutshell, my sigmoid function is squashing all my node outputs to 1.0, which then causes the error calculation to return 0:
error = (expected - actual) * (1 - actual) * actual
^^ this term causes multiply by 0
And so my error is always 0.
I suspect that the problem lies with my sigmoid implementation, which is returning 1.0, rather than asymptotically bounding below 1.0:
# ruby
def sigmoid(x)
1/(1+Math.exp(-x))
end
Am I correct that sigmoid should never actually reach 1.0, or have I got something else wrong?

In a mathematical context you are correct that sigmoid should never reach 1.0. However in a practical programming context Math.exp(-x) will eventually get so small that the difference between it and 0 is negligible and you will get the 1.0 result. Depending on the range of x, this would not be surprising results.
In order to use the sigmoid approach you should make the sum of the incoming weights at each node approximately one. This will make the output of the sigmoid reasonable and allow your weights to converge quicker.

Related

How to scale DFT output to 0.0 through 1.0

I'm trying to make a simple music visualization application, I understand I need to take my audio samples and perform a Fast Fourier Transformation. I'm trying to find out how to determine what the scale of the magnitude is, so I can normalize it to be between 0.0 and 1.0 for plotting purposes.
My application is setup to allow reading audio in 16-bit and 24-bit format, so I scale all incoming audio samples to [-1.0,1.0), then I use a real-to-complex 1-dimensional transform for N samples.
From there, I think I need to take the absolute value of each bin (using the cabs function) between 0 and N/2, but I'm not sure what these numbers really represent or what I'm supposed to do with them.
I've figured out how to calculate the frequency of each bin, I'm not interested in finding the actual magnitude or amplitude in decibels, I really just want to get a value between 0.0 and 1.0.
Most explanations for fftw involve a lot of math that is honestly way above my head.
[Per comments, OP seeks to know the maximum possible magnitude of any output bin given inputs in [−1, 1]. This answer gives a way to determine that.]
DFT routines vary in how they handle scaling. Some normalize their output to keep the scale the same, and some let the arithmetic operations grow the scale for better performance or implementation convenience. So the possible scale of the output is not determined solely by mathematics; it depends on the routine used. The documentation of the routine ought to state what scaling it uses.
In the absence of clear documenrtation, you can determine the maximum output by writing a sine wave with amplitude one to the input (and a frequency matching one of the output bins), then performing the transform, and then examining the output to see which bin has the largest magnitude (it should be the one whose frequency you used, of course). It will likely be 1 or N (the number of inputs), with some slop due to floating-point rounding effects.
(When plotting, be sure to allow a little leeway for floating-point rounding effects—the actual numbers could be slightly greater than the maximum, so avoid overflowing or clipping where you do not want that.)

What is the best way to find an input for a function if you already know the output?

I'm working on a fairly complicated program here and unfortunately I've painted myself into a bit of a corner.
I have a function (let's call it f(x) for simplicity) that I know the output value of, and I need to find the input value that generates that output value (to within a certain threshold).
Unfortunately the equations behind f(x) are fairly complicated and I don't have all the information I need to simply run them in reverse- so I'm forced to perform some sort of brute force search to find the right input variable instead.
The outputs for f(x) are guaranteed to be ordered, in such a way that f(x - 1) < f(x) < f(x + 1) is always true.
What is the most efficient way to find the value of x? I'm not entirely sure if this is a "root finding" problem- it seems awfully close, but not quite. I figure there's gotta be some official name for this sort of algorithm, but I haven't been able to find anything on Google.
I'm assuming that x is an integer so the result f(x - 1) < f(x) < f(x + 1) means that the function is strictly monotonic.
I'll also assume your function is not pathological, such as
f(x) = x * cos(2 * pi * x)
which satisfies your property but has all sorts of nasties between integer values of x.
A linear bisection algorithm is appropriate and tractable here (and you could adapt it to functions which are badly behaved for non-integral x), Brent might recover the solution faster. Such algorithms may well return you a non-integral value of x, but you can always check the integers either side of that, and return the best one (that will work if the function is monotonic in all real values of x). Furthermore, if you have an analytic first derivative of f(x), then an adaption of Newton Raphson might work well, constraining x to be integral (which might not make much sense, depending on your function; it would be disastrous to apply it to the pathological example above!). Newton Raphson is cute since you only need one starting point, unlike Linear Bisection and Brent which both require the root to be bracketed.
Do Google the terms that I've italicised.
Reference: Brent's Method - Wikipedia
For a general function, I would do the following:
Evaluate at 0 and determine if x is positive or negative.
(Assuming positive) . . . Evaluate powers of 2 until you bound the value (1, 2, 4, 8, . . . )
Once you have bounds then do repeated bisection until you get the precision you are looking for
If this is being called multiple times, I would cache the values along the way to reduce the time needed for subsequent operations.

Non-converging Neural Network in C

I wrote my first feed-forward neural network in C, using the sigmoid 1.0 / (1.0 + exp(-x)) as activation function and gradient descent to adjust the weights. I tried to approximate sin(x) to make sure my network works. However, the output of the neuron on the output layer seems to always oscillate between the extreme values 0 and 1 and the weights of the neurons grow to absurd sizes, no matter how many hidden layers there are, how many neurons are in the hidden layer(s), how many training samples I provide, or even what the target outputs are.
1) Are there any standard 'tried and tested' data sets used to proof-test neural networks for errors? If yes, what structures work best (e.g. numbers of neuron(s) in the hidden layer) to converge to the desired output?
2) Are there any common errors that generate the same symptoms? I found this thread, but the issue was because of faulty data, which I believe is not my case.
3) Is there any preferred way of training the network? In my implementation I cycle through the training sets and adjust the weights each time, then rinse and repeat ~1000 times. Is there any other order that works better?
So, to sum up:
Assuming that your gradient propagation works properly usually the values of parameters like topology, learning rate, batch size or value of a constant connected with weight penalty (L1 and L2 decay) are computed using a techniques called grid search or random search. It was empirically proved that random search performs better in this task.
The most common reason of weight divergence is wrong learning rate. Big value of it might make learning really hard. But on the other hand - when learning rate is too small - learning process might take a really long time. Usually - you should babysit the learning phase. The specified instruction might be found e.g. here.
In your learning phase you used a technique called SGD. Usually - it may achieve good results but it's vulnerable to variance of data sets and big values of learning rates. What I advice you is to use batch learning and set a batch size as additional learning parameter learnt during grid or random search. You can read about here e.g. here.
Another thing which you might consider is to change your activation function to tanh or relu. There are a lot of problems with saturation regions of sigmoid and it usually needs a proper initialization. You can read about it here.

Gaussian Distribution little help in C

I am trying to generate the Gaussian Distribution using Method 2 and Method 3 described here:
http://c-faq.com/lib/gaussian.html
The problem is I am little confused as I have sigma and Mean and the 100 numbers with the range of 0 to 1 but in these methods it just returns the value for the interval of 0 to 1 and in these methods the sigma and mean value has not been used.
Can anyone help me how can I generate a Gaussian distribution using these methods?
the routines you linked to give random numbers (selected from a gaussian (normal) distribution) with mean 0 and standard deviation 1. that's the usual way routines like these work.
it's quite easy to change that to any other mean and sd - you just multiple by the sd and then add the mean.
so, for example, if x was generated by one of the routines above, then
y = 0.5 * x + 0.1
will have a standard deviation of 0.5 and a mean of 0.1.
so you don't need a separate routine for each combination of mean and sd. you just use the routines as given and then do the extra conversion.

Is there a version of backward error propogation where the output is a probability?

I want to create a feed-forward neural network where the outputs in the training data are boolean, 0 or 1.
In use, however, I want the output to be the expected probability that that input would have produced a 0 or a 1.
Given that the most common forms of backprop employ a sigmoid activation function, , it seems unlikely that this would result in actual probabilities as output (the sigmoid curve doesn't really seem like a "probabilistic" function - sorry, I know this is hand-wavey).
Or perhaps I'm wrong. Can the outputs of a feed-forward neural net, using sigmoid activation functions, and trained using backprob be safely treated as the actual probability of getting a 1 as opposed to a 0?
Yes, this will work in the way that you want it if you use a standard sigmoid activation function.
The maths that proves this is a little complicated, but it effectively boils down to the fact that you are training the sigmoid function to generate the average value of the output training set (this is a consequence of using the squared error function in normal backprop). Since the two possibles values are 0 and 1, the average value is therefore exactly equal to the probability of getting a 1.

Resources