To calculate the error in back propagation you would use, (target out - act. out) * act.out * (1 - act.out)
So what does, act.out * (1 - act.out) solve?
Wouldn't, [target out - act. out] be the amount the output is incorrect by?
It solves the derivative of the neuron output with respect to the current activation level. If you are using logistic sigmoid for the activation function, then if f(x) is the sigmoid output for activation x, the derivative df/dx is equal to f(x)(1 - f(x)).
In the backpropagation equation, to determine by how much you should change a weight, you need an estimate of how sensitive the output is to a change in activation. That is what this term provides.
Related
What is the proper way to implement following thing:
Task: query documents where voltage (float) equals 100.0 and tolerance -0% / +20%.
q=+voltage_f[100.0 TO 120.0]
I want that documents with the voltage near to lower bound (100.0) get more points as the documents with the voltage near upper bound (120.0).
Or vice versa - tolerance -20% / +0%.
q=+voltage_f[80.0 TO 100.0]
I want that documents with the voltage near to upper bound (100.0) get more points as the documents with the voltage near lower bound (80.0).
You can use the recip function to get a value that starts at 1 and then tapers off as the value increases:
Performs a reciprocal function with recip(x,m,a,b) implementing a/(m*x+b) where m,a,b are constants, and x is any arbitrarily complex function.
When a and b are equal, and x>=0, this function has a maximum value of 1 that drops as x increases. Increasing the value of a and b together results in a movement of the entire function to a flatter part of the curve. These properties can make this an ideal function for boosting more recent documents when x is rord(datefield).
For your first case, that could be implemented by using recip directly, and adjusting the a and b values to suit your needs.
For the second case you can use abs(sub(100, x)) as x, as that becomes larger as the value is further away from 100.
The recip call can be added in bf (for dismax, edismax) or in boost (edismax).
Your frontend will have to decide the proper values to use for a, b or m - you can use debugQuery=true to see how much the boost contributes and adjust accordingly.
I'm playing around with Neural Networks trying to understand the best practices for designing their architecture based on the kind of problem you need to solve.
I generated a very simple data set composed of a single convex region as you can see below:
Everything works fine when I use an architecture with L = 1, or L = 2 hidden layers (plus the output layer), but as soon as I add a third hidden layer (L = 3) my performance drops down to slightly better than chance.
I know that the more complexity you add to a network (number of weights and parameters to learn) the more you tend to go towards over-fitting your data, but I believe this is not the nature of my problem for two reasons:
my performance on the Training set is also around 60% (whereas over-fitting typically means you have a very low training error and high test error),
and I have a very large amount of data examples (don't look at the figure that's only a toy figure I uplaoded).
Can anybody help me understand why adding an extra hidden layer gives
me this drop in performances on such a simple task?
Here is an image of my performance as a function of the number of layers used:
ADDED PART DUE TO COMMENTS:
I am using a sigmoid functions assuming values between 0 and 1, L(s) = 1 / 1 + exp(-s)
I am using early stopping (after 40000 iterations of backprop) as a criteria to stop the learning. I know it is not the best way to stop but I thought that it would ok for such a simple classification task, if you believe this is the main reason I'm not converging I I might implement some better criteria.
At least on the surface of it, this appears to be a case of the so-called "vanishing gradient" problem.
Activation functions
Your neurons activate according to the logistic sigmoid function, f(x) = 1 / (1 + e^-x) :
This activation function is used frequently because it has several nice properties. One of these nice properties is that the derivative of f(x) is expressible computationally using the value of the function itself, as f'(x) = f(x)(1 - f(x)). This function has a nonzero value for x near zero, but quickly goes to zero as |x| gets large :
Gradient descent
In a feedforward neural network with logistic activations, the error is typically propagated backwards through the network using the first derivative as a learning signal. The usual update for a weight in your network is proportional to the error attributable to that weight times the current weight value times the derivative of the logistic function.
delta_w(w) ~= w * f'(err(w)) * err(w)
As the product of three potentially very small values, the first derivative in such networks can become small very rapidly if the weights in the network fall outside the "middle" regime of the logistic function's derivative. In addition, this rapidly vanishing derivative becomes exacerbated by adding more layers, because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.
In networks with more than, say, two hidden layers, this can become a serious problem for training the network, since the first-order gradient information will lead you to believe that the weights cannot usefully change.
However, there are some solutions that can help ! The ones I can think of involve changing your learning method to use something more sophisticated than first-order gradient descent, generally incorporating some second-order derivative information.
Momentum
The simplest solution to approximate using some second-order information is to include a momentum term in your network parameter updates. Instead of updating parameters using :
w_new = w_old - learning_rate * delta_w(w_old)
incorporate a momentum term :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old)
w_new = w_old + w_dir_new
Intuitively, you want to use information from past derivatives to help determine whether you want to follow the new derivative entirely (which you can do by setting mu = 0), or to keep going in the direction you were heading on the previous update, tempered by the new gradient information (by setting mu > 0).
You can actually get even better than this by using "Nesterov's Accelerated Gradient" :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old + mu * w_dir_old)
w_new = w_old + w_dir_new
I think the idea here is that instead of computing the derivative at the "old" parameter value w, compute it at what would be the "new" setting for w if you went ahead and moved there according to a standard momentum term. Read more in a neural-networks context here (PDF).
Hessian-Free
The textbook way to incorporate second-order gradient information into your neural network training algorithm is to use Newton's Method to compute the first and second order derivatives of your objective function with respect to the parameters. However, the second order derivative, called the Hessian matrix, is often extremely large and prohibitively expensive to compute.
Instead of computing the entire Hessian, some clever research in the past few years has indicated a way to compute just the values of the Hessian in a particular search direction. You can then use this process to identify a better parameter update than just the first-order gradient.
You can learn more about this by reading through a research paper (PDF) or looking at a sample implementation.
Others
There are many other optimization methods that could be useful for this task -- conjugate gradient (PDF -- definitely worth a read), Levenberg-Marquardt (PDF), L-BFGS -- but from what I've seen in the research literature, momentum and Hessian-free methods seem to be the most common ones.
Because the number of iterations of training required for convergence increases as you add complexity to a neural network, holding the length of training constant while adding layers to a neural network will certainly result in you eventually observing a drop like this. To figure out whether that is the explanation for this particular observation, try increasing the number of iterations of training that you're using and see if it improves. Using a more intelligent stopping criterion is also a good option, but a simple increase in the cut-off will give you answers faster.
I have read some other questions (and related answers) about this, but I still have doubts: will adding a bias to a threshold activation function change the threshold? As far as I know, adding a bias should move the activation function along the x-axis, so it should also change the threshold.
Let's say that we have only one input node and one output node, and the input node has a threshold activation function with the threshold set to 0. Now if we give 1 as input, the neuron will activate and return 1 * weight to the output node, but if we add a bias node a_0 = -1 with a weight of 2 ,connected to the input node, and give the previous input 1, the neuron won't activate anymore, because now we have to reach at least 2 to activate it. Can this be considered as "changing" the threshold or not?
Have you read these very good explanations about bias: bias explanation and bias explanation 2?
As is said in the first link the bias will shift the curve so the result of the calculation will be more varied. I think if you already use the bias, you don't need to use threshold (set the threshold to 0) because both bias and threshold do the same thing to shift activation function along the x-axis.
But I think bias is much more efficient than threshold. This is because the bias values are just weights and can be calculated exactly like any other weight in the neural network. Threshold values require separate calculation apart from the weights. There are some interesting bias and threshold comparison in encog forum.
this is a neural network calculated with bias :
and this is with threshold
both will give same result. If you are interested in full calculation you can read the encog wiki above.
So I think the answer to your question "does a bias change the threshold of an activation function" is yes. In my thesis about hybrid GA and NN I've tried both and end up only using bias and set the threshold to 0.
I hope my answer can help you, but if you have other question about my answer feel free to ask in the comment :)
I'm a tad confused here. I just started on the subject of Neural Networks and the first one I constructed used the Step-Activation with thresholds on each neuron. Now I wan't to implement the sigmoid activation but it seems that this type of activation doesn't use thresholds, only weights between the neurons. But in the information I find about this there is word of thresholds, only I can't find where they should be in the activation function.
Are thresholds used in a sigmoid activation function in neural networks?
There is no discrete jump as in step activation. The threshold could be considered to be the point where the sigmoid function is 0.5. Some sigmoid functions will have this at 0, while some will have it set to a different 'threshold'.
The step function may be thought of as a version of the sigmoid function that has the steepness set to infinity. There is an obvious threshold in this case, and for less steep sigmoid functions, the threshold could be considered to be where the function's value is 0.5, or the point of maximum steepness.
Sigmoid function's value is in the range [0;1], 0.5 is taken as a threshold, if h(theta) < 0.5 we assume that it's value is 0, if h(theta) >= 0.5 then it's 1.
Thresholds are used only on the output layer of the network and it's only when classifying. So, if you're trying to classify between 4 classes, then the output layer has 4 nodes y = [y1,y2,y3,y4], you'll use this threshold to assign y[i] 1 or 0.
It doesn't need to. Sigmoid curve itself partially can act as a threshold.
Let's say you have 3 inputs: A, B, C. Can an artificial neural network (not necessarily feed forward) learn this pattern?
if C > k
output is A
else
output is B
Are there curtain types of networks, which can or which are well suited for this type of problem?
Yes, that's a relatively easy pattern for a feedforward neural network to learn.
You will need at least 3 layers I think assuming sigmoid functions:
1st layer can test C>k (and possibly also scale A and B down into the linear range of the sigmoid function)
2nd layer can calculate A/0 and 0/B conditional on the 1st layer
3rd (output) layer can perform a weighted sum to give A/B (you may need to make this layer linear rather than sigmoid depending on the scale of values you want)
Having said that, if you genuinely know the structure of you problem and what kind of calculation you want to perform, then Neural Networks are unlikely to be the most effective solution: they are better in situations when you don't know much about the exact calculations required to model the functions / relationships.
If the inputs can be only zeros and ones, then this is the network:
Each neuron has a Heaviside step function as an activation function. The neurons y0 and z have bias = 0.5; the neuron y1 has a bias = 1.5. The weights are shown above the corresponding connections. When s = 0, the output z = d0. When s = 1, the output z = d1.
If the inputs are continuous, then Sigmoid, tanh or ReLU can be used as the activation functions of the neurons, and the network can be trained with the back-propagation algorithm.