Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
There are some related questions that I've come across (like this, this, this, and this) but they all deal with fitting data to a known curve. Is there a way to fit given data to an unknown curve? By which I mean, given some data the algorithm will give me a fit which is one function or a sum of functions. I'm programming in C, but I'm at a complete loss on how to use the gsl package to do this. I'm open to using anything that can (ideally) be piped through C. But any help on what direction I should look will be greatly appreciated.
EDIT: This is basically experimental (physics) data that I've collected, so the data will have some trend modified by additive gaussian distributed noise. In general the trend will be non-linear, so I guess that a linear regression fitting method will be unsuitable. As for the ordering, the data is time-ordered, so the curve necessarily has to be fit in that order.
You might be looking for polynomial interpolation, in the field of numerical analysis.
In polynomial interpolation - given a set of points (x,y) - you are trying to find the best polynom that fits these points. One way to do it is using Newton interpolation, which is fairly easy to program.
The field of numerical analysis and interpolations in specifics is widely studied, and you might be able to get some nice upper bound to the error of the polynom.
Note however, because you are looking for a polynom that best fits your data, and the function is not really a polynom - the scale of the error when getting far from your initial training set blasts off.
Also note, your data set is finite, and there are inifnite number (actually, non-enumerable infinity) of functions that can fit the data (exactly or approximately) - so which one out of these is the best might be specific to what you actually are trying to achieve.
If you are looking for a model to fit your data, note that linear regression and polynomial interpolations are at the opposite ends of the scale: polynomial interpolation might be an overfitting to a model, while a linear regression might be underfitting it, what exactly should be used is case specific and varies from one application to the other.
Simple polynomial interpolation example:
Let's say we have (0,1),(1,2),(3,10) as our data.
The table1 we get using newton method is:
0 | 1 | |
1 | 2 | (2-1)/(1-0)=1 |
3 | 9 | (10-2)/(3-1)=4 | (4-1)/(3-0)=1
Now, the polynom we get is the "diagonal" that ends with the last element:
1 + 1*(x-0) + 1*(x-0)(x-1) = 1 + x + x^2 - x = x^2 +1
(and that is a perfect fit indeed to the data we used)
(1) The table is recursively created: The first 2 columns are the x,y values - and each next column is based on the prior one. It is really easy to implement once you get it, the full explanation is in the wikipedia page for newton interpolation.
Another alternative is using linear regression, but multi-dimensional.
The trick here is to artificially generate extra dimensions. You can do so by simply implying some functions on the original data set. A common usage is doing it to generate polynoms to match the data, so in here the function you imply is f(x) = x^i for all i < k (where k is the degree of the polynom you want to get).
For example, the data set (0,2),(2,3) with k = 3 you will get extra 2 dimnsions, and your data set will be: (0,2,4,8),(2,3,9,27).
The linear-regression algorithm will find the values a_0,a_1,...,a_k for the polynom p(x) = a_0 + a_1*x + ... + a_k * x^k that minimized the error for each point in the data comparing to the predicted model (the value of p(x)).
Now, the problem is - when you start increasing the dimension - you are moving from underfitting (of 1 dimensional linear regression) to overfitting (when k==n, you effectively getting polynomial interpolation).
To "chose" what is the best k value - you can use cross-validation, and chose the k that minimized the error according to your cross-validation.
Note that this process can be fully automated, all you need is to iteratively check all k values in the desired range1, and chose the model with the k that minimized the error according to the cross-validation.
(1) The range could be [1,n] - though it will probably be way too time consuming, I'd go for [1,sqrt(n)] or even [1,log(n)] - but it is just a hunch.
You might want to use (Fast) Fourier Transforms to convert data to frequency domain.
With the result of the transform (a set of amplitudes and phases and frequencies) even the most twisted set of data can be represented by several functions (harmonics) of the form:
r * cos(f * t - p)
where r is the harmonic amplitude, f is the frequency an p the phase.
Finally, the unknonwn data curve is the sum of all harmonics.
I have done this in R (you have some examples of it) but I believe C has enough tools to manage it. It is also possible to pipe C and R but don't know much about it. This might be of help.
This method is really good for large chunks of data because it has complexities of:
1) decompose data with Fast Fourier Transforms (FTT) = O(n log n)
2) built the function with the resulting components = O(n)
Related
I'm studying the Ising model, and I'm trying to efficiently compute a function H(σ) where σ is the current state of an LxL lattice (that is, σ_ij ∈ {+1, -1} for i,j ∈ {1,2,...,L}). To compute H for a particular σ, I need to perform the following calculation:
where ⟨i j⟩ indicates that sites σ_i and σ_j are nearest neighbors and (suppose) J is a constant.
A couple of questions:
Should I store my state σ as an LxL matrix or as an L2 list? Is one better than the other for memory accessing in RAM (which I guess depends on the way I'm accessing elements...)?
In either case, how can I best compute H?
Really I think this boils down to how can I access (and manipulate) the neighbors of every state most efficiently.
Some thoughts:
I see that if I loop through each element in the list or matrix that I'll be double counting, so is there a "best" way to return the unique neighbors?
Is there a better data structure that I'm not thinking of?
Your question is a bit broad and a bit confusing for me, so excuse me if my answer is not the one you are looking for, but I hope it will help (a bit).
An array is faster than a list when it comes to indexing. A matrix is a 2D array, like this for example (where N and M are both L for you):
That means that you first access a[i] and then a[i][j].
However, you can avoid this double access, by emulating a 2D array with a 1D array. In that case, if you want to access element a[i][j] in your matrix, you would now do, a[i * L + j].
That way you load once, but you multiply and add your variables, but this may still be faster in some cases.
Now as for the Nearest Neighbor question, it seems that you are using a square-lattice Ising model, which means that you are working in 2 dimensions.
A very efficient data structure for Nearest Neighbor Search in low dimensions is the kd-tree. The construction of that tree takes O(nlogn), where n is the size of your dataset.
Now you should think if it's worth it to build such a data structure.
PS: There is a plethora of libraries implementing the kd-tree, such as CGAL.
I encountered this problem during one of my school assignments and I think the solution depends on which programming language you are using.
In terms of efficiency, there is no better way than to write a for loop to sum neighbours(which are actually the set of 4 points{ (i+/-1,j+/-1)} for a given (i,j). However, when simd(sse etc) functions are available, you can re-express this as a convolution with a 2d kernel {0 1 0;1 0 1;0 1 0}. so if you use a numerical library which exploits simd functions you can obtain significant performance increase. You can see the example implementation of this here(https://github.com/zawlin/cs5340/blob/master/a1_code/denoiseIsingGibbs.py) .
Note that in this case, the performance improvement is huge because to evaluate it in python I need to write an expensive for loop.
In terms of work, there is in fact some waste as the unecessary multiplications and sum with zeros at corners and centers. So whether you can experience performance improvement depends quite a bit on your programming environment( if you are already in c/c++, it can be difficult and you need to use mkl etc to obtain good improvement)
I'm playing around with Neural Networks trying to understand the best practices for designing their architecture based on the kind of problem you need to solve.
I generated a very simple data set composed of a single convex region as you can see below:
Everything works fine when I use an architecture with L = 1, or L = 2 hidden layers (plus the output layer), but as soon as I add a third hidden layer (L = 3) my performance drops down to slightly better than chance.
I know that the more complexity you add to a network (number of weights and parameters to learn) the more you tend to go towards over-fitting your data, but I believe this is not the nature of my problem for two reasons:
my performance on the Training set is also around 60% (whereas over-fitting typically means you have a very low training error and high test error),
and I have a very large amount of data examples (don't look at the figure that's only a toy figure I uplaoded).
Can anybody help me understand why adding an extra hidden layer gives
me this drop in performances on such a simple task?
Here is an image of my performance as a function of the number of layers used:
ADDED PART DUE TO COMMENTS:
I am using a sigmoid functions assuming values between 0 and 1, L(s) = 1 / 1 + exp(-s)
I am using early stopping (after 40000 iterations of backprop) as a criteria to stop the learning. I know it is not the best way to stop but I thought that it would ok for such a simple classification task, if you believe this is the main reason I'm not converging I I might implement some better criteria.
At least on the surface of it, this appears to be a case of the so-called "vanishing gradient" problem.
Activation functions
Your neurons activate according to the logistic sigmoid function, f(x) = 1 / (1 + e^-x) :
This activation function is used frequently because it has several nice properties. One of these nice properties is that the derivative of f(x) is expressible computationally using the value of the function itself, as f'(x) = f(x)(1 - f(x)). This function has a nonzero value for x near zero, but quickly goes to zero as |x| gets large :
Gradient descent
In a feedforward neural network with logistic activations, the error is typically propagated backwards through the network using the first derivative as a learning signal. The usual update for a weight in your network is proportional to the error attributable to that weight times the current weight value times the derivative of the logistic function.
delta_w(w) ~= w * f'(err(w)) * err(w)
As the product of three potentially very small values, the first derivative in such networks can become small very rapidly if the weights in the network fall outside the "middle" regime of the logistic function's derivative. In addition, this rapidly vanishing derivative becomes exacerbated by adding more layers, because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.
In networks with more than, say, two hidden layers, this can become a serious problem for training the network, since the first-order gradient information will lead you to believe that the weights cannot usefully change.
However, there are some solutions that can help ! The ones I can think of involve changing your learning method to use something more sophisticated than first-order gradient descent, generally incorporating some second-order derivative information.
Momentum
The simplest solution to approximate using some second-order information is to include a momentum term in your network parameter updates. Instead of updating parameters using :
w_new = w_old - learning_rate * delta_w(w_old)
incorporate a momentum term :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old)
w_new = w_old + w_dir_new
Intuitively, you want to use information from past derivatives to help determine whether you want to follow the new derivative entirely (which you can do by setting mu = 0), or to keep going in the direction you were heading on the previous update, tempered by the new gradient information (by setting mu > 0).
You can actually get even better than this by using "Nesterov's Accelerated Gradient" :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old + mu * w_dir_old)
w_new = w_old + w_dir_new
I think the idea here is that instead of computing the derivative at the "old" parameter value w, compute it at what would be the "new" setting for w if you went ahead and moved there according to a standard momentum term. Read more in a neural-networks context here (PDF).
Hessian-Free
The textbook way to incorporate second-order gradient information into your neural network training algorithm is to use Newton's Method to compute the first and second order derivatives of your objective function with respect to the parameters. However, the second order derivative, called the Hessian matrix, is often extremely large and prohibitively expensive to compute.
Instead of computing the entire Hessian, some clever research in the past few years has indicated a way to compute just the values of the Hessian in a particular search direction. You can then use this process to identify a better parameter update than just the first-order gradient.
You can learn more about this by reading through a research paper (PDF) or looking at a sample implementation.
Others
There are many other optimization methods that could be useful for this task -- conjugate gradient (PDF -- definitely worth a read), Levenberg-Marquardt (PDF), L-BFGS -- but from what I've seen in the research literature, momentum and Hessian-free methods seem to be the most common ones.
Because the number of iterations of training required for convergence increases as you add complexity to a neural network, holding the length of training constant while adding layers to a neural network will certainly result in you eventually observing a drop like this. To figure out whether that is the explanation for this particular observation, try increasing the number of iterations of training that you're using and see if it improves. Using a more intelligent stopping criterion is also a good option, but a simple increase in the cut-off will give you answers faster.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm studying programming on my own and I would like to have an idea how to solve this problem.
I have been given the set of resistors with given resistances and a given value restot. I can pick a given number of those resistors. How can I make a circuit which resistance is as near as possible to restot? A programmer told me that that one can use genetic algorithms but I'm not limited to use such.
I guess I have to make a linear system of equations using Kirchoff's laws to make equations but as I don't have very much experience on electricity problems nor numerical algorithms to linear systems so I would like to have some guidance about how can I make those equations automatically to computers memory as the system changes all the time. And how can I make sure that the algorithm converges to a better solutions?
The problem is from a Finnish discussion forum.
Resistors can either exist in series or in parallel, and their resistances add up differently (add values for series, add reciprocals for parallel).
You can also have networks for resistors in series and parallel.
This sounds to me like a classic case of a recursive data structure, and you could probably represent it as a tree, in a similar way to a binary expression tree: http://en.wikipedia.org/wiki/Binary_expression_tree
Combine that some exploratory tree building (you should look into the way Prolog does this) and you can find the best combination of resistors that gets close to your total.
No genetic algorithms in this approach, although you could take a genetic approach to building and refining the tree.
To apply a genetic algorithm you would need to find a way to represent, mutate and combine the "DNA" of a resistor network.
One way would be to:
Add some number of 0 ohm resistors to your resister set (representing wires).
Number the resistors from 1 to N
For some M, imagine a set of M junctions including the source (1) and sink (M).
You could define which junctions the two endpoints of each resistor are connected to as the unique identifier of a network. This is just an N-tuple of integer pairs in the range 1..M. This tuple can be the "DNA".
Then:
Generate a bunch of random networks from random tuples.
Calculate each networks resistence
Discard some amount of the population farthest from the target resistence.
Combine random pairs of them to form new networks. (perhaps by randomly selecting each resistor endpoint from either parent A or parent B with 50% probability)
Randomly change a few endpoints (mutation).
Goto 2
Not sure if it would actually work exactly like this, but you get the general idea.
There is undoubtably a better non-genetic algorithm, but you specifically asked for a genetic one so there you go.
If you are not limited to genetic algorithm, then I think you can also solve this problem with help of linear programming. You can encode the problem as below and ask a solver to give the answer for you.
Required Resistance Of Circuit = x ohms
// We want to have total 33 resistors.
selected_in_series_1 + selected_in_series_2 +... + selected_in_series_211 + selected_in_parallel_1 + selected_in_parallel_2 + ... + selected_in_parallel_211 = 33
// Resistor in Series
(selected_in_series_1 * Resistor_1) + (selected_in_series_2 * Resistor_2) + ..(selected_in_series_211 * Resistor_211) = total_resistence_in series
// Similarly write formula for parallel
(selected_in_parallel_1 * 1/Resistor_1) + (selected_in_parallel_2 * 1/Resistor_2) + ..(selected_in_parallel_211 * 1/Resistor_211) = 1/total_resistence_in parallel
total_resistence_in series + total_resistence_in parallel = Required Resistance Of Circuit
This is one thing in my beginning of understand neural networks is I don't quite understand what to initially set a "bias" at?
I understand the Perceptron calculates it's output based on:
P * W + b > 0
and then you could calculate a learning pattern based on b = b + [ G - O ] where G is the Correct Output, and O is the actual Output (1 or 0) to calculate a new bias...but what about an initial bias.....I don't really understand how this is calculated, or what initial value should be used besides just "guessing", is there any type of formula for this?
Pardon if Im mistaken on anything, Im still learning the whole Neural network idea before I implement my own (crappy) one.
The same goes for learning rate.....I mean most books and such just kinda "pick one" for μ.
The short answer is, it depends...
In most cases (I believe) you can just treat the bias just like any other weight (so it might get initialised to some small random value), and it will get updated as you train your network. The idea is that all the biases and weights will end up converging on some useful set of values.
However, you can also set the weights manually (with no training) to get some special behaviours: for example, you can use the bias to make a perceptron behave like a logic gate (assume binary inputs X1 and X2 are either 0 or 1, and the activation function is scaled to give an output of 0 or 1).
OR gate: W1=1, W2=1, Bias=0
AND gate: W1=1, W2=1, Bias=-1
You can solve the classic XOR problem by using AND and OR as the first layer in a multilayer network, and feed them into a third perceptron with W1=3 (from the OR gate), W2=-2 (from the AND gate) and Bias=-2, like this:
(Note: these values will be different if your activation function is scaled to -1/+1, ie a SGN function)
As to how to set the learning rate, that also depends(!) but I think usually something like 0.01 is recommended. Basically you want the system to learn as quickly as possible, but not so quickly that the weights fail to converge properly.
Since #Richard has already answered the greater part of the question I'll only elaborate on the learning rate. From what I've read (and it's working) there is a very simple formula that you can use in order to update the learning rate for each iteration k and it is:
learningRate_k = constant/k
Here obviously the 0th iteration is excluded since you'll be dividing by zero. The constant can be whatever you want it to be (except 0 of course since it will not be making any sense :D) but the easiest is naturally 1 so you get
learningRate_k = 1/k
The resulting series obeys two basic rules:
lim_(t->inf) SUM from k=1 to t (learningRate_k) = inf
lim_(t->inf) SUM from k=1 to t (learningRate_k^2) < inf
Note that the convergence of your perceptron is directly connected to the learning rate series. It starts big (for k=1 you get 1/1=1) and gets smaller and smaller with each and every update of your perceptron since - as in real life - when you encounter something new at the beginning you learn a lot but later on you learn less and less.
I once wrote a Tetris AI that played Tetris quite well. The algorithm I used (described in this paper) is a two-step process.
In the first step, the programmer decides to track inputs that are "interesting" to the problem. In Tetris we might be interested in tracking how many gaps there are in a row because minimizing gaps could help place future pieces more easily. Another might be the average column height because it may be a bad idea to take risks if you're about to lose.
The second step is determining weights associated with each input. This is the part where I used a genetic algorithm. Any learning algorithm will do here, as long as the weights are adjusted over time based on the results. The idea is to let the computer decide how the input relates to the solution.
Using these inputs and their weights we can determine the value of taking any action. For example, if putting the straight line shape all the way in the right column will eliminate the gaps of 4 different rows, then this action could get a very high score if its weight is high. Likewise, laying it flat on top might actually cause gaps and so that action gets a low score.
I've always wondered if there's a way to apply a learning algorithm to the first step, where we find "interesting" potential inputs. It seems possible to write an algorithm where the computer first learns what inputs might be useful, then applies learning to weigh those inputs. Has anything been done like this before? Is it already being used in any AI applications?
In neural networks, you can select 'interesting' potential inputs by finding the ones that have the strongest correlation, positive or negative, with the classifications you're training for. I imagine you can do similarly in other contexts.
I think I might approach the problem you're describing by feeding more primitive data to a learning algorithm. For instance, a tetris game state may be described by the list of occupied cells. A string of bits describing this information would be a suitable input to that stage of the learning algorithm. actually training on that is still challenging; how do you know whether those are useful results. I suppose you could roll the whole algorithm into a single blob, where the algorithm is fed with the successive states of play and the output would just be the block placements, with higher scoring algorithms selected for future generations.
Another choice might be to use a large corpus of plays from other sources; such as recorded plays from human players or a hand-crafted ai, and select the algorithms who's outputs bear a strong correlation to some interesting fact or another from the future play, such as the score earned over the next 10 moves.
Yes, there is a way.
If you choose M selected features there are 2^M subsets, so there is a lot to look at.
I would to the following:
For each subset S
run your code to optimize the weights W
save S and the corresponding W
Then for each pair S-W, you can run G games for each pair and save the score L for each one. Now you have a table like this:
feature1 feature2 feature3 featureM subset_code game_number scoreL
1 0 1 1 S1 1 10500
1 0 1 1 S1 2 6230
...
0 1 1 0 S2 G + 1 30120
0 1 1 0 S2 G + 2 25900
Now you can run some component selection algorithm (PCA for example) and decide which features are worth to explain scoreL.
A tip: When running the code to optimize W, seed the random number generator, so that each different 'evolving brain' is tested against the same piece sequence.
I hope it helps in something!