Is it possible to get an output from a neural network that is arbitrarily large? I know that the activation function doesn't need to be a sigmoid one, but whenever I try to use a linear one (i.e. not have an activation function), my outputs rapidly drop to near zero and everything falls apart.
As an example, is it possible to have a network where the output is double the input, even if the output is a non-integer larger than 1?
Sorry if this is a repeated question (it seems like it would be), but I couldn't find a thread that dealt with this exact problem. I will post code if needed, but there is a lot of it and this seems like a general problem...
There is no limitation in the output values for as long as you use unbounded activation function in the output layer and that you do not limit your weights "too much" (regularization methods, such as weight decay force your network to have small values).
Related
I am inverting a matrix via a Cholesky factorization, in a distributed environment, as it was discussed here. My code works fine, but in order to test that my distributed project produces correct results, I had to compare it with the serial version. The results are not exactly the same!
For example, the last five cells of the result matrix are:
serial gives:
-250207683.634793 -1353198687.861288 2816966067.598196 -144344843844.616425 323890119928.788757
distributed gives:
-250207683.634692 -1353198687.861386 2816966067.598891 -144344843844.617096 323890119928.788757
I had post in the Intel forum about that, but the answer I got was about getting the same results across all the executions I will make with the distributed version, something that I already had. They seem (in another thread) to be unable to respond to this:
How to get same results, between serial and distributed execution? Is this possible? This would result in fixing the arithmetic error.
I have tried setting this: mkl_cbwr_set(MKL_CBWR_AVX); and using mkl_malloc(), in order to align memory, but nothing changed. I will get the same results, only in the case that I will spawn one process for the distributed version (which will make it almost serial)!
The distributed routines I am calling: pdpotrf() and pdpotri().
The serial routines I am calling: dpotrf() and dpotri().
Your differences seem to appear at about the 12th s.f. Since floating-point arithmetic is not truly associative (that is, f-p arithmetic does not guarantee that a+(b+c) == (a+b)+c), and since parallel execution does not, generally, give a deterministic order of the application of operations, these small differences are typical of parallelised numerical codes when compared to their serial equivalents. Indeed you may observe the same order of difference when running on a different number of processors, 4 vs 8, say.
Unfortunately the easy way to get deterministic results is to stick to serial execution. To get deterministic results from parallel execution requires a major effort to be very specific about the order of execution of operations right down to the last + or * which almost certainly rules out the use of most numeric libraries and leads you to painstaking manual coding of large numeric routines.
In most cases that I've encountered the accuracy of the input data, often derived from sensors, does not warrant worrying about the 12th or later s.f. I don't know what your numbers represent but for many scientists and engineers equality to the 4th or 5th sf is enough equality for all practical purposes. It's a different matter for mathematicians ...
As the other answer mentions getting the exact same results between serial and distributed is not guaranteed. One common technique with HPC/distributed workloads is to validate the solution. There are a number of techniques from calculating percent error to more complex validation schemes, like the one used by the HPL. Here is a simple C++ function that calculates percent error. As #HighPerformanceMark notes in his post the analysis of this sort of numerical error is incredibly complex; this is a very simple method, and there is a lot of info available online about the topic.
#include <iostream>
#include <cmath>
double calc_error(double a,double x)
{
return std::abs(x-a)/std::abs(a);
}
int main(void)
{
double sans[]={-250207683.634793,-1353198687.861288,2816966067.598196,-144344843844.616425, 323890119928.788757};
double pans[]={-250207683.634692, -1353198687.861386, 2816966067.598891, -144344843844.617096, 323890119928.788757};
double err[5];
std::cout<<"Serial Answer,Distributed Answer, Error"<<std::endl;
for (int it=0; it<5; it++) {
err[it]=calc_error(sans[it], pans[it]);
std::cout<<sans[it]<<","<<pans[it]<<","<<err[it]<<"\n";
}
return 0;
}
Which produces this output:
Serial Answer,Distributed Answer, Error
-2.50208e+08,-2.50208e+08,4.03665e-13
-1.3532e+09,-1.3532e+09,7.24136e-14
2.81697e+09,2.81697e+09,2.46631e-13
-1.44345e+11,-1.44345e+11,4.65127e-15
3.2389e+11,3.2389e+11,0
As you can see the order of magnitude of the error in every case is on the order of 10^-13 or less and in one case non-existent. Depending on the problem you are trying to solve error on this order of magnitude could be considered acceptable. Hopefully this helps to illustrate one way of validating a distributed solution against a serial one, or at least gives one way to show how far apart the parallel and serial algorithm are.
When validating answers for big problems and parallel algorithms it can also be valuable to perform several runs of the parallel algorithm, saving the results of each run. You can then look to see if the result and/or error varies with the parallel algorithm run or if it settles over time.
Showing that a parallel algorithm produces error within acceptable thresholds over 1000 runs(just an example, the more data the better for this sort of thing) for various problem sizes is one way to assess the validity of a result.
In the past when I have performed benchmark testing I have noticed wildly varying behavior for the first several runs before the servers have "warmed up". At the time I never bother to check to see if error in the result stabilized over time the same way performance did, but it would be interesting to see.
I am trying to profile a c++ function using gprof, I am intrested in the %time taken. I did more than one run and for some reason I got a large difference in the results. I don't know what is causing this, I am assuming the sampling rate or I read in other posts that I/O has something to do with it. So is there a way to make it more accurate and generate somehow almost constant results?
I was thinking of the following:
increase the sampling rate
flush the caches before executing anything
use another profiler but I want it to generate results in a similar format to grof as function time% function name, I tried Valgrind but it gave me a massive file in size. So maybe I am generating the file with the wrong command.
Waiting for your input
Regards
I recommend printing a copy of the gprof paper and reading it carefully.
According to the paper, here's how gprof measures time. It samples the PC, and it counts how many samples land in each routine. Multiplied by the time between samples, that is each routine's total self time.
It also records in a table, by call site, how many times routine A calls routine B, assuming routine B is instrumented by the -pg option. By summing those up, it can tell how many times routine B was called.
Starting from the bottom of the call tree (where total time = self time), it assumes the average time per call of each routine is its total time divided by the number of calls.
Then it works back up to each caller of those routines. The time of each routine is its average self time plus the average number of calls to each subordinate routine times the average time of the subordinate routine.
You can see, even if recursions (cycles in the call graph) are not present, how this is fraught with possibilities for errors, such as assumptions about average times and average numbers of calls, and assumptions about subroutines being instrumented, which the authors point out. If there are recursions, they basically say "forget it".
All of this technology, even if it weren't problematic, begs the question - What is it's purpose? Usually, the purpose is "find bottlenecks". According to the paper, it can help people evaluate alternative implementations. That's not finding bottlenecks. They do recommend looking at routines that seem to be called a lot of times, or that have high average times. Certainly routines with low average cumulative time should be ignored, but that doesn't localize the problem very much. And, it completely ignores I/O, as if all I/O that is done is unquestionably necessary.
So, to try to answer your question, try Zoom, for one, and don't expect to eliminate statistical noise in measurements.
gprof is a venerable tool, simple and rugged, but the problems it had in the beginning are still there, and far better tools have come along in the intervening decades.
Here's a list of the issues.
gprof is not very accurate, particularly for small functions, see http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC11
If this is Linux then I recommend a profiler that doesn't require the code to be instrumented, e.g. Zoom - you can get a free 30 day evaluation license, after that it costs money.
All sampling profilers suffer form statistical inaccuracies - if the error is too large then you need to sample for longer and/or with a smaller sampling interval.
I wrote my first feed-forward neural network in C, using the sigmoid 1.0 / (1.0 + exp(-x)) as activation function and gradient descent to adjust the weights. I tried to approximate sin(x) to make sure my network works. However, the output of the neuron on the output layer seems to always oscillate between the extreme values 0 and 1 and the weights of the neurons grow to absurd sizes, no matter how many hidden layers there are, how many neurons are in the hidden layer(s), how many training samples I provide, or even what the target outputs are.
1) Are there any standard 'tried and tested' data sets used to proof-test neural networks for errors? If yes, what structures work best (e.g. numbers of neuron(s) in the hidden layer) to converge to the desired output?
2) Are there any common errors that generate the same symptoms? I found this thread, but the issue was because of faulty data, which I believe is not my case.
3) Is there any preferred way of training the network? In my implementation I cycle through the training sets and adjust the weights each time, then rinse and repeat ~1000 times. Is there any other order that works better?
So, to sum up:
Assuming that your gradient propagation works properly usually the values of parameters like topology, learning rate, batch size or value of a constant connected with weight penalty (L1 and L2 decay) are computed using a techniques called grid search or random search. It was empirically proved that random search performs better in this task.
The most common reason of weight divergence is wrong learning rate. Big value of it might make learning really hard. But on the other hand - when learning rate is too small - learning process might take a really long time. Usually - you should babysit the learning phase. The specified instruction might be found e.g. here.
In your learning phase you used a technique called SGD. Usually - it may achieve good results but it's vulnerable to variance of data sets and big values of learning rates. What I advice you is to use batch learning and set a batch size as additional learning parameter learnt during grid or random search. You can read about here e.g. here.
Another thing which you might consider is to change your activation function to tanh or relu. There are a lot of problems with saturation regions of sigmoid and it usually needs a proper initialization. You can read about it here.
This question came to my mind while working on 2 projects in AI and ML. What If I'm building a model (e.g. Classification Neural Network,K-NN, .. etc) and this model uses some function that includes randomness. If I don't fix the seed, then I'm going to get different accuracy results every time I run the algorithm on the same training data. However, If I fix it then some other setting might give better results.
Is averaging a set of accuracies enough to say that the accuracy of this model is xx % ?
I'm not sure If this is the right place to ask such a question/open such a discussion.
Simple answer, yes, you randomize it and use statistics to show the accuracy. However, it's not sufficient to just average a handful of runs. You need, at a minimum, some notion of the variability as well. It's important to know whether "70%" accurate means "70% accurate for each of 100 runs" or "100% accurate once and 40% accurate once".
If you're just trying to play around a bit and convince yourself that some algorithm works, then you can just run it 30 or so times and look at the mean and standard deviation and call it a day. If you're going to convince anyone else that it works, you need to look into how to do more formal hypothesis testing.
There are models which are naturally dependent on randomness (e.g., random forests) and models which only use randomness as part of exploring the space (e.g., initialisation of values for neural networks), but actually have a well-defined, deterministic, objective function.
For the first case, you will want to use multiple seeds and report average accuracy, std. deviation, and the minimum you obtained. It is often good if you have a way to reproduce this, so just use multiple fixed seeds.
For the second case, you can always tell, just on the training data, which run is best (although it might actually not be the one which gives you the best test accuracy!). Thus, if you have the time, it is good to do say, 10 runs, and then evaluate on the one with the best training error (or validation error, just never evaluate on testing for this decision). You can go a level up and do multiple multiple runs and get a standard deviation too. However, if you find that this is significant, it probably means you weren't trying enough initialisations or that you are not using the right model for your data.
Stochastic techniques are typically used to search very large solution spaces where exhaustive search is not feasible. So it's almost inevitable that you will be trying to iterate over a large number of sample points with as even a distribution as possible. As mentioned elsewhere, basic statistical techniques will help you determine when your sample is big enough to be representative of the space as a whole.
To test accuracy, it is a good idea to set aside a portion of your input patterns and avoid training against those patterns (assuming you are learning from a data set). Then you can use the set to test whether your algorithm is learning the underlying pattern correctly, or whether it's simply memorizing the examples.
Another thing to think about is the randomness of your random number generator. Standard random number generators (such as rand from <stdlib.h>) may not make the grade in many cases so look around for a more robust algorithm.
I generalize the answer from what i get of your question,
I suppose Accuracy is always average accuracy of multiple runs and the standard deviation. So if you are considering accuracy you get using different seeds to the random generator, are you not actually considering a greater range of input (which should be a good thing). But you have to consider the Standard deviation to consider the accuracy. Or did i get your question it totally wrong ?
I believe cross-validation may give you what you ask about: an averaged, and therefore more reliable, estimate of classification performance. It contains no randomness, except in permuting the data set initially. The variation comes from choosing different train/test splits.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
True random number generator
I was talking to a friend the other day and we were trying to figure out if it is possible to generate completely random numbers without the help of a random function? In C for example "rand" generates pseudo-random numbers. Or we can use something like "srand( time( NULL ) );" This will allow the computer to read numbers from its clock as seed values. So if I understand everything I have read so far right, then I am pretty sure that no random function actually produces truely random numbers. How would one write a program that generates numbers that are completely random and what would code look like?
Check out this question:
True random number generator
Also, from wikipedia's entry on pseudorandom numbers
As John von Neumann joked, "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin."
The excellent random.org website provides hardware-based random numbers as well as a number of software interfaces to retrieve these.
This can be used e.g. for genuinely unpredictable seeds or for 'true' random numbers. Being a web service, there are limits on the number of draws you can make, so don't try to use this for your graduate school monte carlo simulation.
FWIW, I wrapped one of those interface in the R package random.
It would look like:
int random = CallHardwareRandomGenerator();
Even with hardware, randomness is tricky. There are things which are physically random (atomic decay is random, but with predictable average amounts, so that can be used as a source of random information) there are things that are physically random enough to make prediction impractical (this is how casinos make money).
There are things that are largely indeterminate (mix up information from key-stroke rate, mouse-movements, and a few things like that), which are a good-enough source of "randomness" for many uses.
Mathematically, we cannot produce randomness, but we can improve distribution and make something harder to predict. Cryptographic PRNGs do a stronger job at this than most, but are more expensive in terms of resources.
This is more of a physics question I think. If you think about it nothing is random, it just occurs due to events the complexity of which make them unpredictable to us. A computer is a subsystem just like any other in the universe and by giving it unpredictable external inputs (RTC, I/O garbage) we can get the same kind of randomness that that a roulette wheel gets from varying friction, air resistance, initial impulse and millions of factors that I can't wrap my head around.
There's room for a fair amount of philosophical debate about what "truly random" really even means. From a practical viewpoint, even sources we know aren't truly random can be used in ways that produce what are probably close enough for almost any practical purpose though (in particular, that at least with current technology, full knowledge of the previously produced bitstream appears to be insufficient to predict the next bit accurately). Most of those do involve a bit of extra hardware though -- for example, it's pretty easy to put a source together from a little bit of Americium out of a smoke detector.
There are quite a few more sources as well, though they're mostly pretty low bandwidth (e.g., collect one bit for each keystroke, based on whether the interval between keystrokes was an even or odd number of CPU clocks -- assuming the CPU clock and keyboard clock are derived from separate crystals). OTOH, you have to be really careful with this -- a fair number of security holes (e.g., in Netscape around v. 4.0 or so) have stemmed from people believing that such sources were a lot more random than they really were.
While there are a number of web sites that produce random numbers from hardware sources, most of them are useless from a viewpoint of encryption. Even at best, you're just trusting your SSL (or TLS) connection to be secure so nobody captured the data you got from the site.