Effects of randomizing the order of inputs to a neural network - artificial-intelligence

For my Advanced Algorithms and Data Structures class, my professor asked us to pick any topic that interested us. He also told us to research it and to try and implement a solution in it. I chose Neural Networks because it's something that I've wanted to learn for a long time.
I've been able to implement an AND, OR, and XOR using a neural network whose neurons use a step function for the activator. After that I tried to implement a back-propagating neural network that learns to recognize the XOR operator (using a sigmoid function as the activator). I was able to get this to work 90% of the time by using a 3-3-1 network (1 bias at the input and hidden layer, with weights initialized randomly). At other times it seems to get stuck in what I think is a local minima, but I am not sure (I've asked questions on this before and people have told me that there shouldn't be a local minima).
The 90% of the time it was working, I was consistently presenting my inputs in this order: [0, 0], [0, 1], [1, 0], [1, 0] with the expected output set to [0, 1, 1, 0]. When I present the values in the same order consistently, the network eventually learns the pattern. It actually doesn't matter in what order I send it in, as long as it is the exact same order for each epoch.
I then implemented a randomization of the training set, and so this time the order of inputs is sufficiently randomized. I've noticed now that my neural network gets stuck and the errors are decreasing, but at a very small rate (which is getting smaller at each epoch). After a while, the errors start oscillating around a value (so the error stops decreasing).
I'm a novice at this topic and everything I know so far is self-taught (reading tutorials, papers, etc.). Why does the order of presentation of inputs change the behavior of my network? Is it because the change in error is consistent from one input to the next (because the ordering is consistent), which makes it easy for the network to learn?
What can I do to fix this? I'm going over my backpropagation algorithm to make sure I've implemented it right; currently it is implemented with a learning rate and a momentum. I'm considering looking at other enhancements like an adaptive learning-rate. However, the XOR network is often portrayed as a very simple network and so I'm thinking that I shouldn't need to use a sophisticated backpropagation algorithm.

the order in which you present the observations (input vectors) comprising your training set to the network only matters in one respect--randomized arrangement of the observations according to the response variable is strongly preferred versus ordered arrangement.
For instance, suppose you have 150 observations comprising your training set, and for each the response variable is one of three class labels (class I, II, or III), such that observations 1-50 are in class I, 51-100 in class II, and 101-50 in class III. What you do not want to do is present them to the network in that order. In other words, you do not want the network to see all 50 observations in class I, then all 50 in class II, then all 50 in class III.
What happened during training your classifier? Well initially you were presenting the four observations to your network, unordered [0, 1, 1, 0].
I wonder what was the ordering of the input vectors in those instances in which your network failed to converge? If it was [1, 1, 0, 0], or [0, 1, 1, 1], this is consistent with this well-documented empirical rule i mentioned above.
On the other hand, i have to wonder whether this rule even applies in your case. The reason is that you have so few training instances that even if the order is [1, 1, 0, 0], training over multiple epochs (which i am sure you must be doing) will mean that this ordering looks more "randomized" rather than the exemplar i mentioned above (i.e., [1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0] is how the network would be presented with the training data over three epochs).
Some suggestions to diagnose the problem:
As i mentioned above, look at the ordering of your input vectors in the non-convergence cases--are they sorted by response variable?
In the non-convergence cases, look at your weight matrices (i assume you have two of them). Look for any values that are very large (e.g., 100x the others, or 100x the value it was initialized with). Large weights can cause overflow.

Related

How to Implement (if Possible) an Artifical Neural Network with an Output with Multiple (>2) Possibilities?

For example, let's say that we can classify all planets into water, earth, and air. Each of these can be identified by a number of quantitative characteristics, such as albedo, size, and temperature, which range in values from 1-10 and are distinct for each type of planet. If I have inputs for these characteristics, how do I format the neural network's output to output a result as water, earth, or air?
From my (limited) knowledge, my experience tells me that there are at max only two outputs to an artificial neural network that will, at the end, only result true or false (or indeterminate). With one output, there are step functions where the output is 1 if the threshold is crossed, and 0 if the threshold is not crossed, or linear/sigmoidal that can also determine indeterminate. With two outputs, if one output is larger than the other, then the overall output is 1 or 0.
How would I implement a neural network with more than two overall outputs? My scope is only a true/false output, although I feel that the solution may be quite simple (and something that I overlooked). Furthermore, are there any resources to help me with this? The queries I've made haven't been the most successful.
You don't need the step function on the output; once you remove this you have a real-valued output that you can treat in several different ways:
Set ranges of values that are interpreted as each different output. So, 0...0.3 is output 1, 0.3...0.6 is output 2 and 0.6...1.0 is output 3. You would then train for outputs 0, 0.5 and 1.0 for the three possible outputs.
Use three independent networks or three output nodes to predict each of the outputs. Then, the output is considered to be the network that gives the highest result.
Artificial Neural Networks (ANNs) are not limited to one or two outputs. The number of outputs is only limited by your available computing resources.
A commonly used convention for multi-class classification (more than two classes) with multilayer perceptrons is to have as many outputs as there are classes and to have the desired network outputs be all zeros except for a unity output in the output node corresponding to the target class. For example, if there are 5 classes, the desired network output for class 2 would be (0, 1, 0, 0, 0) and the desired output for class 5 would be (0, 0, 0, 0, 1). This is the case where the classes are considered mutually exclusive.
But you could also define your target outputs to have more than one unity value. For example, if output 1 corresponds to "mammal" and output 4 corresponds to "dog", then you could specify the output for a Beagle (a kind of dog) to be (1, 0, 0, 1, 0). How you map the outputs to your target classes is up to you. The trick is setting up the network architecture (number & sizes of layers) so that your classes are learnable.
Is cases of classification as this, best performances are reached using three discrete output units in the form (a, b, c) where a, b and c can have values 0 or 1. Prepare your training set for a network with three output units and setting the right property for each record.
Generally, it's used the "winner takes all" rule (the higher value wins and give you the final category) but I prefer to use ROC curves to analyze results.
Be careful with number of hidden units a layers. Multiple outputs are possible without problems (not limited to 2) but more outputs means more training data, fixed number of hidden units and intermediate layers, to reach an acceptable result (curse of dimensionality problem).
Suppose you have n classes. Then you can implement the output layer as a Softmax Regression Layer of n units instead of a regular Logistic Regression Layer.

Does Convolution Neural Network need normalized input?

I have trained a Convolution Neural Network, after comparing two normalizations,
I found that simple minus mean and divided by standard variance is better than scaling into [0, 1], it seems that the interval of input value is unnecessary in [0, 1] with sigmoid function.
Does anybody could explain about it?
If you're talking about a NN using logistic regression, then you are correct that a suitable sigmoid function (or logistic function in this context) will give you a [0, 1] range from your original inputs.
However, the logistic function works best when the inputs are in a small range on either side of zero - so, for example, your input to the logistic function might be [-3, +3].
By rescaling your data to [0, 1] first, you would flatten out any underlying distribution and move all of your data to the positive side of zero, which is not what the logistic function expects. So you will get a worse result than by normalising (i.e. subtract mean and divide by standard deviation, as you said) because that normalisation step takes account of the variance in the original distribution and makes sure that the mean is zero so you get both positive and negative data input to the logistic function.
In your question, you said "comparing two normalisations" - I think you are misunderstanding what "normalisation" means and actually comparing normalisation with rescaling, which is different.

Multiple Output Neural Network

I have built my first neural network in python, and i've been playing around with a few datasets; it's going well so far !
I have a quick question regarding modelling events with multiple outcomes: -
Say i wish to train a network to tell me the probability of each runner winning a 100m sprint. I would give the network all of the relevant data regarding each runner, and the number of outputs would be equal to the number of runners in the race.
My question is, using a sigmoid function, how can i ensure the sum of the outputs will be equal to 1.0 ? Will the network naturally learn to do this, or will i have to somehow make this happen explicitly ? If so, how would i go about doing this ?
Many Thanks.
The output from your neural network will approach 1. I don't think it will actually get to 1.
You actually don't need to see which output is equal to 1. Once you've trained your network up to a specific error level, when you present the inputs, just look for the maximum output in your output later. For example, let's say your output layer presents the following output: [0.0001, 0.00023, 0.0041, 0.99999412, 0.0012, 0.0002], then the runner that won the race is runner number 4.
So yes, your network will "learn" to produce 1, but it won't exactly be 1. This is why you train to within a certain error rate. I recently created a neural network to recognize handwritten digits, and this is the method that I used. In my output layer, I have a vector with 10 components. The first component represents 0, and the last component represents 9. So when I present a 4 to the network, I expect the output vector to look like [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]. Of course, it's not what I get exactly, but it's what I train the network to provide. So to find which digit it is, I simply check to see which component has the highest output or score.
Now in your second question, I believe you're asking how the network would learn to provide the correct answer? To do this, you need to provide your network with some training data and train it until the output is under a certain error threshold. So what you need is a set of data that contains the inputs and the correct output. Initially your neural network will be set up with random weights (there are some algorithms that help you select better weights to minimize training time, but that's a little more advanced). Next you need a way to tell the neural network to learn from the data provided. So basically you give the data to the neural network and it provides an output, which is highly likely to be wrong. Then you compare that data with the expected (correct) output and you tell the neural network to update its weights so that it gets closer to the correct answer. You do this over and over again until the error is below a certain threshold.
The easiest way to do this is to implement the stochastic backpropagation algorithm. In this algorithm, you calculate the error between the actual output of the neural network and the expected output. Then you backpropagate the error from the output layer all the way up to the weights to the hidden layer, adjusting the weights as you go. Then you repeat this process until the error that you calculate is below a certain threshold. So during each step, you're getting closer and closer towards your solution.
You can use the algorithm described here. There is a decent amount of math involved, so be prepared for that! If you want to see an example of an implementation of this algorithm, you can take a look at this Java code that I have on github. The code uses momentum and a simple form of simulated annealing as well, but the standard backpropagation algorithm should be easily discernible. The Wikipedia article on backpropagation has a link to an implementation of the backpropagation algorithm in Python.
You're probably not going to understand the algorithm immediately; expect to spend some time understanding it and working through some of the math. I sat down with a pencil and paper as I was coding, and that's how I eventually understood what was going on.
Here are a few resources that should help you understand backpropagation a little better:
The learning process: backpropagation
Error backpropagation
If you want some more resources, you can also take a look at my answer here.
Basically you want a function of multiple real numbers that converts those real numbers into probabilities (each between 0 to 1, sum to 1). You can this easily by post processing the output of your network.
Your network gives you real numbers r1, r2, ..., rn that increases as the probability of each runner wins the race.
Then compute exp(r1), exp(r2), ..., and sum them up for ers = exp(r1) + exp(r2) + ... + exp(rn). Then the probability that the first racer wins is exp(r1) / ers.
This is a one use of the Boltzman distribution. http://en.wikipedia.org/wiki/Boltzmann_distribution
Your network should work around that and learn it naturally eventually.
To make the network learn that a little faster, here's what springs to mind first:
add an additional output called 'sum' (summing all the other output neurons) -- if you want all the output neurons to be in an separate layer, just add a layer of outputs, first numRunners outputs just connect to corresponding neuron in the previous layer, and the last numRunners+1-th neuron you connect to all the neurons from the previous layer, and fix the weights to 1)
the training set would contain 0-1 vectors for each runner (did-did not run), and the "expected" result would be a 0-1 vector 00..00001000..01 first 1 marking the runner that won the race, last 1 marking the "sum" of "probabilities"
for the unknown races, the network would try to predict which runner would win. Since the outputs have contiguous values (more-or-less :D) they can be read as "the certainty of the network that the runner would win the race" -- which is what you're looking for
Even without the additional sum neuron, this is the rough description of the way the training data should be arranged.

Artificial neural networks

I want to know whether Artificial Neural Networks can be applied to discrete values inputs? I know they can be applied to continuous valued inputs, but can they be applied to discrete valued ones? Also, will perform well for discrete valued inputs?
Yes, artificial neural networks may be applied to data featuring discrete-value input variables. In the most commonly used neural network architectures (which are numeric), discrete inputs are typically represented by a series of dummy variables, just as in statistical regression. Also, as with regression, one less than the number of distinct values dummy variables is needed. There are other methods, but this is the most straightforward.
Well, good question let me say!
First of all let me answer directly yes to your question!
The answer implies to consider few aspects about the use and the implementation of the network itself.
Than let me explain why:
The easiest way is to normalize input as usual, this is the first rule of thumb with NN,
than let the neural network compute the task, and once you have your output, invert the normalization to get the output in the original range but still continuous, to get back descrete values just consider the integer part of your output. It is easy, it works and is fine, DONE! A good result just depends on the topology you design for you network.
As a plus you could consider the use of "step" transfer function, instead of "tan-sigmoid", between layers just to strenght and mimic a sort of digitization forcing the output to be just 0 or 1. But you should reconsider also the starting normalization as well as the use of well tuned thresholds.
NB: this latter trick is not really necessary but could give some secondary benefits; maybe test it in a second stage of your development and look at the differences.
PS: Just let me suggest something that should apply to your issue; if you would be smart take into account the use of some fuzzy logic on your learning algorithm ;-)
Cheers!
I'm late on this question, but this may help someone.
Say you have a categorical output variable, for example 3 different categories (0, 1 and 2),
outputs
0
2
1
2
1
0
then becomes
1, 0, 0
0, 0, 1
0, 1, 0
0, 0, 1
0, 1, 0
1, 0, 0
A possible NN output result is
0.2, 0.3, 0.5 (winner is categ 2)
0.05, 0.9, 0.05 (winner is categ 1)
...
Then your NN hill have 3 output nodes in this case, so take the max value.
To improve this, use entropy as a error measure and a softmax activation on the output layer, so that the outputs sum up to 1.
The purpose of a neural network is to approximate complicated functions by interpolating samples. As such, they tend to be a poor fit for discrete data, unless that data can be expressed by thresholding a continuous function. Depending on your problem, there are likely to be much more effective learning methods.

Is there a supervised learning algorithm that takes tags as input, and produces a probability as output?

Let's say I want to determine the probability that I will upvote a question on SO, based only on which tags are present or absent.
Let's also imagine that I have plenty of data about past questions that I did or did not upvote.
Is there a machine learning algorithm that could take this historical data, train on it, and then be able to predict my upvote probability for future questions? Note that it must be the probability, not just some arbitrary score.
Let's assume that there will be up-to 7 tags associated with any given question, these being drawn from a superset of tens of thousands.
My hope is that it is able to make quite sophisticated connections between tags, rather than each tag simply contributing to the end result in a "linear" way (much as words do in a Bayesian spam filter).
So for example, it might be that the word "java" increases my upvote probability, except when it is present with "database", however "database" might increase my upvote probability when present with "ruby".
Oh, and it should be computationally reasonable (training within an hour or two on millions of questions).
What approaches should I research here?
Given that there probably aren't many tags per message, you could just create "n-gram" tags and apply naive Bayes. Regression trees would also produce an empirical probability at the leaf nodes, using +1 for upvote and 0 for no upvote. See http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf for some readable lecture notes and http://sites.google.com/site/rtranking/ for an open source implementation.
You can try several methods (linear regression, SMV, neural networks). The input vector should consist of all possible tags, where each tag represents one dimension.
Then each record in a training set has to be transformed to the input vector according to the tags. For example let's say you have different combinations of 4 tags in your training set (php, ruby, ms, sql) and you define an unweighted input vector [php, ruby, ms, sql]. Let's say you have the following 3 records whic are transformed to weighted input vectors:
php, sql -> [1, 0, 0, 1]
ruby -> [0, 1, 0, 0]
ms, sql -> [0, 0, 1, 1]
In case you use linear regression you use the following formula
y = k * X
where y represents an answer (upvote/downvote) in your case and by inserting known values (X - weighted input vectors).
How ta calculate weights in case you use linear regression you can read here but the point is to create binary input vectors which size is equal (or larger in case you take into account some other variables) to the number of all tags and then for each record you set weights for each tag (0 if it is not included or 1 otherwise).

Resources