nominal-value inputs for Neural Network - artificial-intelligence

I have a set of training data, each item in this set consists of 4 numerical values and 1 nominal-value which is the name of the method that these values have been calculated with. (There are 8 methods)
I'm training a Neural Network with these. To get rid of the nominal-value I simply assigned a value from 1 to 8 to each method and used one input to pass it to Neural Network and 4 other inputs for numerical-values. It is sort of working, but the result is not as amazing as I want.
So my question is could it be because of this simple assignment of numbers to nominal-values? or maybe it is because of mixing two different categories of inputs which are not really at the same level (numbers and method types)

As a general note, a better way for coding nominal values would be a binary vector. In your case, in addition to the 4 continuous-valued inputs, you'd have 8 binary input neurons, where only one is activated (1) and the other 7 are inactive.
The way you did it implies an artificial relationship between the computation methods, which is almost certainly an artifact. For example, 1 and 2 are numerically (and from your network's point of view!) nearer than 1 and 8. But are the methods nr. 1 and 2 really more similar, or related, than the methods 1 and 8?

Since you don't provide much detail, my answer can't be very specific.
Generally speaking neural networks tend to perform worse when coding nominal values as numeric values since the transformation will impose a (probably) false ordering on the variables. Mixing inputs with very varied levels also tend to worsen the performance.
However, given the little information provided here there is no way of telling if this is the reason that the networks performance is "not as amazing" as you want. It could just as well be the case that you don't have enough training data, or that your training data contains a lot of noise. Perhaps you need to pre-scale your data, perhaps there is an error in your network code, perhaps you have chosen ill-suited values of constants for your learning algorithm...
The reasons a neural network doesn't perform as expected are many and diverse (on of them beeing unreasonably high expectations). Without much more information there is no way of knowing what the problem is in your case.

Mapping categories to numerical values is not a good practice in statistics. Especially in the case of neural networks. Bear in mind that neural networks tend to map similar inputs to similar outputs. If you map category A to 1 and category B to 2 (both as inputs), the NN will try to output similar values for both categories, even if they have nothing to do with each other.
A sparser representation is preferred. If you have 4 categories, map them like this:
A -> 0001
B -> 0010
etc
Take a look at the "Subject: How should categories be encoded?" in this link:
ftp://ftp.sas.com/pub/neural/FAQ2.html#A_cat

The previous answers are right - do not map nominal values into arbitrary numeric ones. However, if the attribute has an ordinal nature ("Low", "Medium", High" for example), you can replace the nominal values by ascending numeric values. Note that this may not be the optimal solution - since there is no guarantee for example that "High"=3 by the nature of your data. Instead, use one-hot bit encoding as suggested.
The reason for this is that a neural network is very similar to regression in the sense that multiple numeric values go through some kind of an aggregating function - but this happens multiple times. Each input is also multiplied by a weight.
So when you enter a numeric value, it undergoes a series of mathematical manipulations that adjusts its weights in the network. So if you use numeric values for non-nomial data - nominal values that were mapped to closer numeric values will be treated about the same in the best case, in the worst case - it can harm your model.

Related

The order of features affects the results of a neural network

Well,
I am confused really.
I have a simple order of features, i.e. all the letters and a few symbols, counting how many times are contained in a string.
My selection as a result is as follows
numberOf_a
numberOf_b
...
numberOf_Z
numberOf_.
numberOf_,
I have a test sample of 65 values, and the MLP can get 46 correct.
Now If I chance the order of features in random order, train with the same data, evaluate the same values, I get a different number of correct predictions, e.g. 49.
Results are consistent (the same order will yield the same accuracy) but the accuracy changes between random orders.
The question is, is this supposed to happen? I cannot see how this is backed up by the theory. I am missing something large here?
PS. I am using WEKA's implementation of the MLP
I'm not familiar with the WEKA implementation of the MLP but that doesn't seem like something that should be happening with a neural network algorithm.
It almost seems like it's getting stuck in some sort of local minimum. The algorithm may be initializing the weights of the individual neurons the same way every time. Changing the parameter order might then cause the algorithm to arrive at the same answer for a certain parameter order each time, dependent on the initial parameter order. The "local minimum" might be determined by the algorithm only going through a certain number of iterations each time.

How do I handle uncertainty/missing data in an Artifical Neural Network?

The context:
I'm experimenting with using a feed-forward artificial neural network to create AI for a video game, and I've run into the problem that some of my input features are dependent upon the existence or value of other input features.
The most basic, simplified example I can think of is this:
feature 1 is the number of players (range 2...5)
feature 2 to ? is the score of each player (range >=0)
The number of features needed to inform the ANN of the scores is dependent on the number of players.
The question: How can I represent this dynamic knowledge input to an ANN?
Things I've already considered:
Simply not using such features, or consolidating them into static input.
I.E using the sum of the players scores instead. I seriously doubt this is applicable to my problem, it would result in the loss of too much information and the ANN would fail to perform well.
Passing in an error value (eg -1) or default value (eg 0) for non-existant input
I'm not sure how well this would work, in theory the ANN could easily learn from this input and model the function appropriately. In practise I'm worried about the sheer number of non-existant input causing problems for the ANN. For example if the range of players was 2-10, if there were only 2 players, 80% of the input data would be non-existant and would introduce weird bias into the ANN resulting in a poor performance.
Passing in the mean value over the training set in place on non-existant input
Again, the amount of non-existant input would be a problem, and I'm worried this would introduce weird problems for discrete-valued inputs.
So, I'm asking this, does anybody have any other solutions I could think about? And is there a standard or commonly used method for handling this problem?
I know it's a rather niche and complicated question for SO, but I was getting bored of the "how do I fix this code?" and "how do I do this in PHP/Javascript?" questions :P, thanks guys.
It sounds like you have multiple data sets (for each number of players) that aren't really compatible with each other. Would lessons learned from a 5-player game really apply to a 2-player game? Try simplifying the problem, such as #1, and see how the program performs. In AI, absurd simplifications can sometimes give you a lot of traction, like bag of words in spam filters.
Try thinking about some model like the following:
Say xi (e.g. x1) is one of the inputs that a variable number of can exist. You can have n of these (x1 to xn). Let y be the rest of the inputs.
On your first hidden layer, pass x1 and y to the first c nodes, x1,x2 and y to the next c nodes, x1,x2,x3 and y to the next c nodes, and so on. This assumes x1 and x3 can't both be active without x2. The model will have to change appropriately if this needs to be possible.
The rest of the network is a standard feed-forward network with all nodes connected to all nodes of the next layer, or however you choose.
Whenever you have w active inputs, disable all but the wth set of c nodes (completely exclude them from training for that input set, don't include them when calculating the value for the nodes they output to, don't update the weights for their inputs or outputs). This will allow most of the network to train, but for the first hidden layer, only parts applicable to that number of inputs.
I suggest c is chosen such that c*n (the number of nodes in the first hidden layer) is greater than (or equal to) the number of nodes in the 2nd hidden layer (and have c be at the very least 10 for a moderately sized network (into the 100s is also fine)) and I also suggest the network have at least 2 other hidden layers (so 3 in total excluding input and output). This is not from experience, but just what my intuition tells me.
This working is dependent on a certain (possibly undefinable) similarity between the different numbers of inputs, and might not work well, if at all, if this similarity doesn't exist. This also probably requires quite a bit of training data for each number of inputs.
If you try it, let me / us know if it works.
If you're interested in Artificial Intelligence discussions, I suggest joining some Linked-In group dedicated to it, there are some that are quite active and have interesting discussions. There doesn't seem to be much happening on stackoverflow when it comes to Artificial Intelligence, or maybe we should just work to change that, or both.
UPDATE:
Here is a list of the names of a few decent Artificial Intelligence LinkedIn groups (unless they changed their policies recently, it should be easy enough to join):
'Artificial Intelligence Researchers, Faculty + Professionals'
'Artificial Intelligence Applications'
'Artificial Neural Networks'
'AGI — Artificial General Intelligence'
'Applied Artificial Intelligence' (not too much going on at the moment, and still dealing with some spam, but it is getting better)
'Text Analytics' (if you're interested in that)

Propery Setting Up Neural Network for Location to Location Analysis

I am attempting to train a neural network for a system that can be thought of as a macro-level postal network. My inputs are two locations (one of the 50 US states) along with 1 to 3 other variables, and I want a numeric result out.
My first inclination was to represent the states as a numeric value from 0-49 and then then have a network with only 3 or so inputs. What I've found, however, is that my training never converges on a useful value. I am assuming that this is because the values for the states are wholly arbitrary - a value of 39 for MA has no relation to a value of 38 for CA, especially when 37 represents a jump back to CT.
Is there a better way for me to do this? Should I be creating a network with over 100 inputs, representing boolean values for origin and destination states?
I think that your intuition about the difficulty of representing different states as consecutive integers is correct -- that representation compresses a lot of information into each input. That means that your network might have to learn a lot about how to decode that information into facts that are actually useful in solving your problem.
One state per input, with boolean inputs, could help. It would make it easier for the network to figure out which two states you're talking about. Of course, that approach doesn't necessarily make it easy for the network to learn useful facts like which states are adjacent to eachother.
It might be useful to try to determine if there are any kinds of information out there that are both easy for you to provide and that also might make learning easier. For example, if the physical layout of the states is important to solving your problem (i.e. CT is adjacent to NY, which is adjacent to PA) then perhaps you could break the country into regions (e.g. northwest, southeast, midwest) and provide boolean inputs for each region.
Feeding a few input schemes like that into a single network could allow you to specify a single state using a (potentially) more useful representation: instead of saying "it's state #39", you could say (for example) "it's the northernmost state that touches more than five neighboring states in the eastern region".
If the network finds it useful to determine if two states are near eachother, this kind of representation might make learning go a bit faster -- the network could get a rough idea if two states are close by simply comparing the two "region" inputs for the states. Checking whether two region inputs are equal is a lot easier than memorizing the fact that state #39 is near states #38, #21, #7, and #42.

Continuous output in Neural Networks

How can I set Neural Networks so they accept and output a continuous range of values instead of a discrete ones?
From what I recall from doing a Neural Network class a couple of years ago, the activation function would be a sigmoid, which yields a value between 0 and 1. If I want my neural network to yield a real valued scalar, what should I do? I thought maybe if I wanted a value between 0 and 10 I could just multiply the value by 10? What if I have negative values? Is this what people usually do or is there any other way? What about the input?
Thanks
Much of the work in the field of neuroevolution involves using neural networks with continuous inputs and outputs.
There are several common approaches:
One node per value
Linear activation functions - as others have noted, you can use non-sigmoid activation functions on output nodes if you are concerned about the limited range of sigmoid functions. However, this can cause your output to become arbitrarily large, which can cause problems during training.
Sigmoid activation functions - simply scaling sigmoid output (or shifting and scaling, if you want negative values) is a common approach in neuroevolution. However, it is worth making sure that your sigmoid function isn't too steep: a steep activation function means that the "useful" range of values is small, which forces network weights to be small. (This is mainly an issue with genetic algorithms, which use a fixed weight modification strategy that doesn't work well when small weights are desired.)
(source: natekohl.net)
(source: natekohl.net)
Multiple nodes per value - spreading a single continuous value over multiple nodes is a common strategy for representing continuous inputs. It has the benefit of providing more "features" for a network to play with, at the cost of increasing network size.
Binning - spread a single input over multiple nodes (e.g. RBF networks, where each node is a basis function with a different center that will be partially activated by the input). You get some of the benefits of discrete inputs without losing a smooth representation.
Binary representation - divide a single continuous value into 2N chunks, then feed that value into the network as a binary pattern to N nodes. This approach is compact, but kind of brittle and results in input that changes in a non-continuous manner.
There are no rules which require the output ( * ) to be any particular function. In fact we typically need to add some arithmetic operations at the end of the function per-se implemented in a given node, in order to scale and otherwise coerce the output to a particular form.
The advantage of working with all-or-nothing outputs and/or 0.0 to 1.0 normalized output is that it makes things more easily tractable, and also avoid issues of overflowing and such.
( * ) "Output" can be understood here as either the ouptut a given node (neuron) within the network or that of the network as a whole.
As indicated by Mark Bessey the input [to the network as a whole] and the output [of the network] typically receive some filtering/conversion. As hinted in this response and in Mark's comment, it may be preferable to have normalized/standard nodes in the "hidden" layers of the network, and apply some normalization/conversion/discretization as required for the input and/or for the output of the network; Such practice is however only a matter of practicality rather than an imperative requirement of Neural Networks in general.
You will typically need to do some filtering (level conversion, etc) on both the input and the output. Obviously, filtering the input will change the internal state, so some consideration needs to be given to not losing the signal you're trying to train on.

How to convert the output of an artificial neural network into probabilities?

I've read about neural network a little while ago and I understand how an ANN (especially a multilayer perceptron that learns via backpropagation) can learn to classify an event as true or false.
I think there are two ways :
1) You get one output neuron. It it's value is > 0.5 the events is likely true, if it's value is <=0.5 the event is likely to be false.
2) You get two output neurons, if the value of the first is > than the value of the second the event is likely true and vice versa.
In these case, the ANN tells you if an event is likely true or likely false. It does not tell how likely it is.
Is there a way to convert this value to some odds or to directly get odds out of the ANN. I'd like to get an output like "The event has a 84% probability to be true"
Once a NN has been trained, for eg. using backprogation as mentioned in the question (whereby the backprogation logic has "nudged" the weights in ways that minimize the error function) the weights associated with all individual inputs ("outside" inputs or intra-NN inputs) are fixed. The NN can then be used for classifying purposes.
Whereby the math (and the "options") during the learning phase can get a bit thick, it is relatively simple and straightfoward when operating as a classifier. The main algorithm is to compute an activation value for each neuron, as the sum of the input x weight for that neuron. This value is then fed to an activation function which purpose's is to normalize it and convert it to a boolean (in typical cases, as some networks do not have an all-or-nothing rule for some of their layers). The activation function can be more complex than you indicated, in particular it needn't be linear, but whatever its shape, typically sigmoid, it operate in the same fashion: figuring out where the activation fits on the curve, and if applicable, above or below a threshold. The basic algorithm then processes all neurons at a given layer before proceeding to the next.
With this in mind, the question of using the perceptron's ability to qualify its guess (or indeed guesses - plural) with a percentage value, finds an easy answer: you bet it can, its output(s) is real-valued (if anything in need of normalizing) before we convert it to a discrete value (a boolean or a category ID in the case of several categories), using the activation functions and the threshold/comparison methods described in the question.
So... How and Where do I get "my percentages"?... All depends on the NN implementation, and more importantly, the implementation dictates the type of normalization functions that can be used to bring activation values in the 0-1 range and in a fashion that the sum of all percentages "add up" to 1. In its simplest form, the activation function can be used to normalize the value and the weights of the input to the output layer can be used as factors to ensure the "add up" to 1 question (provided that these weights are indeed so normalized themselves).
Et voilà!
Claritication: (following Mathieu's note)
One doesn't need to change anything in the way the Neural Network itself works; the only thing needed is to somehow "hook into" the logic of output neurons to access the [real-valued] activation value they computed, or, possibly better, to access the real-valued output of the activation function, prior its boolean conversion (which is typically based on a threshold value or on some stochastic function).
In other words, the NN works as previously, neither its training nor recognition logic are altered, the inputs to the NN stay the same, as do the connections between various layers etc. We only get a copy of the real-valued activation of the neurons in the output layer, and we use this to compute a percentage. The actual formula for the percentage calculation depends on the nature of the activation value and its associated function (its scale, its range relative to other neurons' output etc.).
Here are a few simple cases (taken from the question's suggested output rules)
1) If there is a single output neuron: the ratio of the value provided by the activation function relative to the range of that function should do.
2) If there are two (or more output neurons), as with classifiers for example: If all output neurons have the same activation function, the percentage for a given neuron is that of its activation function value divided by the sum of all activation function values. If the activation functions vary, it becomes a case by case situation because the distinct activation functions may be indicative of a purposeful desire to give more weight to some of the neurons, and the percentage should respect this.
What you can do is to use a sigmoid transfer function on the output layer nodes (that accepts data ranges (-inf,inf) and outputs a value in [-1,1]).
Then by using the 1-of-n output encoding (one node for each class), you can map the range [-1,1] to [0,1] and use it as probability for each class value (note that this works naturally for more than just two classes).
The activation value of a single output neuron is a linearly weighted sum, and may be directly interpreted as an approximate probability if the network is trained to give outputs a range from 0 to 1. This would tend to be the case if the transfer function (or output function) in both the preceding stage and providing the final output is in the 0 to 1 range too (typically the sigmoidal logistic function). However, there is no guarantee that it will but repairs are possible. Moreover unless the sigmoids are logistic and the weights are constrained to be positive and sum to 1, it is unlikely. Generally a neural network will train in a more balanced way using the tanh sigmoid and weights and activations that range positive and negative (due to the symmetry of this model). Another factor is the prevalence of the class - if it is 50% then a 0.5 threshold is likely to be effective for logistic and a 0.0 threshold for tanh. The sigmoid is designed to push things towards the centre of the range (on backpropogation) and constrain it from going out of the range (in feedforward). The significance of the performance (with respect to the Bernoulli distribution) can also be interpreted as a probability that the neuron is making real predictions rather than guessing. Ideally the bias of the predictor to positives should match the prevalence of positives in the real world (which may vary at different times and places, e.g. bull vs bear markets, e.g. credit worthiness of people applying for loans vs people who fail to make loan payments) - calibrating to probabilities has the advantage that any desired bias can be set easily.
If you have two neurons for two classes, each can be interpreted independently as above, and the halved difference between them can also be. It is like flipping the negative class neuron and averaging. The differences can also give rise to a probability of significance estimate (using the T-test).
The Brier score and its Murphy decomposition give a more direct estimate of the probability that an average answer is correct, while Informedness gives the probability the classifier is making an informed decision rather than a guess, ROC AUC gives the probability a positive class will be ranked higher than a negative class (by a positive predictor), and Kappa will give a similar number that matches Informedness when prevalence = bias.
What you normally want is both a significance probability for the overall classifier (to ensure that you are playing on a real field, and not in an imaginary framework of guestimates) and a probability estimate for a specific example. There are various ways to calibrate, including doing a regression (linear or nonlinear) versus probability and using its inverse function to remap to a more accurate probability estimate. This can be seen by the Brier score improving, with the calibration component reducing towards 0, but the discrimination component remaining the same, as should ROC AUC and Informedness (Kappa is subject to bias and may worsen).
A simple non-linear way to calibrate to probabilities is to use the ROC curve - as the threshold changes for the output of a single neuron or the difference between two competing neurons, we plot the results true and false positive rates on a ROC curve (the false and true negative rates are naturally the complements, as what isn't really a positive is a negative). Then you scan the ROC curve (polyline) point by point (each time the gradient changes) sample by sample and the proportion of positive samples gives you a probability estimate for positives corresponding to the neural threshold that produced that point. Values between points on the curve can be linearly interpolated between those that are represented in the calibration set - and in fact any bad points in the ROC curve, represented by deconvexities (dents) can be smoothed over by the convex hull - probabilistically interpolating between the endpoints of the hull segment. Flach and Wu propose a technique that actually flips the segment, but this depends on information being used the wrong way round and although it could be used repeatedly for arbitrary improvement on the calibration set, it will be increasingly unlikely to generalize to a test situation.
(I came here looking for papers I'd seen ages ago on these ROC-based approaches - so this is from memory and without these lost references.)
I will be very prudent in interpreting the outputs of a neural networks (in fact any machine learning classifier) as a probability. The machine is trained to discriminate between classes, not to estimate the probability density. In fact, we don't have this information in the data, we have to infer it. For my experience I din't advice anyone to interpret directly the outputs as probabilities.
did you try prof. Hinton's suggestion of training the network with softmax activation function and cross entropy error?
as an example create a three layer network with the following:
linear neurons [ number of features ]
sigmoid neurons [ 3 x number of features ]
linear neurons [ number of classes ]
then train them with cross entropy error softmax transfer with your favourite optimizer stochastic descent/iprop plus/ grad descent. After training the output neurons should be normalized to sum of 1.
Please see http://en.wikipedia.org/wiki/Softmax_activation_function for details. Shark Machine Learning framework does provide Softmax feature through combining two models. And prof. Hinton an excellent online course # http://coursera.com regarding the details.
I can remember I saw an example of Neural network trained with back propagation to approximate the probability of an outcome in the book Introduction to the theory of neural computation (hertz krogh palmer). I think the key to the example was a special learning rule so that you didn't have to convert the output of a unit to probability, but instead you got automatically the probability as output.
If you have the opportunity, try to check that book.
(by the way, "boltzman machines", although less famous, are neural networks designed specifically to learn probability distributions, you may want to check them as well)
When using ANN for 2-class classification and logistic sigmoid activation function is used in the output layer, the output values could be interpreted as probabilities.
So if you choosing between 2 classes, you train using 1-of-C encoding, where 2 ANN outputs will have training values (1,0) and (0,1) for each of classes respectively.
To get probability of first class in percent, just multiply first ANN output to 100. To get probability of other class use the second output.
This could be generalized for multi-class classification using softmax activation function.
You can read more, including proofs of probabilistic interpretation here:
[1] Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Resources