How to add random variables in Tensorflow Probability - tensorflow-probability

Given N independent binomial variables (not identically distributed), how does one compute the probability distribution of the sum of those random variables, in the form of a TensorFlow Probability Distribution object?

Convolution is not analytic in-general. For binomial, if the p parameter is the same and only the counts differ, you can sum the counts.
If all you want to do is sample, you can write tf.reduce_sum(tfd.Binomial(total_count=[1, 2, 3], probs=[.2, .3, .4]).sample()).
Computing log_prob(x) would require evaluating all legal allocations of x across all the underlying distributions, or using MCMC, ABC, or other schemes.

Related

LASSO and sparse solutions

in a text I have found the following:
"The LASSO regerssion method offers a sparse solution and as such the interpretability of the model can be improved".
Can someone help me to understand what is meant by this? As far as I know, a sparse decomposition of a solution to a system of equation is that vector of dimension l with minimum pseudo-l norm such that the system is still satisfied. How would a sparse solution, which is setting some regression coefficients to zero, be of help in the interpretation?
Sparse matrix/array or whatever is by definition when your matrix contains mostly zeros and few non-zero entries. In the other hand, a dense matrix/array is when you have few zeros.
When you apply LASSO regression, the sparsity of your learned coefficients depends on the amount of the penalty (lambda). The higher the penalty, the more sparse coefficients you get. That is, the non-zero coefficients (selected variables). For example, if you have 100 independent variables in your regression, the LASSO may return only 10 non-zero variables. That means 10 non-zero variables and 90 zero variables. This is exactly what is the meaning of sparsity.
Having few selected variables (non-zero) means interpretable model as you can explain it with few variables (in the above example 10 variables) instead of using the 100 variables.
Lasso Regression's penalty method is different than Ridge, which is using L1 regularization. And there is "alpha" parameter which you can set it on scikit-learn. For high values of "alpha", many coefficients are exactly zeroed.
This method is also using absolute sum of the coeeficients ( |w| ). For instance, if there are high correlated features on your dataset, Lasso is doing one of the correlated predictor to largest coefficient, while the rest of them are set to 0
If there are two or more highly collinear variables then LASSO regression select one of them randomly (which is not good for the interpretation of data)
You can find more details here => https://www.geeksforgeeks.org/lasso-vs-ridge-vs-elastic-net-ml/

How to Implement (if Possible) an Artifical Neural Network with an Output with Multiple (>2) Possibilities?

For example, let's say that we can classify all planets into water, earth, and air. Each of these can be identified by a number of quantitative characteristics, such as albedo, size, and temperature, which range in values from 1-10 and are distinct for each type of planet. If I have inputs for these characteristics, how do I format the neural network's output to output a result as water, earth, or air?
From my (limited) knowledge, my experience tells me that there are at max only two outputs to an artificial neural network that will, at the end, only result true or false (or indeterminate). With one output, there are step functions where the output is 1 if the threshold is crossed, and 0 if the threshold is not crossed, or linear/sigmoidal that can also determine indeterminate. With two outputs, if one output is larger than the other, then the overall output is 1 or 0.
How would I implement a neural network with more than two overall outputs? My scope is only a true/false output, although I feel that the solution may be quite simple (and something that I overlooked). Furthermore, are there any resources to help me with this? The queries I've made haven't been the most successful.
You don't need the step function on the output; once you remove this you have a real-valued output that you can treat in several different ways:
Set ranges of values that are interpreted as each different output. So, 0...0.3 is output 1, 0.3...0.6 is output 2 and 0.6...1.0 is output 3. You would then train for outputs 0, 0.5 and 1.0 for the three possible outputs.
Use three independent networks or three output nodes to predict each of the outputs. Then, the output is considered to be the network that gives the highest result.
Artificial Neural Networks (ANNs) are not limited to one or two outputs. The number of outputs is only limited by your available computing resources.
A commonly used convention for multi-class classification (more than two classes) with multilayer perceptrons is to have as many outputs as there are classes and to have the desired network outputs be all zeros except for a unity output in the output node corresponding to the target class. For example, if there are 5 classes, the desired network output for class 2 would be (0, 1, 0, 0, 0) and the desired output for class 5 would be (0, 0, 0, 0, 1). This is the case where the classes are considered mutually exclusive.
But you could also define your target outputs to have more than one unity value. For example, if output 1 corresponds to "mammal" and output 4 corresponds to "dog", then you could specify the output for a Beagle (a kind of dog) to be (1, 0, 0, 1, 0). How you map the outputs to your target classes is up to you. The trick is setting up the network architecture (number & sizes of layers) so that your classes are learnable.
Is cases of classification as this, best performances are reached using three discrete output units in the form (a, b, c) where a, b and c can have values 0 or 1. Prepare your training set for a network with three output units and setting the right property for each record.
Generally, it's used the "winner takes all" rule (the higher value wins and give you the final category) but I prefer to use ROC curves to analyze results.
Be careful with number of hidden units a layers. Multiple outputs are possible without problems (not limited to 2) but more outputs means more training data, fixed number of hidden units and intermediate layers, to reach an acceptable result (curse of dimensionality problem).
Suppose you have n classes. Then you can implement the output layer as a Softmax Regression Layer of n units instead of a regular Logistic Regression Layer.

Does Convolution Neural Network need normalized input?

I have trained a Convolution Neural Network, after comparing two normalizations,
I found that simple minus mean and divided by standard variance is better than scaling into [0, 1], it seems that the interval of input value is unnecessary in [0, 1] with sigmoid function.
Does anybody could explain about it?
If you're talking about a NN using logistic regression, then you are correct that a suitable sigmoid function (or logistic function in this context) will give you a [0, 1] range from your original inputs.
However, the logistic function works best when the inputs are in a small range on either side of zero - so, for example, your input to the logistic function might be [-3, +3].
By rescaling your data to [0, 1] first, you would flatten out any underlying distribution and move all of your data to the positive side of zero, which is not what the logistic function expects. So you will get a worse result than by normalising (i.e. subtract mean and divide by standard deviation, as you said) because that normalisation step takes account of the variance in the original distribution and makes sure that the mean is zero so you get both positive and negative data input to the logistic function.
In your question, you said "comparing two normalisations" - I think you are misunderstanding what "normalisation" means and actually comparing normalisation with rescaling, which is different.

Generate pseudo sample of population given probabilities

I would like to generate pseudo data that conforms to the distribution of actual sampled data. Looking for an efficient and accurate method in C/Obj-C for iphone development. Currently the occurrance of 60 different categories in 1000 sampled events has been assigned a probability (0-1). I want to generate 1000 new events which conform to the same probabilities.
Clarification {
I have a categorical distribution of set {1,2,...,60}. I understand that samples from this distribution will conform to the probabilities of each category. Therefore I need to take 1000 samples from this distribution. I have determined (thanks to answers so far) that I need to:
Normalize this distribution by summing the values and dividing each
by the sum.
Order them.
Create a CDF by replacing each value with the sum of all previous values.
Then I can generate a uniform random number between 0 and 1, and find the greatest number in the CDF whose value is less than or equal to the number just chosen, and return the category corresponding to this CDF value.
}
Q1. Is this the correct way to solve the problem?
Q2. The caveat still holds that I'm using NSDecimals to store the category probabilities. Are there any libraries available or functions in Cocoa or Math.h, etc. that I can use to do this simply? I'm open to trying new libraries, currently only have Core-Plot and the standard Cocoa libraries in this project. Thanks.
Your problem description is unclear. But it sounds like you're looking for inverse transform sampling.
Basically, you first need to generate a cumulative distribution function (CDF) corresponding to your original data; call it F(x). You then generate uniform random data in the range 0->1, and then transform it using the inverse CDF, i.e F-1(x).
Here's my suggestion. This assumes that when you say "normalized probability" you mean the sum of the probability of all types is 1. (If not, you'll need to rescale so that's the case.)
Make up some order for your 60 types. (Say, alphabetic.)
Generate a random number between 0 and 1. (Call it your "target".)
Create an accumulator, initially at 0.
Loop through your 60 types. For each type:
Add the probability of that type of event to your accumulator.
If your accumulator is >= your target, generate an event of that type and stop.
If you do that 1000 times, I believe you'll get the distribution you're looking for.

How to convert the output of an artificial neural network into probabilities?

I've read about neural network a little while ago and I understand how an ANN (especially a multilayer perceptron that learns via backpropagation) can learn to classify an event as true or false.
I think there are two ways :
1) You get one output neuron. It it's value is > 0.5 the events is likely true, if it's value is <=0.5 the event is likely to be false.
2) You get two output neurons, if the value of the first is > than the value of the second the event is likely true and vice versa.
In these case, the ANN tells you if an event is likely true or likely false. It does not tell how likely it is.
Is there a way to convert this value to some odds or to directly get odds out of the ANN. I'd like to get an output like "The event has a 84% probability to be true"
Once a NN has been trained, for eg. using backprogation as mentioned in the question (whereby the backprogation logic has "nudged" the weights in ways that minimize the error function) the weights associated with all individual inputs ("outside" inputs or intra-NN inputs) are fixed. The NN can then be used for classifying purposes.
Whereby the math (and the "options") during the learning phase can get a bit thick, it is relatively simple and straightfoward when operating as a classifier. The main algorithm is to compute an activation value for each neuron, as the sum of the input x weight for that neuron. This value is then fed to an activation function which purpose's is to normalize it and convert it to a boolean (in typical cases, as some networks do not have an all-or-nothing rule for some of their layers). The activation function can be more complex than you indicated, in particular it needn't be linear, but whatever its shape, typically sigmoid, it operate in the same fashion: figuring out where the activation fits on the curve, and if applicable, above or below a threshold. The basic algorithm then processes all neurons at a given layer before proceeding to the next.
With this in mind, the question of using the perceptron's ability to qualify its guess (or indeed guesses - plural) with a percentage value, finds an easy answer: you bet it can, its output(s) is real-valued (if anything in need of normalizing) before we convert it to a discrete value (a boolean or a category ID in the case of several categories), using the activation functions and the threshold/comparison methods described in the question.
So... How and Where do I get "my percentages"?... All depends on the NN implementation, and more importantly, the implementation dictates the type of normalization functions that can be used to bring activation values in the 0-1 range and in a fashion that the sum of all percentages "add up" to 1. In its simplest form, the activation function can be used to normalize the value and the weights of the input to the output layer can be used as factors to ensure the "add up" to 1 question (provided that these weights are indeed so normalized themselves).
Et voilĂ !
Claritication: (following Mathieu's note)
One doesn't need to change anything in the way the Neural Network itself works; the only thing needed is to somehow "hook into" the logic of output neurons to access the [real-valued] activation value they computed, or, possibly better, to access the real-valued output of the activation function, prior its boolean conversion (which is typically based on a threshold value or on some stochastic function).
In other words, the NN works as previously, neither its training nor recognition logic are altered, the inputs to the NN stay the same, as do the connections between various layers etc. We only get a copy of the real-valued activation of the neurons in the output layer, and we use this to compute a percentage. The actual formula for the percentage calculation depends on the nature of the activation value and its associated function (its scale, its range relative to other neurons' output etc.).
Here are a few simple cases (taken from the question's suggested output rules)
1) If there is a single output neuron: the ratio of the value provided by the activation function relative to the range of that function should do.
2) If there are two (or more output neurons), as with classifiers for example: If all output neurons have the same activation function, the percentage for a given neuron is that of its activation function value divided by the sum of all activation function values. If the activation functions vary, it becomes a case by case situation because the distinct activation functions may be indicative of a purposeful desire to give more weight to some of the neurons, and the percentage should respect this.
What you can do is to use a sigmoid transfer function on the output layer nodes (that accepts data ranges (-inf,inf) and outputs a value in [-1,1]).
Then by using the 1-of-n output encoding (one node for each class), you can map the range [-1,1] to [0,1] and use it as probability for each class value (note that this works naturally for more than just two classes).
The activation value of a single output neuron is a linearly weighted sum, and may be directly interpreted as an approximate probability if the network is trained to give outputs a range from 0 to 1. This would tend to be the case if the transfer function (or output function) in both the preceding stage and providing the final output is in the 0 to 1 range too (typically the sigmoidal logistic function). However, there is no guarantee that it will but repairs are possible. Moreover unless the sigmoids are logistic and the weights are constrained to be positive and sum to 1, it is unlikely. Generally a neural network will train in a more balanced way using the tanh sigmoid and weights and activations that range positive and negative (due to the symmetry of this model). Another factor is the prevalence of the class - if it is 50% then a 0.5 threshold is likely to be effective for logistic and a 0.0 threshold for tanh. The sigmoid is designed to push things towards the centre of the range (on backpropogation) and constrain it from going out of the range (in feedforward). The significance of the performance (with respect to the Bernoulli distribution) can also be interpreted as a probability that the neuron is making real predictions rather than guessing. Ideally the bias of the predictor to positives should match the prevalence of positives in the real world (which may vary at different times and places, e.g. bull vs bear markets, e.g. credit worthiness of people applying for loans vs people who fail to make loan payments) - calibrating to probabilities has the advantage that any desired bias can be set easily.
If you have two neurons for two classes, each can be interpreted independently as above, and the halved difference between them can also be. It is like flipping the negative class neuron and averaging. The differences can also give rise to a probability of significance estimate (using the T-test).
The Brier score and its Murphy decomposition give a more direct estimate of the probability that an average answer is correct, while Informedness gives the probability the classifier is making an informed decision rather than a guess, ROC AUC gives the probability a positive class will be ranked higher than a negative class (by a positive predictor), and Kappa will give a similar number that matches Informedness when prevalence = bias.
What you normally want is both a significance probability for the overall classifier (to ensure that you are playing on a real field, and not in an imaginary framework of guestimates) and a probability estimate for a specific example. There are various ways to calibrate, including doing a regression (linear or nonlinear) versus probability and using its inverse function to remap to a more accurate probability estimate. This can be seen by the Brier score improving, with the calibration component reducing towards 0, but the discrimination component remaining the same, as should ROC AUC and Informedness (Kappa is subject to bias and may worsen).
A simple non-linear way to calibrate to probabilities is to use the ROC curve - as the threshold changes for the output of a single neuron or the difference between two competing neurons, we plot the results true and false positive rates on a ROC curve (the false and true negative rates are naturally the complements, as what isn't really a positive is a negative). Then you scan the ROC curve (polyline) point by point (each time the gradient changes) sample by sample and the proportion of positive samples gives you a probability estimate for positives corresponding to the neural threshold that produced that point. Values between points on the curve can be linearly interpolated between those that are represented in the calibration set - and in fact any bad points in the ROC curve, represented by deconvexities (dents) can be smoothed over by the convex hull - probabilistically interpolating between the endpoints of the hull segment. Flach and Wu propose a technique that actually flips the segment, but this depends on information being used the wrong way round and although it could be used repeatedly for arbitrary improvement on the calibration set, it will be increasingly unlikely to generalize to a test situation.
(I came here looking for papers I'd seen ages ago on these ROC-based approaches - so this is from memory and without these lost references.)
I will be very prudent in interpreting the outputs of a neural networks (in fact any machine learning classifier) as a probability. The machine is trained to discriminate between classes, not to estimate the probability density. In fact, we don't have this information in the data, we have to infer it. For my experience I din't advice anyone to interpret directly the outputs as probabilities.
did you try prof. Hinton's suggestion of training the network with softmax activation function and cross entropy error?
as an example create a three layer network with the following:
linear neurons [ number of features ]
sigmoid neurons [ 3 x number of features ]
linear neurons [ number of classes ]
then train them with cross entropy error softmax transfer with your favourite optimizer stochastic descent/iprop plus/ grad descent. After training the output neurons should be normalized to sum of 1.
Please see http://en.wikipedia.org/wiki/Softmax_activation_function for details. Shark Machine Learning framework does provide Softmax feature through combining two models. And prof. Hinton an excellent online course # http://coursera.com regarding the details.
I can remember I saw an example of Neural network trained with back propagation to approximate the probability of an outcome in the book Introduction to the theory of neural computation (hertz krogh palmer). I think the key to the example was a special learning rule so that you didn't have to convert the output of a unit to probability, but instead you got automatically the probability as output.
If you have the opportunity, try to check that book.
(by the way, "boltzman machines", although less famous, are neural networks designed specifically to learn probability distributions, you may want to check them as well)
When using ANN for 2-class classification and logistic sigmoid activation function is used in the output layer, the output values could be interpreted as probabilities.
So if you choosing between 2 classes, you train using 1-of-C encoding, where 2 ANN outputs will have training values (1,0) and (0,1) for each of classes respectively.
To get probability of first class in percent, just multiply first ANN output to 100. To get probability of other class use the second output.
This could be generalized for multi-class classification using softmax activation function.
You can read more, including proofs of probabilistic interpretation here:
[1] Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Resources