Markov chain Monte Carlo sampling using CDFs instead of PDFs - sampling

I wonder if there is any MCMC sampling method which uses the definition of the multivariate target CDF instead of the target PDF; however, I may use a proposal PDF.
I would like to use Metropolis-Hastings but it is not possible because the calculation of the acceptance ratio is defined in terms of the target PDF.
I say this because it is impossible for me to obtain the PDF associated to certain CDF without doing some kind of numerical approximation which may bias my simulation; also, my CDF might be not continuous, and therefore I cannot differenciate it to obtain the PDF
Regards!

Related

VAE input data scaling

Variational Autoencoders (VAE) are quite a heaving concept themselves. Non-surprisingly most post, comments and tutorials focus on the theory and architecture, but most also fail to address the topic of data scaling. While experimenting with VAEs I have come across a (to me) surprising read flag that the way the data is scaled into an VAE is very important and I could not put my head around it what is the explanation.
To visualize the following issue descripting access the Notebook here: https://github.com/mobias17/VAE-Input-Scaling/blob/master/VAE%20Input%20Scaling.ipynb
Let’s assume the goal is to reconstruct a sine wave (e.g. a sound wave) by a VAE. When I feed the standardized data through the model, it is only able to approximate values between -1 and 1. Obviously, the quick answer would be to normalize the data. Still, this leads to the following questions:
1) What is the rational that the VAE can only approximate values between -1 and 1? (is it the gaussian reparameterization, vanishing gradients?)
2) Is there a way to overcome this boundary (model changes)?
3) What is the best practice to scale data for a VAE? Should the data be normalized over the std dev?
Results showing Sutputs are between -1 & 1
Variational autoencoders can approximate values in any range. The problem here is with this particular model's architecture.
The decoder of this VAE uses as the last layer keras.layers.LSTM.
This layer's default activation function is tanh, and the tanh function outputs values in the range (-1,1). This is why the model cannot generate values outside that range.
However, if we change the activation function to, say, linear, replacing
decoder_mean = LSTM(input_dim, return_sequences=True)
with
decoder_mean = LSTM(input_dim, return_sequences=True, activation=None)
the VAE now can approximate the data. This is the result I got after training for 100 epochs.
The general recommendation is to ensure that the data you are trying to approximate lies in the range of the function you are using to approximate it, either by scaling the data or choosing a more expressive function.

fit a skewed t-distribution or normal distribution in Matlab

I have a dataset that I know for sure that has some sort of skewness (and potentially excess kurtosis). I would like to fit this dataset to some sort of distribution, and I thought the most simplistic is to have a skewed student's t-distribution or skewed normal distribution. What sort of distribution in Matlab can I fit the data to?
Thanks!
L.
There may be no pearspdf function in Matlab, because the seven distribution types of the Pearson distribution mostly correspond to or are based on extant functions for other distributions:
Type 0: Normal distribution, normpdf
Type I: Beta distribution, betapdf
Type II: Student's t-distribution, tpdf
Type III: Gamma distribution, gampdf
Type IV: Not related to any standard distribution
Type V: Inverse gamma distribution, Calculated via gampdf
Type VI: F-distribution, fpdf
Type VII: Student's t-distribution/t location scale distribution, tpdf/prob.tLocationScaleDistribution
The summary above simplifies a lot of course and it would be useful to have one function that calculates your PDF according to the system, like pearsrnd does for random variate generation. Luckily someone has already done that and posted it on the MathWorks File Exchange: pearspdf.
You can also use the second argument of the pearsrnd function, which returns the type of the distribution in the Pearson system (see this page for examples). If, for example, it suggests that your data is Type III, you could attempt to fit it directly using gamfit to estimate the parameter values. gamfit, and other similarly-names functions, are based on robust maximum-likelihood estimation (MLE).

Testing a low pass filter

What is a simple way to see if my low-pass filter is working? I'm in the process of designing a low-pass filter and would like to run tests on it in a relatively straight forward manner.
Presently I open up a WAV file and stick all the samples in a array of ints. I then run the array through the low-pass filter to create a new array. What would an easy way to check if the low-pass filter worked?
All of this is done in C.
You can use a broadband signal such as white noise to measure the frequency response:
generate white noise input signal
pass white noise signal through filter
take FFT of output from filter
compute log magnitude of FFT
plot log magnitude
Rather than coding this all up you can just dump the output from the filter to a text file and then do the analysis in e.g. MATLAB or Octave (hint: use periodogram).
Depends on what you want to test. I'm not a DSP expert, but I know there are different things one could measure about your filter (if that's what you mean by testing).
If the filter is linear then all information of the filter can be found in the impulse response. Read about it here: http://en.wikipedia.org/wiki/Linear_filter
E.g. if you take the Fourier transform of the impulse response, you'll get the frequency response. The frequency response easily tells you if the low-pass filter is worth it's name.
Maybe I underestimate your knowledge about DSP, but I recommend you to read the book on this website: http://www.dspguide.com. It's a very accessible book without difficult math. It's available as a real book, but you can also read it online for free.
EDIT: After reading it I'm convinced that every programmer that ever touches an ADC should definitely have read this book first. I discovered that I did a lot of things the difficult way in past projects that I could have done a thousand times better when I had a little bit more knowledge about DSP. Most of the times an unexperienced programmer is doing DSP without knowing it.
Create two monotone signals, one of a low frequency and one of a high frequency. Then run your filter on the two. If it works, then the low frequency signal should be unmodified whereas the high frequency signal will be filtered out.
Like Bart above mentioned.
If it's LTI system, I would insert impulse and record the samples and perform FFT using matlab and plot magnitude.
You ask why?
In time domain, you have to convolute the input x(t) with the impulse response d(t) to get the transfer function which is tedious.
y(t) = x(t) * d(t)
In frequency domain, convolution becomes simple multiplication.
y(s) = x(s) x d(s)
So, transfer function is y(s)/x(s) = d(s).
That's the reason you take FFT of impulse response to see the behavior of the filter.
You should be able to programmatically generate tones (sine waves) of various frequencies, stuff them into the input array, and then compare the signal energy by summing the squared values of the arrays (and dividing by the length, though that's mathematically not necessary here because the signals should be the same length). The ratio of the output energy to the input energy gives you the filter gain. If your LPF is working correctly, the gain should be close to 1 for low frequencies, close to 0.5 at the bandwidth frequency, and close to zero for high frequencies.
A note: There are various (but essentially the same in spirit) definitions of "bandwidth" and "gain". The method I've suggested should be relatively insensitive to the transient response of the filter because it's essentially averaging the intensity of the signal, though you could improve it by ignoring the first T samples of the input, where T is related to the filter bandwidth. Either way, make sure that the signals are long compared to the inverse of the filter bandwidth.
When I check a digital filter, I compute the magnitude response graph for the filter and plot it. I then generate a linear sweeping sine wave in code or using Audacity, and pass the sweeping sine wave through the filter (taking into account that things might get louder, so the sine wave is quiet enough not to clip) . A visual check is usually enough to assert that the filter is doing what I think it should. If you don't know how to compute the magnitude response I suspect there are tools out there that will compute it for you.
Depending on how certain you want to be, you don't even have to do that. You can just process the linear sweep and see that it attenuated the higher frequencies.

Algorithm for voice comparison

Given two recorded voices in digital format, is there an algorithm to compare the two and return a coefficient of similarity?
I recommend to take a look into the HTK toolkit for speech recognition http://htk.eng.cam.ac.uk/, especially the part on feature extraction.
Features that I would assume to be good indicators:
Mel-Cepstrum coefficients (general timbre)
LPC (for the harmonics)
Given your clarification I think what you are looking for falls under speech recognition algorithms.
Even though you are only looking for the measure of similarity and not trying to turn speech into text, still the concepts are the same and I would not be surprised if a large part of the algorithms would be quite useful.
However, you will have to define this coefficient of similarity more formally and precisely to get anywhere.
EDIT:
I believe speech recognition algorithms would be useful because they do abstraction of the sound and comparison to some known forms. Conceptually this might not be that different from taking two recordings, abstracting them and comparing them.
From wikipedia article on HMM
"In speech recognition, the hidden
Markov model would output a sequence
of n-dimensional real-valued vectors
(with n being a small integer, such as
10), outputting one of these every 10
milliseconds. The vectors would
consist of cepstral coefficients,
which are obtained by taking a Fourier
transform of a short time window of
speech and decorrelating the spectrum
using a cosine transform, then taking
the first (most significant)
coefficients."
So if you run such an algorithm on both recordings you would end up with coefficients that represent the recordings and it might be far easier to measure and establish similarities between the two.
But again now you come to the question of defining the 'similarity coefficient' and introducing dogs and horses did not really help.
(Well it does a bit, but in terms of evaluating algorithms and choosing one over another, you will have to do better).
There are many different algorithms - the general name for this task is Speaker Identification - start with this Wikipedia page and work from there: http://en.wikipedia.org/wiki/Speaker_recognition
I'm not sure this will work for soundfiles, but it gives you an idea how to proceed i hope. That is a basic way how to find a pattern (image) in another image.
You first have to calculate the fft of both the soundfiles and then do a correlation. In formular it would look like (pseudocode):
fftSoundFile1 = fft(soundFile1);
fftConjSoundFile2 = conj(fft(soundFile2));
result_corr = real(ifft(soundFile1.*soundFile2));
Where fft= fast Fourier transform, ifft = inverse, conj = conjugate complex.
The fft is performed on the sample values of the soundfiles.
The peaks in the result_corr vector will then give you the positions of high correlation.
Note that both soundfiles must in this case be of the same size-otherwise you have to place the shorter one into a file of max(soundFileLength) vector.
Regards
Edit: .* means (in matlab style) a component wise mult, you must not do a vector mult!
Next Edit: Note that you have to operate with complex numbers - but there are several Complex classes out there so I think you don't have to bother about this.

How to convert the output of an artificial neural network into probabilities?

I've read about neural network a little while ago and I understand how an ANN (especially a multilayer perceptron that learns via backpropagation) can learn to classify an event as true or false.
I think there are two ways :
1) You get one output neuron. It it's value is > 0.5 the events is likely true, if it's value is <=0.5 the event is likely to be false.
2) You get two output neurons, if the value of the first is > than the value of the second the event is likely true and vice versa.
In these case, the ANN tells you if an event is likely true or likely false. It does not tell how likely it is.
Is there a way to convert this value to some odds or to directly get odds out of the ANN. I'd like to get an output like "The event has a 84% probability to be true"
Once a NN has been trained, for eg. using backprogation as mentioned in the question (whereby the backprogation logic has "nudged" the weights in ways that minimize the error function) the weights associated with all individual inputs ("outside" inputs or intra-NN inputs) are fixed. The NN can then be used for classifying purposes.
Whereby the math (and the "options") during the learning phase can get a bit thick, it is relatively simple and straightfoward when operating as a classifier. The main algorithm is to compute an activation value for each neuron, as the sum of the input x weight for that neuron. This value is then fed to an activation function which purpose's is to normalize it and convert it to a boolean (in typical cases, as some networks do not have an all-or-nothing rule for some of their layers). The activation function can be more complex than you indicated, in particular it needn't be linear, but whatever its shape, typically sigmoid, it operate in the same fashion: figuring out where the activation fits on the curve, and if applicable, above or below a threshold. The basic algorithm then processes all neurons at a given layer before proceeding to the next.
With this in mind, the question of using the perceptron's ability to qualify its guess (or indeed guesses - plural) with a percentage value, finds an easy answer: you bet it can, its output(s) is real-valued (if anything in need of normalizing) before we convert it to a discrete value (a boolean or a category ID in the case of several categories), using the activation functions and the threshold/comparison methods described in the question.
So... How and Where do I get "my percentages"?... All depends on the NN implementation, and more importantly, the implementation dictates the type of normalization functions that can be used to bring activation values in the 0-1 range and in a fashion that the sum of all percentages "add up" to 1. In its simplest form, the activation function can be used to normalize the value and the weights of the input to the output layer can be used as factors to ensure the "add up" to 1 question (provided that these weights are indeed so normalized themselves).
Et voilà!
Claritication: (following Mathieu's note)
One doesn't need to change anything in the way the Neural Network itself works; the only thing needed is to somehow "hook into" the logic of output neurons to access the [real-valued] activation value they computed, or, possibly better, to access the real-valued output of the activation function, prior its boolean conversion (which is typically based on a threshold value or on some stochastic function).
In other words, the NN works as previously, neither its training nor recognition logic are altered, the inputs to the NN stay the same, as do the connections between various layers etc. We only get a copy of the real-valued activation of the neurons in the output layer, and we use this to compute a percentage. The actual formula for the percentage calculation depends on the nature of the activation value and its associated function (its scale, its range relative to other neurons' output etc.).
Here are a few simple cases (taken from the question's suggested output rules)
1) If there is a single output neuron: the ratio of the value provided by the activation function relative to the range of that function should do.
2) If there are two (or more output neurons), as with classifiers for example: If all output neurons have the same activation function, the percentage for a given neuron is that of its activation function value divided by the sum of all activation function values. If the activation functions vary, it becomes a case by case situation because the distinct activation functions may be indicative of a purposeful desire to give more weight to some of the neurons, and the percentage should respect this.
What you can do is to use a sigmoid transfer function on the output layer nodes (that accepts data ranges (-inf,inf) and outputs a value in [-1,1]).
Then by using the 1-of-n output encoding (one node for each class), you can map the range [-1,1] to [0,1] and use it as probability for each class value (note that this works naturally for more than just two classes).
The activation value of a single output neuron is a linearly weighted sum, and may be directly interpreted as an approximate probability if the network is trained to give outputs a range from 0 to 1. This would tend to be the case if the transfer function (or output function) in both the preceding stage and providing the final output is in the 0 to 1 range too (typically the sigmoidal logistic function). However, there is no guarantee that it will but repairs are possible. Moreover unless the sigmoids are logistic and the weights are constrained to be positive and sum to 1, it is unlikely. Generally a neural network will train in a more balanced way using the tanh sigmoid and weights and activations that range positive and negative (due to the symmetry of this model). Another factor is the prevalence of the class - if it is 50% then a 0.5 threshold is likely to be effective for logistic and a 0.0 threshold for tanh. The sigmoid is designed to push things towards the centre of the range (on backpropogation) and constrain it from going out of the range (in feedforward). The significance of the performance (with respect to the Bernoulli distribution) can also be interpreted as a probability that the neuron is making real predictions rather than guessing. Ideally the bias of the predictor to positives should match the prevalence of positives in the real world (which may vary at different times and places, e.g. bull vs bear markets, e.g. credit worthiness of people applying for loans vs people who fail to make loan payments) - calibrating to probabilities has the advantage that any desired bias can be set easily.
If you have two neurons for two classes, each can be interpreted independently as above, and the halved difference between them can also be. It is like flipping the negative class neuron and averaging. The differences can also give rise to a probability of significance estimate (using the T-test).
The Brier score and its Murphy decomposition give a more direct estimate of the probability that an average answer is correct, while Informedness gives the probability the classifier is making an informed decision rather than a guess, ROC AUC gives the probability a positive class will be ranked higher than a negative class (by a positive predictor), and Kappa will give a similar number that matches Informedness when prevalence = bias.
What you normally want is both a significance probability for the overall classifier (to ensure that you are playing on a real field, and not in an imaginary framework of guestimates) and a probability estimate for a specific example. There are various ways to calibrate, including doing a regression (linear or nonlinear) versus probability and using its inverse function to remap to a more accurate probability estimate. This can be seen by the Brier score improving, with the calibration component reducing towards 0, but the discrimination component remaining the same, as should ROC AUC and Informedness (Kappa is subject to bias and may worsen).
A simple non-linear way to calibrate to probabilities is to use the ROC curve - as the threshold changes for the output of a single neuron or the difference between two competing neurons, we plot the results true and false positive rates on a ROC curve (the false and true negative rates are naturally the complements, as what isn't really a positive is a negative). Then you scan the ROC curve (polyline) point by point (each time the gradient changes) sample by sample and the proportion of positive samples gives you a probability estimate for positives corresponding to the neural threshold that produced that point. Values between points on the curve can be linearly interpolated between those that are represented in the calibration set - and in fact any bad points in the ROC curve, represented by deconvexities (dents) can be smoothed over by the convex hull - probabilistically interpolating between the endpoints of the hull segment. Flach and Wu propose a technique that actually flips the segment, but this depends on information being used the wrong way round and although it could be used repeatedly for arbitrary improvement on the calibration set, it will be increasingly unlikely to generalize to a test situation.
(I came here looking for papers I'd seen ages ago on these ROC-based approaches - so this is from memory and without these lost references.)
I will be very prudent in interpreting the outputs of a neural networks (in fact any machine learning classifier) as a probability. The machine is trained to discriminate between classes, not to estimate the probability density. In fact, we don't have this information in the data, we have to infer it. For my experience I din't advice anyone to interpret directly the outputs as probabilities.
did you try prof. Hinton's suggestion of training the network with softmax activation function and cross entropy error?
as an example create a three layer network with the following:
linear neurons [ number of features ]
sigmoid neurons [ 3 x number of features ]
linear neurons [ number of classes ]
then train them with cross entropy error softmax transfer with your favourite optimizer stochastic descent/iprop plus/ grad descent. After training the output neurons should be normalized to sum of 1.
Please see http://en.wikipedia.org/wiki/Softmax_activation_function for details. Shark Machine Learning framework does provide Softmax feature through combining two models. And prof. Hinton an excellent online course # http://coursera.com regarding the details.
I can remember I saw an example of Neural network trained with back propagation to approximate the probability of an outcome in the book Introduction to the theory of neural computation (hertz krogh palmer). I think the key to the example was a special learning rule so that you didn't have to convert the output of a unit to probability, but instead you got automatically the probability as output.
If you have the opportunity, try to check that book.
(by the way, "boltzman machines", although less famous, are neural networks designed specifically to learn probability distributions, you may want to check them as well)
When using ANN for 2-class classification and logistic sigmoid activation function is used in the output layer, the output values could be interpreted as probabilities.
So if you choosing between 2 classes, you train using 1-of-C encoding, where 2 ANN outputs will have training values (1,0) and (0,1) for each of classes respectively.
To get probability of first class in percent, just multiply first ANN output to 100. To get probability of other class use the second output.
This could be generalized for multi-class classification using softmax activation function.
You can read more, including proofs of probabilistic interpretation here:
[1] Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Resources