How to predict with multiple measurements datasets - database

I have a collection of measurements based on 10 samples.
Each sample was measured three times - in water, air and alcohol, each file contains two-dimensional - signal wavelength, signal amplitude (amplitude is in the same range in each measurement).
I want to predict characteristics of device in water and alcohol based on measured characteristics in air.
I don't know how to start. Should I merge all these databases in one? or is there any other way?
Can I join them like that:
amplitude
wavelength-air
1.3
sample1
1.3
sample2
...
...
1.3
sample10
1.4
sample1
...
...
Does it make any sens?

Related

What is the difference between Static HDR and dynamic HDR?

HDR is a high dynamic range which is widely used in video devices to have better viewing experience.
What is the difference between static HDR and dynamic HDR?
Dynamic HDR can achieve higher HDR media quality across a variety of
displays.
The following presentation: SMPTE ST 2094 and Dynamic Metadata summarizes the subject of Dynamic Metadata:
Dynamic Metadata for Color Volume Transforms (DMCVT)
- Can preserve the creative intent in HDR media across a variety of displays
- Carried in files, video streams, packaged media
- Standardized in SMPTE ST 2094
It all starts with digital Quantization.
Assume you need to approximate the numbers between 0 and 1,000,000 using only 1000 possible values.
Your first option is using uniform quantification:
Values in range [0, 999] are mapped to 0, range [1000, 1999] are mapped to 1, [2000, 2999] are mapped to 2, and so on...
When you need to restore the original data, you can't restore it accurately, so you need to get the value with minimal average error.
0 is mapped to 500 (to the center of the range [0, 999]).
1 is mapped to 1500 (to the center of the range [1000, 1999]).
When you restore the quntized data, you are loosing lots of information.
The information you loose is called "Quantization error".
The common HDR video applies 10 bits per color component (10 bits for Y component, 10 bits for U and 10 bits for V). Or 10 bits for red, 10 for green and 10 for blue in RGB color space.
10 bits can store 1024 possible values (values in range [0, 1023]).
Assume you have a very good monitor that can display 1,000,001 different brightness levels (0 is darkest and 1000000 is the brightest).
Now you need to quantize the 1,000,001 levels to 1024 values.
Since the response of the human visual system to brightness level is not linear, the uniform quantization illustrated above, is sub-optimal.
The quantization to 10 bits is performed after applying a gamma function.
Example for gamma function: divide each value by 1000000 (new range is [0,1]), compute square root of each value, and multiply the result by 1000000.
Apply the quantization after the gamma function.
The result is: keeping more accuracy on the darker values, on expanse of the brighter values.
The monitor do the opposite operation (de-quantization, and inverse gamma).
Preforming the quantization after applying gamma function results a better quality for the human visual system.
In reality, square root is not the best gamma function.
There are three types of standard HDR static gamma functions:
HLG - Hybrid Log Gamma
PQ - Perceptual Quantizer
HDR10 - Static Metadata
Can we do better?
What if we could select the optimal "gamma functions" for each video frame?
Example for Dynamic Metadata:
Consider the case where all the brightness levels in the image are in range [500000, 501000]:
Now we can map all the levels to 10 bits, without any quantization.
All we need to do is send 500000 as minimum level, and 501000 as minimum level in the image metadata.
Instead of quantization, we can just subtract 500000 from each value.
The monitor that receives the image, reads the metadata, and knows to add 500000 to each value - so there is a perfect data reconstruction (no quantization errors).
Assume the levels of the next image is in range 400000 to 401000, so we need to adjust the metadata (dynamically).
DMCVT - Dynamic Metadata for Color Volume Transform
The true math of DMCVT is much more complicated than the example above (and much more than quantization), but it's based on the same principles - adjusting the metadata dynamically according to the scene and display, can achieve better quality compared to static gamma (or static metadata).
In case you are still reading...
I am really not sure that the main advantage of DMCVT is reducing the quantization errors.
(It was just simpler to give an example of reducing the quantization errors).
Reducing the conversion errors:
Accurate conversion from the digital representation of the input (e.g BT.2100 to the optimal pixel value of the display (like the RGB voltage of the pixel) requires "heavy math".
The conversion process is called Color Volume Transformation.
Displays replaces the heavy computation with mathematical approximations (using look up tables and interpolations [I suppose]).
Another advantage of DMCVT, is moving the "heavy math" from the display to the video post-production process.
The computational resources in the video post-production stage are in order of magnitudes higher than the display resources.
In the post-production stage, the computers can calculate metadata that helps the display performing much more accurate Color Volume Transformation (with less computational resources), and reduce the conversion errors considerably.
Example from the presentation:
Why does "HDR static gamma functions" called static?
Opposed to DMCVT, the static gamma functions are fixed across the entire movie, or fixed (pre-defined) across the entire "system".
For example: Most PC systems (PC and monitors) are using sRGB color space (not HDR).
The sRGB standard uses the following fixed gamma function:
.
Both the PC system and the display knows from advance, that they are working in sRGB standard, and knows that this is the gamma function that is used (without adding any metadata, or adding one byte of metadata that marks the video data as sRGB).

What is the correct method to upsample?

I have an array of samples at 75 Hz, and I want to store them at 128 Hz. If it was 64 Hz and 128 Hz it was very simple, I would just double all samples. But what is the correct way if the samplerates are not a fraction of eachother?
When you want to avoid Filtering then you can:
handle signal as set of joined interpolation cubics curves
but this point is the same as if you use linear interpolation. Without knowing something more about your signal and purpose you can not construct valid coefficients (without damaging signal accuracy) for example of how to construct such cubic look here:
my interpolation cubic
in bullet #3 inside that link are coefficients I use. I think there are sufficient even for your purpose so you can try them. If you want to do custom interpolation look here:
how to construct custom interpolation curve
create function that can return point in your signal given time from start of sampling
so do something like
double signal(double time);
where time is time in [s] from start of sampling. Inside this function compute which 4 samples you need to access.
ix = floor(time*75.0);
gives you curve start point sample index. Cubic need 4 points one before curve and one after ... so for interpolation cubic points p0,p1,p2,p3 use samples ix-1,ix,ix+1,ix+2. Compute cubic coefficients a0,a1,a2,a3 and compute cubic curve parameter t (I use range <0,1>) so
t=(time*75.0); t-=floor(t);
green - actual curve segment
aqua - actual curve segment control points = 75.0 Hz samples
red - curve parametric interpolation parameter t
gray - actual time
sorry I forgot to draw the actual output signal point it should be the intersection of green and gray
simply do for loop through sampled data with time step 1/128 s
something like this:
double time,duration=samples*75.0,dt=1.0/128.0;
double signal128[???];
for (time=0.0,i=0;time<duration;i++,time+=dt)
signal128[i]=signal(time);
samples are the input signal array size in samples sampled by 75.0 Hz
[notes]
for/duration can be done on integers ...
change signal data type to what you need
inside signal(time) you need to handle edge cases (start and end of signal)
because you have no defined points in signal before first sample and after last sample. You can duplicate them or mirror next point (mirror is better).
this whole thing can be changed to process continuously without buffers just need to remember 4 last points in signal so you can do this in RT. Of coarse you will be delayed by 2-3 75.0 Hz samples ... and when you put all this together you will see that this is a FIR filter anyway :)
if you need to preserve more then first derivation add more points ...
You do not need to upsample and then downsample.
Instead, one can interpolate all the new sample points, at the desired spacing in time, using a wide enough low-pass interpolation kernel, such as a windowed Sinc function. This is often done by using a pre-calculated polyphase filter bank, either directly, or with an addition linear interpolation of the filter table. But if performance is not critical, then one can directly calculate each coefficient for each interpolated point.
The easiest way is to upsample to a sample rate which is the LCM of your two sample rates and then downsample - that way you get integer upsample/downsample ratios. In your case there are no common factors in the two sample rates so you would need to upsample by a factor of 128 to 9.6 kHz and then downsample by a factor of 75 to 128 Hz. For the upsampling you insert 127 0 samples in between each sample, then apply a suitable filter (37 Hz LPF, Fs = 9.6 kHz), and then downsample by taking every 75th sample. The filter design is the only tricky part, but there are online tools for taking the hard work out of this.
Alternatively look at third-party libraries which handle resampling, e.g. sox.
You need to upsample and downsample with an intermediate sampling frequency, as #Paul mentioned. In addition, it is needed to filter the signal after each transformation, which can be achieved by linear interpolation as:
% Parameters
F = 2;
Fs1 = 75;
Fs3 = 128;
Fs2 = lcm(Fs1,Fs3);
% Original signal
t1 = 0:1/Fs1:1;
y1 = sin(2*pi*F*t1);
% Up-sampled signal
t2 = 0:1/Fs2:1;
y2 = interp1(t1,y1,t2);
% Down-sampled signal
t3 = 0:1/Fs3:1;
y3 = interp1(t2,y2,t3);
figure;
subplot(3,1,1);
plot(t1,y1,'b*-');
title(['Signal with sampling frequency of ', num2str(Fs1), 'Hz']);
subplot(3,1,2);
plot(t2,y2,'b*-');
title(['Signal with sampling frequency of ', num2str(Fs2), 'Hz']);
subplot(3,1,3);
plot(t3,y3,'b*-');
title(['Signal with sampling frequency of ', num2str(Fs3), 'Hz']);

Correctness of logistic regression in Vowpal Wabbit?

I have started using Vowpal Wabbit for logistic regression, however I am unable to reproduce the results it gives. Perhaps there is some undocumented "magic" it does, but has anyone been able to replicate / verify / check the calculations for logistic regression?
For example, with the simple data below, we aim to model the way age predicts label. It is obvious there is a strong relationship as when age increases the probability of observing 1 increases.
As a simple unit test, I used the 12 rows of data below:
age label
20 0
25 0
30 0
35 0
40 0
50 0
60 1
65 0
70 1
75 1
77 1
80 1
Now, performing a logistic regression on this dataset, using R , SPSS or even by hand, produces a model which looks like L = 0.2294*age - 14.08. So if I substitude the age, and use the logit transform prob=1/(1+EXP(-L)) I can obtain the predicted probabilities which range from 0.0001 for the first row, to 0.9864 for the last row, as reasonably expected.
If I plug in the same data in Vowpal Wabbit,
-1 'P1 |f age:20
-1 'P2 |f age:25
-1 'P3 |f age:30
-1 'P4 |f age:35
-1 'P5 |f age:40
-1 'P6 |f age:50
1 'P7 |f age:60
-1 'P8 |f age:65
1 'P9 |f age:70
1 'P10 |f age:75
1 'P11 |f age:77
1 'P12 |f age:80
And then perform a logistic regression using
vw -d data.txt -f demo_model.vw --loss_function logistic --invert_hash aaa
(command line consistent with How to perform logistic regression using vowpal wabbit on very imbalanced dataset ) , I obtain a model L= -0.00094*age - 0.03857 , which is very different.
The predicted values obtained using -r or -p further confirm this. The resulting probabilities end up nearly all the same, for example 0.4857 for age=20, and 0.4716 for age=80, which is extremely off.
I have noticed this inconsistency with larger datasets too. In what sense is Vowpal Wabbit carrying out the logistic regression differently, and how are the results to be interpreted?
This is a common misunderstanding of vowpal wabbit.
One cannot compare batch learning with online learning.
vowpal wabbit is not a batch learner. It is an online learner. Online learners learn by looking at examples one at a time and slightly adjusting the weights of the model as they go.
There are advantages and disadvantages to online learning. The downside is that convergence to the final model is slow/gradual. The learner doesn't do a "perfect" job at extracting information from each example, because the process is iterative. Convergence on a final result is deliberately restrained/slow. This can make online learners appear weak on tiny data-sets like the above.
There are several upsides though:
Online learners don't need to load the full data into memory (they work by examining one example at a time and adjusting the model based on the real-time observed per-example loss) so they can scale easily to billions of examples. A 2011 paper by 4 Yahoo! researchers describes how vowpal wabbit was used to learn from a tera (10^12) feature data-set in 1 hour on 1k nodes. Users regularly use vw to learn from billions of examples data-sets on their desktops and laptops.
Online learning is adaptive and can track changes in conditions over time, so it can learn from non-stationary data, like learning against an adaptive adversary.
Learning introspection: one can observe loss convergence rates while training and identify specific issues, and even gain significant insights from specific data-set examples or features.
Online learners can learn in an incremental fashion so users can intermix labeled and unlabeled examples to keep learning while predicting at the same time.
The estimated error, even during training, is always "out-of-sample" which is a good estimate of the test error. There's no need to split the data into train and test subsets or perform N-way cross-validation. The next (yet unseen) example is always used as a hold-out. This is a tremendous advantage over batch methods from the operational aspect. It greatly simplifies the typical machine-learning process. In addition, as long as you don't run multiple-passes over the data, it serves as a great over-fitting avoidance mechanism.
Online learners are very sensitive to example order. The worst possible order for an online learner is when classes are clustered together (all, or almost all, -1s appear first, followed by all 1s) like the example above does. So the first thing to do to get better results from an online learner like vowpal wabbit, is to uniformly shuffle the 1s and -1s (or simply order by time, as the examples typically appear in real-life).
OK now what?
Q: Is there any way to produce a reasonable model in the sense that it gives reasonable predictions on small data when using an online learner?
A: Yes, there is!
You can emulate what a batch learner does more closely, by taking two simple steps:
Uniformly shuffle 1 and -1 examples.
Run multiple passes over the data to give the learner a chance to converge
Caveat: if you run multiple passes until error goes to 0, there's a danger of over-fitting. The online learner has perfectly learned your examples, but it may not generalize well to unseen data.
The second issue here is that the predictions vw gives are not logistic-function transformed (this is unfortunate). They are akin to standard deviations from the middle point (truncated at [-50, 50]). You need to pipe the predictions via utl/logistic (in the source tree) to get signed probabilities. Note that these signed probabilities are in the range [-1, +1] rather than [0, 1]. You may use logistic -0 instead of logistic to map them to a [0, 1] range.
So given the above, here's a recipe that should give you more expected results:
# Train:
vw train.vw -c --passes 1000 -f model.vw --loss_function logistic --holdout_off
# Predict on train set (just as a sanity check) using the just generated model:
vw -t -i model.vw train.vw -p /dev/stdout | logistic | sort -tP -n -k 2
Giving this more expected result on your data-set:
-0.95674145247658 P1
-0.930208359811439 P2
-0.888329575506748 P3
-0.823617739247262 P4
-0.726830630992614 P5
-0.405323815830325 P6
0.0618902961794472 P7
0.298575998150221 P8
0.503468453150847 P9
0.663996516371277 P10
0.715480084449868 P11
0.780212725426778 P12
You could make the results more/less polarized (closer to 1 on the older ages and closer to -1 on the younger) by increasing/decreasing the number of passes. You may also be interested in the following options for training:
--max_prediction <arg> sets the max prediction to <arg>
--min_prediction <arg> sets the min prediction to <arg>
-l <arg> set learning rate to <arg>
For example, by increasing the learning rate from the default 0.5 to a large number (e.g. 10) you can force vw to converge much faster when training on small data-sets thus requiring less passes to get there.
Update
As of mid 2014, vw no longer requires the external logistic utility to map predictions back to [0,1] range. A new --link logistic option maps predictions to the logistic function [0, 1] range. Similarly --link glf1 maps predictions to a generalized logistic function [-1, 1] range.

Using genetic programming to estimate probability

I would like to use a genetic program (gp) to estimate the probability of an 'outcome' from an 'event'. To train the nn I am using a genetic algorithm.
So, in my database I have many events, with each event containing many possible outcomes.
I will give the gp a set of input variables that relate to each outcome in each event.
My questions is - what should the fitness function be in the gp be ????
For instance, right now I am giving the gp a set of input data (outcome input variables), and a set of target data (1 if outcome DID occur, 0 if outcome DIDN'T occur, with the fitness function being the mean squared error of the outputs and targets). I then take the sum of each output for each outcome, and divide each output by the sum (to give the probability). However, I know for sure that this is not the right way to be doing this.
For clarity, this is how I am CURRENTLY doing this:
I would like to estimate the probability of 5 different outcomes occurring in an event:
Outcome 1 - inputs = [0.1, 0.2, 0.1, 0.4]
Outcome 1 - inputs = [0.1, 0.3, 0.1, 0.3]
Outcome 1 - inputs = [0.5, 0.6, 0.2, 0.1]
Outcome 1 - inputs = [0.9, 0.2, 0.1, 0.3]
Outcome 1 - inputs = [0.9, 0.2, 0.9, 0.2]
I will then calculate the gp output for each input:
Outcome 1 - output = 0.1
Outcome 1 - output = 0.7
Outcome 1 - output = 0.2
Outcome 1 - output = 0.4
Outcome 1 - output = 0.4
The sum of the outputs for each outcome in this event would be: 1.80. I would then calculate the 'probability' of each outcome by dividing the output by the sum:
Outcome 1 - p = 0.055
Outcome 1 - p = 0.388
Outcome 1 - p = 0.111
Outcome 1 - p = 0.222
Outcome 1 - p = 0.222
Before you start - I know that these aren't real probabilities, and that this approach does not work !! I just put this here to help you understand what I am trying to achieve.
Can anyone give me some pointers on how I can estimate the probability of each outcome ? (also, please note my maths is not great)
Many thanks
I understand the first part of your question: What you described is a classification problem. You're learning if your inputs relate to whether an outcome was observed (1) or not (0).
There are difficulties with the second part though. If I understand you correctly you take the raw GP output for a certain row of inputs (e.g. 0.7) and treat it as a probability. You said this doesn't work, obviously. In GP you can do classification by introducing a threshold value that splits your classes. If it's bigger than say 0.3 the outcome should be 1 if it's smaller it should be 0. This threshold isn't necessarily 0.5 (again it's just a number, not a probability).
I think if you want to obtain a probability you should attempt to learn multiple models that all explain your classification problem well. I don't expect you have a perfect model that explains your data perfectly, respectively if you have you wouldn't want a probability anyway. You can bag these models together (create an ensemble) and for each outcome you can observe how many models predicted 1 and how many models predicted 0. The amount of models that predicted 1 divided by the number of models could then be interpreted as a probability that this outcome will be observed. If the models are all equally good then you can forget weighing between them, if they're different in quality of course you could factor these into your decision. Models with less quality on their training set are less likely to contribute to a good estimate.
So in summary you should attempt to apply GP e.g. 10 times and then use all 10 models on the training set to calculate their estimate (0 or 1). However, don't force yourself to GP only, there are many classification algorithms that can give good results.
As a sidenote, I'm part of the development team of a software called HeuristicLab which runs under Windows and with which you can run GP and create such ensembles. The software is open source.
AI is all about complex algorithms. Think about it, the downside is very often, that these algorithms become black boxes. So the counterside to algoritms, such as NN and GA, are they are inherently opaque. That is what you want if you want to have a car driving itself. On the other hand this means, that you need tools to look into the black box.
What I'm saying is that GA is probably not what you want to solve your problem. If you want to solve AI types of problems, you first have to know how to use standard techniques, such as regression, LDA etc.
So, combining NN and GA is usually a bad sign, because you are stacking one black box on another. I believe this is bad design. An NN and GA are nothing else than non-linear optimizers. I would suggest to you to look at principal component analysis (PDA), SVD and linear classifiers first (see wikipedia). If you figure out to solve simple statistical problems move on to more complex ones. Check out the great textbook by Russell/Norvig, read some of their source code.
To answer the questions one really has to look at the dataset extensively. If you are working on a small problem, define the probabilities etc., and you might get an answer here. Perhaps check out Bayesian statistics as well. This will get you started I believe.

Increasing the pitch of audio using a varied value

Okay, this a bit of maths and DSP question.
Let us say I have 20,000 samples which I want to resample at a different pitch. Twice the normal rate for example. Using an Interpolate cubic method found here I would set my new array index values by multiplying the i variable in an iteration by the new pitch (in this case 2.0). This would also set my new array of samples to total 10,000. As the interpolation is going double the speed it only needs half the amount of time to finish.
But what if I want my pitch to vary throughout the recording? Basically I would like it to slowly increase from a normal rate to 8 times faster (at the 10,000 sample mark) and then back to 1.0. It would be an arc. My questions are this:
How do I calculate how many samples would the final audio track be?
How to create an array of pitch values that would represent this increase from 1.0 to 8.0 back to 1.0
Mind you this is not for live audio output, but for transforming recorded sound. I mainly work in C, but I don't know if that is relevant.
I know this probably is complicated, so please feel free to ask for clarifications.
To represent an increase from 1.0 to 8.0 and back, you could use a function of this form:
f(x) = 1 + 7/2*(1 - cos(2*pi*x/y))
Where y is the number of samples in the resulting track.
It will start at 1 for x=0, increase to 8 for x=y/2, then decrease back to 1 for x=y.
Here's what it looks like for y=10:
Now we need to find the value of y depending on z, the original number of samples (20,000 in this case but let's be general). For this we solve integral 1+7/2 (1-cos(2 pi x/y)) dx from 0 to y = z. The solution is y = 2*z/9 = z/4.5, nice and simple :)
Therefore, for an input with 20,000 samples, you'll get 4,444 samples in the output.
Finally, instead of multiplying the output index by the pitch value, you can access the original samples like this: output[i] = input[g(i)], where g is the integral of the above function f:
g(x) = (9*x)/2-(7*y*sin((2*pi*x)/y))/(4*pi)
For y=4444, it looks like this:
In order not to end up with aliasing in the result, you will also need to low pass filter before or during interpolation using either a filter with a variable transition frequency lower than half the local sample rate, or with a fixed cutoff frequency more than 16X lower than the current sample rate (for an 8X peak pitch increase). This will require a more sophisticated interpolator than a cubic spline. For best results, you might want to try a variable width windowed sinc kernel interpolator.

Resources