Related
I am attempting to correlate the time series from 4 separate tilt monitors that sample every 5 minutes. The time series have slightly different base times and end times, and the resulting arrays are slightly different lengths, though they span almost the (differing by ~3 mins) same period of time. My goal is to correlate each of these time series with a single "wind speed" time series that also covers the same period of time as the tilt monitors, sampling every 5 minutes, but also has a slightly different array length and origin time and end time.
The different array lengths in the tilt measurements are due to instrument error. There are some times within each of the arrays where the instrument missed a measurement and so the sample interval is 10 minutes.
My arrays sizes look something like this:
Tilt_a = 6236x2
Tilt_b = 6310x2
Tilt_c = 6304x2
Tilt_d = 6309x2
Wind_speed = 6283x2
I am using MATLAB to do the correlation. I imagine that I will need to re-sample the data using something like interp1, but I do not know how to renconcile the origin and end times. Is there a method that comes to mind for handling a situation such as this one? Or a function that allows correlating arrays of differing lengths?
So for the different time windows your analyzing, you could either trim them all so that they start and end at the same time, or you could just review them over their raw intervals, and make your comparisons over the windows that overlap.
As far as the sampling interval, you can use the resample command to make your comparisons more uniform.
https://www.mathworks.com/help/signal/ref/resample.html
Extending the first concept, you could use resample to define new vectors with the start time and end time and interval all synchronized, then continue with your analysis.
Given an array of values,
arr = [8,10,4,5,3,7,6,0,1,9,13,2]
X is an array of values can be chosen at a time where X.length != 0 and X.length < arr.length
The chosen values are then fed into a function, score(), which will return a score based on the array of select values.
Example 1:
X = [8]
score(X) = 71
Example 2:
X = [4]
score(X) = 36
Example 3:
X = [8,10,7]
score(X) = 51
Example 4:
X = [5,9,0]
score(X) = 4
The function score() here is a blackbox and we can't modify how the function works, we just provide an input and the function will return the score output.
My problem: How to get the lowest score for each set of numbers?
Meaning, if X is an array that has only 1 value, and I feed all the different values in arr, each value will return me a different score value, and I find which arr value provides the lowest score.
If X is an array of 3 values, I feed a combination of all the different possible values in arr, with each different set of 3 values returning a different score and finding the lowest score.
This is simple enough to do if my arr is small. However if I have an array of 50 or even 100 values, how can I create an algorithm that would provide the lowest score based on the number of input values
tl;dr: If you don't know anything about score, then you can't speed it up.
In order to optimize score itself, you would have to know how it works. After all "optimizing" simply means "does the same thing more efficient", but how can you know if it really does "the same thing" if you don't know what "the same thing" is? Plus, speeding up score will not help you with the combinatorial explosion anyway. The number of combinations grows so fast, that any speedups to score will be quickly eaten up by slightly larger inputs.
In order to optimize how you apply score, you would again need to know something about it. If you knew something about score, you could, for example, only generate combinations that you know will yield different values, or combinations that you know will only yield larger values. In other words, you could exploit some structure in the output of score in order to reduce the input size. However, we don't know the structure of the output of score, in fact, we don't even know if there is some structure at all! So we can't exploit it. Plus, there would have to be some extreme redundancy and regularity in the structure, in order for a significant reduction in input size.
In his comment, #ndn suggested applying some form of machine learning to discover structure in the output.. How well this works depends on what kind of structure the output has. And of course, this again assumes that there even is some structure to discover, which we don't know. And again, even if there were some structure, it would have to very redundant and regular to make up for the combinatorial explosion of the input space.
Really, brute force is the only way. Our last straw is going to be parallelization. Maybe, if we distribute the problem across enough CPU cores, we can tackle it? Unfortunately, the combinatorial explosion in the input space is still really going to hurt you:
If we assume that we have a 10THz CPU (i.e. a thousand times faster than the fastest currently available CPU), and we assume that we can compute score in a single clock cycle, and we assume that we have a computer with 10 million cores (again, that's a thousand times larger than the largest supercomputers), it's still going to take over 400 years to find the optimal selection for an input array as small as 100 numbers. And even if we make our CPU a billion times faster and the computer a billion times bigger, simply doubling the size of the array to 200 items will increase the runtime to 500 trillion years.
There is a reason why we call combinatorial explosion "combinatorial explosion", after all.
I have the following structure:
T = struct('Time',{20, 40, 50, 80, 120, 150, 190, 210, 250, 260, 270, 320, 350, 380, 385, 390, 395},...
'Trial',{'correct','incorrect','incorrect','correct','correct','correct','incorrect','incorrect','correct','correct','correct','incorrect','incorrect','correct','correct','incorrect','incorrect'});
I would like to perform the following two tasks:
I want to get the probability of having an 'incorrect' per each 100 ms time window (interval).
For example, for the first time window, the first 100 ms, there is 4 trials and 2 are 'incorrect' out of 4 so it would be 2/4 = 0.5
I want to plot a bar graph of the probabilities for each 100 ms time window. The x axis would be time and each bar's width would be 100 ms and its height is the probability for that specific window.
I really appreciate any help.
This goes against my policy in answering questions without any effort made by the question poser, but this seems like an interesting question, so I'll make an exception.
First, split up each of the Time and Trial fields so that they're in separate arrays. For the Trial fields, I'm going to convert them into labels 1 and 2 to denote correct and incorrect for ease of implementation:
time = [T.Time].';
trial = {T.Trial}.';
[~,~,trial_ID] = unique(trial);
Next, what you can do is take each entry in the time array and divide by 100 while taking the floor. Values that belong to the same ID mean that they belong to a group of 100 ms. Note that we also need to add 1 for the next step... you'll see why:
groups = floor(time/100) + 1;
Now, here's probably one of the most beautiful functions you can ever use in MATLAB: accumarray. accumarray groups portions of an array based on an ID and you apply a function to all of the values per group. In our case, we want to group all of the correct and incorrect IDs based on the groups array, then from there we determine the total fraction of values that are incorrect per group.
Specifically, what we're going to do is for each group of values specified in groups, we will take a look at the correct and incorrect numeric labels and determine how many were incorrect by summing over how many were equal to 2 for each group, then dividing by how many there were per group. The groups need to start at index 1, which is why we had to add 1 to groups. Without it, the first group would actually start at 0, and MATLAB starts indexing at 1, hence the offset:
per = accumarray(groups, trial_ID, [], #(x) sum(x == 2) / numel(x));
per contains the fraction that were correct per group, and we get:
>> per
per =
0.5000
0.3333
0.2500
0.6667
Very nice! Doing a quick hand calculation will demonstrate that you get the correct results.
Now the last part is to plot the probabilities on a bar graph. That's very simply:
bar(100*(1:numel(per)), per);
xlabel('Time (ms)');
ylabel('Probability');
I create a vector that starts from 100 and goes up in multiples of 100 up until as many groups as we have. In our case, we have 4 as the time goes up to 395 ms.
As such, we get:
I'm looking for a hint towards a solution of the problem:
Suppose there's an array with some numbers in ascending order and some in descending, for example [1,2,5,9,6,3,2,4,7,8] has sequences asc [1,2,5,9], desc [(9),6,3,2], asc [(2),4,7,8].
Now this isn't a problem, I could simply loop through an array and add them to some data structure, and when the direction changes - I store this structure somwhere and start filling next one.
What I've found tricky is if I want to have threshold of some sort. For example: [0,50,100,99,98,97,105,160]
So the sequence in descending order [(100), 99, 98, 97] could be neglected, because overall change is -3, whereas the sequence was increasing much more dramatically (+100) and as a result, the algorithm identifies only one sequence in ascending order.
I have tried the same method as above, simply adding all sequences in a data structure and then comparing the change in values of two consequtive items: (100 vs -3 means -3 can be neglected). But then the problem is if I have say this situation:
(example only in change of values from start to end of sequense)
[+100, -3, +1, -50]
in this situation I cannot neglect descending movement, because the numbers start to descend, then slightly ascend and again go down pretty significantly.
and it gets really confusing with stuff like that:
[+100, -3, +3, -3, +3, -50]
this is quick sketch of representation of what I am trying to achieve:
black lines represent initial data in an array, red thin lines are desired resulting output
Could somebody point me out in right direction? How would I approach this situation? Compare multiple sequences at a time slowly combining sequences together? Maybe I would need to go through sequences multiple times?
I'm not sure If I've come across problem like that and don't know working algorithm. This is a problem I've faced myself trying to analyse some data.
If I understand correctly, you expect your curve to be a succession of alternatively increasing and decreasing sequences, with a bit of added noise.
The usual way to get rid of noise is to filter data. There are millions of ways to do that, most of them requiring frequency analysis, but in your case you could probably get good enough results with something simple.
The main point is that the relevant variable is not the values in the array, but their variations.
Given N values, consider the array of N-1 elements holding the differences between two consecutive values.
[0,50,100,99,98,97,105,160] -> 50,100,-1,-1,-1,6,45
Now eliminate all values whose absolute value is below a given threshold (say 10 for instance)
-> 50,100,0,0,0,0,45
you can then detect a rising sequence by looking at streaks of all positive or null values (and the same for decreasing sequences, considering zero or negative values).
As for all filtering processes, you will have to find a sweet spot for your threshold. Too low and it will fail to eliminate insignificant variations, too high and it will wipe out significant slope inversions.
I don't know if I understand your problem correctly, but I had to do this kind of dimensionality reduction many times before, so I wrote a small javascript library to do so. It uses the Perceptually Important Points algorithm.
In the algorithm you can define a custom metric of the distance between three consecutive points (to measure how much a single point adds in entropy).
Here is a demonstration (in JS). It works kind like a heap, where you remove points that do not contribute so much to the overall entropy:
for(var i=0; i<data.length; i++)
heap.add(data[i]);
while(heap.minValue() < threshold)
heap.removeMin();
And here is the library.
All the references to this error I could find searching online were completely inapplicable to my situation, they were dealing with some kind of variables involving dots, like a.b (structures in other words), whereas I am strictly using arrays. Nothing involves a dot, nor does my code ask about it.
Ok, I have this GINORMOUS array called tier2comparatorconnectionpoints. It is a 4-D array of size 400×10×20×10. Consider tier2comparatorconnectionpoints(counter,counter2,counter3,counter4).
counter is a number 1 to 400,
counter2 is a number 1 to numchromosomes(counter), and numchromosomes(counter1) is bound to 10,
counter3 is a number 1 to tier2numcomparators(counter,counter2), which is in turn bounded to 20.
counter4 is a number 1 to tier2inputspercomparator(counter,counter2,counter3), which is bounded to 10.
Now, so that I don't run out of RAM, I have tier2comparatorconnectionpoints as type int8, and UNFORTUNATELY at some point in my horrendous amount of code, I forgot to cast it to a double when I'm doing math with it, and a rounding error involved with multiplying it with a rand ends up with tier2comparatorconnectionpoints for some values of its 4 inputs exceeding what it's allowed to be.
The values it's allowed to have are 1 through tier1numcomparators(counter,counter2), which is bounded to 40, 41 through 40+tier2numcomparators(counter,counter2), with tier2numcomparators(counter,counter2) being bounded to 20, and 61 through 60+tier2numcomparators(counter,counter2), thus it's not allowed to be more than 80 since tier2numcomparators(counter,counter2) is bounded to 20 and it's not allowed to be more than 60+tier2numcomparators(counter,counter2), but it's also not allowed to be less than 40 but more than tier1numcomparators(counter,counter2) and it's not allowed to be less than 60 but more than 40+tier2numcomparators(counter,counter2). I became aware of the problem because it was being set to 81 somewhere.
This is an evolutionary simulation by the way, it's natural selection on simulated organisms. I need to hunt down the part of the code that is allowing the values of tier2comparatorconnectionpoints to exceed what it's allowed to be. But that is a separate problem.
A temporary fix of my data, just so that it at least is made to conform to its allowed values, is to set anything that is greater than tier1numcomparators(counter,counter2) but less than 40 to tier1numcomparators(counter,counter2), to set anything that is greater than 40+tier2numcomparators(counter,counter2) but less than 60 to 40+tier2numcomparators(counter,counter2), and to set anything that is greater than 60+tier2numcomparators(counter,counter2) to 60+tier2numcomparators(counter,counter2). I first found this problem because it was being set to 81, so it didn't just exceed 60+tier2numcomparators(counter,counter2), it exceeded 60+20, with tier2numcomparators being bounded to 20.
I hope this isn't all too-much-information, but I felt it might be necessary to get you to understand just what sort of variables these are.
So in my attempts to at least turn the data into valid data, I did the following:
for counter=1:size(tier2comparatorconnectionpoints,1)
for counter2=1:size(tier2comparatorconnectionpoints,2)
for counter3=1:size(tier2comparatorconnectionpoints,3)
for counter4=1:size(tier2comparatorconnectionpoints,4)
if tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)>60+tier2numcomparators(counter,counter2)
tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)=60+tier2numcomparators(counter,counter2);
end
end
end
end
end
And that worked just fine. And then:
for counter=1:size(tier2comparatorconnectionpoints,1)
for counter2=1:size(tier2comparatorconnectionpoints,2)
for counter3=1:size(tier2comparatorconnectionpoints,3)
for counter4=1:size(tier2comparatorconnectionpoints,4)
if tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)>40+tier2numcomparators(counter,counter2)
if tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)<60
tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)=40+tier2numcomparators(counter,counter2);
end
end
end
end
end
end
And that's where it said "Attempt to reference field of non-structure array".
TBH it sounds like maybe you've made a typo and put a . in somewhere? Otherwise please post the entire error as maybe it's happening in a different function or something.
Either way you don't need all those for loops, it's simpler and usually quicker to do this (and should bypass your error):
First replicate your tier2numcomparators matrix so that it has the same dimension sizes as tier2comparatorconnectionpoints
T = repmat(tier2numcomparators + 40, 1, 1, size(tier2comparatorconnectionpoints, 3), size(tier2comparatorconnectionpoints, 4));
Now in one shot you can create a logical matrix of which elements meet your criteria:
ind = tier2comparatorconnectionpoints > T | tier2comparatorconnectionpoints < 60;
Finally employ logical indexing to set your desired elements:
tier2comparatorconnectionpoints(ind) = T(ind);
You can play around with bsxfun instead of repmat if this is slow or takes too much memory