I have a very large dataset array with over a million values that looks like this:
Month Day Year Hour Min Second Line1 Line2 Power Dt
7 8 2013 0 1 54 1.91 4.98 826.8 0
7 8 2013 0 0 9 1.93 3.71 676.8 0
7 8 2013 0 1 15 1.92 5.02 832.8 0
7 8 2013 0 1 21 1.91 5.01 830.4 0
and so on.
When the measurement of seconds got to 60 it would start over again at 0 hence why the first number is bigger. I need to fill the delta t column (Dt) by taking the current rows seconds column and subtracting the previous rows seconds column and correcting for negatyive values. This opperation cannot preform this operation in a loop as the it would take ages to complete and needs to be completed in a simple, one-shot, vector subtraction operation.
You can try diff command to generate such results. Its very fast and should work wihout any for loop.
HTH
Dt=diff(datenum(A(:,1:6)))*60*60*24;
This gives the delta in seconds, but I'm not sure what you want you correction for negative differences to be. Could you give an example of the expected output?
Note that Dt will be one entry shorter than A, so you may have to pad it.
You can remove the negative values (I think) with the command
Dt(Dt<0)=Dt(Dt<0)+60;
If you need to pad the Dt vector so that it is the same length as the data set, try
Dt=[Dt;0];
Related
I want to apply feature selection on a dataset (lung.mat)
After loading the data, I computed the mean of distances between each feature with others by Jaccard measure. Then I sorted the distances descendingly in B1. And then I selected for example 25 number of all the features and saved the matrix in databs1.
I want to select the features that have distance values greater than the mean of the array (B1).
close all;
clc
load lung.mat
data=lung;
[n,m]=size(data);
for i=1:m-1
for j=i+1:m
t1(i,j)=fjaccard(data(:,i),data(:,j));
b1=sum(t1)/(m-1);
end
end
[B1,indB1]=sort(b1,'descend');
databs1=data(:,indB1(1:25));
databs1=[databs1,data(:,m)]; %jaccard
save('databs1.mat');
I’ll be grateful to have your opinions about how to define this in B1, selecting values of B1 which are greater than the mean of the array B1, It means cutting the rest of smaller values than the mean of B1.
I used this line,
B1(B1>mean(B1(:)))
after running, B1 still has the full number of features(column) equal to the full dataset, for example, lung.mat has 57 features and B1 by this line still has 57 columns,
I considered that by this line B1 will be cut to the number of features that are greater than the mean of B1.
the general answer to your question is here (this seems clear to you based on your code):
a=randi(10,1,10) %example data
a>mean(a) %get binary matrix of which elements are larger than mean
a(a>mean(a)) %select elements from a that are larger than mean
a =
1 9 10 7 8 8 4 7 2 8
ans =
1×10 logical array
0 1 1 1 1 1 0 1 0 1
ans =
9 10 7 8 8 7 8
I have a Sorted array .Lets assume
{4,7,9,12,23,34,56,78} Given min and max I want to find elements in array between min and max in efficient way.
Cases:min=23 and max is 78 op:{23,34,56,78}
min =10 max is 65 op:{12,23,34,56}
min 0 and max is 100 op:{4,7,9,12,23,34,56,78}
Min 30 max= 300:{34,56,78}
Min =100 max=300 :{} //empty
I want to find efficient way to do this?I am not asking code any algorithm which i can use here like DP exponential search?
Since it's sorted, you can easily find the lowest element greater than or equal to the minimum desired, by using a binary search over the entire array.
A binary search basically reduces the serch space by half with each iteration. Given your first example of 10, you start as follows with the midpoint on the 12:
0 1 2 3 4 5 6 7 <- index
4 7 9 12 23 34 56 78
^^
Since the element you're looking at is higher than 10 and the next lowest is lesser, you've found it.
Then, you can use a similar binary search but only over that section from the element you just found to the end. This time you're looking for the highest element less than or equal to the maximum desired.
On the same example as previously mentioned, you start with:
3 4 5 6 7 <- index
12 23 34 56 78
^^
Since that's less than 65 and the following one is also, you need to increase the pointer to the halfway point of 34..78:
3 4 5 6 7 <- index
12 23 34 56 78
^^
And there you have it, because that number is less and the following number is more (than 65)
Then you have the start at stop indexes (3 and 6) for extracting the values.
0 1 2 3 4 5 6 7 <- index
4 7 9 ((12 23 34 56)) 78
-----------
The time complexity of the algorithm is O(log N). Though keep in mind that this really only becomes important when dealing with larger data sets. If your data sets do consist of only about eight elements, you may as well use a linear search since (1) it'll be easier to write; and (2) the time differential will be irrelevant.
I tend not to worry about time complexity unless the operations are really expensive, the data set size gets into the thousands, or I'm having to do it thousands of times a second.
Since it is sorted, this should do:
List<Integer> subarray = new ArrayList<Integer>();
for (int n : numbers) {
if (n >= MIN && n <= MAX) subarray.add(n);
}
It's O(n) as you only look at every number once.
I have a matlab/octave for loop which gives me an inf error messages along with the incorrect data
I'm trying to get 240,120,60,30,15... every number is divided by two then that number is also divided by two
but the code below gives me the wrong value when the number hits 30 and 5 and a couple of others it doesn't divide by two.
ang=240;
for aa=2:2:10
ang=[ang;ang/aa];
end
240
120
60
30
40
20
10
5
30
15
7.5
3.75
5
2.5
1.25
0.625
24
12
6
3
4
2
1
0.5
3
1.5
0.75
0.375
0.5
0.25
0.125
0.0625
PS: I will be accessing these values from different arrays, that's why I used a for loop so I can access the values using their indexes
In addition to the divide-by-zero error you were starting with (fixed in the edit), the approach you're taking isn't actually doing what you think it is. if you print out each step, you'll see why.
Instead of that approach, I suggest taking more of a "matlab way": avoid the loop by making use of vectorized operations.
orig = 240;
divisor = 2.^(0:5); #% vector of 2 to the power of [0 1 2 3 4 5]
ans = orig./divisor;
output:
ans = [240 120 60 30 15 7.5]
Try the following:
ang=240;
for aa=1:5
% sz=size(ang,1);
% ang=[ang;ang(sz)/2];
ang=[ang;ang(end)/2];
end
You should be getting warning: division by zero if you're running it in Octave. That says pretty much everything.
When you divide by zero, you get Inf. Because of your recursion... you see the problem.
You can simultaneously generalise and vectorise by using logic:
ang=240; %Replace 240 with any positive integer you like
ang=ang*2.^-(0:log2(ang));
ang=ang(1:sum(ang==floor(ang)));
This will work for any positive integer (to make it work for negatives as well, replace log2(ang) with log2(abs(ang))), and will produce the vector down to the point at which it goes odd, at which point the vector ends. It's also faster than jitendra's solution:
octave:26> tic; for i=1:100000 ang=240; ang=ang*2.^-(0:log2(ang)); ang=ang(1:sum(ang==floor(ang))); end; toc;
Elapsed time is 3.308 seconds.
octave:27> tic; for i=1:100000 ang=240; for aa=1:5 ang=[ang;ang(end)/2]; end; end; toc;
Elapsed time is 5.818 seconds.
I am having difficulties with making an array formula work the way I want it to work.
Out of a column of dates which is not sorted, I want it to extract values into a new column. The formula below identifies the required cells of a given month and year, but they appear in their original row rather than on top of the output range. Moreover, I want all ""/FALSE cells to be excluded from the output array.
=IF((MONTH($I$15:$I$1346)=1)*(YEAR($I$15:$I$1346)=2008),$I$15:$I$1346,"")
In fact, the $I$15:$I$1346 should be dynamic and go to the last filled range (I could make a named range for that)
Part two is to expand on that formula so that it calculates the data that is an two column offset of the data described above.
Is the above possible to build into one cell probably with a combination of IF, INDEX, SMALL and maybe others?
I'm not looking for a filter solution. Hope the above is clear enough and that you can help!
Here's a shortened sample layout:
A B C
1 Date Series_A Series_B
2 03/01/2011 45 20
3 04/01/2011 73 30
4 06/01/2011 95 40
5 08/01/2011 72 50
6 06/02/2011 5 13
7 09/02/2011 12 #N/A
8 05/02/2011 23 65
9 07/03/2011 12 65
Then I want three input cells for the year and and the month and series name (index/match, as there are many more columns with data). If it would be 2011, Feb and Series_A, I want it to calculate the average for that month. In this case it would be (5+12+23)/3. If it would be Feb-2011 and Series_B instead, which has an error, it should show (13+65)/2 rather than an error.
Aside from that I want a separate which will output an array with the data instead without 'holes' in between and with the right 'length'. Example for Feb-2011 in Column C:
A B C D
1 Date Series_A Desired Output Output based on f above
2 03/01/2011 45 5
3 04/01/2011 73 12
4 06/01/2011 95 23
5 08/01/2011 72
6 06/02/2011 5 5
7 09/02/2011 12 12
8 05/02/2011 23 23
9 07/03/2011 12
If I then run a =ISBLANK(C5) it should be true, rather than =""=C5
Hope the edit clarifies
I reached out to various platsforms to get an answer, and here you have one which is ok. Still doesn't fully answer part 1, but works nonetheless.
http://www.excelforum.com/excel-formulas-and-functions/905356-exclude-blank-false-cells-in-in-excel-array-if-formula-output.html
I have data in column A, and would like to put the averages in column B like this:
a b
1 10 10
2 7 8.5
3 8 8.333
4 19 11
5 13 11.5
where b1 =average(a1), b2 =average(a1:a2), b3 =average(a1:a3)....
Using average() is alright for small amounts of data, but I have over 1500 data entries. I would like to find a more efficient way of doing this.
Make your initial range reference absolute, while the other is relative, i.e.:
b4 = average($a$1:a4)
You can paste that 1500 times an it will always increment the end of the range while keeping the beginning pinned to A1 due to the dollar signs in that reference.