Rounding problems when creating date vectors - arrays

I want to create a vector containing dates in matlab. For that I specified the start time and the stop time:
WHM01_start = datenum('01-JAN-2005 00:00')
WHM01_stop = datenum('01-SEP-2014 00:00')
and then I created the vector with
WHM01_timevec = WHM01_start:datenum('01-JAN-2014 00:20') - datenum('01-JAN-2014 00:00'):WHM01_stop;
after I want to have time steps of 20 minutes each. Unfortunately I get a rounding error after some thousands of values, leading me to
>> datestr(WHM01_timevec(254160))
ans =
31-Aug-2014 23:39:59
and not as expected, 31-Aug-2014 23:40:00
How can I correct these incorrect values?
Edit: I also saw this thread, but unfortunately I get there a vector per date, and not a number as desired.

You can give year, month, day, ... in numeric format to the function datenum. Datenum accepts vectors for one or several of its arguments, and if the numbers are too big (for example, 120 minutes), datenum knows what to do with it.
So by supplying the minutes vector in 20-minute increments, you can avoid rounding errors (at least on a 1-second level):
WHM01_start = datenum('01-JAN-2005 00:00');
WHM01_stop = datenum('01-SEP-2014 00:00');
time_diff = WHM01_stop - WHM01_start;
WHM01_timevec = test = datenum(2005,01,01,00,[00:20:time_diff*24*60],00);
datestr(WHM01_timevec(254160))
To answer your comment:
The reason you saw rounding errors was that you used the difference of two big numbers for your time-increments. The difference of large numbers has a (relatively) large rounding error.
Matlab time is counted in days since the (fictional) date 0.0.0000. Your time-increment is 1/3 hour, or 1/(24*3) days. Modifying your original code so that it reads
WHM01_timevec = WHM01_start:1/(24*3):WHM01_stop;
is an alternative way to reduce the rounding error, but for absurdely large time spans the first solution is a more robust approach.

Related answer: use linspace instead of the colon operator :.
%// given
WHM01_start = datenum('01-JAN-2005 00:00')
WHM01_stop = datenum('01-SEP-2014 00:00')
%// number of elements
n = numel(WHM01_start: datenum('01-JAN-2014 00:20') - ...
datenum('01-JAN-2014 00:00') : WHM01_stop);
%// creating vector using linspace
WHM01_timevec = linspace(WHM01_start, WHM01_stop, n);
%// proof
datestr(WHM01_timevec(254160))
ans =
31-Aug-2014 23:40:00
Drawback of this solution: to determine the number of elements of the output vector I use the original vector created with :, which is not the best option probably.
Important quote from the linked answer:
Using linspace can reduce the probability of occurance of these issue, it's not a security.

Related

Find specific date in a date array

I am working with a datetime array s constructed as follows:
ds = datetime(2010,01,01,'TimeZone','Europe/Berlin');
de = datetime(2030,01,01,'TimeZone','Europe/Berlin');
s = ds:hours(1):de;
I am using ismember function to find the first occurrence of a specific date in that array.
ind = ismember(s,specificDate);
startPlace = find(ind,1);
The two lines from above are called many times in my application and consume quite some time. It is clear to me that Matlab compares ALL dates from s with specificDate, even though I need only the first occurrence of specificDate in s. So to speed up the application it would be good if Matlab would stop comparing specificDate to s once the first match is found.
One solution would be to use a while loop, but with the while loop the application becomes even slower (I tried it).
Any idea how to work around this problem?
I'm not sure what your specific use-case is here, but with the step size between elements of s being one hour, your index is simply going to be the difference in hours between your specific date and the start date, plus one. No need to create or search through s in the first place:
startPlace = hours(specificDate-ds)+1;
And an example to test each solution:
specificDate = datetime(2017, 1, 1, 'TimeZone', 'Europe/Berlin'); % Sample date
ind = ismember(s, specificDate); % Compare to the whole vector
startPlace = find(ind, 1); % Find the index
isequal(startPlace, hours(specificDate-ds)+1) % Check equality of solutions
ans =
logical
1 % Same!
What you can do to save yourself some time is to convert the datetime to a datenum in such a case you will be comparing numbers rather than strings, which significantly accelerates your processing time, like this:
s_new = datenum(s);
ind = ismember(s_new,datenum(specificDate));
startPlace = find(ind,1);

Matlab: average each element in 2D array based on neighbors [duplicate]

I've written code to smooth an image using a 3x3 averaging filter, however the output is strange, it is almost all black. Here's my code.
function [filtered_img] = average_filter(noisy_img)
[m,n] = size(noisy_img);
filtered_img = zeros(m,n);
for i = 1:m-2
for j = 1:n-2
sum = 0;
for k = i:i+2
for l = j:j+2
sum = sum+noisy_img(k,l);
end
end
filtered_img(i+1,j+1) = sum/9.0;
end
end
end
I call the function as follows:
img=imread('img.bmp');
filtered = average_filter(img);
imshow(uint8(filtered));
I can't see anything wrong in the code logic so far, I'd appreciate it if someone can spot the problem.
Assuming you're working with grayscal images, you should replace the inner two for loops with :
filtered_img(i+1,j+1) = mean2(noisy_img(i:i+2,j:j+2));
Does it change anything?
EDIT: don't forget to reconvert it to uint8!!
filtered_img = uint8(filtered_img);
Edit 2: the reason why it's not working in your code is because sum is saturating at 255, the upper limit of uint8. mean seems to prevent that from happening
another option:
f = #(x) mean(x(:));
filtered_img = nlfilter(noisy_img,[3 3],f);
img = imread('img.bmp');
filtered = imfilter(double(img), ones(3) / 9, 'replicate');
imshow(uint8(filtered));
Implement neighborhood operation of sum of product operation between an image and a filter of size 3x3, the filter should be averaging filter.
Then use the same function/code to compute Laplacian(2nd order derivative, prewitt and sobel operation(first order derivatives).
Use a simple 10*10 matrix to perform these operations
need matlab code
Tangentially to the question:
Especially for 5x5 or larger window you can consider averaging first in one direction and then in the other and you save some operations. So, point at 3 would be (P1+P2+P3+P4+P5). Point at 4 would be (P2+P3+P4+P5+P6). Divided by 5 in the end. So, point at 4 could be calculated as P3new + P6 - P2. Etc for point 5 and so on. Repeat the same procedure in other direction.
Make sure to divide first, then sum.
I would need to time this, but I believe it could work a bit faster for larger windows. It is sequential per line which might not seem the best, but you have many lines where you can work in parallel, so it shouldn't be a problem.
This first divide, then sum also prevents saturation if you have integers, so you might use the approach even in 3x3 case, as it is less wrong (though slower) to divide twice by 3 than once by 9. But note that you will always underestimate final value with that, so you might as well add a bit of bias (say all values +1 between the steps).
img=imread('camraman.tif');
nsy-img=imnoise(img,'salt&pepper',0.2);
imshow('nsy-img');
h=ones(3,3)/9;
avg=conv2(img,h,'same');
imshow(Unit8(avg));

Efficiently calculating weighted distance in MATLAB

Several posts exist about efficiently calculating pairwise distances in MATLAB. These posts tend to concern quickly calculating euclidean distance between large numbers of points.
I need to create a function which quickly calculates the pairwise differences between smaller numbers of points (typically less than 1000 pairs). Within the grander scheme of the program i am writing, this function will be executed many thousands of times, so even small gains in efficiency are important. The function needs to be flexible in two ways:
On any given call, the distance metric can be euclidean OR city-block.
The dimensions of the data are weighted.
As far as i can tell, no solution to this particular problem has been posted. The statstics toolbox offers pdist and pdist2, which accept many different distance functions, but not weighting. I have seen extensions of these functions that allow for weighting, but these extensions do not allow users to select different distance functions.
Ideally, i would like to avoid using functions from the statistics toolbox (i am not certain the user of the function will have access to those toolboxes).
I have written two functions to accomplish this task. The first uses tricky calls to repmat and permute, and the second simply uses for-loops.
function [D] = pairdist1(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% format weights for multiplication
wts = repmat(wts,[numA,1,numB]);
% get featural differences between A and B pairs
A = repmat(A,[1 1 numB]);
B = repmat(permute(B,[3,2,1]),[numA,1,1]);
differences = abs(A-B).^r;
% weigh difference values before combining them
differences = differences.*wts;
differences = differences.^(1/r);
% combine features to get distance
D = permute(sum(differences,2),[1,3,2]);
end
AND:
function [D] = pairdist2(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% use for-loops to generate differences
D = zeros(numA,numB);
for i=1:numA
for j=1:numB
differences = abs(A(i,:) - B(j,:)).^(1/r);
differences = differences.*wts;
differences = differences.^(1/r);
D(i,j) = sum(differences,2);
end
end
end
Here are the performance tests:
A = rand(10,3);
B = rand(80,3);
wts = [0.1 0.5 0.4];
distancemetric = 'cityblock';
tic
D1 = pairdist1(A,B,wts,distancemetric);
toc
tic
D2 = pairdist2(A,B,wts,distancemetric);
toc
Elapsed time is 0.000238 seconds.
Elapsed time is 0.005350 seconds.
Its clear that the repmat-and-permute version works much more quickly than the double-for-loop version, at least for smaller datasets. But i also know that calls to repmat often slow things down, however. So I am wondering if anyone in the SO community has any advice to offer to improve the efficiency of either function!
EDIT
#Luis Mendo offered a nice cleanup of the repmat-and-permute function using bsxfun. I compared his function with my original on datasets of varying size:
As the data become larger, the bsxfun version becomes the clear winner!
EDIT #2
I have finished writing the function and it is available on github [link]. I ended up finding a pretty good vectorized method for computing euclidean distance [link], so i use that method in the euclidean case, and i took #Divakar's advice for city-block. It is still not as fast as pdist2, but its must faster than either of the approaches i laid out earlier in this post, and easily accepts weightings.
You can replace repmat by bsxfun. Doing so avoids explicit repetition, therefore it's more memory-efficient, and probably faster:
function D = pairdist1(A, B, wts, distancemetric)
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else
error('Function only accepts "cityblock" and "euclidean" distance')
end
differences = abs(bsxfun(#minus, A, permute(B, [3 2 1]))).^r;
differences = bsxfun(#times, differences, wts).^(1/r);
D = permute(sum(differences,2),[1,3,2]);
end
For r = 1 ("cityblock" case), you can use bsxfun to get elementwise subtractions and then use matrix-multiplication, which must speed up things. The implementation would look something like this -
%// Calculate absolute elementiwse subtractions
absm = abs(bsxfun(#minus,permute(A,[1 3 2]),permute(B,[3 1 2])));
%// Perform matrix multiplications with the given weights and reshape
D = reshape(reshape(absm,[],size(A,2))*wts(:),size(A,1),[]);

Conditional Sum in Array

I have 2 arrays, A and B. I want to form a new array C with same dimension as B where each element will show SUM(A) for A > B
Below is my working code
A = [1:1:1000]
B=[1:1:100]
for n = 1:numel(B)
C(n) = sum(A(A>B(n)));
end
However, when A has millions of rows and B has thousands, and I have to do similar calculations for 20 array-couples,it takes insane amount of time.
Is there any faster way?
For example, histcounts is pretty fast, but it counts, rather than summing.
Thanks
Depending on the size of your arrays (and your memory limitations), the following code might be slightly faster:
C = A*bsxfun(#gt,A',B);
Though it's vectorized, however, it seems to be bottlenecked (perhaps) by the allocation of memory. I'm looking to see if I can get a further speedup. Depending on your input vector size, I've seen up to a factor of 2 speedup for large vectors.
Here's a method that is a bit quicker, but I'm sure there is a better way to solve this problem.
a=sort(A); %// If A and B are already sorted then this isn't necessary!
b=sort(B);
c(numel(B))=0; %// Initialise c
s=cumsum(a,2,'reverse'); %// Get the partial sums of a
for n=1:numel(B)
%// Pull out the sum for elements in a larger than b(n)
c(n)=s(find(a>b(n),1,'first'));
end
According to some very rough tests, this seems to run a bit better than twice as fast as the original method.
You had the right ideas with histcounts, as you are basically "accumulating" certain A elements based on binning. This binning operation could be done with histc. Listed in this post is a solution that starts off with similar steps as listed in #David's answer and then uses histc to bin and sum up selective elements from A to get us the desired output and all of it in a vectorized manner. Here's the implementation -
%// Sort A and B and also get sorted B indices
sA = sort(A);
[sB,sortedB_idx] = sort(B);
[~,bin] = histc(sB,sA); %// Bin sorted B onto sorted A
C_out = zeros(1,numel(B)); %// Setup output array
%// Take care of the case when all elements in B are greater than A
if sA(1) > sB(end)
C_out(:) = sum(A);
end
%// Only do further processing if there is at least one element in B > any element in A
if any(bin)
csA = cumsum(sA,'reverse'); %// Reverse cumsum on sorted A
%// Get sum(A(A>B(n))) for every n, but for sorted versions
valid_mask = cummax(bin) - bin ==0;
valid_mask2 = bin(valid_mask)+1 <= numel(A);
valid_mask(1:numel(valid_mask2)) = valid_mask2;
C_out(valid_mask) = csA(bin(valid_mask)+1);
%// Rearrange C_out to get back in original unsorted version
[~,idx] = sort(sortedB_idx);
C_out = C_out(idx);
end
Also, please remember when comparing the result from this method with the one from the original for-loop version that there would be slight variations in output as this vectorized solution uses cumsum which computes a running summation and as such would have large cumulatively summed numbers being added to individual elements that are comparatively very small, whereas the for-loop version
would sum only selective elements. So, floating-precision issues would come up there.

Randomize matrix elements between two values while keeping row and column sums fixed (MATLAB)

I have a bit of a technical issue, but I feel like it should be possible with MATLAB's powerful toolset.
What I have is a random n by n matrix of 0's and w's, say generated with
A=w*(rand(n,n)<p);
A typical value of w would be 3000, but that should not matter too much.
Now, this matrix has two important quantities, the vectors
c = sum(A,1);
r = sum(A,2)';
These are two row vectors, the first denotes the sum of each column and the second the sum of each row.
What I want to do next is randomize each value of w, for example between 0.5 and 2. This I would do as
rand_M = (0.5-2).*rand(n,n) + 0.5
A_rand = rand_M.*A;
However, I don't want to just pick these random numbers: I want them to be such that for every column and row, the sums are still equal to the elements of c and r. So to clean up the notation a bit, say we define
A_rand_c = sum(A_rand,1);
A_rand_r = sum(A_rand,2)';
I want that for all j = 1:n, A_rand_c(j) = c(j) and A_rand_r(j) = r(j).
What I'm looking for is a way to redraw the elements of rand_M in a sort of algorithmic fashion I suppose, so that these demands are finally satisfied.
Now of course, unless I have infinite amounts of time this might not really happen. I therefore accept these quantities to fall into a specific range: A_rand_c(j) has to be an element of [(1-e)*c(j),(1+e)*c(j)] and A_rand_r(j) of [(1-e)*r(j),(1+e)*r(j)]. This e I define beforehand, say like 0.001 or something.
Would anyone be able to help me in the process of finding a way to do this? I've tried an approach where I just randomly repick the numbers, but this really isn't getting me anywhere. It does not have to be crazy efficient either, I just need it to work in finite time for networks of size, say, n = 50.
To be clear, the final output is the matrix A_rand that satisfies these constraints.
Edit:
Alright, so after thinking a bit I suppose it might be doable with some while statement, that goes through every element of the matrix. The difficult part is that there are four possibilities: if you are in a specific element A_rand(i,j), it could be that A_rand_c(j) and A_rand_r(i) are both too small, both too large, or opposite. The first two cases are good, because then you can just redraw the random number until it is smaller than the current value and improve the situation. But the other two cases are problematic, as you will improve one situation but not the other. I guess it would have to look at which criteria is less satisfied, so that it tries to fix the one that is worse. But this is not trivial I would say..
You can take advantage of the fact that rows/columns with a single non-zero entry in A automatically give you results for that same entry in A_rand. If A(2,5) = w and it is the only non-zero entry in its column, then A_rand(2,5) = w as well. What else could it be?
You can alternate between finding these single-entry rows/cols, and assigning random numbers to entries where the value doesn't matter.
Here's a skeleton for the process:
A_rand=zeros(size(A)) is the matrix you are going to fill
entries_left = A>0 is a binary matrix showing which entries in A_rand you still need to fill
col_totals=sum(A,1) is the amount you still need to add in every column of A_rand
row_totals=sum(A,2) is the amount you still need to add in every row of A_rand
while sum( entries_left(:) ) > 0
% STEP 1:
% function to fill entries in A_rand if entries_left has rows/cols with one nonzero entry
% you will need to keep looping over this function until nothing changes
% update() A_rand, entries_left, row_totals, col_totals every time you loop
% STEP 2:
% let (i,j) be the indeces of the next non-zero entry in entries_left
% assign a random number to A_rand(i,j) <= col_totals(j) and <= row_totals(i)
% update() A_rand, entries_left, row_totals, col_totals
end
update()
A_rand(i,j) = random_value;
entries_left(i,j) = 0;
col_totals(j) = col_totals(j) - random_value;
row_totals(i) = row_totals(i) - random_value;
end
Picking the range for random_value might be a little tricky. The best I can think of is to draw it from a relatively narrow distribution centered around N*w*p where p is the probability of an entry in A being nonzero (this would be the average value of row/column totals).
This doesn't scale well to large matrices as it will grow with n^2 complexity. I tested it for a 200 by 200 matrix and it worked in about 20 seconds.

Resources