MATLAB sort function yields tampered results - arrays

I have a vector of 126 elements which is usually correctly sorted; however, I always sort it to make sure everything is okay.
The problem is that: when the array is already sorted, performing a sort would destroy the original values of the array.
I attached the array in a csv file and executed the script below, where I insert the vector in the first column of 'a' then sort it in the second then check for any differences in the third column.
a = csvread('a.csv')
a(:,2)=sort(a(:,1))
a(:,3)=a(:,2)-a(:,1)
result=sum(a(:,3).^2)
You could easily see that the first two columns aren't identical, and the third column has some none zero values.
Syntax for array
a = [17.4800
18.6800
19.8800
21.0800
22.2800
23.4800
24.6800
25.8800
27.0800
28.2800
29.4800
30.6800
46.1600
47.3600
48.5600
49.7600
50.9600
52.1600
53.3600
54.5600
55.7600
56.9600
58.1600
59.3600
74.8400
76.0400
77.2400
78.4400
79.6400
80.8400
103.5200
104.7200
105.9200
107.1200
108.3200
109.5200
110.7200
111.9200
113.1200
114.3200
115.5200
116.7200
132.2000
133.4000
134.6000
135.8000
137.0000
138.2000
139.4000
140.6000
141.8000
143.0000
144.2000
145.4000
165.4200
166.6200
167.8200
169.0200
170.2200
171.4200
172.6200
173.8200
175.0200
176.2200
177.4200
178.6200
179.9300
181.1300
182.3300
183.5300
184.7300
185.9300
187.1300
188.3300
189.5300
201.3700
202.5700
203.7700
204.9700
206.1700
207.3700
236.1100
237.3100
238.5100
239.7100
240.9100
242.1100
243.3100
244.5100
245.7100
246.9100
248.1100
249.3100
239.8400
241.0400
242.2400
276.9900
278.1900
279.3900
280.5900
281.7900
282.9900
284.1900
285.3900
286.5900
287.7900
288.9900
290.1900
277.8200
279.0200
280.2200
281.4200
282.6200
283.8200
285.0200
286.2200
287.4200
288.6200
289.8200
291.0200
291.0700
292.2700
293.4700
295.6900
296.8900
298.0900];

Your original vector is unfortunately not sorted. Therefore, sorting this result will obviously not give you what the original vector is supposed to be as the values that were out of order will become in order.
You can check this by using diff on the read in vector from the CSV file and seeing if there are any negative differences. diff takes the difference between the (i+1)th value and the ith value and if your values are monotonically increasing, you should get positive differences all around. We can see which locations are affected by finding values in the difference that are negative:
a = csvread('a.csv');
ind = find(diff(a) < 0);
We get:
>> ind
ind =
93
108
This says that locations 93 and 108 are where the out of order starts. Locations 94 and 109 is where it actually happens. Let's check out portions 90 - 110 of your vector to be sure:
>> a(90:110)
ans =
245.7100 % 90
246.9100 % 91
248.1100 % 92
249.3100 % 93
239.8400 %<-------
241.0400
242.2400
276.9900
278.1900
279.3900
280.5900
281.7900
282.9900
284.1900
285.3900
286.5900
287.7900 % 106
288.9900 % 107
290.1900 % 108
277.8200 % <------
279.0200
As you can see, locations 93 and 108 take a dip in numerical value, and so if you tried sorting the result then taking the difference, you'll notice that locations 1 up to 93 will exhibit a difference of 0, but after location 93, that's when it becomes unequal.
I'm frankly surprised you didn't see that they're out of order because your snapshot clearly shows there's a decrease in value on the left column towards the top of the snapshot.
Therefore, either check your data to see if you have input it correctly, or modify whatever process you're working on to ensure that it can handled unsorted data.

Related

Summation based on unique entries of two arrays | Speed Issue

I have 3 arrays of size 803500*1 with the following details:
Rid: It can contain any number
RidID: It contains elements from 1 to 184 in random order. Each element appears multiple times.
r: It contains elements 0,1,2,...12. All elements (except zero) appear nearly 3400 to 3700 times at random indices in this array.
Following may be useful for generating sample data:
Rid = rand(803500,1);
RidID = randi(184,803500,1);
r = randi(13,803500,1)-1; %This may not be a good sample for r as per previously mentioned details?
What I want to do?
I want to calculate the sum of those entries of Rid which correspond to each positive unique entry of r and each unique entry of RidID.
This may be clearer with the code which I wrote for this problem:
RNum = numel(unique(RidID));
RSum = ones(RNum,12); %Preallocating for better speed
for i=1:12
RperM = r ==i;
for j = 1:RNum
RSum(j,i) = sum(Rid(RperM & (RidID==j)));
end
end
Issue:
My code works but it takes 5 seconds on average on my computer and I have to do this calculation nearly a thousand times. If this time be reduced from 5 seconds to atleast half of it, I'll be very happy. But how do I optimize this? I don't mind if it is made better with vectorization or any better written loop.
I am using MATLAB R2017b.
You can use accumarray :
u = unique(RidID);
A = accumarray([RidID r+1], Rid);
RSum = A(u, 2:13);
This is slower than accumarray as suggested by rahnema, but using findgroups and splitapply may save memory.
In your example, there may be thousands of zero-valued elements in the resulting matrix, where a combination of RidID and r does not occur. In this case a stacked result would be more memory efficient, like so:
RidID | r | Rid_sum
-------------------------
1 | 1 | 100
2 | 1 | 200
4 | 2 | 85
...
This can be achieved with the following code:
[ID, rn, RidIDn] = findgroups(r,RidID); % Get unique combo ID for 'r' and 'RidID'
RSum = splitapply( #sum, Rid, ID ); % Sum for each ID
output = table( RidIDn, rn, RSum ); % Nicely formatted table output
% Get rid of elements where r == 0
output( output.rn == 0, : ) = [];
You could convert this to the same output as the accumarray method, but it's already a slower method...
% Convert to 'unstacked' 2D matrix (optional)
RSum = full( sparse( 1:numel(Ridn), 1:numel(rn), RSum ) );

SAS: set statement point = _N_

I'm trying to understand a friend's code to see if I can find some inspiration for my dissertation. He runs a section where he creates a dataset and inputs 3 datasets. However, what I don't understand is that he uses 3 set statements and the latter datasets use point = "_ N _"
What is the use of the following code?
data Other;
set One;
set Two point = _N_;
set Three point = _N_;
array Rating[*] Unrated;
array Amortising[*] '1'n;
array Rating_old[*] old_Unrated;
AM = 0;
do i = 1 to dim(Rating);
Rating[i] = Rating[i] + Rating_old[i] * Amortising[i];
end;
run;
The input datasets look like this
data one;
input Segment count weight ;
datalines;
1 0 0.1
99 1 0.2
;
run;
data two;
input block $ type '0'n '1'n '99'n;
datalines;
50 A 100% 10% 0%
50 S 100% 10% 0%
51 S 100% 10% 0%
52 S 100% 10% 0%
132 S 100% 12% 0%
;
run;
data three;
input DPD $ block type $ segment count weight;
datalines;
AM 50 S 1 0 0.1
Unrated 51 S 99 0.2
NPE 132 S 1 0.5
;
run;
Just looking to see what the point = _ N _ would be used for!
In this program it does nothing. The program would run exactly the same without the point= option on the last two set statements.
The POINT= let's you access observations directly. The _N_ automatic variable is incremented once for each iteration of the data step. So on the first iteration the step will read the first observation from each of the three inputs. Which is exactly what would happen without the point= option.
Note that this program will stop when the first SET statement reads past the end of the file. Without the POINT= then it would stop when ANY of the three set statements attempted to read past the end of the input file. You could do the same and avoid the ERRORs in the SAS log by using and testing the NOBS= options.
set One;
if _n_ <= nobs2 then set Two nobs=nobs2;
if _n_ <= nobs3 then set Three nobs=nobs3;
Given the datasets shown, it doesn't do anything.
However, if the ONE dataset had more rows than one or both of the other two datasets, it would avoid the data step stopping when it ran out of rows from the shortest dataset. For example, run this:
data Other;
set Two;
set One point = _N_;
set Three point = _N_;
array Rating[*] Unrated;
array Amortising[*] '1'n;
array Rating_old[*] old_Unrated;
AM = 0;
do i = 1 to dim(Rating);
Rating[i] = Rating[i] + Rating_old[i] * Amortising[i];
end;
run;
Just swapping TWO and ONE. Now you get 5 rows, while if you took off the point=_n_, you'd only get two still. So the program is likely being written to ensure all of ONE's rows are represented (similar to a left join in SQL except you're not joining to anything here). This would probably be more clearly written as a merge, even without a by statement if it's just a one-to-one merge. Usually, though, there's a valid merge key to merge on.

MATLAB Extract all rows between two variables with a threshold

I have a cell array called BodyData in MATLAB that has around 139 columns and 3500 odd rows of skeletal tracking data.
I need to extract all rows between two string values (these are timestamps when an event happened) that I have
e.g.
BodyData{}=
Column 1 2 3
'10:15:15.332' 'BASE05' ...
...
'10:17:33:230' 'BASE05' ...
The two timestamps should match a value in the array but might also be within a few ms of those in the array e.g.
TimeStamp1 = '10:15:15.560'
TimeStamp2 = '10:17:33.233'
I have several questions!
How can I return an array for all the data between the two string values plus or minus a small threshold of say .100ms?
Also can I also add another condition to say that all str values in column2 must also be the same, otherwise ignore? For example, only return the timestamps between A and B only if 'BASE02'
Many thanks,
The best approach to the first part of your problem is probably to change from strings to numeric date values. In Matlab this can be done quite painlessly with datenum.
For the second part you can just use logical indexing... this is were you put a condition (i.e. that second columns is BASE02) within the indexing expression.
A self-contained example:
% some example data:
BodyData = {'10:15:15.332', 'BASE05', 'foo';...
'10:15:16.332', 'BASE02', 'bar';...
'10:15:17.332', 'BASE05', 'foo';...
'10:15:18.332', 'BASE02', 'foo';...
'10:15:19.332', 'BASE05', 'bar'};
% create column vector of numeric times, and define start/end times
dateValues = datenum(BodyData(:, 1), 'HH:MM:SS.FFF');
startTime = datenum('10:15:16.100', 'HH:MM:SS.FFF');
endTime = datenum('10:15:18.500', 'HH:MM:SS.FFF');
% select data in range, and where second column is 'BASE02'
BodyData(dateValues > startTime & dateValues < endTime & strcmp(BodyData(:, 2), 'BASE02'), :)
Returns:
ans =
'10:15:16.332' 'BASE02' 'bar'
'10:15:18.332' 'BASE02' 'foo'
References: datenum manual page, matlab help page on logical indexing.

How can I sum values with multiple conditions including different dates

I have some data as follow (column A:D contain data, column E is the sum I created):
NO
SE
Date Country ID Value Sum
30-01-2014 SE B-08888 10 10
05-02-2014 SE B-08888 23
06-02-2014 SE B-08888 20
13-05-2014 SE B-08888 17 27
14-05-2014 SE B-08888 10
13-05-2014 NO A-07777 15 35
14-05-2014 NO A-07777 20
I would like to sum all values that are having same country and same ID when: 1) the date is greater than 1/5; and 2) when date is less than 1/5.
I am using the SUMIFS. But the SUMIFS doesn't give correct results when I included the date argument which is less than 1/5.
=SUMIFS($D$5:$D$11;$A$5:$A$11;"<="&DATE(2014;5;1);$B$5:$B$11;A2;$C$5:$C$11;C5) ==> gives incorrect result (=10)
=SUMIFS($D$5:$D$11;$A$5:$A$11;">="&DATE(2014;5;1);$B$5:$B$11;A2;$C$5:$C$11;C8) ==> gives correct result (=27)
Is there a way I can take into account both date conditions (i.e. date greater than and less than 1/5) and make the formula general so I don't have to go through every cell to change reference?
Thank you.
Using your data, the second formula returns 27 for me - so I assume the cell references you have not mentioned are as I have guessed. The first formula for me returns 53 - I suspect the result you want, though have not mentioned.
Something is wrong with your data (not the formulae). The most likely cause is that there is a trailing space in C6 and C7 that is not in C5. Copying C5 down to C9 should fix that. There might however be a data issue in other cells in those two rows.
It might make things easier for you if the formulae were in separate columns.

How do I make this specific code run faster in Matlab?

I have an array with a set of chronological serial numbers and another source array with random serial numbers associated with a numeric value. The code creates a new cell array in MATLAB with the perfectly chronological serial numbers in one column and in the next column it inserts the associated numeric value if the serial numbers match in both original source arrays. If they don't the code simply copies the previous associated value until there is a new match.
j = 1;
A = {random{1:end,1}};
B = cell2mat(A);
value = random{1,2};
data = cell(length(serial), 1);
data(:,1) = serial(:,1);
h = waitbar(0,'Please Wait...');
steps = length(serial);
for k = 1:length(serial)
[row1, col1, vec1] = find(B == serial{k,1});
tf1 = isempty(vec1);
if (tf1 == 0)
prices = random{col1,2};
data(j,2) = num2cell(value);
j = j + 1;
else
data(j,2) = num2cell(value);
j = j + 1;
end
waitbar(k/steps,h,['Please Wait... ' num2str(k/steps*100) ' %'])
end
close(h);
Right now, the run-time for the code is approximately 4 hours. I would like to make this code run faster. Please suggest any methods to do so.
UPDATE
source input (serial)
1
2
3
4
5
6
7
source input (random)
1 100
2 105
4 106
7 107
desired output (data)
SR No Value
1 100
2 105
3 105
4 106
5 106
6 106
7 107
Firstly, run the MATLAB profiler (see 'doc profile') and see where the bulk of the execution time is occuring.
Secondly, don't update the waitbar on every iteration> Particularly if serial contains a large (> 100) number of elements.
Do something like:
if (mod(k, 100)==0) % update on every 100th iteration
waitbar(k/steps,h,['Please Wait... ' num2str(k/steps*100) ' %'])
end
Some points:
Firstly it would help a lot if you gave us some sample input and output data.
Why do you initialize data as one column and then fill it's second in the loop? Rather initialize it as 2 columns upfront: data = cell(length(serial), 2);
Is j ever different from k, they look identical to me and you could just drop both the j = j + 1 lines.
tf1 = isempty(vec1); if (tf1 == 0)... is the same as the single line: if (!isempty(vec1)) or even better if(isempty(vec1)) and then swap the code from your else and your if.
But I think you can probably find a fast vecotrized solution if you provide some (short) sample input and output data.

Resources