Summation based on unique entries of two arrays | Speed Issue - arrays

I have 3 arrays of size 803500*1 with the following details:
Rid: It can contain any number
RidID: It contains elements from 1 to 184 in random order. Each element appears multiple times.
r: It contains elements 0,1,2,...12. All elements (except zero) appear nearly 3400 to 3700 times at random indices in this array.
Following may be useful for generating sample data:
Rid = rand(803500,1);
RidID = randi(184,803500,1);
r = randi(13,803500,1)-1; %This may not be a good sample for r as per previously mentioned details?
What I want to do?
I want to calculate the sum of those entries of Rid which correspond to each positive unique entry of r and each unique entry of RidID.
This may be clearer with the code which I wrote for this problem:
RNum = numel(unique(RidID));
RSum = ones(RNum,12); %Preallocating for better speed
for i=1:12
RperM = r ==i;
for j = 1:RNum
RSum(j,i) = sum(Rid(RperM & (RidID==j)));
end
end
Issue:
My code works but it takes 5 seconds on average on my computer and I have to do this calculation nearly a thousand times. If this time be reduced from 5 seconds to atleast half of it, I'll be very happy. But how do I optimize this? I don't mind if it is made better with vectorization or any better written loop.
I am using MATLAB R2017b.

You can use accumarray :
u = unique(RidID);
A = accumarray([RidID r+1], Rid);
RSum = A(u, 2:13);

This is slower than accumarray as suggested by rahnema, but using findgroups and splitapply may save memory.
In your example, there may be thousands of zero-valued elements in the resulting matrix, where a combination of RidID and r does not occur. In this case a stacked result would be more memory efficient, like so:
RidID | r | Rid_sum
-------------------------
1 | 1 | 100
2 | 1 | 200
4 | 2 | 85
...
This can be achieved with the following code:
[ID, rn, RidIDn] = findgroups(r,RidID); % Get unique combo ID for 'r' and 'RidID'
RSum = splitapply( #sum, Rid, ID ); % Sum for each ID
output = table( RidIDn, rn, RSum ); % Nicely formatted table output
% Get rid of elements where r == 0
output( output.rn == 0, : ) = [];
You could convert this to the same output as the accumarray method, but it's already a slower method...
% Convert to 'unstacked' 2D matrix (optional)
RSum = full( sparse( 1:numel(Ridn), 1:numel(rn), RSum ) );

Related

Solve system of equations with data loaded, loop through group IDs and different observations

I have data for a large amount of Group IDs, and each group ID has anywhere from 4 to 30 observations. I would like to solve a (linear or nonlinear, depending on approach) system of equations using data in Matlab. I want to solve a system of three equations and three unknowns, but also load in data for known variables. I need observations 2 through 4 in order to solve this, but would also like to move to the next set of 3 observations (if it exists) to see how the solutions change. I would like to record these calculations as well.
What is the best way to accomplish this? I have a standard idea of how to solve the system using fsolve, but what is the best way to loop through group IDs with varying amounts of observations?
Here is some sample code I have written when thinking about this issue:
%%Load Data
Data = readtable('dataset.csv'); % Full Dataset
%Define Variables
%Main Data
groupID = Data{:,1};
Known1 = Data{:,7};
Known2 = Data{:,8};
Known3 = Data{:,9};
%%%%%%Function %%%%%
f = [A,B,C];
% Define the function handle for the system of equations
fun = #(f) [A^2 + B*Known3 - 2C*Known1 +1/Known2 - D2;
A + (B^2)Known3 - C*Known1 +1/Known2 - D3;
A - B*Known3 + C^2*Known1 +1/Known2 - D4];
% Define the initial guess for the solution
f0 = [0; 0; 0];
% Solve the nonlinear system of equations
f = fsolve(fun, f0)
%%%% Create Loop %%%%%%
% Set the number of observations to load at a time
numObservations = 3;
% Set the initial group ID
groupID = 1;
% Set the maximum number of groups
maxGroups = 100;
% Loop through the groups of data
while groupID <= maxGroups
% Load the data for the current group
data = loadData(groupID, numObservations);
% Update the solution using the new data
x = fsolve(fun, x);
% Print the updated solution
disp(x);
% Move on to the next group of data
groupID = groupID + 1;
end
What are the pitfalls with writing the code like this, and how can I improve it?

How to add additional zero arrrays

I have the following problem in my simulation.
A is an array 24 x 2. I am going to split it and get 4 or 12 array. It means that I group 6 or 2 array. It will be ok, if I use even "split" coefficient. If it is odd, I can"t split A.[ I can't group 5 or 7, because of 24/5=4*5 + 4 ( or 5*5 -1) or 24/7=7*3+3.
That's why I going to do the following:
If I have 24 x 2 and need group every 5 together:
block 1 : A(1,:), A(2,:),A(3,:),A(4,:),A(5,:)
block 2 : A(6,:), A(7,:),A(8,:),A(9,:),A(10,:)
block 3 : A(11,:), A(12,:),A(13,:),A(14,:),A(15,:)
block 4 : A(16,:), A(17,:),A(18,:),A(19,:),A(20,:)
block 5 : A(21,:), A(22,:),A(23,:),A(24,:), ?
As you can see the 5th block is not full, Matlab gives me an error. My idea is to create A(25,:)=0. For my simulation it will be ok.
I am going to simulate it as function:
A=rand(m,n)
w- # number of a vector that i would like group together ( in ex., it is `5`)
if mod(w,2)==0
if mod(m,2)==0
% do....
else
% remainder = 0
end
else
if mod(m,2)==0
% remainder = 0
else
%do...
end
I was going to simulate like above, but then I have noticed that it doesn't work. Because 24/10 = 2*10+4. So I should write something else
I can find the reminder as r = rem(24,5). As an example above, MatLab gives me r=4. then I can find a difference c= w-r =1 and after that, I don't know how to do that.
Could you suggest to me how to simulate such a calculation?
Determine the number of blocks needed, calculate the virtual amount of rows needed to fill these blocks, and add as many zero rows to A as the difference between the virtual and actual amount of rows. Since you didn't mention, what the actual output should look like (array, cell array, ...), I chose a reshaped array.
Here's the code:
m = 24;
n = 2;
w = 5;
A = rand(m, n)
% Determine number of blocks
n_blocks = ceil(m / w);
% Add zero rows to A
A(m+1:w*n_blocks, :) = 0
% Reshape A into desired format
A = reshape(A.', size(A, 1) / n_blocks * n, n_blocks).'
The output (shortened):
A =
0.9164959 0.1373036
0.5588065 0.1303052
0.4913387 0.6540321
0.5711623 0.1937039
0.7231415 0.8142444
0.9348675 0.8623844
[...]
0.8372621 0.4571067
0.5531564 0.9138423
A =
0.91650 0.13730
0.55881 0.13031
0.49134 0.65403
0.57116 0.19370
0.72314 0.81424
0.93487 0.86238
[...]
0.83726 0.45711
0.55316 0.91384
0.00000 0.00000
A =
0.91650 0.13730 0.55881 0.13031 0.49134 0.65403 0.57116 0.19370 0.72314 0.81424
0.93487 0.86238 0.61128 0.15006 0.43861 0.07667 0.94387 0.85875 0.43247 0.03105
0.48887 0.67998 0.42381 0.77707 0.93337 0.96875 0.88552 0.43617 0.06198 0.80826
0.08087 0.48928 0.46514 0.69252 0.84122 0.77548 0.90480 0.16924 0.82599 0.82780
0.49048 0.00514 0.99615 0.42366 0.83726 0.45711 0.55316 0.91384 0.00000 0.00000
Hope that helps!

SPSS: using IF function with REPEAT when each case has multiple linked instances

I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |IV.1|IV.2|IV.3
------|------|------|------|------|--------|---------|--------|----|----|----
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 1 | 0 | 1
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1 | 0 | 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 0 | 0 | 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0 | 0 | 0
For each case, there are linked variables i.e. Drug.1 is linked with DateB.1 and IV.1 (Indicator Variable.1); Drug.2 is linked with DateB.2 and IV.2, etc.
The variable IV.1 only = 1 if Drug.1 is the case that I want to analyze (in this example, I want to analyze each receipt of Drug "X"), and so on for the other IV variables. Otherwise, IV = 0 if the drug for that scenario is not "X".
I want to calculate the difference between DateA and DateB for each instance where Drug "X" is received.
e.g. In the example above I want to calculate a new variable:
DateDiffA1_B1.1 = DateA1 - DateB1.1
DateDiffA1_B2.1 = DateA1 - DateB2.1
DateDiffA1_B1.3 = DateA1 - DateB1.3
DateDiffA1_B2.3 = DateA1 - DateB2.3
DateDiffA1_B3.3 = DateA1 - DateB3.3
I'm not sure if this new variable would need to be linked to each instance of Drug "X" as for the other variables, or if it could be a single variable that COUNTS all the instances for each case.
The end goal is to COUNT how many times each case had a date difference of <= 2 weeks when they received Drug "X". If they did not receive Drug "X", I do not want to COUNT the date difference.
I will eventually want to compare those who did receive Drug "X" with a date difference <= 2 weeks to those who did not, so having another indicator variable to help separate out these specific patients would be beneficial.
I am unsure about the best way to go about this; I suspect it will require a combination of IF and REPEAT functions using the IV variable, but I am relatively new with SPSS and syntax and am not sure how this should be coded to avoid errors.
Thanks for your help!
EDIT: It seems like I may need to use IV as a vector variable to loop through the linked variables in each case. I've tried the syntax below to no avail:
DATASET ACTIVATE DataSet1.
vector IV = IV.1 to IV.3.
loop #i = .1 to .3.
do repeat DateB = DateB.1 to DateB.3
/ DrugDateDiff = DateDiff.1 to DateDiff.3.
if IV(#i) = 1
/ DrugDateDiff = datediff(DateA, DateB, "days").
end repeat.
end loop.
execute.
Actually there is no need to add the vector and the loop, all you need can be done within one DO REPEAT:
compute N2W=0.
do repeat DateB = DateB.1 to DateB.3 /IV=IV.1 to IV.3 .
if IV=1 and datediff(DateA, DateB, "days")<=14 N2W = N2W + 1.
end repeat.
execute.
This syntax will first put a zero in the count variable N2W. Then it will loop through all the dates, and only if the matching IV is 1, the syntax will compare them to dateA, and add 1 to the count if the difference is <=2 weeks.
if you prefer to keep the count variable as missing when none of the IV are 1, instead of compute N2W=0. start the syntax with:
If any(1, IV.1 to IV.3) N2W=0.

How do I make this specific code run faster in Matlab?

I have an array with a set of chronological serial numbers and another source array with random serial numbers associated with a numeric value. The code creates a new cell array in MATLAB with the perfectly chronological serial numbers in one column and in the next column it inserts the associated numeric value if the serial numbers match in both original source arrays. If they don't the code simply copies the previous associated value until there is a new match.
j = 1;
A = {random{1:end,1}};
B = cell2mat(A);
value = random{1,2};
data = cell(length(serial), 1);
data(:,1) = serial(:,1);
h = waitbar(0,'Please Wait...');
steps = length(serial);
for k = 1:length(serial)
[row1, col1, vec1] = find(B == serial{k,1});
tf1 = isempty(vec1);
if (tf1 == 0)
prices = random{col1,2};
data(j,2) = num2cell(value);
j = j + 1;
else
data(j,2) = num2cell(value);
j = j + 1;
end
waitbar(k/steps,h,['Please Wait... ' num2str(k/steps*100) ' %'])
end
close(h);
Right now, the run-time for the code is approximately 4 hours. I would like to make this code run faster. Please suggest any methods to do so.
UPDATE
source input (serial)
1
2
3
4
5
6
7
source input (random)
1 100
2 105
4 106
7 107
desired output (data)
SR No Value
1 100
2 105
3 105
4 106
5 106
6 106
7 107
Firstly, run the MATLAB profiler (see 'doc profile') and see where the bulk of the execution time is occuring.
Secondly, don't update the waitbar on every iteration> Particularly if serial contains a large (> 100) number of elements.
Do something like:
if (mod(k, 100)==0) % update on every 100th iteration
waitbar(k/steps,h,['Please Wait... ' num2str(k/steps*100) ' %'])
end
Some points:
Firstly it would help a lot if you gave us some sample input and output data.
Why do you initialize data as one column and then fill it's second in the loop? Rather initialize it as 2 columns upfront: data = cell(length(serial), 2);
Is j ever different from k, they look identical to me and you could just drop both the j = j + 1 lines.
tf1 = isempty(vec1); if (tf1 == 0)... is the same as the single line: if (!isempty(vec1)) or even better if(isempty(vec1)) and then swap the code from your else and your if.
But I think you can probably find a fast vecotrized solution if you provide some (short) sample input and output data.

Changing indices and order in arrays

I have a struct mpc with the following structure:
num type col3 col4 ...
mpc.bus = 1 2 ... ...
2 2 ... ...
3 1 ... ...
4 3 ... ...
5 1 ... ...
10 2 ... ...
99 1 ... ...
to from col3 col4 ...
mpc.branch = 1 2 ... ...
1 3 ... ...
2 4 ... ...
10 5 ... ...
10 99 ... ...
What I need to do is:
1: Re-order the rows of mpc.bus, such that all rows of type 1 are first, followed by 2 and at last, 3. There is only one element of type 3, and no other types (4 / 5 etc.).
2: Make the numbering (column 1 of mpc.bus, consecutive, starting at 1.
3: Change the numbers in the to-from columns of mpc.branch, to correspond to the new numbering in mpc.bus.
4: After running simulations, reverse the steps above to turn up with the same order and numbering as above.
It is easy to update mpc.bus using find.
type_1 = find(mpc.bus(:,2) == 1);
type_2 = find(mpc.bus(:,2) == 2);
type_3 = find(mpc.bus(:,2) == 3);
mpc.bus(:,:) = mpc.bus([type1; type2; type3],:);
mpc.bus(:,1) = 1:nb % Where nb is the number of rows of mpc.bus
The numbers in the to/from columns in mpc.branch corresponds to the numbers in column 1 in mpc.bus.
It's OK to update the numbers on the to, from columns of mpc.branch as well.
However, I'm not able to find a non-messy way of retracing my steps. Can I update the numbering using some simple commands?
For the record: I have deliberately not included my code for re-numbering mpc.branch, since I'm sure someone has a smarter, simpler solution (that will make it easier to redo when the simulations are finished).
Edit: It might be easier to create normal arrays (to avoid woriking with structs):
bus = mpc.bus;
branch = mpc.branch;
Edit #2: The order of things:
Re-order and re-number.
Columns (3:end) of bus and branch are changed. (Not part of this question)
Restore original order and indices.
Thanks!
I'm proposing this solution. It generates a n x 2 matrix, where n corresponds to the number of rows in mpc.bus and a temporary copy of mpc.branch:
function [mpc_1, mpc_2, mpc_3] = minimal_example
mpc.bus = [ 1 2;...
2 2;...
3 1;...
4 3;...
5 1;...
10 2;...
99 1];
mpc.branch = [ 1 2;...
1 3;...
2 4;...
10 5;...
10 99];
mpc.bus = sortrows(mpc.bus,2);
mpc_1 = mpc;
mpc_tmp = mpc.branch;
for I=1:size(mpc.bus,1)
PAIRS(I,1) = I;
PAIRS(I,2) = mpc.bus(I,1);
mpc.branch(mpc_tmp(:,1:2)==mpc.bus(I,1)) = I;
mpc.bus(I,1) = I;
end
mpc_2 = mpc;
% (a) the following mpc_tmp is only needed if you want to truly reverse the operation
mpc_tmp = mpc.branch;
%
% do some stuff
%
for I=1:size(mpc.bus,1)
% (b) you can decide not to use the following line, then comment the line below (a)
mpc.branch(mpc_tmp(:,1:2)==mpc.bus(I,1)) = PAIRS(I,2);
mpc.bus(I,1) = PAIRS(I,2);
end
% uncomment the following line, if you commented (a) and (b) above:
% mpc.branch = mpc_tmp;
mpc.bus = sortrows(mpc.bus,1);
mpc_3 = mpc;
The minimal example above can be executed as is. The three outputs (mpc_1, mpc_2 & mpc_3) are just in place to demonstrate the workings of the code but are otherwise not necessary.
1.) mpc.bus is ordered using sortrows, simplifying the approach and not using find three times. It targets the second column of mpc.bus and sorts the remaining matrix accordingly.
2.) The original contents of mpc.branch are stored.
3.) A loop is used to replace the entries in the first column of mpc.bus with ascending numbers while at the same time replacing them correspondingly in mpc.branch. Here, the reference to mpc_tmp is necessary so ensure a correct replacement of the elements.
4.) Afterwards, mpc.branch can be reverted analogously to (3.) - here, one might argue, that if the original mpc.branch was stored earlier on, one could just copy the matrix. Also, the original values of mpc.bus are re-assigned.
5.) Now, sortrows is applied to mpc.bus again, this time with the first column as reference to restore the original format.

Resources