Parallelize data processing - arrays

I have a large matrix data that I want to "organize" in a certain way. The matrix has 5 columns and about 2 million rows. The first 4 columns are characteristics of each observation (these are integers) and the last column is the outcome variable I'm interested in (this contains real numbers). I want to organize this matrix in an Array of Arrays. Since data is very large, I'm trying to parallelize this operation:
addprocs(3)
#everywhere data = readcsv("datalocation", Int)
#everywhere const Z = 65
#everywhere const z = 16
#everywhere const Y = 16
#everywhere const y = 10
#everywhere const arr = Array{Vector}(Z-z+1,Y-y+1,Z-z+1,Y-y+1)
#parallel (vcat) for a1 in z:Z, e1 in y:Y, a2 in z:Z, e2 in y:Y
arr[a1-z+1,e1-y+1,a2-z+1,e2-y+1] = data[(data[:,1].==a1) & (data[:,2].==e1) & (data[:,3].==a2) & (data[:,4].==e2), end]
end
However I get an error when I try to run the for loop:
Error: syntax: invalid assignment location
After the loop is finished, I would like to have arr available to all processors. What am I doing wrong?
EDIT:
The input matrix data looks like this (rows in no particular order):
16 10 16 10 100
16 10 16 11 200
20 12 21 13 500
16 10 16 10 300
20 12 21 13 500
Notice that some rows can be repeated, and some others will have the same "key" but a different fifth column.
The output I want looks like this (notice how I'm using the dimensions of arr as "keys" for a "dictionary":
arr[16-z+1, 10-y+1, 16-z+1, 10-y+1] = [100, 300]
arr[16-z+1, 10-y+1, 16-z+1, 11-y+1] = [200]
arr[20-z+1, 12-y+1, 21-z+1, 13-y+1] = [500, 500]
That is, the element of arr at index (16-z+1, 10-y+1, 16-z+1, 10-y+1) is the vector [100, 300]. I don't care about the ordering of the rows or the ordering of the last column of vectors.

Does this work for you? I tried to simulate your data by repeating the snippet that you gave of it 1000 times. It's not as elegant as I would have wanted and in particular, I couldn't quite get the remotecall_fetch() working like I wanted (even when wrapping it with #async) so I had to split the calling and the fetching into two steps. Let me know though how this seems.
addprocs(n)
#everywhere begin
if myid() != 1
multiplier = 10^3;
Data = readdlm("/path/to/Input.txt")
global data = kron(Data,ones(multiplier));
println(size(data))
end
end
#everywhere begin
function Select_Data(a1, e1, a2, e2, data=data)
return data[(data[:,1].==a1) & (data[:,2].==e1) & (data[:,3].==a2) & (data[:,4].==e2), end]
end
end
n_workers = nworkers()
function next_pid(pid, n_workers)
if pid <= n_workers
return pid + 1
else
return 2
end
end
const arr = Array{Any}(Z-z+1,Y-y+1,Z-z+1,Y-y+1);
println("Beginning Processing Work")
#sync begin
pid = 2
for a1 in z:Z, e1 in y:Y, a2 in z:Z, e2 in y:Y
pid = next_pid(pid, n_workers)
arr[a1-z+1,e1-y+1,a2-z+1,e2-y+1] = remotecall(pid, Select_Data, a1, e1, a2, e2)
end
end
println("Retrieving Completed Jobs")
#sync begin
pid = 2
for a1 in z:Z, e1 in y:Y, a2 in z:Z, e2 in y:Y
arr[a1-z+1,e1-y+1,a2-z+1,e2-y+1] = fetch(arr[a1-z+1,e1-y+1,a2-z+1,e2-y+1])
end
end

Note: I initially misinterpreted your question. I had thought that you were trying to split the data amongst your workers, but I now see that isn't quite what you were after. I wrote up some simplified examples of ways that can be accomplished. I'll leave them up as a response in case anyone in the future finds them useful.
Get started:
writedlm("path/to/data.csv", rand(100,10), ',')
addprocs(4)
Option 1:
function sendto(p::Int; args...)
for (nm, val) in args
#spawnat(p, eval(Main, Expr(:(=), nm, val)))
end
end
Data = readcsv("/path/to/data.csv")
for (idx, pid) in enumerate(workers())
Start = (idx-1)*25 + 1
End = Start + 24
sendto(pid, Data = Data[Start:End,])
end
Option 2:
#everywhere begin
if myid() != 1
Start = (myid()-2)*25 + 1
End = Start + 24
println(Start)
println(End)
Data = readcsv("path/to/data.csv")[Start:End,:]
end
end
# verify everything looks right for what got sent
#everywhere if myid()!= 1 println(typeof(Data)) end
#everywhere if myid()!= 1 println(size(Data)) end
Option 3:
for (idx, pid) in enumerate(workers())
Start = (idx-1)*25 + 1
End = Start + 24
sendto(pid, Start = Start, End = End)
end
#everywhere if myid()!= 1 Data = readcsv("path/to/data.csv")[Start:End,:] end

Related

Divide data array in more efficient way Matlab

I have a big array of data (it can be of thousands and tens of thousands of values). This data is a results of experiments collected in one array:
data = [2.204000000000000
2.202000000000000
2.206000000000000
2.201000000000000
...
]
And I have time array t of the same size:
t = [1 2 3 ... 65 66 1 2 3 4 ... 72 73 1 2 3 ... 75]';
This t is a time when the data was collected. So t = 1:66 - is a first experiment, and then t values again begins from 1 - it is data of 2 experiment and etc.
What I want to do: divide data by the specific time intervals:
t<=1
1<t<=4
4<t<=6
t>6
I go this way
part1 = []; part2 = []; part3 = []; part4 = [];
for ii = 1: size(data,1)
if (t(ii)) <=1 % collect all data corresponds to t<=1
part1 = [part1; ii];
elseif (t(ii) >1 && t(ii) <=4 )
part2 = [part2; ii];
elseif (t(ii) >4 && t(ii) <=6 )
part3 = [part3; ii];
else
part4 = [part4; ii];
end
end
data1 = data(part1);
data2 = data(part2);
data3 = data(part3);
data4 = data(part4);
That works perfect but it's slow because of:
I can't preallocate part1 part2 part3 part4 - I don't know their sizes;
It use for loop.
Can we do it in more elegant and fast way?
Now I have an idea of using one cell array instead of 4 different. Now I use part{1} part{2} ... part{4}. So I can preallocate it as part = cell(4,1);
You can improve your code using logical indexing.
I strongly encourage you to read the following references:
Mathworks documentation page: Find Array Elements That Meet a Condition
Loren on the Art of MATLAB blog entry: Logical Indexing – Multiple
Conditions
The following code uses logical indexing to do what you want without any loop, and thus without the need to preallocate any arrays:
data1 = data(t <= 1);
data2 = data((t > 1) && (t <= 4));
data3 = data((t > 4) && (t <= 6));
data4 = data(t > 6);
Logical indexing is like a traffic light: It allows the elements of an array that have a value of 1 to continue while stopping those elements that have a value of 0.
Matlab is very powerful in this kind of tasks.

Reverse lookup with non-unique values

What I'm trying to do
I have an array of numbers:
>> A = [2 2 2 2 1 3 4 4];
And I want to find the array indices where each number can be found:
>> B = arrayfun(#(x) {find(A==x)}, 1:4);
In other words, this B should tell me:
>> for ii=1:4, fprintf('Item %d in location %s\n',ii,num2str(B{ii})); end
Item 1 in location 5
Item 2 in location 1 2 3 4
Item 3 in location 6
Item 4 in location 7 8
It's like the 2nd output argument of unique, but instead of the first (or last) occurrence, I want all the occurrences. I think this is called a reverse lookup (where the original key is the array index), but please correct me if I'm wrong.
How can I do it faster?
What I have above gives the correct answer, but it scales terribly with the number of unique values. For a real problem (where A has 10M elements with 100k unique values), even this stupid for loop is 100x faster:
>> B = cell(max(A),1);
>> for ii=1:numel(A), B{A(ii)}(end+1)=ii; end
But I feel like this can't possibly be the best way to do it.
We can assume that A contains only integers from 1 to the max (because if it doesn't, I can always pass it through unique to make it so).
That's a simple task for accumarray:
out = accumarray(A(:),(1:numel(A)).',[],#(x) {x}) %'
out{1} = 5
out{2} = 3 4 2 1
out{3} = 6
out{4} = 8 7
However accumarray suffers from not being stable (in the sense of unique's feature), so you might want to have a look here for a stable version of accumarray, if that's a problem.
Above solution also assumes A to be filled with integers, preferably with no gaps in between. If that is not the case, there is no way around a call of unique in advance:
A = [2.1 2.1 2.1 2.1 1.1 3.1 4.1 4.1];
[~,~,subs] = unique(A)
out = accumarray(subs(:),(1:numel(A)).',[],#(x) {x})
To sum up, the most generic solution, working with floats and returning a sorted output could be:
[~,~,subs] = unique(A)
[subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1)); %// optional
vals = 1:numel(A);
vals = vals(I); %// optional
out = accumarray(subs, vals , [],#(x) {x});
out{1} = 5
out{2} = 1 2 3 4
out{3} = 6
out{4} = 7 8
Benchmark
function [t] = bench()
%// data
a = rand(100);
b = repmat(a,100);
A = b(randperm(10000));
%// functions to compare
fcns = {
#() thewaywewalk(A(:).');
#() cst(A(:).');
};
% timeit
t = zeros(2,1);
for ii = 1:100;
t = t + cellfun(#timeit, fcns);
end
format long
end
function out = thewaywewalk(A)
[~,~,subs] = unique(A);
[subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1));
idx = 1:numel(A);
out = accumarray(subs, idx(I), [],#(x) {x});
end
function out = cst(A)
[B, IX] = sort(A);
out = mat2cell(IX, 1, diff(find(diff([-Inf,B,Inf])~=0)));
end
0.444075509687511 %// thewaywewalk
0.221888202987325 %// CST-Link
Surprisingly the version with stable accumarray is faster than the unstable one, due to the fact that Matlab prefers sorted arrays to work on.
This solution should work in O(N*log(N)) due sorting, but is quite memory intensive (requires 3x the amount of input memory):
[U, X] = sort(A);
B = mat2cell(X, 1, diff(find(diff([Inf,U,-Inf])~=0)));
I am curious about the performance though.

Array search/compare is slow, compare to Excel VBA

I just switched from VBA (Excel) to VB (Visual Studio Express 2013).
Now I have copied parts of my code from VBA to VB.
And now I'm wondering why VB is so slow...
I'm creating an Array (IFS_BV_Assy) with 4 column and about 4000 rows.
There are some identical entrys in it, so I compare every entry with each other and override the duplicate with a empty string.
The Code looks like that:
For i = 1 To counter
For y = 1 To counter
If IFS_BV_Assy(1, y) = IFS_BV_Assy(1, i) And i <> y Then
If IFS_BV_Assy(2, i) < IFS_BV_Assy(2, y) Then
IFS_BV_Assy(1, i) = ""
Else
IFS_BV_Assy(1, y) = ""
End If
Exit For
End If
Next
Next
Counter is the lenght of the Array.
In VBA it takes about 1 Sec. In VB it takes about 30 Sec. to go thru the loop. Somebody knows why? (im creating some Timestamp between every Step to be sure whats slow. And that loop is the bad guy)
The Array looks like this:
(1,1) = 12.3.015 / (2,1) = 02
(1,2) = 12.3.016 / (2,2) = 01 <-- delete
(1,3) = 12.3.016 / (2,3) = 02 <-- keep, because 02 is newer then 01
(1,4) = 12.3.017 / (2,4) = 01
(1,5) = 12.3.018 / (2,5) = 01
Thanks in advance
Andy
Edit: I create the Array like that:
strStartPath_BV_Assy = "\\xxx\xx\xx\"
myFile = Dir(strStartPath_BV_Assy & "*.*")
counter = 1
ReDim IFS_BV_Assy(0 To 2, 0 To 0)
IFS_BV_Assy(0, 0) = "Pfad"
IFS_BV_Assy(1, 0) = "Zg."
IFS_BV_Assy(2, 0) = "Rev"
Do While myFile <> ""
If UCase(Right(myFile, 3)) = "DWG" Or UCase(Right(myFile, 3)) = "PDF" Then
ReDim Preserve IFS_BV_Assy(0 To 2, 0 To counter)
IFS_BV_Assy(0, counter) = strStartPath_BV_Assy + myFile
IFS_BV_Assy(1, counter) = Left(Mid(myFile, 12), InStr(1, Mid(myFile, 12), "-") - 1)
IFS_BV_Assy(2, counter) = Mid(myFile, Len(myFile) - 8, 2)
counter = counter + 1
End If
myFile = Dir()
Loop
Maybe data was best case (around 4000) when ran in VBA.
30 sec seems a reasonable time for 4000x4000=16.000.000 iterations. 1 sec is too low for this number of iterations.
Stokke suggested to create the array as String instead of Objekt Type.
Dim IFS_BV_Assy(,) as String
I create the Module with Option Explicit Off, because I never see any difference in VBA for that point. Now I declare any variable with Dim .. as ....
And now, it's as fast as VBA is =)
Learning = making mistakes.. =)

Signal created with matlab function block in simulink

I want to generate an arbitrary linear signal from matlab function block in simulink. I have to use this block because then, I want to control when i generate the signal by a sequence in Stateflow. I try to put the out of function as a structure with a value field and another time as the following code:
`function y = sig (u)
if u == 1
t = ([0 10 20 30 40 50 60]);
T = [(20 20 40 40 60 60 0]);
S.time = [t '];
S.signals (1) values ​​= [T '].;
S.signals (1) = 1 dimensions.;
else
N = ([0 0 0 0 0 0 0]);
S.signals (1) values ​​= [N '].;
end
y = S.signals (1). values
end `
the idea is that u == 1 generates the signal, u == 0 generates a zero output.
I also try to put the output as an array of two columns (one time and another function value) with the following code:
function y = sig (u)
if u == 1
S = ([0, 0]);
cant = input ('Number of points');
for n = Drange (1: cant)
S (n, 1) = input ('time');
S (n, 2) = input ('temperature');  
end
y = [S]
else
y = [0]
end
end
In both cases I can not generate the signal.
In the first case I get errors like:
This structure does not have a field 'signals'; new fields can not be added When structure has-been read or used
or
Error in port widths or dimensions. Output port 1 of 'tempstrcutsf2/MATLAB Function / u' is a one dimensional vector with 1 elements.
or
Undefined function or variable 'y'. The first assignment to a local variable Determines its class.
And in the second case:
Try and catch are not supported for code generation,
Errors occurred During parsing of MATLAB function 'MATLAB Function' (# 23)
Error in port widths or dimensions. Output port 1 of 'tempstrcutsf2/MATLAB Function / u' is a one dimensional vector with 1 elements.
I'll be very grateful for any contribution.
PD: sorry for my English xD
You have many mistakes in your code
you have to read more about arrays and structs in matlab
here S = ([0, 0]); you're declare an array with only two elements so the size will be static
There is no function called Drange in mtlab except if it's yours
See this example with strcuts and how they are created for you function
function y = sig(u)
if u == 1
%%getting fields size
cant = input ('Number of points\r\n');
%%create a structure of two fields
S = struct('time',zeros(1,cant),'temperature',zeros(1,cant));
for n =1:cant
S.time(n) = input (strcat([strcat(['time' num2str(n)]) '= ']));
S.temperature(n) = input (strcat([strcat(['temperature'...
num2str(n)]) '= ']));
end
%%assign output as cell of struct
y = {S} ;
else
y = 0 ;
end
end
to get result form y just use
s = y{1};
s.time;
s.temperature;
to convert result into 2d array
2dArray = [y{1}.time;y{1}.temperature];
Work with arrays
function y = sig(u)
if u == 1
cant = input ('Number of points\r\n');
S = [];
for n =1:cant
S(1,n) = input (strcat([strcat(['time' num2str(n)]) '= ']));
S(2,n) = input (strcat([strcat(['temperature'...
num2str(n)]) '= ']));
end
y = S;
else
y = 0 ;
end
end

Removing four nested loops in Matlab

I have the following four nested loops in Matlab:
timesteps = 5;
inputsize = 10;
additionalinputsize = 3;
outputsize = 7;
input = randn(timesteps, inputsize);
additionalinput = randn(timesteps, additionalinputsize);
factor = randn(inputsize, additionalinputsize, outputsize);
output = zeros(timesteps,outputsize);
for t=1:timesteps
for i=1:inputsize
for o=1:outputsize
for a=1:additionalinputsize
output(t,o) = output(t,o) + factor(i,a,o) * input(t,i) * additionalinput(t,a);
end
end
end
end
There are three vectors: One input vector, one additional input vector and an output vector. All the are connected by factors. Every vector has values at given timesteps. I need the sum of all combined inputs, additional inputs and factors at every given timestep. Later, I need to calculate from the output to the input:
result2 = zeros(timesteps,inputsize);
for t=1:timesteps
for i=1:inputsize
for o=1:outputsize
for a=1:additionalinputsize
result2(t,i) = result2(t,i) + factor(i,a,o) * output(t,o) * additionalinput(t,a);
end
end
end
end
In a third case, I need the product of all three vectors summed over every timestep:
product = zeros(inputsize,additionalinputsize,outputsize)
for t=1:timesteps
for i=1:inputsize
for o=1:outputsize
for a=1:additionalinputsize
product(i,a,o) = product(i,a,o) + input(t,i) * output(t,o) * additionalinput(t,a);
end
end
end
end
The two code snippets work but are incredibly slow. How can I remove the nested loops?
Edit: Added values and changed minor things so the snippets are executable
Edit2: Added other use case
First Part
One approach -
t1 = bsxfun(#times,additionalinput,permute(input,[1 3 2]));
t2 = bsxfun(#times,t1,permute(factor,[4 2 1 3]));
t3 = permute(t2,[2 3 1 4]);
output = squeeze(sum(sum(t3)));
Or a slight variant to avoid squeeze -
t1 = bsxfun(#times,additionalinput,permute(input,[1 3 2]));
t2 = bsxfun(#times,t1,permute(factor,[4 2 1 3]));
t3 = permute(t2,[1 4 2 3]);
output = sum(sum(t3,3),4);
Second Part
t11 = bsxfun(#times,additionalinput,permute(output,[1 3 2]));
t22 = bsxfun(#times,permute(t11,[1 4 2 3]),permute(factor,[4 1 2 3]));
result2=sum(sum(t22,3),4);
Third Part
t11 = bsxfun(#times,permute(output,[4 3 2 1]),permute(additionalinput,[4 2 3 1]));
t22 = bsxfun(#times,permute(input,[2 4 3 1]),t11);
product = sum(t22,4);

Resources