I am working with big arrays (~6x40million) and my code is showing great bottlenecks. I am experienced programming in MatLab, but don't know much about the inner processes (like memory and such...).
My code looks as follows(Just the essentials, of course all variables are initialized, specially the arrays in loops, I just don't want to bomb you all with code ):
First I read the file,
disp('Point cloud import and subsampling')
tic
fid=fopen(strcat(Name,'.dat'));
C=textscan(fid, '%d%d%f%f%f%d'); %<= Big!
fclose(fid);
then create arrays out of the contents,
y=C{1}(1:Subsampling:end)/Subsampling;
x=C{2}(1:Subsampling:end)/Subsampling;
%... and so on for the other rows
clear C %No one wants 400+ millon doubles just lying around.
And clear the cell array (1), and create some images and arrays with the new values
for i=1:length(x)
PCImage(y(i)+SubSize(1)-maxy+1,x(i)+1-minx)=Reflectanse(i);
PixelCoordinates(y(i)+SubSize(1)-maxy+1,x(i)+1-minx,:)=Coordinates(i,:);
end
toc
Everything runs more or less smoothly until here, but then I manipulate some arrays
disp('Overlap alignment')
tic
PCImage=PCImage(:,[1:maxx/2-Overlap,maxx/2:end-Overlap]); %-30 overlap?
PixelCoordinates=PixelCoordinates(:,[1:maxx/2-Overlap,maxx/2:end-Overlap],:);
Sphere=Sphere(:,[1:maxx/2-Overlap,maxx/2:end-Overlap],:);
toc
and this is a big bottleneck, but it gets worst at the next step
disp('Planar view and point cloud matching')
tic
CompImage=zeros(max(SubSize(1),PCSize(1)),max(SubSize(2),PCSize(2)),3);
CompImage(1:SubSize(1),1:SubSize(2),2)=Subimage; %ExportImage Cyan
CompImage(1:SubSize(1),1:SubSize(2),3)=Subimage;
CompImage(1:PCSize(1),1:PCSize(2),1)=PCImage; %PointCloudImage Red
toc
Output
Point cloud import and subsampling
Elapsed time is 181.157182 seconds.
Overlap alignment
Elapsed time is 408.750932 seconds.
Planar view and point cloud matching
Elapsed time is 719.383807 seconds.
My questions are: will clearing unused objects like C in 1 have any effect? (it doesn't seem like that)
Am I overseeing any other important mechanisms or rules of thumb, or is the whole thing just too much and supposed to happen like this?
When subsref is used, matlab makes a copy of the sub referenced elements. This may be costly for large arrays. Often it will be faster to catenate vectors like
res = [a,b,c];
This is not possible with the current code as written above, but if the code could be modified to make this work, it may save some time.
EDIT
For multi-dimensional arrays you need to use cat
CompImage = cat(dim,Subimage,Subimage,PCImage);
where dim is 3 for this example.
Related
I've got a labVIEW program which reads wavelength and intensity of a spectra as a function of time. The hardware I have reading this data uses a ccd chip and so sometimes I run into bad pixels. The program outputs a 2d array of the intensities in a text file. I want to write a separate program which will read this file, then find and eliminate the bad pixel points. The bad pixels should be obvious, as the intensities are up to 10x bigger than the points around it. As those of you familiar with labVIEW know, you can insert a formula node and code in a language that is basically C. So I've tagged this with C as well as labVIEW.
Try using a median or percentile filter. Since you don't want to actually change data unless it's way out there, you could do something like this:
for every point, collect *rank* points around it in every direction
compute statistics on the subset of points
if point is an outlier, replace with median value
This way, you don't actually replace the point's value unless it's far out there. A point would be an outlier if it is greater than Q3 + 1.5 IQR or if it is less than Q1 - 1.5 IQR.
Here is a VI Snippet performing the filter I've described:
If you want only more extreme outliers to get changed, then increase the IQR multiplier.
I was trying to collect statistics of a 6D vector and plot a 1D histogram for each coordinate. I get 729000000 different copies of this vector (each 6 dimensional). For this I create an array of zeros of size 729000000x6 before I get any of the actual W's and this seems to be a problem in matlab since it says:
Error using zeros
Requested 729000000x6 (32.6GB) array exceeds maximum array size preference. Creation of arrays
greater than this limit may take a long time and cause MATLAB to become unresponsive. See array
size limit or preference panel for more information.
The reason I did this at first was because it was easy to fill W_history and then just feed it to the histogram plotter:
histogram(W_history(:,d),nbins,'Normalization','probability')
however filling W_history seemed impossible for high number of copies of W. Is there a way to do this in matlab automatically? It feels that there should be and didn't want to re-invent the wheel.
I am sure I could potentially create for each coordinate some array of counters where I count how many times a specific value of the coordinate W falls. However, implementing that and having the checks for in which bin each one should fall seemed inefficient or even unnecessary. Is this really the only solution or what do matlab experts people recommend? Is this re-inventing the wheel? Seems also inefficient if I implement it myself?
Also, I thought I could manually have matlab put thing in memory then bring them back etc (as in store W_history in disk as it fills and then put more back in disk as it fills and eventually somehow plug it in to the histogram plotter), that seemed overwork. I hope I can avoid a solution like this one. It feels a wrong solution since it should be "easy" and high level to use matlab and going down to disk and memory doesn't seem to me what matlab is intended.
Currently through the comment that was given the best solution that I have so far is using histcounts as follow:
for i=2:iter+1
%
W = get_new_W(W)
%
[W_hist_counts_current, edges2] = histcounts(W,edges);
W_hist_counts = W_hist_counts + W_hist_counts_current;
end
however, after this it seems difficult to convert W_hist_counts to pdf/probability or other values since it seems they have to be processed manually. Is there no official way to do this processing without the user having to implement the normalizations again?
Well this is one I'm struggling with since I started working on the actual code I'm working with right now.
My advisor wrote this for the past ten years and had, at some point, to stock values that we usually store in matrix or tensors.
Actually we look at matrix with six independent composents calculated from the Virial theorem (from Molecular dynamics simulation) and he had the habits to store 6*1D arrays, one for each value, at each recorded step, ie xy(n), xz(n) yz(n)... n being the number of records.
I assume that a single array s(n,3,3) could be more efficient as the values will be stored closer from one another (xy(n) and xz(n) have no reason to be stored side to side in memory) and rise less error concerning corrupted memory or wrong memory access. I tried to discuss it in the lab but eventually no one cares and again, this is just an assumption.
This would not have buggued me if everything in the code wasn't stored like that. Every 3d quantity is stored in 3 different arrays instead of 1 and this feels weird to me as for the performance of the code.
Is their any comparable effect for long calculations and large data size? I decided to post here after resolving an error I had due to wrong memory access with one of these as I find the code more readable and the data more easy to compute (s = s+... instead of six line of xy = xy+... for example).
The fact that the columns are close to each other is not very important, especially if the leading dimension n is large. Your CPU has multiple prefetch streams and can prefetch simultaneously in different arrays of different columns.
If you make some random access in an array A(n,3,3) where A is allocatable, the dimensions are not known at compile time. Therefore, the address of a random element A(i,j,k) will be address_of(A(1,1,1)) + i + (j-1)*n + (k-1)*3*n, and it will have to be calculated at the execution every time you make a random access to the array. The calculation of the address involves 3 integer multiplications (3 CPU cycles each) and at least 3 adds (1 cycle each). But regular accesses (predictible) can be optimized by the compiler using relative addresses.
If you have different 1-index arrays, the calculation of the address involves only one integer add (1 cycle), so you get a peformance penalty of at least 11 cycles for each access when using a single 3-index array.
Moreover, if you have 9 different arrays, each one of them can be aligned on a cache-line boundary, whereas you would be forced to use padding at the end of lines to ensure this behavior with a single array.
So I would say that in the particular case of A(n,3,3), as the two last indices are small and known at compile time, you can safely do the transformation into 9 different arrays to potentially gain some performance.
Note that if you use often the data of the 9 arrays at the same index i in a random order, re-organizing the data into A(3,3,n) will give you a clear performance increase. If a is in double precision, A(4,4,n) could be even better if A is aligned on a 64-byte boundary as every A(1,1,i) will be located at the 1st position of a cache line.
Assuming that you always loop along n and inside each loop need to access all the components in the matrix, storing the array like s(6,n) or s(3,3,n) will benefit from cache optimization.
do i=1,n
! do some calculation with s(:,i)
enddo
However, if your innerloop looks like this
resultarray(i)=xx(i)+yy(i)+zz(i)+2*(xy(i)+yz(i)+xz(i))
Don't border to change the array layout because you may break the SIMD optimization.
I'd like to have a MATLAB array fill a column with numbers in increments of 0.001. I am working with arrays of around 200,000,000 rows and so would like to use the most efficient method possible. I had considered using the following code:
for i = 1 : size(array,1)
array(i,1) = i * 0.001;
end
There must be a more efficient way of doing this..?
Well the accepted answer is pretty close to being fast but no fast enough. You should use:
s=size(array,1);
step=0.0001;
array(:,1)=[step:step:s*step];
There are two issues with the accepted answer
you don't need to transpose
you should include the step inside the vector, instead of multiplying
and here is a comparison (sorry I am running 32-bit matlab)
array=rand(10000);
s=size(array,1);
step=0.0001;
tic
for i=1:100000
array(:,1)=[step:step:s*step];
end
toc
and
tic
for i=1:100000
array(:, 1)=[1:s]'*step;
end
toc
the results are:
Elapsed time is 3.469108 seconds.
Elapsed time is 5.304436 seconds.
and without transposing in the second example
Elapsed time is 3.524345 seconds.
I suppose in your case things would be worst.
array(:,1) = [1:size(array,1)]' * 0.001;
Matlab is more efficient when vectorizing loops, see also the performance tips from mathworks.
If such vectorization is infeasible due to space limitations, you might want to reconsider rewriting your for-loop in C, using a MEX function.
you can also try this
size=20000000;%size is defined
array(1:size,1)=(1:size)*0.001
I am writing a game-playing ai (aichallenge.org - Ants), which requires a lot of updating of, and referring to data-structures. I have tried both Arrays and Maps, but the basic problem seems to be that every update creates a new value, which makes it slow. The game boots you out if you take more than one second to make your move, so the application counts as "hard-real-time". Is it possible to have the performance of mutable data-structures in Haskell, or should I learn Python, or rewrite my code in OCaml?
I have completely rewritten the Ants "starter-pack". Changed from Arrays to Maps because my tests showed that Maps update much faster.
I ran the Maps version with profiling on, which showed that about 20% of the time is being taken by Map updates alone.
Here is a simple demonstration of how slow Array updates are.
slow_array =
let arr = listArray (0,9999) (repeat 0)
upd i ar = ar // [(i,i)]
in foldr upd arr [0..9999]
Now evaluating slow_array!9999 takes almost 10 seconds! Although it would be faster to apply all the updates at once, the example models the real problem where the array must be updated each turn, and preferably each time you choose a move when planning your next turn.
Thanks to nponeccop and Tener for the reference to the vector modules. The following code is equivalent to my original example, but runs in 0.06 seconds instead of 10.
import qualified Data.Vector.Unboxed.Mutable as V
fast_vector :: IO (V.IOVector Int)
fast_vector = do
vec <- V.new 10000
V.set vec 0
mapM_ (\i -> V.write vec i i) [0..9999]
return vec
fv_read :: IO Int
fv_read = do
v <- fast_vector
V.read v 9999
Now, to incorporate this into my Ants code...
First of all, think if you can improve your algorithm. Also note that the default Ants.hs is not optimal and you need to roll your own.
Second, you should use a profiler to find where the performance problem is instead of relying on hand-waving. Haskell code is usually much faster than Python (10-30 times faster, you can look at Language Shootout for example comparison) even with functional data structures, so probably you do something wrong.
Haskell supports mutable data pretty well. See ST (state thread) and libraries for mutable arrays for the ST. Also take a look at vectors package. Finally, you can use data-parallel haskell, haskell-mpi or other ways of parallelization to load all available CPU cores, or even distribute work over several computers.
Are you using compiled code (e.g. cabal build or ghc --make) or use runhaskell or ghci? The latter ones are bytecode interpreters and create much slower code than the native code compiler. See Cabal reference - it is the preferred way to build applications.
Also make sure you have optimization turned on (-O2 and other flags). Note that -O vs -O2 can make a difference, and try different backends including the new LLVM backend (-fllvm).
Updating arrays one element at a time is incredibily inefficient because each update involves making a copy of the whole array. Other data structures such as Map are implemented as trees and thus allow logarithmic time updates. However, in general updating functional data structures one element at a time is often sub-optimal, so you should try to take a step back and think about how you can implement something as a transformation of the whole structure at once instead of a single element at a time.
For example, your slow_array example can be written much more efficiently by doing all the updates in one step, which only requires the array to be copied once.
faster_array =
let arr = listArray (0,9999) (repeat 0)
in arr // [(i,i) | i <- [0..9999]]
If you cannot think of an alternative to the imperative one-element-at-a-time algorithm, mutable data structures have been mentioned as another option.
You are basically asking for mutable data structure. Apart from standard libraries I would recommend you lookup this:
vector: http://hackage.haskell.org/package/vector
That said, I'm not so sure that you need them. There are neat algorithms for persitent data structures as well. A fast replacement for Data.Map is hash table from this package:
unordered-containers: http://hackage.haskell.org/package/unordered-containers