optimized copying from an openCV mat to a 2D float array - arrays

I wanted to copy an opencv Mat variable in a 2D float array.I used the code below to reach this purpose.but because I develope a code that speed is a very important metric this method of copy is not enough optimize.Is there another more optimized way to use?
float *ImgSrc_f;
ImgSrc_f = (float *)malloc(512 * 512 * sizeof(float));
for(int i=0;i<512;i++)
for(int j=0;j<512;j++)
{
ImgSrc_f[i * 512 + j]=ImgSrc.at<float>(i,j);
}

Really,
//first method
Mat img_cropped = ImgSrc(512, 512).clone();
float *ImgSrc_f = img_cropped.data;
should be no more than a couple of % less efficient than the best method. I suggest this one as long as it doesn't lose more than 1% to the second method.
Also try this, which is very close to the absolute best method unless you can use some kind of expanded cpu instruction set (and if such ones are available). You'll probably see minimal difference between method 2 and 1.
//method 2
//preallocated memory
//largest possible with minimum number of direct copy calls
//caching pointer arithmetic can speed up things extremely minimally
//removing the Mat header with a malloc can speed up things minimally
>>startx, >>starty, >>ROI;
Mat img_roi = ImgSrc(ROI);
Mat img_copied(ROI.height, ROI.width, CV_32FC1);
for (int y = starty; y < starty+ROI.height; ++y)
{
unsigned char* rowptr = img_roi.data + y*img_roi.step1() + startx * sizeof(float);
unsigned char* rowptr2 = img_coped.data + y*img_copied.step1();
memcpy(rowptr2, rowptr, ROI.width * sizeof(float));
}
Basically, if you really, really care about these details of performance, you should stay away from overloaded operators in general. The more levels of abstraction in code, the higher the penalty cost. Of course that makes your code more dangerous, harder to read, and bug-prone.

Can you use a std::vector structure, too? If try
std::vector<float> container;
container.assign((float*)matrix.startpos, (float*)matrix.endpos);

Related

copying 2d array of type (double **2darray) to GPU using cuda [duplicate]

I am looking into how to copy a 2D array of variable width for each row into the GPU.
int rows = 1000;
int cols;
int** host_matrix = malloc(sizeof(*int)*rows);
int *d_array;
int *length;
...
Each host_matrix[i] might have a different length, which I know length[i], and there is where the problem starts. I would like to avoid copying dummy data. Is there a better way of doing it?
According to this thread, that won't be a clever way of doing it:
cudaMalloc(d_array, rows*sizeof(int*));
for(int i = 0 ; i < rows ; i++) {
cudaMalloc((void **)&d_array[i], length[i] * sizeof(int));
}
But I cannot think of any other method. Is there any other smarter way of doing it?
Can it be improved using cudaMallocPitch and cudaMemCpy2D ??
The correct way to allocate an array of pointers for the GPU in CUDA is something like this:
int **hd_array, **d_array;
hd_array = (int **)malloc(nrows*sizeof(int*));
cudaMalloc(d_array, nrows*sizeof(int*));
for(int i = 0 ; i < nrows ; i++) {
cudaMalloc((void **)&hd_array[i], length[i] * sizeof(int));
}
cudaMemcpy(d_array, hd_array, nrows*sizeof(int*), cudaMemcpyHostToDevice);
(disclaimer: written in browser, never compiled, never tested, use at own risk)
The idea is that you assemble a copy of the array of device pointers in host memory first, then copy that to the device. For your hypothetical case with 1000 rows, that means 1001 calls to cudaMalloc and then 1001 calls to cudaMemcpy just to set up the device memory allocations and copy data into the device. That is an enormous overhead penalty, and I would counsel against trying it; the performance will be truly terrible.
If you have very jagged data and need to store it on the device, might I suggest taking a cue of the mother of all jagged data problems - large, unstructured sparse matrices - and copy one of the sparse matrix formats for your data instead. Using the classic compressed sparse row format as a model you could do something like this:
int * data, * rows, * lengths;
cudaMalloc(rows, nrows*sizeof(int));
cudaMalloc(lengths, nrows*sizeof(int));
cudaMalloc(data, N*sizeof(int));
In this scheme, store all the data in a single, linear memory allocation data. The ith row of the jagged array starts at data[rows[i]] and each row has a length of length[i]. This means you only need three memory allocation and copy operations to transfer any amount of data to the device, rather than nrows in your current scheme, ie. it reduces the overheads from O(N) to O(1).
I would put all the data into one array. Then compose another array with the row lengths, so that A[0] is the length of row 0 and so on. so A[i] = length[i]
Then you need just to allocate 2 arrays on the card and call memcopy twice.
Of course it's a little bit of extra work, but i think performance wise it will be an improvement (depending of course on how you use the data on the card)

Benefits of contiguous memory allocation

In terms of performance, what are the benefits of allocating a contiguous memory block versus separate memory blocks for a matrix? I.e., instead of writing code like this:
char **matrix = malloc(sizeof(char *) * 50);
for(i = 0; i < 50; i++)
matrix[i] = malloc(50);
giving me 50 disparate blocks of 50 bytes each and one block of 50 pointers, if I were to instead write:
char **matrix = malloc(sizeof(char *) * 50 + 50 * 50);
char *data = matrix + sizeof(char *) * 50;
for(i = 0; i < 50; i++) {
matrix[i] = data;
data += 50;
}
giving me one contiguous block of data, what would the benefits be? Avoiding cache misses is the only thing I can think of, and even that's only for small amounts of data (small enough to fit on the cache), right? I've tested this on a small application and have noticed a small speed-up and was wondering why.
It's complicated - you need to measure.
Using an intermediate pointer instead of calculating addresses in a two-dimensional array is most likely a loss on current processors, and both of your examples do that.
Next, everything fitting into L1 cache is a big win. malloc () most likely rounds up to multiples of 64 bytes. 180 x 180 = 32,400 bytes might fit into L1 cache, while individual mallocs might allocate 180 x 192 = 34,560 bytes might not fit, especially if you add another 180 pointers.
One contiguous array means you know how the data fits into cache lines, and you know you'll have the minimum number of page table lookups in the hardware. With hundreds of mallocs, no guarantee.
Watch Scott Meyers' "CPU Caches and Why You Care" presentation on Youtube. The performance gains can be entire orders of magnitude.
https://www.youtube.com/watch?v=WDIkqP4JbkE
As for the discussion above, the intermediate pointer argument died a long time ago. Compilers optimize them away. An N-Dimensional array is allocated as a flat 1D vector, ALWAYS. If you do std::vector>, THEN you might get the equivalent of an ordered forward list of vectors, but for raw arrays, they're always allocated as one long, contiguous strip in a flat manner, and multi-dimensional access reduces to pointer arithmetic the same way 1-Dimensional access does.
To access array[i][j][k] (assume width, height, depth of {A, B, C}), you add i*(BC) + (jC) + k to the address at the front of the array. You'd have to do this math manually in a 1-D representation anyway.

How to profile / identify the slow steps in a tight processing loop?

I have some proprietary image processing code. It walks over an image and computes some statistics on the image. An example of the kind of code I'm talking about, can be seen below, although this is not the algorithm that needs optimizing.
My question is, what tools exist for profiling these kinds of tight loops, to determine where things are slow? Sleepy, Windows Performance Analyzer all focus more at identifying which methods/functions are slow. I already know what function is slow, I just need to figure out how to optimize it.
void BGR2YUV(IplImage* bgrImg, IplImage* yuvImg)
{
const int height = bgrImg->height;
const int width = bgrImg->width;
const int step = bgrImg->widthStep;
const int channels = bgrImg->nChannels;
assert(channels == 3);
assert(bgrImg->height == yuvImg->height);
assert(bgrImg->width == yuvImg->width);
// for reasons that are not clear to me, these are not the same.
// Code below has been modified to reflect this fact, but if they
// could be the same, the code below gets sped up a bit.
// assert(bgrImg->widthStep == yuvImg->widthStep);
assert(bgrImg->nChannels == yuvImg->nChannels);
const uchar* bgr = (uchar*) bgrImg->imageData;
uchar* yuv = (uchar*) yuvImg->imageData;
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
const int ixBGR = i*step+j*channels;
const int b = (int) bgr[ixBGR+0];
const int g = (int) bgr[ixBGR+1];
const int r = (int) bgr[ixBGR+2];
const int y = (int) (0.299 * r + 0.587 * g + 0.114 * b);
const double di = 0.596 * r - 0.274 * g - 0.322 * b;
const double dq = 0.211 * r - 0.523 * g + 0.312 * b;
// Do some shifting and trimming to get i & q to fit into uchars.
const int iv = (int) (128 + max(-128.0, min(127.0, di)));
const int q = (int) (128 + max(-128.0, min(127.0, dq)));
const int ixYUV = i*yuvImg->widthStep + j*channels;
yuv[ixYUV+0] = (uchar)y;
yuv[ixYUV+1] = (uchar)iv;
yuv[ixYUV+2] = (uchar)q;
}
}
}
Since you cant share the code I have some general suggestions. First remember profilers tell you what part of code is taking more time and more advanced ones can suggest some modifications to improve the speed. But in general, algorithmic optimizations gain much more speed up than tweaking the code. For the sample code you're sharing, if you google efficient or fast RGB to YUV conversion you will find loads of methods (from using lookup tables to SSE2 and GPU utilization) that improve the speed drastically and I'm sure none of the profilers can suggest any of them.
So once you know what part of the method is slow, you can follow these two steps:
algorithmic optimization: understand what the algorithm is doing and try to come up with a more optimized algorithm. Google is your friend, it's likely someone already has thought optimizing that algorithm and has shared the idea/code with the world. Through, often you should consider the constraints you have. For example, the simplest but the most effective image processing method to speed up the code is to reduce the size of image to the smallest possible. A good rule of thumb is to question every single assumption made in the code/algorithm. e.g., is processing a 800x600 image necessary? or could reduce the size to 320x240 without compromising accuracy? Is processing a three channel image is necessary? or the same could be achieved with a grayscale image? I think you get the idea.
implementation optimization: some advanced profiling tools can suggest how to tweak the code you can try to find one that's affordable. Some might not agree, but I don't think it's necessary to use such tools. Often image processing exact values are not necessary, a rough approximation of the filter response perhaps by integers, for example, can be used instead of exact computation by double floats. SIMD instructions and more recently GPUs have been shown perfectly suitable for optimizing image processing methods. You should consider that if it's possible to do so. You could always to google how to optimize loops or some specific operations. And after all you can do is done one possibility is to break down your code into smaller logical pieces and change it such that the algorithm or method is not revealed by sharing the pieces. Then you can share each piece on SO and ask other's opinion on how to optimize it.

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}
1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.
First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).
Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.
Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Performance of memory operations on iPhone

Here's the code that I use to create a differently ordered array:
const unsigned int height = 1536;
const unsigned int width = 2048;
uint32_t* buffer1 = (uint32_t*)malloc(width * height * BPP);
uint32_t* buffer2 = (uint32_t*)malloc(width * height * BPP);
int i = 0;
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++)
buffer1[x+y*width] = buffer2[i++];
Can anyone explain why using the following assignment:
buffer1[i++] = buffer2[x+y*width];
instead of the one in my code take twice as much time?
It's likely down to CPU cache behaviour (at 12MB, your images far exceed the 256KB L2 cache in the ARM Cortex A8 that's inside an iphone3gs).
The first example accesses the reading array in sequential order, which is fast, but has to access the writing array out of order, which is slow.
The second example is the opposite - the writing array is written in fast, sequential order and the reading array is accessed in a slower fashion. Write misses are evidently less costly under this workload than read misses.
Ulrich Drepper's article What Every Programmer Should Know About Memory is recommended reading if you want to know more about this kind of thing.
Note that if you have this operation wrapped up into a function, then you will help the optimiser to generate better code if you use the restrict qualifier on your pointer arguments, like this:
void reorder(uint32_t restrict *buffer1, uint32_t restrict *buffer2)
{
int i = 0;
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++)
buffer1[x+y*width] = buffer2[i++];
}
(The restrict qualifier promises the compiler that the data pointed to by the two pointers doesn't overlap - which in this case is necessary for the function to make sense anyway).
Each pixel access in the first has a linear locality of reference, the second blows your cache on every read having to goto main memory for each.
The processor can much more efficiently handle writes with bad locality than reads, if the write has to go to main memory, that write can happen in parallel to another read/arithmetic operation. If a read misses the cache it can completely stall the processor waiting for more data to filter through the caches hierarchies.

Resources