How to profile / identify the slow steps in a tight processing loop? - c

I have some proprietary image processing code. It walks over an image and computes some statistics on the image. An example of the kind of code I'm talking about, can be seen below, although this is not the algorithm that needs optimizing.
My question is, what tools exist for profiling these kinds of tight loops, to determine where things are slow? Sleepy, Windows Performance Analyzer all focus more at identifying which methods/functions are slow. I already know what function is slow, I just need to figure out how to optimize it.
void BGR2YUV(IplImage* bgrImg, IplImage* yuvImg)
{
const int height = bgrImg->height;
const int width = bgrImg->width;
const int step = bgrImg->widthStep;
const int channels = bgrImg->nChannels;
assert(channels == 3);
assert(bgrImg->height == yuvImg->height);
assert(bgrImg->width == yuvImg->width);
// for reasons that are not clear to me, these are not the same.
// Code below has been modified to reflect this fact, but if they
// could be the same, the code below gets sped up a bit.
// assert(bgrImg->widthStep == yuvImg->widthStep);
assert(bgrImg->nChannels == yuvImg->nChannels);
const uchar* bgr = (uchar*) bgrImg->imageData;
uchar* yuv = (uchar*) yuvImg->imageData;
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
const int ixBGR = i*step+j*channels;
const int b = (int) bgr[ixBGR+0];
const int g = (int) bgr[ixBGR+1];
const int r = (int) bgr[ixBGR+2];
const int y = (int) (0.299 * r + 0.587 * g + 0.114 * b);
const double di = 0.596 * r - 0.274 * g - 0.322 * b;
const double dq = 0.211 * r - 0.523 * g + 0.312 * b;
// Do some shifting and trimming to get i & q to fit into uchars.
const int iv = (int) (128 + max(-128.0, min(127.0, di)));
const int q = (int) (128 + max(-128.0, min(127.0, dq)));
const int ixYUV = i*yuvImg->widthStep + j*channels;
yuv[ixYUV+0] = (uchar)y;
yuv[ixYUV+1] = (uchar)iv;
yuv[ixYUV+2] = (uchar)q;
}
}
}

Since you cant share the code I have some general suggestions. First remember profilers tell you what part of code is taking more time and more advanced ones can suggest some modifications to improve the speed. But in general, algorithmic optimizations gain much more speed up than tweaking the code. For the sample code you're sharing, if you google efficient or fast RGB to YUV conversion you will find loads of methods (from using lookup tables to SSE2 and GPU utilization) that improve the speed drastically and I'm sure none of the profilers can suggest any of them.
So once you know what part of the method is slow, you can follow these two steps:
algorithmic optimization: understand what the algorithm is doing and try to come up with a more optimized algorithm. Google is your friend, it's likely someone already has thought optimizing that algorithm and has shared the idea/code with the world. Through, often you should consider the constraints you have. For example, the simplest but the most effective image processing method to speed up the code is to reduce the size of image to the smallest possible. A good rule of thumb is to question every single assumption made in the code/algorithm. e.g., is processing a 800x600 image necessary? or could reduce the size to 320x240 without compromising accuracy? Is processing a three channel image is necessary? or the same could be achieved with a grayscale image? I think you get the idea.
implementation optimization: some advanced profiling tools can suggest how to tweak the code you can try to find one that's affordable. Some might not agree, but I don't think it's necessary to use such tools. Often image processing exact values are not necessary, a rough approximation of the filter response perhaps by integers, for example, can be used instead of exact computation by double floats. SIMD instructions and more recently GPUs have been shown perfectly suitable for optimizing image processing methods. You should consider that if it's possible to do so. You could always to google how to optimize loops or some specific operations. And after all you can do is done one possibility is to break down your code into smaller logical pieces and change it such that the algorithm or method is not revealed by sharing the pieces. Then you can share each piece on SO and ask other's opinion on how to optimize it.

Related

optimized copying from an openCV mat to a 2D float array

I wanted to copy an opencv Mat variable in a 2D float array.I used the code below to reach this purpose.but because I develope a code that speed is a very important metric this method of copy is not enough optimize.Is there another more optimized way to use?
float *ImgSrc_f;
ImgSrc_f = (float *)malloc(512 * 512 * sizeof(float));
for(int i=0;i<512;i++)
for(int j=0;j<512;j++)
{
ImgSrc_f[i * 512 + j]=ImgSrc.at<float>(i,j);
}
Really,
//first method
Mat img_cropped = ImgSrc(512, 512).clone();
float *ImgSrc_f = img_cropped.data;
should be no more than a couple of % less efficient than the best method. I suggest this one as long as it doesn't lose more than 1% to the second method.
Also try this, which is very close to the absolute best method unless you can use some kind of expanded cpu instruction set (and if such ones are available). You'll probably see minimal difference between method 2 and 1.
//method 2
//preallocated memory
//largest possible with minimum number of direct copy calls
//caching pointer arithmetic can speed up things extremely minimally
//removing the Mat header with a malloc can speed up things minimally
>>startx, >>starty, >>ROI;
Mat img_roi = ImgSrc(ROI);
Mat img_copied(ROI.height, ROI.width, CV_32FC1);
for (int y = starty; y < starty+ROI.height; ++y)
{
unsigned char* rowptr = img_roi.data + y*img_roi.step1() + startx * sizeof(float);
unsigned char* rowptr2 = img_coped.data + y*img_copied.step1();
memcpy(rowptr2, rowptr, ROI.width * sizeof(float));
}
Basically, if you really, really care about these details of performance, you should stay away from overloaded operators in general. The more levels of abstraction in code, the higher the penalty cost. Of course that makes your code more dangerous, harder to read, and bug-prone.
Can you use a std::vector structure, too? If try
std::vector<float> container;
container.assign((float*)matrix.startpos, (float*)matrix.endpos);

Do people really pad their arrays for efficient cache access in high-performance computing?

I'm reading a book on parallel processing called Intel Xeon Phi Coprocessor High Performance Programming. The book recommends that high-performance programs pad array data in order to align heavily used memory addresses for efficient cache line access.
For statically defined arrays, this can be done like so:
size_t size = 1024 * 1024;
float buff[size] __attribute__((align(64)));
The book also demonstrates how to align dynamically created arrays. Here's an example for a 2D matrix:
const size_t kPaddingSize = 64;
int height = 10000;
int width = ((5900 * sizeof(real) + kPaddingSize - 1)/kPaddingSize) * (kPaddingSize/sizeof(real));
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_malloc(size, kPaddingSize);
real *fout = (real *)_mm_malloc(size, kPaddingSize);
/// ...use the arrays
_mm_free(fin);
_mm_free(fout);
Aligning the arrays has real performance benefits. However, coming from my background as a user-facing software developer, the approach seems undesirable for several reasons:
The size of the cache line, kPaddingSize, is hard-coded in the program.
Padding arrays adds a large amount of complexity to the program. Simply creating the arrays is hard enough, but accessing their members is even worse.
I have two questions:
How do high-performance application developers mitigate these problems?
Is padding arrays a commonplace practice?

Fastest way to traverse columns in a multidimensional array in C

I'm currently working on a program to solve the red/blue computation; program is written in C.
Description of the problem is here : http://www.cs.utah.edu/~mhall/cs4961f10/CS4961-L9.pdf
tl;dr you have a grid of colors (red/blue/white), first red cells move to the right according to certain rules, then blue cells move down according to other rules.
I've got my program working and giving correct output, and I'm now trying to see if I can't speed it up at all.
Using Intel's VTune Amplifier (this is for a parallel programming course, and we're doing pthreads in visual studio with parallel studio integrated), I've identified that the biggest hotspot in my code is when moving blue cells.
Implementation details: grid is stored as a dynamically allocated int **, set up this way
globalBoard = malloc(sizeof(int *) * size);
for (i = 0; i < size; i++)
{
globalBoard[i] = malloc(sizeof(int) * size);
for (j = 0; j < size; j++)
globalBoard[i][j] = rand() % 3;
}
After some research, I believe the cause of the hotspot (almost 4 times as much CPU time as moving red cells) is cache misses when traversing column by column.
I understand that under the hood, this grid will be stored as a 1d array, so when I move red cells to the right and go row by row, I'm most often checking contiguous values, so the CPU doesn't need to load new values into the cache as often, whereas going column by column results in jumping around through the array by amounts that only increase as the size of the board does.
All that being said, I want this particular section to go faster. Here's the code as it stands now :
void blueStep(int col)
{
int i;
int local[size];
for (i = 0; i < size; local[i] = globalBoard[i++][col]);
for (i = 0; i < size; i++)
{
if (i < size - 1)
{
if (globalBoard[i][col] == 2 && globalBoard[i + 1][col] == 0)
{
local[i++] = 0;
local[i] = 2;
}
}
else
{
if (globalBoard[i][col] == 2 && globalBoard[0][col] == 0)
{
local[i++] = 0;
local[0] = 2;
}
}
}
for (i = 0; i < size; i++)
globalBoard[i][col] = local[i];
}
Here, col is which column to work on and size is how big the grid is (it's always square).
I was thinking that I might be able to do some kind of fancy pointer arithmetic to speed this up, and was reading this : http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/pointer.html.
Looking at that, I feel like I might need to change how I declare the grid in order to take advantage of 2d array pointer arithmetic, but I'm still not sure how I would go about traversing columns using that method.
Any help with that, or any other suggestions of fast ways to go through a column are welcome.
UPDATE: After a bit more research and discussion, it would seem my assumptions were incorrect. Turns out it's actually taking almost twice as long to write the results back to the global array than it is to loop over columns, due to false sharing. That said, I'm still somewhat curious to see if there are any better ways of doing column traversal.
I think the answer is to process the grid in tiles. You can do a very quick tile move, either down or right, in a 16x16 or 32x32 tile. They two moves will be effectively the same, and run at the same speed: read all values into XMM registers, process, write. You may want to investigate MASKMOVDQU instruction here. If I understand the nature of the problem, you can overlap tiles by one row/column and this will work okay if you process them in the usual (scan) order. If not, you have to handle stitching the tiles separately.
There is no truly fast way to do this in C code. However, you can try (1) changing your board type to be a unit8_t, (2) replacing all if .. statements with arithmetic, like this: value = (mask & value) | (^mask & newvalue), and (3) turning on maximum loop unrolling and auto-vectorization in the compiler options. This will give you a nice speedup - especially avoiding conditionals.
EDIT In addition to tiles that can fit in registers, you can also do a second level of tiles sized to fit in your cache. I think the combination will run at roughly your memory bandwidth.
EDIT Or, make your board type be two bits: pack four cells to a byte. Goes nicely with the replacing if statements with arithmetic idea :)

Performance difference in accessing an item in an array vs pointer reference?

I'm fresh to C - used to scripting languages like, PHP, JS, Ruby etc. Got a query in regard to performance. I know one should not micro optimize too early - however, I'm writing a Ruby C Extension for Google SketchUp where I'm doing lots of 3D calculations so performance is a concern. (And this question is also for learning how C works.)
Often many iterations is done to process all the 3D data so I'm trying to work out what might be faster.
I'm wondering if accessing an array entry many times is faster if I make a pointer reference to that array entry? What would common practice be?
struct FooBar arr[10];
int i;
for ( i = 0; i < 10; i++ ) {
arr[i].foo = 10;
arr[i].bar = 20;
arr[i].biz = 30;
arr[i].baz = 40;
}
Would this be faster or slower? Why?
struct FooBar arr[10], *item;
int i;
for ( i = 0; i < 10; i++ ) {
item = &arr[i];
item->foo = 10;
item->bar = 20;
item->biz = 30;
item->baz = 40;
}
I looked around and found discussions about variables vs pointers - where it was generally said that pointers required extra steps since it had to look up the address, then the value - but in general there wasn't a bit hit.
But what I was wondering was if accessing an array entry in C has much of a performance hit? In Ruby it is faster to make a reference to the entry if you need to access it many time - but that's Ruby...
There's unlikely to be a significant difference. Possibly the emitted code will be identical. This is assuming a vaguely competent compiler, with optimization enabled. You might like to look at the disassembled code, just to get a feel for some of the things a C optimizer gets up to. You may well conclude, "my code is mangled beyond all recognition, there's no point worrying about this kind of thing at this stage", which is a good instinct.
Conceivably the first code could even be faster, if introducing the item pointer were to somehow interfere with any loop unrolling or other optimization that your compiler performs on the first. Or it could be that the optimizer can figure out that arr[i].foo is equal to stack_pointer + sizeof(FooBar) * i, but fail to figure that out once you use the pointer, and end up using an extra register, spilling something else, with performance implications. But I'm speculating wildly on that point: there is usually little to no difference between accessing an array by pointer or by index, my point is just that any difference there is can come for surprising reasons.
If were worried, and felt like micro-optimizing it (or just were in a pointer-oriented mood), I'd skip the integer index and just use pointers all over:
struct FooBar arr[10], *item, *end = arr + sizeof arr / sizeof *arr;
for (item = arr; item < end; item++)
item->foo = 10;
item->bar = 20;
item->biz = 30;
item->baz = 40;
}
But please note: I haven't compiled this (or your code) and counted the instructions, which is what you'd need to do. As well as running it and measuring of course, since some combinations of multiple instructions might be faster than shorter sequences of other instructions, and so on.

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}
1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.
First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).
Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.
Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Resources