Performance difference in looping - loops

Will there be a huge performance difference between:
if (this.chkSelectAll.Checked)
for (int i = 0; i < this.listBoxColumns.Items.Count; i++)
this.listBoxColumns.SetSelected(i, true);
else
for (int i = 0; i < this.listBoxColumns.Items.Count; i++)
this.listBoxColumns.SetSelected(i, false);
vs.
for (int i = 0; i < this.listBoxColumns.Items.Count; i++)
this.listBoxColumns.SetSelected(i, this.chkSelectAll.Checked);
Which one is advisable. Concise coding vs. performance gain?

I wouldn't expect to see much performance difference, and I'd certainly go with the latter as it's more readable. (I'd put braces round it though.)
It's quite easy to imagine a situation where you might need to change the loop, and with the first example you might accidentally only change one of them instead of both. If you really want to avoid calling the Checked property in every iteration, you could always do:
bool checked = this.chkSelectAll.Checked;
for (int i = 0; i < this.listBoxColumns.Items.Count; i++)
{
this.listBoxColumns.SetSelected(i, checked);
}
As ever, write the most readable code first, and measure/profile any performance differences before bending your design/code out of shape for the sake of performance.

I suppose the performance difference will be barely noticeable. However here's a variation that is both efficient and highly readable:
bool isChecked = this.chkSelectAll.Checked;
for (int i = 0; i < this.listBoxColumns.Items.Count; i++) {
this.listBoxColumns.SetSelected(i, isChecked);
}
If you're after some real optimization you will also want to pay attention to whether the overhead of accessing "this.listBoxColumns" twice on each iteration is present in the first place and is worth paying attention to. That's what profiling is for.

You have an extra boolean check in the first example. But having said that, I can't imagine that the performance difference will be anything other than negligible. Have you tried measuring this in your particular scenario ?
The second example is preferable since you're not repeating the loop code.

I can't see there being a significant performance difference between the two. The way to confirm it would be to set up a benchmark and time the different algorithms over 1000s of iterations.
However as it's UI code any performance gain is pretty meaningless as you are going to be waiting for the user to read the dialog and decide what to do next.
Personally I'd go for the second approach every time. You've only got one loop to maintain, and the code is clearer.

Any performance difference will be negligible.
Your primary concern should be code readability and maintainability.
Micro-optimisations such as this are more often than not, misplaced. Always profile before being concerned with performance.

It's most likely to be negligible. More importantly, however, I feel the need to quote the following:
"Premature optimisation is the root of all evil"
The second is easily the more readable, so simply go with that, unless you later find a need to optimise (which is quite unlikely in my opinion).

Why not use System.Diagnostics.StopWatch and compare the two yourself? However, I don't believe there's going to be any real performance difference. The first example might be faster because you're only accessing chkSelectAll.Checked once. Both are easily readable though.

Related

Why do most/all languages not have a multi-loop break function?

I haven't seen any programming language that has a simple function to break out of nested loops. I'm wondering:
Is there something at a low level making this impossible?
Here is an example of what I mean (C#):
while (true)
{
foreach (String s in myStrings)
{
break 2; // This would break out of both the foreach and while
}
}
Just break; would be like break 1;, and take you out of only the innermost loop.
Some languages have messy alternatives like goto, which work but are not as elegant as the above proposed solution.
No there's nothing on a low level that would prevent creating something like what you propose. And you also have a lot of languages implemented in a highlevel language that definitely would allow for this kind of behaviour.
There are however several considerations when designing a language that would talk against this sort of construct, at least there is from my point of view.
The main argument being that it opens up for a very complex structure in something that is probably allready overly complex. Nested loops are inherently hard to follow and figure out what is happening.
If you'd add your construct it could make them even more complex.
Unless you consider a return statement as a sort of break of course :)
Perl does have something very similar to this feature. Instead of the number of nestings, you use a label much like goto.
FOREVER:while (1) {
for my $string (#strings) {
last FOREVER;
}
}
The feature is intended to remove ambiguities when using loop controls in deeply nested loops. Using a label improves readability and it protects you should your level of nesting change. It reduces the amount of knowledge the inner loops have about the outer loops, though they still have knowledge.
The nesting is also non-lexical, it will cross subroutine boundaries. This is where it gets weird and more goto-like.
FOREVER:while (1) {
for my $string (#strings) {
do_something;
}
}
sub do_something {
last FOREVER;
}
This is not considered a good feature for all the reasons #Codor lays out in their answer. This sort of feature encourages very complex logic. You're nearly always better off restructuring deeply nested loops into multiple subroutine calls.
sub forever {
while (1) {
for my $string (#strings) {
return;
}
}
}
What you're asking for is, essentially, a restricted goto. It carries with it most of the same arguments for and against.
while (1) {
for my $string (#strings) {
goto FOREVER;
}
}
FOREVER:print "I have escaped!\n"
The idea of saying "break out of the nth nested loop" is worse than a goto from a structural perspective. It means inner code has intimate knowledge of its surroundings. Should the nesting change, all of the inner loop controls may break. It creates a barrier to refactoring. For example, should you want to perform an extract method refactoring on the inner loop...
while (1) {
twiddle_strings(#strings);
}
sub twiddle_strings {
for my $string (#strings) {
last 2;
}
}
Now the code controlling the outer loop is in a completely different function from the inner loop. What if the outer loop changes?
while (1) {
while(wait_for_lock) {
twiddle_strings(#strings);
}
}
sub twiddle_strings {
for my $string (#strings) {
last 2;
}
}
PHP has it since version 4. And IMHO it's not very good - it's quite easy to abuse it. Especially when you add levels to iterations or remove some or change code logic inside iterations. Code refactoring / optimization usually begins with iterations overview and trying to reduce cycles to conserve CPU usage. During this kind of optimization it's easy to miss a continue level and finding this kind of introduced bug is not an easy task if multiple people are working on a project.
I'd prefer any time goto since it's (usually) much safer. Unless you know exactly what you are doing. Also, goto keeps BASIC in my mind.
Although this is more speculation, there are some arguments both for and against such a possibility.
Pro:
Might be very elegant in some cases.
Con:
Might result in a tempation to write deeply nested loops, which is seen as undesirable by some.
The desired behaviour can be implemented with goto.
The desired behaviour can be implemented with auxiliary variables.
Nested loops can in most cases be refactored to use just one loop.
The desired behaviour can be implemented using exception handling (although, on the other hand, controlling the expected flow of execution is not the primary task of exception handling).
That being said, I consider resposible usage of goto to be legitimate.

How would you avoid False Sharing in a scenario like this?

In the code below I have parallelised using OpenMP's standard parallel for clause.
#pragma omp parallel for private(i, j, k, d_equ) shared(cells, tmp_cells, params)
for(i=0; i<some_large_value; i++)
{
for(j=0; j<some_large_value; j++)
{
....
// Some operations performed over here which are using private variables
....
// Accessing a shared array is causing False Sharing
for(k=0; k<10; k++)
{
cells[i * width + j].speeds[k] = some_calculation(i, j, k, cells);
}
}
}
This has given me a significant improvement to runtime (~140s to ~40s) but there is still one area I have noticed really lags behind - the innermost loop I marked above.
I know for certain the array above is causing False Sharing because if I make the change below, I see another huge leap in performance (~40s to ~13s).
for(k=0; k<10; k++)
{
double somevalue = some_calculation(i, j);
}
In other words, as soon as I changed the memory location to write to a private variable, there was a huge speed up improvement.
Is there any way I can improve my runtime by avoiding False Sharing in the scenario I have just explained? I cannot seem to find many resources online that seem to help with this problem even though the problem itself is mentioned a lot.
I had an idea to create an overly large array (10x what is needed) so that enough margin space is kept between each element to make sure when it enters the cache line, no other thread will pick it up. However this failed to create the desired effect.
Is there any easy (or even hard if needs be) way of reducing or removing the False Sharing found in that loop?
Any form of insight or help will be greatly appreciated!
EDIT: Assume some_calculation() does the following:
(tmp_cells[ii*params.nx + jj].speeds[kk] + params.omega * (d_equ[kk] - tmp_cells[ii*params.nx + jj].speeds[kk]));
I cannot move this calculation out of my for loop because I rely on d_equ which is calculated for each iteration.
Before anwsering your question, I have to ask is it really a false sharing situation when you use the whole cells as the input of the function some_calcutation()? It seems you are sharing the whole array actrually. You may want to provide more info about this function.
If yes, go on with the following.
You've already show that private variable double somevaluewill improve the performance. Why not just use this approach?
Instead of using a single double variable, you could define a private array private_speed[10] just before the for k loop, calculate them in the loop, and copy it back to cells after the loop with Something like
memcpy(cells[i*width+j].speed, private_speed, sizeof(...));

Is it best practice to use array[array.length - 1] or roll your own method?

For example (in JavaScript):
//Not that I would ever add a method to a base JavaScript prototype...
//(*wink nudge*)...
Array.prototype.lastIndex = function() {
return this.length - 1;
}
console.log(array[array.lastIndex()]);
vs
console.log(array[array.length - 1]);
Technically speaking, the latter method uses one less character, but also utilizes a magic number. Granted, the readability may not really be affected in this case, but magic numbers suck. Which is better practice to use?
I'm of the opinion that 1 and 0 don't really count as "magic numbers" in many cases. When you're referring to the index of the last item (i.e. length - 1), that would definitely be one time where I would not consider 1 a magic number.
Different languages have their own idiomatic ways of accessing the last element of an array, and that should be used. For example, in Groovy that would be:
myArray.last()
While in C, one would very likely do:
my_array[len - 1]
and in Common Lisp, something like:
(first (last my_list))
i agree with #DragoonWraith that 1 is not a magic number. however it's not about magic numbers but about readability. if you need last index use myArray.lastIndex(), if you need last element use myArray.last() or myArray.lastElement(). it's way easier to read and understand than myArray[myArray.length - 1]
My take is that we should be looking for the style which the most programmers will be familiar with. Given that anyone who's been programming for more than a couple weeks in a language with this sort of array syntax (i.e., C-influenced imperative languages) will be comfortable with the idea that arrays use 0-based indexing, I suspect that anyone reading your code will understand what array[array.length-1] means.
The method calls are a bit less standard, and are language-specific, so you'll spend a bit more time understanding that if you're not totally familiar with the language. This alone makes me prefer the length-1 style.

Is SIMD Worth It? Is there a better option?

I have some code that runs fairly well, but I would like to make it run better. The major problem I have with it is that it needs to have a nested for loop. The outer one is for iterations (which must happen serially), and the inner one is for each point particle under consideration. I know there's not much I can do about the outer one, but I'm wondering if there is a way of optimizing something like:
void collide(particle particles[], box boxes[],
double boxShiftX, double boxShiftY) {/*{{{*/
int i;
double nX;
double nY;
int boxnum;
for(i=0;i<PART_COUNT;i++) {
boxnum = ((((int)(particles[i].sX+boxShiftX))/BOX_SIZE)%BWIDTH+
BWIDTH*((((int)(particles[i].sY+boxShiftY))/BOX_SIZE)%BHEIGHT));
//copied and pasted the macro which is why it's kinda odd looking
particles[i].vX -= boxes[boxnum].mX;
particles[i].vY -= boxes[boxnum].mY;
if(boxes[boxnum].rotDir == 1) {
nX = particles[i].vX*Wxx+particles[i].vY*Wxy;
nY = particles[i].vX*Wyx+particles[i].vY*Wyy;
} else { //to make it randomly pick a rot. direction
nX = particles[i].vX*Wxx-particles[i].vY*Wxy;
nY = -particles[i].vX*Wyx+particles[i].vY*Wyy;
}
particles[i].vX = nX + boxes[boxnum].mX;
particles[i].vY = nY + boxes[boxnum].mY;
}
}/*}}}*/
I've looked at SIMD, though I can't find much about it, and I'm not entirely sure that the processing required to properly extract and pack the data would be worth the gain of doing half as many instructions, since apparently only two doubles can be used at a time.
I tried breaking it up into multiple threads with shm and pthread_barrier (to synchronize the different stages, of which the above code is one), but it just made it slower.
My current code does go pretty quickly; it's on the order of one second per 10M particle*iterations, and from what I can tell from gprof, 30% of my time is spent in that function alone (5000 calls; PART_COUNT=8192 particles took 1.8 seconds). I'm not worried about small, constant time things, it's just that 512K particles * 50K iterations * 1000 experiments took more than a week last time.
I guess my question is if there is any way of dealing with these long vectors that is more efficient than just looping through them. I feel like there should be, but I can't find it.
I'm not sure how much SIMD would benefit; the inner loop is pretty small and simple, so I'd guess (just by looking) that you're probably more memory-bound than anything else. With that in mind, I'd try rewriting the main part of the loop to not touch the particles array more than needed:
const double temp_vX = particles[i].vX - boxes[boxnum].mX;
const double temp_vY = particles[i].vY - boxes[boxnum].mY;
if(boxes[boxnum].rotDir == 1)
{
nX = temp_vX*Wxx+temp_vY*Wxy;
nY = temp_vX*Wyx+temp_vY*Wyy;
}
else
{
//to make it randomly pick a rot. direction
nX = temp_vX*Wxx-temp_vY*Wxy;
nY = -temp_vX*Wyx+temp_vY*Wyy;
}
particles[i].vX = nX;
particles[i].vY = nY;
This has the small potential side effect of not doing the extra addition at the end.
Another potential speedup would be to use __restrict on the particle array, so that the compiler can better optimize the writes to the velocities. Also, if Wxx etc. are global variables, they may have to get reloaded each time through the loop instead of possibly stored in registers; using __restrict would help with that too.
Since you're accessing the particles in order, you can try prefetching (e.g. __builtin_prefetch on GCC) a few particles ahead to reduce cache misses. Prefetching on the boxes is a bit tougher since you're accessing them in an unpredictable order; you could try something like
int nextBoxnum = ((((int)(particles[i+1].sX+boxShiftX) /// etc...
// prefetch boxes[nextBoxnum]
One last one that I just noticed - if box::rotDir is always +/- 1.0, then you can eliminate the comparison and branch in the inner loop like this:
const double rot = boxes[boxnum].rotDir; // always +/- 1.0
nX = particles[i].vX*Wxx + rot*particles[i].vY*Wxy;
nY = rot*particles[i].vX*Wyx + particles[i].vY*Wyy;
Naturally, the usual caveats of profiling before and after apply. But I think all of these might help, and can be done regardless of whether or not you switch to SIMD.
Just for the record, there's also libSIMDx86!
http://simdx86.sourceforge.net/Modules.html
(On compiling you may also try: gcc -O3 -msse2 or similar).
((int)(particles[i].sX+boxShiftX))/BOX_SIZE
That's expensive if sX is an int (can't tell). Truncate boxShiftX/Y to an int before entering the loop.
Do you have sufficient profiling to tell you where the time is spent within that function?
For instance, are you sure it's not your divs and mods in the boxnum calculation where the time is being spent? Sometimes compilers fail to spot possible shift/AND alternatives, even where a human (or at least, one who knew BOX_SIZE and BWIDTH/BHEIGHT, which I don't) might be able to.
It would be a pity to spend lots of time on SIMDifying the wrong bit of the code...
The other thing which might be worth looking into is if the work can be coerced into something which could work with a library like IPP, which will make well-informed decisions about how best to use the processor.
Your algorithm has too many memory, integer and branch instructions to have enough independent flops to profit from SIMD. The pipeline will be constantly stalled.
Finding a more effective way to randomize would be top of the list. Then, try to work either in float or int, but not both. Recast conditionals as arithmetic, or at least as a select operation. Only then does SIMD become a realistic proposition

C functions overusing parameters?

I have legacy C code base at work and I find a lot of function implementations in the style below.
char *DoStuff(char *inPtr, char *outPtr, char *error, long *amount)
{
*error = 0;
*amount = 0;
// Read bytes from inPtr and decode them as a long storing in amount
// before returning as a formatted string in outPtr.
return (outPtr);
}
Using DoStuff:
myOutPtr = DoStuff(myInPtr, myOutPtr, myError, &myAmount);
I find that pretty obtuse and when I need to implement a similar function I end up doing:
long NewDoStuff(char *inPtr, char *error)
{
long amount = 0;
*error = 0;
// Read bytes from inPtr and decode them as a long storing in amount.
return amount;
}
Using NewDoStuff:
myAmount = NewDoStuff(myInPtr, myError);
myOutPtr += sprintf (myOutPtr, "%d", myAmount);
I can't help but wondering if there is something I'm missing with the top example, is there a good reason to use that type of approach?
One advantage is that if you have many, many calls to these functions in your code, it will quickly become tedious to have to repeat the sprintf calls over and over again.
Also, returning the out pointer makes it possible for you to do things like:
DoOtherStuff(DoStuff(myInPtr, myOutPtr, myError, &myAmount), &myOther);
With your new approach, the equivalent code is quite a lot more verbose:
myAmount = DoNewStuff(myInPtr, myError);
myOutPtr += sprintf("%d", myAmount);
myOther = DoOtherStuff(myInPtr, myError);
myOutPtr += sprintf("%d", myOther);
It is the C standard library style. The return value is there to aid chaining of function calls.
Also, DoStuff is cleaner IMO. And you really should be using snprintf. And a change in the internals of buffer management do not affect your code. However, this is no longer true with NewDoStuff.
The code you presented is a little unclear (for example, why are you adding myOutPtr with the results of the sprintf.
However, in general what it seems that you're essentially describing is the breakdown of one function that does two things into a function that does one thing and a code that does something else (the concatenation).
Separating responsibilities into two functions is a good idea. However, you would want to have a separate function for this concatenation and formatting, it's really not clear.
In addition, every time you break a function call into multiple calls, you are creating code replication. Code replication is never a good idea, so you would need a function to do that, and you will end up (this being C) with something that looks like your original DoStuff.
So I am not sure that there is much you can do about this. One of the limitations of non-OOP languages is that you have to send huge amounts of parameters (unless you used structs). You might not be able to avoid the giant interface.
If you wind up having to do the sprintf call after every call to NewDoStuff, then you are repeating yourself (and therefore violating the DRY principle). When you realize that you need to format it differently you will need to change it in every location instead of just the one.
As a rule of thumb, if the interface to one of my functions exceeds 110 columns, I look strongly at using a structure (and if I'm taking the best approach). What I don't (ever) want to do is take a function that does 5 things and break it into 5 functions, unless some functionality within the function is not only useful, but needed on its own.
I would favor the first function, but I'm also quite accustomed to the standard C style.

Resources