Once upon a time, when > was faster than < ... Wait, what? - c

I am reading an awesome OpenGL tutorial. It's really great, trust me. The topic I am currently at is Z-buffer. Aside from explaining what's it all about, the author mentions that we can perform custom depth tests, such as GL_LESS, GL_ALWAYS, etc. He also explains that the actual meaning of depth values (which is top and which isn't) can also be customized. I understand so far. And then the author says something unbelievable:
The range zNear can be greater than the range zFar; if it is, then the
window-space values will be reversed, in terms of what constitutes
closest or farthest from the viewer.
Earlier, it was said that the window-space Z value of 0 is closest and
1 is farthest. However, if our clip-space Z values were negated, the
depth of 1 would be closest to the view and the depth of 0 would be
farthest. Yet, if we flip the direction of the depth test (GL_LESS to
GL_GREATER, etc), we get the exact same result. So it's really just a
convention. Indeed, flipping the sign of Z and the depth test was once
a vital performance optimization for many games.
If I understand correctly, performance-wise, flipping the sign of Z and the depth test is nothing but changing a < comparison to a > comparison. So, if I understand correctly and the author isn't lying or making things up, then changing < to > used to be a vital optimization for many games.
Is the author making things up, am I misunderstanding something, or is it indeed the case that once < was slower (vitally, as the author says) than >?
Thanks for clarifying this quite curious matter!
Disclaimer: I am fully aware that algorithm complexity is the primary source for optimizations. Furthermore, I suspect that nowadays it definitely wouldn't make any difference and I am not asking this to optimize anything. I am just extremely, painfully, maybe prohibitively curious.

If I understand correctly, performance-wise, flipping the sign of Z and the depth test is nothing but changing a < comparison to a > comparison. So, if I understand correctly and the author isn't lying or making things up, then changing < to > used to be a vital optimization for many games.
I didn't explain that particularly well, because it wasn't important. I just felt it was an interesting bit of trivia to add. I didn't intend to go over the algorithm specifically.
However, context is key. I never said that a < comparison was faster than a > comparison. Remember: we're talking about graphics hardware depth tests, not your CPU. Not operator<.
What I was referring to was a specific old optimization where one frame you would use GL_LESS with a range of [0, 0.5]. Next frame, you render with GL_GREATER with a range of [1.0, 0.5]. You go back and forth, literally "flipping the sign of Z and the depth test" every frame.
This loses one bit of depth precision, but you didn't have to clear the depth buffer, which once upon a time was a rather slow operation. Since depth clearing is not only free these days but actually faster than this technique, people don't do it anymore.

The answer is almost certainly that for whatever incarnation of chip+driver was used, the Hierarchical Z only worked in the one direction - this was a fairly common issue back in the day. Low level assembly/branching has nothing to do with it - Z-buffering is done in fixed function hardware, and is pipelined - there is no speculation and hence, no branch prediction.

It has to do with flag bits in highly tuned assembly.
x86 has both jl and jg instructions, but most RISC processors only have jl and jz (no jg).

Related

What is "DN" in the specification of ARMv6M ADD(register) T2 Encoding?

I'm working on the specification of ARMV6M and the instruction's T2 encoding is as below.
0 1 0 0 0 1 | 0 0 | DN | Rm | Rdn
DN is always used as a prefix for Rdn register file position and I couldn't understand why it's not just put in the Rdn.
Many of the thumb instructions, esp ones that are ALU operations have the three bit register specifications encoded in the lower 6 bits, three bits per indicating r0-r7. this specific add instruction allows for operations on low and high registers r0-r15, so the other two bits need a home they happened to put one in bit 6 that goes with 5:3 so the other went above that.
So perhaps they were thinking of saving a few gates or readability or some other reason that cant be answered here at SO. so instead of using 7:4 and 3:0 like they would in a full sized arm instruction for this one special one off instruction, they put the upper bits in 7:6 an even better question is why didn't they put the left right the same as the other two why isn't it [7,5,4,3] [6,2,1,0] instead of [6,5,4,3] and [7,2,1,0]? IMO that would have helped readability esp if you started off on the arm thumb docs that were only in print (paper) originally where H1/H2 seemed swapped.
In the pseudo code they show (DN:Rdn) and talk about a four bit number and then Rm being a 4 bit number, so that indicates what the older docs did in a different way.
I suspect they used the n at the end as lower bits and the capital N as larger bits and yes it would have read better as RdN instead of DN. or Rd3 would have been even better than that.
Instruction sets can to some extent do whatever they want, ARM is no different here the designers way back when thumb started chose what they chose. Arbitrary decision, you will be lucky to find anyone here that was in the room at the time, I have seen these things be mistakes too, in the meeting one thing is decided, the implementation by someone is backward, but by the time that gets around to the room to much investment has been made in testing, etc, to just leave it. Possible that/those engineers are already retired, or got golden parachutes a while ago, maybe you will get lucky.
Also understand that it is not uncommon that the documentation folks are often a separate department, so this could have been a game time decision (or typo) by an individual technical writer, and later determined not to change the docs down the road.
Don't read anything magical into this or some industry nomenclature, that is a bad habit to have anyway, what matters is that you understand what the bits do not how they are labelled.

Does the Expectimax AI still work as expected if not all branched are explored to the same depth?

I am making an expectimax AI, and the branching factor of this game is unpredictable, ranging from 6 - 20. I'm currently exploring the game tree for 1 second every turn, then making sure the whole game tree is explored to the same depth, but occasionally this results in a very large slowdown, if branching factor for a particular turn jumps up radically. Is if OK if I cut off exploration when parts of the game tree are not explored as deeply? Will this affect the mathematical properties of expectimax at all?
Short answer: I'm pretty sure you lose the mathematical guarantees, but the extent to which this affects your program's performance will probably depend on the game and your board evaluation function.
Here's an abstract scenario to give you some intuition for where having different branch lengths might create the most problems: say that, for player one, the best move is something that takes a few turns to set up. Let's say this set up is not something that your board evaluation function can pick up on. In this case, regardless of what player 2 does in the mean time, there will be a point a few moves in the future where the score of the board will swing in the direction that favors player 1. If one branch gets far enough to see that move and another doesn't, it will look like the first is a worse option for player 2, despite the fact that the same thing will happen on the other branch. If the move that player 2 made in the first branch was actually better than the move it made in the second branch, this would lead to a suboptimal choice.
On the other hand, a perfect board evaluator would make this impossible (it would recognize player 1 setting up their move). There are also games in which setting up moves in advance like this is not possible. But the existence of this case is a red flag.
Fundamentally, branches that didn't get evaluated as far have greater uncertainty in their estimations of how good a move is. This will sometimes result in them being chosen when they shouldn't be and other times it will result in them not being chosen when they should be. As a result, I would strongly suspect that you lose mathematical guarantees by doing this. That said, the practical impact of this problem on performance may or may not be substantial.
There might be some way around this if you incorporate the current turn number into the board evaluation function and adjust for it accordingly. Minimally, that would allow you to explicitly account for the added uncertainty in shorter branches.

Techniques for static code analysis in detecting integer overflows

I'm trying to find some effective techniques which I can base my integer-overflow detection tool on. I know there are many ready-made detection tools out there, but I'm trying to implement a simple one on my own, both for my personal interest in this area and also for my knowledge.
I know techniques like Pattern Matching and Type Inference, but I read that more complicated code analysis techniques are required to detect the int overflows. There's also the Taint Analysis which can "flag" un-trusted sources of data.
Is there some other technique, which I might not be aware of, which is capable of detecting integer overflows?
It may be worth to try with cppcheck static analysis tool, that claims to detect signed integer overflow as of version 1.67:
New checks:
- Detect shift by too many bits, signed integer overflow and dangerous sign conversion
Notice that it supports both C and C++ languages.
There is no overflow check for unsigned integers, as by Standard unsigned types never overflow.
Here is some basic example:
#include <stdio.h>
int main(void)
{
int a = 2147483647;
a = a + 1;
printf("%d\n", a);
return 0;
}
With such code it gets:
$ ./cppcheck --platform=unix64 simple.c
Checking simple.c...
[simple.c:6]: (error) Signed integer overflow for expression 'a+1'
However I wouldn't expect too much from it (at least with current version), as slighly different program:
int a = 2147483647;
a++;
passes without noticing overflow.
It seems you are looking for some sort of Value Range Analysis, and detect when that range would exceed the set bounds. This is something that on the face of it seems simple, but is actually hard. There will be lots of false positives, and that's even without counting bugs in the implementation.
To ignore the details for a moment, you associate a pair [lower bound, upper bound] with every variable, and do some math to figure out the new bounds for every operator. For example if the code adds two variables, in your analysis you add the upper bounds together to form the new upper bound, and you add the lower bounds together to get the new lower bound.
But of course it's not that simple. Firstly, what if there is non-straight-line code? if's are not too bad, you can just evaluate both sides and then take the union of the ranges after it (which can lose information! if two ranges have a gap in between, their union will span the gap). Loops require tricks, a naive implementation may run billions of iterations of analysis on a loop or never even terminate at all. Even if you use an abstract domain that has no infinite ascending chains, you can still get into trouble. The keywords to solve this are "widening operator" and (optionally, but probably a good idea) "narrowing operator".
It's even worse than that, because what's a variable? Your regular local variable of scalar type that never has its address taken isn't too bad. But what about arrays? Now you don't even know for sure which entry is being affected - the index itself may be a range! And then there's aliasing. That's far from a solved problem and causes many real world tools to make really pessimistic assumptions.
Also, function calls. You're going to call functions from some context, hopefully a known one (if not, then it's simple: you know nothing). That makes it hard, not only is there suddenly a lot more state to keep track of at the same time, there may be several places a function could be called from, including itself. The usual response to that is to re-evaluate that function when a range of one of its arguments has been expanded, once again this could take billions of steps if not done carefully. There also algorithms that analyze a function differently for different context, which can give more accurate results, but it's easy to spend a lot of time analyzing contexts that aren't different enough to matter.
Anyway if you've made it this far, you could read Accurate Static Branch Prediction by Value Range Propagation and related papers to get a good idea of how to actually do this.
And that's not all. Considering only the ranges of individual variables without caring about the relationships between (keyword: non-relational abstract domain) them does bad on really simple (for a human reader) things such as subtracting two variables that always close together in value, for which it will make a large range, with the assumption that they may be as far apart as their bounds allow. Even for something trivial such as
; assume x in [0 .. 10]
int y = x + 2;
int diff = y - x;
For a human reader, it's pretty obvious that diff = 2. In the analysis described so far, the conclusions would be that y in [2 .. 12] and diff in [-8, 12]. Now suppose the code continues with
int foo = diff + 2;
int bar = foo - diff;
Now we get foo in [-6, 14] and bar in [-18, 22] even though bar is obviously 2 again, the range doubled again. Now this was a simple example, and you could make up some ad-hoc hacks to detect it, but it's a more general problem. This effect tends to blow up the ranges of variables quickly and generate lots of unnecessary warnings. A partial solution is assigning ranges to differences between variables, then you get what's called a difference-bound matrix (unsurprisingly this is an example of a relational abstract domain). They can get big and slow for interprocedual analysis, or if you want to throw non-scalar variables at them too, and the algorithms start to get more complicated. And they only get you so far - if you throw a multiplication in the mix (that includes x + x and variants), things still go bad very fast.
So you can throw something else in the mix that can handle multiplication by a constant, see for example Abstract Domains of Affine Relations⋆ - this is very different from ranges, and won't by itself tell you much about the ranges of your variables, but you could use it to get more accurate ranges.
The story doesn't end there, but this answer is getting long. I hope this does not discourage you from researching this topic, it's a topic that lends itself well to starting out simple and adding more and more interesting things to your analysis tool.
Checking integer overflows in C:
When you add two 32-bit numbers and get a 33-bit result, the lower 32 bits are written to the destination, with the highest bit signaled out as a carry flag. Many languages including C don't provide a way to access this 'carry', so you can use limits i.e. <limits.h>, to check before you perform an arithmetic operation. Consider unsigned ints a and b :
if MAX - b < a, we know for sure that a + b would cause an overflow. An example is given in this C FAQ.
Watch out: As chux pointed out, this example is problematic with signed integers, because it won't handle MAX - b or MIN + b if b < 0. The example solution in the second link (below) covers all cases.
Multiplying numbers can cause an overflow, too. A solution is to double the length of the first number, then do the multiplication. Something like:
(typecast)a*b
Watch out: (typecast)(a*b) would be incorrect because it truncates first then typecasts.
A detailed technique for c can be found HERE. Using macros seems to be an easy and elegant solution.
I'd expect Frama-C to provide such a capability. Frama-C is focused on C source code, but I don't know if it is dialect-sensitive or specific. I believe it uses abstract interpretation to model values. I don't know if it specifically checks for overflows.
Our DMS Software Reengineering Toolkit has variety of langauge front ends, including most major dialects of C. It provides control and data flow analysis, and also abstract interpretation for computing ranges, as foundations on which you can build an answer. My Google Tech Talk on DMS at about 0:28:30 specifically talks about how one can use DMS's abstract interpretation on value ranges to detect overflow (of an index on a buffer). A variation on checking the upper bound on array sizes is simply to check for values not exceeding 2^N. However, off the shelf DMS does not provide any specific overflow analysis for C code. There's room for the OP to do interesting work :=}

Concerning efficiency, logical compares vs redundant memory manipulation

Which is more taxing? Is enclosing an array element exchange with a conditional if statement to prevent redundant exchanges, like say exchanging with itself, more efficient?
Or is having to check for an only probabilistic condition all the time more inefficient? Say the chance of the special condition increases every invocation.
Say you're developing an algorithm and is trying to check for efficiency: compares or exchanges(like insertion sort).
if(condition)
exchange two elements
This very much depends on your processor architecture, how often this would be done, the throughput its required to handle and the cost of doing said exchanges, in which case, the only viable, real world-answer is: "profile, profile and profile some more".
Basically, if you CPU suffers badly from branch miss-prediction, and the swapping of elements is trivial, then its makes sense to leave out the conditional.
however, if your target CPU architecture can support a fair amount of branch miss-predictions with cause too much stalling or the cost of swapping elements is not trivial, then you might gain performance, depending on the size of said array. you may also benefit from the use of instructions like MOVcc/CMPXCHG, or there non-x86 counterparts (though it this situation, you'd still need a read + compare, but it removes the branching).
With so many variable inputs, it makes sense to profile your code and find where its really bottlenecking, things like VTune or CodeAnalyst will also give you stats on branch miss-prediction so you can see how much it affects your algorithm as a whole.
A useful way to look at any condition-evaluation code to ask, "What is the probability of each outcome?"
For example, it there's a test expression test, and it's probability of being true is 1/100, then on average it is telling you very little, for your investment in processor cycles.
In fact you can quantify that.
If it's true, then the the amount of information you've learned is pretty good.
It is log2(100/1) = 6.6 bits, roughly, but that only happens 1 out of 100 times.
The other 99 times, the amount of information you learn is log2(100/99) = .014 bits.
Practically nothing.
So a condition like that is telling you very little, on average. It's not "working" very hard.
A good way to finish quantifying it is to multiply what you learn from each outcome by the probability of that outcode, and add those up.
That tells you what you learn on average.
That is 6.6 * 1/100 + .014 * 99/100 = .066 + .014 = .08 bits, which is very poor.
(This number is called the entropy of the decision.)
On the other hand, if you have a decision point where each outcome is equally likely, it learns a full 1 bit on average.
In fact that's the most work a binary decision can possibly do.
So if you're worried about the performance of a conditional test (you may not be) try to make it earn its cycles.

Micro-optimizations in C, which ones are there? Is there anyone really useful? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I understand most of the micro-optimizations out there but are they really useful?
Exempli gratia: does doing ++i instead of i++, or while(1) or for(;;) really result in performance improvements (either in memory fingerprint or CPU cycles)?
So the question is, what micro-optimizations can be done in C? Are they really useful?
You should rely on your compiler to optimise this stuff. Concentrate on using appropriate algorithms and writing reliable, readable and maintainable code.
The day tclhttpd, a webserver written in Tcl, one of the slowest scripting language, managed to outperform Apache, a webserver written in C, one of the supposedly fastest compiled language, was the day I was convinced that micro-optimizations significantly pales in comparison to using a faster algorithm/technique*.
Never worry about micro-optimizations until you can prove in a debugger that it is the problem. Even then, I would recommend first coming here to SO and ask if it is a good idea hoping someone would convince you not to do it.
It is counter-intuitive but very often code, especially tight nested loops or recursion, are optimized by adding code rather than removing them. The gaming industry has come up with countless tricks to speed up nested loops using filters to avoid unnecessary processing. Those filters add significantly more instructions than the difference between i++ and ++i.
*note: We have learned a lot since then. The realization that a slow scripting language can outperform compiled machine code because spawning threads is expensive led to the developments of lighttpd, NginX and Apache2.
There's a difference, I think, between a micro-optimization, a trick, and alternative means of doing something. It can be a micro-optimization to use ++i instead of i++, though I would think of it as merely avoiding a pessimization, because when you pre-increment (or decrement) the compiler need not insert code to keep track of the current value of the variable for use in the expression. If using pre-increment/decrement doesn't change the semantics of the expression, then you should use it and avoid the overhead.
A trick, on the other hand, is code that uses a non-obvious mechanism to achieve a result faster than a straight-forward mechanism would. Tricks should be avoided unless absolutely needed. Gaining a small percentage of speed-up is generally not worth the damage to code readability unless that small percentage reflects a meaningful amount of time. Extremely long-running programs, especially calculation-heavy ones, or real-time programs are often candidates for tricks because the amount of time saved may be necessary to meet the systems performance goals. Tricks should be clearly documented if used.
Alternatives, are just that. There may be no performance gain or little; they just represent two different ways of expressing the same intent. The compiler may even produce the same code. In this case, choose the most readable expression. I would say to do so even if it results in some performance loss (though see the preceding paragraph).
I think you do not need to think about these micro-optimizations because most of them is done by compiler. These things can only make code more difficult to read.
Remember, [edited] premature [/edited] optimization is an evil.
To be honest, that question, while valid, is not relevant today - why?
Compiler writers are a lot more smarter than they were 20 years ago, rewind back in time, then these optimizations would have been very relevant, we were all working with old 80286/386 processors, and coders would often resort to tricks to squeeze even more bytes out of the compiled code.
Today, processors are too fast, compiler writers knows the intimate details of operand instructions to make every thing work, considering that there is pipe-lining, core processors, acres of RAM, remember, with a 80386 processor, there would be 4Mb RAM and if you're lucky, 8Mb was considered superior!!
The paradigm has shifted, it was about squeezing every byte out of compiled code, now it is more on programmer productivity and getting the release out the door much sooner.
The above I have stated the nature of the processor, and compilers, I was talking about the Intel 80x86 processor family, Borland/Microsoft compilers.
Hope this helps,
Best regards,
Tom.
If you can easily see that two different code sequences produce identical results, without making assumptions about the data other than what's present in the code, then the compiler can too, and generally will.
It's only when the transformation from one to the other is highly non-obvious or requires assuming something that you may know to be true but the compiler has no way to infer (eg. that an operation cannot overflow or that two pointers will never alias, even though they aren't declared with the restrict keyword) that you should spend time thinking about these things. Even then, the best thing to do is usually to find a way to inform the compiler about the assumptions that it can make.
If you do find specific cases where the compiler misses simple transformations, 99% of the time you should just file a bug against the compiler and get on with working on more important things.
Keeping the fact that memory is the new disk in mind will likely improve your performance far more than applying any of those micro-optimizations.
For a slightly more pragmatic take on the question of ++i vs. i++ (at least in a C++ context) see http://llvm.org/docs/CodingStandards.html#micro_preincrement.
If Chris Lattner says it, I've got to pay attention. ;-)
You would do better to consider every program you write primarily as a language in which you communicate your ideas, intentions and reasoning to other human beings who will have to bug-fix, reuse and understand it. They will spend more time on decoding garbled code than any compiler or runtime system will do executing it.
To summarise, say what you mean in the clearest way, using the common idioms of the language in question.
For these specific examples in C, for(;;) is the idiom for an infinite loop and "i++" is the usual idiom for "add one to i" unless you use the value in an expression, in which case it depends whether the value with the clearest meaning is the one before or after the increment.
Here's real optimization, in my experience.
Someone on SO once remarked that micro-optimization was like "getting a haircut to lose weight". On American TV there is a show called "The Biggest Loser" where obese people compete to lose weight. If they were able to get their body weight down to a few grams, then getting a haircut would help.
Maybe that's overstating the analogy to micro-optimization, because I have seen (and written) code where micro-optimization actually did make a difference, but when starting off there is a lot more to be gained by simply not solving problems you don't have.
x ^= y
y ^= x
x ^= y
++i should be prefered over i++ for situations where you don't use the return value because it better represents the semantics of what you are trying to do (increment i) rather than any possible optimisation (it might be slightly faster, and is probably not worse).
Generally, loops that count towards zero are faster than loops that count towards some other number. I can imagine a situation where the compiler can't make this optimization for you, but you can make it yourself.
Say that you have and array of length x, where x is some very big number, and that you need to perform some operation on each element of x. Further, let's say that you don't care what order these operations occur in. You might do this...
int i;
for (i = 0; i < x; i++)
doStuff(array[i]);
But, you could get a little optimization by doing it this way instead -
int i;
for (i = x-1; i != 0; i--)
{
doStuff(array[i]);
}
doStuff(array[0]);
The compiler doesn't do it for you because it can't assume that order is unimportant.
MaR's example code is better. Consider this, assuming doStuff() returns an int:
int i = x;
while (i != 0)
{
--i;
printf("%d\n",doStuff(array[i]));
}
This is ok as long as printing the array contents in reverse order is acceptable, but the compiler can't decide that for you.
This being an optimization is hardware dependent. From what I remember about writing assembler (many, many years ago), counting up rather than counting down to zero requires an extra machine instruction each time you go through the loop.
If your test is something like (x < y), then evaluation of the test goes something like this:
subtract y from x, storing the result in some register r1
test r1, to set the n and z flags
branch based on the values of the n and z flags
If your test is ( x != 0), you can do this:
test x, to set the z flag
branch based on the value of the z flag
You get to skip a subtract instruction for each iteration.
There are architectures where you can have the subtract instruction set the flags based on the result of the subtraction, but I'm pretty sure x86 isn't one of them, so most of us aren't using compilers that have access to such a machine instruction.

Resources