I'm working on the specification of ARMV6M and the instruction's T2 encoding is as below.
0 1 0 0 0 1 | 0 0 | DN | Rm | Rdn
DN is always used as a prefix for Rdn register file position and I couldn't understand why it's not just put in the Rdn.
Many of the thumb instructions, esp ones that are ALU operations have the three bit register specifications encoded in the lower 6 bits, three bits per indicating r0-r7. this specific add instruction allows for operations on low and high registers r0-r15, so the other two bits need a home they happened to put one in bit 6 that goes with 5:3 so the other went above that.
So perhaps they were thinking of saving a few gates or readability or some other reason that cant be answered here at SO. so instead of using 7:4 and 3:0 like they would in a full sized arm instruction for this one special one off instruction, they put the upper bits in 7:6 an even better question is why didn't they put the left right the same as the other two why isn't it [7,5,4,3] [6,2,1,0] instead of [6,5,4,3] and [7,2,1,0]? IMO that would have helped readability esp if you started off on the arm thumb docs that were only in print (paper) originally where H1/H2 seemed swapped.
In the pseudo code they show (DN:Rdn) and talk about a four bit number and then Rm being a 4 bit number, so that indicates what the older docs did in a different way.
I suspect they used the n at the end as lower bits and the capital N as larger bits and yes it would have read better as RdN instead of DN. or Rd3 would have been even better than that.
Instruction sets can to some extent do whatever they want, ARM is no different here the designers way back when thumb started chose what they chose. Arbitrary decision, you will be lucky to find anyone here that was in the room at the time, I have seen these things be mistakes too, in the meeting one thing is decided, the implementation by someone is backward, but by the time that gets around to the room to much investment has been made in testing, etc, to just leave it. Possible that/those engineers are already retired, or got golden parachutes a while ago, maybe you will get lucky.
Also understand that it is not uncommon that the documentation folks are often a separate department, so this could have been a game time decision (or typo) by an individual technical writer, and later determined not to change the docs down the road.
Don't read anything magical into this or some industry nomenclature, that is a bad habit to have anyway, what matters is that you understand what the bits do not how they are labelled.
Related
I recently read the question here Why is it faster to process a sorted array than an unsorted array? and found the answer to be absolutely fascinating and it has completely changed my outlook on programming when dealing with branches that are based on Data.
I currently have a fairly basic, but fully functioning interpreted Intel 8080 Emulator written in C, the heart of the operation is a 256 long switch-case table for handling each opcode. My initial thought was this would obviously be the fastest method of working as opcode encoding isn't consistent throughout the 8080 instruction set and decoding would add a lot of complexity, inconsistency and one-off cases. A switch-case table full of pre-processor macros is a very neat and easy to maintain.
Unfortunately, after reading the aforementioned post it occurred to me that there's absolutely no way the branch predictor in my computer can predict the jumping for the switch case. Thus every time the switch-case is navigated the pipeline would have to be completely wiped, resulting in a several cycle delay in what should otherwise be an incredibly quick program (There's not even so much as multiplication in my code).
I'm sure most of you are thinking "Oh, the solution here is simple, move to dynamic recompilation". Yes, this does seem like it would cut out the majority of the switch-case and increase speed considerably. Unfortunately my primary interest is emulating older 8-bit and 16-bit era consoles (the intel 8080 here is only an example as it's my simplest piece of emulated code) where cycle and timing keeping to the exact instruction is important as the Video and Sound must be processed based on these exact timings.
When dealing with this level of accuracy performance becomes an issue, even for older consoles (Look at bSnes for example). Is there any recourse or is this simply a matter-of-fact when dealing with processors with long pipelines?
On the contrary, switch statements are likely to be converted to jump tables, which means they perform possibly a few ifs (for range checking), and a single jump. The ifs shouldn't cause a problem with branch prediction because it is unlikely you will have a bad op-code. The jump is not so friendly with the pipeline, but in the end, it's only one for the whole switch statement..
I don't believe you can convert a long switch statement of op-codes into any other form that would result in better performance. This is of course, if your compiler is smart enough to convert it to a jump table. If not, you can do so manually.
If in doubt, implement other methods and measure performance.
Edit
First of all, make sure you don't confuse branch prediction and branch target prediction.
Branch prediction solely works on branch statements. It decides whether a branch condition would fail or succeed. They have nothing to do with the jump statement.
Branch target prediction on the other hand tries to guess where the jump will end up in.
So, your statement "there's no way the branch predictor can predict the jump" should be "there's no way the branch target predictor can predict the jump".
In your particular case, I don't think you can actually avoid this. If you had a very small set of operations, perhaps you could come up with a formula that covers all your operations, like those made in logic circuits. However, with an instruction set as big as a CPU's, even if it were RISC, the cost of that computation is much higher than the penalty of a single jump.
As the branches on your 256-way switch statement are densely packed the compiler will implement this as a jump table, so you're correct in that you'll trigger a single branch mispredict every time you pass through this code (as the indirect jump won't display any kind of predictable behaviour). The penalty associated with this will be around 15 clock cycles on a modern CPU (Sandy Bridge), or maybe up to 25 on older microarchitectures that lack a micro-op cache. A good reference for this sort of thing is "Software optimisation resources" on agner.org. Page 43 in "Optimizing software in C++" is a good place to start.
http://www.agner.org/optimize/?e=0,34
The only way you could avoid this penalty is by ensuring that the same instructions are execution regardless of the value of the opcode. This can often be done by using conditional moves (which add a data dependency so are slower than a predictable branch) or otherwise looking for symmetry in your code paths. Considering what you're trying to do this is probably not going to be possible, and if it was then it would almost certainly add a overhead greater than the 15-25 clock cycles for the mispredict.
In summary, on a modern architecture there's not much you can do that'll be more efficient than a switch/case, and the cost of mispredicting a branch isn't as much as you might expect.
The indirect jump is probably the best thing to do for instruction decoding.
On older machines, like say the Intel P6 from 1997, the indirect jump would probably get a branch misprediction.
On modern machines, like say Intel Core i7, there is an indirect jump predictor that does a fairly good job of avoiding the branch misprediction.
But even on the older machines that do not have an indirect branch predictor, you can play a trick. This trick is (was), by the way, documented in the Intel Code Optimization Guide from way back in the Intel P6 days:
Instead of generating something that looks like
loop:
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_00h_ADD: ...
jmp loop
label_instruction_01h_SUB: ...
jmp loop
...
generate the code as
loop:
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_00h_ADD: ...
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_01h_SUB: ...
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
...
i.e. replace the jump to the top of the instruction fetch/decode/execute loop
by the code at the top of the loop at each place.
It turns out that this has much better branch prediction, even in the absence of an indirect predictor. More precisely, a conditional, single target, PC indexed BTB will be quite a lot better in this latter, threaded, code, than on the original with only a single copy of the indirect jump.
Most instruction sets have special patterns - e.g. on Intel x86, a compare instruction is nearly always followed by a branch.
Good luck and have fun!
(In case you care, the instruction decoders used by instruction set simulators in industry nearly always do a tree of N-way jumps, or the data-driven dual, navigate a tree of N-way tables, with each entry in the tree pointing to other nodes, or to a function to evaluate.
Oh, and perhaps I should mention: these tables, these switch statements or data structures, are generated by special purpose tools.
A tree of N-way jumps, because there are problems when the number of cases in the jump table gets very large - in the tool, mkIrecog (make instruction recognizer) that I wrote in the 1980s, I usually did jump tables up to 64K entries in size, i.e. jumping on 16 bits. The compilers of the time broke when the jump tables exceeded 16M in size (24 bits).
Data driven, i.e. a tree of nodes pointing to other nodes because (a) on older machines indirect jumps may not be predicted well, and (b) it turns out that much of the time there is common code between instructions - instead of having a branch misprediction when jumping to the case per instruction, then executing common code, then switching again, and getting a second mispredict, you do the common code, with slightly different parameters (like, how many bits of the instruction stream do you consume, and where the next set of bits to branch on is (are).
I was very aggressive in mkIrecog, as I say allowing up to 32 bits to be used in a switch, although practical limitations nearly always stopped me at 16-24 bits. I remember that I often saw the first decode as a 16 or 18 bit switch (64K-256K entries), and all other decodes were much smaller, no bigger than 10 bits.
Hmm: I posted mkIrecog to Usenet back circa 1990. ftp://ftp.lf.net/pub/unix/programming/misc/mkIrecog.tar.gz
You may be able to see the tables used, if you care.
(Be kind: I was young then. I can't remember if this was Pascal or C. I have since rewritten it many times - although I have not yet rewritten it to use C++ bit vectors.)
Most of the other guys I know who do this sort of thing do things a byte at a time - i.e. an 8 bit, 256 way, branch or table lookup.)
I thought I'd add something since no one mentioned it.
Granted, the indirect jump is likely to be the best option.
However, should you go with the N-compare way, there are two things that come to my mind:
First, instead of doing N equality compares, you could do log(N) inequality compares, testing your instructions based on their numerical opcode by dichotomy (or test the number bit by bit if the value space is near to full) .This is a bit like a hashtable, you implement a static tree to find the final element.
Second, you could run an analysis on the binary code you want to execute.
You could even do that per binary, before execution, and runtime-patch your emulator.
This analysis would build a histogram representing the frequency of instructions, and then you would organize your tests so that the most frequent instructions get predicted correctly.
But I cant see this being faster than a medium 15 cycles penalty, unless you have 99% of MOV and you put an equality for the MOV opcode before the other tests.
I am working with the registers of an ARM Cortex M3. In the documentation, some of the bits may be "reserved". It is unclear to me how I should deal with these reserved bits when writing on the registers.
Are these reserved bits even writeable? Should I be cautious to not touch them? Will something bad happen if I touch them?
This is a classic embedded world problem as to what to do with reserved bits! First, you should NOT write randomly into it lest your code becomes un-portable. What happens when the architecture assigns a new meaning to the reserved bits in future? Your code will break. So the best mantra when dealing with registers having reserved bits is Read-Modify-Write. i.e read the register contents, modify only the bits you want and then write back the value so that reserved bits are untouched ( untouched, does not mean we dont write into them, but in the sense, that we wrote that which was there before )
For example, say there is a register in which only the LSBit has meaning and all others are reserved. I would do this
ldr r0,=memoryAddress
ldr r1,[r0]
orr r1,r1,#1
str r1,[r0]
If there is no other clue in the documentation, write a zero. You cannot avoid writing to a few reserved bits spread around in a 32-bit register.
Read-Modify-Write should work most of the time, however there are cases where reserved bits are undefined on read but must be written with a specific value. See this post from the LPC2000 group (the whole thread is quite interesting too). So, always check the docs carefully, and also any errata that's available. When in doubt or docs are unclear, don't hesitate to write to the manufacturer.
Ideally you should read-modify-write, no guarantee for success, when you change to a newer chip with different bits, you are changing your code anyway. I have seen vendors where writing zeros to the reserved bits failed when they revved the chip and the code had to be touched. So there are no guarantees. The biggest clue is if in the vendors code you see a register or set that are clearly read-modify-write or clearly just a write. This could be different developers writing different sections of the example or there is a register in that peripheral that is sensitive, has an undocumented bit, and needs the read-modify-write.
On the chips that I work on I make sure that undocumented (to the customer), but not unused bits are marked in some way to stand out from other unused bits. We normally mark unused/reserved bits as zero, and these other bits get a name, and a must write this value marking. Not all vendors do this.
The bottom line is there is no guarantee, assume all documentation and example programs have bugs and you have to hack your way through to figure out what is right and what is wrong. No matter what path you take (read-modify-write, write zeros, etc) you will be wrong from time to time and have to re-do the code to match a hardware change. I strongly suggest that if a vendor has a chip id of some sort, that your software reads that ID and if it is an id that you have not tested your code against, declare a failure and not program that part. In production testing long before a customer sees the product, the part change will get detected and software will be involved in understanding the reason for the part change, the resolution being the alternate part is not compatible and rejected or the software changes, etc.
Reserved most of the time mean that they aren't used in this chip, but they might be used on feature devices (other product line). (Most chip manufacturers produce one peripheral driver and they use it for all there chips. This way it's mostly copy past work and there is less change for errors) Most of the time it doesn't matter if you write to reserved bits in peripheral registers, this because there isn't any logic attached to it.
It is possible that if you write something to it, it won't be stored and next time you attempt to read the register / bits it seams unchanged.
I am reading an awesome OpenGL tutorial. It's really great, trust me. The topic I am currently at is Z-buffer. Aside from explaining what's it all about, the author mentions that we can perform custom depth tests, such as GL_LESS, GL_ALWAYS, etc. He also explains that the actual meaning of depth values (which is top and which isn't) can also be customized. I understand so far. And then the author says something unbelievable:
The range zNear can be greater than the range zFar; if it is, then the
window-space values will be reversed, in terms of what constitutes
closest or farthest from the viewer.
Earlier, it was said that the window-space Z value of 0 is closest and
1 is farthest. However, if our clip-space Z values were negated, the
depth of 1 would be closest to the view and the depth of 0 would be
farthest. Yet, if we flip the direction of the depth test (GL_LESS to
GL_GREATER, etc), we get the exact same result. So it's really just a
convention. Indeed, flipping the sign of Z and the depth test was once
a vital performance optimization for many games.
If I understand correctly, performance-wise, flipping the sign of Z and the depth test is nothing but changing a < comparison to a > comparison. So, if I understand correctly and the author isn't lying or making things up, then changing < to > used to be a vital optimization for many games.
Is the author making things up, am I misunderstanding something, or is it indeed the case that once < was slower (vitally, as the author says) than >?
Thanks for clarifying this quite curious matter!
Disclaimer: I am fully aware that algorithm complexity is the primary source for optimizations. Furthermore, I suspect that nowadays it definitely wouldn't make any difference and I am not asking this to optimize anything. I am just extremely, painfully, maybe prohibitively curious.
If I understand correctly, performance-wise, flipping the sign of Z and the depth test is nothing but changing a < comparison to a > comparison. So, if I understand correctly and the author isn't lying or making things up, then changing < to > used to be a vital optimization for many games.
I didn't explain that particularly well, because it wasn't important. I just felt it was an interesting bit of trivia to add. I didn't intend to go over the algorithm specifically.
However, context is key. I never said that a < comparison was faster than a > comparison. Remember: we're talking about graphics hardware depth tests, not your CPU. Not operator<.
What I was referring to was a specific old optimization where one frame you would use GL_LESS with a range of [0, 0.5]. Next frame, you render with GL_GREATER with a range of [1.0, 0.5]. You go back and forth, literally "flipping the sign of Z and the depth test" every frame.
This loses one bit of depth precision, but you didn't have to clear the depth buffer, which once upon a time was a rather slow operation. Since depth clearing is not only free these days but actually faster than this technique, people don't do it anymore.
The answer is almost certainly that for whatever incarnation of chip+driver was used, the Hierarchical Z only worked in the one direction - this was a fairly common issue back in the day. Low level assembly/branching has nothing to do with it - Z-buffering is done in fixed function hardware, and is pipelined - there is no speculation and hence, no branch prediction.
It has to do with flag bits in highly tuned assembly.
x86 has both jl and jg instructions, but most RISC processors only have jl and jz (no jg).
From a long time ago I have a memory which has stuck with me that says comparisons against zero are faster than any other value (ahem Z80).
In some C code I'm writing I want to skip values which have all their bits set. Currently the type of these values is char but may change. I have two different alternatives to perform the test:
if (!~b)
/* skip */
and
if (b == 0xff)
/* skip */
Apart from the latter making the assumption that b is an 8bit char whereas the former does not, would the former ever be faster due to the old compare to zero optimization trick, or are the CPUs of today way beyond this kind of thing?
If it is faster, the compiler will substitute it for you.
In general, you can't write C better than the compiler can optimize it. And it is architecture specific anyway.
In short, don't worry about it unless that sub-micro-nano-second is ultra important
From what I recall in my architecture classes, I believe they should be equally fast. Both have 2 instructions.
First example
1. Negate b into a temp register
2. Compare temp register equal 0
Second example
1. Subtract 0xff from b into a temp register
2. Compare temp register equal to 0
These are basically identical, and besides, even if your particular architecture requires more or less than this, is it really worth the fraction of a nanosecond? Several minutes have been spent just answering this question.
I would say it's not so much the that CPUs are beyond these kind of tricks as it is the compilers.
The CPUs of today are, however, beyond simple tricks which pull an extra clock-tick or two of speed. Even if you do this 100,000 times a second, we are still only talking about an increase in speed of 0.00003 seconds on a single-core 3Ghz computer - it is simply not worth your time to worry about things like this.
Go with the one that will be easier for the person who is maintaining your code to understand. If you have a successful product, most of the expense in software is in maintenance. If you write cryptic code you add to that expense. If you don't have a successful product, it doesn't matter because no one will have to maintain it. I have been in situations where I had to save every byte I could, and had to resort to tricks like the one you gave, but I only do it as the very very very last resort.
1) On a 32-bit CPU is it faster to acccess an array of 32 boolean values or to access the 32 bits within one word? (Assume we want to check the value of the Nth element and can use either a bit-mask (Nth bit is set) or the integer N as an array index.)
It seems to me that the array would be faster because all common computer architectures natively work at the word level (32 bits, 64 bits, etc., processed in parallel) and accessing the sub-word bits takes extra work.
I know different compilers will represent things differently, but it seems that the underlying hardware architecture would dictate the answer. Or does the answer depend on the language and compiler?
And,
2) Is the speed answer reversed if this array represents a state that I pass between client and server?
This question came to mind when reading question "How use bit/bit-operator to control object state?"
P.S. Yes, I could write code to test this myself, but then the SO community wouldn't get to play along!
Bear in mind that a theoretically faster solution that doesn't fit into a cache line might be slower than a theoretically slower one that does, depending on a whole host of things. If this is actually something that needs to be fast, as determined by profiling, test both ways and see. If it doesn't, do whatever looks like cleaner code, which is probably the array.
It depends on the compiler and the access patterns and the platform. Raymond Chen has an excellent cost-benefit analysis: http://blogs.msdn.com/oldnewthing/archive/2008/11/26/9143050.aspx .
Even on non x86 platforms the use of bits can be prohibitive as at least one PPC platform out there uses microcoded instructions to perform a variable shift which can do nasty things with other hardware threads.
So it can be a win, but you need to understand the context in which it will be good and bad. (Which is a general thing anyway.)
For question #1: Yes, on most 32-bit platforms, an array of boolean values should be faster, because you will just be loading each 32-bit-aligned value in the array and testing it against 0. If you use a single word, you will have all that work plus the overhead of bit-fiddling.
For question #2: Again, yes, since sending data over a network is significantly slower than operating on data in the CPU and main memory, the overhead of sending even one word will strongly outweigh any performance gain or loss you get by aligning words or bit fiddling.
This is the code generated by 0 != (value & (1 << index)) to test a bit:
00401000 mov eax,1
00401005 shl eax,cl
00401007 and eax,1
And this by values[index] to test a bool[]:
00401000 movzx eax,byte ptr [ecx+eax]
Can't figure out how to put a loop around it that doesn't get optimized away, I'll vote bool[].
If you are going to check more than one value at a time, doing it in parallel will obviously be faster. If you're only checking one value, it's probably the same.
If you need a better answer than that, write some tests and get back to us.
I think a byte array is probably better than a full-word array for simple random access.
It will give better cache locality than using the full word size, and I don't think byte access is any slower on most/all common architectures.