How does the program counter in register 15 expose the pipeline?

How does the program counter in register 15 expose the pipeline? - arm

I have a question assigned to me that states:
"The ARM puts the program counter in register r15, making it visible to the programmer. Someone writing about the ARM stated that this exposed the ARM's pipeline. What did he mean, and why?"
We haven't talked about pipelines in class yet so I don't know what that is and I'm having a hard time understanding the material online. Can someone help me out with either answering the question or helping me understand so I can form my own answer?
Thanks!

An exposed pipeline means one where the programmer needs to consider the pipeline, and I would disagree that the r15 value offset is anything other than an encoding constant.
By making the PC visible to the programmer, yes, some fragment of the early implementation detail has been 'exposed' as an architectural oddity which needs to be maintained by future implementations.
This wouldn't have been worthy of comment if the offset which has been designed into the architecture was zero - there would have been no optimisation possible for simple 3 stage pipelines, and everyone would have been non the wiser.
There is nothing 'exported' from the pipeline, not in the way that trace or debug allow you to snoop on timing behaviour at code is running - this feature is just part of the illusion that the processor hardware presents to a programmer (similar to each instruction executing in program order).
A problem with novel tricks like this is that people like to write questions about them, and those questions can easily be poorly phrased. They also neglect the fact that even if the pipeline is 3 stage, it only takes a single 'special case' to require the gates for an offset calculation (even if these gates don't consume power for typical operation).
Having PC relative instructions is rather common. Having implementation optimised encodings for how the offest is calculated is also common - for example IBM 650
Retrocomputing.SE is an interesting place to learn about some of the things that relate to the evolution of modern computers.

It doesn't really; what they are probably talking about is the program counter is two instructions ahead of the instruction being executed. But that doesn't mean its a two or three deep pipe if it ever was. It exposes nothing at this point, in the same way that the branch shadow in MIPS exposes nothing. There is the text book MIPS and there is reality.
There is nothing magical about a pipeline; it is a computer version of an assembly line. You can build a car in place and bring the engine the doors, the wheels etc to the car. Or you could move the car through an assembly line and have a station for doors, a station for the wheels and so on. You have many cars being built at once, and the number of cars coming out of the building is every few minutes but that doesn't mean a new car takes a few minutes to build. It just means the slowest step takes a few minutes, front to back it roughly takes the same amount of time.
An instruction has several should be obvious steps, an add requires that you get the instruction, decode it, gather up the operands feed them to an adder (alu) and take the result and store the result. Other instructions have similar steps and a similar number.
Your text book will use terms like fetch, decode, execute. So if you are fetching an instruction at 0x1000, then one at 0x1004 and then one at 0x1008 hoping that the code is running linearly with no branches then the when the 0x1004 is being fetched 0x1000 is being decoded, when 0x1008 is being fetched then 0x1004 is being decoded and 0x1000 might be up to execution, depends. so then one might think well when 0x1000 is being executed the program counter is fetching 0x1008 so that tells me how the pipeline works. Well it doesn't I could have a 10000 deep pipeline and have the program counter that the instruction sees be any address I like relative to the address of that instruction I could have it be 0x1000 for the instruction at 0x1000 and have a 12345 deep pipeline. Its just a definition, it might at some point in history been put in place because of a real design and real pipe or it could have always just been defined that way.
What does matter is that the definition is stated and supported by the instruction set if they say that the pc is instruction plus some offset then it needs to always be that or needs to document the exceptions, and match those definitions. Done. Programmers can then programmers compilers can be written, etc.
A textbook problem with pipelines (not saying it isn't real) is that say we run out of v8 engines, we have 12 trucks in the assembly line, we build trucks on that line for some period of time then we build cars with v6's. The engines are coming by slow boat once they finish building them but we have car parts ready now, so let's move the trucks off the line and start the line over, for N number of steps in the assembly line no cars come out the other end of the building, once the first car makes it to the end then a new car every few minutes. We had to flush the assembly line. When you are running instruction 0x1000, 0x1004, 0x1008, etc. If you can keep the pipeline moving it's more efficient but what of 0x1004 is a branch to 0x1100, we might have 0x1008's instruction and 0x100c's in the pipe, so we have to flush the pipe start fetching from 0x1100 and it takes some number of clock cycles before we are completing instructions again then ideally we complete one per clock after that until another branch. So if you read the classic textbook on the subject where mips or an educational predecessor is used, they have this notion of a branch shadow or some other similar term, the instruction after the branch is always executed.
So instead of flushing N instructions out of the pipe you flush N-1 instructions and you get an extra clock to get the next instruction after the branch into the pipeline. And that's how mips works by default but when you buy a real core you can turn that off and have it not execute the instruction after the branch. It's a great textbook illustration and was probably real and probably is real for "let's build a mips" classes in computer engineering. But the pipelines in use today don't wait that long to see that the pipe is going to be empty and can start fetching early and sometimes gain more than one clock rather than flush the whole pipe. If it ever did, it does not currently give MIPS any kind of advantage over other designs nor does it give us exposure to their pipeline.

Related

Benchmarking microcontrollers

currently I am working on setting up benchmark between microcontrollers (based on Powerpc). So I would like to know, if anyone can provide me some documentation showing in detail, what factors are most important to be considered for benchmarking?
In other words I am looking for documentation which provides detailed information about factors that should be considered for enhancement in the performance of
Core
Peripherals,
Memory banks
Plus, if someone could provide algorithms that will be lot helpful.

There is only one useful way and that is to write your application for both and time your application. Benchmarks are for the most part bogus there are too many factors and it is quite trivial to craft a benchmark that takes advantage of the differences, or even takes advantage of the common features in a way to make two things look different.
I perform this stunt on a regular basis, most recently this code
.globl ASMDELAY
ASMDELAY:
subs r0,r0,#1
bne ASMDELAY
bx lr
Run on a raspberry pi (bare metal) the same raspberry pi not comparing two just comparing it to itself, clearly assembly so not even taking into account compiler features/tricks that you can encode in the benchmark intentionally or accidentally. Two of those three instructions matter for benchmarking purposes, have the loop run many tens of thousands of times I think I used 0x100000. The punchline to that performance was those two instructions in a loop ran as fast as 93662 timer ticks and as slow as 4063837 timer ticks for 0x10000 loops. Certainly i cache and branch prediction were turned on and off for various tests. But even with both branch prediction on and the i cache on, these two instructions will vary in speed depending on where they lie within the fetch line and the cache line.
A microcontroller makes this considerably worse depending on what you are comparing, some have flashes that can use the same wait state for a wide range of clock speeds, some are speed limited and for every N Mhz you have to add another wait state, so depending on where you set your clock it affects performance across that range and definitely just below and just above the boundary where you add a wait state (24Mhz minus a smidge and 24Mhz with an extra wait state if it was from 2-3 wait states then fetching just got 50% slower 36Mhz minus a smidge it may still be at the 3 wait states but 3 wait states at 36minus a smidge is faster than 24mhz 3 wait states). if you run the same code in sram vs flash for those platforms there usually isnt a wait state issue the sram can usually match the cpu clock and so that code at any speed may be faster than the same code run from flash.
If you are comparing two microcontrollers from the same vendor and family then it is usually pointless, the internals are the same they usually just vary by how many, how many flash banks how many sram banks how many uarts, how many timers, how many pins, etc.
One of my points being if you dont know the nuances of the overall architecture, you can possibly make the same code you are running now on the same board a few percent to tens of times faster by simply understanding how things work. Enabling features you didnt know where there, proper alignment of the code that is exercised often (simply re-arranging your functions within a C file can/will affect performance) adding one or more nops in the bootstrap to change the alignment of the whole program can and will change performance.
Then you get into compiler differences and compiler options, you can play with those and also get some to several to dozens of times improvement (or loss).
So at the end of the day the only thing that matters is I have an application it is the final binary and how fast does it run on A, then I ported that application and the final binary for B is done and how fast does it run there. Everything else can be manipulated, the results cant be trusted.

How do I determine the start and end of instructions in an object file?

So, I've been trying to write an emulator, or at least understand how stuff works. I have a decent grasp of assembly, particularly z80 and x86, but I've never really understood how an object file (or in my case, a .gb ROM file) indicates the start and end of an instruction.
I'm trying to parse out the opcode for each instruction, but it occurred to me that it's not like there's a line break after every instruction. So how does this happen? To me, it just looks like a bunch of bytes, with no way to tell the difference between an opcode and its operands.

For most CPUs - and I believe Z80 falls in this category - the length of an instruction is implicit.
That is, you must decode the instruction in order to figure out how long it is.

If you're writing an emulator you don't really ever need to be able to obtain a full disassembly. You know what the program counter is now, you know whether you're expecting a fresh opcode, an address, a CB page opcode or whatever and you just deal with it. What people end up writing, in effect, is usually a per-opcode recursive descent parser.
To get to a full disassembler, most people impute some mild simulation, recursively tracking flow. Instructions are found, data is then left by deduction.
Not so much on the GB where storage was plentiful (by comparison) and piracy had a physical barrier, but on other platforms it was reasonably common to save space or to effect disassembly-proof code by writing code where a branch into the middle of an opcode would create a multiplexed second stream of operations, or where the same thing might be achieved by suddenly reusing valid data as valid code. One of Orlando's 6502 efforts even re-used some of the loader text — regular ASCII — as decrypting code. That sort of stuff is very hard to crack because there's no simple assembly for it and a disassembler therefore usually won't be able to figure out what to do heuristically. Conversely, on a suitably accurate emulator such code should just work exactly as it did originally.

could the wikipedia "Reconfigurable computing" code example be solve in adcanced compilers like Haskell?

Article here
http://en.wikipedia.org/wiki/Reconfigurable_computing#Example_of_a_streaming_model_of_computation
Example of a streaming model of computation
Problem: We are given 2 character arrays of length 256: A[] and B[]. We need to compute the array C[] such that C[i]=B[B[B[B[B[B[B[B[A[i]]]]]]]]]. Though this problem is hypothetical, similar problems exist which have some applications.
Consider a software solution (C code) for the above problem:
for(int i=0;i<256;i++){
char a=A[i];
for(int j=0;j<8;j++)
a=B[a];
C[i]=a;
}
This program will take about 256*10*CPI cycles for the CPU, where CPI is the number of cycles per instruction.
Could this problem be optimized in an advanced compiler like Haskell GHC ?

This wiki page doesn't make much sense (and I think it's also noted in the talk page there).
The example machine is quite meaningless as they ignore the fact that to pipeline the accesses, you would need a memory that can not only sustain 8 simultaneous requests (coming from the different pipelined stages), but also complete them in a single cycle. Banking or splitting the memory in any way wouldn't really work as they all access the same addresses of B.
You could stretch it a bit and say that you've cloned B into 8 different memory units, but then you'll have to find some more complicated controller to keep the coherency, otherwise you'll only be able to use them for reading.
On the other hand, if you had this kind of memory, then the "CPU" they're competing against should be allowed to use it. If we had this banked memory, a modern CPU with out-of-order execution would be able to issue the following instructions for e.g., under the same assumption of 1 cycle per load:
1st cycle: load a[i], calculate i+1
2nd cycle: load a[i+1], load b[a[i]], calculate (i+1)+1
3nd cycle: load a[i+2], load b[a[i+1]], load b[b[a[i]]], calculate i+1+1+1
...
So it would essentially do just as well as the special pipeline they show, even with a basic compiler. Note that a modern CPU can look far ahead in the execution window to find independent operations, but if the compiler does loop unrolling (which is a basic feature supported in most languages) it could re-order the operations in a way that makes it easier for the CPU to issue them.
As for your question about compilers - you didn't specify which feature exactly you think can solve this. Generally speaking - these problems are very hard to optimize through a compiler since you can't mitigate the latency of the memory dependencies. In other words, you'll first have to access a[i], only then the CPU will have the address to access b[a[i]], only then it will have the address for b[b[a[i]]], and so on. There's not much the compiler can do in order to guess the content of memory not yet accessed (and even it it did speculate, it wouldn't be smart to use it for anything practical as it may change by the time the actual load arrives in program order).
This is similar to problem of "pointer chasing" where you traverse a linked list - the required addresses are not only unknown at compile time, but are also hard to predict at runtime, and may change.
I'm not saying this can't be optimized, but it would usually require some dedicated HW solution (such as the memory banking), or some fancy speculative algorithm that would be quite limited in its use. There are papers on the topic (mostly HW prefetching), for e.g. - http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=765944

How to deal with branch prediction when using a switch case in CPU emulation

I recently read the question here Why is it faster to process a sorted array than an unsorted array? and found the answer to be absolutely fascinating and it has completely changed my outlook on programming when dealing with branches that are based on Data.
I currently have a fairly basic, but fully functioning interpreted Intel 8080 Emulator written in C, the heart of the operation is a 256 long switch-case table for handling each opcode. My initial thought was this would obviously be the fastest method of working as opcode encoding isn't consistent throughout the 8080 instruction set and decoding would add a lot of complexity, inconsistency and one-off cases. A switch-case table full of pre-processor macros is a very neat and easy to maintain.
Unfortunately, after reading the aforementioned post it occurred to me that there's absolutely no way the branch predictor in my computer can predict the jumping for the switch case. Thus every time the switch-case is navigated the pipeline would have to be completely wiped, resulting in a several cycle delay in what should otherwise be an incredibly quick program (There's not even so much as multiplication in my code).
I'm sure most of you are thinking "Oh, the solution here is simple, move to dynamic recompilation". Yes, this does seem like it would cut out the majority of the switch-case and increase speed considerably. Unfortunately my primary interest is emulating older 8-bit and 16-bit era consoles (the intel 8080 here is only an example as it's my simplest piece of emulated code) where cycle and timing keeping to the exact instruction is important as the Video and Sound must be processed based on these exact timings.
When dealing with this level of accuracy performance becomes an issue, even for older consoles (Look at bSnes for example). Is there any recourse or is this simply a matter-of-fact when dealing with processors with long pipelines?

On the contrary, switch statements are likely to be converted to jump tables, which means they perform possibly a few ifs (for range checking), and a single jump. The ifs shouldn't cause a problem with branch prediction because it is unlikely you will have a bad op-code. The jump is not so friendly with the pipeline, but in the end, it's only one for the whole switch statement..
I don't believe you can convert a long switch statement of op-codes into any other form that would result in better performance. This is of course, if your compiler is smart enough to convert it to a jump table. If not, you can do so manually.
If in doubt, implement other methods and measure performance.
Edit
First of all, make sure you don't confuse branch prediction and branch target prediction.
Branch prediction solely works on branch statements. It decides whether a branch condition would fail or succeed. They have nothing to do with the jump statement.
Branch target prediction on the other hand tries to guess where the jump will end up in.
So, your statement "there's no way the branch predictor can predict the jump" should be "there's no way the branch target predictor can predict the jump".
In your particular case, I don't think you can actually avoid this. If you had a very small set of operations, perhaps you could come up with a formula that covers all your operations, like those made in logic circuits. However, with an instruction set as big as a CPU's, even if it were RISC, the cost of that computation is much higher than the penalty of a single jump.

As the branches on your 256-way switch statement are densely packed the compiler will implement this as a jump table, so you're correct in that you'll trigger a single branch mispredict every time you pass through this code (as the indirect jump won't display any kind of predictable behaviour). The penalty associated with this will be around 15 clock cycles on a modern CPU (Sandy Bridge), or maybe up to 25 on older microarchitectures that lack a micro-op cache. A good reference for this sort of thing is "Software optimisation resources" on agner.org. Page 43 in "Optimizing software in C++" is a good place to start.
http://www.agner.org/optimize/?e=0,34
The only way you could avoid this penalty is by ensuring that the same instructions are execution regardless of the value of the opcode. This can often be done by using conditional moves (which add a data dependency so are slower than a predictable branch) or otherwise looking for symmetry in your code paths. Considering what you're trying to do this is probably not going to be possible, and if it was then it would almost certainly add a overhead greater than the 15-25 clock cycles for the mispredict.
In summary, on a modern architecture there's not much you can do that'll be more efficient than a switch/case, and the cost of mispredicting a branch isn't as much as you might expect.

The indirect jump is probably the best thing to do for instruction decoding.
On older machines, like say the Intel P6 from 1997, the indirect jump would probably get a branch misprediction.
On modern machines, like say Intel Core i7, there is an indirect jump predictor that does a fairly good job of avoiding the branch misprediction.
But even on the older machines that do not have an indirect branch predictor, you can play a trick. This trick is (was), by the way, documented in the Intel Code Optimization Guide from way back in the Intel P6 days:
Instead of generating something that looks like
loop:
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_00h_ADD: ...
jmp loop
label_instruction_01h_SUB: ...
jmp loop
...
generate the code as
loop:
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_00h_ADD: ...
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_01h_SUB: ...
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
...
i.e. replace the jump to the top of the instruction fetch/decode/execute loop
by the code at the top of the loop at each place.
It turns out that this has much better branch prediction, even in the absence of an indirect predictor. More precisely, a conditional, single target, PC indexed BTB will be quite a lot better in this latter, threaded, code, than on the original with only a single copy of the indirect jump.
Most instruction sets have special patterns - e.g. on Intel x86, a compare instruction is nearly always followed by a branch.
Good luck and have fun!
(In case you care, the instruction decoders used by instruction set simulators in industry nearly always do a tree of N-way jumps, or the data-driven dual, navigate a tree of N-way tables, with each entry in the tree pointing to other nodes, or to a function to evaluate.
Oh, and perhaps I should mention: these tables, these switch statements or data structures, are generated by special purpose tools.
A tree of N-way jumps, because there are problems when the number of cases in the jump table gets very large - in the tool, mkIrecog (make instruction recognizer) that I wrote in the 1980s, I usually did jump tables up to 64K entries in size, i.e. jumping on 16 bits. The compilers of the time broke when the jump tables exceeded 16M in size (24 bits).
Data driven, i.e. a tree of nodes pointing to other nodes because (a) on older machines indirect jumps may not be predicted well, and (b) it turns out that much of the time there is common code between instructions - instead of having a branch misprediction when jumping to the case per instruction, then executing common code, then switching again, and getting a second mispredict, you do the common code, with slightly different parameters (like, how many bits of the instruction stream do you consume, and where the next set of bits to branch on is (are).
I was very aggressive in mkIrecog, as I say allowing up to 32 bits to be used in a switch, although practical limitations nearly always stopped me at 16-24 bits. I remember that I often saw the first decode as a 16 or 18 bit switch (64K-256K entries), and all other decodes were much smaller, no bigger than 10 bits.
Hmm: I posted mkIrecog to Usenet back circa 1990. ftp://ftp.lf.net/pub/unix/programming/misc/mkIrecog.tar.gz
You may be able to see the tables used, if you care.
(Be kind: I was young then. I can't remember if this was Pascal or C. I have since rewritten it many times - although I have not yet rewritten it to use C++ bit vectors.)
Most of the other guys I know who do this sort of thing do things a byte at a time - i.e. an 8 bit, 256 way, branch or table lookup.)

I thought I'd add something since no one mentioned it.
Granted, the indirect jump is likely to be the best option.
However, should you go with the N-compare way, there are two things that come to my mind:
First, instead of doing N equality compares, you could do log(N) inequality compares, testing your instructions based on their numerical opcode by dichotomy (or test the number bit by bit if the value space is near to full) .This is a bit like a hashtable, you implement a static tree to find the final element.
Second, you could run an analysis on the binary code you want to execute.
You could even do that per binary, before execution, and runtime-patch your emulator.
This analysis would build a histogram representing the frequency of instructions, and then you would organize your tests so that the most frequent instructions get predicted correctly.
But I cant see this being faster than a medium 15 cycles penalty, unless you have 99% of MOV and you put an equality for the MOV opcode before the other tests.

measure time to execute single instruction

Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?

Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.
On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.
On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.
That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:
.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC
Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:
set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first
Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.

Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.
Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.
Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.
In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.
Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.

Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.

No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.
That said, why do you care?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight