x86 assembly instruction execution count - c

Hello everyone
I have a code and I want to find the number of times each assembly line executed. I dont care whether through profiling or emulation, yet I want high precision results. I came across a forum once that gave some scripting code to do so, yet I lost the link. Can anyone help me brainstorm some ways to do so?
Regards
Edit:
Okey I think I am halfway there. I have done some research on the BTS (Branch Trace Store) provided by Intel Manual 3A section 16.4.5 according to one the posts. This feature provides branch history. So now I need your help to find if there are any open source scripts or tools to do this. Waiting to check your feedback
cheers=)!

If your processor supports it, you can enable Branch Trace Store (BTS). BTS stores a log of all of the taken branches in a predefined area in memory. Each entry contains the branch source and destination. Using that, you can count how many times you were in each code segment.
Look at volume 3A of the Intel Software Developer's Manual, section 16.4.5 (in the current edition) for details on how to enable it.

If you do not care about performance, you can do a small trick to count that. Raise a single step exception and upon entering your custom seh handler, raise another one and step over to the next command.
Maybe some profiler tools like pin or valgrind do that for you in an easier manner. I would suggest that you take a look.

One (although slow) method would be to write your own debugger. It would then breakpoint the entry point of your program, and when it was hit it would set the trace flag on the EFlags in the context, so it would break to the debugger on the next instruction as well. You could then use a hash table with the EIP to count the number of times hit.
Only problem is that the overhead would be extreme and the application would run very slowly.

Related

How to benchmark some algorithms for Cortex-M architecture

For my current project i have to investigate the runtime behavior (used cycles) of different algorithms on a Cortex-M4. The algorithms are pure computation in C, no IO and interrupts. Any hints and ideas how to do it?
My current idea is to create a minimal application and use renode (https://renode.io/) for cycle counting:
Create a file test.c with one function with fixed signature that runs my algorithm
Compile and link it to perform a minimal application
Load the application and the needed input data into renode
Run the application
Extract the output data from renode
Use the profiling data from renode to rate the algorithms
And now the questions:
Has anyone used renode or QEMU for similar purposes?
How to create a true minimal application? (crt0,ld flags)
Any other ideas for my problem?
How to configure a minimal system in renode? Which components are a minimal subset to successful run a C program?
Regards
Jan
FYI: I work at Antmicro and am one of the authors of Renode
There are really many ways to perform such profiling. Note that Renode is not cycle-accurate, but you can track virtual time progression.
One of the possible approaches would be to use Renode's metrics analyzer. You can read the docs here: https://renode.readthedocs.io/en/latest/basic/metrics.html
It allows you to capture data and analyze it in Python or generate some graphs straight away:
# in Renode
(monitor) machine EnableProfiler "path_to_dump_file"
# in Bash
python3 tools/metrics_analyzer/metrics_visualizer/metrics-visualizer.py path_to_dump_file
You can also analyze the virtual time passed until a specific string appears on UART. This can be done with a Robot test. An example of timestamp extraction can be found here: https://github.com/renode/renode/blob/master/tests/platforms/QuarkC1000/QuarkC1000.robot#L44
${r} Wait For Line On Uart My String
Do Something With Time ${r.timestamp}
Another option would be to instrument your code and dump binary data from memory, if needed.
You can also add hooks to be called on specific Program Counter value - then you can dump such a timestamp to log.
There are possibly many other options to move forward, but it would depend on your specific use case.
Minimal system in Renode: depending on your software, it would require
a core
nvic controller, if it's Cortex-M
memory
uart if you want output.
UPDATE:
We have added some tracing features that allow you to use https://www.speedscope.app/ or https://ui.perfetto.dev/ to display traces of execution, very useful in profiling.
The quick way to enable it for speedscope is:
cpu EnableProfilerCollapsedStack #path/to/trace true
For more details please see this chapter in the docs: https://renode.readthedocs.io/en/latest/advanced/execution-tracing.html

How does the program counter in register 15 expose the pipeline?

I have a question assigned to me that states:
"The ARM puts the program counter in register r15, making it visible to the programmer. Someone writing about the ARM stated that this exposed the ARM's pipeline. What did he mean, and why?"
We haven't talked about pipelines in class yet so I don't know what that is and I'm having a hard time understanding the material online. Can someone help me out with either answering the question or helping me understand so I can form my own answer?
Thanks!
An exposed pipeline means one where the programmer needs to consider the pipeline, and I would disagree that the r15 value offset is anything other than an encoding constant.
By making the PC visible to the programmer, yes, some fragment of the early implementation detail has been 'exposed' as an architectural oddity which needs to be maintained by future implementations.
This wouldn't have been worthy of comment if the offset which has been designed into the architecture was zero - there would have been no optimisation possible for simple 3 stage pipelines, and everyone would have been non the wiser.
There is nothing 'exported' from the pipeline, not in the way that trace or debug allow you to snoop on timing behaviour at code is running - this feature is just part of the illusion that the processor hardware presents to a programmer (similar to each instruction executing in program order).
A problem with novel tricks like this is that people like to write questions about them, and those questions can easily be poorly phrased. They also neglect the fact that even if the pipeline is 3 stage, it only takes a single 'special case' to require the gates for an offset calculation (even if these gates don't consume power for typical operation).
Having PC relative instructions is rather common. Having implementation optimised encodings for how the offest is calculated is also common - for example IBM 650
Retrocomputing.SE is an interesting place to learn about some of the things that relate to the evolution of modern computers.
It doesn't really; what they are probably talking about is the program counter is two instructions ahead of the instruction being executed. But that doesn't mean its a two or three deep pipe if it ever was. It exposes nothing at this point, in the same way that the branch shadow in MIPS exposes nothing. There is the text book MIPS and there is reality.
There is nothing magical about a pipeline; it is a computer version of an assembly line. You can build a car in place and bring the engine the doors, the wheels etc to the car. Or you could move the car through an assembly line and have a station for doors, a station for the wheels and so on. You have many cars being built at once, and the number of cars coming out of the building is every few minutes but that doesn't mean a new car takes a few minutes to build. It just means the slowest step takes a few minutes, front to back it roughly takes the same amount of time.
An instruction has several should be obvious steps, an add requires that you get the instruction, decode it, gather up the operands feed them to an adder (alu) and take the result and store the result. Other instructions have similar steps and a similar number.
Your text book will use terms like fetch, decode, execute. So if you are fetching an instruction at 0x1000, then one at 0x1004 and then one at 0x1008 hoping that the code is running linearly with no branches then the when the 0x1004 is being fetched 0x1000 is being decoded, when 0x1008 is being fetched then 0x1004 is being decoded and 0x1000 might be up to execution, depends. so then one might think well when 0x1000 is being executed the program counter is fetching 0x1008 so that tells me how the pipeline works. Well it doesn't I could have a 10000 deep pipeline and have the program counter that the instruction sees be any address I like relative to the address of that instruction I could have it be 0x1000 for the instruction at 0x1000 and have a 12345 deep pipeline. Its just a definition, it might at some point in history been put in place because of a real design and real pipe or it could have always just been defined that way.
What does matter is that the definition is stated and supported by the instruction set if they say that the pc is instruction plus some offset then it needs to always be that or needs to document the exceptions, and match those definitions. Done. Programmers can then programmers compilers can be written, etc.
A textbook problem with pipelines (not saying it isn't real) is that say we run out of v8 engines, we have 12 trucks in the assembly line, we build trucks on that line for some period of time then we build cars with v6's. The engines are coming by slow boat once they finish building them but we have car parts ready now, so let's move the trucks off the line and start the line over, for N number of steps in the assembly line no cars come out the other end of the building, once the first car makes it to the end then a new car every few minutes. We had to flush the assembly line. When you are running instruction 0x1000, 0x1004, 0x1008, etc. If you can keep the pipeline moving it's more efficient but what of 0x1004 is a branch to 0x1100, we might have 0x1008's instruction and 0x100c's in the pipe, so we have to flush the pipe start fetching from 0x1100 and it takes some number of clock cycles before we are completing instructions again then ideally we complete one per clock after that until another branch. So if you read the classic textbook on the subject where mips or an educational predecessor is used, they have this notion of a branch shadow or some other similar term, the instruction after the branch is always executed.
So instead of flushing N instructions out of the pipe you flush N-1 instructions and you get an extra clock to get the next instruction after the branch into the pipeline. And that's how mips works by default but when you buy a real core you can turn that off and have it not execute the instruction after the branch. It's a great textbook illustration and was probably real and probably is real for "let's build a mips" classes in computer engineering. But the pipelines in use today don't wait that long to see that the pipe is going to be empty and can start fetching early and sometimes gain more than one clock rather than flush the whole pipe. If it ever did, it does not currently give MIPS any kind of advantage over other designs nor does it give us exposure to their pipeline.

How do I benchmark or trace a specific function in the Linux Kernel?

How do I use ftrace() (or anything else) to trace a specific, user-defined function in the Linux kernel? I'm trying to create and run some microbenchmarks, so I'd like to have the time it takes certain functions to run. I've read through (at least as much as I can) the documentation, but a step in the right direction would be awesome.
I'm leaning towards ftrace(), but having issues getting it to work on Ubuntu 14.04.
Here are a couple of options you may have depending on the version of the kernel you are on:
Systemtap - this is the ideal way check the examples that come with the stap, you may have something ready with minimal modifications to do.
Oprofile - if you are using older versions of the kernel, stap gives better precision compared to oprofile.
debugfs with stack tracer option - good for stack overrun debugging. To do this you would need to turn on depth checking functions by mounting debugfs and then echo 1 > /proc/sys/kernel/stack_tracer_enabled.
strace - if you are looking at identifying the system calls being called by the user space program and some performance numbers. use strace -fc <program name>
Hope this helps!
Ftrace is a good option and has a good documentation.
use WARN_ON() It will print some trace of function called that.
For time tracing i think you should use time stamp showing in kernel log or use jiffies counter
Also systemtap will be useful in your situation. Systemtap is some kind of tool in which you can write code like in scripting languages. It is very powerful, but if you want to only know a time of execution particular function ftrace would be better, but if you need very advanced tool to analyze e.g, performance problems in the kernel space, it may be very helpful.
Pls read more: (what you want to do is here:- 5.2 Timing function execution times)
enter link description here
If the function's execution time is interesting because it makes subsidiary calls to slow/blocking functions, then statement-by-statement tracing could work for you, without too much distortion due to the "probe effect" overheads of the instrumentation itself.
probe kernel.statement("function_name#dir/file.c:*") { println(tid(), " ", gettimeofday_us(), " ", pn()) }
will give you a trace of each separate statement in the function_name. Deltas between adjacent statements are easily computed by hand or by a larger script. See also https://sourceware.org/systemtap/examples/#profiling/linetimes.stp
To get the precision that I needed (CPU cycles), I ended up using get_cycles() which is essentially a wrappeer for RDTSC (but portable). ftrace() may still be beneficial in the future, but all I'm doing now is taking the difference between start CPU cycles and end CPU cycles and using that as a benchmark.
Update: To avoid parallelization of instructions, I actually ended up wrapping RDTSCP instead. I couldn't use RDTSC + CPUID because that caused a lot of delays from hypercalls (I'm working in a VM).
Use systemtap and try this script:
https://github.com/openresty/stapxx#func-latency-distr

Reverse engineering a firmware - what's up with every fourth byte?

So I decided to grab my tools and analyze a router firmware. It went pretty okay up to the point where I had to find segments manually. I wouldn't bother you with it and i really don't want to ask about hacking anything or to do a favor for me. There is a pattern I'm sure someone could explain to me.
Looking at the hexdump, all i see is this:
There are strings that break the pattern but it goes all the way down almost to the end of the file.
what on earth can cause this pattern?
(if anyone's willing to help but needs more info: VxWorks 5.5.1 / probably ARM-9E CPU)
it is an arm, go look at the arm documentation you will see that for the 32 bit (non-thumb) arm instructions the first four bits are the condition code. The code 0b1110 is "ALWAYS" most of the time you dont do conditional execution so most arm instructions start with 0xE. makes it very easy to pick out an arm binary. the 16 bit thumb instructions also have a similar pattern but for different reasons, then if you add in thumb2 it changes that some...
Thats just due to how ARMs op codes are mapped and is actually helps me "eyeball" a dump to see if its ARM code.
I would suggest you go through part of the ARM Architecture Manual to see how op codes are generated. particularly conditionals. the E is created when you always want something to happen

Self-modifying code for trace hooks?

I'm looking for the least-overhead way of inserting trace/logging hooks into some very performance-sensitive driver code. This logging stuff has to always be compiled in, but most of the time do nothing (but do nothing very fast).
There isn't anything much simpler than just having a global on/off word, doing an if(enabled){log()}. However, if possible I'd like to even avoid the cost of loading that word every time I hit one of my hooks. It occurs to me that I could potentially use self-modifying code for this -- i.e. everywhere I have a call to my trace function, I overwrite the jump with a NOP when I want to disable the hooks, and replace the jump when I want to enable them.
A quick google doesn't turn up any prior art on this -- has anyone done it? Is it feasible, are there any major stumbling blocks that I'm not foreseeing?
(Linux, x86_64)
Yes, this technique has been implemented within the Linux kernel, for exactly the same purpose (tracing hooks).
See the LWN article on Jump Labels for a starting point.
There's not really any major stumbling blocks, but a few minor ones: multithreaded processes (you will have to stop all other threads while you're enabling or disabling the code); incoherent instruction cache (you'll need to ensure the I-cache is flushed, on every core).
Does it matter if your compiled driver is suddenly twice as large?
Build two code paths -- one with logging, one without. Use a global function pointer(s) to jump into the performance-sensitive section(s), overwrite them as appropriate.
If there were a way to somehow declare a register global, you could load the register with the value of your word at every entry point into your driver from the outside and then just check the register. Of course, then you'd be denying the use of that register to the optimizer, which might have some unpleasant performance consequences.
I'm writing not so much on the issue of whether this is possible or not but if you gain anything significant.
On the one hand you don't want to test "logging enabled" every time a logging possibility presents itself and on the other need to test "logging enabled" and overwrite code with either the yes- or the no-case code. Or does your driver "remember" that it was no the last time and since no is requested this time nothing needs to be done?
The logic necessary does not appear to be trivial compared to testing every time.

Resources