ARM Cortex A9 second execution unit - arm

I am trying to understand the full working of execution stage in ARM cortex A9 and the types of instructions that are executed in second execution unit(ALU). Till now i was able to find quite limited references that were not much helpful. If any of you know anything related to execution in "Second execution unit of ARM cortex A9" or any related references, please let me know!! Also it is possible to discuss further here in the forum directly. Looking forward to your input.
Thanks & regards.

That's probably not the right place for queries like this, this community expects you to ask a particular technical question and get an answer (if you're lucky). Your "question" is more like an attempt to start a discussion..
Have a look here:
https://community.arm.com/content?query=cortex-a9
And, of course, there are tons of documents on www.arm.com

Related

Reverse engineering a firmware - what's up with every fourth byte?

So I decided to grab my tools and analyze a router firmware. It went pretty okay up to the point where I had to find segments manually. I wouldn't bother you with it and i really don't want to ask about hacking anything or to do a favor for me. There is a pattern I'm sure someone could explain to me.
Looking at the hexdump, all i see is this:
There are strings that break the pattern but it goes all the way down almost to the end of the file.
what on earth can cause this pattern?
(if anyone's willing to help but needs more info: VxWorks 5.5.1 / probably ARM-9E CPU)
it is an arm, go look at the arm documentation you will see that for the 32 bit (non-thumb) arm instructions the first four bits are the condition code. The code 0b1110 is "ALWAYS" most of the time you dont do conditional execution so most arm instructions start with 0xE. makes it very easy to pick out an arm binary. the 16 bit thumb instructions also have a similar pattern but for different reasons, then if you add in thumb2 it changes that some...
Thats just due to how ARMs op codes are mapped and is actually helps me "eyeball" a dump to see if its ARM code.
I would suggest you go through part of the ARM Architecture Manual to see how op codes are generated. particularly conditionals. the E is created when you always want something to happen

How much faster is C than R in practice?

I wrote a Gibbs sampler in R and decided to port it to C to see whether it would be faster. A lot of pages I have looked at claim that C will be up to 50 times faster, but every time I have used it, it's only about five or six times faster than R. My question is: is this to be expected, or are there tricks which I am not using which would make my C code significantly faster than this (like how using vectorization speeds up code in R)? I basically took the code and rewrote it in C, replacing matrix operations with for loops and making all the variables pointers.
Also, does anyone know of good resources for C from the point of view of an R programmer? There's an excellent book called The Art of R Programming by Matloff, but it seems to be written from the perspective of someone who already knows C.
Also, the screen tends to freeze when my C code is running in the standard R GUI for Windows. It doesn't crash; it unfreezes once the code has finished running, but it stops me from doing anything else in the GUI. Does anybody know how I could avoid this? I am calling the function using .C()
Many of the existing posts have explicit examples you can run, for example Darren Wilkinson has several posts on his blog analyzing this in different languages, and later even on different hardware (eg comparing his high-end laptop to his netbook and to a Raspberry Pi). Some of his posts are
the initial (then revised) post
another later post
and there are many more on his site -- these often compare C, Java, Python and more.
Now, I also turned this into a version using Rcpp -- see this blog post. We also used the same example in a comparison between Julia, Python and R/C++ at useR this summer so you should find plenty other examples and references. MCMC is widely used, and "easy pickings" for speedups.
Given these examples, allow me to add that I disagree with the two earlier comments your question received. The speed will not be the same, it is easy to do better in an example such as this, and your C/C++ skills will mostly determines how much better.
Finally, an often overlooked aspect is that the speed of the RNG matters a lot. Running down loops and adding things up is cheap -- doing "good" draws is not, and a lot of inter-system variation comes from that too.
About the GUI freezing, you might want to call R_CheckUserInterrupt and perhaps R_ProcessEvents every now and then.
I would say C, done properly, is much faster than R.
Some easy gains you could try:
Set the compiler to optimize for more speed.
Compiling with the -march flag.
Also if you're using VS, make sure you're compiling with release options, not debug.
Your observed performance difference will depend on a number of things: the type of operations that you are doing, how you write the C code, what type of compiler-level optimizations you use, your target CPU architecture, etc etc.
You can write basic, sloppy C and get something that works and runs with decent efficiency. You can also fine-tune your code for the unique characteristics of your target CPU - perhaps invoking specialized assembly instructions - and squeeze every last drop of performance that you can out of the code. You could even write code that runs significantly slower than the R version. C gives you a lot of flexibility. The limiting factor here is how much time that you want to put into writing and optimizing the C code.
The reverse is also true (duplicate the previous paragraph here, but swap "C" and "R").
I'm not trying to sound facetious, but there's really not a straightforward answer to your question. The only way to tell how much faster your C version would be is to write the code both ways and benchmark them.

Call (dependency) graph at the level of instruction

I was wondering. Is there a tool I can use (on a C program) that would generate a call graph at the level of an instruction in a program, taking into consideration the dependency of such instruction on other instructions? Something like a "dependency graph" but at the level of instructions in a program. I took the idea from chapter 27 of the new Cormen book (see for example p. 778), but I won't even try to hack anything if there's a tool already available. (If you want, Chapter 27 is online here). Thanks for any help.
Any optimizing compiler for C should be doing this kind of control-flow analysis.
On the other hand, I have no idea how easy it is to get the graph out of it (in the standalone tool sense)
If you're taking inspiration from Figure 27.2 on page 778 of the Cormen/Rivest book, it is not a call graph in the usual sense.
it is a call tree, in which the nodes are execution instances of a function, not the function itself.
It's the call tree of a particular execution of the program, elaborated with information about the variables in each instance, and information about the parallelism.
To get such a complete call tree you're going to have to basically trace the entire execution. With different arguments, you will get a different trace.
It might be easier to help if your overall goal were more clear.

x86 assembly instruction execution count

Hello everyone
I have a code and I want to find the number of times each assembly line executed. I dont care whether through profiling or emulation, yet I want high precision results. I came across a forum once that gave some scripting code to do so, yet I lost the link. Can anyone help me brainstorm some ways to do so?
Regards
Edit:
Okey I think I am halfway there. I have done some research on the BTS (Branch Trace Store) provided by Intel Manual 3A section 16.4.5 according to one the posts. This feature provides branch history. So now I need your help to find if there are any open source scripts or tools to do this. Waiting to check your feedback
cheers=)!
If your processor supports it, you can enable Branch Trace Store (BTS). BTS stores a log of all of the taken branches in a predefined area in memory. Each entry contains the branch source and destination. Using that, you can count how many times you were in each code segment.
Look at volume 3A of the Intel Software Developer's Manual, section 16.4.5 (in the current edition) for details on how to enable it.
If you do not care about performance, you can do a small trick to count that. Raise a single step exception and upon entering your custom seh handler, raise another one and step over to the next command.
Maybe some profiler tools like pin or valgrind do that for you in an easier manner. I would suggest that you take a look.
One (although slow) method would be to write your own debugger. It would then breakpoint the entry point of your program, and when it was hit it would set the trace flag on the EFlags in the context, so it would break to the debugger on the next instruction as well. You could then use a hash table with the EIP to count the number of times hit.
Only problem is that the overhead would be extreme and the application would run very slowly.

Any references on Dynamic Code Analysis?

Yesterday I was reading about debugging techniques and found Valgrind to be really interesting. It seems to use techniques from dynamic code analysis. And I followed a link from the original reference to something else called Path Profiling.
I tried Googling but I guess I am using the wrong terms to search for a good reference on these concepts. Can someone suggest a good resource taking into account that I do not have a background in compilers and programming languages?
Path Profiling is interesting as a theoretical problem. gprof is also interesting, because it deals in call graphs, cyclical subgraphs, and such. There are nice algorithms for manipulating this information and propogating measurements throughout a structure.
All of which might tempt you to think it works (though they never say it does) - for finding general performance problems.
However, suppose your program hangs. How do you find the problem?
What I do is get it into the infinite loop, and then interrupt (pause) it to see what it's doing. I look at the code on each level of the call stack, because I know the loop is somewhere on the stack. If it's not obvious, I just step it along until I see it repeating itself, and then I know where the problem is. I suspect almost anyone would do that.
In fact, if you stop the program while it's taking too long and examine its state several times, you can not only find infinite loops, but almost any problem where the program runs longer than you would like.
There are profiler tools based on this concept, such as Zoom and LTProf, but for my money nothing gives as much insight as thoroughly understanding representative snapshots.
You won't find good references on this technique because (oddly) not many people are aware of it, and it's too simple to publish.
There's considerably more to say on the subject.
Actually, FWIW, I "published" an article on it, but it was only reviewed by an editor, and I don't think anyone's actually read it: Dunlavey, “Performance tuning with instruction-level cost derived from call-stack sampling”, ACM SIGPLAN Notices 42, 8 (August, 2007), pp. 4-8.

Resources