I've created a listing file of my asm code using commands
cd c:\masm32\bin\
ml.exe /c /Fl"c:\path\file.lst" /Sc "c:\path\file.asm"
The lst file contains three columns: the first one is hex address of specific line, the third one is opcode, but I don't understand the meaning of values in the second column. I think it's called "timing" and the values are someting like: 2 or 10m or even 7m,3. What is the meaning of this numbers, what do they represent?
With the /Sc command-line switch, which generates instruction timings, each line has this syntax:
offset [[timing]] [[code]]
The offset is the offset from the beginning of the current code segment. The timing shows the number of cycles the processor needs to execute the instruction. The value of timing reflects the CPU type; for example, specifying the .386 directive produces instruction timings for the 80386 processor. If the statement generates code or data, code shows the numeric value in hexadecimal notation if the value is known at assembly time. If the value is calculated at run time, the assembler indicates what action is necessary to compute the value.
When assembling under the default .8086 directive, timing includes an effective address value if the instruction accesses memory. The 80186/486 processors do not use effective address values. For more information on effective address timing, see the "Processor" section in the Reference book.
(source)
I'm not sure how much I'd trust those timing values unless you're actually going to execute the code on an 80486 or earlier processor.
Related
May I get an explanation about what opcode sequences are and how to find them in PE32 files?
I am trying to extract them from PE32 files.
what opcode sequences are
A CPU instruction is composed from 1 to multiple bytes, each of of those bytes have different meanings.
An opcode (operation code) is the part of an instruction that defines the behavior of the instruction itself (as in, this instruction is an 'ADD', or an 'XOR', a NOP, etc.).
For x86 / x64 CPUs (IA-32; IA-32e in Intel linguo) an instruction is composed of at least an opcode (1 to 3 bytes), but can comes with multiple other bytes (various prefixes, ModR/M, SIB, Disp. and Imm.) depending on its encoding:
Opcode is often synonym with "instruction" (since the opcode defines the behavior of the instruction); therefore when you have multiple instructions you then have an opcode sequence (which is a bit of a misnomer since it's really a instruction sequence, unless all instructions in the sequence are only composed of opcodes).
how to find them in PE32 files?
As instructions can be multiple bytes long, you can't just start at a random location in the .text section (which, for a PE file, contains the executable code of the program). There's a specific location in the PE file - called the "entry point" - which defines the start of the program.
The entry point for a PE File is given by the AddressOfEntryPoint member of the IMAGE_OPTIONAL_HEADER structure (parts of the PE header structures). Note that this member is an RVA, not a "full" VA.
From there you know you are at the start of an instruction. You can start disassembling / counting instructions from this point, following the encoding rules for instructions (these rules are explained to great length in the Intel and AMD manuals).
Most instruction are "fall-through", which means that once an instruction has executed, the next to execute is the following one (this seems obvious, but!). The trick is when there's a non-fall-through instruction, you must know what this instruction does to continue your disassembling (e.g. it might jump somewhere, go to a specific handler, etc.)
Use the radare 2 library, it can extract opcode sequences very quickly.
I'm programming an STM8S microcontroller using STVD IDE. It uses the COSMIC compiler.
I found that there is a veriable that is increased unexpectedly. When debugging I found that there is a line in the assembly code that causes this variable to increase its value. It's a function named c_lgadc. Sometimes this assembly line is called while there is no ADC related function is shown in the call stack.
I don't understand where this code comes from and what is this c_lgadc? I have no function in my C code named c_lgadc
Here is a screenshot of the disassembly.
UPDATE:
I don't know what C code should I examine as the call stack is
different every time this disassembly line is called.
I've noticed that when I step over or into in the debugger, it comes
to the last line of a specific timer ISR.
I've also noticed that the line with the second breakpoint is the one that causes addition to my variable.
The line with first breakpoint is called always 5 times then the line
with second breakpoint is called once and so on.
I'd like to know how should I debug this further to prevent the unexpected addition to my variable.
UPDATE2:
I found the following in the map file:
c_lgadc 0000f39c defined in (C:\Users\xxxxxxxx\CXSTM8\Lib\libm0.sm8)lgadc.o section .text
used in Debug\stm8s_it.o
I'm not sure if this would help in clarifying the problem?
I've noticed that when I step over or into in the debugger, it comes to the last line of a specific timer ISR.
So, this timer ISR increments a 4-byte integer variable, and this variable overlaps with your variable. How such overlapping occurs might be revealed by inspecting that ISR or the link map, or it may be that the index register X is not correctly set in the ISR.
The function c_lgadc looks like part of a runtime library. Suggested by context, it is probably an add carry flag function because it is between the compare and unsigned right shift functions.
The c_l and c_lg prefixes for these functions are probably some part of a scheme indicate the types of the operands or their result.
As to your question, adc occurs in the instruction set of several CPU architectures, namely the intel x86 and motorola 680x. It means:
If the carry flag (unsigned arithmetic overflow or shift through carrry flag) is zero, return the operand as the result.
If the carry flag is set, return the result as one added to the operand.
I'm working in an embedded system and have "mapped" some defines to an array for inputs.
volatile int INPUT_ARRAY[40];
#define INPUT01 INPUT_ARRAY[0]
#define INPUT02 INPUT_ARRAY[1]
// section 2
if ( INPUT01 && INPUT02 ) {
writepin(outputpin, value);
}
If I want to read from Input 1, I can simply say newvariable = INPUT01 or I can compare data with Input 1, like in section 2 of my code. I'm not sure if this is a normal way of mapping the name INPUT01 to where the array position is. Or for an Input pin in the first place. Each array value represents a binary pin, and are read into the array by decoding a port value (16 bit). Question: Is using the defines and array like this reasonably efficient?
Yes, your solution is efficient.
Before the C compiler even sees your code, the C preprocessor substitutes INPUT_ARRAY[0] for INPUT01 and, similarly, INPUT_ARRAY[1] for INPUT02; so this substitution uses zero time and zero power at run time.
Moreover, when the C compiler sees INPUT_ARRAY[1] in the preprocessed code, it adds 1 at compile time to the base address of INPUT_ARRAY. Therefore, you get maximal efficiency at run time.
Admittedly, were you manually to turn your C compiler's optimizer off, as with the -O0 option of GCC, then it is conceivable that the compiler would emit assembly code to add the 1 at run time. So don't do that.
The only likely exception to the foregoing would be the case that the base address of INPUT_ARRAY were unknown to the compiler at run time, not likely because INPUT_ARRAY were dynamically allocated on the heap (which would make little sense for hardware device addressing), but likely because the base address of INPUT_ARRAY were configurable during boot via device configuration registers. Some hardware does this, but if yours does, why, that is exactly the reason your MCU (or MPU) possesses an index-offset indirect addressing mode in the first place. Though this mode engages the MCU's integer arithmetic unit, [a] the mode does not multiply (multiplication being a power-hungry operation); and, [b] anyway, the mode is such a normal, often-used mode that MCUs are invariably designed to support it efficiently—not perhaps as efficiently as precomputed direct addressing, but as efficiently as one can reasonably expect for such a use. The MCU's manufacturer knows that device pins are things you need to address. The engineer who designed your MCU will have given priority to making the index-offset indirect mode as efficient as possible for this and other reasons. (You could maybe still cheat the matter to save a few millijoules via self-modifying code, if your MCU even allowed that; but, as an engineer, you'd regret the cheat, I suspect, unless security and maintainability were non-issues to you. The problem probably is not much of a real problem. Index-offset indirect addressing is the normal technique when the base address remains unknown until run time. If you really need to save that last millijoule, then you might not be using a C compiler for your code's inner loop, anyway, but might be handcrafting assembly code.)
I suspect that you would find it instructive to tell your compiler to emit assembly code for your inspection. I do not know which compiler you are using but, if you were using GCC, then gcc -S myfile.c.
I would like to ask you how to determine in which ISA (ARM/Thumb/Thumb-2) an instruction is encoded?
First of all, I tried to do it following the instructions here (section 4.5.5).
However, when I use readelf -s ./arm_binary, and arm_binary was built in release mode, it appears that there is no .symtab in the binary. And anyway, I don't understand how to use this command to find the type for the instructions.
Secondly, I know the other way to differentiate is to look at the PC address for the ARM/Thumb instruction. If it is even then it is a Thumb instruction, if not - then ARM. But how can I do this without loading the file to memory? When I parse the sections of the file and find the execute section, all that I have is the start (offset) location in the file and the file-offset is always even, and it will be always even because we have instruction of size equal to 2 or 4...
Finally, the last way to check is to detect BX Rm, extract the value from Rm, and then check if that address in Rm is it even or not. But, this may be difficult because for this I would need to emulate the whole program.
So what is the correct way to identify the ISA for disassembly?
Thank you for your attention and I hope you will help me.
I don't believe it's possible to tell, in a mixed mode binary, without inspecting the instructions as you describe.
If the whole file will be one ISA or the other, then you can determine the ISA of the entry point by running this:
readelf -h ./arm_binary
And checking whether the entry point is even or odd.
However, what I would do is simply disassemble it both ways, and see what looks right. As long as you start the disassembly at the start of a function (or any 4-byte boundary), then this will work fine. Most code will produce nonsense when disassembled in the wrong ISA.
Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?
Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.
On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.
On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.
That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:
.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC
Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:
set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first
Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.
Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.
Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.
Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.
In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.
Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.
Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.
No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.
That said, why do you care?