GCC/objdump: Generating compilable/buildable assembly (interspersed with C/C++) source? - c

This is close to Using GCC to produce readable assembly?, but my context here is avr-gcc (and correspondingly, avr-objdump) for Atmel (though, I guess it would apply across the GCC board).
The thing is, I have a project of multiple .c and .cpp files; which ultimately get compiled into an executable, with the same name as the 'master' .cpp file. In this process, I can obtain assembly listing in two ways:
I can instruct gcc to emit assembly listing source (see Linux Assembly and Disassembly an Introduction) using the -S switch; in this case, I get a file, with contents like:
...
loop:
push r14
push r15
push r16
push r17
push r28
push r29
/* prologue: function /
/ frame size = 0 */
ldi r24,lo8(13)
ldi r22,lo8(1)
call digitalWrite
rjmp .L2
.L3:
ldi r24,lo8(MyObj)
ldi r25,hi8(MyObj)
call _ZN16MYOBJ7connectEv
.L2:
ldi r24,lo8(MyObj)
ldi r25,hi8(MyObj)
call _ZN16MYOBJ11isConnectedEv
...
(Haven't tried it yet; but I guess this code is compilable/buildable....)
I can inspect the final executable with, and instruct, objdump to emit assembly listing source using the -S switch; in this case, I get a file, with contents like:
...
0000066a <init>:
void init()
{
// this needs to be called before setup() or some functions won't
// work there
sei();
66a: 78 94 sei
66c: 83 b7 in r24, 0x33 ; 51
66e: 84 60 ori r24, 0x04 ; 4
670: 83 bf out 0x33, r24 ; 51
...
000006be <loop>:
6be: ef 92 push r14
6c0: ff 92 push r15
6c2: 0f 93 push r16
6c4: 1f 93 push r17
6c6: cf 93 push r28
6c8: df 93 push r29
6ca: 8d e0 ldi r24, 0x0D ; 13
6cc: 61 e0 ldi r22, 0x01 ; 1
6ce: 0e 94 23 02 call 0x446 ; 0x446
6d2: 04 c0 rjmp .+8 ; 0x6dc
6d4: 8d ef ldi r24, 0xFD ; 253
6d6: 94 e0 ldi r25, 0x04 ; 4
6d8: 0e 94 25 06 call 0xc4a ; 0xc4a <_ZN16MYOBJ7connectEv>
6dc: 8d ef ldi r24, 0xFD ; 253
6de: 94 e0 ldi r25, 0x04 ; 4
6e0: 0e 94 21 06 call 0xc42 ; 0xc42 <_ZN16MYOBJ11isConnectedEv>
...
(I did try to build this code, and it did fail - it reads the 'line numbers' as labels)
Obviously, both listings (for the loop function, at least) represent the same assembly code; except:
The gcc one (should) compile -- the objdump one does not
The objdump one contains listings of all referred functions, which could be defined in files other than the 'master' (e.g., digitalWrite) -- the gcc one does not
The objdump one contains original C/C++ source lines 'interspersed' with assembly (but only occasionally, and seemingly only for C files?) -- the gcc one does not
So, is there a way to obtain an assembly listing that would be 'compilable', however with all in-linked functions, and where source C/C++ code is (possibly, where appropriate) interspersed as comments (so they don't interfere with compilation of assembly file)? (short of writing a parser for the output of objdump, that is :))

Add the option -fverbose-asm to your gcc command line up there. (This is in the gcc manual, but it's documented under 'Code Gen Options')

The "dependencies" that you talk about often come from libraries or separate object files, so they don't have a source - they're just linked as binary code into the final executable. Since such code is not passed through the assembler, you will have to extract it using other ways - e.g. using objdump.
Maybe you should tell us what you really want to achieve because I don't see much point in such an exercise by itself.

The best I have been able to get is to use -Wa,-ahl=outfile.s instead of -S. It isn't compilable code though, but a listing file for diagnostic purposes; the compiled object file is emitted as usual.

Related

Synchronous baud rate (RFC2217) encode/decode

I'm trying to implement RFC2217 in my code but I can't understand how the last parity bit (46H and 28H) is generated.
I'm using RS485 to Ethernet device.
What will be the code, if I'm using 2400,E,8,1?
Is it: 55 AA 55 09 60 1B XX?
Is 1B right?
What will be XX?
User manual: page 42 in https://www.sarcitalia.it/file_upload/prodotti//USR-N520-Manual-EN-V1.0.4.pdf
In the field for the baud rate you missed the MSByte. This field shall be 00 09 60.
Yes, 1B for "E,8,1" is correct. BTW, the table lists 2 bits for the 1-bit fields of "stop bit" and "parity enable", which is quite irritating.
The field "parity" is actually just a sum, without the header and the MSBit cleared. (I don't grasp the text of the explanation, but the document seems to be low quality anyway.)
01 C2 00 03: 0x01 + 0xC2 + 0x00 + 0x03 = 0xC6; without bit 7 = 0x46.
00 25 80 03: 0x00 + 0x25 + 0x80 + 0x03 = 0xA8; without bit 7 = 0x28.
Your telegram 00 09 60 1B: 0x00 + 0x09 + 0x60 + 0x1B = 0x84; without bit 7 = 0x04. So XX is 04.

Manual vectorization using AVX vector intrinsics only runs about the same speed as 4 scalar FP adds on Ryzen?

So I decided to take a look at how to use SSE, AVX, ... in C via IntelĀ® Intrinsics. Not because of any actual interest to use it for something, but out of pure curiosity. Trying to check if code using AVX is actually faster than non-AVX code, I was a bit surprised by the results. Here is my C code:
#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h>
#include <immintrin.h>
/*** Sum up two vectors using AVX ***/
#define __vec_sum_4d_d64(src_vec1, src_vec2, dst_vec) \
_mm256_store_pd(dst_vec, _mm256_add_pd(_mm256_load_pd(src_vec1), _mm256_load_pd(src_vec2)));
/*** Sum up two vectors without AVX ***/
#define __vec_sum_4d(src_vec1, src_vec2, dst_vec) \
dst_vec[0] = src_vec1[0] + src_vec2[0];\
dst_vec[1] = src_vec1[1] + src_vec2[1];\
dst_vec[2] = src_vec1[2] + src_vec2[2];\
dst_vec[3] = src_vec1[3] + src_vec2[3];
int main (int argc, char *argv[]) {
unsigned long i;
double dvec1[4] = {atof(argv[1]), atof(argv[2]), atof(argv[3]), atof(argv[4])};
double dvec2[4] = {atof(argv[5]), atof(argv[6]), atof(argv[7]), atof(argv[8])};
#if 1
for (i = 0; i < 3000000000; i++) {
__vec_sum_4d(dvec1, dvec2, dvec2);
}
#endif
#if 0
for (i = 0; i < 3000000000; i++) {
__vec_sum_4d_d64(dvec1, dvec2, dvec2);
}
#endif
printf("%10.10lf %10.10lf %10.10lf %10.10lf\n", dvec2[0], dvec2[1], dvec2[2], dvec2[3]);
}
I simply switch #if 1 to #if 0 and the other way around to switch between "modes" (AVX and non-AVX).
My expectation would be, that the loop using AVX would be at least somewhat faster than the other one, but it isn't. I compiled the code with gcc version 10.2.0 (GCC) and these: -O2 --std=gnu99 -lm -mavx2 flags.
> time ./noavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000
real 0m2.150s
user 0m2.147s
sys 0m0.000s
> time ./withavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000
real 0m2.168s
user 0m2.165s
sys 0m0.000s
As you can see, they run at practically the same speed. I also tried to increase the number of iterations by a factor of ten, but the results will simply scale up proportionally. Also note that the printed output values are the same for both executables, so I think that it is save to say that both perform the same calculations. Digging deeper i took a look at the assembly and was even more confused. Here are the important parts of both (only the loop):
; With avx
1070: c5 fd 58 c1 vaddpd %ymm1,%ymm0,%ymm0
1074: 48 83 e8 01 sub $0x1,%rax
1078: 75 f6 jne 1070
; Without avx
1080: c5 fb 58 c4 vaddsd %xmm4,%xmm0,%xmm0
1084: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1
1088: c5 eb 58 d7 vaddsd %xmm7,%xmm2,%xmm2
108c: c5 e3 58 de vaddsd %xmm6,%xmm3,%xmm3
1090: 48 83 e8 01 sub $0x1,%rax
1094: 75 ea jne 1080
In my understanding the second one should be way slower since besides decrementing the counter and the conditional jump there are four times as many instructions in it. Why is it not slower? Is the vaddsd instruction just four times faster than vaddpd?
If this is relevant, my system runs on a AMD Ryzen 5 2600X Six-Core Processor which supports AVX.
With AVX
; With avx
1070: c5 fd 58 c1 vaddpd %ymm1,%ymm0,%ymm0
1074: 48 83 e8 01 sub $0x1,%rax
1078: 75 f6 jne 1070
This loop is using ymm0 as accumulator. In other words it is doing ymm0 += ymm1 (this is a vector operation; adding 4 double values at once). Therefore it has loop-carried dependency on ymm0 (every new addition has to wait for the previous addition to finish and uses the result to start the next addition). vaddpd has latency=3, throughput=1 for Zen+ (according to https://www.uops.info/table.html). Loop carried dependency makes this loop bottleneck on latency of vaddpd, so your loop can get at best 3 cycles/iteration. Only one vaddpd addition is in-flight in the CPU, which is under-utilizing it's capability by a lot.
To make this faster add more accumulators (have more vectors to sum). It can (in theory) get 3 times faster due to pipelining (3 full ymm additions in-flight), as long as it does not get limited by something else.
Without AVX
; Without avx
1080: c5 fb 58 c4 vaddsd %xmm4,%xmm0,%xmm0
1084: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1
1088: c5 eb 58 d7 vaddsd %xmm7,%xmm2,%xmm2
108c: c5 e3 58 de vaddsd %xmm6,%xmm3,%xmm3
1090: 48 83 e8 01 sub $0x1,%rax
1094: 75 ea jne 1080
This loop accumulates results into 4 different accumulators. Basically it is doing:
xmm0 += xmm4
xmm1 += xmm5
xmm2 += xmm7
xmm3 += xmm6
All of these additions are independent from each other (and they are scalar additions, so each only operates on a single 64-bit floating point value). vaddsd has latency=3, throughput=0.5 (Cycles Per Instruction). Which means that it can start executing first 2 additions in one cycle. Then on the next cycle it will start the second pair of additions. Therefore it is possible to achieve 2 cycles/iteration for this loop based on throughput. But latency, as you recall is 3 cycles. So this loop is also bottlenecked on latency. Unroll once (with 4 additional accumulators; alternatively break loop-carried dep.chain within the loop by adding xmm4-7 between each other before adding it to the main accumulator) to get rid of that bottleneck (it may get ~50% faster).
Note that this ("without AVX") disassembly is still using VEX encoding, so technically still requires AVX-capable CPU.
On Benchmarking
Note that your disassembly does not have any loads or stores, so this may or may not be representative of performance comparison for adding 2 arrays of 4-double vectors.
You are dealing with a latency issue. Depending on the CPU you have to wait 3 or 4 cycles until you can use the result of a vaddpd or vaddsd instruction. But within 1 cycle up to 2 vaddpd or vaddsd instructions can be executed (if the CPU does not have to wait for source registers).
Since in your loop
; Without avx
1080: c5 fb 58 c4 vaddsd %xmm4,%xmm0,%xmm0
1084: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1
1088: c5 eb 58 d7 vaddsd %xmm7,%xmm2,%xmm2
108c: c5 e3 58 de vaddsd %xmm6,%xmm3,%xmm3
1090: 48 83 e8 01 sub $0x1,%rax
1094: 75 ea jne 1080
each vaddsd depends on the result from the previous iteration, it has to wait 3 or 4 cycles before this can be executed. But the execution of the all the vaddsd and the sub and jne can happen during that time. Therefore, for this simple loop it does not make a difference, if you execute one vaddpd or four vaddsd.
To fully exhaust the vaddpd instruction, you need to execute 6 or 8 of them which do not depend on the result of each other (or have other instructions which do some independent work).

Reuse symbols in disassembling/reassembling a C++ program

it's me again. I am working on a tool can that disassemble/reassemble stripped binaries and now I am sucked in a (external) symbol reuse issue.
The test is on 32-bit Linux x86 platform.
Suppose I am working on a C++ program, in the GCC compiler produced assembly code, there exists some instructions like this:
call _ZNSt8ios_baseC2Ev
movl _ZTTSt14basic_ifstreamIcSt11char_traitsIcEE+4, %ebx
movb $0, 312(%esp)
movl _ZTTSt14basic_ifstreamIcSt11char_traitsIcEE+8, %ecx
....
Please pay special attention to symbol _ZTTSt14basic_ifstreamIcSt11char_traitsIcEE.
After the compilation, suppose I get an unstripped binary, and i checked this symbol like this:
readelf -s a.out | grep "_ZTTSt14basic"
69: 080a7390 16 OBJECT WEAK DEFAULT 27 _ZTTSt14basic_ifstreamIcS#GLIBCXX_3.4 (3)
72: 080a7220 16 OBJECT WEAK DEFAULT 27 _ZTTSt14basic_ofstreamIcS#GLIBCXX_3.4 (3)
705: 080a7220 16 OBJECT WEAK DEFAULT 27 _ZTTSt14basic_ofstreamIcS
1033: 080a7390 16 OBJECT WEAK DEFAULT 27 _ZTTSt14basic_ifstreamIcS
See, this is my first question, why the name of symbol _ZTTSt14basic_ifstreamIcSt11char_traitsIcEE modified to _ZTTSt14basic_ifstreamIcS and _ZTTSt14basic_ifstreamIcS#GLIBCXX_3.4 (3) ?
What is _ZTTSt14basic_ifstreamIcS#GLIBCXX_3.4 (3) though?
Then I stripped the binary like this:
strip a.out
readelf -s a.out | grep "_ZTTSt14basic"
69: 080a7390 16 OBJECT WEAK DEFAULT 27 _ZTTSt14basic_ifstreamIcS#GLIBCXX_3.4 (3)
72: 080a7220 16 OBJECT WEAK DEFAULT 27 _ZTTSt14basic_ofstreamIcS#GLIBCXX_3.4 (3)
Then after I disassemble the binary, and the corresponding disassembled assembly instructions are :
8063ee7: e8 84 54 fe ff call 8049370 <_ZNSt8ios_baseC2Ev#plt>
8063eec: 8b 1d 94 73 0a 08 mov 0x80a7394,%ebx
8063ef2: c6 84 24 38 01 00 00 movb $0x0,0x138(%esp)
8063ef9: 00
8063efa: 8b 0d 98 73 0a 08 mov 0x80a7398,%ecx
At this point we can figure out that 0x80a7394 equals to _ZTTSt14basic_ifstreamIcSt11char_traitsIcEE+4.
In order to reuse these instructions, I modified the code:
call _ZNSt8ios_baseC2Ev
mov _ZTTSt14basic_ifstreamIcS+4,%ebx
movb $0x0,0x138(%esp)
mov _ZTTSt14basic_ifstreamIcS+8,%ecx
And did some update like these (please see this question for reference):
echo ""_ZTTSt14basic_ifstreamIcS#GLIBCXX_3.4 (3)" = 0x080a7390;" > symbolfile
g++ -Wl,--just-symbols=symbolfile final.s
readelf -s a.out | grep "_ZTTSt14basic"
3001: 080a7390 0 NOTYPE LOCAL DEFAULT 27 _ZTTSt14basic_ifstreamIcS
8412: 080a7390 0 NOTYPE GLOBAL DEFAULT ABS _ZTTSt14basic_ifstreamIcS
I debugged the newly produced binary, and to my surprise, in the newly produced binary, symbol _ZTTSt14basic_ifstreamIcS does not get any value after the function call of _ZNSt8ios_baseC2Ev, while in the original binary, after the function call, _ZTTSt14basic_ifstreamIcS do get some memory address referring to library section. Which means:
call _ZNSt8ios_baseC2Ev
mov _ZTTSt14basic_ifstreamIcS+4,%ebx <--- %ebx gets zero!
movb $0x0,0x138(%esp)
mov _ZTTSt14basic_ifstreamIcS+8,%ecx <--- %ecx gets zero!
I must state that in these lines of the original binary, registers %ebx and %ecx both gets some addresses referring to the libc section.
This is my second question, why does symbol _ZTTSt14basic_ifstreamIcS didn't get any value after function call _ZNSt8ios_baseC2Ev? I also tried with symbol name _ZTTSt14basic_ifstreamIcSt11char_traitsIcEE. But that does not work also.
Am I clear enough? Could anyone save my ass? thank you!

gadgets in ROP on AVR architecture

Please for ROP refer to this paper
I'm building the gadget catalog for AVR-8bit but I have some doubts.
I'll ask my question using the following example.
In order to have v1=v1+v2; (v1 and v2 are variables)
the corresponding assembly is:
ldi r17, #value
ldi r18, #value
add r18,r17;
or
ldi r17, #value
mov r1, r17;
ldi r18, #value
add r18,r1;
or
ldi r17, #value
ldi r18, #value
mov r1, r18;
add r1,r17;
or
ldi r17, #value
mov r1, r17;
ldi r18, #value
mov r2, r18;
add r2,r1;
Will the gadget be the following?
ldi r#, #value;
ldi r#, value;
add r#, r#;
ret
or just the following combined with ldi r#,r#; ret and with the combination with mov?
add r#,r#;
ret
ldi is loading a constant and there is not much point in adding two constants at runtime. As such, your gadget will be the add; ret only, and you'll want to ensure the two operands are in the appropriate registers by using other gadgets.
It might make sense to have a gadget for adding a constant to a register, though.

MSVS 2010 C: memory detection working as expected

I am working on a C project in MSVS 2010 (meaning I am using malloc, calloc, and free, not the C++ new and delete operators). I need to find a memory leak(s?), so I've followed the steps on http://msdn.microsoft.com/en-us/library/x98tx3cf.aspx to get the program to dump the memory state at the end of the run.
I include the libraries like so:
#define _CRTDBG_MAP_ALLOC
#include <stdlib.h>
#include <crtdbg.h>
I also specify that every exit should display the debug info like so:
_CrtSetDbgFlag ( _CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF );
But my debug output looks like this:
Detected memory leaks!
Dumping objects ->
{80181} normal block at 0x016B1D38, 12 bytes long.
Data: < 7 7 8 7 > 0C D5 37 00 14 A9 37 00 38 99 37 00
{80168} normal block at 0x016ACC20, 16 bytes long.
Data: < 7 H 7 X 7 \ 7 > A8 FB 37 00 48 E9 37 00 58 C2 37 00 5C AC 37 00
...
According to the article, I should be getting file name and line number output indicating where the leaked memory is allocated. Why is this not happening, and how can I fix it?
Adrian McCarthy commented that I should ensure that the definition _CRT_MAP_ALLOC existed in every compilation unit. While I could not figure out how to define that as a compiler option, I did create a sparse header file that I ensured every compiled file included. This made the debugging functionality work as expected.

Resources