Understanding Clang's optimization when pointer is zero

Understanding Clang's optimization when pointer is zero - c

In short: try switching foos pointer from 0 to 1 here:
godbolt - compiler explorer link - what is happening?
I was surprised at how many instruction came out of clang when I compiled the following C code. - And I noticed that it only happens when the pointer foos is zero. (x86-64 clang 12.0.1 with -O2 or -O3).
#include <stdint.h>
typedef uint8_t u8;
typedef uint32_t u32;
typedef struct {
u32 x;
u32 y;
}Foo;
u32 count = 500;
int main()
{
u8 *foos = (u8 *)0;
u32 element_size = 8;
u32 offset = 0;
for(u32 i=0;i<count;i++)
{
u32 *p = (u32 *)(foos + element_size*i);
*p = i;
}
return 0;
}
This is the output when the pointer is zero.
main: # #main
mov r8d, dword ptr [rip + count]
test r8, r8
je .LBB0_6
lea rcx, [r8 - 1]
mov eax, r8d
and eax, 3
cmp rcx, 3
jae .LBB0_7
xor ecx, ecx
jmp .LBB0_3
.LBB0_7:
and r8d, -4
mov esi, 16
xor ecx, ecx
.LBB0_8: # =>This Inner Loop Header: Depth=1
lea edi, [rsi - 16]
and edi, -32
mov dword ptr [rdi], ecx
lea edi, [rsi - 8]
and edi, -24
lea edx, [rcx + 1]
mov dword ptr [rdi], edx
mov edx, esi
and edx, -16
lea edi, [rcx + 2]
mov dword ptr [rdx], edi
lea edx, [rsi + 8]
and edx, -8
lea edi, [rcx + 3]
mov dword ptr [rdx], edi
add rcx, 4
add rsi, 32
cmp r8, rcx
jne .LBB0_8
.LBB0_3:
test rax, rax
je .LBB0_6
lea rdx, [8*rcx]
.LBB0_5: # =>This Inner Loop Header: Depth=1
mov esi, edx
and esi, -8
mov dword ptr [rsi], ecx
add rdx, 8
add ecx, 1
add rax, -1
jne .LBB0_5
.LBB0_6:
xor eax, eax
ret
count:
.long 500 # 0x1f4
Can you please help me understand what is happening here? I don't know assembly very well. The AND with 3 suggest to me that there's some alignment branching. The top part of LBB0_8 looks very strange to me...

This is loop unrolling.
The code first checks if count is greater than 3, and if so, branches to LBB0_7, which sets up loop variables and drops into the loop at LBB0_8. This loop does 4 steps per iteration, as long as there are still 4 or more to do. Afterwards it falls through to the "slow path" at LBB0_3/LBB0_5 that just does one step per iteration.
That slow path is also very similar to what you get when you compile the code with a non-zero value for that pointer.
As for why this happens, I don't know. Initially I was thinking that the compiler proves that a NULL deref will happen inside the loop and optimises on that, but usually that's akin to replacing the loop contents with __builtin_unreachable();, which causes it to throw out the loop entirely. Still can't rule it out, but I've seen the compiler throw out a lot of code many times, so it seems at least unlikely that UB causes this.
Then I was thinking maybe the fact that 0 requires no additional calculation, but all it'd have to change was mov esi, 16 to mov esi, 17, so it'd have the same amount of instructions.
What's also interesting is that on x86_64, it generates a loop with 4 steps per iteration, whereas on arm64 it generates one with 2 steps per iteration.

Related

How do I write Rust code which compiles to assembly which resembles that produced by GCC from C?

I have these two source files:
const ARR_LEN: usize = 128 * 1024;
pub fn plain_mod_test(x: &[u64; ARR_LEN], m: u64, result: &mut [u64; ARR_LEN]) {
for i in 0..ARR_LEN {
result[i] = x[i] % m;
}
}
and
#include <stdint.h>
#define ARR_LEN (128 * 1024)
void plain_mod_test(uint64_t *x, uint64_t m, uint64_t *result) {
for (int i = 0; i < ARR_LEN; ++ i) {
result[i] = x[i] % m;
}
}
Is my C code a good approximation to the Rust code?
When I compile the C code on godbolt.org x86_64 gcc12.2 -O3 I get the sensible:
plain_mod_test:
mov r8, rdx
xor ecx, ecx
.L2:
mov rax, QWORD PTR [rdi+rcx]
xor edx, edx
div rsi
mov QWORD PTR [r8+rcx], rdx
add rcx, 8
cmp rcx, 1048576
jne .L2
ret
But when I do the same for rustc 1.66 -C opt-level=3 I get this verbose output:
example::plain_mod_test:
push rax
test rsi, rsi
je .LBB0_10
mov r8, rdx
xor ecx, ecx
jmp .LBB0_2
.LBB0_7:
xor edx, edx
div rsi
mov qword ptr [r8 + 8*rcx + 8], rdx
mov rcx, r9
cmp r9, 131072
je .LBB0_9
.LBB0_2:
mov rax, qword ptr [rdi + 8*rcx]
mov rdx, rax
or rdx, rsi
shr rdx, 32
je .LBB0_3
xor edx, edx
div rsi
jmp .LBB0_5
.LBB0_3:
xor edx, edx
div esi
.LBB0_5:
mov qword ptr [r8 + 8*rcx], rdx
mov rax, qword ptr [rdi + 8*rcx + 8]
lea r9, [rcx + 2]
mov rdx, rax
or rdx, rsi
shr rdx, 32
jne .LBB0_7
xor edx, edx
div esi
mov qword ptr [r8 + 8*rcx + 8], rdx
mov rcx, r9
cmp r9, 131072
jne .LBB0_2
.LBB0_9:
pop rax
ret
.LBB0_10:
lea rdi, [rip + str.0]
lea rdx, [rip + .L__unnamed_1]
mov esi, 57
call qword ptr [rip + core::panicking::panic#GOTPCREL]
ud2
How do I write Rust code which compiles to assembly similar to that produced by gcc for C?
Update: When I compile the C code with clang 12.0.0 -O3 I get output which looks far more like the Rust assembly than the GCC/C assembly.
i.e. This looks like a GCC vs clang issue, rather than a C vs Rust difference.
plain_mod_test: # #plain_mod_test
mov r8, rdx
xor ecx, ecx
jmp .LBB0_1
.LBB0_6: # in Loop: Header=BB0_1 Depth=1
xor edx, edx
div rsi
mov qword ptr [r8 + 8*rcx + 8], rdx
add rcx, 2
cmp rcx, 131072
je .LBB0_8
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov rax, qword ptr [rdi + 8*rcx]
mov rdx, rax
or rdx, rsi
shr rdx, 32
je .LBB0_2
xor edx, edx
div rsi
jmp .LBB0_4
.LBB0_2: # in Loop: Header=BB0_1 Depth=1
xor edx, edx
div esi
.LBB0_4: # in Loop: Header=BB0_1 Depth=1
mov qword ptr [r8 + 8*rcx], rdx
mov rax, qword ptr [rdi + 8*rcx + 8]
mov rdx, rax
or rdx, rsi
shr rdx, 32
jne .LBB0_6
xor edx, edx
div esi
mov qword ptr [r8 + 8*rcx + 8], rdx
add rcx, 2
cmp rcx, 131072
jne .LBB0_1
.LBB0_8:
ret

Don’t compare apples to orange crabs.
Most of the difference between the assembly outputs is due to loop unrolling, which the LLVM code generator used by rustc does much more aggressively than GCC’s, and working around a CPU performance pitfall, as explained in Peter Cordes’ answer. When you compile the same C code with Clang 15, it outputs:
mov r8, rdx
xor ecx, ecx
jmp .LBB0_1
.LBB0_6:
xor edx, edx
div rsi
mov qword ptr [r8 + 8*rcx + 8], rdx
add rcx, 2
cmp rcx, 131072
je .LBB0_8
.LBB0_1:
mov rax, qword ptr [rdi + 8*rcx]
mov rdx, rax
or rdx, rsi
shr rdx, 32
je .LBB0_2
xor edx, edx
div rsi
jmp .LBB0_4
.LBB0_2:
xor edx, edx
div esi
.LBB0_4:
mov qword ptr [r8 + 8*rcx], rdx
mov rax, qword ptr [rdi + 8*rcx + 8]
mov rdx, rax
or rdx, rsi
shr rdx, 32
jne .LBB0_6
xor edx, edx
div esi
mov qword ptr [r8 + 8*rcx + 8], rdx
add rcx, 2
cmp rcx, 131072
jne .LBB0_1
.LBB0_8:
ret
which is pretty much the same as the Rust version.
Using Clang with -Os results in assembly much closer to that of GCC:
mov r8, rdx
xor ecx, ecx
.LBB0_1:
mov rax, qword ptr [rdi + 8*rcx]
xor edx, edx
div rsi
mov qword ptr [r8 + 8*rcx], rdx
inc rcx
cmp rcx, 131072
jne .LBB0_1
ret
Likewise does -C opt-level=s with rustc:
push rax
test rsi, rsi
je .LBB6_4
mov r8, rdx
xor ecx, ecx
.LBB6_2:
mov rax, qword ptr [rdi + 8*rcx]
xor edx, edx
div rsi
mov qword ptr [r8 + 8*rcx], rdx
lea rax, [rcx + 1]
mov rcx, rax
cmp rax, 131072
jne .LBB6_2
pop rax
ret
.LBB6_4:
lea rdi, [rip + str.0]
lea rdx, [rip + .L__unnamed_1]
mov esi, 57
call qword ptr [rip + core::panicking::panic#GOTPCREL]
ud2
Of course, there is still the check if m is zero, leading to a panicking branch. You can eliminate that branch by narrowing down the type of the argument to exclude zero:
const ARR_LEN: usize = 128 * 1024;
pub fn plain_mod_test(x: &[u64; ARR_LEN], m: std::num::NonZeroU64, result: &mut [u64; ARR_LEN]) {
for i in 0..ARR_LEN {
result[i] = x[i] % m
}
}
Now the function will emit identical assembly to Clang.

rustc uses the LLVM back-end optimizer, so compare against clang. LLVM unrolls small loops by default.
Recent LLVM is also tuning for Intel CPUs before Ice Lake, where div r64 is much slower than div r32, so much slower that it's worth branching on.
It's checking if a uint64_t actually fits in uint32_t and using 32-bit operand-size for div. The shr/je is doing if ((dividend|divisor)>>32 == 0) use 32-bit to check the high halves of both operands for being all zero. If it checked the high half of m once, and made 2 versions of loop, the test would be simpler. But this code will bottleneck on division throughput anyway.
This opportunistic div r32 code-gen will eventually become obsolete, as Ice Lake's integer divider is wide enough not to need way more micro-ops for 64-bit, so performance just depends on the actual values, regardless of whether there are an extra 32 bits of zeros above it or not. AMD has been like that for a while.
But Intel sold a lot of CPUs based re-spins of Skylake (including Cascade Lake servers, and client CPUs up to Comet Lake). While those are still in widespread use, LLVM -mtune=generic should probably keep doing this.
For more details:
Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux (a case where we know at compile-time the 64-bit integers will only hold small values, and rewriting the binary to use 32-bit operand-size with zero other changes to alignment or machine code makes it 3x faster on my Skylake CPU.)
uops for integer DIV instruction
Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs?

How can I store abc(x, y) which is a pointer function into an array as per the following code sample?

Function 1.
It is a pointer function.
char *abc(unsigned int a, unsigned int b)
{
//do something here ...
}
Function 2
Leveraged the function 1 into function 2.
I am trying to store the abc function into an array, however I am getting the error as : error: assignment to expression with array type.
fun2()
{
unsigned int x, y;
x= 5, y=6;
char *array1;
char array2;
for(i=0; i<3; i++)
{
array2[i] = abc(x, y);
}
}

You can't store the invocation of a function in C since it would defeat many existing popular optimizations involving register parameters passing - see because normally parameters are assigned their argument values immediately before the execution flow is transferred to the calling site - compilers may choose to use the registers to store those values but as it stands those registers are volatile and so if we were to delay the actual call they would be overwritten at said later time - possibly even by another call to some function which also have its arguments passed as registers. A solution - which I've personally implemented - is to have a function simulate the call for you by re-assigning to the proper registers and any further arguments - to the stack. In this case you store the argument values in a flat memory. But this must be done in assembly exclusively for this purpose and specific to your target architecture. On the other hand if your architecture is not using any such optimizations - it could be quite easier but still hand written assembly would be required.
In any case this is not a feature the standard (or even pre standard as far as I know) C has implemented anytime.
For example this is an implementation for x86-64 I've wrote some time ago (for MSVC masm assembler):
PUBLIC makeuniquecall
.data
makeuniquecall_jmp_table dq zero_zero, one_zero, two_zero, three_zero ; ordinary
makeuniquecall_jmp_table_one dq zero_one, one_one, two_one, three_one ; single precision
makeuniquecall_jmp_table_two dq zero_two, one_two, two_two, three_two ; double precision
.code
makeuniquecall PROC
;rcx - function pointer
;rdx - raw argument data
;r8 - a byte array specifying each register parameter if it's float and the last qword is the size of the rest
push r12
push r13
push r14
mov r12, rcx
mov r13, rdx
mov r14, r8
; first store the stack vars
mov rax, [r14 + 4] ; retrieve size of stack
sub rsp, rax
mov rdi, rsp
xor rdx, rdx
mov r8, 8
div r8
mov rcx, rax
mov rsi, r13
;add rsi, 32
rep movs qword ptr [rdi], qword ptr [rsi]
xor r10,r10
cycle:
mov rax, r14
add rax, r10
movzx rax, byte ptr [rax]
test rax, rax
jnz jmp_one
lea rax, makeuniquecall_jmp_table
jmp qword ptr[rax + r10 * 8]
jmp_one:
cmp rax, 1
jnz jmp_two
lea rax, makeuniquecall_jmp_table_one
jmp qword ptr[rax + r10 * 8]
jmp_two:
lea rax, makeuniquecall_jmp_table_two
jmp qword ptr[rax + r10 * 8]
zero_zero::
mov rcx, qword ptr[r13+r10*8]
jmp continue
one_zero::
mov rdx, qword ptr[r13+r10*8]
jmp continue
two_zero::
mov r8, qword ptr[r13+r10*8]
jmp continue
three_zero::
mov r9, qword ptr[r13+r10*8]
jmp continue
zero_one::
movss xmm0, dword ptr[r13+r10*8]
jmp continue
one_one::
movss xmm1, dword ptr[r13+r10*8]
jmp continue
two_one::
movss xmm2, dword ptr[r13+r10*8]
jmp continue
three_one::
movss xmm3, dword ptr[r13+r10*8]
jmp continue
zero_two::
movsd xmm0, qword ptr[r13+r10*8]
jmp continue
one_two::
movsd xmm1, qword ptr[r13+r10*8]
jmp continue
two_two::
movsd xmm2, qword ptr[r13+r10*8]
jmp continue
three_two::
movsd xmm3, qword ptr[r13+r10*8]
continue:
inc r10
cmp r10, 4
jb cycle
mov r14, [r14 + 4] ; retrieve size of stack
call r12
add rsp, r14
pop r14
pop r13
pop r12
ret
makeuniquecall ENDP
END
And your code will look something like this:
#include <stdio.h>
char* abc(unsigned int a, unsigned int b)
{
printf("a - %d, b - %d\n", a, b);
return "return abc str\n";
}
extern makeuniquecall();
main()
{
unsigned int x, y;
x = 5, y = 6;
#pragma pack(4)
struct {
struct { char maskargs[4]; unsigned long long szargs; } invok;
char *(*pfunc)();
unsigned long long args[2], shadow[2];
} array2[3];
#pragma pack(pop)
for (int i = 0; i < 3; i++)
{
memset(array2[i].invok.maskargs, 0, sizeof array2[i].invok.maskargs); // standard - no floats passed
array2[i].invok.szargs = 8 * 4; //consider shadow space
array2[i].pfunc = abc;
array2[i].args[0] = x;
array2[i].args[1] = y;
}
//now do the calls
for (int i = 0; i < 3; i++)
printf("%s\n", ((char *(*)())makeuniquecall)(array2[i].pfunc, array2[i].args, &array2[i].invok));
}
You'll probably not need that for your specific case you will get away with simply storing each argument and calling the function directly - i.e. (plus this method won't be x86-64 specific):
//now do the calls
for (int i = 0; i < 3; i++)
printf("%s\n", array2[i].pfunc(array2[i].args[0], array2[i].args[1]));
But mine implementation gives you the flexibility to store different amount of arguments for each call.
Note consider this guide for running above examples on msvc (since it requires to add asm file for the assembly code).
I love such noob questions since they make you think about why x-y feature doesn't actually exist in the language.

volatile keyword in C, are all variables marked as volatile?

Sorry if I am asking a stupid question, but I can't find the answer due to clumsy search terms I guess
If I declare three variables as follows
volatile uint16_t a, b, c;
Will all three variables be declared volatile?
Or should I really not declare multiple variables in a row but instead do:
volatile uint16_t a;
volatile uint16_t b;
volatile uint16_t c;

If I declare three variables as follows
volatile uint16_t a, b, c;
Will all three variables be declared volatile?
Yes, all 3 variables will be volatile.
Or should I really not declare multiple variables in a row but instead do:
That is related to code style and personal preference. Usually declaring variables one per line is preferred, is more readable, easier to read, easier to refactor and results in more readable changes when browsing diff output of files.

We can check the assembly generated by the compiler to see if it optimizes the variables out or not.
When I check this simple program:
#include <stdio.h>
#include <stdint.h>
int main(void)
{
uint16_t a = 1, b = 1, c = 1;
printf("%hu", a);
printf("%hu", b);
printf("%hu", c);
}
The generated assembly at -O3 (link) is:
.LC0:
.string "%hu"
main:
sub rsp, 8
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
It's obvious here that the variables have been optimized out and 1 is being used as a parameter instead of the variables.
When I replace the uint16_t a = 1, b = 1, c = 1; with volatile uint16_t a = 1, b = 1, c = 1;, The assembly generated (link) is:
main:
sub rsp, 24
mov edx, 1
mov ecx, 1
mov eax, 1
mov WORD PTR [rsp+10], ax
mov edi, OFFSET FLAT:.LC0
xor eax, eax
mov WORD PTR [rsp+12], dx
mov WORD PTR [rsp+14], cx
movzx esi, WORD PTR [rsp+10]
call printf
movzx esi, WORD PTR [rsp+12]
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
movzx esi, WORD PTR [rsp+14]
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 24
ret
Here, volatile is working like it should for all variables. The variables are created and are not optimized out.
In comparison, if we replace volatile uint16_t a = 1, b = 1, c = 1; with volatile uint16_t a = 1; uint16_t b = 1, c = 1; we see that only a is not optimized out (link):
main:
sub rsp, 24
mov eax, 1
mov edi, OFFSET FLAT:.LC0
mov WORD PTR [rsp+14], ax
movzx esi, WORD PTR [rsp+14]
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 24
ret

Efficiency difference between an if-statement and mod(SIZE)

Studying I found the use of the (i+1)mod(SIZE) to perform a cycle in an array of elements.
So I wondered if this method was more efficient than an if-statement...
For example:
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i = (i + 1) % SIZE) items[i] += 1;
return 0;
}
It is more efficient than(?):
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i++) {
if(i == SIZE) i = 0;
items[i] += 1;
}
return 0;
}
Thanks for the answers and your time.

You can check the assembly online (i. e. here). The result depends on the architecture and the optimization, but without optimization and for x64 with GCC, you get this code (as a simple example).
Example 1:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
mov eax, DWORD PTR [rbp-4]
add eax, 1
movsx rdx, eax
imul rdx, rdx, -2004318071
shr rdx, 32
add edx, eax
mov ecx, edx
sar ecx, 3
cdq
sub ecx, edx
mov edx, ecx
mov DWORD PTR [rbp-4], edx
mov ecx, DWORD PTR [rbp-4]
mov edx, ecx
sal edx, 4
sub edx, ecx
sub eax, edx
mov DWORD PTR [rbp-4], eax
jmp .L3
.L2:
mov eax, 0
pop rbp
ret
Example 2:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L4:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
cmp DWORD PTR [rbp-4], 15
jne .L3
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
add DWORD PTR [rbp-4], 1
jmp .L4
.L2:
mov eax, 0
pop rbp
ret
You see, that for the specific case with x86, the solution without modulo is much shorter.

Although you are only asking about mod vs branch, there are probably more like five cases depending on the actual implementation of the mod and branch:
Modulus-based
Power-of-two
If the value of SIZE is known to the compiler and is a power of 2, the mod will compile into a single and like this and will be very efficient in performance and code size. The and is still part of the loop increment dependency chain though, putting a speed limit on the performance of 2 cycles per iteration unless the compiler is clever enough to unroll it and keep the and out of the carried chain (gcc and clang weren't).
Known, not power-of-two
On the other hand, if the value of SIZE is known but not a power of two, then you are likely to get a multiplication-based implementation of the fixed modulus value, like this. This generally takes something like 4-6 instructions, which end up part of the dependency chain. So this will limit your performance to something like 1 iteration every 5-8 cycles, depending exactly on the latency of the dependency chain.
Unknown
In your example SIZE is a known constant, but in the more general case where it is not known at compile time you will get an division instruction on platforms that support it. Something like this.
That is good for code size, since it's a single instruction, but probably disastrous for performance because now you have a slow division instruction as part of the carried dependency for the loop. Depending on your hardware and the type of the SIZE variable, you are looking at 20-100 cycles per iteration.
Branch-based
You put a branch in your code, but jump compiler made decide to implement that as a conditional jump or as a conditional move. At -O2, gcc decides on a jump and clang on a conditional move.
Conditional Jump
This is the direct interpretation of your code: use a conditional branch to implement the i == SIZE condition.
It has the advantage of making the condition a control dependency, not a data dependency, so your loop will mostly run at full speed when the branch is not taken.
However, performance could be seriously impacted if the branch mispredicts often. That depends heavily on the value of SIZE and on your hardware. Modern Intel should be able to predict nested loops like this up to 20-something iterations, but beyond that it will mispredict once every time the inner loop is exited. Of course, is SIZE is very large then the single mispredict won't matter much anyways, so the worst case is SIZE just large enough to mispredict.
Conditional Move
clang uses a conditional move to update i. This is a reasonable option, but it does mean a carried data flow dependency of 3-4 cycles.
1 Either actually a constant like your example or effectively a constant due to inlining and constant propagation.

declaring a string in assembly

I have this assembly code that computes some prime numbers:
#include <stdio.h>
int main() {
char format[] = "%d\t";
_asm{
mov ebx, 1000
mov ecx, 1
jmp start_while1
incrementare1:
add ecx, 1
start_while1:
cmp ecx, ebx
jge end_while1
mov edi, 2
mov esi, 0
jmp start_while2
incrementare2:
add edi, 1
start_while2:
cmp edi, ecx
jge end_while2
mov eax, ecx
xor edx, edx
div edi
test edx, edx
jnz incrementare2
mov esi, 1
end_while2:
test esi, esi
jnz incrementare1
push ecx
lea ecx, format
push ecx
call printf
pop ecx
pop ecx
jmp incrementare1
end_while1:
nop
}
return 0;
}
It works fine but I would like to also declare the 'format' string in asm, not in C code. I have tried adding something like format db "%d\t", 0 but it didn't work.

If all else fails there's always the ugly way:
format_minus_1:
mov ecx,0x00096425 ; '%', 'd', '\t', '\0' in little-endian format
lea ecx,format_minus_1 + 1 ; skip past the "mov ecx" opcode
push ecx
call printf

You cannot define objects inside the _asm block with those directives. The C declaration is allocating space on the stack for you so if you want to do something like that inside the _asm block you need to manipulate the stack pointer and initialize the memory yourself:
sub esp, 4
mov [esp], '%'
mov [esp + 1], 'd'
mov [esp + 2], '\t'
mov [esp + 3], '\0'
...
push ecx
push esp + 4
call printf
Note this is one way. Not necessarily the best way. The best way being let C do your memory management for you.