scanf assembly x86-x64 always get an error [duplicate] - c

When compiling below code:
global main
extern printf, scanf
section .data
msg: db "Enter a number: ",10,0
format:db "%d",0
section .bss
number resb 4
section .text
main:
mov rdi, msg
mov al, 0
call printf
mov rsi, number
mov rdi, format
mov al, 0
call scanf
mov rdi,format
mov rsi,[number]
inc rsi
mov rax,0
call printf
ret
using:
nasm -f elf64 example.asm -o example.o
gcc -no-pie -m64 example.o -o example
and then run
./example
it runs, print: enter a number:
but then crashes and prints:
Segmentation fault (core dumped)
So printf works fine but scanf not.
What am I doing wrong with scanf so?

Use sub rsp, 8 / add rsp, 8 at the start/end of your function to re-align the stack to 16 bytes before your function does a call.
Or better push/pop a dummy register, e.g. push rdx / pop rcx, or a call-preserved register like RBP you actually wanted to save anyway. You need the total change to RSP to be an odd multiple of 8 counting all pushes and sub rsp, from function entry to any call.
i.e. 8 + 16*n bytes for whole number n.
On function entry, RSP is 8 bytes away from 16-byte alignment because the call pushed an 8-byte return address. See Printing floating point numbers from x86-64 seems to require %rbp to be saved,
main and stack alignment, and Calling printf in x86_64 using GNU assembler. This is an ABI requirement which you used to be able to get away with violating when there weren't any FP args for printf. But not any more.
See also Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?
To put it another way, RSP % 16 == 8 on function entry, and you need to ensure RSP % 16 == 0 before you call a function. How you do this doesn't matter. (Not all functions will actually crash if you don't, but the ABI does require/guarantee it.)
gcc's code-gen for glibc scanf now depends on 16-byte stack alignment
even when AL == 0.
It seems to have auto-vectorized copying 16 bytes somewhere in __GI__IO_vfscanf, which regular scanf calls after spilling its register args to the stack1. (The many similar ways to call scanf share one big implementation as a back end to the various libc entry points like scanf, fscanf, etc.)
I downloaded Ubuntu 18.04's libc6 binary package: https://packages.ubuntu.com/bionic/amd64/libc6/download and extracted the files (with 7z x blah.deb and tar xf data.tar, because 7z knows how to extract a lot of file formats).
I can repro your bug with LD_LIBRARY_PATH=/tmp/bionic-libc/lib/x86_64-linux-gnu ./bad-printf, and also it turns out with the system glibc 2.27-3 on my Arch Linux desktop.
With GDB, I ran it on your program and did set env LD_LIBRARY_PATH /tmp/bionic-libc/lib/x86_64-linux-gnu then run. With layout reg, the disassembly window looks like this at the point where it received SIGSEGV:
│0x7ffff786b49a <_IO_vfscanf+602> cmp r12b,0x25 │
│0x7ffff786b49e <_IO_vfscanf+606> jne 0x7ffff786b3ff <_IO_vfscanf+447> │
│0x7ffff786b4a4 <_IO_vfscanf+612> mov rax,QWORD PTR [rbp-0x460] │
│0x7ffff786b4ab <_IO_vfscanf+619> add rax,QWORD PTR [rbp-0x458] │
│0x7ffff786b4b2 <_IO_vfscanf+626> movq xmm0,QWORD PTR [rbp-0x460] │
│0x7ffff786b4ba <_IO_vfscanf+634> mov DWORD PTR [rbp-0x678],0x0 │
│0x7ffff786b4c4 <_IO_vfscanf+644> mov QWORD PTR [rbp-0x608],rax │
│0x7ffff786b4cb <_IO_vfscanf+651> movzx eax,BYTE PTR [rbx+0x1] │
│0x7ffff786b4cf <_IO_vfscanf+655> movhps xmm0,QWORD PTR [rbp-0x608] │
>│0x7ffff786b4d6 <_IO_vfscanf+662> movaps XMMWORD PTR [rbp-0x470],xmm0 │
So it copied two 8-byte objects to the stack with movq + movhps to load and movaps to store. But with the stack misaligned, movaps [rbp-0x470],xmm0 faults.
I didn't grab a debug build to find out exactly which part of the C source turned into this, but the function is written in C and compiled by GCC with optimization enabled. GCC has always been allowed to do this, but only recently did it get smart enough to take better advantage of SSE2 this way.
Footnote 1: printf / scanf with AL != 0 has always required 16-byte alignment because gcc's code-gen for variadic functions uses test al,al / je to spill the full 16-byte XMM regs xmm0..7 with aligned stores in that case. __m128i can be an argument to a variadic function, not just double, and gcc doesn't check whether the function ever actually reads any 16-byte FP args.

Related

Can ASLR randomization be different per function?

I have the following code snippet:
#include <inttypes.h>
#include <stdio.h>
uint64_t
esp_func(void)
{
__asm__("movl %esp, %eax");
}
int
main()
{
uint32_t esp = 0;
__asm__("\t movl %%esp,%0" : "=r"(esp));
printf("esp: 0x%08x\n", esp);
printf("esp: 0x%08lx\n", esp_func());
return 0;
}
Which prints the following upon multiple executions:
❯ clang -g esp.c && ./a.out
esp: 0xbd3b7670
esp: 0x7f8c1c2c5140
❯ clang -g esp.c && ./a.out
esp: 0x403c9040
esp: 0x7f9ee8bd8140
❯ clang -g esp.c && ./a.out
esp: 0xb59b70f0
esp: 0x7fe301f8c140
❯ clang -g esp.c && ./a.out
esp: 0x6efa4110
esp: 0x7fd95941f140
❯ clang -g esp.c && ./a.out
esp: 0x144e72b0
esp: 0x7f246d4ef140
esp_func shows that ASLR is active with 28 bits of entropy, which makes sense on my modern Linux kernel.
What doesn't make sense is the first value: why is it drastically different?
I took a look at the assembly and it looks weird...
// From main
0x00001150 55 push rbp
0x00001151 4889e5 mov rbp, rsp
0x00001154 4883ec10 sub rsp, 0x10
0x00001158 c745fc000000. mov dword [rbp-0x4], 0
0x0000115f c745f8000000. mov dword [rbp-0x8], 0
0x00001166 89e0 mov eax, esp ; Move esp to eax
0x00001168 8945f8 mov dword [rbp-0x8], eax ; Assign eax to my variable `esp`
0x0000116b 8b75f8 mov esi, dword [rbp-0x8]
0x0000116e 488d3d8f0e00. lea rdi, [0x00002004]
0x00001175 b000 mov al, 0
0x00001177 e8b4feffff call sym.imp.printf ; For whatever reason, the value in [rbp-0x8]
; is assigned here. Why?
// From esp_func
0x00001140 55 push rbp
0x00001141 4889e5 mov rbp, rsp
0x00001144 89e0 mov eax, esp ; Move esp to eax (same instruction as above)
0x00001146 488b45f8 mov rax, qword [rbp-0x8] ; This changes everything. What is this?
0x0000114a 5d pop rbp
0x0000114b c3 ret
0x0000114c 0f1f4000 nop dword [rax]
So my question is, what is in [rbp-0x8], how did it get there, and why are the two values different?
No, stack ASLR happens once at program startup. Relative adjustments to RSP between functions are fixed at compile time, and are just the small constants to make space for a function's local vars. (C99 variable-length arrays and alloca do runtime-variable adjustments to RSP, but not random.)
Your program contains Undefined Behaviour and isn't actually printing RSP; instead some stack address left in a register by the previous printf call (which appears to be a stack address, so its high bits do vary with ASLR). It tells you nothing about stack-pointer differences between functions, just how not to use GNU C inline asm.
The first value is printing the current ESP correctly, but that's only the low 32 bits of the 64-bit RSP.
Falling off the end of a non-void function is not safe, and using the return value is Undefined Behaviour. Any caller that uses the return value of esp_func() necessarily would trigger UB, so the compiler is free to leave whatever it wants in RAX.
If you want to write mov %rsp, %rax / ret, then write that function in pure asm, or mov to an "=r"(tmp) local variable. Using GNU C inline asm to modify RAX without telling the compiler about it doesn't change anything; the compiler still sees this as a function with no return value.
MSVC inline asm is different: it is apparently supported to use _asm{ mov eax, 123 } or something and then fall off the end of a non-void function, and MSVC will respect that as the function return value even when inlining. GNU C inline asm doesn't need silly hacks like that: if you want your asm to interact with C values, use Extended asm with an output constraint like you're doing in main. Remember that GNU C inline asm is not parsed by the compiler, just emit the template string as part of the compiler's asm output to be assembled.
I don't know exactly why clang is reloading a return value from the stack, but that's just an artifact of clang internals and how it does code-gen with optimization disabled. But it's allowed to do this because of the undefined behaviour. It is a non-void function, so it needs to have a return value. The simplest thing would be to just emit a ret, and is what some compilers happen to do with optimization enabled, but even that doesn't fix the problem because of inter-procedural optimization.
It's actually Undefined Behaviour in C to use the return value of a function that didn't return one. This applies at the C level; using inline asm that modifies a register without telling the compiler about it doesn't change anything as far as the compiler is concerned. Therefore your program as a whole contains UB, because it passes the result to printf. That's why the compiler is allowed to compile this way: your code was already broken. In practice it's just returning some garbage from stack memory.
TL:DR: this is not a valid way to emit mov %rsp, %rax / ret as the asm definition for a function.
(C++ strengthens this to it being UB to fall off the end in the first place, but in C it's legal as long as the caller doesn't use the return value. If you compile the same source as C++ with optimization, g++ doesn't even emit a ret instruction after your inline asm template. Probably this is to support C's default-int return type if you declare a function without a return type.)
This UB is also why your modified version from comments (with the printf format strings fixed), compiled with optimization enabled (https://godbolt.org/z/sE7e84) prints "surprisingly" different "RSP" values: the 2nd one isn't using RSP at all.
#include <inttypes.h>
#include <stdio.h>
uint64_t __attribute__((noinline)) rsp_func(void)
{
__asm__("movq %rsp, %rax");
} // UB if return value used
int main()
{
uint64_t rsp = 0;
__asm__("\t movq %%rsp,%0" : "=r"(rsp));
printf("rsp: 0x%08lx\n", rsp);
printf("rsp: 0x%08lx\n", rsp_func()); // UB here
return 0;
}
Output example:
Compiler stderr
<source>:7:1: warning: non-void function does not return a value [-Wreturn-type]
}
^
1 warning generated.
Program returned: 0
Program stdout
rsp: 0x7fff5c472f30
rsp: 0x7f4b811b7170
clang -O3 asm output shows that the compiler-visible UB was a problem. Even though you used noinline, the compiler can still see the function body and try to do inter-procedural optimization. In this case, the UB led it to just give up and not emit a mov %rsp, %rsi between call rsp_func and call printf, so it's printing whatever value the previous printf happened to leave in RSI
# from the Godbolt link
rsp_func: # #rsp_func
mov rax, rsp
ret
main: # #main
push rax
mov rsi, rsp
mov edi, offset .L.str
xor eax, eax
call printf
call rsp_func # return value ignored because of UB.
mov edi, offset .L.str
xor eax, eax
call printf # printf("0x%08lx\n", garbage in RSI left from last printf)
xor eax, eax
pop rcx
ret
.L.str:
.asciz "rsp: 0x%08lx\n"
GNU C Basic asm (without constraints) is not useful for anything (except the body of a __attribute__((naked)) function).
Don't assume the compiler will do what you expect when there is UB visible to it at compile time. (When UB isn't visible at compile time, the compiler has to make code that would work for some callers or callees, and you get the asm you expected. But compile-time-visible UB means all bets are off.)

Assembly: Purpose of loading the effective address before a call to a function?

Source C Code:
int main()
{
int i;
for(i=0, i < 10; i++)
{
printf("Hello World!\n");
}
}
Dump of Intel syntax x86 assembler code for function main:
1. 0x000055555555463a <+0>: push rbp
2. 0x000055555555463b <+1>: mov rbp,rsp
3. 0x000055555555463e <+4>: sub rsp,0x10
4. 0x0000555555554642 <+8>: mov DWORD PTR [rbp-0x4],0x0
5. 0x0000555555554649 <+15>: jmp 0x55555555465b <main+33>
6. 0x000055555555464b <+17>: lea rdi,[rip+0xa2] # 0x5555555546f4
7. 0x0000555555554652 <+24>: call 0x555555554510 <puts#plt>
8. 0x0000555555554657 <+29>: add DWORD PTR [rbp-0x4],0x1
9. 0x000055555555465b <+33>: cmp DWORD PTR [rbp-0x4],0x9
10. 0x000055555555465f <+37>: jle 0x55555555464b <main+17>
11. 0x0000555555554661 <+39>: mov eax,0x0
12. 0x0000555555554666 <+44>: leave
13. 0x0000555555554667 <+45>: ret
I'm currently working through "Hacking, The Art of Exploitation 2nd Edition by Jon Erickson", and I'm just starting to tackle assembly.
I have a few questions about the translation of the provided C code to Assembly, but I am mainly wondering about my first question.
1st Question: What is the purpose of line 6? (lea rdi,[rip+0xa2]).
My current working theory, is that this is used to save where the next instructions will jump to in order to track what is going on. I believe this line correlates with the printf function in the source C code.
So essentially, its loading the effective address of rip+0xa2 (0x5555555546f4) into the register rdi, to simply track where it will jump to for the printf function?
2nd Question: What is the purpose of line 11? (mov eax,0x0?)
I do not see a prior use of the register, EAX and am not sure why it needs to be set to 0.
The LEA puts a pointer to the string literal into a register, as the first arg for puts. The search term you're looking for is "calling convention" and/or ABI. (And also RIP-relative addressing). Why is the address of static variables relative to the Instruction Pointer?
The small offset between code and data (only +0xa2) is because the .rodata section gets linked into the same ELF segment as .text, and your program is tiny. (Newer gcc + ld versions will put it in a separate page so it can be non-executable.)
The compiler can't use a shorter more efficient mov edi, address in position-independent code in your Linux PIE executable. It would do that with gcc -fno-pie -no-pie
mov eax,0 implements the implicit return 0 at the end of main that C99 and C++ guarantee. EAX is the return-value register in all calling conventions.
If you don't use gcc -O2 or higher, you won't get peephole optimizations like xor-zeroing (xor eax,eax).
This:
lea rdi,[rip+0xa2]
Is a typical position independent LEA, putting the string address into a register (instead of loading from that memory address).
Your executable is position independent, meaning that it can be loaded at runtime at any address. Therefore, the real address of the argument to be passed to puts() needs to be calculated at runtime every single time, since the base address of the program could be different each time. Also, puts() is used instead of printf() because the compiler optimized the call since there is no need to format anything.
In this case, the binary was most probably loaded with the base address 0x555555554000. The string to use is stored in your binary at offset 0x6f4. Since the next instruction is at offset 0x652, you know that, no matter where the binary is loaded in memory, the address you want will be rip + (0x6f4 - 0x652) = rip + 0xa2, which is what you see above. See this answer of mine for another example.
The purpose of:
mov eax,0x0
Is to set the return value of main(). In Intel x86, the calling convention is to return values in the rax register (eax if the value is 32 bits, which is true in this case since main returns an int). See the table entry for x86-64 at the end of this page.
Even if you don't add an explicit return statement, main() is a special function, and the compiler will add a default return 0 for you.
If you add some debug data and symbols to the assembly everything will be easier. It is also easier to read the code if you add some optimizations.
There is a very useful tool godbolt and your example https://godbolt.org/z/9sRFmU
On the asm listing there you can clearly see that that lines loads the address of the string literal which will be then printed by the function.
EAX is considered volatile and main by default returns zero and thats the reason why it is zeroed.
The calling convention is explained here: https://en.wikipedia.org/wiki/X86_calling_conventions
Here you have more interesting cases https://godbolt.org/z/M4MeGk

Is the stack frame required for all functions in C on x86-64?

I've made a function to calculate the length of a C string (I'm trying to beat clang's optimizer using -O3). I'm running macOS.
_string_length1:
push rbp
mov rbp, rsp
xor rax, rax
.body:
cmp byte [rdi], 0
je .exit
inc rdi
inc rax
jmp .body
.exit:
pop rbp
ret
This is the C function I'm trying to beat:
size_t string_length2(const char *str) {
size_t ret = 0;
while (str[ret]) {
ret++;
}
return ret;
}
And it disassembles to this:
string_length2:
push rbp
mov rbp, rsp
mov rax, -1
LBB0_1:
cmp byte ptr [rdi + rax + 1], 0
lea rax, [rax + 1]
jne LBB0_1
pop rbp
ret
Every C function sets up a stack frame using push rbp and mov rbp, rsp, and breaks it using pop rbp. But I'm not using the stack in any way here, I'm only using processor registers. It worked without using a stack frame (when I tested on x86-64), but is it necessary?
No, the stack frame is, at least in theory, not always required. An optimizing compiler might in some cases avoid using the call stack. Notably when it is able to inline a called function (in some specific call site), or when the compiler successfully detects a tail call (which reuses the caller's frame).
Read the ABI of your platform to understand requirements related to the stack.
You might try to compile your program with link time optimization (e.g. compile and link with gcc -flto -O2) to get more optimizations.
In principle, one could imagine a compiler clever enough to (for some programs) avoid using any call stack.
BTW, I just compiled a naive recursive long fact(int n) factorial function with GCC 7.1 (on Debian/Sid/x86-64) at -O3 (i.e. gcc -fverbose-asm -S -O3 fact.c). The resulting assembler code fact.s contains no call machine instruction.
Every C function sets up a stack frame using...
This is true for your compiler, not in general. It is possible to compile a C program without using the stack at all—see, for example, the method CPS, continuation passing style. Probably no C compiler on the market does so, but it is important to know that there are other ways to execute programs, in addition to stack-evaluation.
The ISO 9899 standard says nothing about the stack. It leaves compiler implementations free to choose whichever method of evaluation they consider to be the best.

How is this string in an array represented in assembly when a C program is compiled using the gcc -S option?

This is a C program, which has been compiled to assembly using gcc -S. How is string "Hello, world" represented in this program?
This is the C-code:
1. #include <stdio.h>
2.
3. int main(void) {
4.
5. char st[] = "Hello, wolrd";
6. printf("%s\n", st);
7.
8. return 0;
9. }
Heres the assembly code:
1. .intel_syntax noprefix
2. .text
3. .globl main
4.
5. main:
6. push rbp
7. mov rbp, rsp
8. sub rsp, 32
9. mov rax, QWORD PTR fs:40
10 mov QWORD PTR [rbp-8], rax
11. xor eax, eax
12. movabs rax, 8583909746840200520
15. mov QWORD PTR [rbp-32], rax
14. mov DWORD PTR [rbp-24], 1684828783
15. mov BYTE PTR [rbp-20], 0
16. lea rax, [rbp-32]
17. mov rdi, rax
18. call puts
19. mov eax, 0
20. mov rdx, QWORD PTR [rbp-8]
21. xor rdx, QWORD PTR fs:40
22 je .L3
22. call __stack_chk_fail
23. .L3:
24. leave
25. ret
You are using a local buffer in function main, initialized from a string literal. The compiler compiles this initialization as setting the 16 bytes at [rbp-32] with 3 mov instructions. The first one via rax, the second immediate as the value is 32 bits, the third for a single byte.
8583909746840200520 in decimal is 0x77202c6f6c6c6548 in hex, corresponding to the bytes "Hello, W" in little endian order, 1684828783 is 0x646c726f, the bytes "orld". The third mov sets the final '\0' byte. Hence the buffer contains "Hello, World".
This string is then passed to puts for output to stdout.
Note that gcc optimized the call printf("%s\n", "Hello, World"); to puts("Hello, World");! By the way, clang performs the same optimization.
Interesting.
If you'd written const char *str="...", gcc would have passed puts a pointer to the string sitting there in the .rodata section, like in this godbolt link. (Well-spotted by chqrlie that gcc is optimizing printf to puts).
Your code forces the compiler to make a writeable copy of the string literal, by assigning it to a non-const char[]. (Actually, even with const char str[], gcc still generates it on the fly from mov-immediates. clang-3.7 spots the chance to optimize, though.)
Interestingly, it encodes it into immediate data, rather than copying into the buffer. If the array had been global, it would have just been sitting there in the regular .data section, not .rodata.
Also, in general avoid using main() to see compiler optimization. gcc on purpose marks it as "cold", and optimizes it less. This is an advantage for real programs that do their real work in other functions. No difference in this case, renaming main. But usually if you're looking at how gcc optimizes something, it's best to write a function that takes some args, and use those. Then you don't have to worry about gcc seeing that the inputs or loop-bounds are compile-time constants, either.

C allocated space size on stack for an array

I have a simple program called demo.c which allocates space for a char array with the length of 8 on the stack
#include<stdio.h>
main()
{
char buffer[8];
return 0;
}
I thought that 8 bytes will be allocated from stack for the eight chars but if I check this in gdb there are 10 bytes subtracted from the stack.
I compile the the program with this command on my Ubuntu 32 bit machine:
$ gcc -ggdb -o demo demo.c
Then I analyze the program with:
$ gdb demo
$ disassemble main
(gdb) disassemble main
Dump of assembler code for function main:
0x08048404 <+0>: push %ebp
0x08048405 <+1>: mov %esp,%ebp
0x08048407 <+3>: and $0xfffffff0,%esp
0x0804840a <+6>: sub $0x10,%esp
0x0804840d <+9>: mov %gs:0x14,%eax
0x08048413 <+15>: mov %eax,0xc(%esp)
0x08048417 <+19>: xor %eax,%eax
0x08048419 <+21>: mov $0x0,%eax
0x0804841e <+26>: mov 0xc(%esp),%edx
0x08048422 <+30>: xor %gs:0x14,%edx
0x08048429 <+37>: je 0x8048430 <main+44>
0x0804842b <+39>: call 0x8048340 <__stack_chk_fail#plt>
0x08048430 <+44>: leave
0x08048431 <+45>: ret
End of assembler dump.
0x0804840a <+6>: sub $0x10,%esp says, that there are 10 bytes allocated from the stack right?
Why are there 10 bytes allocated and not 8?
No, 0x10 means it's hexadecimal, i.e. 1016, which is 1610 bytes in decimal.
Probably due to alignment requirements for the stack.
Please note that the constant $0x10 is in hexadecimal this is equal to 16 byte.
Take a look at the machine code:
0x08048404 <+0>: push %ebp
0x08048405 <+1>: mov %esp,%ebp
0x08048407 <+3>: and $0xfffffff0,%esp
0x0804840a <+6>: sub $0x10,%esp
...
0x08048430 <+44>: leave
0x08048431 <+45>: ret
As you can see before we subtract 16 from the esp we ensure to make esp pointing to a 16 byte aligned address first (take a look at the and $0xfffffff0,%esp instruction).
I guess the compiler try to respect the alignment so he simply reserves 16 byte as well. It does not matter anyway because 8 byte fit into 16 byte very well.
sub $0x10, %esp is saying that there are 16 bytes on the stack, not 10 since 0x is hexadecimal notation.
The amount of space for the stack is completely dependent on the compiler. In this case it's most like an alignment issue where the alignment is 16 bytes and you've requested 8, so it gets increased to 16.
If you requested 17 bytes, it would most likely have been sub $0x20, %esp or 32 bytes instead of 17.
(I skipped over some things the other answers explain in more detail).
You compiled with -O0, so gcc is operating in a super-simple way that tells you something about compiler internals, but little about how to make good code from C.
gcc is keeping the stack 16B-aligned at all times. The 32bit SysV ABI only guarantees 4B stack alignment, but GNU/Linux systems actually assume and maintain gcc's default -mpreferred-stack-boundary=4 (16B-aligned).
Your version of gcc also defaults to using -fstack-protector, so it checks for stack-smashing in functions with local char arrays with 4 or more elements:
-fstack-protector
Emit extra code to check for buffer overflows, such as stack smashing attacks. This is done by adding a guard variable to
functions with
vulnerable objects. This includes functions that call "alloca", and functions with buffers larger than 8 bytes. The guards
are
initialized when a function is entered and then checked when the function exits. If a guard check fails, an error message is
printed and
the program exits.
For some reason, this is actually kicking in with char arrays >= 4B, but not with integer arrays. (At least, not when they're unused!). char pointers can alias anything, which may have something to do with it.
See the code on godbolt, with asm output. Note how main is special: it uses andl $-16, %esp to align the stack on entry to main, but other functions assume the stack was 16B-aligned before the call instruction that called them. So they'll typically sub $24, %esp, after pushing %ebp. (%ebp and the return address are 8B total, so the stack is 8B away from being 16B-aligned). This leaves room for the stack-protector canary.
The 32bit SysV ABI only requires arrays to be aligned to the natural alignment of their elements, so this 16B alignment for the char array is just what the compiler decided to do in this case, not something you can count on.
The 64bit ABI is different:
An array uses the same alignment as its elements, except that a local
or global array variable of length at least 16 bytes or a C99
variable-length array variable always has alignment of at least 16
bytes
(links from the x86 tag wiki)
So you can count on char buf[1024] being 16B-aligned on SysV, allowing you to use SSE aligned loads/stores on it.

Resources