I wrote my own malloc new and realloc for my C++ project. Some of these pages are >= 4K. I was wondering when I call my malloc is there a way I can zero out the 4K+ page without reading the data into cache? I vaguely remember reading about something like this in either intel or AMD x86-64 documentation but I can't remember what it's called.
Does gcc (or clang) have an intrinsic I can use? If not what assembly instructions should I look up? I have 3 common use cases after a malloc. zeroing the memory, memcpy-ing a buffer and mixing both (64bytes or 512 of memcpy then rest as zeros). I'm not sure what will be the miminum architecture I'll support but it's no less then haswell. Likely it'll be Intel Skylake/AMD Zen and up
-Edit- I rolled back the C++ tag to C because generally intrinsic is in C
Under Unix systems you can mmap /dev/zero to get zero filled pages. That would give you zeroed pages for sure. Depending on the kernel MAP_ANNONYMOUS might also give you zero filled pages. Both ways should not poison the caches.
You can also use MAP_POPULATE (Linux) to allocate physical pages from the start instead of faulting them in on first access. Hopefully this wouldn't poison the caches either but I never verified that in the Linux source.
But I have to wonder: Why would you zero out the pages on malloc/realloc/new? Only calloc zeroes out pages and for everything else the compiler or source code will zero out the memory. Unless you change the compiler to know about you already zeroing out the pages there won't be any benefit.
Note: For many types in C++ the memory will not zeroed out at all but initialized using the constructors.
I think rep stosb meets your needs. Even though it does 1-byte writes, it uses write combining internally, so it will fill a full cache line before issuing a write. Then since an entire cache line is being written, it doesn't need to read the dead contents of memory before writing the line to L1.
Many methods found in high-performance algorithms could be (and are) simplified if they were allowed to read a small amount past the end of input buffers. Here, "small amount" generally means up to W - 1 bytes past the end, where W is the word size in bytes of the algorithm (e.g., up to 7 bytes for an algorithm processing the input in 64-bit chunks).
It's clear that writing past the end of an input buffer is never safe, in general, since you may clobber data beyond the buffer1. It is also clear that reading past the end of a buffer into another page may trigger a segmentation fault/access violation, since the next page may not be readable.
In the special case of reading aligned values, however, a page fault seems impossible, at least on x86. On that platform, pages (and hence memory protection flags) have a 4K granularity (larger pages, e.g. 2MiB or 1GiB, are possible, but these are multiples of 4K) and so aligned reads will only access bytes in the same page as the valid part of the buffer.
Here's a canonical example of some loop that aligns its input and reads up to 7 bytes past the end of buffer:
int processBytes(uint8_t *input, size_t size) {
uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
int res;
if (size < 8) {
// special case for short inputs that we aren't concerned with here
return shortMethod();
}
// check the first 8 bytes
if ((res = match(*input)) >= 0) {
return input + res;
}
// align pointer to the next 8-byte boundary
input64 = (ptrdiff_t)(input64 + 1) & ~0x7;
for (; input64 < end64; input64++) {
if ((res = match(*input64)) > 0) {
return input + res < input + size ? input + res : -1;
}
}
return -1;
}
The inner function int match(uint64_t bytes) isn't shown, but it is something that looks for a byte matching a certain pattern, and returns the lowest such position (0-7) if found or -1 otherwise.
First, cases with size < 8 are pawned off to another function for simplicity of exposition. Then a single check is done for the first 8 (unaligned bytes). Then a loop is done for the remaining floor((size - 7) / 8) chunks of 8 bytes2. This loop may read up to 7 bytes past the end of the buffer (the 7 byte case occurs when input & 0xF == 1). However, return call has a check which excludes any spurious matches which occur beyond the end of the buffer.
Practically speaking, is such a function safe on x86 and x86-64?
These types of overreads are common in high performance code. Special tail code to avoid such overreads is also common. Sometimes you see the latter type replacing the former to silence tools like valgrind. Sometimes you see a proposal to do such a replacement, which is rejected on the grounds the idiom is safe and the tool is in error (or simply too conservative)3.
A note for language lawyers:
Reading from a pointer beyond its allocated size is definitely not allowed
in the standard. I appreciate language lawyer answers, and even occasionally write
them myself, and I'll even be happy when someone digs up the chapter
and verse which shows the code above is undefined behavior and hence
not safe in the strictest sense (and I'll copy the details here). Ultimately though, that's not what
I'm after. As a practical matter, many common idioms involving pointer
conversion, structure access though such pointers and so are
technically undefined, but are widespread in high quality and high
performance code. Often there is no alternative, or the alternative
runs at half speed or less.
If you wish, consider a modified version of this question, which is:
After the above code has been compiled to x86/x86-64 assembly, and the user has verified that it is compiled in the expected way (i.e.,
the compiler hasn't used a provable partially out-of-bounds access to
do something really
clever,
is executing the compiled program safe?
In that respect, this question is both a C question and a x86 assembly question. Most of the code using this trick that I've seen is written in C, and C is still the dominant language for high performance libraries, easily eclipsing lower level stuff like asm, and higher level stuff like <everything else>. At least outside of the hardcore numerical niche where FORTRAN still plays ball. So I'm interested in the C-compiler-and-below view of the question, which is why I didn't formulate it as a pure x86 assembly question.
All that said, while I am only moderately interested in a link to the
standard showing this is UD, I am very interested in any details of
actual implementations that can use this particular UD to produce
unexpected code. Now I don't think this can happen without some deep
pretty deep cross-procedure analysis, but the gcc overflow stuff
surprised a lot of people too...
1 Even in apparently harmless cases, e.g., where the same value is written back, it can break concurrent code.
2 Note for this overlapping to work requires that this function and match() function to behave in a specific idempotent way - in particular that the return value supports overlapping checks. So a "find first byte matching pattern" works since all the match() calls are still in-order. A "count bytes matching pattern" method would not work, however, since some bytes could be double counted. As an aside: some functions such as "return the minimum byte" call would work even without the in-order restriction, but need to examine all bytes.
3 It's worth noting here that for valgrind's Memcheck there is a flag, --partial-loads-ok which controls whether such reads are in fact reported as an error. The default is yes, means that in general such loads are not treated as immediate errors, but that an effort is made to track the subsequent use of loaded bytes, some of which are valid and some of which are not, with an error being flagged if the out-of-range bytes are used. In cases such as the example above, in which the entire word is accessed in match(), such analysis will conclude the bytes are accessed, even though the results are ultimately discarded. Valgrind cannot in general determine whether invalid bytes from a partial load are actually used (and detection in general is probably very hard).
Yes, it's safe in x86 asm, and existing libc strlen(3) implementations take advantage of this in hand-written asm. And even glibc's fallback C, but it compiles without LTO so it it can never inline. It's basically using C as a portable assembler to create machine code for one function, not as part of a larger C program with inlining. But that's mostly because it also has potential strict-aliasing UB, see my answer on the linked Q&A. You probably also want a GNU C __attribute__((may_alias)) typedef instead of plain unsigned long as your wider type, like __m128i etc. already use.
It's safe because an aligned load will never cross a higher alignment boundary, and memory protection happens with aligned pages, so at least 4k boundaries1
Any naturally-aligned load that touches at least 1 valid byte can't fault. It's also safe to just check if you're far enough from the next page boundary to do a 16-byte load, like if (p & 4095 > (4096 - 16)) do_special_case_fallback. See the section below about that for more detail.
It's also generally safe in C compiled for x86, as far as I know. Reading outside an object is of course Undefined Behaviour in C, but works in C-targeting-x86. I don't think compilers explicitly / on purpose define the behaviour, but in practice it works that way.
I think it's not the kind of UB that aggressive compilers will assume can't happen while optimizing, but confirmation from a compiler-writer on this point would be good, especially for cases where it's easily provable at compile-time that an access goes out of past the end of an object. (See discussion in comments with #RossRidge: a previous version of this answer asserted that it was absolutely safe, but that LLVM blog post doesn't really read that way).
This is required in asm to go faster than 1 byte at a time processing an implicit-length string. In C in theory a compiler could know how to optimize such a loop, but in practice they don't so you have to do hacks like this. Until that changes, I suspect that the compilers people care about will generally avoid breaking code that contains this potential UB.
There's no danger when the overread isn't visible to code that knows how long an object is. A compiler has to make asm that works for the case where there are array elements as far as we actually read. The plausible danger I can see with possible future compilers is: after inlining, a compiler might see the UB and decide that this path of execution must never be taken. Or that the terminating condition must be found before the final not-full-vector and leave that out when fully unrolling.
The data you get is unpredictable garbage, but there won't be any other potential side-effects. As long as the your program isn't affected by the garbage bytes, it's fine. (e.g. use bithacks to find if one of the bytes of a uint64_t are zero, then a byte loop to find the first zero byte, regardless of what garbage is beyond it.)
Unusual situations where this wouldn't be safe in x86 asm
Hardware data breakpoints (watchpoints) that trigger on a load from a given address. If there's a variable you're monitoring right after an array, you could get a spurious hit. This might be a minor annoyance to someone debugging a normal program. If your function will be part of a program that uses x86 debug registers D0-D3 and the resulting exceptions for something that could affect correctness, then be careful with this.
Or similarly a code checker like valgrind could complain about reading outside an object.
Under a hypothetical 16 or 32-bit OS could that uses segmentation: A segment limit can use 4k or 1-byte granularity so it's possible to create a segment where the first faulting offset is odd. (Having the base of the segment aligned to a cache line or page is irrelevant except for performance). All mainstream x86 OSes use flat memory models, and x86-64 removes support for segment limits for 64-bit mode.
Memory-mapped I/O registers right after the buffer you wanted to loop over with wide loads, especially the same 64B cache-line. This is extremely unlikely even if you're calling functions like this from a device driver (or a user-space program like an X server that has mapped some MMIO space).
If you're processing a 60-byte buffer and need to avoid reading from a 4-byte MMIO register, you'll know about it and will be using a volatile T*. This sort of situation doesn't happen for normal code.
strlen is the canonical example of a loop that processes an implicit-length buffer and thus can't vectorize without reading past the end of a buffer. If you need to avoid reading past the terminating 0 byte, you can only read one byte at a time.
For example, glibc's implementation uses a prologue to handle data up to the first 64B alignment boundary. Then in the main loop (gitweb link to the asm source), it loads a whole 64B cache line using four SSE2 aligned loads. It merges them down to one vector with pminub (min of unsigned bytes), so the final vector will have a zero element only if any of the four vectors had a zero. After finding that the end of the string was somewhere in that cache line, it re-checks each of the four vectors separately to see where. (Using the typical pcmpeqb against a vector of all-zero, and pmovmskb / bsf to find the position within the vector.) glibc used to have a couple different strlen strategies to choose from, but the current one is good on all x86-64 CPUs.
Usually loops like this avoid touching any extra cache-lines they don't need to touch, not just pages, for performance reasons, like glibc's strlen.
Loading 64B at a time is of course only safe from a 64B-aligned pointer, since naturally-aligned accesses can't cross cache-line or page-line boundaries.
If you do know the length of a buffer ahead of time, you can avoid reading past the end by handling the bytes beyond the last full aligned vector using an unaligned load that ends at the last byte of the buffer.
(Again, this only works with idempotent algorithms, like memcpy, which don't care if they do overlapping stores into the destination. Modify-in-place algorithms often can't do this, except with something like converting a string to upper-case with SSE2, where it's ok to reprocess data that's already been upcased. Other than the store-forwarding stall if you do an unaligned load that overlaps with your last aligned store.)
So if you are vectorizing over a buffer of known length, it's often best to avoid overread anyway.
Non-faulting overread of an object is the kind of UB that definitely can't hurt if the compiler can't see it at compile time. The resulting asm will work as if the extra bytes were part of some object.
But even if it is visible at compile-time, it generally doesn't hurt with current compilers.
PS: a previous version of this answer claimed that unaligned deref of int * was also safe in C compiled for x86. That is not true. I was a bit too cavalier 3 years ago when writing that part. You need a typedef with GNU C __attribute__((aligned(1),may_alias)), or memcpy, to make that safe. The may_alias part isn't needed if you only access it via signed/unsigned int* and/or `char*, i.e. in ways that wouldn't violate the normal C strict-aliasing rules.
The set of things ISO C leaves undefined but that Intel intrinsics requires compilers to define does include creating unaligned pointers (at least with types like __m128i*), but not dereferencing them directly. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
Checking if a pointer is far enough from the end of a 4k page
This is useful for the first vector of strlen; after this you can p = (p+16) & -16 to go to the next aligned vector. This will partially overlap if p was not 16-byte aligned, but doing redundant work is sometimes the most compact way to set up for an efficient loop. Avoiding it might mean looping 1 byte at a time until an alignment boundary, and that's certainly worse.
e.g. check ((p + 15) ^ p) & 0xFFF...F000 == 0 (LEA / XOR / TEST) which tells you that the last byte of a 16-byte load has the same page-address bits as the first byte. Or p+15 <= p|0xFFF (LEA / OR / CMP with better ILP) checks that the last byte-address of the load is <= the last byte of the page containing the first byte.
Or more simply, p & 4095 > (4096 - 16) (MOV / AND / CMP), i.e. p & (pgsize-1) < (pgsize - vecwidth) checks that the offset-within-page is far enough from the end of a page.
You can use 32-bit operand-size to save code size (REX prefixes) for this or any of the other checks because the high bits don't matter. Some compilers don't notice this optimization, so you can cast to unsigned int instead of uintptr_t, although to silence warnings about code that isn't 64-bit clean you might need to cast (unsigned)(uintptr_t)p. Further code-size saving can be had with ((unsigned int)p << 20) > ((4096 - vectorlen) << 20) (MOV / SHL / CMP), because shl reg, 20 is 3 bytes, vs. and eax, imm32 being 5, or 6 for any other register. (Using EAX will also allow the no-modrm short form for cmp eax, 0xfff.)
If doing this in GNU C, you probably want typedef unsigned long aliasing_unaligned_ulong __attribute__((aligned(1),may_alias)); to make it safe to do unaligned accesses.
If you permit consideration of non-CPU devices, then one example of a potentially unsafe operation is accessing out-of-bounds regions of PCI-mapped memory pages. There's no guarantee that the target device is using the same page size or alignment as the main memory subsystem. Attempting to access, for example, address [cpu page base]+0x800 may trigger a device page fault if the device is in a 2KiB page mode. This will usually cause a system bugcheck.
I'm working on an embedded project which involves I/O on memory-mapped FPGA registers. Pointers to these memory regions need to be marked volatile so the compiler does not "optimize out" reads and writes to the FPGA by caching values in CPU registers.
In a few cases, we want to copy a series of FPGA registers into a buffer for further use. Since the registers are mapped to contiguous addresses, memcpy seems appropriate, but passing our volatile pointer as the source argument gives a warning about discarding the volatile qualifier.
Is it safe (and sane) to cast away the volatile-ness of the pointer to suppress this warning? Unless the compiler does something magical, I can't imagine a scenario where calling memcpy would fail to perform an actual copy. The alternative is to just use a for loop and copy byte by byte, but memcpy implementations can (and do) optimize the copy based on size of the copy, alignment, etc.
As a developer of both: FPGA and embedded software, there is just one clear answer: do not use memcpy et al. for this
Some reasons:
There is no guarantee memcpy will work in any specific order.
The compiler might very well replace the call with inline code.
Such acceses often require a certain word-size. memcpy does not guarantee that.
Gaps in the register map might result in undefined behaviour.
You can, however, use a simple for loop and copy yourself. This is safe, if the registers are volatile (see below).
Depending on your platform, volatile alone might not be sufficient. The memory area has also to be non-cachable and strictily ordered (and - possibly - non-shared). Otherwise the system busses might (and will for some platforms) reorder accesses.
Furthermore, you might need barriers/fences for your CPU not to reorder accesses. Please read your hardware-specs very carefully about this.
If you need to transfer larger blocks more often, think about using DMA. If the FPGA uses PCI(e), you could use busmaster DMA with scatter/gather for instance (however, this is not easily implemented; did that myself, but might be worth the effort).
The best (and most sane) approach depends actually on multiple factors, like platform, required speed, etc. Of all possible approaches, I would deem using mempcy() one of the lesser sane(1) at best (1): not sure if that is correct grammar, but I hope you got my point).
Absolutely not safe. There is no guarantee whatsoever in which order memcpy will copy the data, and how many bytes are copied at a time.
After having successfully implemented the karatsuba algorithm, I decided to compare the time needed with the school algorithm. The program needs to test up to 32768 digits. Unfortunately, it stops at 8192 digits(the digits are stored in an array). When running it with gdb I get the output: Programme terminated with SIGKILL, Killed. So obviously I searched through the web and found out that(since I'm on Linux), the kernel automatically killed the program because it consumed too much of resources.
So my question is: Is there a way to keep it running?
Thanks in advance for any response
The most probable cause is memory exhaustion. You can roughly test this hypothesis by running top on the terminal.
If this is the case, valgrind is your friend. Look very carefully at every place you call malloc in your program and ensure that you call free for each array afterwards.
I see a number of things you should do before forcing Linux to keep your program running (if you could do that anyway).
Watch out for memory leaks (see answer of jons34yp)
Once all memory leaks resolved, check the declaration of your
variables, every non used bit but allocated bit is one to many. If a
byte is enough (unsigned char), don't use a short. If a short is
enough, don't use a long. Same for float's and doubles. Also check
eventual structs and unions for unused data.
Also check your algorithm and the way you implement it. e.g. a
sparse matrix can be represented in other ways than waisting entire
array's.
Keep in mind that C compilers use to align data fields. This means
that after for instance, an array of 13 bytes, compilers tend to
align the next bytes on an 32bit or 64bit boundary, leaving you with
unused bytes in between. The same thing can happen within structs.
So check your compilers alignment settings.
I hope this helps to find a solution.
Kind regards,
PB
I'm on Linux, which has address space layout randomization. Is it necessary to to declare a buffer on the stack, leave it uninitialized, and use it for entropy, or can I just take the address of something already on the stack, cast it to an integer and (knowing that it is somewhat random due to address space layout randomization) use that integer for entropy instead?
The pointer approach has an advantage in that it generates no compiler warnings as the unitialized buffer does when you attempt to manipulate it, but in my tests it seemed only the lower-order part of the address (maybe the last byte or two) would change from invocation to invocation. The entropy buffer seemed to perform even worse, often containing nothing at all.
If you need weak entropy on Linux, why not read /dev/urandom? It's a non-blocking variant of /dev/random that's less...random (but, again, non-blocking).
Fundamentally, if you need entropy for anything, you need to take it from some external source, not some quirk of the compiler, or expectations of memory allocation layout. There's no guarantee that the compiler will give you different pointers. You might write code that works okay on one system, but completely fails to give entropy on another.
As others have suggested, using /dev/random is a good idea. If that's not available, you might be able to get away with calling the time() function (time.h) if you just need a little entropy.
However, I get very worried when people ask for just a little entropy because that suggests some kind of reliance on a random value. Having only a little entropy means that it will frequently return the same value, possibly causing the system to fail in unexpected ways. My strong recommendation is to always get good entropy from sources like /dev/random.
What's wrong with /dev/random?
Do not use uninitialized memory for entropy.
Especially stack. It has tendency to look very similarly in consecutive runs.
And it's quite predictable and not very random.
What exactly do you mean by weak? The canonical source of entropy in C (for non-cryptographic purposes) is time from <time.h>.
Accessing an uninitialized variable is undefined behavior and may have unpredictable consequences on some platforms. Don't do it.
Why don't you read a few bytes from /dev/random or /dev/urandom?