Zero out memory without reading into cache?

Zero out memory without reading into cache? - c

I wrote my own malloc new and realloc for my C++ project. Some of these pages are >= 4K. I was wondering when I call my malloc is there a way I can zero out the 4K+ page without reading the data into cache? I vaguely remember reading about something like this in either intel or AMD x86-64 documentation but I can't remember what it's called.
Does gcc (or clang) have an intrinsic I can use? If not what assembly instructions should I look up? I have 3 common use cases after a malloc. zeroing the memory, memcpy-ing a buffer and mixing both (64bytes or 512 of memcpy then rest as zeros). I'm not sure what will be the miminum architecture I'll support but it's no less then haswell. Likely it'll be Intel Skylake/AMD Zen and up
-Edit- I rolled back the C++ tag to C because generally intrinsic is in C

Under Unix systems you can mmap /dev/zero to get zero filled pages. That would give you zeroed pages for sure. Depending on the kernel MAP_ANNONYMOUS might also give you zero filled pages. Both ways should not poison the caches.
You can also use MAP_POPULATE (Linux) to allocate physical pages from the start instead of faulting them in on first access. Hopefully this wouldn't poison the caches either but I never verified that in the Linux source.
But I have to wonder: Why would you zero out the pages on malloc/realloc/new? Only calloc zeroes out pages and for everything else the compiler or source code will zero out the memory. Unless you change the compiler to know about you already zeroing out the pages there won't be any benefit.
Note: For many types in C++ the memory will not zeroed out at all but initialized using the constructors.

I think rep stosb meets your needs. Even though it does 1-byte writes, it uses write combining internally, so it will fill a full cache line before issuing a write. Then since an entire cache line is being written, it doesn't need to read the dead contents of memory before writing the line to L1.

Related

Processor heap read and prefetch

So i'm trying to figure out how heap reading can probably slow down processor capabilities with prefetching and it's only theoretical question so in example i use some C-like wonder-language.
So let's suppose we have some heap of 120 bytes and it's have some memory in use by a program.
[0...19, /* FREE /, 40-79, / FREE TILL THE END (119) */]
and i have some structs with magically aligned by memory of 20 bytes
#include <stdlib.h>
struct magic_struct {
long long int foo[3];
short int bar;
};
typedef MagicStruct struct magic_struct;
void read_magic_struct(MagicStruct* buzz) {
// Some code to read struct
}
int main(void) {
MagicStruct *str1 = malloc(sizeof(MagicStruct));
MagicStruct *str2 = malloc(sizeof(MagicStruct));
read_magic_struct(str1);
read_magic_struct(str2);
free(str1);
free(str2);
}
So let's suppose that our processor fetches cacheline of 40 bytes,
it means with our current memory representation processor can't prefetch str2 while reading str1 so it will be slow down in program execution? How do structs get allocated if there was an empty memory buffer or first empty memory chunk would be 40 bytes along? Would a processor hit to the cache misses if structs` size would be 50 bytes? Does some mechanism decides where and when allocate memory on heap?

Does some mechanism decides where and when allocate memory on heap?
The mechanism that decides where memory is allocated on the heap is a piece of code called "memory allocator" - a part of the C runtime library - and it doesn't care much about prefetching or anything like that. Most memory allocators do their best to keep "related" allocations "close together", but it's only on a best-effort basis. So you can't really assume anything about how far away those two allocations are.
How do structs get allocated if there was an empty memory buffer or first empty memory chunk would be 40 bytes along?
In any way that is possible: you can't know until you read the source code of the memory allocator used by the C runtime library on your system (e.g. glibc on most Linux systems). It's totally arbitrary and in general you can't predict how those structs will get allocated. And also in real life heap doesn't have to be contiguous, i.e. heap doesn't have to be a single big memory block, it will usually be many memory blocks, with gaps between them.
So let's suppose that our processor fetches cacheline of 40 bytes, it means with our current memory representation processor can't prefetch str2 while reading str1 so it will be slow down in program execution?
Assuming the heap layout you proposed, the only valid statement is as follows: the cacheline containing str1 does not contain str2. That's all you can say, and no more. It tells you nothing about prefetch, because prefetch has to do with fetching other cachelines ahead of time, before they are needed.
I think you're misusing the term "prefetch", because the processor only exchanges data with memory in terms of complete cachelines. Prefetch does not mean that when fetching a single cacheline some other useful data was also inside that cacheline. We call this cache coherency, and it's a property of how you lay out data in your program. If you need such fine control, you need to write your own memory allocator (even if it's simple).
Prefetch means that the processor has a system that monitors the cachelines being fetched, and anticipates the need for another cacheline, and fetches it before you use it. Prefetch can be triggered by explicit prefetch machine instructions, if you want to control it to such an extent, or it can be triggered by the prefetch algorithms implemented in the processor. Modern processors are extremely good at detecting sequential access that's contiguous or even separated by repeating gaps.
It's certainly good to ponder such things, but you have to understand that processors are pretty darn good, and if you think that prefetch may be your problem, you have to have some very good argument why it is so, and normally you'd need actual measurements. Talking about optimizing prefetch without having measurements that show that the processor is waiting for cachelines to come in from memory is absolutely foolish and a waste of time. So, in practical terms, if you'll want to explore this using real prefetch on real processors (as opposed to imaginary stuff), you'll have to install e.g. Intel VTune Profiler, or use Valgrind's cachegrind (on linux), run your program under those tools, then be able to interpret the results, pinpoint problems, address them by changing data layout or using explicit prefetch machine commands (or compiler intrinsic instructioins), and then be able to validate your solution by seeing improved cache hit rates. And ideally this should be automated so you won't have performance regressions, i.e. you'd run all this instrumentation and result interpretation as scripts under the continous integration system, so that you know that you won't accidentally break things. But this takes lots of work, and if you're starting from scratch (a new project), assuming that you were minimally fluent in scripting, CI systems, and C, then it'd still take potentially hundreds of hours to fully set up and shake down to be trustworthy. The guiding principle is:
Lack of measurement is prima-facie evidence that you don't care about the result
In other words, if you actually care about some aspect of performance, then the only way to show such care is to measure it, and have those measurements be a part of the normal workflow, i.e. that once they were set up, you don't even worry about them - you check in some new code, and if you broke performance, the performance tests will fail, and you'll have to fix it before you're allowed to merge. That's how it's normally done in an environment where that matters. Otherwise, the only interpretation is that you cannot be caring, since without automated measurements it is almost certain that some "minor" code change will affect performance and yet will slip unnoticed. Once you're told this, there's no going back, sorry :)

Is it safe to read past the end of a buffer within the same page on x86 and x64?

Many methods found in high-performance algorithms could be (and are) simplified if they were allowed to read a small amount past the end of input buffers. Here, "small amount" generally means up to W - 1 bytes past the end, where W is the word size in bytes of the algorithm (e.g., up to 7 bytes for an algorithm processing the input in 64-bit chunks).
It's clear that writing past the end of an input buffer is never safe, in general, since you may clobber data beyond the buffer1. It is also clear that reading past the end of a buffer into another page may trigger a segmentation fault/access violation, since the next page may not be readable.
In the special case of reading aligned values, however, a page fault seems impossible, at least on x86. On that platform, pages (and hence memory protection flags) have a 4K granularity (larger pages, e.g. 2MiB or 1GiB, are possible, but these are multiples of 4K) and so aligned reads will only access bytes in the same page as the valid part of the buffer.
Here's a canonical example of some loop that aligns its input and reads up to 7 bytes past the end of buffer:
int processBytes(uint8_t *input, size_t size) {
uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
int res;
if (size < 8) {
// special case for short inputs that we aren't concerned with here
return shortMethod();
}
// check the first 8 bytes
if ((res = match(*input)) >= 0) {
return input + res;
}
// align pointer to the next 8-byte boundary
input64 = (ptrdiff_t)(input64 + 1) & ~0x7;
for (; input64 < end64; input64++) {
if ((res = match(*input64)) > 0) {
return input + res < input + size ? input + res : -1;
}
}
return -1;
}
The inner function int match(uint64_t bytes) isn't shown, but it is something that looks for a byte matching a certain pattern, and returns the lowest such position (0-7) if found or -1 otherwise.
First, cases with size < 8 are pawned off to another function for simplicity of exposition. Then a single check is done for the first 8 (unaligned bytes). Then a loop is done for the remaining floor((size - 7) / 8) chunks of 8 bytes2. This loop may read up to 7 bytes past the end of the buffer (the 7 byte case occurs when input & 0xF == 1). However, return call has a check which excludes any spurious matches which occur beyond the end of the buffer.
Practically speaking, is such a function safe on x86 and x86-64?
These types of overreads are common in high performance code. Special tail code to avoid such overreads is also common. Sometimes you see the latter type replacing the former to silence tools like valgrind. Sometimes you see a proposal to do such a replacement, which is rejected on the grounds the idiom is safe and the tool is in error (or simply too conservative)3.
A note for language lawyers:
Reading from a pointer beyond its allocated size is definitely not allowed
in the standard. I appreciate language lawyer answers, and even occasionally write
them myself, and I'll even be happy when someone digs up the chapter
and verse which shows the code above is undefined behavior and hence
not safe in the strictest sense (and I'll copy the details here). Ultimately though, that's not what
I'm after. As a practical matter, many common idioms involving pointer
conversion, structure access though such pointers and so are
technically undefined, but are widespread in high quality and high
performance code. Often there is no alternative, or the alternative
runs at half speed or less.
If you wish, consider a modified version of this question, which is:
After the above code has been compiled to x86/x86-64 assembly, and the user has verified that it is compiled in the expected way (i.e.,
the compiler hasn't used a provable partially out-of-bounds access to
do something really
clever,
is executing the compiled program safe?
In that respect, this question is both a C question and a x86 assembly question. Most of the code using this trick that I've seen is written in C, and C is still the dominant language for high performance libraries, easily eclipsing lower level stuff like asm, and higher level stuff like <everything else>. At least outside of the hardcore numerical niche where FORTRAN still plays ball. So I'm interested in the C-compiler-and-below view of the question, which is why I didn't formulate it as a pure x86 assembly question.
All that said, while I am only moderately interested in a link to the
standard showing this is UD, I am very interested in any details of
actual implementations that can use this particular UD to produce
unexpected code. Now I don't think this can happen without some deep
pretty deep cross-procedure analysis, but the gcc overflow stuff
surprised a lot of people too...
1 Even in apparently harmless cases, e.g., where the same value is written back, it can break concurrent code.
2 Note for this overlapping to work requires that this function and match() function to behave in a specific idempotent way - in particular that the return value supports overlapping checks. So a "find first byte matching pattern" works since all the match() calls are still in-order. A "count bytes matching pattern" method would not work, however, since some bytes could be double counted. As an aside: some functions such as "return the minimum byte" call would work even without the in-order restriction, but need to examine all bytes.
3 It's worth noting here that for valgrind's Memcheck there is a flag, --partial-loads-ok which controls whether such reads are in fact reported as an error. The default is yes, means that in general such loads are not treated as immediate errors, but that an effort is made to track the subsequent use of loaded bytes, some of which are valid and some of which are not, with an error being flagged if the out-of-range bytes are used. In cases such as the example above, in which the entire word is accessed in match(), such analysis will conclude the bytes are accessed, even though the results are ultimately discarded. Valgrind cannot in general determine whether invalid bytes from a partial load are actually used (and detection in general is probably very hard).

Yes, it's safe in x86 asm, and existing libc strlen(3) implementations take advantage of this in hand-written asm. And even glibc's fallback C, but it compiles without LTO so it it can never inline. It's basically using C as a portable assembler to create machine code for one function, not as part of a larger C program with inlining. But that's mostly because it also has potential strict-aliasing UB, see my answer on the linked Q&A. You probably also want a GNU C __attribute__((may_alias)) typedef instead of plain unsigned long as your wider type, like __m128i etc. already use.
It's safe because an aligned load will never cross a higher alignment boundary, and memory protection happens with aligned pages, so at least 4k boundaries1
Any naturally-aligned load that touches at least 1 valid byte can't fault. It's also safe to just check if you're far enough from the next page boundary to do a 16-byte load, like if (p & 4095 > (4096 - 16)) do_special_case_fallback. See the section below about that for more detail.
It's also generally safe in C compiled for x86, as far as I know. Reading outside an object is of course Undefined Behaviour in C, but works in C-targeting-x86. I don't think compilers explicitly / on purpose define the behaviour, but in practice it works that way.
I think it's not the kind of UB that aggressive compilers will assume can't happen while optimizing, but confirmation from a compiler-writer on this point would be good, especially for cases where it's easily provable at compile-time that an access goes out of past the end of an object. (See discussion in comments with #RossRidge: a previous version of this answer asserted that it was absolutely safe, but that LLVM blog post doesn't really read that way).
This is required in asm to go faster than 1 byte at a time processing an implicit-length string. In C in theory a compiler could know how to optimize such a loop, but in practice they don't so you have to do hacks like this. Until that changes, I suspect that the compilers people care about will generally avoid breaking code that contains this potential UB.
There's no danger when the overread isn't visible to code that knows how long an object is. A compiler has to make asm that works for the case where there are array elements as far as we actually read. The plausible danger I can see with possible future compilers is: after inlining, a compiler might see the UB and decide that this path of execution must never be taken. Or that the terminating condition must be found before the final not-full-vector and leave that out when fully unrolling.
The data you get is unpredictable garbage, but there won't be any other potential side-effects. As long as the your program isn't affected by the garbage bytes, it's fine. (e.g. use bithacks to find if one of the bytes of a uint64_t are zero, then a byte loop to find the first zero byte, regardless of what garbage is beyond it.)
Unusual situations where this wouldn't be safe in x86 asm
Hardware data breakpoints (watchpoints) that trigger on a load from a given address. If there's a variable you're monitoring right after an array, you could get a spurious hit. This might be a minor annoyance to someone debugging a normal program. If your function will be part of a program that uses x86 debug registers D0-D3 and the resulting exceptions for something that could affect correctness, then be careful with this.
Or similarly a code checker like valgrind could complain about reading outside an object.
Under a hypothetical 16 or 32-bit OS could that uses segmentation: A segment limit can use 4k or 1-byte granularity so it's possible to create a segment where the first faulting offset is odd. (Having the base of the segment aligned to a cache line or page is irrelevant except for performance). All mainstream x86 OSes use flat memory models, and x86-64 removes support for segment limits for 64-bit mode.
Memory-mapped I/O registers right after the buffer you wanted to loop over with wide loads, especially the same 64B cache-line. This is extremely unlikely even if you're calling functions like this from a device driver (or a user-space program like an X server that has mapped some MMIO space).
If you're processing a 60-byte buffer and need to avoid reading from a 4-byte MMIO register, you'll know about it and will be using a volatile T*. This sort of situation doesn't happen for normal code.
strlen is the canonical example of a loop that processes an implicit-length buffer and thus can't vectorize without reading past the end of a buffer. If you need to avoid reading past the terminating 0 byte, you can only read one byte at a time.
For example, glibc's implementation uses a prologue to handle data up to the first 64B alignment boundary. Then in the main loop (gitweb link to the asm source), it loads a whole 64B cache line using four SSE2 aligned loads. It merges them down to one vector with pminub (min of unsigned bytes), so the final vector will have a zero element only if any of the four vectors had a zero. After finding that the end of the string was somewhere in that cache line, it re-checks each of the four vectors separately to see where. (Using the typical pcmpeqb against a vector of all-zero, and pmovmskb / bsf to find the position within the vector.) glibc used to have a couple different strlen strategies to choose from, but the current one is good on all x86-64 CPUs.
Usually loops like this avoid touching any extra cache-lines they don't need to touch, not just pages, for performance reasons, like glibc's strlen.
Loading 64B at a time is of course only safe from a 64B-aligned pointer, since naturally-aligned accesses can't cross cache-line or page-line boundaries.
If you do know the length of a buffer ahead of time, you can avoid reading past the end by handling the bytes beyond the last full aligned vector using an unaligned load that ends at the last byte of the buffer.
(Again, this only works with idempotent algorithms, like memcpy, which don't care if they do overlapping stores into the destination. Modify-in-place algorithms often can't do this, except with something like converting a string to upper-case with SSE2, where it's ok to reprocess data that's already been upcased. Other than the store-forwarding stall if you do an unaligned load that overlaps with your last aligned store.)
So if you are vectorizing over a buffer of known length, it's often best to avoid overread anyway.
Non-faulting overread of an object is the kind of UB that definitely can't hurt if the compiler can't see it at compile time. The resulting asm will work as if the extra bytes were part of some object.
But even if it is visible at compile-time, it generally doesn't hurt with current compilers.
PS: a previous version of this answer claimed that unaligned deref of int * was also safe in C compiled for x86. That is not true. I was a bit too cavalier 3 years ago when writing that part. You need a typedef with GNU C __attribute__((aligned(1),may_alias)), or memcpy, to make that safe. The may_alias part isn't needed if you only access it via signed/unsigned int* and/or `char*, i.e. in ways that wouldn't violate the normal C strict-aliasing rules.
The set of things ISO C leaves undefined but that Intel intrinsics requires compilers to define does include creating unaligned pointers (at least with types like __m128i*), but not dereferencing them directly. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
Checking if a pointer is far enough from the end of a 4k page
This is useful for the first vector of strlen; after this you can p = (p+16) & -16 to go to the next aligned vector. This will partially overlap if p was not 16-byte aligned, but doing redundant work is sometimes the most compact way to set up for an efficient loop. Avoiding it might mean looping 1 byte at a time until an alignment boundary, and that's certainly worse.
e.g. check ((p + 15) ^ p) & 0xFFF...F000 == 0 (LEA / XOR / TEST) which tells you that the last byte of a 16-byte load has the same page-address bits as the first byte. Or p+15 <= p|0xFFF (LEA / OR / CMP with better ILP) checks that the last byte-address of the load is <= the last byte of the page containing the first byte.
Or more simply, p & 4095 > (4096 - 16) (MOV / AND / CMP), i.e. p & (pgsize-1) < (pgsize - vecwidth) checks that the offset-within-page is far enough from the end of a page.
You can use 32-bit operand-size to save code size (REX prefixes) for this or any of the other checks because the high bits don't matter. Some compilers don't notice this optimization, so you can cast to unsigned int instead of uintptr_t, although to silence warnings about code that isn't 64-bit clean you might need to cast (unsigned)(uintptr_t)p. Further code-size saving can be had with ((unsigned int)p << 20) > ((4096 - vectorlen) << 20) (MOV / SHL / CMP), because shl reg, 20 is 3 bytes, vs. and eax, imm32 being 5, or 6 for any other register. (Using EAX will also allow the no-modrm short form for cmp eax, 0xfff.)
If doing this in GNU C, you probably want typedef unsigned long aliasing_unaligned_ulong __attribute__((aligned(1),may_alias)); to make it safe to do unaligned accesses.

If you permit consideration of non-CPU devices, then one example of a potentially unsafe operation is accessing out-of-bounds regions of PCI-mapped memory pages. There's no guarantee that the target device is using the same page size or alignment as the main memory subsystem. Attempting to access, for example, address [cpu page base]+0x800 may trigger a device page fault if the device is in a 2KiB page mode. This will usually cause a system bugcheck.

When is it more appropriate to use valloc() as opposed to malloc()?

C (and C++) include a family of dynamic memory allocation functions, most of which are intuitively named and easy to explain to a programmer with a basic understanding of memory. malloc() simply allocates memory, while calloc() allocates some memory and clears it eagerly. There are also realloc() and free(), which are pretty self-explanatory.
The manpage for malloc() also mentions valloc(), which allocates (size) bytes aligned to the page border.
Unfortunately, my background isn't thorough enough in low-level intricacies; what are the implications of allocating and using page border-aligned memory, and when is this appropriate as opposed to regular malloc() or calloc()?

The manpage for valloc contains an important note:
The function valloc() appeared in 3.0BSD. It is documented as being obsolete in 4.3BSD, and as legacy in SUSv2. It does not appear in POSIX.1-2001.
valloc is obsolete and nonstandard - to answer your question, it would never be appropriate to use in new code.
While there are some reasons to want to allocate aligned memory - this question lists a few good ones - it is usually better to let the memory allocator figure out which bit of memory to give you. If you are certain that you need your freshly-allocated memory aligned to something, use aligned_alloc (C11) or posix_memalign (POSIX) instead.

Allocations with page alignment usually are not done for speed - they're because you want to take advantage of some feature of your processor's MMU, which typically works with page granularity.
One example is if you want to use mprotect(2) to change the access rights on that memory. Suppose, for instance, that you want to store some data in a chunk of memory, and then make it read only, so that any buggy part of your program that tries to write there will trigger a segfault. Since mprotect(2) can only change permissions page by page (since this is what the underlying CPU hardware can enforce), the block where you store your data had better be page aligned, and its size had better be a multiple of the page size. Otherwise the area you set read-only might include other, unrelated data that still needs to be written.
Or, perhaps you are going to generate some executable code in memory and then want to execute it later. Memory you allocate by default probably isn't set to allow code execution, so you'll have to use mprotect to give it execute permission. Again, this has to be done with page granularity.
Another example is if you want to allocate memory now, but might want to mmap something on top of it later.
So in general, a need for page-aligned memory would relate to some fairly low-level application, often involving something system-specific. If you needed it, you'd know. (And as mentioned, you should allocate it not with valloc, but using posix_memalign, or perhaps an anonymous mmap.)

First of all valloc is obsolete, and memalignshould be used instead.
Second thing it's not part of the C (C++) standard at all.
It's a special allocation which is aligned to _SC_PAGESIZE boundary.
When is it useful to use it? I guess never, unless you have some specific low level requirement. If you would need it, you would know to need it, since it's rarely useful (maybe just when trying some micro-optimizations or creating shared memory between processes).

The self-evident answer is that it is appropriate to use valloc when malloc is unsuitable (less efficient) for the application (virtual) memory usage pattern and valloc is better suited (more efficient). This will depend on the OS and libraries and architecture and application...
malloc traditionally allocated real memory from freed memory if available and by increasing the brk point if not, in which case it is cleared by the OS for security reasons.
calloc in a dumb implementation does a malloc and then (re)clears the memory, while a smart implementation would avoid reclearing newly allocated memory that is automatically cleared by the operating system.
valloc relates to virtual memory. In a virtual memory system using the file system, you can allocate a large amount of memory or filespace/swapspace, even more than physical memory, and it will be swapped in by pages so alignment is a factor. In Unix creation of file of a specified file and adding/deleting pages is done using inodes to define the file but doesn't deal with actual disk blocks till needed, in which case it creates them cleared. So I would expect a valloc system to increase the size of the data segment swap without actually allocating physical or swap pages, or running a for loop to clear it all - as the file and paging system does that as needed. Thus valloc should be a heck of a lot faster than malloc. But as with calloc, how particular idiotsyncratic *x/C flavours do it is up to them, and the valloc man page is totally unhelpful about these expectations.
Traditionally this was implemented with brk/sbrk. Of course in a virtual memory system, whether a paged or a segmented system, there is no real need for any of this brk/sbrk stuff and it is enough to simply write the last location in a file or address space to extend up to that point.
Re the allocation to page boundaries, that is not usually something the user wants or needs, but rather is usually something the system wants or needs.
A (probably more expensive) way to simulate valloc is to determine the page boundary and then call aligned_alloc or posix_memalign with this alignment spec.
The fact that valloc is deprecated or has been removed or is not required in some OS' doesn't mean that it isn't still useful and required for best efficiency in others. If it has been deprecated or removed, one would hope that there are replacements that are as efficient (but I wouldn't bet on it, and might, indeed have, written my own malloc replacement).
Over the last 40 years the tradeoffs of real and (once invented) virtual memory have changed periodically, and mainstream OS has tended to go for frills rather than efficiency, with programmers who don't have (time or space) efficiency as a major imperative. In the embedded systems, efficiency is more critical, but even there efficiency is often not well supported by the standard OS and/or tools. But when in doubt, you can roll your own malloc replacement for your application that does what you need, rather than depend on what someone else woke up and decided to do/implement, or to undo/deprecate.
So the real answer is you don't necessarily want to use valloc or malloc or calloc or any of the replacements your current subversion of an OS provides.

How Dangerous is This Faster `strlen`?

strlen is a fairly simple function, and it is obviously O(n) to compute. However, I have seen a few approaches that operate on more than one character at a time. See example 5 here or this approach here. The basic way these work is by reinterpret-casting the char const* buffer to a uint32_t const* buffer and then checking four bytes at a time.
Personally, my gut reaction is that this is a segfault-waiting-to-happen, since I might dereference up to three bytes outside valid memory. However, this solution seems to hang around, and it seems curious to me that something so obviously broken has stood the test of time.
I think this comprises UB for two reasons:
Potential dereference outside valid memory
Potential dereference of unaligned pointer
(Note that there is not an aliasing issue; one might think the uint32_t is aliased as an incompatible type, and code after the strlen (such as code that might change the string) could run out of order to the strlen, but it turns out that char is an explicit exception to strict aliasing).
But, how likely is it to fail in practice? At minimum, I think there needs to be 3 bytes padding after the string literal data section, malloc needs to be 4-byte or larger aligned (actually the case on most systems), and malloc needs to allocate 3 extra bytes. There are other criteria related to aliasing. This is all fine for compiler implementations, which create their own environments, but how frequently are these conditions met on modern hardware for user code?

The technique is valid, and you will not avoid it if you call our C library strlen. If that library is, for instance, a recent version of the GNU C library (at least on certain targets), it does the same thing.
The key to make it work is to ensure that the pointer is aligned properly. If the pointer is aligned, the operation will read beyond the end of the string surely enough, but not into an adjacent page. If the null terminating byte is within one word of the end of a page, then that last word will be accessed without touching the subsequent page.
It certainly isn't well-defined behavior in C, and so it carries the burden of careful validation when ported from one compiler to another. It also triggers false positives from out-of-bounds access detectors like Valgrind.
Valgrind had to be patched to work around Glibc doing this. Without the patches, you get nuisance errors such as this:
==13669== Invalid read of size 8
==13669== at 0x411D6D7: __wcslen_sse2 (wcslen-sse2.S:59)
==13669== by 0x806923F: length_str (lib.c:2410)
==13669== by 0x807E61A: string_out_put_string (stream.c:997)
==13669== by 0x8075853: obj_pprint (lib.c:7103)
==13669== by 0x8084318: vformat (stream.c:2033)
==13669== by 0x8081599: format (stream.c:2100)
==13669== by 0x408F4D2: (below main) (libc-start.c:226)
==13669== Address 0x43bcaf8 is 56 bytes inside a block of size 60 alloc'd
==13669== at 0x402BE68: malloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==13669== by 0x8063C4F: chk_malloc (lib.c:1763)
==13669== by 0x806CD79: sub_str (lib.c:2653)
==13669== by 0x804A7E2: sysroot_helper (txr.c:233)
==13669== by 0x408F4D2: (below main) (libc-start.c:226)
Glibc is using SSE instructions to do calculate wcslen eight bytes at a time (instead of four, the width of wchar_t). In doing so, it is accessing at offset 56 in a block that is 60 bytes wide. However, note that this access could never straddle a page boundary: the address is divisible by 8.
If you're working in assembly language, you don't have to think twice about the technique.
In fact, the technique is used quite bit in some optimized audio codecs that I work with (targetting ARM), which feature a lot of hand-written assembly language in the Neon instruction set.
I noticed it when running Valgrind on code which integrated these codecs, and contacted the vendor. They explained that it was just a harmless loop optimization technique; I went through the assembly language and convinced myself they were right.

(1) can definitely happen. There's nothing preventing you from taking the strlen of a string near the end of an allocated page, which could result in an access past the end of allocated memory and a nice big crash. As you note, this could be mitigated by padding all your allocations, but then you have to have any libraries do the same. Worse, you have to arrange for the linker and OS to always add this padding (remember the OS passes argv[] in a static memory buffer somewhere). The overhead of doing this isn't worth it.
(2) also definitely happens. Earlier versions of ARM processors generate data aborts on unaligned accesses, which either cause your program to die with a bus error (or halt the CPU if you're running bare-metal), or force a very expensive trap through the kernel to handle the unaligned access. These earlier ARM chips are still in wide use in older cellphones and embedded devices. Later ARM processors synthesize multiple word accesses to deal with unaligned accesses, but this will result in overall slower performance since you basically double the number of memory loads you need to do.
Many current ("modern") PICs and embedded microprocessors lack the logic to handle unaligned accesses, and may behave unpredictably or even nonsensically when given unaligned addresses (I've personally seen chips that will just mask off the bottom bits, which would give incorrect answers, and others that will just give garbage results with unaligned accesses).
So, this is ridiculously dangerous to use in anything that should be remotely portable. Please, please, please do not use this code; use the libc strlen. It will usually be faster (optimized for your platform properly) and will make your code portable. The last thing you want is for your code to subtly and unexpectedly break in some situation (string near the end of an allocation) or on some new processor.

Donald Knuth, a person who wrote 3+ volumes on clever algorithms said: "Premature optimization is the root of all evil".
strlen() is used a lot, so it really should be fast. Riffing on wildplasser's remark, "I would trust the library function", what makes you think that the library function works byte at a time? Or is slow?
The title may give folks the impression that the code you suggest is faster than the standard system library strlen(), but I think what you mean is that it is faster than a naive strlen() which probably doesn't get used, anyway.
I compiled a simple C program and looked on my 64-bit system which uses GNU's glibc function. The code I saw was pretty sophisticated and looks pretty fast in terms of working with register width rather than byte at a time. The code I saw for strlen() is written in assembly language so there probably aren't junk instructions as you might get if this were compiled C code. What I saw was rtld-strlen.S. This code also unrolls loops to reduce the overhead in looping.
Before you think you can do better on strlen, you should look at that code, or the corresponding code for your particular architecture, and register size.
And if you do write your own strlen, benchmark it against the existing implementation.
And obviously, if you use the system strlen, then it is probably correct and you don't have to worry about invalid memory references as a result of an optimization in the code.

I agree it's a bletcherous technique, but I suspect it's likely to work most of the time. It's only a segfault if the string happens to be right up against the end of your data (or stack) segment. The vast majority of strings (wheher statically or dynamically allocated) won't be.
But you're right, to guarantee it working you'd need some guarantee that all strings were padded somehow, and your list of shims looks about right.
If alignment is a problem, you could take care of that in the fast strlen implementation; you wouldn't have to run around trying to align all strings.
(But of course, if your problem is that you're spending too much time scanning strings, the right fix is not to desperately try to make string scanning faster, but to rig things up so that you don't have to scan so many strings in the first place...)

Why is malloc really non-deterministic? (Linux/Unix)

malloc is not guaranteed to return 0'ed memory. The conventional wisdom is not only that, but that the contents of the memory malloc returns are actually non-deterministic, e.g. openssl used them for extra randomness.
However, as far as I know, malloc is built on top of brk/sbrk, which do "return" 0'ed memory. I can see why the contents of what malloc returns may be non-0, e.g. from previously free'd memory, but why would they be non-deterministic in "normal" single-threaded software?
Is the conventional wisdom really true (assuming the same binary and libraries)
If so, Why?
Edit Several people answered explaining why the memory can be non-0, which I already explained in the question above. What I'm asking is why the program using the contents of what malloc returns may be non-deterministic, i.e. why it could have different behavior every time it's run (assuming the same binary and libraries). Non-deterministic behavior is not implied by non-0's. To put it differently: why it could have different contents every time the binary is run.

Malloc does not guarantee unpredictability... it just doesn't guarantee predictability.
E.g. Consider that
return 0;
Is a valid implementation of malloc.

The initial values of memory returned by malloc are unspecified, which means that the specifications of the C and C++ languages put no restrictions on what values can be handed back. This makes the language easier to implement on a variety of platforms. While it might be true that in Linux malloc is implemented with brk and sbrk and the memory should be zeroed (I'm not even sure that this is necessarily true, by the way), on other platforms, perhaps an embedded platform, there's no reason that this would have to be the case. For example, an embedded device might not want to zero the memory, since doing so costs CPU cycles and thus power and time. Also, in the interest of efficiency, for example, the memory allocator could recycle blocks that had previously been freed without zeroing them out first. This means that even if the memory from the OS is initially zeroed out, the memory from malloc needn't be.
The conventional wisdom that the values are nondeterministic is probably a good one because it forces you to realize that any memory you get back might have garbage data in it that could crash your program. That said, you should not assume that the values are truly random. You should, however, realize that the values handed back are not magically going to be what you want. You are responsible for setting them up correctly. Assuming the values are truly random is a Really Bad Idea, since there is nothing at all to suggest that they would be.
If you want memory that is guaranteed to be zeroed out, use calloc instead.
Hope this helps!

malloc is defined on many systems that can be programmed in C/C++, including many non-UNIX systems, and many systems that lack operating system altogether. Requiring malloc to zero out the memory goes against C's philosophy of saving CPU as much as possible.
The standard provides a zeroing cal calloc that can be used if you need to zero out the memory. But in cases when you are planning to initialize the memory yourself as soon as you get it, the CPU cycles spent making sure the block is zeroed out are a waste; C standard aims to avoid this waste as much as possible, often at the expense of predictability.

Memory returned by mallocis not zeroed (or rather, is not guaranteed to be zeroed) because it does not need to. There is no security risk in reusing uninitialized memory pulled from your own process' address space or page pool. You already know it's there, and you already know the contents. There is also no issue with the contents in a practical sense, because you're going to overwrite it anyway.
Incidentially, the memory returned by malloc is zeroed upon first allocation, because an operating system kernel cannot afford the risk of giving one process data that another process owned previously. Therefore, when the OS faults in a new page, it only ever provides one that has been zeroed. However, this is totally unrelated to malloc.
(Slightly off-topic: The Debian security thing you mentioned had a few more implications than using uninitialized memory for randomness. A packager who was not familiar with the inner workings of the code and did not know the precise implications patched out a couple of places that Valgrind had reported, presumably with good intent but to desastrous effect. Among these was the "random from uninitilized memory", but it was by far not the most severe one.)

I think that the assumption that it is non-deterministic is plain wrong, particularly as you ask for a non-threaded context. (In a threaded context due to scheduling alea you could have some non-determinism).
Just try it out. Create a sequential, deterministic application that
does a whole bunch of allocations
fills the memory with some pattern, eg fill it with the value of a counter
free every second of these allocations
newly allocate the same amount
run through these new allocations and register the value of the first byte in a file (as textual numbers one per line)
run this program twice and register the result in two different files. My idea is that these files will be identical.

Even in "normal" single-threaded programs, memory is freed and reallocated many times. Malloc will return to you memory that you had used before.

Even single-threaded code may do malloc then free then malloc and get back previously used, non-zero memory.

There is no guarantee that brk/sbrk return 0ed-out data; this is an implementation detail. It is generally a good idea for an OS to do that to reduce the possibility that sensitive information from one process finds its way into another process, but nothing in the specification says that it will be the case.
Also, the fact that malloc is implemented on top of brk/sbrk is also implementation-dependent, and can even vary based on the size of the allocation; for example, large allocations on Linux have traditionally used mmap on /dev/zero instead.
Basically, you can neither rely on malloc()ed regions containing garbage nor on it being all-0, and no program should assume one way or the other about it.

The simplest way I can think of putting the answer is like this:
If I am looking for wall space to paint a mural, I don't care whether it is white or covered with old graffiti, since I'm going to prime it and paint over it. I only care whether I have enough square footage to accommodate the picture, and I care that I'm not painting over an area that belongs to someone else.
That is how malloc thinks. Zeroing memory every time a process ends would be wasted computational effort. It would be like re-priming the wall every time you finish painting.

There is an whole ecosystem of programs living inside a computer memmory and you cannot control the order in which mallocs and frees are happening.
Imagine that the first time you run your application and malloc() something, it gives you an address with some garbage. Then your program shuts down, your OS marks that area as free. Another program takes it with another malloc(), writes a lot of stuff and then leaves. You run your program again, it might happen that malloc() gives you the same address, but now there's different garbage there, that the previous program might have written.
I don't actually know the implementation of malloc() in any system and I don't know if it implements any kind of security measure (like randomizing the returned address), but I don't think so.
It is very deterministic.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight