Is there memset() that accepts integers larger than char? - c

Is there a version of memset() which sets a value that is larger than 1 byte (char)? For example, let's say we have a memset32() function, so using it we can do the following:
int32_t array[10];
memset32(array, 0xDEADBEEF, sizeof(array));
This will set the value 0xDEADBEEF in all the elements of array. Currently it seems to me this can only be done with a loop.
Specifically, I am interested in a 64 bit version of memset(). Know anything like that?

void memset64( void * dest, uint64_t value, uintptr_t size )
{
uintptr_t i;
for( i = 0; i < (size & (~7)); i+=8 )
{
memcpy( ((char*)dest) + i, &value, 8 );
}
for( ; i < size; i++ )
{
((char*)dest)[i] = ((char*)&value)[i&7];
}
}
(Explanation, as requested in the comments: when you assign to a pointer, the compiler assumes that the pointer is aligned to the type's natural alignment; for uint64_t, that is 8 bytes. memcpy() makes no such assumption. On some hardware unaligned accesses are impossible, so assignment is not a suitable solution unless you know unaligned accesses work on the hardware with small or no penalty, or know that they will never occur, or both. The compiler will replace small memcpy()s and memset()s with more suitable code so it is not as horrible is it looks; but if you do know enough to guarantee assignment will always work and your profiler tells you it is faster, you can replace the memcpy with an assignment. The second for() loop is present in case the amount of memory to be filled is not a multiple of 64 bits. If you know it always will be, you can simply drop that loop.)

There's no standard library function afaik. So if you're writing portable code, you're looking at a loop.
If you're writing non-portable code then check your compiler/platform documentation, but don't hold your breath because it's rare to get much help here. Maybe someone else will chip in with examples of platforms which do provide something.
The way you'd write your own depends on whether you can define in the API that the caller guarantees the dst pointer will be sufficiently aligned for 64-bit writes on your platform (or platforms if portable). On any platform that has a 64-bit integer type at all, malloc at least will return suitably-aligned pointers.
If you have to cope with non-alignment, then you need something like moonshadow's answer. The compiler may inline/unroll that memcpy with a size of 8 (and use 32- or 64-bit unaligned write ops if they exist), so the code should be pretty nippy, but my guess is it probably won't special-case the whole function for the destination being aligned. I'd love to be corrected, but fear I won't be.
So if you know that the caller will always give you a dst with sufficient alignment for your architecture, and a length which is a multiple of 8 bytes, then do a simple loop writing a uint64_t (or whatever the 64-bit int is in your compiler) and you'll probably (no promises) end up with faster code. You'll certainly have shorter code.
Whatever the case, if you do care about performance then profile it. If it's not fast enough try again with more optimisation. If it's still not fast enough, ask a question about an asm version for the CPU(s) on which it's not fast enough. memcpy/memset can get massive performance increases from per-platform optimisation.

Just for the record, the following uses memcpy(..) in the following pattern. Suppose we want to fill an array with 20 integers:
--------------------
First copy one:
N-------------------
Then copy it to the neighbour:
NN------------------
Then copy them to make four:
NNNN----------------
And so on:
NNNNNNNN------------
NNNNNNNNNNNNNNNN----
Then copy enough to fill the array:
NNNNNNNNNNNNNNNNNNNN
This takes O(lg(num)) applications of memcpy(..).
int *memset_int(int *ptr, int value, size_t num) {
if (num < 1) return ptr;
memcpy(ptr, &value, sizeof(int));
size_t start = 1, step = 1;
for ( ; start + step <= num; start += step, step *= 2)
memcpy(ptr + start, ptr, sizeof(int) * step);
if (start < num)
memcpy(ptr + start, ptr, sizeof(int) * (num - start));
return ptr;
}
I thought it might be faster than a loop if memcpy(..) was optimised using some hardware block memory copy functionality, but it turns out that a simple loop is faster than the above with -O2 and -O3. (At least using MinGW GCC on Windows with my particular hardware.) Without the -O switch, on a 400 MB array the code above is about twice as fast as an equivalent loop, and takes 417 ms on my machine, while with optimisation they both go to about 300 ms. Which means that it takes approximately the same number of nanoseconds as bytes, and a clock cycle is about a nanosecond. So either there is no hardware block memory copy functionality on my machine, or the memcpy(..) implementation does not take advantage of it.

Check your OS documentation for a local version, then consider just using the loop.
The compiler probably knows more about optimizing memory access on any particular architecture than you do, so let it do the work.
Wrap it up as a library and compile it with all the speed improving optimizations the compiler allows.

wmemset(3) is the wide (16-bit) version of memset. I think that's the closest you're going to get in C, without a loop.

If you're just targeting an x86 compiler you could try something like (VC++ example):
inline void memset32(void *buf, uint32_t n, int32_t c)
{
__asm {
mov ecx, n
mov eax, c
mov edi, buf
rep stosd
}
}
Otherwise just make a simple loop and trust the optimizer to know what it's doing, just something like:
for(uint32_t i = 0;i < n;i++)
{
((int_32 *)buf)[i] = c;
}
If you make it complicated chances are it will end up slower than simpler to optimize code, not to mention harder to maintain.

You should really let the compiler optimize this for you as someone else suggested. In most cases that loop will be negligible.
But if this some special situation and you don't mind being platform specific, and really need to get rid of the loop, you can do this in an assembly block.
//pseudo code
asm
{
rep stosq ...
}
You can probably google stosq assembly command for the specifics. It shouldn't be more than a few lines of code.

write your own; it's trivial even in asm.

Related

String length function is unstable

So I made this strlen a while ago and everything seemed fine. But I started noticing bugs with my codebase and after a while I tracked it down to this strlen function. I used SIMD instructions to write it and I am new to writing intrinsics so the code isn't probably the best it could be either.
Here is the function:
inline size_t strlen(const char* data) {
const __m256i terminationCharacters = _mm256_setzero_si256();
const size_t shiftAmount = ((size_t)&data) & 31;
const __m256i* pointer = (const __m256i*) (data - shiftAmount);
size_t length = 0;
for (;; length += 32, ++pointer) {
const __m256i comparingData = _mm256_load_si256(pointer);
const __m256i comparison = _mm256_cmpeq_epi8(comparingData, terminationCharacters);
if (!_mm256_testc_si256(terminationCharacters, comparison)) {
const auto mask = _mm256_movemask_epi8(comparison);
return length + _tzcnt_u32(mask >> shiftAmount);
}
}
}
Your attempt to combine startup handling into the aligned-vector loop has at least 2 showstopper bugs:
You exit the loop if your aligned load finds any zero bytes, even if they're from before the proper start of the string. (#James Griffin spotted this in comments). You need to do mask >>= shiftAmount and check that for non-zero to see if there were any matches in the part of the load that comes after the start of the string. (Don't use _mm256_testc_si256, just movemask and check).
_tzcnt_u32(mask >> shiftAmount); is buggy for any vectors after the first. The whole vector comes from bytes after the start of the string, so you need tzcnt to see all of bits. Instead, you want _tzcnt_u32(mask) - shiftAmount, I think.
Make yourself some test cases with 0 bytes before the actual string but inside the first aligned vector. And test cases with the final 0 in different places relative to a vector, and non-zero and test your version against libc strlen. (Maybe even for some randomized 0-positions within the first 32 bytes, and then within the first 64 bytes after that.)
Your strategy for handling unaligned startup should work, if you separate it from the loop. (Is it safe to read past the end of a buffer within the same page on x86 and x64?).
Another option is a page-cross check before a first unaligned vector load from the actual start of the string. (But then you need a fallback to something else). Then go aligned: overlap is fine; as long as you calculate the final length correctly, it doesn't matter if you check the same byte twice for being zero.
(You also don't really want the compiler to be wasting instructions inside the loop incrementing a pointer and a separate length, so check the resulting asm. A pointer-subtract after the loop should do the trick. Even cast to uintptr_t.
Also, you can subtract the final zero-position from the initial function arg, instead of from the aligned pointer, so instead of subtracting shiftAmount twice, you're just not using it at all except for the initial alignment.)
Don't use the vptest intrinsic (_mm256_testc_si256) at all, even in the main loop when you should be checking all the bytes; it's not better for _mm_cmp* results. vptest is 2 uops and can't macro-fuse with a branch instruction. But vpmovmskb eax, ymm0 is 1 uop, and test eax,eax / jz .loop is another one macro-fused uop. And even better, you actually need the integer movemask result after the loop, so you already have it.
Related
Is it safe to read past the end of a buffer within the same page on x86 and x64?
Why does glibc's strlen need to be so complicated to run quickly? (includes links to hand-written x86-64 asm for glibc's strlen implementation.) Unless you're on a platform with a worse C library, normally you should use that, because glibc uses CPU detection during dynamic linking to select a good version of strlen (and memcpy, etc.) for your CPU. Unaligned-startup for strlen is somewhat tricky, and glibc I think makes reasonable choices, unless the function-call overhead is a big problem. It also has good loop-unrolling techniques for big strings (like _mm256_min_epu8 to get a zero in a vector element if either of 2 input vectors had a zero, so it can amortize the actual movemask/branch work over a whole cache-line of data). It might be too aggressive in ramping up to that for medium-length strings though.
Note that glibc's licence is the LGPL, so you can't just copy code from glibc into your project unless your license is compatible. Even writing an intrinsics equivalent of its asm might be questionable.
Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? - a simple SSE2 strlen that doesn't handle misalignment, in hand-written asm. And comments on benchmarking.
https://agner.org/optimize/ - guides and instruction tables, and his subroutine library (in hand-written asm) includes a strlen. (But note it's GPL licensed.)
I assume some of the BSDs and MacOS have an asm strlen under a more permissive license you could use / look at if your project isn't GPL-compatible.
No offense but
size_t strlen(char *p)
{
size_t ret_val = 0;
while (*p++) ret_val++;
retirn ret_val;
}
does its work very well since long long ago. Also, today's optimizing compilers get very tight code for it, and your code is impossible to read.

Why its not recommended to use pointer for array access in C

I'm learning C programming and I cam across this tutorial online, which state that you should always prefer using [] operator over pointer arithmetic as much as possible.
https://www.cs.swarthmore.edu/~newhall/unixhelp/C_arrays.html#dynamic
you can use pointer arithmetic (but in general don't)
consider the following code in C
int *p_array;
p_array = (int *)malloc(sizeof(int)*50);
for(i=0; i < 50; i++) {
p_array[i] = 0;
}
What is the difference in doing it using pointer arithmetic like the following code (and why its not recommended)?
int *p_array;
p_array = (int *)malloc(sizeof(int)*50); // allocate 50 ints
int *dptr = p_array;
for(i=0; i < 50; i++) {
*dptr = 0;
dptr++;
}
What are the cases where using pointer arithmetic can cause issues in the software? is it bad practice or is it inexperienced engineer can be not paying attention?
Since there seems to be all out confusion on this:
In the old days, we had 16bit CPU's think 8088, 268 etc.
To formulate an address you had to load your segment register (16 bit register) and your address register. if accessing an array, you could load your array base into the segment register and the address register would be the index.
C compilers for these platforms did exist but pointer arithmetic involved checking the address for overruns and bumping the segment register if necessary (inefficient) Flat addressed pointers simply weren't possible in hardware.
Fast forward to the 80386 Now we have a full 32 bit space. Hardware pointers are possible Index + base addressing incurs a 1 clock cycle penalty. The segments are also 32 bit though, so arrays can be loaded using segments avoiding this penalty even if you are running 32 bit mode. The 368 also increases the number of segment registers by 2. (No idea why Intel thought that was a good idea) There was still a lot of 16bit code around though
These days, segment registers are disabled in 64 bit mode, Base+Index addressing is free.
Is there any platform where a flat pointer can outperform array addressing in hardware ? Well yes. the Motorola 68000 released in 1979 has a flat 32 bit address space, no segments and the Base + Index addressing mode incurs an 8 clock cycle penalty over immediate addressing. So if you're programming a early 80's era Sun station, Apple Lisa etc. this might be relevant.
In short. If you want an array, use an array. If you want a pointer use a pointer. Don't try and out smart your compiler. Convoluted code to turn arrays into pointers is exceedingly unlikely to provide any benefit, and may be slower.
This code is not recommended:
int *p_array;
p_array = (int *)malloc(sizeof(int)*50); // allocate 50 ints
int *dptr = p_array;
for(i=0; i < 50; i++) {
*dptr = 0;
dptr++;
}
because 1) for no reason you have two different pointers that point to the same place, 2) you don't check the result of malloc() -- it's known to return NULL occasionally, 3) the code is not easy to read and 4) it's easy to make a silly mistake very hard to spot later on.
All in all, I'd recommend to use this instead:
int array[50] = { 0 }; // make sure it's zero-initialized
int* p_array = array; // if you must =)
In your example, without compiler optimizations, pointer arithmetic may be more efficient, because it is easier to just increment a pointer than to calculate a new offset in every single loop iteration. However, most modern CPUs are optimized in such a way that accessing memory with an offset does not incur a (significant) performance penalty.
Even if you happen to be programming on a platform in which pointer arithmetic is faster, then it is likely that, if you activate compiler optimizations ("-O3" on most compilers), the compiler will use whatever method is fastest.
Therefore, it is mostly a matter of personal preference whether you use pointer arithmetic or not.
Code using array indexing instead of pointer arithmetic is generally easier to understand and less prone to errors.
Another advantage of not using pointer arithmetic is that pointer aliasing may be less of an issue (because you are using less pointers). That way, the compiler may have more freedom in optimizing your code (making your code faster).

Can I cast pointers like this?

Code:
unsigned char array_add[8]={0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
...
if ((*((uint32_t*)array_add)!=0)||(*((uint32_t*)array_add+1)!=0))
{
...
}
I want to check if the array is all zero. So naturally I thought of casting the address of an array, which also happens to be the address of the first member, to an unsigned int 32 type, so I'll only need to do this twice, since it's a 64 bit, 8 byte array. Problem is, it was successfully compiled but the program crashes every time around here.
I'm running my program on an 8bit microcontroller, cortex-M0.
How wrong am I?
In theory this could work but in practice there is a thing you aren't considering: aligned memory accesses.
If a uint32_t requires aligned memory access (eg to 4 bytes), then casting an array of unsigned char which has 1 byte alignment requirement to an uint32_t* produces a pointer to an unaligned array of uint32_t.
According to documentation:
There is no support for unaligned accesses on the Cortex-M0 processor. Any attempt to perform an unaligned memory access operation results in a HardFault exception.
In practice this is just dangerous and fragile code which invokes undefined behavior in certain circumstances, as pointed out by Olaf and better explained here.
To test multiple bytes as once code could use memcmp().
How speedy this is depends more on the compiler as a optimizing compiler may simple emit code that does a quick 8 byte at once (or 2 4-byte) compare. Even the memcmp() might not be too slow on an 8-bit processor. Profiling code helps.
Take care in micro-optimizations, as they too often are not efficient use of coders` time for significant optimizations.
unsigned char array_add[8] = ...
const unsigned char array_zero[8]={0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
if (memcmp(array_zero, array_add, 8) == 0) ...
Another method uses a union. Be careful not to assume if add.arr8[0] is the most or least significant byte.
union {
uint8_t array8[8];
uint64_t array64;
} add;
// below code will check all 8 of the add.array8[] is they are zero.
if (add.array64 == 0)
In general, focus on writing clear code and reserve such small optimizations to very select cases.
I am not sure but if your array has 8 bytes then just assign base address to a long long variable and compare it to 0. That should solve your problem of checking if the array is all 0.
Edit 1: After Olaf's comment I would say that replace long long with int64_t. However, why do you not a simple loop for iterating the array and checking. 8 chars is all you need to compare.
Edit 2: The other approach could be to OR all elements of array and then compare with 0. If all are 0 then OR will be zero. I do not know whether CMP will be fast or OR. Please refer to Cortex-M0 docs for exact CPU cycles requirement, however, I would expect CMP to be slower.

How do I force the program to use unaligned addresses?

I've heard reads and writes of aligned int's are atomic and safe, I wonder when does the system make non malloc'd globals unaligned other than packed structures and casting/pointer arithmetic byte buffers?
[X86-64 linux] In all of my normal cases, the system always chooses integer locations that don't get word torn, for example, two byte on one word and the other two bytes on the other word. Can any one post a program/snip (C or assembly) that forces the global variable to unaligned address such that the integer gets torn and the system has to use two reads to load one integer value ?
When I print the below program, the addresses are close to each other such that multiple variables are within 64bits but never once word tearing is seen (smartness in the system or compiler ?)
#include <stdio.h>
int a;
char b;
char c;
int d;
int e = 0;
int isaligned(void *p, int N)
{
if (((int)p % N) == 0)
return 1;
else
return 0;
}
int main()
{
printf("processor is %d byte mode \n", sizeof(int *));
printf ( "a=%p/b=%p/c=%p/d=%p/f=%p\n", &a, &b, &c, &d, &e );
printf ( " check for 64bit alignment of test result of 0x80 = %d \n", isaligned( 0x80, 64 ));
printf ( " check for 64bit alignment of a result = %d \n", isaligned( &a, 64 ));
printf ( " check for 64bit alignment of d result = %d \n", isaligned( &e, 64 ));
return 0;}
Output:
processor is 8 byte mode
a=0x601038/b=0x60103c/c=0x60103d/d=0x601034/f=0x601030
check for 64bit alignment of test result of 0x80 = 1
check for 64bit alignment of a result = 0
check for 64bit alignment of d result = 0
How does a read of a char happen in the above case ? Does it read from 8 byte aligned boundary (in my case 0x601030 ) and then go to 0x60103c ?
Memory access granularity is always word size isn't it ?
Thx.
1) Yes, there is no guarantee that unaligned accesses are atomic, because [at least sometimes, on certain types of processors] the data may be written as two separate writes - for example if you cross over a memory page boundary [I'm not talking about 4KB pages for virtual memory, I'm talking about DDR2/3/4 pages, which is some fraction of the total memory size, typically 16Kbits times whatever the width is of the actual memory chip - which will vary depending on the memory stick itself]. Equally, on other processors than x86, you get a trap for reading unaligned memory, which would either cause the program to abort, or the read be emulated in software as multiple reads to "fix" the unaligned read.
2) You could always make an unaligned memory region by something like this:
char *ptr = malloc(sizeof(long long) * number+1);
long long *unaligned = (long long *)&ptr[2];
for(i = 0; i < number; i++)
temp = unaligned[i];
By the way, your alignment check checks if the address is aligned to 64 bytes, not 64 bits. You'll have to divide by 8 to check that it's aligned to 64 bits.
3) A char is a single byte read, and the address will be on the actual address of the byte itself. The actual memory read performed is probably for a full cache-line, starting at the target address, and then cycling around, so for example:
0x60103d is the target address, so the processor will read a cache line of 32 bytes, starting at the 64-bit word we want: 0x601038 (and as soon as that's completed the processor goes on to the next instruction - meanwhile the next read will be performed to fill the cacheline), then cacheline is filled with 0x601020, 0x601028, 0x601030. But should we turn the cache off [if you want your 3GHz latest x86 processor to be slightly slower than a 66MHz 486, disabling the cache is a good way to achieve that], the processor would just read one byte at 0x60103d.
4) Not on x86 processors, they have byte addressing - but for normal memory, reads are done on a cacheline basis, as explained above.
Note also that "may not be atomic" is not at all the same as "will not be atomic" - so you'll probably have a hard time making it go wrong by will - you really need to get all the timings of two different threads just right, and straddle cachelines, straddle memory page boundaries, and so on to make it go wrong - this will happen if you don't want it to happen, but trying to make it go wrong can be darn hard [trust me, I've been there, done that].
It probably doesn't, outside of those cases.
In assembly it's trivial. Something like:
.org 0x2
myglobal:
.word SOME_NUMBER
But on Intel, the processor can safely read unaligned memory. It might not be atomic, but that might not be apparent from the generated code.
Intel, right? The Intel ISA has single-byte read/write opcodes. Disassemble your program and see what it's using.
Not necessarily - you might have a mismatch between memory word size and processor word size.
1) This answer is platform-specific. In general, though, the compiler will align variables unless you force it to do otherwise.
2) The following will require two reads to load one variable when run on a 32-bit CPU:
uint64_t huge_variable;
The variable is larger than a register, so it will require multiple operations to access. You can also do something similar by using packed structures:
struct unaligned __attribute__ ((packed))
{
char buffer[2];
int unaligned;
char buffer2[2];
} sample_struct;
3) This answer is platform-specific. Some platforms may behave like you describe. Some platforms have instructions capable of fetching a half-register or quarter-register of data. I recommend examining the assembly emitted by your compiler for more details (make sure you turn off all compiler optimizations first).
4) The C language allows you to access memory with byte-sized granularity. How this is implemented under the hood and how much data your CPU fetches to read a single byte is platform-specific. For many CPUs, this is the same as the size of a general-purpose register.
The C standards guarantee that malloc(3) returns a memory area that complies to the strictest alignment requirements, so this just can't happen in that case. If there are unaligned data, it is probably read/written by pieces (that depends on the exact guarantees the architecture provides).
On some architectures unaligned access is allowed, on others it is a fatal error. When allowed, it is normally much slower than aligned access; when not allowed the compiler must take the pieces and splice them together, and that is even much slower.
Characters (really bytes) are normally allowed to have any byte address. The instructions working with bytes just get/store the individual byte in that case.
No, memory access is according to the width of the data. But real memory access is in terms of cache lines (read up on CPU cache for this).
Non-aligned objects can never come into existence without you invoking undefined behavior. In other words, there is no sequence of actions, all having well-defined behavior, which a program can take that will result in a non-aligned pointer coming into existence. In particular, there is no portable way to get the compiler to give you misaligned objects. The closest thing is the "packed structure" many compilers have, but that only applies to structure members, not independent objects.
Further, there is no way to test alignedness in portable C. You can use the implementation-defined conversions of pointers to integers and inspect the low bits, but there is no fundamental requirement that "aligned" pointers have zeros in the low bits, or that the low bits after conversion to integer even correspond to the "least significant" bits of the pointer, whatever that would mean. In other words, conversions between pointers and integers are not required to commute with arithmetic operations.
If you really want to make some misaligned pointers, the easiest way to do it, assuming alignof(int)>1, is something like:
char buf[2*sizeof(int)+1];
int *p1 = (int *)buf, *p2 = (int *)(buf+sizeof(int)+1);
It's impossible for both buf and buf+sizeof(int)+1 to be simultaneously aligned for int if alignof(int) is greater than 1. Thus at least one of the two (int *) casts gets applied to a misaligned pointer, invoking undefined behavior, and the typical result is a misaligned pointer.

How to use 32-bit pointers in 64-bit application?

Our school's project only allows us to compile the c program into 64-bit application and they test our program for speed and memory usage. However, if I am able to use 32-bit pointers, then my program will consume much less memory than in 64-bit, also maybe it runs faster (faster to malloc?)
I am wondering if I can use 32-bit pointers in 64-bit applications?
Thanks for the help
Using GCC?
The -mx32 option sets int, long, and pointer types to 32 bits, and generates code for the x86-64 architecture. (Intel 386 and AMD x86-64 Options):
i386-and-x86_64-Options
Other targets, GCC
Then benchmark :)
You could "roll you own". The following may reduce memory usage -- marginally -- but it may not improve speed as you'd have to translate your short pointer to absolute pointer, and that adds overhead, also you lose most of the benefits of typechecking.
It would look something like this:
typedef unsigned short ptr;
...
// pre-allocate all memory you'd ever need
char* offset = malloc(256); // make sure this size is less than max unsigned int
// these "pointers" are 16-bit short integer, assuming sizeof(int) == 4
ptr var1 = 0, var2 = 4, var3 = 8;
// how to read and write to those "pointer", you can hide these with macros
*((int*) &offset[var1]) = ((int) 1) << 16;
printf("%i", *((int*) &offset[var1]));
with a bit more tricks, you can invent your own brk() to help allocating the memory from the offset.
Is it worth it? IMO no.

Resources