write big blocks to file with fwrite() (e.g. 1000000000) - c

I am attempting to write blocks with fwrite(). At this point the largest block I could write was 100000000 (it is probably a bit higher than that...I did not try..). I cannot write a block with the size 1000000000 the outputfile is 0 Byte.
Is there any possibilty to write blocks like e.g. 1000000000 and greater?
I am using uint64_t to store these great numbers.
Thank you in advance!
Code from pastebin in comment: -zw
char * pEnd;
uint64_t uintBlockSize=strtoull(chBlockSize, &pEnd, 10);
uint64_t uintBlockCount=strtoull(chBlockCount, &pEnd, 10);
char * content=(char *) malloc(uintBlockSize*uintBlockCount);
/*
Create vfs.structure
*/
FILE *storeFile;
storeFile = fopen (chStoreFile, "w");
if (storeFile!=NULL)
{
uint64_t i=uintBlockCount;
size_t check;
/*
Fill storeFile with empty Blocks
*/
while (i!=0)
{
fwrite(content,uintBlockSize, 1, storeFile);
i--;
}

You're assuming that the type used in your C library to represent the size of objects and index memory (size_t) can hold the same range of values as uint64_t. This may not be the case!
fwrite's manpage indicates that you can use the function to write blocks whose size is limited by the size_t type. If you're on a 32bit system, the block size value passed to fwrite will be cast from uint64_t to whatever the library's size_t is (uint32_t, for example, in which case a very large value will have its most significant digits lost).

I have had fwrite fail with a block >64MB compiled with gcc 4.1.2 on CentOS 5.3
I had to chop it up into smaller pieces.
I also had fread() fail for >64MB blocks on the same setup.
This seems to have been fixed in later Linux environments, e.g. Ubuntu 12.04.

Related

Vectorize random init and print for BigInt with decimal digit array, with AVX2?

How could I pass my code to AVX2 code and get the same result as before?
Is it possible to use __m256i in the LongNumInit, LongNumPrint functions instead of uint8_t *L, or some similar type of variable?
My knowledge of AVX is quite limited; I investigated quite a bit however I do not understand very well how to transform my code any suggestion and explanation is welcome.
I'm really interested in this code in AVX2.
void LongNumInit(uint8_t *L, size_t N )
{
for(size_t i = 0; i < N; ++i){
L[i] = myRandom()%10;
}
}
void LongNumPrint( uint8_t *L, size_t N, uint8_t *Name )
{
printf("%s:", Name);
for ( size_t i=N; i>0;--i )
{
printf("%d", L[i-1]);
}
printf("\n");
}
int main (int argc, char **argv)
{
int i, sum1, sum2, sum3, N=10000, Rep=50;
seed = 12345;
// obtain parameters at run time
if (argc>1) { N = atoi(argv[1]); }
if (argc>2) { Rep = atoi(argv[2]); }
// Create Long Nums
unsigned char *V1= (unsigned char*) malloc( N);
unsigned char *V2= (unsigned char*) malloc( N);
unsigned char *V3= (unsigned char*) malloc( N);
unsigned char *V4= (unsigned char*) malloc( N);
LongNumInit ( V1, N ); LongNumInit ( V2, N ); LongNumInit ( V3, N );
//Print last 32 digits of Long Numbers
LongNumPrint( V1, 32, "V1" );
LongNumPrint( V2, 32, "V2" );
LongNumPrint( V3, 32, "V3" );
LongNumPrint( V4, 32, "V4" );
free(V1); free(V2); free(V3); free(V4);
return 0;
}
The result that I obtain in my initial code is this:
V1:59348245908804493219098067811457
V2:24890422397351614779297691741341
V3:63392771324953818089038280656869
V4:00000000000000000000000000000000
This is a terrible format for BigInteger in general, see https://codereview.stackexchange.com/a/237764 for a code review of the design flaws in using one decimal digit per byte for BigInteger, and what you could/should do instead.
And see Can long integer routines benefit from SSE? for #Mysticial's notes on ways to store your data that make SIMD for BigInteger math practical, specifically partial-word arithmetic where your temporaries might not be "normalized", letting you do lazy carry handling.
But apparently you're just asking about this code, the random-init and print functions, not how to do math between two numbers in this format.
We can vectorize both of these quite well. My LongNumPrintName() is a drop-in replacement for yours.
For LongNumInit I'm just showing a building-block that stores two 32-byte chunks and returns an incremented pointer. Call it in a loop. (It naturally produces 2 vectors per call so for small N you might make an alternate version.)
LongNumInit
What's the fastest way to generate a 1 GB text file containing random digits? generates space-separated random ASCII decimal digits at about 33 GB/s on 4GHz Skylake, including overhead of write() system calls to /dev/null. (This is higher than DRAM bandwidth; cache blocking for 128kiB lets the stores hit in L2 cache. The kernel driver for /dev/null doesn't even read the user-space buffer.)
It could easily be adapted into an AVX2 version of void LongNumInit(uint8_t *L, size_t N ). My answer there uses an AVX2 xorshift128+ PRNG (vectorized with 4 independent PRNGs in the 64-bit elements of a __m256i) like AVX/SSE version of xorshift128+. That should be similar quality of randomness to your rand() % 10.
It breaks that up into decimal digits via a multiplicative inverse to divide and modulo by 10 with shifts and vpmulhuw, using Why does GCC use multiplication by a strange number in implementing integer division?. (Actually using GNU C native vector syntax to let GCC determine the magic constant and emit the multiplies and shifts for convenient syntax like v16u dig1 = v % ten; and v /= ten;)
You can use _mm256_packus_epi16 to pack two vectors of 16-bit digits into 8-bit elements instead of turning the odd elements into ASCII ' ' and the even elements into ASCII '0'..'9'. (So change vec_store_digit_and_space to pack pairs of vectors instead of ORing with a constant, see below)
Compile this with gcc, clang, or ICC (or hopefully any other compiler that understands the GNU C dialect of C99, and Intel's intrinsics).
See https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html for the __attribute__((vector_size(32))) part, and https://software.intel.com/sites/landingpage/IntrinsicsGuide/ for the _mm256_* stuff. Also https://stackoverflow.com/tags/sse/info.
#include <immintrin.h>
// GNU C native vectors let us get the compiler to do stuff like %10 each element
typedef unsigned short v16u __attribute__((vector_size(32)));
// returns p + size of stores. Caller should use outpos = f(vec, outpos)
// p must be aligned
__m256i* vec_store_digits(__m256i vec, __m256i *restrict p)
{
v16u v = (v16u)vec;
v16u ten = (v16u)_mm256_set1_epi16(10);
v16u divisor = (v16u)_mm256_set1_epi16(6554); // ceil((2^16-1) / 10.0)
v16u div6554 = v / divisor; // Basically the entropy from the upper two decimal digits: 0..65.
// Probably some correlation with the modulo-based values, especially dig3, but we do this instead of
// dig4 for more ILP and fewer instructions total.
v16u dig1 = v % ten;
v /= ten;
v16u dig2 = v % ten;
v /= ten;
v16u dig3 = v % ten;
// dig4 would overlap much of the randomness that div6554 gets
// __m256i or v16u assignment is an aligned store
v16u *vecbuf = (v16u*)p;
// pack 16->8 bits.
vecbuf[0] = _mm256_packus_epi16(div6554, dig1);
vecbuf[1] = _mm256_packus_epi16(dig2, dig3)
return p + 2; // always a constant number of full vectors
}
The logic in random_decimal_fill_buffer that inserts newlines can be totally removed because you just want a flat array of decimal digits. Just call the above function in a loop until you've filled your buffer.
Handling small sizes (less than a full vector):
It would be convenient to pad your malloc up to the next multiple of 32 bytes so it's always safe to do a 32-byte load without checking for maybe crossing into an unmapped page.
And use C11 aligned_alloc to get 32-byte aligned storage. So for example, aligned_alloc(32, (size+31) & -32). This lets us just do full 32-byte stores even if N is odd. Logically only the first N bytes of the buffer hold our real data, but it's convenient to have padding we can scribble over to avoid any extra conditional checks for N being less than 32, or not a multiple of 32.
Unfortunately ISO C and glibc are missing aligned_realloc and aligned_calloc. MSVC does actually provide those: Why is there no 'aligned_realloc' on most platforms? allowing you to sometimes allocate more space at the end of an aligned buffer without copying it. A "try_realloc" would be ideal for C++ that might need to run copy-constructors if non-trivially copyable objects change address. Non-expressive allocator APIs that force sometimes-unnecessary copying is a pet peeve of mine.
LongNumPrint
Taking a uint8_t *Name arg is bad design. If the caller wants to printf a "something:" string first, they can do that. Your function should just do what printf "%d" does for an int.
Since you're storing your digits in reverse printing order, you'll want to byte-reverse into a tmp buffer and convert 0..9 byte values to '0'..'9' ASCII character values by ORing with '0'. Then pass that buffer to fwrite.
Specifically, use alignas(32) char tmpbuf[8192]; as a local variable.
You can work in fixed-size chunks (like 1kiB or 8kiB) instead allocating a potentially-huge buffer. You probably want to still go through stdio (instead of write() directly and managing your own I/O buffering). With an 8kiB buffer, an efficient fwrite might just pass that on to write() directly instead of memcpy into the stdio buffer. You might want to play around with tuning this, but keeping the tmp buffer comfortably smaller than half of L1d cache will mean it's still hot in cache when it's re-read after you wrote it.
Cache blocking makes the loop bounds a lot more complex but it's worth it for very large N.
Byte-reversing 32 bytes at a time:
You could avoid this work by deciding that your digits are stored in MSD-first order, but then if you did want to implement addition it would have to loop from the end backwards.
The your function could be implemented with SIMD _mm_shuffle_epi8 to reverse 16-byte chunks, starting from the end of you digit array and writing to the beginning of your tmp buffer.
Or better, load vmovdqu / vinserti128 16-byte loads to feed _mm256_shuffle_epi8 to byte-reverse within lanes, setting up for 32-byte stores.
On Intel CPUs, vinserti128 decodes to a load+ALU uop, but it can run on any vector ALU port, not just the shuffle port. So two 128-bit loads are more efficient than 256-bit load -> vpshufb - > vpermq which would probably bottleneck on shuffle-port throughput if data was hot in cache. Intel CPUs can do up to 2 loads + 1 store per clock cycle (or in IceLake, 2 loads + 2 stores). We'll probably bottleneck on the front-end if there are no memory bottlenecks, so in practice not saturating load+store and shuffle ports. (https://agner.org/optimize/ and https://uops.info/)
This function is also simplified by the assumption that we can always read 32 bytes from L without crossing into an unmapped page. But after a 32-byte reverse for small N, the first N bytes of the input become the last N bytes in a 32-byte chunk. It would be most convenient if we could always safely do a 32-byte load ending at the end of a buffer, but it's unreasonable to expect padding before the object.
#include <immintrin.h>
#include <stdalign.h>
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
// one vector of 32 bytes of digits, reversed and converted to ASCII
static inline
void ASCIIrev32B(void *dst, const void *src)
{
__m128i hi = _mm_loadu_si128(1 + (const __m128i*)src); // unaligned loads
__m128i lo = _mm_loadu_si128(src);
__m256i v = _mm256_set_m128i(lo, hi); // reverse 128-bit hi/lo halves
// compilers will hoist constants out of inline functions
__m128i byterev_lane = _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
__m256i byterev = _mm256_broadcastsi128_si256(byterev_lane); // same in each lane
v = _mm256_shuffle_epi8(v, byterev); // in-lane reverse
v = _mm256_or_si256(v, _mm256_set1_epi8('0')); // digits to ASCII
_mm256_storeu_si256(dst, v); // Will usually be aligned in practice.
}
// Tested for N=32; could be bugs in the loop bounds for other N
// returns bytes written, like fwrite: N means no error, 0 means error in all fwrites
size_t LongNumPrint( uint8_t *num, size_t N)
{
// caller can print a name if it wants
const int revbufsize = 8192; // 8kiB on the stack should be fine
alignas(32) char revbuf[revbufsize];
if (N<32) {
// TODO: maybe use a smaller revbuf for this case to avoid touching new stack pages
ASCIIrev32B(revbuf, num); // the data we want is at the *end* of a 32-byte reverse
return fwrite(revbuf+32-N, 1, N, stdout);
}
size_t bytes_written = 0;
const uint8_t *inp = num+N; // start with last 32 bytes of num[]
do {
size_t chunksize = (inp - num >= revbufsize) ? revbufsize : inp - num;
const uint8_t *inp_stop = inp - chunksize + 32; // leave one full vector for the end
uint8_t *outp = revbuf;
while (inp > inp_stop) { // may run 0 times
inp -= 32;
ASCIIrev32B(outp, inp);
outp += 32;
}
// reverse first (lowest address) 32 bytes of this chunk of num
// into last 32 bytes of this chunk of revbuf
// if chunksize%32 != 0 this will overlap, which is fine.
ASCIIrev32B(revbuf + chunksize - 32, inp_stop - 32);
bytes_written += fwrite(revbuf, 1, chunksize, stdout);
inp = inp_stop - 32;
} while ( inp > num );
return bytes_written;
// caller can putchar('\n') if it wants
}
// wrapper that prints name and newline
void LongNumPrintName(uint8_t *num, size_t N, const char *name)
{
printf("%s:", name);
//LongNumPrint_scalar(num, N);
LongNumPrint(num, N);
putchar('\n');
}
// main() included on Godbolt link that runs successfully
This compiles and runs (on Godbolt) with gcc -O3 -march=haswell and produces identical output to your scalar loop for the N=32 that main passes. (I used rand() instead of MyRandom(), so we could test with the same seed and get the same numbers, using your init function.)
Untested for larger N, but the general idea of chunksize = min(ptrdiff, 8k) and using that to loop downwards from the end of num[] should be solid.
We could load (not just store) aligned vectors if we converted the first N%32 bytes and passed that to fwrite before starting the main loop. But that probably either leads to an extra write() system call, or to clunky copying inside stdio. (Unless there was already buffered text not printed yet, like Name:, in which case we already have that penalty.)
Note that it's technically C UB to decrement inp past start of num. So inp -= 32 instead of inp = inp_stop-32 would have that UB for the iteration that leaves the outer loop. I actually avoid that in this version, but it generally works anyway because I think GCC assumes a flat memory model and de-factor defines the behaviour of pointer compares enough. And normal OSes reserve the zero page so num definitely can't be within 32 bytes of the start of physical memory (so inp can't wrap to a high address.) This paragraph is mostly left-over from the first totally untested attempt that I thought was decrementing the pointer farther in the inner loop than it actually was.

C Programming - Size of 2U and 1024U

I know that the U literal means in c, that the value is a unsigned integer. An unsigned intagers size is 4 bytes.
But how big are 2U or 1024U? Does this simply mean 2 * 4 bytes = 8 bytes for example or does this notation means that 2 (or 1024) are unsigned integers?
My goal would be to figured out how much memory will be allocated if i call malloc like this
int *allocated_mem = malloc(2U * 1024U);
and prove in a short program my answer what i tried like this
printf("Size of 2U: %ld\n", sizeof(2U));
printf("Size of 1024U: %ld\n", sizeof(1024U));
I would have expeted for the first line a size of 2 * 4 Bytes = 8 and for the second like 1024 * 4 Bytes = 4096 but the output is always "4".
Would realy appreciate what 2U and 1024U means exactly and how can i check their size in C?
My goal would be to figured out how much memory will be allocated if i call malloc like this int *allocated_mem = malloc(2U * 1024U);
What is difficult about 2 * 1024 == 2048? The fact that they are unsigned literals does not change their value.
An unsigned intagers size is 4 bytes. (sic)
You are correct. So 2U takes up 4-bytes, and 1024U takes up 4-bytes, because they are both unsigned integers.
I would have expeted for the first line a size of 2 * 4 Bytes = 8 and for the second like 1024 * 4 Bytes = 4096 but the output is always "4".
Why would the value change the size? The size depends only on the type. 2U is of type unsigned int, so it takes up 4-bytes; same as 50U, same as 1024U. They all take 4-bytes.
You are trying to multiply the value (2) times the size of the type. That makes no sense.
How big?
2U and 1024U are the same size, the size of an unsigned, commonly 32-bits or 4 "bytes". The size of a type is the same throughout a given platform - it does not change because of value.
"I know that the U literal means in c, that the value is a unsigned integer." --> OK, close enough so far.
"An unsigned integers size is 4 bytes.". Reasonable guess yet C requires that unsigned are at least 16-bits. Further, the U makes the constant unsigned, yet that could be unsigned, unsigned long, unsigned long long, depending on the value and platform.
Detail: in C, 2U is not a literal, but a constant. C has string literals and compound literals. The literals can have their address taken, but &2U is not valid C. Other languages call 2U a literal, and have their rules on how it can be used.
My goal would be to figured out how much memory will be allocated if i call malloc like this int *allocated_mem = malloc(2U * 1024U);
Instead, better to use size_t for sizing than unsigned and check the allocation.
size_t sz = 2U * 1024U;
int *allocated_mem = malloc(sz);
if (allocated_mem == NULL) allocated_mem = 0;
printf("Allocation size %zu\n", allocated_mem);
(Aside) Be careful with computed sizes. Do your size math using size_t types. 4U * 1024U * 1024U * 1024U could overflow unsigned math, yet may compute as desired with size_t.
size_t sz = (size_t)4 * 1024 * 1024 * 1024;
The following attempts to print the size of the constants which is likely 32-bits or 4 "bytes" and not their values.
printf("Size of 1024U: %ld\n", sizeof(1024U));
printf("Size of 1024U: %ld\n", sizeof(2U));

fseek - fails skipping a large amount of bytes?

I'm trying to skip a large amount of bytes before using fread to read the next bytes.
When size is small #define size 6404168 - it works:
long int x = ((long int)size)*sizeof(int);
fseek(fincache, x, SEEK_CUR);
When size is huge #define size 649218227, it doesn't :( The next fread reads garbage, can't really understand which offset is it reading from.
Using fread instead as a workaround works in both cases but its really slow:
temp = (int *) calloc(size, sizeof(int));
fread(temp,1, size*sizeof(int), fincache);
free(temp);
Assuming sizoef(int) is 4 and you are on a 32 bit system (where sizeof(long) is 4),
So 649218227*4 would overflow what a long can hold. Signed integer overflow is undefined behaviour. So you it works for smaller values (that's less than LONG_MAX).
You can use a loop instead to fseek() necessary bytes.
long x;
intmax_t len = size;
for(;len>0;){
x = (long) (len>LONG_MAX?LONG_MAX:len);
fseek(fincache, x, SEEK_CUR);
len = len-x;
}
The offset argument of fseek is required to be a long, not a long long. So x must fit into a long, else don't use fseek.
Since your platform's int is most likely 32-bit, multiplying 649,218,227 with sizeof(int) results in a number that exceeds INT_MAX and LONG_MAX, which are both 2**31-1 on 32-bit platforms. Since fseek accepts a long int, the resulting overflow causes your program to print garbage.
You should consult your compiler's documentation to find if it provides an extension for 64-bit seeking. On POSIX systems, for example, you can use fseeko, which accepts an offset of type off_t.
Be careful not to introduce overflow before even calling the 64-bit seeking function. Careful code could look like this:
off_t offset = (off_t) size * (off_t) sizeof(int);
fseeko(fincache, offset, SEEK_CUR);
Input guidance for fseek:
http://www.tutorialspoint.com/c_standard_library/c_function_fseek.htm
int fseek(FILE *stream, long int offset, int whence)
offset − This is the number of bytes to offset from whence.
You are invoking undefined behavior by passing a long long (whose value is bigger then the Max of Long int) to fseek rather then the required long.
As is known, UB can do anything, including not work.
Try this, You may have to read it out if it's such a large number
size_t toseek = 6404168;
//change the number to increase it
while(toseek>0)
{
char buffer[4096];
size_t toread = min(sizeof(buffer), toseek);
size_t read = fread(buffer, 1, toread, stdin);
toseek = toseek - read;
}

How to check a buffer in C?

I have a buffer of size 1500. In that buffer I need to check whether 15 bytes are all zeros or not (from 100 to 115). How can we do this (if we do not use any loop for it)? Data is of type "unsigned char", actually it is an unsigned char array.
Platform : Linux, C, gcc compiler
Will using memcmp() be correct or not? I am reading some data from a smart card and storing them in a buffer. Now I need to check whether the last 15 bytes are consecutively zeros or not.
I mentioned memcmp() here because I need an efficient approach; already the smart card reading has taken some time.
Or going for bitwise comparison will be correct or not . Please suggest .
unsigned char buffer[1500];
...
bool allZeros = true;
for (int i = 99; i < 115; ++i)
{
if (buffer[i] != 0)
{
allZeros = false;
break;
}
}
.
static const unsigned char zeros[15] = {0};
...
unsigned char buffer[1500];
...
bool allZeros = (memcmp(&buffer[99], zeros, 15) == 0);
Use a loop. It's the clearest, most accurate way to express your intent. The compiler will optimize it as much as possible. By "optimizing" it yourself, you can actually make things worse.
True story, happened to me a few days ago: I was 'optimizing' a comparison function between two 256-bit integers. The old version used a for loop to compare the 8 32-bit integers that comprised the 256-bit integers, I changed it to a memcmp. It was slower. Turns out that my 'optimization' blinded the compiler to the fact that both buffers were 32-bit aligned, causing it to use a less efficient comparison routine. It had already optimized out my loop anyway.
100 to 115 is not 15 byte, it is 16 byte.
I assume int size is 16 byte in your system.
if (0 == *((unsigned int*)(buffer + 100))) {
// all are zero
}
I implemented like this :
699 int is_empty_buffer(unsigned char *buff , size_t size)
700 {
701 return *buff || memcmp(buff , buff+1, size);
702 }
703
if the return value is zero then it's empty

Faster way to zero memory than with memset?

I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way (at least on x86)?
I assume that memset uses mov, however when zeroing memory most compilers use xor as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?
Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.
edit2:
I revisited this issue and did a little testing. Here is what I tested:
#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <sys/time.h>
#define TIME(body) do { \
struct timeval t1, t2; double elapsed; \
gettimeofday(&t1, NULL); \
body \
gettimeofday(&t2, NULL); \
elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
printf("%s\n --- %f ---\n", #body, elapsed); } while(0) \
#define SIZE 0x1000000
void zero_1(void* buff, size_t size)
{
size_t i;
char* foo = buff;
for (i = 0; i < size; i++)
foo[i] = 0;
}
/* I foolishly assume size_t has register width */
void zero_sizet(void* buff, size_t size)
{
size_t i;
char* bar;
size_t* foo = buff;
for (i = 0; i < size / sizeof(size_t); i++)
foo[i] = 0;
// fixes bug pointed out by tristopia
bar = (char*)buff + size - size % sizeof(size_t);
for (i = 0; i < size % sizeof(size_t); i++)
bar[i] = 0;
}
int main()
{
char* buffer = malloc(SIZE);
TIME(
memset(buffer, 0, SIZE);
);
TIME(
zero_1(buffer, SIZE);
);
TIME(
zero_sizet(buffer, SIZE);
);
return 0;
}
results:
zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.
For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)
edit3: fixed bug in test code, test results are not affected
edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset has a SSE optimized routine for zero. It will be hard to beat this.
x86 is rather broad range of devices.
For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.
For chips with mmx, an assembly loop with movq could hit 64bits at a time.
You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.
for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps
Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.
memset is generally designed to be very very fast general-purpose setting/zeroing code. It handles all cases with different sizes and alignments, which affect the kinds of instructions you can use to do your work. Depending on what system you're on (and what vendor your stdlib comes from), the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are. It might also have internal special cases to handle the case of zeroing (versus setting some other value).
That said, if you have very specific, very performance critical memory zeroing to do, it's certainly possible that you could beat a specific memset implementation by doing it yourself. memset and its friends in the standard library are always fun targets for one-upmanship programming. :)
Nowadays your compiler should do all the work for you. At least of what I know gcc is very efficient in optimizing calls to memset away (better check the assembler, though).
Then also, avoid memset if you don't have to:
use calloc for heap memory
use proper initialization (... = { 0
}) for stack memory
And for really large chunks use mmap if you have it. This just gets zero initialized memory from the system "for free".
If I remember correctly (from a couple of years ago), one of the senior developers was talking about a fast way to bzero() on PowerPC (specs said we needed to zero almost all the memory on power up). It might not translate well (if at all) to x86, but it could be worth exploring.
The idea was to load a data cache line, clear that data cache line, and then write the cleared data cache line back to memory.
For what it is worth, I hope it helps.
Unless you have specific needs or know that your compiler/stdlib is sucky, stick with memset. It's general-purpose, and should have decent performance in general. Also, compilers might have an easier time optimizing/inlining memset() because it can have intrinsic support for it.
For instance, Visual C++ will often generate inline versions of memcpy/memset that are as small as a call to the library function, thus avoiding push/call/ret overhead. And there's further possible optimizations when the size parameter can be evaluated at compile-time.
That said, if you have specific needs (where size will always be tiny *or* huge), you can gain speed boosts by dropping down to assembly level. For instance, using write-through operations for zeroing huge chunks of memory without polluting your L2 cache.
But it all depends - and for normal stuff, please stick to memset/memcpy :)
The memset function is designed to be flexible and simple, even at the expense of speed. In many implementations, it is a simple while loop that copies the specified value one byte at a time over the given number of bytes. If you are wanting a faster memset (or memcpy, memmove, etc), it is almost always possible to code one up yourself.
The simplest customization would be to do single-byte "set" operations until the destination address is 32- or 64-bit aligned (whatever matches your chip's architecture) and then start copying a full CPU register at a time. You may have to do a couple of single-byte "set" operations at the end if your range doesn't end on an aligned address.
Depending on your particular CPU, you might also have some streaming SIMD instructions that can help you out. These will typically work better on aligned addresses, so the above technique for using aligned addresses can be useful here as well.
For zeroing out large sections of memory, you may also see a speed boost by splitting the range into sections and processing each section in parallel (where number of sections is the same as your number or cores/hardware threads).
Most importantly, there's no way to tell if any of this will help unless you try it. At a minimum, take a look at what your compiler emits for each case. See what other compilers emit for their standard 'memset' as well (their implementation might be more efficient than your compiler's).
There is one fatal flaw in this otherwise great and helpful test:
As memset is the first instruction, there seems to be some "memory overhead" or so which makes it extremely slow.
Moving the timing of memset to second place and something else to first place or simply timing memset twice makes memset the fastest with all compile switches!!!
That's an interesting question. I made this implementation that is just slightly faster (but hardly measurable) when 32-bit release compiling on VC++ 2012. It probably can be improved on a lot. Adding this in your own class in a multithreaded environment would probably give you even more performance gains since there are some reported bottleneck problems with memset() in multithreaded scenarios.
// MemsetSpeedTest.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
#include "Windows.h"
#include <time.h>
#pragma comment(lib, "Winmm.lib")
using namespace std;
/** a signed 64-bit integer value type */
#define _INT64 __int64
/** a signed 32-bit integer value type */
#define _INT32 __int32
/** a signed 16-bit integer value type */
#define _INT16 __int16
/** a signed 8-bit integer value type */
#define _INT8 __int8
/** an unsigned 64-bit integer value type */
#define _UINT64 unsigned _INT64
/** an unsigned 32-bit integer value type */
#define _UINT32 unsigned _INT32
/** an unsigned 16-bit integer value type */
#define _UINT16 unsigned _INT16
/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
/** maximum allo
wed value in an unsigned 64-bit integer value type */
#define _UINT64_MAX 18446744073709551615ULL
#ifdef _WIN32
/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);
/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT clock_t start;double diff;
/** Use to start the performance timer */
#define TIMER_START start=clock();
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif
void *MemSet(void *dest, _UINT8 c, size_t count)
{
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);
_UINT64 cUll =
c
| (((_UINT64)c) << 8 )
| (((_UINT64)c) << 16 )
| (((_UINT64)c) << 24 )
| (((_UINT64)c) << 32 )
| (((_UINT64)c) << 40 )
| (((_UINT64)c) << 48 )
| (((_UINT64)c) << 56 );
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = cUll;
if (!bytesLeft) return dest;
blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = (_UINT32)cUll;
if (!bytesLeft) return dest;
blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = (_UINT16)cUll;
if (!bytesLeft) return dest;
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = (_UINT8)cUll;
return dest;
}
int _tmain(int argc, _TCHAR* argv[])
{
TIMER_INIT
const size_t n = 10000000;
const _UINT64 m = _UINT64_MAX;
const _UINT64 o = 1;
char test[n];
{
cout << "memset()" << endl;
TIMER_START;
for (int i = 0; i < m ; i++)
for (int j = 0; j < o ; j++)
memset((void*)test, 0, n);
TIMER_STOP;
}
{
cout << "MemSet() took:" << endl;
TIMER_START;
for (int i = 0; i < m ; i++)
for (int j = 0; j < o ; j++)
MemSet((void*)test, 0, n);
TIMER_STOP;
}
cout << "Done" << endl;
int wait;
cin >> wait;
return 0;
}
Output is as follows when release compiling for 32-bit systems:
memset() took:
5.569000
MemSet() took:
5.544000
Done
Output is as follows when release compiling for 64-bit systems:
memset() took:
2.781000
MemSet() took:
2.765000
Done
Here you can find the source code Berkley's memset(), which I think is the most common implementation.
memset could be inlined by compiler as a series of efficient opcodes, unrolled for a few cycles. For very large memory blocks, like 4000x2000 64bit framebuffer, you can try optimizing it across several threads (which you prepare for that sole task), each setting its own part. Note that there is also bzero(), but it is more obscure, and less likely to be as optimized as memset, and the compiler will surely notice you pass 0.
What compiler usually assumes, is that you memset large blocks, so for smaller blocks it would likely be more efficient to just do *(uint64_t*)p = 0, if you init large number of small objects.
Generally, all x86 CPUs are different (unless you compile for some standardized platform), and something you optimize for Pentium 2 will behave differently on Core Duo or i486. So if you really into it and want to squeeze the last few bits of toothpaste, it makes sense to ship several versions your exe compiled and optimized for different popular CPU models. From personal experience Clang -march=native boosted my game's FPS from 60 to 65, compared to no -march.

Resources