c - what is the most efficient way to copying a string? - c

what is the most efficient way for the cpu (benchmark way) to copy a string?
I am new to c and i am currently copying a string like this
char a[]="copy me";
char b[sizeof(a)];
for (size_t i = 0; i < sizeof(a); i++) {
b[i] = a[i];
}
printf("%s",b); // copy me
Here is another alternative , a while loop is a little bit faster than a for loop (of what i have heard)
char a[]="copy me";
char b[sizeof(a)];
char c[sizeof(a)];
void copyAString (char *s, char *t)
{
while ( (*s++ = *t++) != '\0');
};
copyAString(b,a);
printf("%s",c);

Don't write your own copy loops when you can use a standard function like memcpy (when the length is known) or strcpy (when it isn't).
Modern compilers treat these as "builtin" functions, so for constant sizes can expand them to a few asm instructions instead of actually setting up a call to the library implementation, which would have to branch on the size and so on. So if you're avoiding memcpy because of the overhead of a library function call for a short copy, don't worry, there won't be one if the length is a compile-time constant.
But even in the unknown / runtime-variable length cases, the library functions will usually be an optimized version hand-written in asm that's much faster (especially for medium to large strings) than anything you can do in pure C, especially for strcpy without undefined behaviour from reading past the end of a buffer.
Your first block of code has a compile-time-constant size (you were able to use sizeof instead of strlen). Your copy loop will actually get recognized by modern compilers as a fixed-size copy, and (if large) turned into an actual call to memcpy, otherwise usually optimized similarly.
It doesn't matter how you do the array indexing; optimizing compilers can see through size_t indices or pointers and make good asm for the target platform.
See this and this Q&A for examples of how code actually compiles.
Remember that CPUs run asm, not C directly.
This example is too small and too simplistic to actually be usable as a benchmark, though. See Idiomatic way of performance evaluation?
Your 2nd way is equivalent to strcpy for an implicit-length string. That's slower because it has to search for the terminating 0 byte, if it wasn't known at compile time after inlining and unrolling the loop.
Especially if you do it by hand like this for non-constant strings; modern gcc/clang are unable to auto-vectorize loops there the program can't calculate the trip-count ahead of the first iteration. i.e. they fail at search loops like strlen and strcpy.
If you actually just call strcpy(dst, src), the compiler will either expand it inline in some efficient way, or emit an actual call to the library function. The libc function uses hand-written asm to do it efficiently as it goes, especially on ISAs like x86 where SIMD can help. For example for x86-64, glibc's AVX2 version (https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcpy-avx2.S.html) should be able to copy 32 bytes per clock cycle for medium-sized copies with source and destination hot in cache, on mainstream CPUs like Zen2 and Skylake.
It seems modern GCC/clang do not recognize this pattern as strcpy the way they recognize memcpy-equivalent loops, so if you want efficient copying for unknown-size C strings, you need to use actual strcpy. (Or better, stpcpy to get a pointer to the end, so you know the string length afterwards, allowing you to use explicit-length stuff instead of the next function also having to scan the string for length.)
Writing it yourself with one char at a time will end up using byte load/store instructions, so can go at most 1 byte per clock cycle. (Or close to 2 on Ice Lake, probably bottlenecked on the 5-wide front-end for the load / macro-fused test/jz / store.) So it's a disaster for medium to large copies with runtime-variable source where the compiler can't remove the loop.
(https://agner.org/optimize/ for performance of x86 CPUs. Other architectures are broadly similar, except for how useful SIMD is for strcpy. ISAs without x86's efficient SIMD->integer ability to branch on SIMD compare results may need to use general-purpose integer bithacks like in Why does glibc's strlen need to be so complicated to run quickly? - but note that's glibc's portable C fallback, only used on a few platforms where nobody's written hand-tuned asm.)
#0___________ claims their unrolled char-at-a-time loop is faster than glibc strcpy for strings of 1024 chars, but that's implausible and probably the result of faulty benchmark methodology. (Like compiler optimization defeating the benchmark, or page fault overhead or lazy dynamic linking for libc strcpy.)
Related Q&As:
Is memcpy() usually faster than strcpy()? - Yes, although for large copies on x86 strcpy can pretty much keep up; x86 SIMD can efficiently check whole chunks for any zero byte.
faster way than memcpy to copy 0-terminated string
Idiomatic way of performance evaluation? - microbenchmarking is hard: you need the compiler to optimize the parts that should be optimized, but still repeat the work in your benchmark loop instead of just doing it once.
Is it safe to read past the end of a buffer within the same page on x86 and x64? - yes, and all other ISAs where memory protection works in aligned pages. (It's still technically C UB, but safe in asm, so hand-written asm for library functions can 100% safely do this.)
Efficiency: arrays vs pointers
In C, accessing my array index is faster or accessing by pointer is faster?

This probably won't fit your use-case, but I found this code to be VASTLY faster than memcpy when I copy an image-array (and I'm talking >10fold). There are probably a lot of people out there who will benefit from this, so I'm posting it here:
void fastMemcpy(void* Dest, void* Source, unsigned int nBytes)
{
assert(nBytes % 32 == 0);
assert((intptr_t(Dest) & 31) == 0);
assert((intptr_t(Source) & 31) == 0);
const __m256i* pSrc = reinterpret_cast<const __m256i*>(Source);
__m256i* pDest = reinterpret_cast<__m256i*>(Dest);
int64_t nVects = nBytes / sizeof(*pSrc);
for (; nVects > 0; nVects--, pSrc++, pDest++)
{
const __m256i loaded = _mm256_stream_load_si256(pSrc);
_mm256_stream_si256(pDest, loaded);
}
_mm_sfence();
}
This makes use of intrinsics, so include <intrin.h>. The stream-commands bypass the CPUs cache and seem to make a big difference in speed. For bigger arrays you can also use multiple threads, which improve performance further.

Generally the most efficient way of copying the string is to manually unroll the loop to minimize the number of operations needed.
Example:
char *mystrcpy(char *restrict dest, const char * restrict src)
{
char *saveddest = dest;
while(1)
{
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
if(!(*dest++ = *src++)) break;
}
return saveddest;
}
https://godbolt.org/z/q3vYeWzab
A very similar approach is used by the glibc implementation.

Related

Place function instructions successively in program memory

Say I have a program which controls some Christmas lights (this isn't the actual application, only an example). These lights have a few different calculations to determine whether a light, i, will be lit in a given frame, t. Each of i and t is a uint8_t, so it can be assumed that there are 256 lights and t will loop each 256 frames. Some light patterns could be the following:
int flash(uint8_t t, uint8_t i) {
return t&1;}
int alternate(uint8_t t, uint8_t i) {
return i&1 == t&1;}
int loop(uint8_t t, uint8_t i) {
return i == t;}
If I then wanted to implement a mode-changing system that would loop through these modes, I could use a function pointer array int (*modes)(uint8_t, uint8_t)[3]. But, since these are all such short functions, is there any way I could instead force the compiler to place the functions directly after one another in program memory, sort of like an inline array?
The idea would be that to access one of these functions wouldn't require evaluating the pointer, and you could instead tell the processor the correct function is at modes + pitch*mode where pitch is the spacing between functions (at least the length of the longest).
I ask more out of curiosity than requirement, because I doubt this would actually cause much of a speed improvement.
What you are asking for is not directly available in C. But such logic can be possible in assembler, and C compilers might utilize different assembler tricks depending on CPU, optimization level etc. Try to just make the logic small and compact, mark the different functions as static, and use an switch() block in C and look at the assembler the compiler generates.
You could use a switch statement, like:
#define FLASH 1
#define ALTERNATE 2
#define LOOP 3
int patternexecute(uint8_t t, uint8_t i, int pattern)
{
switch (pattern) {
case FLASH: return t&1;
case ALTERNATE: return i&1 == t&1;
case LOOP: return i == t;
}
return 0;
}

Most Efficient Way To Count How Many Times A Character Occurs within a String

I am writing a very simple function that counts how many times a certain character occurs within a given string. I have a working function but was wondering if there was a more efficient or preferred method of doing this.
Here is the function:
size_t strchroc(const char *str, const char ch)
{
int c = 0, i = 0;
while(str[i]) if(str[i++] == ch) c++;
return c;
}
I personally cannot think of any way to get this code more efficient. And was wondering (just for the sake of learning) if anybody knew of a way to make this function more efficient.
(efficient in the sense of speed and using minimal resources).
First of all, unless your function is really time-sensitive, do not try to over-optimize. Just use the one you provided, as it is easy to verify for correctness and it doesn't try to be smart for just the heck of it.
If the function needs really to be fast then there are many ways to optimize it more. Many, really many ways. Some of them either expect or assume specific memory layout of the strings you have (for example, that they are allocated on word boundaries and the allocation is also always padded up to word boundary). So you'd need to be careful, as the algorithm might work on some combination of processor, compiler and memory allocator and fail miserably on others.
Just for the heck of it, I'll list some possible ways to speed up the character counter:
Read the string a word (32 or 64 bit integer) at a time. Not necessarily much of a help thanks to L1 caching and speculative/out-of-order execution. This needs end-of-loop adjustment for the last word (miscounting bytes after NUL terminator). Use only with word-aligned and padded memory allocators.
Remove the conditional, and instead calculate counts for all characters (to an array) and return the count for the wanted character. (This will remove the conditional and if you know string length in advance it makes for excellent loop unrolling plus removes one point of conditional branching.)
If you know the length of the string beforehand (calculated somewhere else) you can use that to unroll the loop. Or better, write it as a for-loop and apply a suitable #pragma and compiler options to make the compiler do loop unrolling for you.
Write the routine in assembler. Before going this way, crank up all compiler optimizations and disassemble the routine first -- you are likely to find out that the compiler already used all potential tricks you knew and several you didn't.
If your string is potentially very large (megabytes) -- and here I am speculating -- using a graphics card via OpenCL/CUDA might offer some potential.
And so on.
But I really, really suggest you stick with the one you have if you have a real-world problem. If this is a toy problem and you are optimizing for the fun of it, go ahead.
Cycle-shaving is a fun way to learn CPUs and instructions sets, but for 99.999999...% of programming tasks it is not worth the effort.
You can use the pointer to iterate the string, and with a little effort use the * only once per character:
size_t strchroc(const char *str, const char ch)
{
size_t c = 0;
char n;
while ((n=*str++), ((n==ch)? ++c : 0), n)
;
return c;
}
Not that the compiler couldn't optimize yours to exactly the same code, but just for fun.
You should use strchr() (or memchr() if you know the length) before using your function. If there is a match, you can start from the position of the first matching character and then go from there.
This should be much faster unless your strings are very short, or it matches very early.
you can get rid of the variable i.
size_t strchroc(const char *str, const char ch){
size_t c = 0;
while(*str != '\0') {
if(*str == ch) c++;
str++;
}
return c;
}
size_t count_the_string(const char *str, const char ch){
size_t cnt ;
for(cnt=0; *str; ) {
cnt += *str++ == ch;
}
return cnt;
}
For the equivalent do { ...} while(); variant, GCC generates code without the conditional jump (except for the loop's jump, of course) , comparable to #hakattack 's solution.
size_t count_the_string2(const char *str, const char ch){
size_t cnt=0 ;
do {
cnt += *str == ch;
} while (*str++);
return cnt;
}
After a quick low quality benchmark I ended up with this for strings of arbitrary length.
On huge strings (100M+) it did not show too much of a difference, but on shorter strings (sentences, normal text files etc.) the improvement was about 25%.
unsigned int countc_r(char *buf, char c)
{
unsigned int k = 0;
for (;;) {
if (!buf[0]) break;
if ( buf[0] == c) ++k;
if (!buf[1]) break;
if ( buf[1] == c) ++k;
if (!buf[2]) break;
if ( buf[2] == c) ++k;
if (!buf[3]) break;
if ( buf[3] == c) ++k;
buf += 4;
}
return k;
}

Is memset() more efficient than for loop in C?

Is memset() more efficient than for loop.
Considering this code:
char x[500];
memset(x,0,sizeof(x));
And this one:
char x[500];
for(int i = 0 ; i < 500 ; i ++) x[i] = 0;
Which one is more efficient and why? Is there any special instruction in hardware to do block level initialization.
Most certainly, memset will be much faster than that loop. Note how you treat one character at a time, but those functions are so optimized that set several bytes at a time, even using, when available, MMX and SSE instructions.
I think the paradigmatic example of these optimizations, that go unnoticed usually, is the GNU C library strlen function. One would think that it has at least O(n) performance, but it actually has O(n/4) or O(n/8) depending on the architecture (yes, I know, in big O() will be the same, but you actually get an eighth of the time). How? Tricky, but nicely: strlen.
Well, why don't we take a look at the generated assembly code, full optimization under VS 2010.
char x[500];
char y[500];
int i;
memset(x, 0, sizeof(x) );
003A1014 push 1F4h
003A1019 lea eax,[ebp-1F8h]
003A101F push 0
003A1021 push eax
003A1022 call memset (3A1844h)
And your loop...
char x[500];
char y[500];
int i;
for( i = 0; i < 500; ++i )
{
x[i] = 0;
00E81014 push 1F4h
00E81019 lea eax,[ebp-1F8h]
00E8101F push 0
00E81021 push eax
00E81022 call memset (0E81844h)
/* note that this is *replacing* the loop,
not being called once for each iteration. */
}
So, under this compiler, the generated code is exactly the same. memset is fast, and the compiler is smart enough to know that you are doing the same thing as calling memset once anyway, so it does it for you.
If the compiler actually left the loop as-is then it would likely be slower as you can set more than one byte size block at a time (i.e., you could unroll your loop a bit at a minimum. You can assume that memset will be at least as fast as a naive implementation such as the loop. Try it under a debug build and you will notice that the loop is not replaced.
That said, it depends on what the compiler does for you. Looking at the disassembly is always a good way to know exactly what is going on.
It really depends on the compiler and library. For older compilers or simple compilers, memset may be implemented in a library and would not perform better than a custom loop.
For nearly all compilers that are worth using, memset is an intrinsic function and the compiler will generate optimized, inline code for it.
Others have suggested profiling and comparing, but I wouldn't bother. Just use memset. Code is simple and easy to understand. Don't worry about it until your benchmarks tell you this part of code is a performance hotspot.
The answer is 'it depends'. memset MAY be more efficient, or it may internally use a for loop. I can't think of a case where memset will be less efficient. In this case, it may turn into a more efficient for loop: your loop iterates 500 times setting a bytes worth of the array to 0 every time. On a 64 bit machine, you could loop through, setting 8 bytes (a long long) at a time, which would be almost 8 times quicker, and just dealing with the remaining 4 bytes (500%8) at the end.
EDIT:
in fact, this is what memset does in glibc:
http://repo.or.cz/w/glibc.git/blob/HEAD:/string/memset.c
As Michael pointed out, in certain cases (where the array length is known at compile time), the C compiler can inline memset, getting rid of the overhead of the function call. Glibc also has assembly optimized versions of memset for most major platforms, like amd64:
http://repo.or.cz/w/glibc.git/blob/HEAD:/sysdeps/x86_64/memset.S
Good compilers will recognize the for loop and replace it with either an optimal inline sequence or a call to memset. They will also replace memset with an optimal inline sequence when the buffer size is small.
In practice, with an optimizing compiler the generated code (and therefore performance) will be identical.
Agree with above. It depends. But, for sure memset is faster or equal to the for-loop. If you are uncertain of your environment or too lazy to test, take the safe route and go with memset.
Other techniques like loop unrolling which reduce the number of loops can also be used. The code of memset() can mimic the famous duff's device:
void *duff_memset(char *to, int c, size_t count)
{
size_t n;
char *p = to;
n = (count + 7) / 8;
switch (count % 8) {
case 0: do { *p++ = c;
case 7: *p++ = c;
case 6: *p++ = c;
case 5: *p++ = c;
case 4: *p++ = c;
case 3: *p++ = c;
case 2: *p++ = c;
case 1: *p++ = c;
} while (--n > 0);
}
return to;
}
Those tricks used to enhancing the execution speed in the past. But on modern architectures this tends to increase the code size and increase cache misses.
So, it is quite impossible to say which implementation is faster as it depends on the quality of the compiler optimizations, the ability of the C library to take advantage of special hardware instructions, the amount of data you are operating on and the features of the underlying operating system (page faults management, TLB misses, Copy-On-Write).
For example, in the glibc, the implementation of memset() as well as various other "copy/set" functions like bzero() or strcpy() are architecture dependent to take advantage of various optimized hardware instructions like SSE or AVX.

Best way to convert whole file to lowercase in C

I was wondering if theres a realy good (performant) solution how to Convert a whole file to lower Case in C.
I use fgetc convert the char to lower case and write it in another temp-file with fputc. At the end i remove the original and rename the tempfile to the old originals name. But i think there must be a better Solution for it.
This doesn't really answer the question (community wiki), but here's an (over?)-optimized function to convert text to lowercase:
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
int fast_lowercase(FILE *in, FILE *out)
{
char buffer[65536];
size_t readlen, wrotelen;
char *p, *e;
char conversion_table[256];
int i;
for (i = 0; i < 256; i++)
conversion_table[i] = tolower(i);
for (;;) {
readlen = fread(buffer, 1, sizeof(buffer), in);
if (readlen == 0) {
if (ferror(in))
return 1;
assert(feof(in));
return 0;
}
for (p = buffer, e = buffer + readlen; p < e; p++)
*p = conversion_table[(unsigned char) *p];
wrotelen = fwrite(buffer, 1, readlen, out);
if (wrotelen != readlen)
return 1;
}
}
This isn't Unicode-aware, of course.
I benchmarked this on an Intel Core 2 T5500 (1.66GHz), using GCC 4.6.0 and i686 (32-bit) Linux. Some interesting observations:
It's about 75% as fast when buffer is allocated with malloc rather than on the stack.
It's about 65% as fast using a conditional rather than a conversion table.
I'd say you've hit the nail on the head. Temp file means that you don't delete the original until you're sure that you're done processing it which means upon error the original remains. I'd say that's the correct way of doing it.
As suggested by another answer (if file size permits) you can do a memory mapping of the file via the mmap function and have it readily available in memory (no real performance difference if the file is less than the size of a page as it's probably going to get read into memory once you do the first read anyway)
You can usually get a little bit faster on big inputs by using fread and fwrite to read and write big chunks of the input/output. Also you should probably convert a bigger chunk (whole file if possible) into memory and then write it all at once.
edit: I just rememberd one more thing. Sometimes programs can be faster if you select a prime number (at the very least not a power of 2) as the buffer size. I seem to recall this has to do with specifics of the cacheing mechanism.
If you're processing big files (big as in, say, multi-megabytes) and this operation is absolutely speed-critical, then it might make sense to go beyond what you've inquired about. One thing to consider in particular is that a character-by-character operation will perform less well than using SIMD instructions.
I.e. if you'd use SSE2, you could code the toupper_parallel like (pseudocode):
for (cur_parallel_word = begin_of_block;
cur_parallel_word < end_of_block;
cur_parallel_word += parallel_word_width) {
/*
* in SSE2, parallel compares are either about 'greater' or 'equal'
* so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
* The 'ALL' macro is supposed to replicate into all parallel bytes.
*/
mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
/*
* vector op - and all bytes in two vectors, 'PAND'
*/
mask = mask1 & mask2;
/*
* vector op - add a vector of bytes. Would use 'PADDB'.
*/
new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
/*
* vector op - zero bytes in the original vector that will be replaced
*/
*cur_parallel_word &= !mask; // that'd become 'PANDN'
/*
* vector op - extract characters from new that replace old, then or in.
*/
*cur_parallel_word |= (new & mask); // PAND / POR
}
I.e. you'd use parallel comparisons to check which bytes are uppercase, and then mask both original value and 'uppercased' version (one with the mask, the other with the inverse) before you or them together to form the result.
If you use mmap'ed file access, this could even be performed in-place, saving on the bounce buffer, and saving on many function and/or system calls.
There is a lot to optimize when your starting point is a character-by-character 'fgetc' / 'fputc' loop; even shell utilities are highly likely to perform better than that.
But I agree that if your need is very special-purpose (i.e. something as clear-cut as ASCII input to be converted to uppercase) then a handcrafted loop as above, using vector instruction sets (like SSE intrinsics/assembly, or ARM NEON, or PPC Altivec), is likely to make a significant speedup possible over existing general-purpose utilities.
Well, you can definitely speed this up a lot, if you know what the character encoding is. Since you're using Linux and C, I'm going to go out on a limb here and assume that you're using ASCII.
In ASCII, we know A-Z and a-z are contiguous and always 32 apart. So, what we can do is ignore the safety checks and locale checks of the toLower() function and do something like this:
(pseudo code)
foreach (int) char c in the file:
c -= 32.
Or, if there may be upper and lowercase letters, do a check like
if (c > 64 && c < 91) // the upper case ASCII range
then do the subtract and write it out to the file.
Also, batch writes are faster, so I would suggest first writing to an array, then all at once writing the contents of the array to the file.
This should be considerable faster.

Fastest way to get the null char in a copied string in C

I need to get the pointer to the terminating null char of a string.
Currently I'm using this simple way: MyString + strlen(MyString) which is probably quite good out of context.
However I'm uncomfortable with this solution, as I have to do that after a string copy:
char MyString[32];
char* EndOfString;
strcpy(MyString, "Foo");
EndOfString = MyString + strlen(MyString);
So I'm looping twice around the string, the first time in strcpy and the second time in strlen.
I would like to avoid this overhead with a custom function that returns the number of copied characters:
size_t strcpylen(char *strDestination, const char *strSource)
{
size_t len = 0;
while( *strDestination++ = *strSource++ )
len++;
return len;
}
EndOfString = MyString + strcpylen(MyString, "Foobar");
However, I fear that my implementation may be slower than the compiler provided CRT function (that may use some assembly optimization or other trick instead of a simple char-by-char loop). Or maybe I'm not aware of some standard builtin function that already does that?
I've done some poor's man benchmarking, iterating 0x1FFFFFFF times three algorithms (strcpy+strlen, my version of strcpylen, and the version of user434507). The result are:
1) strcpy+strlen is the winner with just 967 milliseconds;
2) my version takes much more: 57 seconds!
3) the edited version takes 53 seconds.
So using two CRT functions instead of a custom "optimized" version in my environment is more than 50 times faster!
size_t strcpylen(char *strDestination, const char *strSource)
{
char* dest = strDestination;
while( *dest++ = *strSource++ );
return dest - strDestination;
}
This is almost exactly what the CRT version of strcpy does, except that the CRT version will also do some checking e.g. to make sure that both arguments are non-null.
Edit: I'm looking at the CRT source for VC++ 2005. pmg is correct, there's no checking. There are two versions of strcpy. One is written in assembly, the other in C. Here's the C version:
char * __cdecl strcpy(char * dst, const char * src)
{
char * cp = dst;
while( *cp++ = *src++ )
; /* Copy src over dst */
return( dst );
}
Hacker's Delight has a nice section on finding the first null byte in a C string (see chapter 6 section 1). I found (parts of) it in Google Books, and the code seems to be here. I always go back to this book. Hope it's helpful.
Use strlcpy(), which will return the length of what it copied (assuming your size parameter is large enough).
You can try this:
int len = strlen(new_str);
memcpy(MyString, new_str, len + 1);
EndOfString = MyString + len;
It makes sense only if the new_str is large, because memcpy is much faster that standard while( *dest++ = *strSource++ ); approach, but have extra initialization costs.
Just a couple of remarks: if your function is not called very often then it may run faster from your code than from the C library because your code is already in the CPU caches.
What your benchmark is doing, is to make sure that the library call is in the cache, and this is not necessarily the case in a real-world application.
Further, Being inline could even save more cycles: compilers and CPUs prefer leaf function calls (one level encapsulation rather than several call levels) for branch prediction and data pre-fetching.
It al depends on your code-style, your application, and where you need to save cycles.
As you see, the picture is a bit more complex than what was previously exposed.
I think you may be worrying unnecessarily here. It's likely that any possible gain you can make here would be more than offset by better improvements you can make elsewhere. My advice would be not to worry about this, get your code finished and see whether you are so short of processing cycles that the benefit of this optimisation outweighs the additional work and future maintenance effort to speed it up.
In short: don't do it.
Try memccpy() (or _memccpy() in VC 2005+). I ran some tests of it against strcpy + strlen and your custom algorithm, and in my environment it beat both. I don't know how well it will work in yours, though, since for me your algorithm runs much faster than you saw, and strcpy + strlen much slower (14.4s for the former vs. 7.3s for the latter, using your number of iterations). I clocked the code below at about 5s.
int main(int argc, char *argv[])
{
char test_string[] = "Foo";
char new_string[64];
char *null_character = NULL;
int i;
int iterations = 0x1FFFFFFF;
for(i = 0; i < iterations; i++)
{
null_character = memccpy(new_string, test_string, 0, 64);
--null_character;
}
return 0;
}
Check out sprintf.
http://www.cplusplus.com/reference/clibrary/cstdio/sprintf/

Resources