Why is memcpy() faster? [duplicate] - c

This question already has answers here:
Why are memcpy() and memmove() faster than pointer increments?
(10 answers)
Copying array of structs with memcpy vs direct approach [duplicate]
(3 answers)
Closed 9 years ago.
I'm curious why the memcpy() function is faster than the simple manual copy.
Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
int main()
{
clock_t begin, end;
double time_spent;
int i, j;
char source[65536], destination[65536];
begin = clock();
for (j = 0; j<1000; j++)
for (i = 0; i < 65536; i++) destination[i] = source[i];
//slower than memcpy(destination, source, 65536);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%Lf\n",time_spent);
system("pause");
}
Doesn't the implementation of memcpy() do the same thing?
Thanks in advance.

memcpy() can incorporate various other optimizations, for example SIMD. See this answer for more information.

A good optimizing compiler should identify that your loop is, in fact, memmove() or memcpy() and replace it with a call to that function. That still leaves the question: why is it smart to do that?
It turns out that there's a great deal of room for hand-optimization of the compiled code for copying memory, and compilers aren't nearly smart enough to do it all yet (it's also very cpu-specific, so OSs will have specialized versions for each family of CPUs they support, and swap them at runtime).
Here's OSX's x86_64 SSE 4.2 copy implementation: http://www.opensource.apple.com/source/Libc/Libc-825.25/x86_64/string/bcopy_sse42.s

Isn't the implementation of memcpy() do the same thing?
Not necessarily.
It's a standard library function, and as such:
it may be highly optimized, using plaform-specific fast assembly instructions or maybe it just copies more than one bytes per iteration, which is faster if the processor has large enough registers;
it may be recognized by the compiler as a builtin, so it may perform even more optimization steps, for example, inlining it removing the function call overhead, or deducing from its context what you are trying to do and do it using another method, etc.

Because the for loop copy the item one by one. While the memcpy() copy the items block by block. You could read the souce code of memcpy() here: https://www.student.cs.uwaterloo.ca/~cs350/common/os161-src-html/memcpy_8c-source.html or here http://research.microsoft.com/en-us/um/redmond/projects/invisible/src/crt/memcpy.c.htm

memcpy() will try to copy words at once, i.e. 4 bytes per iteration on 32 bit systems and 8 bytes per iteration on 64 bit systems.

memcpy is not a vanilla loop. There are a number of optimizations in place.
Things like alignment and word-size allow memcpy to copy memory in bigger chunks, at a steady pace.

You can just step into memcpy to find out that it's not a simple loop.

Related

What are the various ways code may be 'vectorized'?

I usually hear the term vectorized functions in one of two ways:
In a very high-level language when the data is passed all-at-once (or at least, in bulk chunks) to a lower-level library that does the calculations in faster way. An example of this would be python's use of numpy for array/LA-related stuff.
At the lowest level, when using a specific machine instruction or procedure that makes heavy use of them (such as YMM, ZMM, XMM register instructions).
However, it seems like the term is passed around quite generally, and I wanted to know if there's a third (or even more) ways in which it's used. And this would just be, for example, passing multiple values to a function rather than one (usually done via an array) for example:
// non-'vectorized'
#include <stdio.h>
int squared(int num) {
return num*num;
}
int main(void) {
int nums[] = {1,2,3,4,5};
for (int i=0; i < sizeof(nums)/sizeof(*nums); i++) {
int n_squared = squared(nums[i]);
printf("%d^2 = %d\n", nums[i], n_squared);
}
}
// 'vectorized'
#include <stdio.h>
void squared(int num[], int size) {
for (int i=0; i<size; i++) {
*(num +i) = num[i] * num[i];
}
}
int main(void) {
int nums[] = {1,2,3,4,5};
squared(nums, sizeof(nums)/sizeof(*nums));
for (int i=0; i < sizeof(nums)/sizeof(*nums); i++) {
printf("Squared=%d\n", nums[i]);
}
}
Is the above considered 'vectorized code'? Is there a more formal/better definition of what makes something vectorized or not?
Vectorized code, in the context you seem to be referring to, normally means "an implementation that happens to make use of Single Instruction Multiple Data (SIMD) hardware instructions".
This can sometimes mean that someone manually wrote a version of a function that is equivalent to the canonical one, but happens to make use of SIMD. More often than not, it's something that the compiler does under the hood as part of its optimization passes.
In a very high-level language when the data is passed all-at-once (or at least, in bulk chunks) to a lower-level library that does the calculations in faster way. An example of this would be python's use of numpy for array/LA-related stuff.
That's simply not correct. The process of handing off a big chunk of data to some block of code that goes through it quickly is not vectorization in of itself.
You could say "Now that my code uses numpy, it's vectorized" and be sort of correct, but only transitively. A better way to put it would be "Now that my code uses numpy, it runs a lot faster because numpy is vectorized under the hood.". Importantly though, not all fast libraries to which big chunks of data are passed at once are vectorized.
...Code examples...
Since there is no SIMD instruction in sight in either example, then neither are vectorized yet. It might be true that the second version is more likely to lead to a vectorized program. If that's the case, then we'd say that the program is more vectorizable than the first. However, the program is not vectorized until the compiler makes it so.

Macros for 3D loops in C

I'm developing a C (C99) program that loops heavily over 3-D arrays in many places. So naturally, the following access pattern is ubiquitous in the code:
for (int i=0; i<i_size, i++) {
for (int j=0; j<j_size, j++) {
for (int k=0; k<k_size, k++) {
...
}
}
}
Naturally, this fills many lines of code with clutter and requires extensive copypasting. So I was wondering whether it would make sense to use macros to make it more compact, like this:
#define BEGIN_LOOP_3D(i,j,k,i_size,j_size,k_size) \
for (int i=0; i<(i_size), i++) { \
for (int j=0; j<(j_size), j++) { \
for (int k=0; k<(k_size), k++) {
and
#define END_LOOP_3D }}}
On one hand, from a DRY principle standpoint, this seems great: it makes the code a lot more compact, and allows you to indent the contents of the loop by just one block instead of three. On the other hand, the practice of introducing new language constructs seems hideously ugly and, even though I can't think of any obvious problems with it right now, seems alarmingly prone to creating bugs that are a nightmare to debug.
So what do you think: do the compactness and reduced repetition justify this despite the ugliness and the potential drawbacks?
Never put open or close {} inside macros. C programmers are not used to this so code gets difficult to read.
In your case this is even completely superfluous, you just don't need them. If you do such a thing do
FOR3D(I, J, K, ISIZE, JSIZE, KSIZE) \
for (size_t I=0; I<ISIIZE, I++) \
for (size_t J=0; J<JSIZE, J++) \
for (size_t K=0; K<KSIZE, K++)
no need for a terminating macro. The programmer can place the {} directly.
Also, above I have used size_t as the correct type in C for loop indices. 3D matrices easily get large, int arithmetic overflows when you don't think of it.
If these 3D arrays are “small”, you can ignore me. If your 3D arrays are large, but you don't much care about performance, you can ignore me. If you subscribe to the (common but false) doctrine that compilers are quasi-magical tools that can poop out optimal code almost irrespective of the input, you can ignore me.
You are probably aware of the general caveats regarding macros, how they can frustrate debugging, etc., but if your 3D arrays are “large” (whatever that means), and your algorithms are performance-oriented, there may be drawbacks of your strategy that you may not have considered.
First: if you are doing linear algebra, you almost certainly want to use dedicated linear algebra libraries, such as BLAS, LAPACK, etc., rather than “rolling your own”. OpenBLAS (from GotoBLAS) will totally smoke any equivalent you write, probably by at least an order of magnitude. This is doubly true if your matrices are sparse and triply true if your matrices are sparse and structured (such as tridiagonal).
Second: if your 3D arrays represent Cartesian grids for some kind of simulation (like a finite-difference method), and/or are intended to be fed to any numerical library, you absolutely do not want to represent them as C 3D arrays. You will want, instead, to use a 1D C array and use library functions where possible and perform index computations yourself (see this answer for details) where necessary.
Third: if you really do have to write your own triple-nested loops, the nesting order of the loops is a serious performance consideration. It might well be that the data-access pattern for ijk order (rather than ikj or kji) yields poor cache behavior for your algorithm, as is the case for dense matrix-matrix multiplication, for example. Your compiler might be able to do some limited loop exchange (last time I checked, icc would produce reasonably fast code for naive xGEMM, but gcc wouldn't). As you implement more and more triple-nested loops, and your proposed solution becomes more and more attractive, it becomes less and less likely that a “one loop-order fits all” strategy will give reasonable performance in all cases.
Fourth: any “one loop-order fits all” strategy that iterates over the full range of every dimension will not be tiled, and may exhibit poor performance.
Fifth (and with reference to another answer with which I disagree): I believe, in general, that the “best” data type for any object is the set with the smallest size and the least algebraic structure, but if you decide to indulge your inner pedant and use size_t or another unsigned integer type for matrix indices, you will regret it. I wrote my first naive linear algebra library in C++ in 1994. I've written maybe a half dozen in C over the last 8 years and, every time, I've started off trying to use unsigned integers and, every time, I've regretted it. I've finally decided that size_t is for sizes of things and a matrix index is not the size of anything.
Sixth (and with reference to another answer with which I disagree): a cardinal rule of HPC for deeply nested loops is to avoid function calls and branches in the innermost loop. This is particularly important where the op-count in the innermost loop is small. If you're doing a handful of operations, as is the case more often than not, you don't want to add a function call overhead in there. If you're doing hundreds or thousands of operations in there, you probably don't care about a handful of instructions for a function call/return and, therefore, they're OK.
Finally, if none of the above are considerations that jibe with what you're trying to implement, then there's nothing wrong with what you're proposing, but I would carefully consider what Jens said about braces.
The best way is to use a function. Let the compiler worry about performance and optimization, though if you are concerned you can always declare functions as inline.
Here's a simple example:
#include <stdio.h>
#include <stdint.h>
typedef void(*func_t)(int* item_ptr);
void traverse_3D (size_t x,
size_t y,
size_t z,
int array[x][y][z],
func_t function)
{
for(size_t ix=0; ix<x; ix++)
{
for(size_t iy=0; iy<y; iy++)
{
for(size_t iz=0; iz<z; iz++)
{
function(&array[ix][iy][iz]);
}
}
}
}
void fill_up (int* item_ptr) // fill array with some random numbers
{
static uint8_t counter = 0;
*item_ptr = counter;
counter++;
}
void print (int* item_ptr)
{
printf("%d ", *item_ptr);
}
int main()
{
int arr [2][3][4];
traverse_3D(2, 3, 4, arr, fill_up);
traverse_3D(2, 3, 4, arr, print);
}
EDIT
To shut up all speculations, here are some benchmarking results from Windows.
Tests were done with a matrix of size [20][30][40]. The fill_up function was called either from traverse_3D or from a 3-level nested loop directly in main(). Benchmarking was done with QueryPerformanceCounter().
Case 1: gcc -std=c99 -pedantic-errors -Wall
With function, time in us: 255.371402
Without function, time in us: 254.465830
Case 2: gcc -std=c99 -pedantic-errors -Wall -O2
With function, time in us: 115.913261
Without function, time in us: 48.599049
Case 3: gcc -std=c99 -pedantic-errors -Wall -O2, traverse_3D function inlined
With function, time in us: 37.732181
Without function, time in us: 37.430324
Why the "without function" case performs somewhat better with the function inlined, I have no idea. I can comment out the call to it and still get the same benchmarking results for the "without function" case.
The conclusion however, is that with proper optimization, performance is most likely a non-issue.

How to tell the compiler to unroll this loop [duplicate]

This question already has answers here:
Tell gcc to specifically unroll a loop
(3 answers)
Closed 9 years ago.
I have the following loop that I am running on an ARM processor.
// pin here is pointer to some part of an array
for (i = 0; i < v->numelements; i++)
{
pe = pptr[i];
peParent = pe->parent;
SPHERE *ps = (SPHERE *)(pe->data);
pin[0] = FLOAT2FIX(ps->rad2);
pin[1] = *peParent->procs->pe_intersect == &SphPeIntersect;
fixifyVector( &pin[2], ps->center ); // Is an inline function
pin = pin + 5;
}
By the slow performance of the loop, I can judge that the compiler was unable to unroll this loop, as when I manually do the unrolling, it becomes quite fast. I think the compiler is getting confused by the pin pointer. Can we use restrict keyword to help the compiler here, or is restrict only reserved for function parameters? In general how can we tell the compiler to unroll it and don't worry about the pin pointer.
To tell gcc to unroll all loops you can use the optimization flag -funroll-loops.
To unroll only a specific loop you can use:
__attribute__((optimize("unroll-loops")))
see this answer for more details.
Edit
If the compiler cannot determine the number of iterations of the loop upon entry you will need to use -funroll-all-loops. Note that from the documentation: "Unroll all loops, even if their number of iterations is uncertain when the loop is entered. This usually makes programs run more slowly."
If you extent pptr size by one, you can use the pld instruction.
__asm__ __volatile__("pld\t[%0]" :: "r" (pptr[i+1]));
Or alternatively you may need to pre-load the next peParent and SPHERE *ps. The loop overhead on an ARM is very small. It is unlikely that un-rolling the loop will be a significant benefit. There are no loop variable constants. It is more likely that the compiler's scheduler is able to fetch advanced data before it is used when you have un-rolled the loop.
You have not presented all of the code to see the data dependencies. There maybe other variables that would benefit from being pre-loaded. Giving a complete example would probably help everyone answer your question.

Is memset() more efficient than for loop in C?

Is memset() more efficient than for loop.
Considering this code:
char x[500];
memset(x,0,sizeof(x));
And this one:
char x[500];
for(int i = 0 ; i < 500 ; i ++) x[i] = 0;
Which one is more efficient and why? Is there any special instruction in hardware to do block level initialization.
Most certainly, memset will be much faster than that loop. Note how you treat one character at a time, but those functions are so optimized that set several bytes at a time, even using, when available, MMX and SSE instructions.
I think the paradigmatic example of these optimizations, that go unnoticed usually, is the GNU C library strlen function. One would think that it has at least O(n) performance, but it actually has O(n/4) or O(n/8) depending on the architecture (yes, I know, in big O() will be the same, but you actually get an eighth of the time). How? Tricky, but nicely: strlen.
Well, why don't we take a look at the generated assembly code, full optimization under VS 2010.
char x[500];
char y[500];
int i;
memset(x, 0, sizeof(x) );
003A1014 push 1F4h
003A1019 lea eax,[ebp-1F8h]
003A101F push 0
003A1021 push eax
003A1022 call memset (3A1844h)
And your loop...
char x[500];
char y[500];
int i;
for( i = 0; i < 500; ++i )
{
x[i] = 0;
00E81014 push 1F4h
00E81019 lea eax,[ebp-1F8h]
00E8101F push 0
00E81021 push eax
00E81022 call memset (0E81844h)
/* note that this is *replacing* the loop,
not being called once for each iteration. */
}
So, under this compiler, the generated code is exactly the same. memset is fast, and the compiler is smart enough to know that you are doing the same thing as calling memset once anyway, so it does it for you.
If the compiler actually left the loop as-is then it would likely be slower as you can set more than one byte size block at a time (i.e., you could unroll your loop a bit at a minimum. You can assume that memset will be at least as fast as a naive implementation such as the loop. Try it under a debug build and you will notice that the loop is not replaced.
That said, it depends on what the compiler does for you. Looking at the disassembly is always a good way to know exactly what is going on.
It really depends on the compiler and library. For older compilers or simple compilers, memset may be implemented in a library and would not perform better than a custom loop.
For nearly all compilers that are worth using, memset is an intrinsic function and the compiler will generate optimized, inline code for it.
Others have suggested profiling and comparing, but I wouldn't bother. Just use memset. Code is simple and easy to understand. Don't worry about it until your benchmarks tell you this part of code is a performance hotspot.
The answer is 'it depends'. memset MAY be more efficient, or it may internally use a for loop. I can't think of a case where memset will be less efficient. In this case, it may turn into a more efficient for loop: your loop iterates 500 times setting a bytes worth of the array to 0 every time. On a 64 bit machine, you could loop through, setting 8 bytes (a long long) at a time, which would be almost 8 times quicker, and just dealing with the remaining 4 bytes (500%8) at the end.
EDIT:
in fact, this is what memset does in glibc:
http://repo.or.cz/w/glibc.git/blob/HEAD:/string/memset.c
As Michael pointed out, in certain cases (where the array length is known at compile time), the C compiler can inline memset, getting rid of the overhead of the function call. Glibc also has assembly optimized versions of memset for most major platforms, like amd64:
http://repo.or.cz/w/glibc.git/blob/HEAD:/sysdeps/x86_64/memset.S
Good compilers will recognize the for loop and replace it with either an optimal inline sequence or a call to memset. They will also replace memset with an optimal inline sequence when the buffer size is small.
In practice, with an optimizing compiler the generated code (and therefore performance) will be identical.
Agree with above. It depends. But, for sure memset is faster or equal to the for-loop. If you are uncertain of your environment or too lazy to test, take the safe route and go with memset.
Other techniques like loop unrolling which reduce the number of loops can also be used. The code of memset() can mimic the famous duff's device:
void *duff_memset(char *to, int c, size_t count)
{
size_t n;
char *p = to;
n = (count + 7) / 8;
switch (count % 8) {
case 0: do { *p++ = c;
case 7: *p++ = c;
case 6: *p++ = c;
case 5: *p++ = c;
case 4: *p++ = c;
case 3: *p++ = c;
case 2: *p++ = c;
case 1: *p++ = c;
} while (--n > 0);
}
return to;
}
Those tricks used to enhancing the execution speed in the past. But on modern architectures this tends to increase the code size and increase cache misses.
So, it is quite impossible to say which implementation is faster as it depends on the quality of the compiler optimizations, the ability of the C library to take advantage of special hardware instructions, the amount of data you are operating on and the features of the underlying operating system (page faults management, TLB misses, Copy-On-Write).
For example, in the glibc, the implementation of memset() as well as various other "copy/set" functions like bzero() or strcpy() are architecture dependent to take advantage of various optimized hardware instructions like SSE or AVX.

Fastest way to get the null char in a copied string in C

I need to get the pointer to the terminating null char of a string.
Currently I'm using this simple way: MyString + strlen(MyString) which is probably quite good out of context.
However I'm uncomfortable with this solution, as I have to do that after a string copy:
char MyString[32];
char* EndOfString;
strcpy(MyString, "Foo");
EndOfString = MyString + strlen(MyString);
So I'm looping twice around the string, the first time in strcpy and the second time in strlen.
I would like to avoid this overhead with a custom function that returns the number of copied characters:
size_t strcpylen(char *strDestination, const char *strSource)
{
size_t len = 0;
while( *strDestination++ = *strSource++ )
len++;
return len;
}
EndOfString = MyString + strcpylen(MyString, "Foobar");
However, I fear that my implementation may be slower than the compiler provided CRT function (that may use some assembly optimization or other trick instead of a simple char-by-char loop). Or maybe I'm not aware of some standard builtin function that already does that?
I've done some poor's man benchmarking, iterating 0x1FFFFFFF times three algorithms (strcpy+strlen, my version of strcpylen, and the version of user434507). The result are:
1) strcpy+strlen is the winner with just 967 milliseconds;
2) my version takes much more: 57 seconds!
3) the edited version takes 53 seconds.
So using two CRT functions instead of a custom "optimized" version in my environment is more than 50 times faster!
size_t strcpylen(char *strDestination, const char *strSource)
{
char* dest = strDestination;
while( *dest++ = *strSource++ );
return dest - strDestination;
}
This is almost exactly what the CRT version of strcpy does, except that the CRT version will also do some checking e.g. to make sure that both arguments are non-null.
Edit: I'm looking at the CRT source for VC++ 2005. pmg is correct, there's no checking. There are two versions of strcpy. One is written in assembly, the other in C. Here's the C version:
char * __cdecl strcpy(char * dst, const char * src)
{
char * cp = dst;
while( *cp++ = *src++ )
; /* Copy src over dst */
return( dst );
}
Hacker's Delight has a nice section on finding the first null byte in a C string (see chapter 6 section 1). I found (parts of) it in Google Books, and the code seems to be here. I always go back to this book. Hope it's helpful.
Use strlcpy(), which will return the length of what it copied (assuming your size parameter is large enough).
You can try this:
int len = strlen(new_str);
memcpy(MyString, new_str, len + 1);
EndOfString = MyString + len;
It makes sense only if the new_str is large, because memcpy is much faster that standard while( *dest++ = *strSource++ ); approach, but have extra initialization costs.
Just a couple of remarks: if your function is not called very often then it may run faster from your code than from the C library because your code is already in the CPU caches.
What your benchmark is doing, is to make sure that the library call is in the cache, and this is not necessarily the case in a real-world application.
Further, Being inline could even save more cycles: compilers and CPUs prefer leaf function calls (one level encapsulation rather than several call levels) for branch prediction and data pre-fetching.
It al depends on your code-style, your application, and where you need to save cycles.
As you see, the picture is a bit more complex than what was previously exposed.
I think you may be worrying unnecessarily here. It's likely that any possible gain you can make here would be more than offset by better improvements you can make elsewhere. My advice would be not to worry about this, get your code finished and see whether you are so short of processing cycles that the benefit of this optimisation outweighs the additional work and future maintenance effort to speed it up.
In short: don't do it.
Try memccpy() (or _memccpy() in VC 2005+). I ran some tests of it against strcpy + strlen and your custom algorithm, and in my environment it beat both. I don't know how well it will work in yours, though, since for me your algorithm runs much faster than you saw, and strcpy + strlen much slower (14.4s for the former vs. 7.3s for the latter, using your number of iterations). I clocked the code below at about 5s.
int main(int argc, char *argv[])
{
char test_string[] = "Foo";
char new_string[64];
char *null_character = NULL;
int i;
int iterations = 0x1FFFFFFF;
for(i = 0; i < iterations; i++)
{
null_character = memccpy(new_string, test_string, 0, 64);
--null_character;
}
return 0;
}
Check out sprintf.
http://www.cplusplus.com/reference/clibrary/cstdio/sprintf/

Resources