Another question about C compiler asm output

Another question about C compiler asm output - c

very quick question for you. When I store some automatic variable in C, asm output is like this: MOV ESP+4,#25h , and I just want to know why can´t compiler calculate that ESP+4 adress itself.
I thought this through, and I really cant find reason for this. I mean, isnt compiler aware of the esp value? It should be. And when using another object file, this should not be problem either, since variables could just be represent by adress and linked later, when all automatic variables are known, and therefore proper adress could be assigned. Thanks.

No, it cannot know the value of esp in advance.
Take for example a recursive function, ie. a function that calls itself. Assume such a function has several parameters that are passed in via the stack. This means that each argument takes some space on the stack, thereby changing the value of the esp register.
Now, when the function is entered, the exact value of esp will depend on how many times the function has called itself previously, and there is no way the compiler could know this at compile time. If you doubt this, take a function such as this:
void foobar(int n)
{
if (rand() % n != 17)
foobar(n + 1);
}
There's no way the compiler would be smart enough in advance to figure out if the function will call itself once more.
If the compiler wanted to determine esp in advance, it would effectively have to create a version of the function for each possible value for esp.
The above explanation only takes into account one function. In a real-world scenario, a program has many functions which interdepend on one another, which results in fairly complex "call graphs". This together with (among other things) unpredicable program logic means the compiler would have to create a huge array of versions of each function, just to optimise on esp -- which clearly doesn't make sense.
P.S.: Now something else. You don't actually need to optimise [esp+N] at all, because it should not take any more CPU time than the simpler [esp]... at least not on Intel Pentium CPUs. You could say that they already contain optimizations for exactly this and even more complicated scenarios. If you're interested in the Intel CPUs, I suggest you look up the documentation for something called the MOD R/M and the SIB byte of a machine instruction, e.g. here for the SIB byte or here or, of course, in Intel's official CPU developer documentation.

No, the compiler is not aware of the value of ESP at runtime - it's the stack pointer. It is potentially different every time the function is called. Perhaps the simplest example to think about is a recursive function - every time it calls itself, the stack gets a little bit deeper to accommodate the local variables for the new call. Every stack frame has its own local variable, every stack frame is at a different position on the stack, and therefore has its own address (in ESP, normally).

The Stack Pointer cannot be calculated at compile time. For a simple example why this is not possible, just think of a recursive function: The same variable has a different address for each call, but it's always the same code that is run.

No, the compiler doesn't know the value ahead of time. In a few extremely basic programs (where there's only one possible "route" from main to any other particular function being called) it could, but I don't know of a compiler that attempts to compute this. If you have any recursion, or a function is called from more than one place, the the stack pointer will have different values depending on where it was called from.
There's not much point to doing so in any case -- since the stack pointer is so heavily used, most CPUs are designed to make indirect addressing from the stack pointer extremely efficient. In fact, it's often more efficient than supplying an absolute address would be.

This is really rather fundamental to the way the stack works. To reason it out for yourself, imagine how you'd implement a recursive function.

Related

Using Structs in Functions

I have a function and i'm accessing a struct's members a lot of times in it.
What I was wondering about is what is the good practice to go about this?
For example:
struct s
{
int x;
int y;
}
and I have allocated memory for 10 objects of that struct using malloc.
So, whenever I need to use only one of the object in a function, I usually create (or is passed as argument) pointer and point it to the required object (My superior told me to avoid array indexing because it adds a calculation when accessing any member of the struct)
But is this the right way? I understand that dereferencing is not as expensive as creating a copy, but what if I'm dereferencing a number of times (like 20 to 30) in the function.
Would it be better if i created temporary variables for the struct variables (only the ones I need, I certainly don't use all the members) and copy over the value and then set the actual struct's value before returning?
Also, is this unnecessary micro optimization? Please note that this is for embedded devices.

This is for an embedded system. So, I can't make any assumptions about what the compiler will do. I can't make any assumptions about word size, or the number of registers, or the cost of accessing off the stack, because you didn't tell me what the architecture is. I used to do embedded code on 8080s when they were new...
OK, so what to do?
Pick a real section of code and code it up. Code it up each of the different ways you have listed above. Compile it. Find the compiler option that forces it to print out the assembly code that is produced. Compile each piece of code with every different set of optimization options. Grab the reference manual for the processor and count the cycles used by each case.
Now you will have real data on which to base a decision. Real data is much better that the opinions of a million highly experience expert programmers. Sit down with your lead programmer and show him the code and the data. He may well show you better ways to code it. If so, recode it his way, compile it, and count the cycles used by his code. Show him how his way worked out.
At the very worst you will have spent a weekend learning something very important about the way your compiler works. You will have examined N ways to code things times M different sets of optimization options. You will have learned a lot about the instruction set of the machine. You will have learned how good, or bad, the compiler is. You will have had a chance to get to know your lead programmer better. And, you will have real data.
Real data is the kind of data that you must have to answer this question. With out that data nothing anyone tells you is anything but an ego based guess. Data answers the question.
Bob Pendleton

First of all, indexing an array is not very expensive (only like one operation more expensive than a pointer dereference, or sometimes none, depending on the situation).
Secondly, most compilers will perform what is called RVO or return value optimisation when returning structs by value. This is where the caller allocates space for the return value of the function it calls, and secretly passes the address of that memory to the function for it to use, and the effect is that no copies are made. It does this automatically, so
struct mystruct blah = func();
Only constructs one object, passes it to func for it to use transparently to the programmer, and no copying need be done.
What I do not know is if you assign an array index the return value of the function, like this:
someArray[0] = func();
will the compiler pass the address of someArray[0] and do RVO that way, or will it just not do that optimisation? You'll have to get a more experienced programmer to answer that. I would guess that the compiler is smart enough to do it though, but it's just a guess.
And yes, I would call it micro optimisation. But we're C programmers. And that's how we roll.

Generally, the case in which you want to make a copy of a passed struct in C is if you want to manipulate the data in place. That is to say, have your changes not be reflected in the struct it self but rather only in the return value. As for which is more expensive, it depends on a lot of things. Many of which change implementation to implementation so I would need more specific information to be more helpful. Though, I would expect, that in an embedded environment you memory is at a greater premium than your processing power. Really this reads like needless micro optimization, your compiler should handle it.

In this case creating temp variable on the stack will be faster. But if your structure is much bigger then you might be better with dereferencing.

Are programming languages and methods inefficient? (assembler and C knowledge needed)

for a long time, I am thinking and studying output of C language compiler in assembler form, as well as CPU architecture. I know this may be silly to you, but it seems to me that something is very ineffective. Please, don´t be angry if I am wrong, and there is some reason I do not see for all these principles. I will be very glad if you tell me why is it designed this way. I actually truly believe I am wrong, I know the genius minds of people which get PCs together knew a reason to do so. What exactly, do you ask? I´ll tell you right away, I use C as a example:
1: Stack local scope memory allocation:
So, typical local memory allocation uses stack. Just copy esp to ebp and than allocate all the memory via ebp. OK, I would understand this if you explicitly need allocate RAM by default stack values, but if I do understand it correctly, modern OS use paging as a translation layer between application and physical RAM, when address you desire is further translated before reaching actual RAM byte. So why don´t just say 0x00000000 is int a,0x00000004 is int b and so? And access them just by mov 0x00000000,#10? Because you wont actually access memory blocks 0x00000000 and 0x00000004 but those your OS set the paging tables to. Actually, since memory allocation by ebp and esp use indirect addressing, "my" way would be even faster.
2: Variable allocation duplicity:
When you run application, Loader load its code into RAM. When you create variable, or string, compiler generates code that pushes these values on the top o stack when created in main. So there is actual instruction for do so, and that actual number in memory. So, there are 2 entries of the same value in RAM. One in form of instruction, second in form of actual bytes in the RAM. But why? Why not to just when declaring variable count at which memory block it would be, than when used, just insert this memory location?

How would you implement recursive functions? What you are describing is equivalent to using global variables everywhere.
That's just one problem. How can you link to a precompiled object file and be sure it won't corrupt the memory of your procedures?

Because C (and most other languages) support recursion, so a function can call itself, and each call of the function needs separate copies of any local variables. Also, on most current processors, your way would actually be slower -- indirect addressing is so common that processors are optimized for it.
You seem to want the behavior of C (or at least that C allows) for string literals. There are good and bad points to this, such as the fact that even though you've defined a "variable", you can't actually modify its contents (without affecting other variables that are pointing at the same location).

The answers to your questions are mostly wrapped up in the different semantics of different storage classes
Google "data segment"
Think about the difference in behavior between global and local variables.
Think about how constant and non-constant variables have different requirements when functions are called repeatedly (or as Mehrdad says, recursively)
Think about the difference between static and non static automatic variables again in the context of multiple or recursive calls.

Since you are comparing assembler and c (which are very close together from an architectural standpoint), I'm inclined to say that you're describing micro-optimization, which is meaningless unless you profile the code to see if it performs better.
In general, programming languages are evolving towards a more declarative style (i.e. telling the computer what you want done, rather than how you want it done). When you program in an imperative language (like assembly or c), you specify in extreme detail how you want the problem solved. This gives the compiler little room to make optimization decisions on your behalf.
However, as the languages become more declarative, the compilers are getting smarter, because we are giving them the room they need to make more intelligent performance optimizations.

If every function would put its first variable at offset 0 and so on then you would have to change the memory mapping each time you enter a function (you could not allocate all variables to unique addresses if you want recursion). This is doable, but with current hardware it's very slow. Furthermore, the address translation performed by the virtual memory is not free either, it's actually quite complicated to implement this efficiently.
Addressing off ebp (or any other register) costs having a mux (to select the register) and an adder (to add the offset to the register). The time taken for this can often be overlapped with other operations.
If you want to be able to modify the static value you have to copy it to the stack. If you don't (saying it's 'const') then a good C compiler will no copy it to the stack.

Why compilers creates one variable "twice"?

I know this is more "heavy" question, but I think its interesting too. It was part of my previous questions about compiler functions, but back than I explained it very badly, and many answered just my first question, so ther it is:
So, if my knowledge is correct, modern Windows systems use paging as a way to switch tasks and secure that each task has propriate place in memory. So, every process gets its own place starting from 0.
When multitasking goes into effect, Kernel has to save all important registers to the task´s stack i believe than save the current stack pointer, change page entry to switch to another proces´s physical adress space, load new process stack pointer, pop saved registers and continue by call to poped instruction pointer adress.
Becouse of this nice feature (paging) every process thinks it has nice flat memory within reach. So, there is no far jumps, far pointers, memory segment or data segment. All is nice and linear.
But, when there is no more segmentation for the process, why does still compilers create variables on the stack, or when global directly in other memory space, than directly in program code?
Let me give an example, I have a C code:int a=10;
which gets translated into (Intel syntax):mov [position of a],#10
But than, you actually ocupy more bytes in RAM than needed. Becouse, first few bytes takes the actuall instruction, and after that instruction is done, there is new byte containing the value 10.
Why, instead of this, when there is no need to switch any segment (thus slowing the process speed) isn´t just a value of 10 coded directly into program like this:
xor eax,eax //just some instruction
10 //the value iserted to the program
call end //just some instruction
Becouse compiler know the exact position of every instruction, when operating with that variable, it would just use it´s adress.
I know, that const variables do this, but they are not really variables, when you cannot change them.
I hope I eplained my question well, but I am still learning English, so forgive my sytactical and even semantical errors.
EDIT:
I have read your answers, and it seems that based on those I can modify my question:
So, someone told here that global variable is actually that piece of values attached directly into program, I mean, when variable is global, is it atached to the end of program, or just created like the local one at the time of execution, but instead of on stack on heap directly?
If the first case - attached to the program itself, why is there even existence of local variables? I know, you will tell me becouse of recursion, but that is not the case. When you call function, you can push any memory space on stack, so there is no program there.
I hope you do understand me, there always is ineficient use of memory, when some value (even 0) is created on stack from some instruction, becouse you need space in program for that instruction and than for the actual var. Like so: push #5 //instruction that says to create local variable with integer 5
And than this instruction just makes number 5 to be on stack. Please help me, I really want to know why its this way. Thanks.

Consider:
local variables may have more than one simultaneous existence if a routine is called recursively (even indirectly in, say, a recursive decent parser) or from more than one thread, and these cases occur in the same memory context
marking the program memory non-writable and the stack+heap as non-executable is a small but useful defense against certain classes of attacks (stack smashing...) and is used by some OSs (I don't know if windows does this, however)
Your proposal doesn't allow for either of these cases.

So, there is no far jumps, far pointers, memory segment or data segment. All is nice and linear.
Yes and no. Different program segments have different purposes - despite the fact that they reside within flat virtual memory. E.g. data segment is readable and writable, but you can't execute data. Code segment is readable and executable, but you can't write into it.
why does still compilers create variables on the stack, [...] than directly in program code?
Simple.
Code segment isn't writable. For safety reasons first. Second,
most CPUs do not like to have code segment being written into as it
breaks many existing optimization used to accelerate execution.
State of the function has to be private to the function due to
things like recursion and multi-threading.
isn´t just a value of 10 coded directly into program like this
Modern CPUs prefetch instructions to allow things like parallel execution and out-of-order execution. Putting the garbage (to CPU that is the garbage) into the code segment would simply diminish (or flat out cancel) the effect of the techniques. And they are responsible for the lion share of the performance gains CPUs had showed in the past decade.
when there is no need to switch any segment
So if there is no overhead of switching segment, why then put that into the code segment? There are no problems to keep it in data segment.
Especially in case of read-only data segment, it makes sense to put all read-only data of the program into one place - since it can be shared by all instances of the running application, saving physical RAM.
Becouse compiler know the exact position of every instruction, when operating with that variable, it would just use it´s adress.
No, not really. Most of the code is relocatable or position independent. The code is patched with real memory addresses when OS loads it into the memory. Actually special techniques are used to actually avoid patching the code so that the code segment too could be shared by all running application instances.
The ABI is responsible for defining how and what compiler and linker supposed to do for program to be executable by the complying OS. I haven't seen the Windows ABI, but the ABIs used by Linux are easy to find: search for "AMD64 ABI". Even reading the Linux ABI might answer some of your questions.

What you are talking about is optimization, and that is the compiler's business. If nothing ever changes that value, and the compiler can figure that out, then the compiler is perfectly free to do just what you say (unless a is declared volatile).
Now if you are saying that you are seeing that the compiler isn't doing that, and you think it should, you'd have to talk to your compiler writer. If you are using VisualStudio, their address is One Microsoft Way, Redmond WA. Good luck knocking on doors there. :-)

Why isn´t just a value of 10 coded directly into program like this:
xor eax,eax //just some instruction
10 //the value iserted to the program
call end //just some instruction
That is how global variables are stored. However, instead of being stuck in the middle of executable code (which is messy, and not even possible nowadays), they are stored just after the program code in memory (in Windows and Linux, at least), in what's called the .data section.
When it can, the compiler will move variables to the .data section to optimize performance. However, there are several reasons it might not:
Some variables cannot be made global, including instance variables for a class, parameters passed into a function (obviously), and variables used in recursive functions.
The variable still exists in memory somewhere, and still must have code to access it. Thus, memory usage will not change. In fact, on the x86 ("Intel"), according to this page the instruction to reference a local variable:
mov eax, [esp+8]
and the instruction to reference a global variable:
mov eax, [0xb3a7135]
both take 1 (one!) clock cycle.
The only advantage, then, is that if every local variable is global, you wouldn't have to make room on the stack for local variables.
Adding a variable to the .data segment may actually increase the size of the executable, since the variable is actually contained in the file itself.
As caf mentions in the comments, stack-based variables only exist while the function is running - global variables take up memory during the entire execution of the program.

not quite sure what your confusion is?
int a = 10; means make a spot in memory, and put the value 10 at the memory address
if you want a to be 10
#define a 10
though more typically
#define TEN 10

Variables have storage space and can be modified. It makes no sense to stick them in the code segment, where they cannot be modified.
If you have code with int a=10 or even const int a=10, the compiler cannot convert code which references 'a' to use the constant 10 directly, because it has no way of knowing whether 'a' may be changed behind its back (even const variables can be changed). For example, one way 'a' can be changed without the compiler knowing is, if you have a pointer which points 'a'. Pointers are not fixed at runtime, so the compiler cannot determine at compile time whether there will be a pointer which will point to and modify 'a'.

Does Function pointer make the program slow?

I read about function pointers in C.
And everyone said that will make my program run slow.
Is it true?
I made a program to check it.
And I got the same results on both cases. (measure the time.)
So, is it bad to use function pointer?
Thanks in advance.
To response for some guys.
I said 'run slow' for the time that I have compared on a loop.
like this:
int end = 1000;
int i = 0;
while (i < end) {
fp = func;
fp ();
}
When you execute this, i got the same time if I execute this.
while (i < end) {
func ();
}
So I think that function pointer have no difference of time
and it don't make a program run slow as many people said.

You see, in situations that actually matter from the performance point of view, like calling the function repeatedly many times in a cycle, the performance might not be different at all.
This might sound strange to people, who are used to thinking about C code as something executed by an abstract C machine whose "machine language" closely mirrors the C language itself. In such context, "by default" an indirect call to a function is indeed slower than a direct one, because it formally involves an extra memory access in order to determine the target of the call.
However, in real life the code is executed by a real machine and compiled by an optimizing compiler that has a pretty good knowledge of the underlying machine architecture, which helps it to generate the most optimal code for that specific machine. And on many platforms it might turn out that the most efficient way to perform a function call from a cycle actually results in identical code for both direct and indirect call, leading to the identical performance of the two.
Consider, for example, the x86 platform. If we "literally" translate a direct and indirect call into machine code, we might end up with something like this
// Direct call
do-it-many-times
call 0x12345678
// Indirect call
do-it-many-times
call dword ptr [0x67890ABC]
The former uses an immediate operand in the machine instruction and is indeed normally faster than the latter, which has to read the data from some independent memory location.
At this point let's remember that x86 architecture actually has one more way to supply an operand to the call instruction. It is supplying the target address in a register. And a very important thing about this format is that it is normally faster than both of the above. What does this mean for us? This means that a good optimizing compiler must and will take advantage of that fact. In order to implement the above cycle, the compiler will try to use a call through a register in both cases. If it succeeds, the final code might look as follows
// Direct call
mov eax, 0x12345678
do-it-many-times
call eax
// Indirect call
mov eax, dword ptr [0x67890ABC]
do-it-many-times
call eax
Note, that now the part that matters - the actual call in the cycle body - is exactly and precisely the same in both cases. Needless to say, the performance is going to be virtually identical.
One might even say, however strange it might sound, that on this platform a direct call (a call with an immediate operand in call) is slower than an indirect call as long as the operand of the indirect call is supplied in a register (as opposed to being stored in memory).
Of course, the whole thing is not as easy in general case. The compiler has to deal with limited availability of registers, aliasing issues etc. But is such simplistic cases as the one in your example (and even in much more complicated ones) the above optimization will be carried out by a good compiler and will completely eliminate any difference in performance between a cyclic direct call and a cyclic indirect call. This optimization works especially well in C++, when calling a virtual function, since in a typical implementation the pointers involved are fully controlled by the compiler, giving it full knowledge of the aliasing picture and other relevant stuff.
Of course, there's always a question of whether your compiler is smart enough to optimize things like that...

I think when people say this they're referring to the fact that using function pointers may prevent compiler optimizations (inlining) and processor optimizations (branch prediction). However, if function pointers are an effective way to accomplish something that you're trying to do, chances are that any other method of doing it would have the same drawbacks.
And unless your function pointers are being used in tight loops in a performance critical application or on a very slow embedded system, chances are the difference is negligible anyway.

And everyone said that will make my
program run slow. Is it true?
Most likely this claim is false. For one, if the alternative to using function pointers are something like
if (condition1) {
func1();
} else if (condition2)
func2();
} else if (condition3)
func3();
} else {
func4();
}
this is most likely relatively much slower than just using a single function pointer. While calling a function through a pointer does have some (typically neglectable) overhead, it is normally not the direct-function-call versus through-pointer-call difference that is relevant to compare.
And secondly, never optimize for performance without any measurements. Knowing where the bottlenecks are is very difficult (read impossible) to know and sometimes this can be quite non-intuitively (for instance the linux kernel developers have started removing the inline keyword from functions because it actually hurt performance).

A lot of people have put in some good answers, but I still think there's a point being missed. Function pointers do add an extra dereference which makes them several cycles slower, that number can increase based on poor branch prediction (which incidentally has almost nothing to do with the function pointer itself). Additionally functions called via a pointer cannot be inlined. But what people are missing is that most people use function pointers as an optimization.
The most common place you will find function pointers in c/c++ APIs is as callback functions. The reason so many APIs do this is because writing a system that invokes a function pointer whenever events occur is much more efficient than other methods like message passing. Personally I've also used function pointers as part of a more-complex input processing system, where each key on the keyboard has a function pointer mapped to it via a jump table. This allowed me to remove any branching or logic from the input system and merely handle the key press coming in.

Calling a function via a function pointer is somewhat slower than a static function call, since the former call includes an extra pointer dereferencing. But AFAIK this difference is negligible on most modern machines (except maybe some special platforms with very limited resources).
Function pointers are used because they can make the program much simpler, cleaner and easier to maintain (when used properly, of course). This more than makes up for the possible very minor speed difference.

A lot of good points in earlier replies.
However take a look at C qsort comparison function. Because the comparison function cannot be inlined and needs to follow standard stack based calling conventions, the total running time for the sort can be an order of magnitude (more exactly 3-10x) slower for integer keys, than otherwise same code with a direct, inlineable, call.
A typical inlined comparison would be a sequence of simple CMP and possibly CMOV/SET instruction. A function call also incurs the overhead of a CALL, setting up stack frame, doing the comparison, tearing down stack frame and returning the result. Note, that the stack operations can cause pipeline stalls due to CPU pipeline length and virtual registers. For example if value of say eax is needed before the instruction that last modified eax has finished executing (which typically takes about 12 clock cycles on the newest processors). Unless the CPU can execute other instructions out of order to wait for that, a pipeline stall will occur.

Using a function pointer is slower that just calling a function as it is another layer of indirection. (The pointer needs to be dereferenced to get the memory address of the function). While it is slower, compared to everything else your program may do (Read a file, write to the console) it is negligible.
If you need to use function pointers, use them because anything that tries to do the same thing but avoids using them will be slower and less maintainable that using function pointers.

Possibly.
The answer depends on what the function pointer is being used for and hence what the alternatives are. Comparing function pointer calls to direct function calls is misleading if a function pointer is being used to implement a choice that's part of our program logic and which can't simply be removed. I'll go ahead and nonetheless show that comparison and come back to this thought afterwards.
Function pointer calls have the most opportunity to degrade performance compared to direct function calls when they inhibit inlining. Because inlining is a gateway optimization, we can craft wildly pathological cases where function pointers are made arbitrarily slower than the equivalent direct function call:
void foo(int* x) {
*x = 0;
}
void (*foo_ptr)(int*) = foo;
int call_foo(int *p, int size) {
int r = 0;
for (int i = 0; i != size; ++i)
r += p[i];
foo(&r);
return r;
}
int call_foo_ptr(int *p, int size) {
int r = 0;
for (int i = 0; i != size; ++i)
r += p[i];
foo_ptr(&r);
return r;
}
Code generated for call_foo():
call_foo(int*, int):
xor eax, eax
ret
Nice. foo() has not only been inlined, but doing so has allowed the compiler to eliminate the entire preceding loop! The generated code simply zeroes out the return register by XORing the register with itself and then returns. On the other hand, compilers will have to generate code for the loop in call_foo_ptr() (100+ lines with gcc 7.3) and most of that code effectively does nothing (so long as foo_ptr still points to foo()). (In more typical scenarios, you can expect that inlining a small function into a hot inner loop might reduce execution time by up to about an order of magnitude.)
So in a worst case scenario, a function pointer call is arbitrarily slower than a direct function call, but this is misleading. It turns out that if foo_ptr had been const, then call_foo() and call_foo_ptr() would have generated the same code. However, this would require us to give up the opportunity for indirection provided by foo_ptr. Is it "fair" for foo_ptr to be const? If we're interested in the indirection provided by foo_ptr, then no, but if that's the case, then a direct function call is not a valid option either.
If a function pointer is being used to provide useful indirection, then we can move the indirection around or in some cases swap out function pointers for conditionals or even macros, but we can't simply remove it. If we've decided that function pointers are a good approach but performance is a concern, then we typically want to pull indirection up the call stack so that we pay the cost of indirection in an outer loop. For example, in the common case where a function takes a callback and calls it in a loop, we might try moving the innermost loop into the callback (and changing the responsibility of each callback invocation accordingly).

C memcpy() a function

Is there any method to calculate size of a function? I have a pointer to a function and I have to copy entire function using memcpy. I have to malloc some space and know 3rd parameter of memcpy - size. I know that sizeof(function) doesn't work. Do you have any suggestions?

Functions are not first class objects in C. Which means they can't be passed to another function, they can't be returned from a function, and they can't be copied into another part of memory.
A function pointer though can satisfy all of this, and is a first class object. A function pointer is just a memory address and it usually has the same size as any other pointer on your machine.

It doesn't directly answer your question, but you should not implement call-backs from kernel code to user-space.
Injecting code into kernel-space is not a great work-around either.
It's better to represent the user/kernel barrier like a inter-process barrier. Pass data, not code, back and forth between a well defined protocol through a char device. If you really need to pass code, just wrap it up in a kernel module. You can then dynamically load/unload it, just like a .so-based plugin system.
On a side note, at first I misread that you did want to pass memcpy() to the kernel. You have to remind that it is a very special function. It is defined in the C standard, quite simple, and of a quite broad scope, so it is a perfect target to be provided as a built-in by the compiler.
Just like strlen(), strcmp() and others in GCC.
That said, the fact that is a built-in does not impede you ability to take a pointer to it.

Even if there was a way to get the sizeof() a function, it may still fail when you try to call a version that has been copied to another area in memory. What if the compiler has local or long jumps to specific memory locations. You can't just move a function in memory and expect it to run. The OS can do that but it has all the information it takes to do it.
I was going to ask how operating systems do this but, now that I think of it, when the OS moves stuff around it usually moves a whole page and handles memory such that addresses translate to a page/offset. I'm not sure even the OS ever moves a single function around in memory.
Even in the case of the OS moving a function around in memory, the function itself must be declared or otherwise compiled/assembled to permit such action, usually through a pragma that indicates the code is relocatable. All the memory references need to be relative to its own stack frame (aka local variables) or include some sort of segment+offset structure such that the CPU, either directly or at the behest of the OS, can pick the appropriate segment value. If there was a linker involved in creating the app, the app may have to be
re-linked to account for the new function address.
There are operating systems which can give each application its own 32-bit address space but it applies to the entire process and any child threads, not to an individual function.
As mentioned elsewhere, you really need a language where functions are first class objects, otherwise you're out of luck.

You want to copy a function? I do not think that this is possible in C generally.
Assume, you have a Harvard-Architecture microcontroller, where code (in other words "functions") is located in ROM. In this case you cannot do that at all.
Also I know several compilers and linkers, which do optimization on file (not only function level). This results in opcode, where parts of C functions are mixed into each other.
The only way which I consider as possible may be:
Generate opcode of your function (e.g. by compiling/assembling it on its own).
Copy that opcode into an C array.
Use a proper function pointer, pointing to that array, to call this function.
Now you can perform all operations, common to typical "data", on that array.
But apart from this: Did you consider a redesign of your software, so that you do not need to copy a functions content?

I don't quite understand what you are trying to accomplish, but assuming you compile with -fPIC and don't have your function do anything fancy, no other function calls, not accessing data from outside function, you might even get away with doing it once. I'd say the safest possibility is to limit the maximum size of supported function to, say, 1 kilobyte and just transfer that, and disregard the trailing junk.
If you really needed to know the exact size of a function, figure out your compiler's epilogue and prologue. This should look something like this on x86:
:your_func_epilogue
mov esp, ebp
pop ebp
ret
:end_of_func
;expect a varying length run of NOPs here
:next_func_prologue
push ebp
mov ebp, esp
Disassemble your compiler's output to check, and take the corresponding assembled sequences to search for. Epilogue alone might be enough, but all of this can bomb if searched sequence pops up too early, e.g. in the data embedded by the function. Searching for the next prologue might also get you into trouble, i think.
Now please ignore everything that i wrote, since you apparently are trying to approach the problem in the wrong and inherently unsafe way. Paint us a larger picture please, WHY are you trying to do that, and see whether we can figure out an entirely different approach.

A similar discussion was done here:
http://www.motherboardpoint.com/getting-code-size-function-c-t95049.html
They propose creating a dummy function after your function-to-be-copied, and then getting the memory pointers to both. But you need to switch off compiler optimizations for it to work.
If you have GCC >= 4.4, you could try switching off the optimizations for your function in particular using #pragma:
http://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html#Function-Specific-Option-Pragmas
Another proposed solution was not to copy the function at all, but define the function in the place where you would want to copy it to.
Good luck!

If your linker doesn't do global optimizations, then just calculate the difference between the function pointer and the address of the next function.
Note that copying the function will produce something which can't be invoked if your code isn't compiled relocatable (i.e. all addresses in the code must be relative, for example branches; globals work, though since they don't move).

It sounds like you want to have a callback from your kernel driver to userspace, so that it can inform userspace when some asynchronous job has finished.
That might sound sensible, because it's the way a regular userspace library would probably do things - but for the kernel/userspace interface, it's quite wrong. Even if you manage to get your function code copied into the kernel, and even if you make it suitably position-independent, it's still wrong, because the kernel and userspace code execute in fundamentally different contexts. For just one example of the differences that might cause problems, if a page fault happens in kernel context due to a swapped-out page, that'll cause a kernel oops rather than swapping the page in.
The correct approach is for the kernel to make some file descriptor readable when the asynchronous job has finished (in your case, this file descriptor almost certainly be the character device your driver provides). The userspace process can then wait for this event with select / poll, or with read - it can set the file descriptor non-blocking if wants, and basically just use all the standard UNIX tools for dealing with this case. This, after all, is how the asynchronous nature of network sockets (and pretty much every other asychronous case) is handled.
If you need to provide additional information about what the event that occured, that can be made available to the userspace process when it calls read on the readable file descriptor.

Function isn't just object you can copy. What about cross-references / symbols and so on? Of course you can take something like standard linux "binutils" package and torture your binaries but is it what you want?
By the way if you simply are trying to replace memcpy() implementation, look around LD_PRELOAD mechanics.

I can think of a way to accomplish what you want, but I won't tell you because it's a horrific abuse of the language.

A cleaner method than disabling optimizations and relying on the compiler to maintain order of functions is to arrange for that function (or a group of functions that need copying) to be in its own section. This is compiler and linker dependant, and you'll also need to use relative addressing if you call between the functions that are copied. For those asking why you would do this, its a common requirement in embedded systems that need to update the running code.

My suggestion is: don't.
Injecting code into kernel space is such an enormous security hole that most modern OSes forbid self-modifying code altogether.

As near as I can tell, the original poster wants to do something that is implementation-specific, and so not portable; this is going off what the C++ standard says on the subject of casting pointers-to-functions, rather than the C standard, but that should be good enough here.
In some environments, with some compilers, it might be possible to do what the poster seems to want to do (that is, copy a block of memory that is pointed to by the pointer-to-function to some other location, perhaps allocated with malloc, cast that block to a pointer-to-function, and call it directly). But it won't be portable, which may not be an issue. Finding the size required for that block of memory is itself dependent on the environment, and compiler, and may very well require some pretty arcane stuff (e.g., scanning the memory for a return opcode, or running the memory through a disassembler). Again, implementation-specific, and highly non-portable. And again, may not matter for the original poster.
The links to potential solutions all appear to make use of implementation-specific behaviour, and I'm not even sure that they do what the purport to do, but they may be suitable for the OP.
Having beaten this horse to death, I am curious to know why the OP wants to do this. It would be pretty fragile even if it works in the target environment (e.g., could break with changes to compiler options, compiler version, code refactoring, etc). I'm glad that I don't do work where this sort of magic is necessary (assuming that it is)...

I have done this on a Nintendo GBA where I've copied some low level render functions from flash (16 bit access slowish memory) to the high speed workspace ram (32 bit access, at least twice as fast). This was done by taking the address of the function immdiately after the function I wanted to copy, size = (int) (NextFuncPtr - SourceFuncPtr). This did work well but obviously cant be garunteed on all platforms (does not work on Windows for sure).

I think one solution can be as below.
For ex: if you want to know func() size in program a.c, and have indicators before and after the function.
Try writing a perl script which will compile this file into object format(cc -o) make sure that pre-processor statements are not removed. You need them later on to calculate the size from object file.
Now search for your two indicators and find out the code size in between.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight