I am starting to figure out some basic ideas about memory reading and writing (let's assume that the data we read or write have not been cached yet).
For the following code:
int a = 1;
It's definitely a write, since we write the value '1' to the memory place of variable 'a'.
But for the following code:
int a, b;
a = 1;
b = a;
When we execute the statement "b = a;", do we actually perform one read and one write?
To my understandings, I think it's one read and one write, since we have to load the value of 'a' first and then write the value to 'b'.
Not sure if my understandings are correct. Please help me to clarify these basic ideas.
Many thanks for the help.
Let's assume that the data we read or write have not been cached yet)
I don't see how cache is pertinent to this.
When we execute the statement "b = a;", do we actually perform one read and one write?
Correct.
However, C is not like the assembly language. C instructions don't map 1-to-1 to machine instructions. There is the as-if rule. Basically the compiler can generate whatever machine code as long as the observable behavior of the program is preserved.
For instance:
auto foo()
{
int a = 24;
int b = 11;
int c = a + b;
return c;
}
C compiler is free to compile the above to
foo():
mov eax, 35
ret
And compilers do actually do this (with optimizations enabled). As you can see there is no memory read/write. Just just a write to the eax register (where the return of the function must be put). And the value is an imediate (35).
Related
Say I have a tight loop in C, within which I use the value of a global variable to do some arithmetics, e.g.
double c;
// ... initialize c somehow ...
double f(double*a, int n) {
double sum = 0.0;
int i;
for (i = 0; i < n; i++) {
sum += a[i]*c;
}
return sum;
}
with c the global variable. Is c "read anew from global scope" in each loop iteration? After all, it could've been changed by some other thread executing some other function, right? Hence would the code be faster by taking a local (function stack) copy of c prior to the loop and only use this copy?
double f(double*a, int n) {
double sum = 0.0;
int i;
double c_cp = c;
for (i = 0; i < n; i++) {
sum += a[i]*c_cp;
}
return sum;
}
Though I haven't specified how c is initialized, let's assume it's done in some way such that the value is unknown at compile time. Also, c is really a constant throughout runtime, i.e. I as the programmer knows that its value won't change. Can I let the compiler in on this information, e.g. using static double c in the global scope? Does this change the a[i]*c vs. a[i]*c_cp question?
My own research
Reading e.g. the "Global variables" section of this, it seems clear that taking a local copy of the global variable is the way to go. However, they want to update the value of the global variable, whereas I only ever want to read its value.
Using godbolt I fail to notice any real difference in the assembly for both c vs. c_cp and double c vs. static double c.
Any decently smart compiler will optimize your code so it will behave as your second code snippet. Using static won't change much, but if you want to ensure read on each iteration then use volatile.
Great point there about changes from a different thread. Compiler will maintain integrity of your code as far as single-threaded execution goes. That means that it can reorder your code, skip something, add something -- as long as the end result is still the same.
With multiple threads it is your job to ensure that things still happen in a specific order, not just that the end result is right. The way to ensure that are memory barriers. It's a fun topic to read, but one that is best avoided unless you're an expert.
Once everything translated to machine code, you will get no difference whatsoever. If c is global, any access to c will reference the address of c or most probably, in a tight loop c will be kept in a register, or in the worst case the L1 cache.
On a Linux machine you can easily generate the assembly and examine the resultant code.
You can also run benchmarks.
I am trying to compare a c function code to the equivalent of the assembly and kind of confused on the conditional jumps
I looked up jl instruction and it says jump if < but the answer to the question was >= Can someone explain why is that?
To my understanding, the condition is inverted, but the logic is the same; the C source defines
if the condition is satisfied, execute the following block
whereas the assembly source defines
if the condition is violated, skip the following block
which means that the flow of execution will be the same in both implementations.
In essence, what this assembly is doing, is executing your condition as you set it, but using negative logic.
Your condition says:
If a is smaller then b, return x. Otherwise, return y.
What the assembly code says (simplified):
Move y into the buffer for returning. Move b into a different buffer.
If a is bigger then b, jump ahead to the return step. Then y is
returned. If a is not bigger then b, continue in the program. The next
step assigns x to the return buffer. The step after that returns as
normal.
The outcome is the same, but the process is slightly different.
the assembly does, line by line (code not included, because you posted it as image):
foo:
return_value (eax) = y; // !!!
temporary_edx = b; // x86 can't compare memory with memory, so "b" goes to register
set_flags_by(a-b); // cmp does subtraction and discards result, except flags
"jump less to return" // so when a < b => return y (see first line)
return_value (eax) = x;
return
so to make that C code do the same thing, you need:
if (a >= b) { return x; } else { return y; }
BTW, see how easy it is to flip:
if (a < b) { return y; } else { return x; }
So there's no point to translate jl into "less" into C, you have to track down each branch, what really happens, and find for each branch of calculation the correct C-side calculation, and then "create" the condition in C to get the same calculation on both sides, so this task is not about "translating" the assembly, but about deciphering the asm logic + rewriting it back in C. Looks like you sort of completely missed the point and expected you can get away with some simple "match pattern" translation, while you have to work it out fully.
I have the following piece of C code:
#include <stdint.h>
typedef union{
uint8_t c[4];
uint16_t s[2];
uint32_t l;
}U4;
uint32_t cborder32(uint32_t l)
{
U4 mask,res;
unsigned char* p = (unsigned char*)&l;
mask.l = 0x00010203;
res.c[(uint8_t)(mask.c[0])] = (uint8_t)p[0]; // <-- this line gives C6386
res.c[(uint8_t)(mask.c[1])] = (uint8_t)p[1];
res.c[(uint8_t)(mask.c[2])] = (uint8_t)p[2];
res.c[(uint8_t)(mask.c[3])] = (uint8_t)p[3];
return res.l;
}
And it triggers a Write overrun warning when running code analysis on it. http://msdn.microsoft.com/query/dev11.query?appId=Dev11IDEF1&l=EN-US&k=k%28C6386%29&rd=true
The error is:
C6386 Write overrun Buffer overrun while writing to 'res.c': the writable size is '4' bytes, but '66052' bytes might be written.
Invalid write to 'res.c[66051]', (writable range is 0 to 3)
And I just don't understand why ... Is there anyone who can explain me why?
I'd put this down as a potential bug in the Microsoft product. It appears to be using the full value of mask.l (0x01020304 being decimal 66051) when figuring out the array index, despite the fact you clearly want mask.c[0] forced to a uint8_t value.
So the first step is to notify Microsoft. They may come back and tell you you're wrong, and hopefully give you the C++ standard section that states why what you're doing is wrong. Or they may just state the code analysis tool is "best effort only". Since it's not actually preventing you from compiling (and it's not generating errors or warnings during compilation), they could still claim VC++ is compliant.
I would hope, of course, they wouldn't take that tack since they have a lot of interest in ensuring their tools are the best around.
The second step you should take is question why you want to do what you're doing in that way in the first place. What you have seems to be a simple byte-ordering switcher based on a mask. The statement:
res.c[(uint8_t)(mask.c[0])] = (uint8_t)p[0];
is problematic anyway since (uint8_t)(mask.c[0]) may well evaluate out to something greater than 3, and you're going to write beyond the end of your union in that case.
You may think that ensuring mask has no bytes greater than 3 may prevent this but it may be that the analyser doesn't know this. In any case, there are many ways already to switch byte order, such as with the htons family of functions or, since your stuff is hard-coded anyway, just use one of:
res.c[0] = p[0]; res.c[1] = p[1]; res.c[2] = p[2]; res.c[3] = p[3];
or:
res.c[0] = p[3]; res.c[1] = p[2]; res.c[2] = p[1]; res.c[3] = p[0];
or something else, for stranger byte ordering requirements. Using this method doesn't cause any complaints from the analyser at all.
If you really want to do it with the current mask method, you can remove the analyser warning (at least in VS2013 which is what I'm using) by temporarily supressing it (for one line):
#pragma warning(suppress : 6386)
res.c[mask.c[0]] = p[0];
res.c[mask.c[1]] = p[1];
res.c[mask.c[2]] = p[2];
res.c[mask.c[3]] = p[3];
(with casts removed since the types are already correct).
I am using gcc version 4.7.2 on Ubuntu 12.10 x86_64.
First of all these are the sizes of data types on my terminal:
sizeof(char) = 1
sizeof(short) = 2 sizeof(int) = 4
sizeof(long) = 8 sizeof(long long) = 8
sizeof(float) = 4 sizeof(double) = 8
sizeof(long double) = 16
Now please have a look at this code snippet:
int main(void)
{
char c = 'a';
printf("&c = %p\n", &c);
return 0;
}
If I am not wrong we can't predict anything about the address of c. But each time this program gives some random hex address ending in f. So the next available location will be some hex value ending in 0.
I observed this pattern in case of other data types too. For an int value the address was some hex value ending in c. For double it was some random hex value ending in 8 and so on.
So I have 2 questions here.
1) Who is governing this kind of memory allocation ? Is it gcc or C standard ?
2) Whoever it is, Why it's so ? Why the variable is stored in such a way that next available memory location starts at a hex value ending in 0 ? Any specific benefit ?
Now please have a look at this code snippet:
int main(void)
{
double a = 10.2;
int b = 20;
char c = 30;
short d = 40;
printf("&a = %p\n", &a);
printf("&b = %p\n", &b);
printf("&c = %p\n", &c);
printf("&d = %p\n", &d);
return 0;
}
Now here what I observed is completely new for me. I thought the variable would get stored in the same order they are declared. But No! That's not the case. Here is the sample output of one of random run:
&a = 0x7fff8686a698
&b = 0x7fff8686a694
&c = 0x7fff8686a691
&d = 0x7fff8686a692
It seems that variables get sorted in increasing order of their sizes and then they are stored in the same sorted order but with maintaining the observation 1. i.e. the last variable (largest one) gets stored in such a way that the next available memory location is an hex value ending in 0.
Here are my questions:
3) Who is behind this ? Is it gcc or C standard ?
4) Why to waste the time in sorting the variables first and then allocating the memory instead of directly allocating the memory on 'first come first serve' basis ? Any specific benefit of this kind of sorting and then allocating memory ?
Now please have a look at this code snippet:
int main(void)
{
char array1[] = {1, 2};
int array2[] = {1, 2, 3};
printf("&array1[0] = %p\n", &array1[0]);
printf("&array1[1] = %p\n\n", &array1[1]);
printf("&array2[0] = %p\n", &array2[0]);
printf("&array2[1] = %p\n", &array2[1]);
printf("&array2[2] = %p\n", &array2[2]);
return 0;
}
Now this is also shocking for me. What I observed is that the array is always stored at some random hex value ending in '0' if the elements of an array >= 2 and if elements < 2
then it gets memory location following observation 1.
So here are my questions:
5) Who is behind this storing an array at some random hex value ending at 0 thing ? Is it gcc or C standard ?
6) Now why to waste the memory ? I mean array2 could have been stored immediately after array1 (and hence array2 would have memory location ending at 2). But instead of that array2 is stored at next hex value ending at 0 thereby leaving 14 memory locations in between. Any specific benefits ?
The address at which the stack and the heap start is given to the process by the operating system. Everything else is decided by the compiler, using offsets that are known at compile time. Some of these things may follow an existing convention followed in your target architecture and some of these do not.
The C standard does not mandate anything regarding the order of the local variables inside the stack frame (as pointed out in a comment, it doesn't even mandate the use of a stack at all). The standard only bothers to define order when it comes to structs and, even then, it does not define specific offsets, only the fact that these offsets must be in increasing order. Usually, compilers try to align the variables in such a way that access to them takes as few CPU instructions as possible - and the standard permits that, without mandating it.
Part of the reasons are mandated by the application binary interface (ABI) specifications for your system & processor.
See the x86 calling conventions and the SVR4 x86-64 ABI supplement (I'm giving the URL of a recent copy; the latest original is surprisingly hard to find on the Web).
Within a given call frame, the compiler could place variables in arbitrary stack slots. It may try (when optimizing) to reorganize the stack at will, e.g. by decreasing alignment constraints. You should not worry about that.
A compiler try to put local variables on stack location with suitable alignment. See the alignof extension of GCC. Where exactly the compiler put these variables is not important, see my answer here. (If it is important to your code, you really should pack the variables in a single common local struct, since each compiler, version and optimization flags could do different things; so don't depend on that precise behavior of your particular compiler).
Let's suppose I have a function:
int f1(int x){
// some more or less complicated operations on x
return x;
}
And that I have another function
int f2(int x){
// we simply return x
return x;
}
I would like to be able to do something like the following:
char* _f1 = (char*)f1;
char* _f2 = (char*)f2;
int i;
for (i=0; i<FUN_LENGTH; ++i){
f1[i] = f2[i];
}
I.e. I would like to interpret f1 and f2 as raw byte arrays and "overwrite f1 byte by byte" and thus, replace it by f2.
I know that usually callable code is write-protected, however, in my particular situation, you can simply overwrite the memory location where f1 is located. That is, I can copy the bytes over onto f1, but afterwards, if I call f1, the whole thing crashes.
So, is my approach possible in principle? Or are there some machine/implementation/whatsoever-dependent issues I have to take into consideration?
It would be easier to replace the first few bytes of f1 with a machine jump instruction to the beginning of f2. That way, you won't have to deal with any possible code relocation issues.
Also, the information about how many bytes a function occupies (FUN_LENGTH in your question) is normally not available at runtime. Using a jump would avoid that problem too.
For x86, the relative jump instruction opcode you need is E9 (according to here). This is a 32-bit relative jump, which means you need to calculate the relative offset between f2 and f1. This code might do it:
int offset = (int)f2 - ((int)f1 + 5); // 5 bytes for size of instruction
char *pf1 = (char *)f1;
pf1[0] = 0xe9;
pf1[1] = offset & 0xff;
pf1[2] = (offset >> 8) & 0xff;
pf1[3] = (offset >> 16) & 0xff;
pf1[4] = (offset >> 24) & 0xff;
The offset is taken from the end of the JMP instruction, so that's why there is 5 added to the address of f1 in the offset calculation.
It's a good idea to step through the result with an assembly level debugger to make sure you're poking the correct bytes. Of course, this is all not standards compliant so if it breaks you get to keep both pieces.
Your approach is undefined behavior for the C standard.
And on many operating systems (e.g. Linux), your example will crash: the function code is inside the read only .text segment (and section) of the ELF executable, and that segment is (sort-of) mmap-ed read-only by execve (or by dlopen or by the dynamic linker), so you cannot write inside it.
Instead of trying to overwrite the function (which you've already found is fragile at best), I'd consider using a pointer to a function:
int complex_implementation(int x) {
// do complex stuff with x
return x;
}
int simple_implementation(int x) {
return x;
}
int (*f1)(int) = complex_implementation;
You'd use this something like:
for (int i=0; i<limit; i++) {
a = f1(a);
if (whatever_condition)
f1 = simple_implementation;
}
...and after the assignment, calling f1 would just return the input value.
Calling a function via a pointer does impose some overhead, but (thanks to that being common in OO languages) most compilers and CPUs do a pretty good job of minimizing that overhead.
Most memory architectures will stop you writing over the function code. It will crash.... But some embedded devices, you can do this kind of thing, but it is dangerous unless you know there's enough space, the calling is going to be ok, the stack is going to be ok, etc etc...
Most likely there is a WAY better way to solve the problem.