CRC checksum of ELF file

CRC checksum of ELF file - c

I need an opinion from somebody who has some experince with assuring file integrity.
I am trying to protect the integrity of my file with a crc checksum. My primary goal is to make harder bypassing a licence file check (which consist in disassembling the executable and removing a conditional jump).
I came up with the following idea:
unsigned long crc_stored = 4294967295;
char* text_begin = (char*)0xffffffffffffffff;
char* text_end = (char*)0xffffffffffffffff;
int main(){
unsigned long crc = calc_checksum(text_begin, text_end);
if (crc == crc_stored)
//file is ok
}
I edit the .data section of the elf binary in the following way: text_begin and text_end will contain the begin and end address of the .text section, and crc_stored the crc checksum of the .text section.
I would like to know whether this is a proper way of doing this, or there are better methods?
Edit: Karoly Horvath has right. Let's say I use the crc check to decrypt some code. I would like to know which is the best way ro checksum protect the executable.
Olaf also has right. I can use a sha algorithm. The question is the same.
Edit2: please stop saying that any protection can bypassed. I know and I just want to make it harder. Please answer the question if you can.

Let me see. You have code that does this:
int main() {
if (!license_ok()) { exit(1); }
// do something useful
}
You are worried that someone will disassemble your code, and patch out the conditional jump, so you are proposing to change the code this way instead:
int main() {
if (calc_checksum() != stored_crc) { exit(1); }
if (!license_ok()) { exit(1); }
// do something useful
}
I hope you see that this "solution" is not really a solution at all (if someone is capable of patching out one conditional jump, surely he is just as capable of patching out two such jumps).
You can find ideas for a more plausible / robust solution in one of the many books on the subject.

Do not stop the programme from running.
If the license is wrong at the start some strange behaviour is likely after 1 to 5 minutes,
causing segfaults, wrong calculations, whatever.
But in some indirect ways. Like a 2nd thread that modifies calculations or changes a random bit in the stack of another thread if the license is wrong.
Also get a Map of yourself at runtime by /proc/self and run a checksum on your .text sections at runtime.
That way you also can find some runtime modifications.
But the bitter truth is,
if it is runnable then it is just a question of how much effort the attacker needs to get a unlicensed copy running. It is not important to make it unrunnable. Just the effort of getting it cracked must be bigger than the effort.

Related

C: fprintf does not work

I have a long C code. At the beginning I open two files and write something on them:
ffitness_data = fopen("fitness_data.txt","w");
if( ffitness_data == NULL){
printf("Impossible to open the fitness data file\n");
exit(1);
}else{
fprintf(ffitness_data,"#This file contains all the data that are function of fitness.\n");
fprintf(ffitness_data,"#Columns: f,<p>(f),<l>(f).\n\n");
}
fmeme_data = fopen("meme_data.txt","w");
if( fmeme_data == NULL){
printf("Impossible to open the meme data file\n");
exit(1);
}else{
fprintf(fmeme_data,"#This file contains all the data relative to memes.\n");
fprintf(fmeme_data,"#Columns: fitness, popularity, lifetime.\n\n");
}
Everything is fine at this step: files are open and two lines are written on them.
Then I have a long simluation of a stochastic process, whose code is not interesting for the question's purposes: the files and their pointers are never used. At the end of the process I have:
for(i=0;i<data;i++){
fprintf(fmeme_data,"%f\t%d\t%f\n",meme[i].fitness,meme[i].popularity,meme[i].lifetime);
}
for(i=0;i<40;i++){
fprintf(ffitness_data,"%f\t%f\t%f\n",(1.0/40)*(i+0.5),popularity_histo[i],lifetime_histo[i]);
}
Then I DO fflush() and fclose() of both files.
If I make the code run on my laptop, both files are filled. If the code runs on a remote server, the file fitness_data.txt contains only the first print, i.e. the print starting with # but doesn't contain the data. I want you to note that:
The other file never gives me problems.
I'm used to this server. Something similar never happened.
Given all these information, the question is:
Why it is happening that a certain command, used always in the same way and in the same code, always works on a server while on a different server it works sometime but sometime it doesn't?
Admins: I don't think this question is a duplicate. All similar questions were solved by asjusting the code (here) or adding fflush() (here) and similar things. Here is not a problem in the code (in my modest opinion) because on my laptop it works. I bet it works on most.

We can't say for certain what's going on here, because we don't have your full program nor do we have access to the server where the problem happens. But, we can give you some debugging advice.
When a C program behaves differently on one computer than another, the very first thing you should suspect is memory corruption. The best available tool for finding memory corruption is valgrind. Fix the first invalid operation it reports and repeat until it reports no more invalid operations. There are excellent odds that the problem will have then gone away.
Turn up the warning levels as high as they can go and fix all of the complaints, even the ones that look silly.
You say you are calling fflush and fclose, but are you checking whether they failed? Check thoroughly, like this:
if (ferror(ffitness_data) || fflush(ffitness_data) || fclose(ffitness_data)) {
perror("write error on fitness_data.txt");
exit(1);
}
Does the problem go away if you change the optimization level you are compiling with? If so, you may have a bug that causes "undefined behavior". Unfortunately there are a lot of possible ways to do that and I can't easily explain how to look for them.
Use a tool like C-Reduce to cut your program down to a smaller program that still doesn't work correctly but is short enough to post here in its entirety.
Read and follow the instructions in the article "How to Debug Small Programs"..

Can I programmatically detect changes in a sketch?

At work we have an Arduino sketch that gets changed periodically. In a nutshell, it communicates back and forth on a Serial port. For the most part our software development team controls the code; however, there are some other teams at our company that periodically make last minute changes to the sketch in order to accommodate specific client needs.
This has obviously been quite problematic because it means we might have different versions of our sketch deployed in different places without realizing it. Our software developers are very good at using source control but the other teams are not quite so disciplined.
One idea that was proposed was hard-coding a version number, so that a certain serial command would respond by reporting back the predefined version number. The trouble however is that our other teams might likewise fail to have the discipline to update the version number if they decide to make other changes.
Obviously the best solution involves cutting off the other team from making updates, but assuming that isn't possible for office politics reasons, I was wondering if there's any way to programmatically "reflect" on an Arduino sketch. Obviously a sketch is going to take up a certain number of bytes, and that sketch file is going to have a unique file hash. I was thinking if there was some way to either get the byte count, the file hash, or the last modified time as a preprocessor directive that can be injected into code that would be ideal. Something like this:
// pseudocode
const String SKETCH_FILE_HASH = #filehash;
const int SKETCH_FILE_SIZE = #filesize;
const int SKETCH_LAST_UPDATED = #modified;
But that's about as far as my knowledge goes with this. Is there any way to write custom preprocessor directives, or macros, for Arduino code? Specifically ones that can examine the sketch file itself? Is that even possible? Or is there some way that already exists to programmatically track changes in one way or another?

Risking an answer.
SKETCH_FILE_HASH : you would have to precompute externally and pass as a flag. I guess you're using the arduino IDE and this is not doable
SKETCH_FILE_SIZE: same answer
SKETCH_LAST_UPDATED: You can use __TIME__ to get a string containing compilation time.
What I would do, taking into account the polititc parts.
enmbed a keyword linked to your version control (e.g. svn:id for subversion, almost all VCS provide this)
embed compilation time
change the official build (the one the SW team controls) to use the actual toolchain and not the IDE and put it on a jenkins : you'll be able to use compilation flags!
embed a code like
#ifndef BUILD_TYPE
#define BUILD_TYPE "Unsupported"
#endif
On your continuous build process, use -DBUILD_TYPE="HEAD" or "Release"
I'm sorry I don't see a magicx wand solving your solution. I'd invest a lot into training on why version control can save you (seems you already have the war stories)

I was looking at this issue myself, and found this:
https://gist.github.com/jcw/1985789#file-bootcheck-ino
This is to look up the bootloader; but I'm thinking that something like this could be used for determining a signature of some sort for the code as a whole.
I did a quick experiment, where I added something like:
Serial.print("Other...");
Serial.println(CalculateChecksum(0, 2048));
in void setup(), and was able to get different values for the CRC, based on changing a tiny bit of code (a string).
This is not an explicit solution; I tried CalculateChecksum(0, 32767), and so on, and if I defined an integer like int a=101; and changed it to int a=102; the checksum was the same. Only when I changed a string (i.e., add a space) did this value change.
I'm not crystal clear on the way memory is allocated in the Arduino; I do know there is program memory (32,256 bytes) and global variable memory (2048 bytes), so I'm sure there is some way of doing this.
In another experiment, I used the pgm_read_byte() function, and if I create a simple memory dump function:
void MemoryDump (word addr, word size) {
word dataval = ~0;
// prog_uint8_t* p = (prog_uint8_t*) addr;
uint8_t* p = (uint8_t*) addr;
for (word i = 0; i < size; ++i)
{
dataval = pgm_read_byte(p++);
Serial.print(i);
Serial.print(" ->");
Serial.print(dataval,HEX);
Serial.print(" ");
Serial.print(dataval);
Serial.print(" ");
if(dataval>32)
{
Serial.print(char(dataval));
}
else
{
Serial.print("***");
}
Serial.print("\n");
}
}
... and I put in a line like:
Serial.println(F("12345fghijklmnopqrstuvwxyz"));
because the F() puts the string in program memory, you will see it there.
Reading the SRAM is a bit of an issue, as noted here:
http://forum.arduino.cc/index.php?topic=220125.0
I'm not a compiler god, so I don't know how stuff like a=101; looks to the compiler/IDE, or why this doesn't look different to the program memory area.
One last note:
http://playground.arduino.cc/Code/AvailableMemory
Those functions access SRAM, so perhaps, with a bit of tweaking, you could do a CRC on that memory, but it would seem a bit of an issue, since you have to be doing a computation with a variable... in SRAM! But if the code was identical, even if doing a computation like that, it might be possible. Again, I'm in deep water here, so if an AVR god has issue with this, please destroy this theory with an ugly fact!

what the author of nedtries means by "in-place"?

I. Just implemented a kind of bitwise trie (based on nedtries), but my code does lot
Of memory allocation (for each node).
Contrary to my implemetation, nedtries are claimed to be fast , among othet things,
Because of their small number of memory allocation (if any).
The author claim his implementation to be "in-place", but what does it really means in this context ?
And how does nedtries achieve such a small number of dynamic memory allocation ?
Ps: I know that the sources are available, but the code is pretty hard to follow and I cannot figure how it works

I'm the author, so this is for the benefit of the many according to Google who are similarly having difficulties in using nedtries. I would like to thank the people here on stackflow for not making unpleasant comments about me personally which some other discussions about nedtries do.
I am afraid I don't understand the difficulties with knowing how to use it. Usage is exceptionally easy - simply copy the example in the Readme.html file:
typedef struct foo_s foo_t;
struct foo_s {
NEDTRIE_ENTRY(foo_t) link;
size_t key;
};
typedef struct foo_tree_s foo_tree_t;
NEDTRIE_HEAD(foo_tree_s, foo_t);
static foo_tree_t footree;
static size_t fookeyfunct(const foo_t *RESTRICT r)
{
return r->key;
}
NEDTRIE_GENERATE(static, foo_tree_s, foo_s, link, fookeyfunct, NEDTRIE_NOBBLEZEROS(foo_tree_s));
int main(void)
{
foo_t a, b, c, *r;
NEDTRIE_INIT(&footree);
a.key=2;
NEDTRIE_INSERT(foo_tree_s, &footree, &a);
b.key=6;
NEDTRIE_INSERT(foo_tree_s, &footree, &b);
r=NEDTRIE_FIND(foo_tree_s, &footree, &b);
assert(r==&b);
c.key=5;
r=NEDTRIE_NFIND(foo_tree_s, &footree, &c);
assert(r==&b); /* NFIND finds next largest. Invert the key function to invert this */
NEDTRIE_REMOVE(foo_tree_s, &footree, &a);
NEDTRIE_FOREACH(r, foo_tree_s, &footree)
{
printf("%p, %u\n", r, r->key);
}
NEDTRIE_PREV(foo_tree_s, &footree, &a);
return 0;
}
You declare your item type - here it's struct foo_s. You need the NEDTRIE_ENTRY() inside it otherwise it can contain whatever you like. You also need a key generating function. Other than that, it's pretty boilerplate.
I wouldn't have chosen this system of macro based initialisation myself! But it's for compatibility with the BSD rbtree.h so nedtries is very easy to swap in to anything using BSD rbtree.h.
Regarding my usage of "in place"
algorithms, well I guess my lack of
computer science training shows
here. What I would call "in place"
is when you only use the memory
passed into a piece of code, so if
you hand 64 bytes to an in place
algorithm it will only touch that 64
bytes i.e. it won't make use of
extra metadata, or allocate some
extra memory, or indeed write to
global state. A good example is an
"in place" sort implementation where
only the collection being sorted
(and I suppose the thread stack)
gets touched.
Hence no, nedtries doesn't need a
memory allocator. It stores all the
data it needs in the NEDTRIE_ENTRY
and NEDTRIE_HEAD macro expansions.
In other words, when you allocate
your struct foo_s, you do all the
memory allocation for nedtries.
Regarding understanding the "macro
goodness", it's far easier to
understand the logic if you compile
it as C++ and then debug it :). The
C++ build uses templates and the
debugger will cleanly show you state
at any given time. In fact, all
debugging from my end happens in a
C++ build and I meticulously
transcribe the C++ changes into
macroised C.
Lastly, before a new release, I
search Google for people having
problems with my software to see if
I can fix things and I am typically
amazed what someone people say about
me and my free software. Firstly,
why didn't those people having
difficulties ask me directly for
help? If I know that there is
something wrong with the docs, then
I can fix them - equally, asking on
stackoverflow doesn't let me know
immediately that there is a docs
problem bur rather relies on me to
find it next release. So all I would
say is that if anyone finds a
problem with my docs, please do
email me and say so, even if there
is a discussion say like here on
stackflow.
Niall

I took a look at the nedtrie.h source code.
It seems that the reason it is "in-place" is that you have to add the trie bookkeeping data to the items that you want to store.
You use the NEDTRIE_ENTRY macro to add parent/child/next/prev links to your data structure, and you can then pass that data structure to the various trie routines, which will extract and use those added members.
So it is "in-place" in the sense that you augment your existing data structures and the trie code piggybacks on that.
At least that's what it looks like. There's lots of macro goodness in that code so I could have gotten myself confused (:

In-place means you operate on the original (input) data, so the input data becomes the output data. Not-in-place means that you have separate input and output data, and the input data is not modified. In-place operations have a number of advantages - smaller cache/memory footprint, lower memory bandwidth, hence typically better performance, etc, but they have the disadvantage that they are destructive, i.e. you lose the original input data (which may or may not matter, depending on the use case).

In-place means to operate on the input data and (possibly) update it. The implication is that there no copying and/moving of the input data. This may result in loosing the input data original values which you will need to consider if it is relevant for your particular case.

Microcontroller Serial Command Interpreter in C/C++; Ways to do it;

I'd like to interpret a command string, recieved by a microcontroller (PIC16f877A if that makes any difference) via serial.
The strings have a pretty simple and straight-foward formatting:
$AABBCCDDEE (5 "blocks" of 2 chracters+'$' for 11 characters in total) where:
$AA= the actual name of the command (could be letters, numbers, both; mandatory);
BB-EE= parameters (numbers; optional);
I'd like to write the code in C/C++.
I figure I could just grab the string via serial, hack it up into blocks, switch () {case} and memcmp the command block ($AA). Then I could have a binary decision tree to make use of the BB CC DD and EE blocks.
I'd like to know if that's the right way to do it (It kinda seems ugly to me, surely there must be a less tedious way to do this!).

Don't over design it ! It does not mean to go blindly coding, but once you have designed something that looks like it can do the job, you can start to implement it. Implementation will give you feedback about your architecture.
For example, when writing your switch case, you might see yourself rewriting code very similar to the one you just wrote for the preceding case. Actually writing down an algorithm will help you see some problem you did not think off, or some simplification you did not see.
Don't aim for the best code on the first try. Aim for
easy to read
easy to debug
Take litlle steps. You do not have to implement the whole thing in one go.
Grab the string from the serial port. Looks easy, right ? Well, let's do that first, just printing out the commands.
Separate the command from the parameters.
Extract the parameters. Will the extraction be the same for each command ? Can you design a data structure valid for every command ?
Once you have done it right, you can start to think of a better solution.

ASCII interfaces are ugly by definition. Ideally you have some sort of frame structure, which maybe you have, the $ indicates the division between frames and you say they are 11 characters in length. If always 11 that is good, if only sometimes that is harder, hopefully there is a $ at the start and 0x0A and or 0x0D/0x0A at the end (CR/LF). Normally I have one module of code that simply extracts bytes from the serial port and puts them into a (circular) buffer. The buffering dating to the days when serial ports had very little of no buffer on board, but even today, esp with microcontrollers, that is still the case. Then another module of code that monitors the buffer searching for frames. Ideally this buffer is big enough to leave the frame there and have room for the next frame and not require another buffer for keeping copies of the frames received. using the circular buffer this second module can move (discarding if necessary as it goes) the head pointer to the beginning of frame marker and waits for a full frames worth of data. Once a full frame appears to be there it calls another function that processes that frame. That function may be the one you are asking about. And "just code it" may be the answer, you are in a microcontroller, so you cant use lazy high level desktop application on an operating system solutions. You will need some sort of strcmp function if created yourself or available to you through a library, or not depending on your solution. The brute force if(strncmp(&frame[1],"bob",3)==0) then, else if(strncmp(&frame[1],"ted",3) then, else if... Certainly works but you may chew up your rom with that kind of thing, or not. And the buffering required for this kind of approach can chew up a lot of ram. This aproach is very readable and maintainable, and portable though. May not be fast (maintainable normally conflicts with reliable and/or performance), but that may not be a concern, so long as you can process this one before the next one comes along, and or before unprocessed data falls out of the circular buffer. Depending on the task the frame checker routine may simply check that the frame is good, I normally put start and end markers, length and some sort of arithmetic checksum and if it is a bad frame it is discarded, this saves on a lot of code checking for bad/corrupt data. When the frame processing routine returns to the search for frame routine it moves the head pointer to purge the frame as it is no longer needed, good frame or bad. The frame checker may only validate a frame and hand it off to yet another function that does the parsing. Each lego block in this arrangement has a very simple task, and operates on the assumption that the lego block below it has performed its task properly. Modular, object oriented, whatever term you want to use makes the design, coding, maintenance, debugging much easier. (at the cost of peformance and resources). This approach works well for any serial type stream be it serial port in a microcontroller (with enough resources) as well as applications on a desktop looking at serial data from a serial port or TCP data which is also serial and NOT frame oriented.
if your micro doesnt have the resources for all that, then the state machine approach also works quite well. Each byte that arrives ticks the state machine one state. Start with idle waiting for the first byte, is the first byte a $? no discard it and go back to idle. if first byte is a $ then go to the next state. If you were looking for say the commands "and", "add", "or", and "xor", then the second state would compare with "a","o", and "x", if none of these then go to idle. if an a then go to a state that compares for n and d, if an o then go to a state that looks for the r. If the look for the r in or state does not see the r then go to idle, if it does then process the command and then go to idle. The code is readable in the sense that you can look at the state machine and see the words a,n,d, a,d,d, o,r, x,o,r, and where they ultimately lead to, but generally not considered readable code. This approach uses very little ram, leans on the rom a bit more but overall could use the least amount of rom as well compared to other parsing approaches. And here again is very portable, beyond microcontrollers, but outside a microcontroller folks might think you are insane with this kind of code (well not if this were verilog or vhdl of course). This approach is harder to maintain, harder to read, but is very fast and reliable and uses the least amount of resources.
To matter what approach once the command is interpreted you have to insure you can perform the command without losing any bytes on the serial port, either through deterministic performance of the code or interrupts or whatever.
Bottom line ascii interfaces are always ugly, the code for them, no matter how many layers of libraries you use to make the job easier, the resulting instructions that get executed are ugly. And one size fits no-one by definition. Just start coding, try a state machine and try the if-then-else-strncmp, and optimizations in between. You should see quickly which one performs best both with your coding style, the tools/processor, and the problem being solved.

It depends on how fancy you want to get, how many different commands there are, and whether new commands are likely to be frequently added.
You could create a data structure that associates each valid command string with a corresponding function pointer - a sorted list accessed with bsearch() is probably fine, although a hash table is an alternative which may have better performance (since the set of valid commands is known beforehand, you could construct a perfect hash with a tool like gperf).
The bsearch() approach might look something like this:
void func_aa(char args[11]);
void func_cc(char args[11]);
void func_xy(char args[11]);
struct command {
char *name;
void (*cmd_func)(char args[11]);
} command_tbl[] = {
{ "AA", func_aa },
{ "CC", func_cc },
{ "XY", func_xy }
};
#define N_CMDS (sizeof command_tbl / sizeof command_tbl[0])
static int comp_cmd(const void *c1, const void *c2)
{
const struct command *cmd1 = c1, *cmd2 = c2;
return memcmp(cmd1->name, cmd2->name, 2);
}
static struct command *get_cmd(char *name)
{
struct command target = { name, NULL };
return bsearch(&target, command_tbl, N_CMDS, sizeof command_tbl[0], comp_cmd);
}
Then if you have command_str pointing to a string from the serial port, you'd do this to dispatch the right function:
struct command *cmd = get_cmd(command_str + 1);
if (cmd)
cmd->cmd_func(command_str);

Don't know if you're still working on this. But I'm working on a similar project and found an embedded command line interpreter http://sourceforge.net/projects/ecli/?source=recommended. That's right, they had embedded applications in mind .
The cli_engine function really helps in taking the inputs from your command line.
Warning: there is no documentation besides a readme file. I'm still working through some bugs integrating the framework but this definitely gave me a head start. You'll have to deal with comparing the strings (i.e. using strcmp) yourself.

Bug fixed with four nops in an if(0), world no longer makes sense

I was writing a function to figure out if a given system of linear inequalities has a solution, when all of a sudden it started giving the wrong answers after a seemingly innocuous change.
I undid some changes, re-did them, and then proceeded to fiddle for the next two hours, until I had reduced it to absurdity.
The following, inserted anywhere into the function body, but nowhere else in the program, fixes it:
if(0) {
__asm__("nop\n");
__asm__("nop\n");
__asm__("nop\n");
__asm__("nop\n");
}
It's for a school assignment, so I probably shouldn't post the function on the web, but this is so ridiculous that I don't think any context is going to help you. And all the function does is a bunch of math and looping. It doesn't even touch memory that isn't allocated on the stack.
Please help me make sense of the world! I'm loathe to chalk it up to the GCC, since the first rule of debugging is not to blame the compiler. But heck, I'm about to. I'm running Mac OS 10.5 on a G5 tower, and the compiler in question identifies itself as 'powerpc-apple-darwin9-gcc-4.0.1' but I'm thinking it could be an impostor...
UPDATE: Curiouser and curiouser... I diffed the .s files with nops and without. Not only are there too many differences to check, but with no nops the .s file is 196,620 bytes, and with it's 156,719 bytes. (!)
UPDATE 2: Wow, should have posted the code! I came back to the code today, with fresh eyes, and immediately saw the error. See my sheepish self-answer below.

Most times when you modify the code inconsequentially and it fixes your problem, it's a memory corruption problem of some sort. We may need to see the actual code to do proper analysis, but that would be my first guess, based on the available information.

It's faulty pointer arithmetic, either directly (through a pointer) or indirectly (by going past the end of an array). Check all your arrays. Don't forget that if your array is
int a[4];
then a[4] doesn't exist.
What you're doing is overwriting something on the stack accidentally. The stack contains both locals, parameters, and the return address from your function. You might be damaging the return address in a way that the extra noops cures.
For example, if you have some code that is adding something to the return address, inserting those extra 16 bytes of noops would cure the problem, because instead of returning past the next line of code, you return into the middle of some noops.
One way you might be adding something to the return address is by going past the end of a local array or a parameter, for example
int a[4];
a[4]++;

I came back to this after a few days busy with other things, and figured it out right away. Sorry I didn't post the code sooner, but it was hard coming up with minimal example that displayed the problem.
The root problem was that I left out the return statements in the recursive function. I had:
bool function() {
/* lots of code */
function()
}
When it should have been:
bool function() {
/* lots of code */
return function()
}
This worked because, through the magic of optimization, the right value happened to be in the right register at the right time, and made it to the right place.
The bug was originally introduced when I broke the first call into its own special-cased function. And, at that point, the extra nops were the difference between this first case being inlined directly into the general recursive function.
Then, for reasons that I don't fully understand, inlining this first case led to the right value not being in the right place at the right time, and the function returning junk.

Does it happen in debug and release mode build (with symbols and without)? Does it behave the same way using a debugger? Is the code moultithreaded? Are you compiling with optimizations? Can you try another machine?

Can you confirm that you are indeed getting different executables when you add the if(0) {nops}? I don't see nops on my system.
$ gcc --version
powerpc-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)
$ cat nop.c
void foo()
{
if (0) {
__asm__("nop");
__asm__("nop");
__asm__("nop");
__asm__("nop");
}
}
$ gcc nop.c -S -O0 -o -
.
.
_foo:
stmw r30,-8(r1)
stwu r1,-48(r1)
mr r30,r1
lwz r1,0(r1)
lmw r30,-8(r1)
blr
$ gcc nop.c -S -O3 -o -
.
.
_foo:
blr

My guess is stack corruption -- though gcc should optimize anything inside an if(0) out, I would have thought.
You could try sticking a big array on the stack in your function and see if that also fixes it -- that would also implicate stack corruption.
Are you sure you're running what you think you're running? (dumb question, but it happens.)

Looks like you will need to put in some hard work and elbow grease
Your problem sounds similar to something I have debugged in the past where my app was running regular ... when out of nowhere it jumped to a different part of the app and the callstack got completely messed up ( however this was embedded programming )!
It sounds like you are spending your time "thinking" about "what should be happening" ... when you should be "looking" at "what is actually happening". A lot of the times the hardest bugs are things that you would never think "should happen".
I would approach the problem like so:
Break out your favorite debugger
Start stepping through your code and watch the call stack and local variables and look for suspicious activity
Make the system fail
Focus in to where the system is failing
Focus on iterating your code changes:
making code changes that will "make the system fail"
running/debugging and watching
If it runs fine you are looking/trying the wrong thing and you need to try something else. If you make it fail then you have made progress towards finding the bug.
If you don't know where or how the system fails you will not be able to solve the problem.
This will be a good opportunity to build your debugging skills. For more help on building your debugging skills read check out the book "9 rules for debugging".
Here is a poster from the book:
(source: google.com)
Concrete suggestions:
If you think it is the compiler, then run a different platform/OS/compiler.
Once you have ruled out the platform/OS/compiler, then try restructuring the code. Look for the "clever" code parts and see if they are actually doing what the code meant to do... maybe the clever solution wasn't actually clever and is doing something else.

I am the author of "Debugging" so kindly referenced above by Trevor Boyd Smith. He has it right -- the key rules here are #2 Make It Fail (which you seem to be doing okay), and #3 Quit Thinking and Look. The conjectures above are very good (demonstrating mastery of rule #1 -- Understand the System -- in this case the way code size can change a bug). But actually watching it fail with a debugger will show you what's actually happening without guesswork.

Break out that one function into a separate .c file (or .cpp or whatever). Compile just that one file with the nops and without them, to .s files and compare them.
Try an old version of gcc. Go back 5 or 10 years and see if things get stranger.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight