Obfuscating C++ Shared Library - obfuscation

I've been asked to help obfuscate a library (written in C++) which will be distributed to clients. I've already discussed why obfuscation is not necessarily a good idea, and seeing as licensing will be integrated into the software many concerns regarding copy protection are moot.
Regardless, I've been asked to research methods anyway. I've looked into header mangling (and the like) as well as HARES, but I fail to find much that I can use for a library (naturally, these things would destroy any form of API rendering the library useless).
What techniques can I apply that would work for libraries? While I would appreciate recommendations for tools (or compiler flags, etc.) that might be helpful I would like to stress that this is not a tool-focused (i.e. closable) question, but rather one focused on applicable techniques.

I wouldn't put much energy to doing it very thoroughly, because the reverse engineer is going to win this round.
https://softwareengineering.stackexchange.com/questions/155131/is-it-important-to-obfuscate-c-application-code
Obfuscating C++ binaries is a bit of a losing battle. It depends on who you are dealing with, but if your reverse engineer is smart enough to use IDA Pro and a couple of plugins, and a good debugger then it shall all be for naught.
Obfuscation Priorities
Where you can give the reverse engineer useless function names.
Honestly, this doesn't help that much, since ultimately your code will have to call some kind of non obfuscated shared library to get anything done. At some point you will use the standard libary, or the STL, or even make a system call.
Add false pathways to confound static analysis
So that the reverse engineer can have fun with a debugger. Anti-analysis techniques are well known to the reverse engineer, and they can almost be circumvented with a debugger like ollydbg.
Write debugger foiling code
That reverse engineers love to play with. Again, this is an expected move, and the response is to just to step around the offending code, or to modify away the traps. Anyone with any formal training in RE will blast past this.
Pack most of my binary into an encryped region which is decryped by a stub just before execution.
Same answer as above. Reverse engineers train for this from day one.
Keep in mind the reverse engineers are looking for targeted morsals of information - very rarely are they trying to recreate the entire application. Security intensive code, code for license validation, code for home base communication, networking code. These are all prime targets - put your energy into making these thorny places to live.
Keep in mind that binaries from the largest corporations on the earth are routinely reverse engineered by people in their early 20's.
Don't leave your debugging symbols in the final binary, as those will definitely help with analysis.
If you are dedicated to doing this right, also focus on wasting the engineers time - time is always against the reverse engineer.
Remember, that any meaningful obfuscation might also cost you the performance gains that justified working in C++ in the first place. There are many zones in the C world (and for that matter the Java world) where meaningful obfuscation just isn't possible. Games for instance, cannot conceal their calls to the OpenGl APIs, nor can they truly prevent engineers from harvesting their shader code.
Also remember that the reverse engineer is watching your code at the assembly level most of the time. He'd rather have your function names, but he can live without it if need be. He can see what your program is doing at the most finite level possible. It is only a matter of time before he finds the critical routines.
For your purposes, find a program to mangle function names, make your boss happy, and call it a day. At least at that point, reverse engineering the software will not be trivial.

Well really you have 2 primary vectors that you have to guard against
Disassembley
Debugging
My favourite method for preventing the first issue is in memory decryption, take parts of your executable code and encrypt it, have it self decrypt in memory while your library is running, you can also checksum parts of the code and compare the checksum against what is loaded in ram ( have the encrypted portions check the decrypter and vice versa )
Another neat trick is to statically link libraries that you use into your executible so they cannot be easily swapped out to try to see what your code is doing.
Now debugging checking interrupt vectors helps, another trick is to check the 'timing' between various portions of code ( for example if more than a couple of milliseconds worth of delay occurs in code that should execute significantly faster than that then it can be assumed that the code is being debugged

Related

Can Smart Assembly 7+ strings be deobfuscated?

I am planning on using Smart Assembly 7+ for obfuscating my .NET C# library.
But when I look through some forums I came across that there are even programs to deobfuscate DLLs protected with Smart Assembly, particularly programs like de4dot.
So I tried to deobfuscate my program using de4dot, and I got most of my logic decompiled successfully to my surprise. But thankfully the strings were not decompiled.
They were in the form of Class24.getString_0(5050)
If the strings cannot be decompiled properly by any deobfuscator, then it is enough to protect my core logic. But I am paranoid that maybe I did not use the deobfuscator properly and there are ways to deobfuscate strings even(but I tried running the deobfuscator commands for strings, as stated in the repo wiki).
Basically my question is, can I be certain that strings obfuscated by the SmartAssembly cannot be decompiled by any deobfuscator program in the market.
Also, any good suggestions for obfuscating the .NET libraries are also welcomed.
Thank You All!
In order for your code to run, the computer must understand it. There is no way around that. If the CLR can understand your code, there is no reason that a de-obfuscator cannot understand your code either.
Plus, computers are much stupider than humans. If a computer can understand your code, then a human definitely can.
The typical approaches to protecting your code, are:
Don't give the customers your code. Run it on your own computer and give them access to it. (That's the "Google approach".)
Give the customers a computer that you control 100% with your code pre-installed. (That's the "PlayStation approach".)
Don't do business with criminals. Copying your code is illegal pretty much everywhere. Circumventing protections in your code is illegal in several countries, including some of the biggest markets (e.g. the US). Reverse engineering your code may be legal, but only under very strict circumstances. (E.g. in the EU, reverse engineering is only legal for purposes of interoperability, and only if you refuse to make the information required for interoperability available under reasonable and non-discriminatory terms.)
Offer your customers extra services that your competitors, even if they were stealing your code, don't or cannot offer. For a lot of companies, the mere fact of "having someone they can sue" is already reason enough to buy the original software from the original vendor. Criminals are lazy, that's why they are criminals. They will never understand the problem domain as deeply as you do, simply because they are too lazy to put in the work, so they will never be able to provide enhancements, consulting, support, or bug fixes as well, as fast, and as precise as you can.

Question to any embedded systems engineers out there using STM32 NUCLEO

I have recently bought an STM32 NUCLEO Dev Kit and wondered if this what an actual Embedded Systems Engineer would use in the industry when developing a product?
I'm using Kiel Uvision 5, STMCubeMX and STM32 ST-LINK Utility to develop certain projects. As I am used to using PIC and using registers like PORTA, OSCCON, TIMER0 etc, I see that Kiel Uvision 5 uses ready made functions like HAL_GPIO_TogglePin(.........) etc. Is this the usual way they do this in industry or work more directly with the registers?
It's heavily opinion based and I wouldn't get surprised if this question got closed for this reason. This answer only touches on few aspects of what you're asking about. It's a very broad topic and it's going to be hard - if not impossible - to include everything in one post which wouldn't end up being several pages long. However to give you my perspective on the topic, while trying to remain unbiased, the short answer is.. it depends.
If you're asking about what is used in most common cases, it's likely going to be the HAL (previously StdPeriph) functions you've mentioned. The reason is - they get the job done in most common cases. After all it always comes down to what the cost of creating a product is going to be. If HAL functions are "good enough" for the purpose, they're going to be used simply because they're faster to develop with. The higher the development cost, the more you'll want to cut it (or move it elsewhere) and using abstractions is one way of doing so.
However, even though I think it's safe to assume that HAL / Std Periph / any other (including proprietary) abstraction layer is generally used, it's not always the case for at least two reasons I can think of:
Existing functions may not be suitable for your purpose. Giving HAL as an example, it works pretty well for most common cases, but sometimes your needs may be so specific that you'll have to go and mess "under the hood", often ending up either writing your own variation of the functions of building something new on top of HAL. Personally I can think of at least few examples where HAL functions weren't exactly what I needed. It doesn't necessarily mean that the library is bad, it's just sometimes the requirements are very specific.
Messing with registers directly may sometimes be required for performance reasons. HAL and similar are an abstraction layer and as any abstraction, they take more time to execute than using the registers directly. If you're trying to squeeze absolute maximum out of given peripheral, you'll sometimes have to go down to register level.
Now to a more biased portion of my answer.. I can see why you ask this question. Coming from PIC world where Flash or CPU clocks were more precious, it does make sense to use registers directly there. In case of STM32, it's not as critical anymore. Having said that, you'll sometimes stumble upon opinions that "using registers is the only true way", but personally I find such discussions ending up being purely academic. I see registers or any abstraction built on top of it as tools and you should use the right tools for the right job. Two examples of NOT using the right tools:
You use only registers as "the only right way" either because you believe it yourself or you've been told so. Your products take twice (if not more) time to develop, your code takes less space in flash (so now you use 46% of 1MB flash instead of 48%). Code that is performance-critical meets its goals. Code that has relaxed execution time constraints is also super efficient, but it doesn't affect the end customer much, if at all. Your code is also less reusable - you find yourself rewriting same portions of code over and over every time you release new product for a new MCU family.
You only use HAL / any other similar abstraction because "you didn't pick so powerful MCU to have to go down to register level", or because you're told you should never ever touch registers. You develop much faster and you're able to release two products instead of just one using registers. However when there are execution time constraints / transmission speeds you have to hit, you find yourself picking MCUs more powerful than should theoretically be needed. Sometimes you find yourself writing wrappers around HAL because they don't give you exactly the functionality you need - it feels like making it more complicated than it should be.
So after all, if there was anything to take out of what I'm trying to say is that you should use what is suitable for the job on a case-by-case basis. In case of STM32, you nowadays have 3 options: HAL (top abstraction level), HAL LL (Low Level abstraction - often simple wrapper functions around register acceses) or using registers directly. Which one you choose should come from what your requirements are.
I use Nucleo boards all the time. It allows me to start write software before I have the actual hardware ready.
HAL & register way is rather the programmer choice than the "industry standard". I personally use HAL drivers when program more complicated peripherals like USB and Ethernet to avoid wiring the full stacks from the scratch.
I have just finished my degree and we heavily used the STM32 platform with CMSIS and HAL.
It is a matter of preference. The HAL libraries offer higher abstraction but come with some quirks that are not very intuitive. The HAL (was/)is buggy sometimes. We have encountered SPI transfer bugs which made the higher transfer rates of SPI unusable because of a delay in the per byte transfer.
CMSIS offers lower level access but still abstracts away from simple bit manipulation. I don't think that direct register access is a great way to program anymore and at least CMSIS should be used. But it still a matter of opinion, preference and what is right for the job at hand. If you need something quick: HAL. If you need really fine control: CMSIS.
(sidenote I believe that CMSIS has been phased out in favor of HAL but it is still usable at this time)
All of the other answers are valid.
Just wanted to chime in and say I used STM32 regularly at my job (and at two different companies). We used the HAL drivers at both companies. There are obviously some issues with them, but used regularly enough by others that you can easily find support online. Additionally CubeMx does a decent job with them, at least enough to get you started on using the peripherals. So the 0->something step feel smaller. But getting really optimized code and design, you may want to dive deeper to really understand what each of the HAL drivers are doing and decide which method works for your project, goals, and requirements.

Mixing OCaml and C: is it worth the pain?

I am faced with the task of building a new component to be integrated into a large existing C codebase. The component is essentially a kind of compiler, and will be complicated enough that I would like to write it in OCaml (for reasons along the lines of those given here). I know that OCaml-C interaction is possible (as per the manual and this tutorial), but it looks somewhat painful.
What I'd like to know is whether others here have attempted large-scale integration of OCaml and C code, what were some of the unexpected gotchas they found, and whether at the end of the day they concluded that they would have been better off just writing the new code in C.
Note, I'm not trying to start a debate about the merits of functional versus imperative programming: let's just say we assume that OCaml happens to be the right tool for the job I have in mind, and the potential difficulty in integration is the only issue. I also don't have the option of rewriting the rest of the codebase.
To give a little more detail about the task: the component I need to implement is a certain kind of query optimizer that incorporates some research ideas my group at UC Davis is working on, and will be integrated into PostgreSQL so that we can run experiments. (A query optimizer is, essentially, a compiler.) The component would be invoked from C code, would function mostly independently but would make a certain number of calls to other PostgreSQL components to retrieve things like system catalog information, and would construct a complex C data structure (representing a physical query plan) as output.
Apologies for the somewhat open-ended question, but I'm hoping the community might be able to save me a little trouble :)
Thanks,
TJ
Great question. You should be using the better tool for the job.
If in fact your intentions are to use the better tool for the job (and you are sure lexx and yacc are going to be a pain) then I have something to share with you; it's not painful at all to call ocaml from c, and vice versa. Most of the time I've been writing ocaml calling C, but I have written a few the other way. They've mostly been debug functions that don't return a result. Although, the callings back and fourth is really about packing and unpacking the ocaml value type on the C side. That tutorial you mention covers all of that, and very well.
I'm opposed to Ron Savage remarks that you have to be an expert in the language. I recall starting out where I work, and within a few months, without knowing what a "functor" was, being able to call C, and writing a thousand lines of C for numerical recipes, and abstract data types, and there were some hiccups (not with unpacking types, but with garbage collection of an abstract data-types), but it wasn't bad at all. Most of the inner loops in the project are written in C --taking advantage of SSE, external libraries (lapack), tighter optimized loops, and some in-lined hand optimized assembly.
I think you might need to be experienced with designing a large project and demarcating functional and imperative sections. I would really assess how much ocaml you are going to be writing, and what kind of values you want to pass to C --I'm saying this because I'd be fearful of recommending to someone to pass a recursive data-structure from ocaml to C, actually, it would be lots of unpacking tuples, their contents, and thus a lot of possibility for confusion and bugs.
I one wrote a reasonably complex OCaml-C hybrid program. I was frustrated by what I found to be inadequate documentation, and I ended up spending too much time dealing with garbage collection issues. However, the resulting program worked and was fast.
I think there is a place for OCaml-C integration, but make sure it is worth the hassle. It might be simpler to have the programs communicate over a socket (assuming such IO operations won't eliminate the performance you want). It might also be more sane to just write the whole thing in C.
Interoperability is the achilles heel of standalone implementations of statically typed languages, particularly those without JIT compilation like OCaml. My own experience having been using OCaml for over 5 years is that the only reliable bindings are across simple APIs that do little more than pass large arrays, e.g. LAPACK. Even slightly more complicated bindings like those to FFTW took years to stabilize and others, like OpenGL and GLU, remain an unsolved problem. In particular, I found major bugs in binding code written by two of the authors of the OCaml compiler. If they cannot get it right then there is little hope for the rest of us...
However, all is not lost. The solution is simply to use looser bindings. Rather than handling interoperability at the C level with a low-level type-unsafe interface, use a high-level interface like XML-RPC with string passing or even over sockets. This is much easier to get right and, as you say, will let you leverage the enormous practical benefits offered by OCaml for this application.
My rule of thumb is to stick with the language / model / style used in the existing code-base, so that future maintenance developers inherit a consistent and understandable set of application code.
The only way I could justify something like what you are suggesting would be if:
You are an Expert at OCaml AND a Novice at C (so you'll be 20x as productive)
You have successfully integrated it with a C library before (apparently not)
If you are at all more familiar with C than OCaml, you've just lost any "theoretical" gain from OCaml being easier to use when writing a compiler - plus it seems at though you will have more peers familiar with C around you than OCaml.
That's my "grumpy old coder" 2 cents (which used to only cost a penny!).

What type of programs are best written in C [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Joel and company expound on the virtues of learning C and how the best way to learn a language is to actually write programs using that use it. To that effect, which types of applications are most suitable to the C programming language?
Edit:
I am looking for what C is good at. This likely does not coincide with the best way of learning C.
Code where you need absolute control over memory management. Code where you need to be utterly in control of speed versus memory trade-offs. Very low-level file manipulation (such as access to the raw devices).
Examples include OS kernel, and embedded apps.
In the late 1980s, I was head of the maintenance team on a C system that was more than a million lines of code. It did database access (Oracle), X Windows graphics, interprocess communications, all sorts of good stuff. It ran on VMS and several varieties of Unix. But if I were to recreate that system today, I sure wouldn't use C, I'd use Java. Others would probably use C#.
Low level functions such as OS kernel and drivers. For those, C is unbeatable.
You can use C to write anything. It is a very general purpose language. After doing so for a little while you will understand why there are other "higher level" languages available.
"Learn C", by all means, but don't don't stick with it for too long. Other languages are far more productive.
I think the gist of the people who say you need to learn C is so that you understand the relationship between high level code and the machine internals and what exaclty happens with bits, bytes, program execution, etc.
Learn that, and then move on.
Those 100 lines of python code that were accounting for 80% of your execution time.
Small apps that don't have a UI, especially when you're trying to learn.
Edit: After thinking a little more on this, I'd add the following: if you already know a higher-level language and you're trying to learn more about C, a good route may be to not create a whole new C app, but instead create a C DLL and add functions to it that you can call from the higher language. In this way you can replace simple functions that your high language already has to ensure that you C function does what it should (gives you pre-built testing), lets you code mostly in what you're familiar with, uses the language in a problem you're already familiar with, and teaches you about interop.
Anything where you think of using assembly.
Number crunching (for example, libraries to be used at a higher level from some other language like Python).
Embedded systems programming.
A lot of people are saying OS kernel and device drivers which are, of course, good applications for C. But C is also useful for writing any performance critical applications that need to use every bit of performance the hardware is capable of.
I'm thinking of applications like database management systems (mySQL, Oracle, SQL Server), web servers (apache, IIS), or even we browsers (look at the way chrome was written).
You can do so many optimizations in C that are just impossible in languages that run in virtual machines like Java or .NET. For instance, databases and servers support many simultaneous users and need to scale very well. A database may need to share data structures between multiple users (threads/processes), but do so in a way that efficiently uses CPU caches. In C, you can use an operating system call to determine the size of the cache, and then align a data structure appropriately to the cache line so that the line does not "ping pong" between caches when multiple threads access adjacent, but unrelated data (so called "false sharing). This is one example. There are many others.
A bootloader. Some assembly also required, which is actually very nice..
Where you feel the need for 100% control over your program.
This is often the case in lower layer OS stuff like device drivers,
or real embedded devices based on MCU:s etc etc (all this and other is already mentioned above)
But please note that C is a mature language that has been around for many years
and will be around for many more years,
it has many really good debugging tools and still a huge number off developers that use it.
(It probably has lost a lot to more trendy languages, but it is still huge)
All its strengths and weaknesses are well know, the language will probably not change any more.
So there are not much room for surprises...
This also means that it would probably be a good choice if you have a application with a long expected life cycle.
/Johan
Anything where you need a minimum of "magic" and need the computer to do exactly what you tell it to, no more and no less. Anything where you don't trust the "magic" of garbage collection to handle your memory because it might not be as efficient as what you can hand-code. Anything where you don't trust the "magic" of built-in arrays, strings, etc. not to waste too much space. Anything where you want to be able to reason about exactly what ASM instructions the compiler will emit for a given piece of code.
In other words, not too much in the real world. Most things would benefit more from higher level abstraction than from this kind of control. However, OS code, device drivers, and a few things that have to be near optimal in both space and speed might make sense to write in C. Higher level languages can do pretty well competing with C on speed, but not necessarily on space.
Embedded stuff, where memory-usage and cpu-speed matters.
The interrupt handler part of an OS (and maybe two or three more functions in it).
Even if some of you will now start to bash heavily on me now:
I dont think that any decent app should be written in C - it is way too error prone.
(and yes, I do know what I am talking about, having written an awful lot of code in C myself (OSes, compilers, VMs, network protocols, RT-control stuff etc.).
Using a high level language makes you so much more productive. Speed is typically gained by keeping the 10-90 rule in mind: 90% of CPU time is spent in 10% of your code (which you should optimize first).
Also, using the right algorithm might give more performance than optimizing bits in C. And doing things right in C is so much more trouble.
PS: I do really mean what I wrote in the second sentence above; you can write a complete high performance OS in a high level language like Lisp or Smalltalk, with only a few minor parts in C. Think how the 70's Lisp machines would fly on todays hardware...
Garbage collectors!
Also, simply programs whose primary job is to make operating-system calls. For example, I need to write a short C program called timeout that
Takes a command line as argument, with a number of seconds for that command to run
Forks two child processes, one to run the command and one to sleep for N seconds
When the first of the child processes exits, kills the other, then exits
The effect will be to run a command with a limit on wall-clock time.
I and others on this forum have tried several different solutions using shells and/or perl. All are convoluted and none quite do the right thing. In C the solution will be easy, because all the OS facilities are right where you can get at them.
A few kinds that I can think of:
Systems programming that directly uses Unix/Linux or Win32 system calls
Performance-critical code that doesn't have much string manipulation in it (e.g., number crunching)
Embedded programming or other applications that are severely resource-constrained
Basically, C gives you portable, efficient access to the native capabilities of the machine; everything else is your responsibility. In particular, string manipulation in C is tedious, error-prone, and generally nasty; the most effective way to do extensive string operations with C may be to use it to implement a language that handles strings better...
examples are: embedded apps, kernel code, drivers, raw sockets.
However if you find yourself more productive in C then go ahead and build whatever you wish. Use the language as a tool to get your problem solved.
c compiler
Researches in maths and physics. There are probably two alternatives: C and C++, but such features of the latter as encapsulation and inheritance are not very useful there. One could prefer to use C++ "as a better C" or just stay with C.
Well most people are suggesting system programming related things like OS Kernels , Device Drivers etc. These are difficult and Time consuming. Maybe the most fun thing to with C is console programming. Have you heard of the HAM SDK? It is a complete software development kit for the Nintendo GBA , and making games for it is fun. There is also the CC65 Compiler which supports NES Programming (Althought Not Completely). You can also make good Emulators. Trust Me , C is pretty helpful. I was originally a Python fan, and hated C because it was complex. But after yuoget used to it, you can do anything with C. Now I use CPython to embed Python in my C Programs(if needed) and code mostly in C.
C is also great for portability , There is a C Compiler for like every OS and Almost Every Console And Mobile Device. Ive even seen one that supports some calculators!
Well, if you want to learn C and have some fun at the same time, might I suggest obtaining NXC and a Lego Mindstorms set? NXC is a C compiler for the Lego Mindstorms.
Another advantage of this approach is that you can compare the effort to program the Mindstorms "brick" with C and then try LEJOS and see how it goes with Java.
All great fun.
Implicit in your question is the assumption that a 'high-level' language like Python or Perl (or Java or ...) is fast enough, small enough, ... enough for most applications. This is of course true for most apps and some choice X of language. Given that, your language of choice almost certainly has a foreign function interface (FFI). Since you're looking to learn C, create a module in the FFI built in C.
For example, let's assume that your tool of choice is Python. Reimplement a subset of Numpy in C. Since C is a pretty fast language, and has, in C99, a clear numeric library interface, you'll get the opportunity to experience the power of C in an appropriate setting.
ISAPI filters for Internet Information Server.
Before actually write C code, i would suggest first read good C code.
Choose subject you want to concentrate on, basically any application can be written in C, but i assume GUI application will be not your first choice, and find few open source projects to look into.
Not any open source project is best code to look. I assume that after you will select a subject there is a place for another question, ask for best open source project in the field.
Play with it, understand how it's working modify some functionality...
Best way to learn is learn from somebody good.
Photoshop plugin filters. Lots and lots of interesting graphical manipulation you can do with those and you'll need pure C to do it in.
For instance that's a gateway to fractal generation, fourier transforms, fuzzy algorithms etc etc. Really anything that can manipulate image/color data and give interesting results
Don't treat C as a beefed up assembler. I've done some serious app's in it when it was the natural language (e.g., the target machine was a Solaris box with X11).
Write something with some meat on it. Write a client server chess program, where the AI is on a server and the UI is displaying in X11; once you've done that you will really know C.
I wonder why nobody stated the obvious:
Web applications.
Any place where the underlying libraries are entirely in C is a good candidate for staying in C - openGL, Lua extensions, PHP extensions, old-school windows.h, etc.
I prefer C for anything like parsing, code generation - anything that doesn't need a lot of data structure (read OOP). It's library footprint is very light, because class libraries are non-existent. I realize I'm unusual in this, but just as lots of things are "considered harmful", I try to have/use as little data structure as possible.
Following on from what someone else said. C seems a good language to implement the language in which you write the rest of your software.
And (mutatis mutandis) the virtual machine which runs the rest of your software.
I'd say that with the minuscule runtime environment and it's self-contained nature, you might start by creating some CLI utilities such as grep or tail (or half the commands in Unix). Anything that uses only STDOUT, STDIN and file manipulation is a good candidate.
This isn't exactly answering your question because I wouldn't actually CHOOSE to use C in such an app, but I hope it's answering the question you meant to ask--"what would be a good type of app to use learn C on?"
C isn't actually that bad a language--it's pretty easily to understand your code at an assembly language level which is quite useful, and the language constructs are few, leaving a very sparse language.
To answer your actual question, the very best use of C is the one for which it was created--porting itself (and UNIX) to other CPU architectures. This explains the lack of language features very well; it also explains the existence of Pointers which are really just a construct to make the compiler work less--any code created with pointers could be created without it (just as well optimized), but it becomes much harder to create a compiler to do so.
digital signal processing, all Pure Data extensions are written in C, this can be done in other languages also but has had good success in C

C (or any) compilers deterministic performance

Whilst working on a recent project, I was visited by a customer QA representitive, who asked me a question that I hadn't really considered before:
How do you know that the compiler you are using generates machine code that matches the c code's functionality exactly and that the compiler is fully deterministic?
To this question I had absolutely no reply as I have always taken the compiler for granted. It takes in code and spews out machine code. How can I go about and test that the compiler isn't actually adding functionality that I haven't asked it for? or even more dangerously implementing code in a slightly different manner to that which I expect?
I am aware that this is perhapse not really an issue for everyone, and indeed the answer might just be... "you're over a barrel and deal with it". However, when working in an embedded environment, you trust your compiler implicitly. How can I prove to myself and QA that I am right in doing so?
You can apply that argument at any level: do you trust the third party libraries? do you trust the OS? do you trust the processor?
A good example of why this may be a valid concern of course, is how Ken Thompson put a backdoor into the original 'login' program ... and modified the C compiler so that even if you recompiled login you still got the backdoor. See this posting for more details.
Similar questions have been raised about encryption algorithms -- how do we know there isn't a backdoor in DES for the NSA to snoop through?
At the end of the you have to decide if you trust the infrastructure you are building on enough to not worry about it, otherwise you have to start developing your own silicon chips!
For safety critical embedded application certifying agencies require to satisfy the "proven-in-use" requirement for the compiler. There are typically certain requirements (kind of like "hours of operation") that need to be met and proven by detailed documentation. However, most people either cannot or don't want to meet these requirements because it can be very difficult especially on your first project with a new target/compiler.
One other approach is basically to NOT trust the compiler's output at all. Any compiler and even language-dependent (Appendix G of the C-90 standard, anyone?) deficiencies need to be covered by a strict set of static analysis, unit- and coverage testing in addition to the later functional testing.
A standard like MISRA-C can help to restrict the input to the compiler to a "safe" subset of the C language. Another approach is to restrict the input to a compiler to a subset of a language and test what the output for the entire subset is. If our application is only built of components from the subset it is assumed to be known what the output of the compiler will be. The usually goes by "qualification of the compiler".
The goal of all of this is to be able to answer the QA representative's question with "We don't just rely on determinism of the compiler but this is the way we prove it...".
You know by testing. When you test, you're testing your both code and the compiler.
You will find that the odds that you or the compiler writer have made an error are much smaller than the odds that you would make an error if you wrote the program in question in some assembly language.
There are compiler validation suits available.
The one I remember is "Perennial".
When I worked on a C compiler for a embedded SOC processor we had to validate the compiler against this and two other validation suits (that I forget the name of). Validating the compiler to a certain level of conformance to these test suits was part of the contract.
It all boils down to trust. Does your customer trust any compiler? Use that, or at least compare output code between yours and theirs.
If they don't trust any, is there a reference implementation for the language? Could you convince them to trust it? Then compare yours against the reference or use the reference.
This all assuming you actually verify the actual code you get from the vendor/provider and that you check the compiler has not been tampered with, which should be the first step.
Anyhow this still leaves the question about how would you verify, without having references, a compiler, from scratch. That certainly looks like a ton of work and requires a definition of the language, which not always is available, sometimes the definition is the compiler.
How do you know that the compiler you are using generates machine code that matches the c code's functionality exactly and that the compiler is fully deterministic?
You don't, that's why you test the resultant binary, and why you make sure to ship the same binary you tested with. And why when you make 'minor' software changes, you regression test to make sure none of the old functionality broke.
The only software I've certified is avionics. FAA certification isn't rigorous enough to prove the software works correctly, while at the same time it does force you to jump through a certain amount of hoops. The trick is to structure your 'process' so it improves quality as much as possible, with as little extraneous hoop-jumping as you can get away with. So anything that you know is worthless and won't actually find bugs, you can probably weasel out of. And anything you know you should do because it will find bugs that isn't explicitly asked for by the FAA, your best bet is to twist words until it sounds like you're giving the FAA/your QA people what they asked for.
This actually isn't as dishonest as I've made it sound, in general the FAA cares more about you being conscientious and confident that you're trying to do a good job, than about what exactly you do.
Some intellectual ammunition might be found in Crosstalk, a magazine for defense software engineers. This question is the kind of thing they spend many waking hours on. http://www.stsc.hill.af.mil/crosstalk/2006/08/index.html (If i can find my old notes from an old project, i'll be back here...)
You can never fully trust the compiler, even highly recommended ones. They could release an update that has a bug, and your code compiles the same. This problem is compounded when updating old code with the buggy compiler, doing testing and shipping out the goods only to have the customer ring you 3 months later with a problem.
It all comes back to testing, and if there is one thing I have learnt it is to thouroughly test after any non-trivial change. If the problem seems impossible to find have a look at the compiled assembler and check it's doing what it should be doing.
On several occasions I have found bugs in the compiler. One time there was a bug where 16 bit variables would get incremented but without carry and only if the 16 bit variable was part of an extern struct defined in a header file.
...you trust your compiler implicitly
You'll stop doing that the first time you come across a compiler bug. ;-)
But ultimately this is what testing is for. It doesn't matter to your test regime how the bug got in to your product in the first place, all that matters is that it didn't pass your extensive testing regime.
Well.. you can't simply say that you trust your compiler's output - particularly if you work with embedded code. It is not hard to find discrepancies between the code generated when compiling the very same code with different compilers. This is the case because the C standard itself is too loose. Many details can be implemented differently by different compilers without breaking the standard. How do we deal with this stuff? We avoid compiler dependent constructs whenever possible. We may deal with it by choosing a safer subset of C like Misra-C as previously mentioned by the user cschol. I seldom have to inspect the code generated by the compiler but that has also happened to me at times. But, ultimately, you are relying on your tests in order to make sure that the code behaves as intended.
Is there a better option out there? Some people claim that there is. The other option is to write your code in SPARK/Ada. I have never written code in SPARK but my understanding is that you would still have to link it against routines written in C that would deal with the "bare metal" stuff. The beauty of SPARK/Ada is that you are absolutely guaranteed that the code generated by any compiler is always going to be the same. No ambiguity whatsoever. On top of that, the language allows you to annotate the code with explanations as to how the code is intended to behave. The SPARK toolset will use these annotations to formally prove that the code written does indeed do what the annotations have described. So I have been told that for critical systems, SPARK/Ada is a pretty good bet. I have never tried it myself though.
You don't know for sure that the compiler will do exactly what you expect. The reason is, of course, that a compiler is a peice of software, and is therefore susceptible to bugs.
Compiler writers have the advantage of working from a high quality spec, while the rest of us have to figure out what we're making as we go along. However, compiler specs also have bugs, and complex parts with subtle interactions. So, it's not exactly trivial to figure out what the compiler should be doing.
Still, once you decide what you think the language spec means, you can write a good, fast, automated test for every nuance. This is where compiler writing has a huge advantage over writing other kinds of software: in testing. Every bug becomes an automated test case, and the test suite can very thorough. Compiler vendors have a lot more budget to invest in verifying the correctness of the compiler than you do (you already have a day job, right?).
What does this mean for you? It means that you need to be open to the possibilities of bugs in your compiler, but chances are you won't find any yourself.
I would pick a compiler vendor that is not likely to go out of business any time soon, that has a history of high quality in their compilers, and that has demonstrated their ability to service (patch) their products. Compilers seem to get more correct over time, so I'd choose one that's been around a decade or two.
Focus your attention on getting your code right. If it's clear and simple, then when you do hit a compiler bug, you won't have to think really hard to decide where the problem lies. Write good unit tests, which will ensure that your code does what you expect it to do.
Try unit testing.
If that's not enough, use different compilers and compare the results of your unit tests. Compare strace outputs, run your tests in a VM, keep a log of disk and network I/O, then compare those.
Or propose to write your own compiler and tell them what it's going to cost.
The most you can easily certify is that you are using an untampered compiler from provider X. If they do not trust provider X, it's their problem (if X is reasonably trustworthy). If they do not trust any compiler provider, then they are totally unreasonable.
Answering their question: I make sure I'm using an untampered compiler from X through these means. X is well reputed, plus I have a nice set of tests that show our application behaves as expected.
Everything else is starting to open the can of worms. You have to stop somewhere, as Rob says.
Sometimes you do get behavioural changes when you request aggressive levels of optimisation.
And optimisation and floating point numbers? Forget it!
For most software development (think desktop applications) the answer is probably that you don't know and don't care.
In safety-critical systems (think nuclear power plants and commercial avionics) you do care and regulatory agencies will require you to prove it. In my experience, you can do this one of two ways:
Use a qualified compiler, where "qualified" means that it has been verified according to the standards set out by the regulatory agency.
Perform object code analysis. Essentially, you compile a piece of reference code and then manually analyze the output to demonstrate that the compiler has not inserted any instructions that can't be traced back to your source code.
You get the one Dijkstra wrote.
Select a formally verified compiler, like Compcert C compiler.
Changing the optimization level of the compiler will change the output.
Slight changes to a function may make the compiler inline or no longer inline a function.
Changes to the compiler (gcc versions for example) may change the output
Certain library functions may be instrinic (i.e., emit optimized assembly) while others most are not.
The good news is that for most things it really doesn't matter that much. Where it does, you may want to consider assembly if it really matters (e.g., in an ISR).
If you are concerned about unexpected machine code which doesn't produce visible results, the only way is probably to contact compiler vendor for certification of some sort which will satisfy your customer.
Otherwise you'll know it the same you know about bugs in your code - testing.
Machine code from modern compilers can be vastly different and totally incomprehensible for puny humans.
I think it's possible to reduce this problem to the Halting Problem somehow.
The most obvious problem is that if you use some kind of program to analyze the compiler and its determinism, how do you know that your program gets compiled correctly, and produces the correct result?
If you're using another, "safe" compiler though, I'm not sure. What I'm sure is writing a compiler from scratch would probably be an easier job.
Even a qualified or certified compiler can produce undesirable results. Keep your code simple and test, test, test. That or walk through the machine code by hand while not allowing any human error. PLus the operating system or whatever environment you are running on (preferably no operating system, just your program).
This problem has been solved in mission critical environments since software and compilers began. As many of the others who have responded also know. Each industry has its own rules from certified compilers to programming style (you must always program this way, never use this or that or the other), lots of testing and peer review. Verifying every execution path, etc.
If you are not in one of those industries, then you get what you get. A commercial program on a COTS operating system on COTS hardware. It will fail, that is a guarantee.
If your worried about malicious bugs in the compiler, one recommendation (IIRC, an NSA requirement for some projects) is that the compiler binary predate the writing of the code. At least then you know that no one has added bugs targeted at your program.

Resources