I've been asked to help obfuscate a library (written in C++) which will be distributed to clients. I've already discussed why obfuscation is not necessarily a good idea, and seeing as licensing will be integrated into the software many concerns regarding copy protection are moot.
Regardless, I've been asked to research methods anyway. I've looked into header mangling (and the like) as well as HARES, but I fail to find much that I can use for a library (naturally, these things would destroy any form of API rendering the library useless).
What techniques can I apply that would work for libraries? While I would appreciate recommendations for tools (or compiler flags, etc.) that might be helpful I would like to stress that this is not a tool-focused (i.e. closable) question, but rather one focused on applicable techniques.
I wouldn't put much energy to doing it very thoroughly, because the reverse engineer is going to win this round.
https://softwareengineering.stackexchange.com/questions/155131/is-it-important-to-obfuscate-c-application-code
Obfuscating C++ binaries is a bit of a losing battle. It depends on who you are dealing with, but if your reverse engineer is smart enough to use IDA Pro and a couple of plugins, and a good debugger then it shall all be for naught.
Obfuscation Priorities
Where you can give the reverse engineer useless function names.
Honestly, this doesn't help that much, since ultimately your code will have to call some kind of non obfuscated shared library to get anything done. At some point you will use the standard libary, or the STL, or even make a system call.
Add false pathways to confound static analysis
So that the reverse engineer can have fun with a debugger. Anti-analysis techniques are well known to the reverse engineer, and they can almost be circumvented with a debugger like ollydbg.
Write debugger foiling code
That reverse engineers love to play with. Again, this is an expected move, and the response is to just to step around the offending code, or to modify away the traps. Anyone with any formal training in RE will blast past this.
Pack most of my binary into an encryped region which is decryped by a stub just before execution.
Same answer as above. Reverse engineers train for this from day one.
Keep in mind the reverse engineers are looking for targeted morsals of information - very rarely are they trying to recreate the entire application. Security intensive code, code for license validation, code for home base communication, networking code. These are all prime targets - put your energy into making these thorny places to live.
Keep in mind that binaries from the largest corporations on the earth are routinely reverse engineered by people in their early 20's.
Don't leave your debugging symbols in the final binary, as those will definitely help with analysis.
If you are dedicated to doing this right, also focus on wasting the engineers time - time is always against the reverse engineer.
Remember, that any meaningful obfuscation might also cost you the performance gains that justified working in C++ in the first place. There are many zones in the C world (and for that matter the Java world) where meaningful obfuscation just isn't possible. Games for instance, cannot conceal their calls to the OpenGl APIs, nor can they truly prevent engineers from harvesting their shader code.
Also remember that the reverse engineer is watching your code at the assembly level most of the time. He'd rather have your function names, but he can live without it if need be. He can see what your program is doing at the most finite level possible. It is only a matter of time before he finds the critical routines.
For your purposes, find a program to mangle function names, make your boss happy, and call it a day. At least at that point, reverse engineering the software will not be trivial.
Well really you have 2 primary vectors that you have to guard against
Disassembley
Debugging
My favourite method for preventing the first issue is in memory decryption, take parts of your executable code and encrypt it, have it self decrypt in memory while your library is running, you can also checksum parts of the code and compare the checksum against what is loaded in ram ( have the encrypted portions check the decrypter and vice versa )
Another neat trick is to statically link libraries that you use into your executible so they cannot be easily swapped out to try to see what your code is doing.
Now debugging checking interrupt vectors helps, another trick is to check the 'timing' between various portions of code ( for example if more than a couple of milliseconds worth of delay occurs in code that should execute significantly faster than that then it can be assumed that the code is being debugged
a previous relevant question from me is here Reverse Engineering old paint programs
I have set up my base of operations here: http://animatorpro.org
wiki coming soon.
Okay, so now I have a 300,000 line legacy MSDOS codebase. It's sort of a "be careful what you wish for" situation. I am not an experienced C programmer. I'm not entirely inexperienced either, but for all intents and purposes I'm a noob to the language and in particular the intricacies of its libraries. I am especially ignorant of the vagaries of the differences between C programs written specifically for MSDOS and programs that are cross platform. However I have been studying this code base for over a year now, and this is what I know about Animator Pro:
Compilers and tools used:
Watcom C compiler
tcmake (make program from Turbo C)
386asm, a specialised assembler for the Phar Lap dos extender
and of course, the Phar Lap dos extender itself.
a selection of obscure dos utilities
Much of the compilation seems to be driven by batch files. Though I have obtained copies of all these tools, I have not yet succeeded at compiling it. (though I have compiled its older brother, autodesk animator original.
It's got a plugin system that replicates DLL before DLL's were available, based on REX. The plugin system handles:
Video Drivers (with a plethora of included VESA drivers)
Input drivers (including wacom tablets, and keyboards)
Drawing Tools
Inks (Like photoshop's filters, or blending modes)
Scripting Addons (essentially compiled scripts)
File formats
It's got its own script interpreter named POCO, based on the C language- The scripting language has enough power to do virtually all the things the plugin system can do- Just slower.
Given this information, this is my development plan. Please criticise this. The source code is available in the link above, so you can easily, if you are so inclined, assess the situation yourself.
Compile with its original tools.
Switch to using DJGPP, and make the necessary changes to get it to compile with that, plus the original assembler.
Include the Allegro.cc "Game" library, and switch over as much functionality to that library as possible- Perhaps by simply writing new video and input drivers that use the Allegro API. I'm thinking allegro rather than SDL because: there is a DOS version of Allegro, and fascinatingly, one of its core functions is the ability to play Animator Pro's native format FLIC.
Hopefully after 3, I will have eliminated most or all of the Assembler in the project. I say hopefully, because it's in an obscure dialect that doesn't assemble in any modern free assembler without significant modification. I have tried them all. Whatever is left gets converted to assemble in NASM, or to C code if I can define the assembler's actual function.
Switch the dos extender from Phar Lap to HX Dos http://www.japheth.de/HX.html, Which promises to replicate as much of the WIN32 api as possible. Then make all the necessary code changes for that to work.
Switch to the win32 version of Allegro.cc, assuming that the win32 version can run on top of HXDos. Make any further necessary changes
Modify the plugin system to use some kind of standard cross platform plugin library. What this would be, I have no idea. Maybe you can offer some suggestions? I talked to the developer who originally wrote the plugin system, and he said some of the things it does aren't possible on modern OS's because of segmentation restrictions. I'm not sure what this means, but I'm guessing it means all the plugins will need to be rewritten almost from scratch.
Magically, I got all the above done, and we can try and make it run in windows, osx, and linux, whilst dealing with other cross platform niggles like long file names, and things I haven't thought of.
Anyone got a problem with any of this? Is allegro a good choice? if not, why? what would you do about this plugin system? What would you do different? Is this whole thing foolish, and should I just rewrite it from scratch, using the original as inpiration? (it would apparently take the original developer "About a month" to do that)
One thing I haven't covered above is the text/font system. Not sure what to do about that, but Animator Pro has its own custom font format, but also is able to use Postscript Type 1 fonts, and some other formats.
My biggest concern with your plan, in a nutshell: Your approach seems to be to attempt to keep the whole enormous thing working at all times, tweaking the environment ever-further away from DOS. During each tweak to the environment, that means you will have approximately a billion subtle assumptions that might have broken at once, none of which you necessarily understand yet. Untangling them all at once will be incredibly painful.
If I were doing the port, my approach would be to disable as much code as possible to get SOMETHING running in a modern environment, and bring the parts back online, one piece at a time. Write a simple test harness program that loads a display driver and draws some stuff, and compile it for DOS to make sure you understand the interface. Then write some C code that implements the same interface, but with Allegro (or SDL or SFML), and make that program work under Windows or Linux. When the output differs, you have a simple test case to work from.
Your entire job on this port is swapping out implementations of various interfaces and functions with completely new ones. This is a job that unit testing excels at. Don't write any new code without a test of some kind that runs on the old code under DOS! Make your potential problems as small and simple as you possibly can. Port assembly code instead of rewriting it only if you're reasonably confident that it will actually make your job easier (ie, algorithmic stuff that compiles fine with few tweaks under NASM). Don't bite off a bigger piece than you can comfortably fit in your brain at once.
I, for one, look forward to seeing your progress! I think what you're attempting to do is great. Thanks for doing it.
Hmmm - I might approach it by writing an OpenGL video "driver" for it. and todays machines are fast enough with tons of ram that you could do all the pixel specific algorithms on main CPU into a back buffer and it would work. As the "generic" VGA driver just mapped the video buffer to a pointer this would be a place to start. There was a zoom mode in the UI so you can look at the pixels on a high res display.
It is often very difficult to take an existing non-trivial code base that wasn't written with portability in mind - you mention a few - and then try to make it portable. There will be a lot of problems on the way. It is probably a better idea to start from scratch and rewrite the code using the existing code as reference only. If you start from scratch you can leverage existing portable UI solution in your new project like Qt.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Joel and company expound on the virtues of learning C and how the best way to learn a language is to actually write programs using that use it. To that effect, which types of applications are most suitable to the C programming language?
Edit:
I am looking for what C is good at. This likely does not coincide with the best way of learning C.
Code where you need absolute control over memory management. Code where you need to be utterly in control of speed versus memory trade-offs. Very low-level file manipulation (such as access to the raw devices).
Examples include OS kernel, and embedded apps.
In the late 1980s, I was head of the maintenance team on a C system that was more than a million lines of code. It did database access (Oracle), X Windows graphics, interprocess communications, all sorts of good stuff. It ran on VMS and several varieties of Unix. But if I were to recreate that system today, I sure wouldn't use C, I'd use Java. Others would probably use C#.
Low level functions such as OS kernel and drivers. For those, C is unbeatable.
You can use C to write anything. It is a very general purpose language. After doing so for a little while you will understand why there are other "higher level" languages available.
"Learn C", by all means, but don't don't stick with it for too long. Other languages are far more productive.
I think the gist of the people who say you need to learn C is so that you understand the relationship between high level code and the machine internals and what exaclty happens with bits, bytes, program execution, etc.
Learn that, and then move on.
Those 100 lines of python code that were accounting for 80% of your execution time.
Small apps that don't have a UI, especially when you're trying to learn.
Edit: After thinking a little more on this, I'd add the following: if you already know a higher-level language and you're trying to learn more about C, a good route may be to not create a whole new C app, but instead create a C DLL and add functions to it that you can call from the higher language. In this way you can replace simple functions that your high language already has to ensure that you C function does what it should (gives you pre-built testing), lets you code mostly in what you're familiar with, uses the language in a problem you're already familiar with, and teaches you about interop.
Anything where you think of using assembly.
Number crunching (for example, libraries to be used at a higher level from some other language like Python).
Embedded systems programming.
A lot of people are saying OS kernel and device drivers which are, of course, good applications for C. But C is also useful for writing any performance critical applications that need to use every bit of performance the hardware is capable of.
I'm thinking of applications like database management systems (mySQL, Oracle, SQL Server), web servers (apache, IIS), or even we browsers (look at the way chrome was written).
You can do so many optimizations in C that are just impossible in languages that run in virtual machines like Java or .NET. For instance, databases and servers support many simultaneous users and need to scale very well. A database may need to share data structures between multiple users (threads/processes), but do so in a way that efficiently uses CPU caches. In C, you can use an operating system call to determine the size of the cache, and then align a data structure appropriately to the cache line so that the line does not "ping pong" between caches when multiple threads access adjacent, but unrelated data (so called "false sharing). This is one example. There are many others.
A bootloader. Some assembly also required, which is actually very nice..
Where you feel the need for 100% control over your program.
This is often the case in lower layer OS stuff like device drivers,
or real embedded devices based on MCU:s etc etc (all this and other is already mentioned above)
But please note that C is a mature language that has been around for many years
and will be around for many more years,
it has many really good debugging tools and still a huge number off developers that use it.
(It probably has lost a lot to more trendy languages, but it is still huge)
All its strengths and weaknesses are well know, the language will probably not change any more.
So there are not much room for surprises...
This also means that it would probably be a good choice if you have a application with a long expected life cycle.
/Johan
Anything where you need a minimum of "magic" and need the computer to do exactly what you tell it to, no more and no less. Anything where you don't trust the "magic" of garbage collection to handle your memory because it might not be as efficient as what you can hand-code. Anything where you don't trust the "magic" of built-in arrays, strings, etc. not to waste too much space. Anything where you want to be able to reason about exactly what ASM instructions the compiler will emit for a given piece of code.
In other words, not too much in the real world. Most things would benefit more from higher level abstraction than from this kind of control. However, OS code, device drivers, and a few things that have to be near optimal in both space and speed might make sense to write in C. Higher level languages can do pretty well competing with C on speed, but not necessarily on space.
Embedded stuff, where memory-usage and cpu-speed matters.
The interrupt handler part of an OS (and maybe two or three more functions in it).
Even if some of you will now start to bash heavily on me now:
I dont think that any decent app should be written in C - it is way too error prone.
(and yes, I do know what I am talking about, having written an awful lot of code in C myself (OSes, compilers, VMs, network protocols, RT-control stuff etc.).
Using a high level language makes you so much more productive. Speed is typically gained by keeping the 10-90 rule in mind: 90% of CPU time is spent in 10% of your code (which you should optimize first).
Also, using the right algorithm might give more performance than optimizing bits in C. And doing things right in C is so much more trouble.
PS: I do really mean what I wrote in the second sentence above; you can write a complete high performance OS in a high level language like Lisp or Smalltalk, with only a few minor parts in C. Think how the 70's Lisp machines would fly on todays hardware...
Garbage collectors!
Also, simply programs whose primary job is to make operating-system calls. For example, I need to write a short C program called timeout that
Takes a command line as argument, with a number of seconds for that command to run
Forks two child processes, one to run the command and one to sleep for N seconds
When the first of the child processes exits, kills the other, then exits
The effect will be to run a command with a limit on wall-clock time.
I and others on this forum have tried several different solutions using shells and/or perl. All are convoluted and none quite do the right thing. In C the solution will be easy, because all the OS facilities are right where you can get at them.
A few kinds that I can think of:
Systems programming that directly uses Unix/Linux or Win32 system calls
Performance-critical code that doesn't have much string manipulation in it (e.g., number crunching)
Embedded programming or other applications that are severely resource-constrained
Basically, C gives you portable, efficient access to the native capabilities of the machine; everything else is your responsibility. In particular, string manipulation in C is tedious, error-prone, and generally nasty; the most effective way to do extensive string operations with C may be to use it to implement a language that handles strings better...
examples are: embedded apps, kernel code, drivers, raw sockets.
However if you find yourself more productive in C then go ahead and build whatever you wish. Use the language as a tool to get your problem solved.
c compiler
Researches in maths and physics. There are probably two alternatives: C and C++, but such features of the latter as encapsulation and inheritance are not very useful there. One could prefer to use C++ "as a better C" or just stay with C.
Well most people are suggesting system programming related things like OS Kernels , Device Drivers etc. These are difficult and Time consuming. Maybe the most fun thing to with C is console programming. Have you heard of the HAM SDK? It is a complete software development kit for the Nintendo GBA , and making games for it is fun. There is also the CC65 Compiler which supports NES Programming (Althought Not Completely). You can also make good Emulators. Trust Me , C is pretty helpful. I was originally a Python fan, and hated C because it was complex. But after yuoget used to it, you can do anything with C. Now I use CPython to embed Python in my C Programs(if needed) and code mostly in C.
C is also great for portability , There is a C Compiler for like every OS and Almost Every Console And Mobile Device. Ive even seen one that supports some calculators!
Well, if you want to learn C and have some fun at the same time, might I suggest obtaining NXC and a Lego Mindstorms set? NXC is a C compiler for the Lego Mindstorms.
Another advantage of this approach is that you can compare the effort to program the Mindstorms "brick" with C and then try LEJOS and see how it goes with Java.
All great fun.
Implicit in your question is the assumption that a 'high-level' language like Python or Perl (or Java or ...) is fast enough, small enough, ... enough for most applications. This is of course true for most apps and some choice X of language. Given that, your language of choice almost certainly has a foreign function interface (FFI). Since you're looking to learn C, create a module in the FFI built in C.
For example, let's assume that your tool of choice is Python. Reimplement a subset of Numpy in C. Since C is a pretty fast language, and has, in C99, a clear numeric library interface, you'll get the opportunity to experience the power of C in an appropriate setting.
ISAPI filters for Internet Information Server.
Before actually write C code, i would suggest first read good C code.
Choose subject you want to concentrate on, basically any application can be written in C, but i assume GUI application will be not your first choice, and find few open source projects to look into.
Not any open source project is best code to look. I assume that after you will select a subject there is a place for another question, ask for best open source project in the field.
Play with it, understand how it's working modify some functionality...
Best way to learn is learn from somebody good.
Photoshop plugin filters. Lots and lots of interesting graphical manipulation you can do with those and you'll need pure C to do it in.
For instance that's a gateway to fractal generation, fourier transforms, fuzzy algorithms etc etc. Really anything that can manipulate image/color data and give interesting results
Don't treat C as a beefed up assembler. I've done some serious app's in it when it was the natural language (e.g., the target machine was a Solaris box with X11).
Write something with some meat on it. Write a client server chess program, where the AI is on a server and the UI is displaying in X11; once you've done that you will really know C.
I wonder why nobody stated the obvious:
Web applications.
Any place where the underlying libraries are entirely in C is a good candidate for staying in C - openGL, Lua extensions, PHP extensions, old-school windows.h, etc.
I prefer C for anything like parsing, code generation - anything that doesn't need a lot of data structure (read OOP). It's library footprint is very light, because class libraries are non-existent. I realize I'm unusual in this, but just as lots of things are "considered harmful", I try to have/use as little data structure as possible.
Following on from what someone else said. C seems a good language to implement the language in which you write the rest of your software.
And (mutatis mutandis) the virtual machine which runs the rest of your software.
I'd say that with the minuscule runtime environment and it's self-contained nature, you might start by creating some CLI utilities such as grep or tail (or half the commands in Unix). Anything that uses only STDOUT, STDIN and file manipulation is a good candidate.
This isn't exactly answering your question because I wouldn't actually CHOOSE to use C in such an app, but I hope it's answering the question you meant to ask--"what would be a good type of app to use learn C on?"
C isn't actually that bad a language--it's pretty easily to understand your code at an assembly language level which is quite useful, and the language constructs are few, leaving a very sparse language.
To answer your actual question, the very best use of C is the one for which it was created--porting itself (and UNIX) to other CPU architectures. This explains the lack of language features very well; it also explains the existence of Pointers which are really just a construct to make the compiler work less--any code created with pointers could be created without it (just as well optimized), but it becomes much harder to create a compiler to do so.
digital signal processing, all Pure Data extensions are written in C, this can be done in other languages also but has had good success in C
I hope not everyone is using Rational Purify.
So what do you do when you want to measure:
time taken by a function
peak memory usage
code coverage
At the moment, we do it manually [using log statements with timestamps and another script to parse the log and output to excel. phew...)
What would you recommend? Pointing to tools or any techniques would be appreciated!
EDIT: Sorry, I didn't specify the environment first, Its plain C on a proprietary mobile platform
I've done this a lot. If you have an IDE, or an ICE, there is a technique that takes some manual effort, but works without fail.
Warning: modern programmers hate this, and I'm going to get downvoted. They love their tools. But it really works, and you don't always have the nice tools.
I assume in your case the code is something like DSP or video that runs on a timer and has to be fast. Suppose what you run on each timer tick is subroutine A. Write some test code to run subroutine A in a simple loop, say 1000 times, or long enough to make you wait at least several seconds.
While it's running, randomly halt it with a pause key and sample the call stack (not just the program counter) and record it. (That's the manual part.) Do this some number of times, like 10. Once is not enough.
Now look for commonalities between the stack samples. Look for any instruction or call instruction that appears on at least 2 samples. There will be many of these, but some of them will be in code that you could optimize.
Do so, and you will get a nice speedup, guaranteed. The 1000 iterations will take less time.
The reason you don't need a lot of samples is you're not looking for small things. Like if you see a particular call instruction on 5 out of 10 samples, it is responsible for roughly 50% of the total execution time. More samples would tell you more precisely what the percentage is, if you really want to know. If you're like me, all you want to know is where it is, so you can fix it, and move on to the next one.
Do this until you can't find anything more to optimize, and you will be at or near your top speed.
You probably want different tools for performance profiling and code coverage.
For profiling I prefer Shark on MacOSX. It is free from Apple and very good. If your app is vanilla C you should be able to use it, if you can get hold of a Mac.
For profiling on Windows you can use LTProf. Cheap, but not great:
http://successfulsoftware.net/2007/12/18/optimising-your-application/
(I think Microsoft are really shooting themself in the foot by not providing a decent profiler with the cheaper versions of Visual Studio.)
For coverage I prefer Coverage Validator on Windows:
http://successfulsoftware.net/2008/03/10/coverage-validator/
It updates the coverage in real time.
For complex applications I am a great fan of Intel's Vtune. It is a slightly different mindset to a traditional profiler that instruments the code. It works by sampling the processor to see where instruction pointer is 1,000 times a second. It has the huge advantage of not requiring any changes to your binaries, which as often as not would change the timing of what you are trying to measure.
Unfortunately it is no good for .net or java since there isn't a way for the Vtune to map instruction pointer to symbol like there is with traditional code.
It also allows you to measure all sorts of other processor/hardware centric metrics, like clocks per instruction, cache hits/misses, TLB hits/misses, etc which let you identify why certain sections of code may be taking longer to run than you would expect just by inspecting the code.
If you're doing an 'on the metal' embedded 'C' system (I'm not quite sure what 'mobile' implied in your posting), then you usually have some kind of timer ISR, in which it's fairly easy to sample the code address at which the interrupt occurred (by digging back in the stack or looking at link registers or whatever). Then it's trivial to build a histogram of addresses at some combination of granularity/range-of-interest.
It's usually then not too hard to concoct some combination of code/script/Excel sheets which merges your histogram counts with addresses from your linker symbol/list file to give you profile information.
If you're very RAM limited, it can be a bit of a pain to collect enough data for this to be both simple and useful, but you would need to tell us a more about your platform.
nProf - Free, does that for .NET.
Gets the job done, at least enough to see the 80/20. (20% of the code, taking 80% of the time)
Windows (.NET and Native Exes): AQTime is a great tool for the money. Standalone or as a Visual Studio plugin.
Java: I'm a fan of JProfiler. Again, can run standalone or as an Eclipse (or various other IDEs) plugin.
I believe both have trial versions.
The Google Perftools are extremely useful in this regard.
I use devpartner with MSVC 6 and XP
How are any tools going to work if your platform is a proprietary OS? I think you're doing the best you can right now