Does obfuscation ever change variable/function types - obfuscation

is there certain obfuscation techniques that might change variable types? Ie. C gives you more than enough string to hang yourself with and you can add a char and int. Will code obfuscation ever change a char to an int or convert any other variable types?

I don't have an example for type changes, but the entry about software obfuscation on Wikipedia says:
In addition, tools known as obfuscators can provide automated obfuscation to compiled applications that make reverse engineering more difficult for people and machines but do not alter the behavior of the obfuscated application.
According to this, any application may be the output of an obfuscator that has the same behavior as the original application.

Related

Make tolower as static

I need to make the standard library function tolower static instead of "public' scope.
I am compiling with MISRA C:2004, using IAR Embedded Workbench compiler. The compiler is declaring tolower as inline:
inline
int tolower(int _C)
{
return isupper(_C) ? (_C + ('A' - 'a')) : _C;
}
I am getting the following error from the compiler:
Error[Li013]: the symbol "tolower" in RS232_Server.o is public but
is only needed by code in the same module - all declarations at
file scope should be static where possible (MISRA C 2004 Rule 8.10)
Here are my suggested solutions:
Use tolower in another module, in a dummy circumstance, so that it
is needed by more than one module.
Implement the functionality without using tolower. This is an embedded system.
Add a "STATIC" macro, defined as empty by default, but can be
defined to static before the ctype.h header file is included.
I'm looking for a solution to the MISRA linker error. I would prefer to make the tolower function static only for the RS232_Server translation unit (If I make tolower static in the standard header file, it may affect other future projects.)
Edit 1:
Compiler is IAR Embedded Workbench 6.30 for ARM processor.
I'm using an ARM7TDMI processor in 32-bit mode (not Thumb mode).
The tolower function is used with a debug port.
Edit 2:
I'm also getting the error for _LocaleC_isupper and _LocaleC_tolower
Solution:
I notified the vendor of the issue according as recommended by Michael Burr.
I decided not to rewrite the library routine because of localization
issues.
I implemented a function pointer in the main.c file as suggested by
gbulmer; however this will be commented incredibly because it should
be removed after IAR resolves their issue.
I'd suggest that you disable this particular MISRA check (you should be able to do that just for the RS232_Server translation unit) rather than use the one of the workarounds you suggest. In my opinion the utility of rule 8.10 is pretty minimal, and jumping through the kinds of hoops in the suggested workarounds seems more likely to introduce a higher risk than just disabling the rule. Keep in mind that the point of MISRA is to make C code less likely to have bugs.
Note that MISRA recognizes that "in some instances it may be necessary to deviate from the rules" and there is a documented "Deviation procedure" (section 4.3.2 in MISRA-C 2004).
If you won't or can't disable the rule for whatever reason, in my opinion you should probably just reimplement tolower()'s functionality in your own function, especially if you don't have to deal with locale support. It may also be worthwhile to open a support incident with IAR. You can prod them with rule 3.6 says that "All libraries used in production code shall be written to comply with [MISRA-C]".
Who sells the MISRA linker? It seems to have an insane bug.
Can you work around it by taking its address int (*foo)(int) = tolower;at file scope?
Edit: My rationale is:
a. that is the stupidest thing I've seen this decade, so it may be a bug, but
b. pushing it towards an edge case (a symbol having its name exported via a global) might shut it up.
For that to be correct behaviour, i.e. not a bug, it would have to be a MISRA error to include any library function once, like initialise_system_timer, initialise_watchdog_timer, ... , which just hurts my head.
Edit: Another thought. Again based on an assumption that this is an edge-case error.
Theory: Maybe the compiler is doing both the inline, and generating an implementation of the function. Then the function is (of course) not being called. So the linker rules are seeing that un-called function.
GNU has options to prevent in-lining.
Can you do the same for that use of tolower? Does that change the error? You could do a test by
#define inline /* nothing */
before including ctype.h, then undef the macro.
The other test is to define:
#define inline static inline
before including ctype.h, which is the version of inline I expect.
EDIT2:
I think there is a bug which should be reported. I would imagine IAR have a workaround. I'd take their advice.
After a nights sleep, I strongly suspect the problem is inline int tolower() rather than static inline int tolower or int tolower, because it makes most sense. Having a public function which is not called does seem to be the symptoms of a bug in a program.
Even with documentation, all the coding approaches have downsides.
I strongly support the OP, I would not change a standard header file. I think there are several reasons. For example, a future upgrade of the tool chain (which comes with a new set of headers) breaks an old app if it ever gets maintained. Or simply building the application on a different machine gives an error, merging two apparently correct applications may give an error, ... . Nothing good is likely to come of that. -100
Use #define ... to make the error go away. I do not like using macros to change standard libraries. This seems like the seeds of a long-term bad idea. If in future another part of the program uses another function with a similar problem, the application of the 'fix' gets worse. Some inexperienced developer might get the job of maintaining the code, and 'learns' wrapping strange pieces of #define trickery around #include is 'normal' practice. It becomes company 'practice' to wrap #include <ctype.h> in weird macro workarounds, which remains years after it was fixed. -20
Use a command line compiler option to switch off inlining. If this works, and the semantics are correct, i.e. including the header and using the function in two source files does not lead to a multiple defined function, then okay. I would expect it leads to an error, but it is worth confirming as part of the bug report. It lays a frustrating trap for someone else, who comes along in the future. If a different inline standard library function is used, and for some reason a person has to look at the generated code it won't be included either. They might go a bit crazy wondering why inline is not honoured. I conjecture the reason they are looking at generated code is because performance is critical. In my experience, people would spend a lot of time looking at code, baffled before looking at the build script for a program which works. If suppressing inline is used as a fix, I do feel it is better to do it everywhere than for one file. At least the semantics are consistent and the high-level comment or documentation might get noticed. Anyone using the build scripts as a 'quick check' will get consistent behaviour which might cause them to look at the documentation. -1 (no-inline everywhere) -3 (no-inline on one file)
Use tolower in a bogus way in a second source file. This has the small benefit that the build system is not 'infected' with extra compiler options. Also it is a small file which will give the opportunity to explain the error being worked around. I do not like this much but I like it more than the fiddling with standard headers. My current concern is it might not work. It might include two copies which the linker can't resolve. I do like it is better than the weird 'take its address and see if the linker shuts up` (which I do think is an interesting way to test the edge cases). -2
Code your own tolower. I don't understand the wider context of the application. My reaction is not to code replacements for library functions because I am concerned about testing (and the unit tests which introduce even more code) and long term maintenance. I am even more nervous with character I/O because the application might need to become capable of handling a wider character set, for example UTF-8 encoding, and a user-defined tolower will tend to get out of synch. It does sound like a very specific application (debugging to a port), so I could cope. I don't like writing extra code to work around things which look like bugs, especially if it is commercial software. -5
Can you convert the error to a warning without losing all of the other checking? I still feel it is a bug in the toolchain, so I'd prefer it to be a warning, rather than silence so that there is a hook for the incident and some documentation, and there is less chance of another error creeping in. Otherwise go with switching off the error. +1 (Warning) 0 (Error)
Switching the error off seems to lose the 'corporate awareness' that (IMHO) IAR owes you an explanation, and longer term fix, but it seems much better than messing with the build system, writing macro-nastiness to futz with standard libraries, or writing your own code which increases your costs.
It may just be me, but I dislike writing code to work around a deficiency in a commercial product. It feels like this is where the vendor should be pleased to have the opportunity to justify its license cost. I remember Microsoft charged us for incidents, but if the problem was proven to be theirs, the incident and fix were free. The right thing to do seems to be give them a chance to earn the money. Products will have bugs, so silently working around it without giving them a chance to fix it seems less helpful too.
First of all, MISRA-C:2004 does not allow inline nor C99. Upcoming MISRA 2012 will allow it. If you try to run C99 or non-standard code through a MISRA-C:2004 static analyser, all bets are off. The MISRA checker should give you an error for the inline keyword.
I believe a MISRA-C compliant version of the code would look like:
static uint8_t tolower(uint8_t ch)
{
return (uint8_t)(isupper(ch) ? (uint8_t)((uint8_t)(ch + 'A') - 'a') :
ch);
}
Some comments on this: MISRA encourages char type for character literals, but at the same time warns against using the plain char type, as it has implementation-defined signedness. Therefore I use uint8_t instead. I believe it is plain dumb to assume that there exist ASCII tables with negative indices.
(_C + ('A' - 'a')) is most certainly not MISRA-compliant, as MISRA regards it, it contains two implicit type promotions. MISRA regards character literals as char, rather than int, like the C standard.
Also, you have to typecast to underlying type after each expression. And because the ?: operator contains implicit type promotions, you must typecast the result of it to underlying type as well.
Since this turned out to be quite an unreadable mess, the best idea is to forget all about ?: entirely and rewrite the function. At the same time we can get rid of the unnecessary reliance on signed calculations.
static uint8_t tolower (uint8_t ch)
{
if(isupper(ch))
{
ch = (uint8_t)(ch - 'A');
ch = (uint8_t)(ch + 'a');
}
return ch;
}

Why does the Win32-API have so many custom types?

I'm new to the Win32 API and the many new types begin to confuse me.
Some functions take 1-2 ints and 3 UINTS as arguments.
Why can't they just use ints? What are UINTS?
Then, there are those other types:
DWORD LPCWSTR LPBOOL
Again, I think the "primitive" C types would be enough - why introduce 100 new types?
This one was a pain: WCHAR*
I had to iterate through it and push_back every character to an std::string as there wasn't another way to convert it to one. Horrible.
Why WCHAR? Why reinvent the wheel? They could have just used char* instead, or?
The Windows API was first created back in the 1980's, and has had to support several different CPU architectures and compilers over the years. They've gone from single-user single-process standalone systems to networked multi-user multi-core security-conscious systems. They had to work around issues with 16-bit vs. 32-bit processors, and now 64-bit processors. They had to work around issues with pre-ANSI C compilers. They had to support C++ compilers in the early unstandardized times. They had to deal with segmented memory. They had to support internationalization before Unicode existed. They had to support some source-level compatibility with MS-DOS, with OS/2, and with Mac OS. They've had to run on several generations of Intel chips, and PowerPC, and MIPS, and Alpha, and ARM. The same basic API is used for desktop, server, mobile, and embedded systems.
Back in the 1980's, C was considered to be a high-level language (yes, really!) and many people considered it good form to use abstract types rather than just specifying everything as a primitive int, char, or void *. Back when we didn't have IntelliSense and infotips and code browsers and online documentation and the like, such usage hints were helpful, and it made it easier to port code between different compilers and different programming languages.
Yes, it looks like a horrible mess now, but that doesn't mean anybody did anything wrong.
Win32 actually has very few primitive types. What you're looking at is decades of built-up #defines and typedefs and Hungarian notation. Because there were so few types and little or no IntelliSense developers gave themselves "clues" as to what a particular type was actually supposed to do.
For example, there is no boolean type but there is an "aliased" representation of an integer that tells you that a particular variable is supposed to be treated as a boolean. Take a look at the contents of WinDef.h to see what I mean.
You can take a look here: http://msdn.microsoft.com/en-us/library/aa383751(VS.85).aspx
for a peek at the veritable tip of the iceberg. For example, notice how HANDLE is the base typedef for every other object that is a "handle" to a windows object. Of course, HANDLE is defined somewhere else as a primitive type.
UINT is an unsigned integer. If a parameter value will not / cannot be negative, it makes sense to specify unsigned. LPCWSTR is a pointer to const wide char array, while WCHAR* is non-const.
You should probably compile your app for UNICODE when working with wide chars, or use a conversion routine to convert from narrow to wide.
http://msdn.microsoft.com/en-us/library/dd319072%28VS.85%29.aspx
http://msdn.microsoft.com/en-us/library/dd374083%28v=VS.85%29.aspx
A coworker of mine would say, "There is no problem that can't be solved (obfuscated?) by a level of indirection." In Win32, you'll be dealing with WCHAR, UINT, etc., and you'll get used to it. You won't have to worry when you deploy that DLL which basic type a WCHAR or UINT compiles to—it will "just work".
It is best to read through some of the documentation to get used to it. Especially on the "wide char" support (WCHAR, etc.). There's a nice definition on MSDN for WCHAR.

Is there a reason that C99 doesn't support function overloading?

Apparently (at least according to gcc -std=c99) C99 doesn't support function overloading. The reason for not supporting some new feature in C is usually backward compatibility, but in this case I can't think of a single case in which function overloading would break backward compatibility. What is the reasoning behind not including this basic feature?
To understand why you aren't likely to see overloading in C, it might help to better learn how overloading is handled by C++.
After compiling code, but before it is ready to run, the intermediate object code must be linked. This transforms a rough database of compiled functions and other objects into a ready to load/run binary file. This extra step is important because it is the principle mechanism of modularity available to compiled programs. This step allows you to take code from existing libraries and mix it with your own application logic.
At this stage, the object code may have been written in any language, with any combination of features. To make this possible, it's necessary to have some sort of convention so that the linker is able to pick the right object when another object refers to it. If you're coding in assembly language, when you define a label, that label is used exactly, because it is assumed you know what you're doing.
In C, functions become the symbol names for the linker, so when you write
int main(int argc, char **argv) { return 1; }
the compiler provides an archive of object code, which contains an object called main.
This works well, but it means that you cannot have two objects with the same name, because the linker would be unable to decide which name it should use. The linker doesn't know anything about argument types, and very little about code in general.
C++ resolves this by encoding additional information into the symbol name directly. The return type, the number and type of arguments, the reference type of the arguments, whether its const or not, etc., are added to the symbol name, and are referred to that way at the point of a function call. The linker doesn't have to know this is even happening, since as far as it can tell, the function call is unambiguous.
The downside of this is that the symbol names don't look anything like the original function names. In particular, it's almost impossible to predict what the name of an overloaded function will be so that you can link to it. To link to foriegn code, you can use extern "C", which causes those functions to follow the C style of symbol names, but of course you cannot overload such a function.
These differences are related to the design goals of each language. C is oriented toward portability and interoperability. C goes out of its way to do predictable and compatible things. C++ is more strongly oriented toward building rich and powerful systems, and not terribly focused on interacting with other languages.
I think it is unlikely for C to ever pursue any feature that would produce code that is as difficult to interact with as C++.
Edit: Imagist asks:
Would it really be less portable or
more difficult to interact with a
function if you resolved int main(int
argc, char** argv) to something like
main-int-int-char** instead of to main
(and this were part of the standard)?
I don't see a problem here. In fact,
it seems to me that this gives you
more information (which could be used
for optimization and the like)
To answer this, I will turn again to C++ and the way it deals with overloads. C++ uses this mechanism, almost exactly as described, but with one caveat. C++ does not standardize how certain parts of itself should be implemented, and then goes on to suggest how some of the consequences of that omission. In particular, C++ has a rich type system, that includes virtual class members. How this feature should be implemented is left to the compiler writers, and the details of vtable resolution has a strong effect on function signatures. For this reason, C++ deliberately suggests compiler writers make name mangling be mutually incompatible across compilers or same compilers with different implementations of these key features.
This is just a symptom of the deeper issue that while higher level languages like C++ and C have detailed type systems, the lower level machine code is totally typeless. arbitrarily rich type systems are built on top of the untyped binary provided at the machine level. Linkers do not have access to the rich type information available to the higher level languages. The linker is completely dependent on the compiler to handle all of the type abstractions and produce properly type-free object code.
C++ does this by encoding all of the necessary type information in the mangled object names. C, however, has a significantly different focus, aiming to be a sort of portable assembly language. C prefers thus to have a strict one to one correspondence between the declared name and the resulting objects' symbol name. If C Mangled it's names, even in a standardized and predictable way, you would have to go to great efforts to match the altered names to the desired symbol names, or else you would have to turn it off as you do in c++. This extra effort comes at almost no benefit, because unlike C++, C's type system is fairly small and simple.
At the same time, it's practically a standard practice to define several similarly named C functions that vary only by the types they take as arguments. for a lengthy example of just this, have a look at the OpenGL namespace.
When you compile a C source, symbol names will remain intact. If you introduce function overloading, you should provide a name mangling technique to prevent name clashes. Consequently, like C++, you'll have machine generated symbol names in the compiled binary.
Also, C does not feature strict typing. Many things are implicitly convertible to each other in C. The complexity of overload resolution rules could introduce confusion in such kind of language.
Lots of language designers, including me, think that the combination of function overloading with C's implicit promotions can result in code that is heinously difficult to understand. For evidence, look at the body of knowledge accumulated about C++.
In general, C99 was intended to be a modest revision largely compatible with existing practice. Overloading would have been a pretty big departure.

Reflection support in C

I know it is not supported, but I am wondering if there are any tricks around it. Any tips?
Reflection in general is a means for a program to analyze the structure of some code.
This analysis is used to change the effective behavior of the code.
Reflection as analysis is generally very weak; usually it can only provide access to function and field names. This weakness comes from the language implementers essentially not wanting to make the full source code available at runtime, along with the appropriate analysis routines to extract what one wants from the source code.
Another approach is tackle program analysis head on, by using a strong program analysis tool, e.g., one that can parse the source text exactly the way the compiler does it.
(Often people propose to abuse the compiler itself to do this, but that usually doesn't work; the compiler machinery wants to be a compiler and it is darn hard to bend it to other purposes).
What is needed is a tool that:
Parses language source text
Builds abstract syntax trees representing every detail of the program.
(It is helpful if the ASTs retain comments and other details of the source
code layout such as column numbers, literal radix values, etc.)
Builds symbol tables showing the scope and meaning of every identifier
Can extract control flows from functions
Can extact data flow from the code
Can construct a call graph for the system
Can determine what each pointer points-to
Enables the construction of custom analyzers using the above facts
Can transform the code according to such custom analyses
(usually by revising the ASTs that represent the parsed code)
Can regenerate source text (including layout and comments) from
the revised ASTs.
Using such machinery, one implements analysis at whatever level of detail is needed, and then transforms the code to achieve the effect that runtime reflection would accomplish.
There are several major benefits:
The detail level or amount of analysis is a matter of ambition (e.g., it isn't
limited by what runtime reflection can only do)
There isn't any runtime overhead to achieve the reflected change in behavior
The machinery involved can be general and applied across many languages, rather
than be limited to what a specific language implementation provides.
This is compatible with the C/C++ idea that you don't pay for what you don't use.
If you don't need reflection, you don't need this machinery. And your language
doesn't need to have the intellectual baggage of weak reflection built in.
See our DMS Software Reengineering Toolkit for a system that can do all of the above for C, Java, and COBOL, and most of it for C++.
[EDIT August 2017: Now handles C11 and C++2017]
Tips and tricks always exists. Take a look at Metaresc library https://github.com/alexanderchuranov/Metaresc
It provides interface for types declaration that will also generate meta-data for the type. Based on meta-data you can easily serialize/deserialize objects of any complexity. Out of the box you can serialize/deserialize XML, JSON, XDR, Lisp-like notation, C-init notation.
Here is a simple example:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "metaresc.h"
TYPEDEF_STRUCT (point_t,
double x,
double y
);
int main (int argc, char * argv[])
{
point_t point = {
.x = M_PI,
.y = M_E,
};
char * str = MR_SAVE_XML (point_t, &point);
if (str)
{
printf ("%s\n", str);
free (str);
}
return (EXIT_SUCCESS);
}
This program will output
$ ./point
<?xml version="1.0"?>
<point>
<x>3.1415926535897931</x>
<y>2.7182818284590451</y>
</point>
Library works fine for latest gcc and clang on Linux, MacOs, FreeBSD and Windows.
Custom macro language is one of the options. User could do declaration as usual and generate types descriptors from DWARF debug info. This moves complexity to the build process, but makes adoption much easier.
any tricks around it? Any tips?
The compiler will probably optionally generate 'debug symbol file', which a debugger can use to help debug the code. The linker may also generate a 'map file'.
A trick/tip might be to generate and then read these files.
Based on the responses to How can I add reflection to a C++ application? (Stack Overflow) and the fact that C++ is considered a "superset" of C, I would say you're out of luck.
There's also a nice long answer about why C++ doesn't have reflection (Stack Overflow).
I needed reflection in a bunch of structs in a C++ project.
I created a xml file with the description of all those structs - fortunately the fields types were primitive types.
I used a template (not C++ template) to auto generate a class for each struct along with setter/getter methods.
In each class I used a map to associate string names and class members (pointers to members).
I didn't regret using reflection because it opened new ways to design my core functionality that I couldn't even imagine without reflection.
(BTW, it was an external report generator for a program that uses a raw database)
So, I used code generation, function pointers and maps to simulate reflection.
I know of the following options, but all come at cost and a lot of limitations:
Use libdl (#include <dfcln.h>)
Call a tool like objdump or nm
Parse the object files yourself (using a corresponding library)
Involve a parser and generate the necessary information at compile time.
"Abuse" the linker to generate symbol arrays.
I'll use a bit of unit test frameworks as examples further down, because automatic test discovery for unit test frameworks is a typical example where reflection comes in very handy, and it's something that most unit test frameworks for C fall short of.
Using libdl (#include <dfcln.h>) (POSIX)
If you're on a POSIX environment, a little bit of reflection can be done using libdl. Plugins are developed that way.
Use
#include <dfcln.h>
in your source code and link with -ldl.
Then you have access to functions dlopen(), dlerror(), dlsym() and dlclose() with which you could load and access / run shared objects at runtime. However, it does not give you easy access to the symbol table.
Another disadvantage of this approach is that you basically restrict reflection to objects loaded as dynamic library (shared object loaded at runtime via dlopen()).
Running nm or objdump
You could run nm or objdump to show the symbol table and parse the output.
For me, nm -P --defined-only -g xyz.o gives good results, and parsing the output is trivial.
You'd be interested in the first word of each line only, which is the symbol name, and maybe the second one, which is the section type.
If you do not know the object name in some static way, i.e. the object is actually a shared object, at least on Linux you then might want to skip symbol names starting with '_'.
objdump, nm or similar tools are also often available outside POSIX environments.
Parsing the object files yourself
You could parse the object files yourself. You probably don't want to implement that from scratch but use an existing library for that. This is how nm, objdump and even libdl are implemented. You could peek at the source code of nm, objdump and libdl and the libraries they use in order to find out how they do what they do.
Involving a Parser
You could write a parser and code generator which generates the necessary reflective information at compile time and stores it in the object file. Then you have a lot of freedom and could even implement primitive forms of annotations. That's what some unit test frameworks like AceUnit do.
I found that writing a parser which covers straight-forward C syntax is fairly trivial. Writing a parser which really understands C and could deal with all cases is NOT trivial.
So, this has limitations which depend on how exotic the C syntax is that you want to reflect upon.
"Abusing" the linker to generate symbol arrays
You could put references to symbols which you want to reflect upon in a special section and use a linker configuration to emit the section boundaries so you can access them in C.
I've described here N-Dependency injection in C - better way than linker-defined arrays? how this works.
But beware, this is depending on a lot of things and not very portable. I have only tried this with GCC/ld, and I know it doesn't work with all compilers / linkers. Also, it's almost guaranteed that dead code elimination will not detect how you call this stuff, so if you use dead code elimination, you will have to add all the reflected symbols as entry points.
Pitfalls
For some of the mechanisms, dead code elimination can be a problem, in particular when you "abuse" the linker to generate a symbol arrays. It can be worked around by telling the reflected symbols as entry points to the linker, and depending on the amount of symbols this might be neither nice nor convenient.
Conclusion
Combining nm and libdl can actually give quite good results. The combination can be almost as powerful as the level of Reflection used by JUnit 3.x in Java. The level of reflection given is sufficient to implement a JUnit 3.x-style unit test framework for C, including test-case discovery by naming convention.
Involving a parser is more work and limited to objects that you compile yourself, but gives you most power and freedom. The level of reflection given can be sufficient to implement a JUnit 4.x-style unit test framework for C, including test-case discovery by annotations. AceUnit is a unit test framework for C that does exactly this.
Combining parsing and the linker to generate symbol arrays can give very nice results - if your environment is so much under your control that you can ensure that working with the linker that way works for you.
And of course you can combine all approaches to stitch together the bits and pieces until they fit your needs.
You would need to implement it from yourself from the ground up. In straight C, there is no runtime information whatsoever kept on structure and composite types. Metadata simply does not exist in the standard.
Implementing reflection for C would be much simpler... because C is simple language.
There is some basic options for analazing program, like detect if function exists by calling dlopen/dlsym -- depends on your needs.
There are tools for creating code that can modify/extend itselfusing tcc.
You may use the above tool in order to create your own code analizers.
For similar reasons to the author of the question, I have been working on a C-type-reflection-API along with a C reflection graph database format and a clang plug-in that writes reflection metadata.
The intent is to use the C reflection API for writing serialization and deserialization routines, such as mappers for ASN.1, function argument printers, function proxies, fuzzers, etc. Clang and GCC both have plugin APIs that allow access to the AST but there currently is no standard graph format for C reflection metadata.
The proposed C reflection API is called Crefl:
https://github.com/michaeljclark/crefl
The Crefl API provides runtime access to reflection metadata for C structure declarations with support for arbitrarily nested combinations of: intrinsic, set, enum, struct, union, field (member), array, constant, variable.
The Crefl reflection graph database format for portable reflection metadata.
The Crefl clang plug-in outputs C reflection metadata used by the library.
The Crefl API provides task-oriented query access to C reflection metadata
A C reflection API provides access to runtime reflection metadata for C structure declarations with support for arbitrarily nested combinations of: intrinsic, set, enum, struct, union, field, array, constant, variable. The Crefl C reflection data model is essentially a transcription of the C data types in ISO/IEC 9899:9999.
C intrinsic data types.
integer types.
floating-point types.
complex number types.
boolean type.
nested struct, union, field, and bitfield
arrays and pointers
typedef type aliases
enum and enum constants
functions and function parameters
const, volatile and restrict qualifiers
GNU-C style attributes using (__attribute__).
The library is still a work in progress. The hope is to find others who are interested in reflection support in C.
Parsers and Debug Symbols are great ideas. However, the gotcha is that C does not really have arrays. Just pointers to stuff.
For example, there is no way by reading the source code to know whether a char * points to a character, a string, or a fixed array of bytes based on some "nearby" length field. This is a problem for human readers let alone any automated tool.
Why not use a modern language, like Java or .Net? Can be faster than C as well.

When should I use type abstraction in embedded systems

I've worked on a number of different embedded systems. They have all used typedefs (or #defines) for types such as UINT32.
This is a good technique as it drives home the size of the type to the programmer and makes you more conscious of chances for overflow etc.
But on some systems you know that the compiler and processor won't change for the life of the project.
So what should influence your decision to create and enforce project-specific types?
EDIT
I think I managed to lose the gist of my question, and maybe it's really two.
With embedded programming you may need types of specific size for interfaces and also to cope with restricted resources such as RAM. This can't be avoided, but you can choose to use the basic types from the compiler.
For everything else the types have less importance.
You need to be careful not to cause overflow and may need to watch out for register and stack usage. Which may lead you to UINT16, UCHAR.
Using types such as UCHAR can add compiler 'fluff' however. Because registers are typically larger, some compilers may add code to force the result into the type.
i++;
can become
ADD REG,1
AND REG, 0xFF
which is unecessary.
So I think my question should have been :-
given the constraints of embedded software what is the best policy to set for a project which will have many people working on it - not all of whom will be of the same level of experience.
I use type abstraction very rarely. Here are my arguments, sorted in increasing order of subjectivity:
Local variables are different from struct members and arrays in the sense that you want them to fit in a register. On a 32b/64b target, a local int16_t can make code slower compared to a local int since the compiler will have to add operations to /force/ overflow according to the semantics of int16_t. While C99 defines an intfast_t typedef, AFAIK a plain int will fit in a register just as well, and it sure is a shorter name.
Organizations which like these typedefs almost invariably end up with several of them (INT32, int32_t, INT32_T, ad infinitum). Organizations using built-in types are thus better off, in a way, having just one set of names. I wish people used the typedefs from stdint.h or windows.h or anything existing; and when a target doesn't have that .h file, how hard is it to add one?
The typedefs can theoretically aid portability, but I, for one, never gained a thing from them. Is there a useful system you can port from a 32b target to a 16b one? Is there a 16b system that isn't trivial to port to a 32b target? Moreover, if most vars are ints, you'll actually gain something from the 32 bits on the new target, but if they are int16_t, you won't. And the places which are hard to port tend to require manual inspection anyway; before you try a port, you don't know where they are. Now, if someone thinks it's so easy to port things if you have typedefs all over the place - when time comes to port, which happens to few systems, write a script converting all names in the code base. This should work according to the "no manual inspection required" logic, and it postpones the effort to the point in time where it actually gives benefit.
Now if portability may be a theoretical benefit of the typedefs, readability sure goes down the drain. Just look at stdint.h: {int,uint}{max,fast,least}{8,16,32,64}_t. Lots of types. A program has lots of variables; is it really that easy to understand which need to be int_fast16_t and which need to be uint_least32_t? How many times are we silently converting between them, making them entirely pointless? (I particularly like BOOL/Bool/eBool/boolean/bool/int conversions. Every program written by an orderly organization mandating typedefs is littered with that).
Of course in C++ we could make the type system more strict, by wrapping numbers in template class instantiations with overloaded operators and stuff. This means that you'll now get error messages of the form "class Number<int,Least,32> has no operator+ overload for argument of type class Number<unsigned long long,Fast,64>, candidates are..." I don't call this "readability", either. Your chances of implementing these wrapper classes correctly are microscopic, and most of the time you'll wait for the innumerable template instantiations to compile.
The C99 standard has a number of standard sized-integer types. If you can use a compiler that supports C99 (gcc does), you'll find these in <stdint.h> and you can just use them in your projects.
Also, it can be especially important in embedded projects to use types as a sort of "safety net" for things like unit conversions. If you can use C++, I understand that there are some "unit" libraries out there that let you work in physical units that are defined by the C++ type system (via templates) that are compiled as operations on the underlying scalar types. For example, these libraries won't let you add a distance_t to a mass_t because the units don't line up; you'll actually get a compiler error.
Even if you can't work in C++ or another language that lets you write code that way, you can at least use the C type system to help you catch errors like that by eye. (That was actually the original intent of Simonyi's Hungarian notation.) Just because the compiler won't yell at you for adding a meter_t to a gram_t doesn't mean you shouldn't use types like that. Code reviews will be much more productive at discovering unit errors then.
My opinion is if you are depending on a minimum/maximum/specific size don't just assume that (say) an unsigned int is 32 bytes - use uint32_t instead (assuming your compiler supports C99).
I like using stdint.h types for defining system APIs specifically because they explicitly say how large items are. Back in the old days of Palm OS, the system APIs were defined using a bunch of wishy-washy types like "Word" and "SWord" that were inherited from very classic Mac OS. They did a cleanup to instead say Int16 and it made the API easier for newcomers to understand, especially with the weird 16-bit pointer issues on that system. When they were designing Palm OS Cobalt, they changed those names again to match stdint.h's names, making it even more clear and reducing the amount of typedefs they had to manage.
I believe that MISRA standards suggest (require?) the use of typedefs.
From a personal perspective, using typedefs leaves no confusion as to the size (in bits / bytes) of certain types. I have seen lead developers attempt both ways of developing by using standard types e.g. int and using custom types e.g. UINT32.
If the code isn't portable there is little real benefit in using typedefs, however , if like me then you work on both types of software (portable and fixed environment) then it can be useful to keep a standard and use the cutomised types. At the very least like you say, the programmer is then very much aware of how much memory they are using. Another factor to consider is how 'sure' are you that the code will not be ported to another environment? Ive seen processor specific code have to be translated as a hardware engieer has suddenly had to change a board, this is not a nice situation to be in but due to the custom typedefs it could have been a lot worse!
Consistency, convenience and readability. "UINT32" is much more readable and writeable than "unsigned long long", which is the equivalent for some systems.
Also, the compiler and processor may be fixed for the life of a project, but the code from that project may find new life in another project. In this case, having consistent data types is very convenient.
If your embedded systems is somehow a safety critical system (or similar), it's strongly advised (if not required) to use typedefs over plain types.
As TK. has said before, MISRA-C has an (advisory) rule to do so:
Rule 6.3 (advisory): typedefs that indicate size and signedness should be used in place of the basic numerical types.
(from MISRA-C 2004; it's Rule #13 (adv) of MISRA-C 1998)
Same also applies to C++ in this area; eg. JSF C++ coding standards:
AV Rule 209 A UniversalTypes file will be created to define all sta
ndard types for developers to use. The types include: [uint16, int16, uint32_t etc.]
Using <stdint.h> makes your code more portable for unit testing on a pc.
It can bite you pretty hard when you have tests for everything but it still breaks on your target system because an int is suddenly only 16 bit long.
Maybe I'm weird, but I use ub, ui, ul, sb, si, and sl for my integer types. Perhaps the "i" for 16 bits seems a bit dated, but I like the look of ui/si better than uw/sw.

Resources