Apparently (at least according to gcc -std=c99) C99 doesn't support function overloading. The reason for not supporting some new feature in C is usually backward compatibility, but in this case I can't think of a single case in which function overloading would break backward compatibility. What is the reasoning behind not including this basic feature?
To understand why you aren't likely to see overloading in C, it might help to better learn how overloading is handled by C++.
After compiling code, but before it is ready to run, the intermediate object code must be linked. This transforms a rough database of compiled functions and other objects into a ready to load/run binary file. This extra step is important because it is the principle mechanism of modularity available to compiled programs. This step allows you to take code from existing libraries and mix it with your own application logic.
At this stage, the object code may have been written in any language, with any combination of features. To make this possible, it's necessary to have some sort of convention so that the linker is able to pick the right object when another object refers to it. If you're coding in assembly language, when you define a label, that label is used exactly, because it is assumed you know what you're doing.
In C, functions become the symbol names for the linker, so when you write
int main(int argc, char **argv) { return 1; }
the compiler provides an archive of object code, which contains an object called main.
This works well, but it means that you cannot have two objects with the same name, because the linker would be unable to decide which name it should use. The linker doesn't know anything about argument types, and very little about code in general.
C++ resolves this by encoding additional information into the symbol name directly. The return type, the number and type of arguments, the reference type of the arguments, whether its const or not, etc., are added to the symbol name, and are referred to that way at the point of a function call. The linker doesn't have to know this is even happening, since as far as it can tell, the function call is unambiguous.
The downside of this is that the symbol names don't look anything like the original function names. In particular, it's almost impossible to predict what the name of an overloaded function will be so that you can link to it. To link to foriegn code, you can use extern "C", which causes those functions to follow the C style of symbol names, but of course you cannot overload such a function.
These differences are related to the design goals of each language. C is oriented toward portability and interoperability. C goes out of its way to do predictable and compatible things. C++ is more strongly oriented toward building rich and powerful systems, and not terribly focused on interacting with other languages.
I think it is unlikely for C to ever pursue any feature that would produce code that is as difficult to interact with as C++.
Edit: Imagist asks:
Would it really be less portable or
more difficult to interact with a
function if you resolved int main(int
argc, char** argv) to something like
main-int-int-char** instead of to main
(and this were part of the standard)?
I don't see a problem here. In fact,
it seems to me that this gives you
more information (which could be used
for optimization and the like)
To answer this, I will turn again to C++ and the way it deals with overloads. C++ uses this mechanism, almost exactly as described, but with one caveat. C++ does not standardize how certain parts of itself should be implemented, and then goes on to suggest how some of the consequences of that omission. In particular, C++ has a rich type system, that includes virtual class members. How this feature should be implemented is left to the compiler writers, and the details of vtable resolution has a strong effect on function signatures. For this reason, C++ deliberately suggests compiler writers make name mangling be mutually incompatible across compilers or same compilers with different implementations of these key features.
This is just a symptom of the deeper issue that while higher level languages like C++ and C have detailed type systems, the lower level machine code is totally typeless. arbitrarily rich type systems are built on top of the untyped binary provided at the machine level. Linkers do not have access to the rich type information available to the higher level languages. The linker is completely dependent on the compiler to handle all of the type abstractions and produce properly type-free object code.
C++ does this by encoding all of the necessary type information in the mangled object names. C, however, has a significantly different focus, aiming to be a sort of portable assembly language. C prefers thus to have a strict one to one correspondence between the declared name and the resulting objects' symbol name. If C Mangled it's names, even in a standardized and predictable way, you would have to go to great efforts to match the altered names to the desired symbol names, or else you would have to turn it off as you do in c++. This extra effort comes at almost no benefit, because unlike C++, C's type system is fairly small and simple.
At the same time, it's practically a standard practice to define several similarly named C functions that vary only by the types they take as arguments. for a lengthy example of just this, have a look at the OpenGL namespace.
When you compile a C source, symbol names will remain intact. If you introduce function overloading, you should provide a name mangling technique to prevent name clashes. Consequently, like C++, you'll have machine generated symbol names in the compiled binary.
Also, C does not feature strict typing. Many things are implicitly convertible to each other in C. The complexity of overload resolution rules could introduce confusion in such kind of language.
Lots of language designers, including me, think that the combination of function overloading with C's implicit promotions can result in code that is heinously difficult to understand. For evidence, look at the body of knowledge accumulated about C++.
In general, C99 was intended to be a modest revision largely compatible with existing practice. Overloading would have been a pretty big departure.
Related
A swift app, will convert its dynamic frameworks into binaries. And once something is a binary, then it's no longer Swift/Ruby/Python, etc. It's machine code.
Same thing happens for a Python binary. So why aren't the machine codes compatible with each other out of the box?
Is it just that a simple mapping is required to bridge one language to the other?
Like if I needed to use a binary created from the Swift language — into a Python based app, then do I need to expose the Swift Headers to Python for it to work? Or something else is required?
I assume you're talking about making calls in one language to a library compiled in a different language.
At the assembly language level, there are standards (ABI, for Application Binary Interface) that define how function parameters are passed in registers, how values are returned, the behavior of the stack, etc. ABIs are architecture and operating-system-dependent. Usually any function that is exported in a library will follow the ABI.
It is plain that ABIs basically expect a C language model for functions: a single return value, a well-defined data type for each function parameter as well as the return value, the possibility of using pointers, etc.
Problems start to arise once you move to a higher level language. C++ already introduces complications: whereas the name of a C function is the same in assembly (often a _ character is prepended), in C++ function names must encode data types due to the possibility of overloaded functions with the same name but different parameters. Thus, names must be mangled and demangled -- this is why a prototype for a C function must be declared as extern "C" in C++. Then there are issues of the classes (this pointer, vtables), namespaces and so on, which complicate matters further.
Then you have dynamically typed languages like Python. In truth, there is no such thing as dynamic typing at the assembly language levels: the instruction encodings in machine language (i.e. binary codes as they're read by the CPU when executed) implicitly determine whether you're using an integer or floating-point or SIMD instruction (and the width of operands), which also determines which of the different register banks are accessed. Although the language makes dynamic typing transparent to you, at the assembly code level, the interpreter/JIT/compiler must resolve them somehow, because ultimately the CPU must be told exactly what data type to operate on.
This is why you can't directly call a C function (or in general any library function) from Python -- unlike a pure Python function which can disregard the types of its parameters, library functions must know the exact types of each parameter and the return type. Thus, you must use something like ctypes for Python, explicitly specifying the types in question for each function that needs to be called -- in a way, this is similar to function prototypes usually found in C headers. It is possible to write functions in C that are directly callable from Python (and, in that case, essentially from Python alone), but you'll have to jump through a few hoops.
As for the particular language pairing you're interested in (Python/Swift), a cursory search came up with this thread in the Swift forums (this one, linked from there, may also be interesting. Reading the thread, there appears to be two feasible solutions at this time: first, use the #_cdecl attribute (which isn't officially supported) to make a C function, and then call it from Python using ctypes. But the second and apparently more promising one is to use the #objc attribute in Swift, and use PyObjC in Python. I assume this will allow using some of the higher-level features of Swift, at least those that intersect with what Objective-C offers.
I'm generating machine code to call functions from existing system libraries. Most system libraries were written in C, so I'll take C as an example, but the question probably applies to any other language.
If I understand this answer correctly, C compilers are free to choose the ABI/calling convention of a function as long as they preserve the semantics. For instance they can choose to pass a pointer for the returned value as an argument to obtain copy-elision.
Does this mean that no one can ever truly know what's the right way to call a function from a library, even if its C signature is known?
Is this a real concern in practice? Or is it safe to assume that all the functions with non-mangled names from system libraries always use the system's default calling convention?
What other assumptions or considerations can I make about the ABI/calling convention of functions with non-mangled names in system libraries?
C compilers are free to choose the ABI/calling convention of a function as long as they preserve the semantics.
Well, yes and no. The ABI is often defined by the target system, in which case the compiler has to fall in line. In case there exists no ABI for the target system (often the case in microcontroller programming), the compiler is free to do as it pleases, essentially inventing the ABI.
Does this mean that no one can ever truly know what's the right way to call a function from a library, even if its C signature is known?
No you can't unless you know the target system and calling convention. Some systems have several "de facto" standards such as x86 Windows __cdecl vs __stdcall see https://en.wikipedia.org/wiki/X86_calling_conventions
Is this a real concern in practice?
Not within a program written entirely in C. But it becomes a big problem in case the program links external libs such as Windows DLLs, possibly written in other languages. Then you have to use the right calling convention or the program will soon crash.
It's also a very real concern whenever you attempt to mix assembler and C for the given system - the C compiler will handle stacking according to the calling convention, but in the assembler part you have to write this manually. This can also affect the C code, if it is written with care to suit assembler. You'd then pick parameter and return types that are convenient to use.
If I understand this answer correctly, C compilers are free to choose
the ABI/calling convention of a function as long as they preserve the
semantics. For instance they can choose to pass a pointer for the
returned value as an argument to obtain copy-elision.
I don't see how you conclude that from the answer you referenced. Calling conventions are a characteristic of the function, as it appears in compiled form. The compiler can do all manner of tricks at the point of call, but changing or ignoring the calling conventions of the function implementation is not one of them. Where it is possible, copy elision for returned structure values (the subject of that answer) does not rely on any such thing.
Does this mean that no one can ever truly know what's the right way to
call a function from a library, even if its C signature is known?
Yes and no. The function signature alone does not convey anything about calling convention (with some caveats; see below), but libraries simply could not work if there were no way to know calling conventions. In practice, it is usually the case that calling convention (and ABI overall) is standardized on a per-platform basis.
Thus, for example, Linux implementations for x86_64 substantially all follow the same conventions. All the toolchains targeting that platform both use that convention for function calls and provide for functions to be called according to it. Compilers for Win64 likewise follow the appropriate (different) conventions.
Windows is in fact an interesting case, however, because historically, it has supported multiple calling conventions. In its case, there is a default convention, and different conventions can be specified in function declarations via extension keywords. The compiler knows which convention to use based on the function declaration.
Additionally, where it is not concerned about interoperability, compilers can do anything within their power. So, for example, when compiling a function with internal linkage, it could, in principle, use whatever calling convention it wants, as it is in full control of both the function and all callers (ignoring the possible effect of function pointers). This is not different in kind from compilers' ability to inline functions. As a practical matter, however, I would not expect compilers to use variant calling conventions under such circumstances, and I am not aware of any that do.
Is this a real concern in practice? Or is it safe to assume that all
the functions with non-mangled names from system libraries always use
the system's default calling convention?
Name mangling has nothing to do with it. That's part of a higher-level mapping of C++ (usually) semantics onto system-level, source-language-independent object-file formats.
Generally speaking, it is safe to assume that where the appropriate function declarations are in scope (from the library's header files, typically), the compiler will generate correct calls. This is an essential interoperability characteristic that is rarely violated in practice. It cannot be construed as a universal guarantee, but in practice, it is not something that you should worry about.
What other assumptions or considerations can I make about the
ABI/calling convention of functions with non-mangled names in system
libraries?
I'm unsure what kinds of assumptions you have in mind, and I suspect you're overcomplicating things. You make sure to include the header(s) from the relevant library that declare the functions you want to call. Having done so, you rely on your compiler to generate correct calls.
The premise: I'm writing a plug-in DLL which conforms to an industry standard interface / function signature. This will be used in at least two different software packages used internally at my company, both of which have some example skeleton code or empty shells of this particular interface. One vendor authors their example in C/C++, the other in Fortran.
Ideally I'd like to just have to write and maintain this library code in one language and not duplicate it (especially as I'm only just now getting some comfort level in various flavors of C, but haven't touched Fortran).
I've emailed off to both our vendors to see if there's anything specific their solvers need when they import this DLL, but this has made me curious at a more fundamental level. If I compile a DLL with an exposed method void foo(int bar) in both C and Fortran... by the time it's down to x86 machine instructions - does it make any difference in how that method is called by program "X"? I've gathered so far that if I were to do C++ I'd need the extern "C" bit to avoid "mangling" - there anything else I should be aware of?
It matters. The exported function must use a specific calling convention, there are several incompatible ones in common use in 32-bit code. The calling convention dictates where the function arguments are stored, in what order they are passed and how they are removed again. As well as how the function return value is passed back.
And the name of the function matters, exported function names are often decorated with extra characters. Which is what extern "C" is all about, it suppresses the name mangling that a C++ compiler uses to prevent overloaded functions from having the same exported name. So the name is one that the linker for a C compiler can recognize.
The way a C compiler makes function calls is pretty much the standard if you interop with code written in other languages. Any modern Fortran compiler will support declarations to make them compatible with a C program. And surely this is something that's already used by whatever software vendor you are working with that provides an add-on that was written in Fortran. And the other way around, as long as you provide functions that can be used by a C compiler then the Fortran programmer has a good chance at being able to call it.
Yes it has been discussed here many many times. Study answers and questions in this tag https://stackoverflow.com/questions/tagged/fortran-iso-c-binding .
The equivalent of extern "C" in fortran is bind(C). The equivalency of the datatypes is done using the intrinsic module iso_c_binding.
Also be sure to use the same calling conventions. If you do not specify anything manually, the default is usually the same for both. On Linux this is non-issue.
extern "C" is used in C++ code. So if you DLL is written in C++, you mustn't pass any C++ objects (classes).
If you stick with C types, you need to make sure the function passes parameters in a single way e.g. use C's default of _cdecl. Not sure what Fortran uses.
I am not clear with use of __attribute__ keyword in C.I had read the relevant docs of gcc but still I am not able to understand this.Can some one help to understand.
__attribute__ is not part of C, but is an extension in GCC that is used to convey special information to the compiler. The syntax of __attribute__ was chosen to be something that the C preprocessor would accept and not alter (by default, anyway), so it looks a lot like a function call. It is not a function call, though.
Like much of the information that a compiler can learn about C code (by reading it), the compiler can make use of the information it learns through __attribute__ data in many different ways -- even using the same piece of data in multiple ways, sometimes.
The pure attribute tells the compiler that a function is actually a mathematical function -- using only its arguments and the rules of the language to arrive at its answer with no other side effects. Knowing this the compiler may be able to optimize better when calling a pure function, but it may also be used when compiling the pure function to warn you if the function does do something that makes it impure.
If you can keep in mind that (even though a few other compilers support them) attributes are a GCC extension and not part of C and their syntax does not fit into C in an elegant way (only enough to fool the preprocessor) then you should be able to understand them better.
You should try playing around with them. Take the ones that are more easily understood for functions and try them out. Do the same thing with data (it may help to look at the assembly output of GCC for this, but sizeof and checking the alignment will often help).
Think of it as a way to inject syntax into the source code, which is not standard C, but rather meant for consumption of the GCC compiler only. But, of course, you inject this syntax not for the fun of it, but rather to give the compiler additional information about the elements to which it is attached.
You may want to instruct the compiler to align a certain variable in memory at a certain alignment. Or you may want to declare a function deprecated so that the compiler will automatically generate a deprecated warning when others try to use it in their programs (useful in libraries). Or you may want to declare a symbol as a weak symbol, so that it will be linked in only as a last resort, if any other definitions are not found (useful in providing default definitions).
All of this (and more) can be achieved by attaching the right attributes to elements in your program. You can attach them to variables and functions.
Take a look at this whole bunch of other GCC extensions to C. The attribute mechanism is a part of these extensions.
There are too many attributes for there to be a single answer, but examples help.
For example __attribute__((aligned(16))) makes the compiler align that struct/function on a 16-bit stack boundary.
__attribute__((noreturn)) tells the compiler this function never reaches the end (e.g. standard functions like exit(int) )
__attribute__((always_inline)) makes the compiler inline that function even if it wouldn't normally choose to (using the inline keyword suggests to the compiler that you'd like it inlining, but it's free to ignore you - this attribute forces it).
Essentially they're mostly about telling the compiler you know better than it does, or for overriding default compiler behaviour on a function by function basis.
One of the best (but little known) features of GNU C is the attribute mechanism, which allows a developer to attach characteristics to function declarations to allow the compiler to perform more error checking. It was designed in a way to be compatible with non-GNU implementations, and we've been using this for years in highly portable code with very good results.
Note that attribute spelled with two underscores before and two after, and there are always two sets of parentheses surrounding the contents. There is a good reason for this - see below. Gnu CC needs to use the -Wall compiler directive to enable this (yes, there is a finer degree of warnings control available, but we are very big fans of max warnings anyway).
For more information please go to http://unixwiz.net/techtips/gnu-c-attributes.html
Lokesh Venkateshiah
The #encode directive returns a const char * which is a coded type descriptor of the various elements of the datatype that was passed in. Example follows:
struct test
{ int ti ;
char tc ;
} ;
printf( "%s", #encode(struct test) ) ;
// returns "{test=ic}"
I could see using sizeof() to determine primitive types - and if it was a full object, I could use the class methods to do introspection.
However, How does it determine each element of an opaque struct?
#Lothars answer might be "cynical", but it's pretty close to the mark, unfortunately. In order to implement something like #encode(), you need a full blown parser in order to extract the the type information. Well, at least for anything other than "trivial" #encode() statements (i.e., #encode(char *)). Modern compilers generally have either two or three main components:
The front end.
The intermediate end (for some compilers).
The back end.
The front end must parse all the source code and basically converts the source code text in to an internal, "machine useable" form.
The back end translates the internal, "machine useable" form in to executable code.
Compilers that have an "intermediate end" typically do so because of some need: they support multiple "front ends", possibly made up of completely different languages. Another reason is to simplify optimization: all the optimization passes work on the same intermediate representation. The gcc compiler suite is an example of a "three stage" compiler. llvm could be considered an "intermediate and back end" stage compiler: The "low level virtual machine" is the intermediate representation, and all the optimization takes place in this form. llvm also able to keep it in this intermediate representation right up until the last second- this allows for "link time optimization". The clang compiler is really a "front end" that (effectively) outputs llvm intermediate representation.
So, if you want to add #encode() functionality to an 'existing' compiler, you'd probably have to do it as a "source to source" 'compiler / preprocessor'. This was how the original Objective-C and C++ compilers were written- they parsed the input source text and converted it to "plain C" which was then fed in to the standard C compiler. There's a few ways to do this:
Roll your own
Use yacc and lex to put together a ANSI-C parser. You'll need a grammar- ANSI C grammar (Yacc) is a good start. Actually, to be clear, when I say yacc, I really mean bison and flex. And also, loosely, the other various yacc and lex like C-based tools: lemon, dparser, etc...
Use perl with Yapp or EYapp, which are pseudo-yacc clones in perl. Probably better for quickly prototyping an idea compared to C-based yacc and lex- it's perl after all: Regular expressions, associative arrays, no memory management, etc.
Build your parser with Antlr. I don't have any experience with this tool chain, but it's another "compiler compiler" tool that (seems) to be geared more towards java developers. There appears to be freely available C and Objective-C grammars available.
Hack another tool
Note: I have no personal experience using any of these tools to do anything like adding #encode(), but I suspect they would be a big help.
CIL - No personal experience with this tool, but designed for parsing C source code and then "doing stuff" with it. From what I can glean from the docs, this tool should allow you to extract the type information you'd need.
Sparse - Worth looking at, but not sure.
clang - Haven't used it for this purpose, but allegedly one of the goals was to make it "easily hackable" for just this sort of stuff. Particularly (and again, no personal experience) in doing the "heavy lifting" of all the parsing, letting you concentrate on the "interesting" part, which in this case would be extracting context and syntax sensitive type information, and then convert that in to a plain C string.
gcc Plugins - Plugins are a gcc 4.5 (which is the current alpha/beta version of the compiler) feature and "might" allow you to easily hook in to the compiler to extract the type information you'd need. No idea if the plugin architecture allows for this kind of thing.
Others
Coccinelle - Bookmarked this recently to "look at later". This "might" be able to do what you want, and "might" be able to do it with out much effort.
MetaC - Bookmarked this one recently too. No idea how useful this would be.
mygcc - "Might" do what you want. It's an interesting idea, but it's not directly applicable to what you want. From the web page: "Mygcc allows programmers to add their own checks that take into account syntax, control flow, and data flow information."
Links.
CocoaDev Objective-C Parsing - Worth looking at. Has some links to lexers and grammars.
Edit #1, the bonus links.
#Lothar makes a good point in his comment. I had actually intended to include lcc, but it looks like it got lost along the way.
lcc - The lcc C compiler. This is a C compiler that is particularly small, at least in terms of source code size. It also has a book, which I highly recommend.
tcc - The tcc C compiler. Not quite as pedagogical as lcc, but definitely still worth looking at.
poc - The poc Objective-C compiler. This is a "source to source" Objective-C compiler. It parses the Objective-C source code and emits C source code, which it then passes to gcc (well, usually gcc). Has a number of Objective-C extensions / features that aren't available in gcc. Definitely worth looking at.
You would implement this by implementing the ANSI C compiler first and then add some implementation specific pragmas and functions to it.
Yes i know this is cynical answer and i accept the downvotes.
One way to do it would be to write a preprocessor, which reads the source code for the type definitions and also replaces #encode... with the corresponding string literal.
Another approach, if your program is compiled with -g, would be to write a function that reads the type definition from the program's debug information at run-time, or use gdb or another program to read it for you and then reformat it as desired. The gdb ptype command can be used to print the definition of a particular type (or if that is insufficient there is also maint print type, which is sure to print far more information than you could possibly want).
If you are using a compiler that supports plugins (e.g. GCC 4.5), it may also be possible to write a compiler plugin for this. Your plugin could then take advantage of the type information that the compiler has already parsed. Obviously this approach would be very compiler-specific.