Why isn't regular expressions part of ISO C99 - c

Everyone knows how awesome C language is and how much it sucks in text processing tasks. Given these facts. Regex definitely must be part of ISO C. But it isn't. I don't understand why? Are there people who think its not essential?

Regular Expressions don't belong in the C language proper any more than a sound library, a graphics library, or an encryption library does. Doing so would reduce the general purpose nature of the language and greatly inhibit its use as a small and efficient embedded language.
The philosophy of C was to have a very small and efficient language keyword set with standardized libraries for the next layer of functionality. Since things like regex, graphics, sound, encryption, etc. don't have a single platform or standard they don't fit in with the standard C library.
They fit best as user libraries which they currently are.

Regex is defined as part of IEEE Std 1003.1:2001 (POSIX)
Here's a handly list of which headers are in which standard:
http://www.schweikhardt.net/identifiers.html

Because it is a library feature that would require standardizing on one of the regex languages. Standard bodies are commitee driven, not an easy task.
This document explains the rationalization of the standard: http://www.open-std.org/jtc1/sc22/wg14/www/docs/C99RationaleV5.10.pdf which might clarify why.
Another reason explained in the doc. is to keep the language simple.
There are quite a few downloads available, just use one.

Because regexes are not essential to a programming language. Handy? Yes, very much so, when you need them. Essential? No way.
Web developers will naturally consider regexes to be an essential feature of a language because they have to validate all that HTML form data. Developers whose experience is always with one of a few big-name relational database servers will consider SQL support to be essential. Those working in the scientific domain will require support for "big numbers" or tensors. GUI developers think a built-in GUI toolkit is essential. Some folks deal with XML all day and consider XML support to be essential.... etc. you get the idea. This list of "essentials" can get pretty big, and languages like Java have certainly taken the "kitchen sink" approach to their massive standard libraries. I appreciate that C is not a kitchen sink language in that sense.
Be careful not to assume that your favorite language feature is an essential feature for everyone else.

The point of C is to be small yet powerful. Since regular expressions are typically a large and complex topic, it belongs in a library. It is too bad though that the C committee doesn't "sponser" some well written, standard C, algorithms/data structure libraries. There is a plethora of them out there. I tend to stick with GNU "sponsored" libs whenever I can since they are available for most platforms even if they aren't necessarily the easiest or most efficient to use. They do strike a nice balance.

Related

In what languages besides C can I write a C library?

I want to write a library that is dynamically loadable and callable from C code, but I really don't want to write it in C - the code is security critical, so I want a language that makes it easier to have confidence that my code is correct. What are my options?
To be more specific, I want C programmers to be able to #include this, and -l that, and start using my library just as if I had written it in C. I'd like programmers in other languages to be able to use their favourite tools for linking to C libraries to link to it. Ideally I'd like that to be possible on every platform that supports C, but I'll settle for Linux, Windows and MacOS.
Anything that compiles to native code. So you might Google for that - "languages that compile to native code." See, e.g., Programming languages that compile to native code and have the batteries included
C++ is often the choice for this. Compiles to native code and provided you keep your interfaces simple, easy to write an adapter layer.
Objective C and Fortran are also possible.
It sounds like you are looking for a language with ABI compatibility or which can be described as resulting in native code. So long as it can be compiled to a valid object file (typically an .obj or .o file) which is accepted by the linker, that should be the main criteria. You also then want to write a header file as a convenience for any client code which is written in C (or a closely related language/variant thereof).
As mentioned by others, you need a pretty good reason for choosing a language other than C as it is the lingua-franco of low-level/systems software. Assembler is an option, although harder to port between platforms. D is a more portable - but less widespread - alternative which is designed to produce secure, efficient native code with a minimum of fuss. There are many others.
Almost every security critical application I know of is written in C. I don't believe that there are any other language that has higher real status in producing secure applications.
C is being said to be a poor language for security by people who don't understand.
If you want C programmers to use your library, use C. Doing anything else is tying one hand behind your back whilst trying to walk on a balance beam (the gymnastics equipment). Sure, there are dozens of other languages that are CAPABLE of interfacing to C, but it typically involves using a C layer and then stuffing the C data types into a language specific data type (Java Objects, Python Objects, etc, etc), and when the call is finished, you use the same conversion back to a C data type. Just makes it harder to work with, and potentially slower if you don't get all the design decisions right. And people won't understand the source code, so won't like to use it (see more about this below).
If you want security, then write very good code, wearing your "security aspects" hat firmly on at all times, find a security mailing list or website and post it there for review, take the review comments on board, understand the comments, and fix any comments that are meaningful to fix. Distribute the source code to the users, so people can see what your code does. Those that understand security will know what to look for and understand that you have done a good job (or a bad job, whichever is applicable) - and those who don't will hopefully trust the right pople. If it's good, people will use it. If it's "hidden", and not easy to access, you won't get many customers, no matter what language you use.
Don't worry, you won't reveal anything more from releasing source. If there is a flaw in the code, and it is popular (or important) enough, someone will find the flaw, even if you publish only binaries. For those skilled in reverse engineering, not having source code is only a small obstacle.
Security doesn't stem from using a specific language or a specific tool, it stems from good design and good basic understanding of the problems with security.
And remember security by obscurity (whether that means "hidden source code" or "unusual language" or something else obscure) is false security.
You might be interested in ATS, http://ats-lang.sourceforge.net/. ATS compiles via C, can be as efficient as C, and can be used in a way that is ABI-compatible with C. From the project website:
ATS is a statically typed programming language that unifies implementation with formal specification. It is equipped with a highly expressive type system rooted in the framework Applied Type System, which gives the language its name. In particular, both dependent types and linear types are available in ATS. The current implementation of ATS (ATS/Anairiats) is written in ATS itself. It can be as efficient as C/C++ (see The Computer Language Benchmarks Game for concrete evidence) and supports a variety of programming paradigms
ATS's dependent and linear type system helps produce static guarantees about your code, including various aspects of resource management safety.
Chris Double has been writing a series of articles exploring the power of ATS's type system for systems programming here: http://bluishcoder.co.nz/tags/ats/. Of particular note is this article: http://bluishcoder.co.nz/2012/08/30/safer-handling-of-c-memory-in-ats.html
This document covers aspects of calling back and forth between ATS and C code: https://docs.google.com/document/d/1W6DYQApEqKgyBzMbvpCI87DBfLdNAQ3E60u1hUiMoU0
The main downside is that dependently-typed programming is still a daunting prospect, even for non-systems programming. The syntax of the language is also a bit weird: consider lexical quirks such as the use of abst#ype as a keyword. Finally, ATS is to some degree a research project, and I personally don't know whether it would be sensible to adopt for a commercial endeavour.
Theoretically, it's going to be Fortran: less indirection (as in: my array is [here], not just a pointer to here, and this is true of most but not all of your data structures and variables).
However... There are many gotchas and quirks in Fortran: not, perhaps, as many as in C but you probably know your way around C rather better than Fortran. Which is the point behind most of the comments saying 'Know your code' - but do you really know what your compiler is doing?
Knowing you, I'm prepared to take it on trust that you do, for C. Most programers don't. You do not know and cannot know what a local JVM or JIT compiler does, and that's a black hole in your security model if you're using Java or C# r scripting languages.
Ignore anyone who tells you that the hairy-chested he-men of secure computing write their own assembler: they probably don't even know the security errors they're making in any and all nontrivial projects they release. Know your compiler, indeed.
You could write it in lua - providing a C API to a Lua library is relatively straight forward. C++ is also an option, though of course you'd have to write C wrappers and make sure no exceptions can escape your functions. But honestly, if it's security critical the minor inconveniences of the C language shouldn't be that much of a big deal. What you really should be doing is prove the correctness of your program where feasible, and test extensively where it's not.
You can write a library in Java. JNI is normally used to call C from Java, but it can be used the other way around.
There is finally a decent answer to this question: Rust.

what higher-level language is most like c?

I've been learning C: it's a beautiful, well-thought-out language. However, it is so low-level that writing any sort of major project becomes tedious.
What higher-level language has the most C-like syntax—but without all the clutter that you find in something like C++. Does one exist?
What higher-level language has the most C-like syntax—but without all the clutter that you find in something like C++?
I'm going to answer a slightly different question:
What is a language that is like C in that it is well designed and beautifully thought out, is like C in that it is good for systems programming, allows people to program at a higher level than C, and is relatively uncluttered?
I don't think this question has a single right answer, but here are three worthy candidates (in alphabetical order):
D. The D language is designed essentially as a better, cleaner C++. Like C++, D is explicitly designed to incorporate a lot of features, but one hopes in a cleaner, more harmonious way than C++. A major difference that enables programmers to work at a higher level is that memory is managed automatically by the language and safety is guaranteed by the compiler (and run-time system) through garbage collection.
Go. Go scores very high on being well designed and beautifully thought out: Rob Pike is a master designer and has been practicing this particular craft for 25 years. Its explicit goal is to be uncluttered and to make systems programming "fun again". Go is still a new language, and Rob has learned much from Squeak, Newsqueak, Alef, and Limbo. Because Rob understands that a great design is one with no unnecessary parts, Go is clean and uncluttered. Its primary features that are higher-level than C are type safety, garbage collection, and an excellent concurrency model.
Java. Java has a well-designed core (see Jim Waldo's book Java: The Good Parts) but unfortunately suffers from the clutter that any mature, successful language accumulates. The features of Java that make it most suitable for higher-level programming are interfaces, garbage collection, and exceptions.
The common thread here is using garbage collection to relieve the programmer of the burden of memory management. This is a major boost to productivity.
Each of these languages has much to recommend it. My own taste is for languages that are small and simple, and I admire Rob Pike's body of work very highly, so if I had to pick one for myself, it would be Go, despite the fact that it is new and unproven.
In C++ you can write C code and have it compile successfully as C++ (mostly). Therefore, although I suggest that your term "clutter" is both derogatory and ambiguous, the only clutter you will have is what you choose to write yourself. You can use C++ as a bigger tool-bag without using all the tools (or clutter if you prefer).
The answer therefore is C++ whether you like it or not. Most other C-like languages add OO features, which is perhaps what you regard as clutter, but you do not get something for nothing and you need to have syntax to support the additional features. Such languages include:
Java
C#
Objective-C
D
Of these Objective-C is probably the most C-Like since it is a superset of C in the way that C++ is not quite. It is also the preferred language for OSX and iPhone/iPod Touch development, which may be attractive.
Java is ubiquitous but probably best described as superficially C-like. C# has limited cross-platform support but is the path of least resistance for Windows GUI development with excellent free development tools. C# also has a simpler but more restrictive OO implementation than C++ so may meet your requirements, but its resemblance to C/C++ can be misleading; it is fundamentally different in how it works in a similar manner to Java. D is somewhat of a niche, being developed by a single author (albeit the author of the once renowned Zortech/Symantec C++ compiler).
Regarding it being "low level" and "tedious", when embarking on a "major project", you would seldom start from scratch with only the standard library and OS API available, you would make use of third-party and in-house developed libraries to quickly develop higher level functionality. That said, an OO approach is generally much more amenable to this 'code-reuse' approach, and of course C++'s standard library and third-party libraries are more extensive (not least because it can use C libraries as well as C++ libraries). In fact I would suggest that apart form support for OO, the only thing that makes C++ higher-level is its extensibility via classes as first-class objects. It remains suitable as a systems-level language nonetheless.
Google's Go language has a similar syntax (though different enough I suppose) and semantics, though with garbage collection, polymorphism, etc., built into the language.
The D programming language is an attempt to be what C++ should have been (not bashing on C++ at all it is my primary language) and I quote from the website, "D is a systems programming language. Its focus is on combining the power and high performance of C and C++ with the programmer productivity of modern languages like Ruby and Python. Special attention is given to the needs of quality assurance, documentation, management, portability and reliability. " The issue with D is it is relatively new compared to a lot of languages but luckily it can still use C libraries which allows it to access a large pre-existing code base. Certainly worth checking out.
Java is another option however it is notably slower than C. Syntactically it is very similar and offers a nice object orientated environment for writing code. It is also considered by most to be a safer language than C and C++. It is widely used in enterprise.
Python while syntactically not like C is a high level Object Orientated Programming Language that is very popular and can import C modules which may be very useful down the track.
This is too broad a question and is best made Community wiki.
However, in my mind, the main distinguishing feature of C is it's compactness. The whole language can be described in a small book like K&R. One can remember all the syntactic details without much effort (since there are so few of them) and it doesn't try to protect it's users from themselves.
Languages like C++ are much more baroque. It's quite hard to remember all the rules and exceptions. I feel the same way about Perl and Ruby. There are lots of things to remember and lots of things to watch out for.
I feel the same sense of compactness with Python (although perhaps not as much as C). There's very little "special syntax" and all libraries and modules are operated upon in a similar fashion.
This (probably like most other comments on this question) is a personal evaluation and is by no means a final word.
Probably Java and C#... Java a little more so I think.
And it's not the language - it's all about the libraries. Try out Qt (http://qt.nokia.com/). It's for C++ and I know you said C but I'm just making a point that you'll find yourself writing just as little (and perhaps even less!) code than you'd write for applications in Java or C#. Plus they're native and cross-platform.
All about the libraries.
I've been learning C: it's a beautiful, well-thought-out language. However, it is so low-level that writing any sort of major project becomes tedious.
Some people would say that the second sentence proves that the assertion of the first sentence is false.
Another point is that this is pretty much unanswerable. What is a "high level" language? what are your criteria for "closeness"? Syntax, computational model, performance? And what kind of applications are you wanting to build with this hypothetical language?
And if you just want to confine yourself to languages that "look like" C, why? As someone who has lost count of the number of programming languages he has used, I can tell you that differences in programming language syntax are generally pretty unimportant. You can get used to pretty much any syntax, given time.
This comparison of basic instructions gives you a good idea of what languages are similar to each other.
I would say PHP is most like C except for the $variables, if you can distinguish php the language from php the platform. Java tries in some ways, but is too strongly object oriented to be similar to C.
Javascript has a reasonably C-like syntax, and it's a very popular language. Javascript has a lot of quirks, but it has one powerful similarity to C - it's simple. The complete Javascript specification is very short, and the language is very powerful and high-level. It would be great to clean it up from some of its ugly cruft, though.
I'll just point out that Pascal is semantically (though not so much syntactically) very similar to C, so there are options like Object Pascal, Modula 2, Ada and Oberon out there where you will be re-using most of the non-trivial part of what you already know, the trivial part being the spelling.
You're probably better off sticking with C# or Java in terms of job prospects, though.
EDIT
I'll also add that on the clutter issue, it is important to sort out which clutter is important. C has less "clutter" in it's language definition, true, but the relevant clutter is in source code. Consider the following...
// C
struct mystruct *myvar;
myvar = (struct mystruct *) malloc (sizeof (struct mystruct));
myvar->a = 1;
myvar->b = 2;
myvar->c = 3;
call_something (myvar);
free (myvar);
// C++
auto_ptr<mystruct> myvar (new myclass (1, 2, 3));
call_something (myvar);
The point is that the "clutter" in the language definition is there for a reason. With a little up-front work when writing libraries, a lot of work (and clutter) is avoided down the line. And even when you're writing a library, you benefit from the up-front work done by other library writers.
I'd vote C#. I don't know what you mean by "clutter," but from a usability standpoint, C# is nice because it avoids some of the tedious things of C++, like having to essentially "declare" each of your class's methods twice (prototyping it in the header file, then essentially duplicating the same thing in your class's implementation). Ditching header files was nice in other ways too, like doing away with dependency conflicts in big projects or avoiding circular references. In C#, the compiler takes care of all that (although you still have to set references to other files or assemblies).
I've been doing C# for 10 years and I still miss pointers, which believe it or not, in my opinion, actually made debugging easier!
If you're going to be programming often, it's good to know languages that are explicitly not like each other. It's especially useful to know high level scripting languages like python or ruby. If you can think like a programmer in C you should be fine learning either of these two.
Many big projects take advantage of the rapid prototyping of higher level languages like python or ruby, but also take advantage of low overhead (fast) compiled languages like C/C++.
If you think that C++ is cluttered, then you just don't know how to write effective C++, because nobody forces you to use any of the advanced tools available. You could write a C++ program entirely in C plus your favourite C++ feature (like the AWESOME standard library). That's the definition of uncluttered. A cluttered language would be Java/C#, where you HAVE to put every function in a class. That's clutter.
How about ActionScript 3? It's a lot like Java.

Embeddable language with good string manipulation support

I've been working on a C program which does quite a lot of string manipulation, and very often needs to be tweaked and recompiled for some sort of special case processing. I've been thinking that embedding some scripting language with good string manipulation support might make sense for the project.
What language would provide the best string manipulation support while being easy to embed in a C program?
For some extra background...
Performance is pretty important (especially startup time)
Needs easily be compiled on multiple platforms (Linux, Solaris, Win32 (ideally with MinGW), Darwin)
Needs to be a language which will still be around in 5 years time
I've looked a little at Python (perhaps too heavy weight?) and Lua (perhaps not focused on string manipulation?) but don't really know enough about them or what other choices might be out there.
I've never regretted using Lua.
It's very easy to embed in your application. In fact, now I usually don't write C applications, i just write C libraries and control them from Lua.
Text manipulation isn't its best feature, but it's certainly far better than C alone. And the LPEG library makes building parsers almost trivially easy, putting any regex to shame (but still has a couple of regex-like syntaxes if you prefer them).
Lua stands head and shoulders above other choices.
... best string manipulation support while being easy to embed?
Lua is designed to be embedded in C; the API is clear and easy to use; the documentation is terrific.
Some other responses have denigrated Lua's string capabilities. I think they're underestimating Lua.
Lua's string capabilities actually find a sweet spot between "just concatenation" and the full complexity of regular expressions. String formatting capability is very strong, and accumulating strings through "buffers" or tables is simple and efficient.
String scanning is, in my opinion, one of the best parts of the design. It doesn't have "or" patterns but otherwise gives you a large fraction of what you get from regular expressions, including a very powerful and elegant "capture" function. For example, I can convert a string to hex by capturing every single character and applying a function to it:
s:gsub('.', function(c) return string.format("%02x", string.byte(c)) end)
Or I can escape non-alphanumeric, non-space characters into octal:
s:gsub('[^%w%s]', function(c) return string.format([[\%03o]], string.byte(c)) end)
Some of the features on display here:
The escape character for string scanning is %, which is different from the escape character for string quoting, which is \. This decision is brilliant and should win an award by itself :-)
There are multiple mechanisms for quoting literal strings, including [[...]] in which no characters have to be escaped. If you want to generate or match strings with backslashes in them (like LaTeX for example), this is a godsend.
If you want the full power of a context-free parser, you can always use LPEG, a library written by one of Lua's designers.
Performance is pretty important (especially startup time)
Lua consistently wins performance awards. Startup is lightning fast: the whole system (including compiler, library, garbage collector, and runtime system) fits in 150KB. To avoid pause times, Lua provides incremental garbage collection. See also SO question Why is Lua faster than other scripting languages?
You can make startup even faster by precompiling your scripts, but I've never found it necessary to do this—and because compiled code (as opposed to source code) is not portable, precompilation usually creates more headache than it solves.
Needs easily be compiled on multiple platforms
Lua compiles using pure ANSI C and does not even require POSIX. I have a version running on my PalmOS PDA.
Needs to be a language which will still be around in 5 years time.
Lua has been around since 1993. Moreover, the two members of the team who provide the most support are tenured professors at PUC-Rio. Lua is their livelihood. Finally, the whole system is only 17,000 lines of code. If Rio fell off the map tomorrow, anybody with a good undergraduate compiler course could pick the system up and maintain it. There would be plenty of volunteers.
I've looked a little at Python and Lua but don't really know enough about them
See SO question Which game scripting language is better to use: Lua or Python?.
People have been embedding tcl in larger projects for what seems like ages. It's been a while since I've had to use tcl for anything...
One of the things that sets tcl apart from other programming languages is that everything is a string.
And for your reference, here's the tcl documentation on string functions.
tcl might be easier to embed than perl, but I do have to agree #Matthew Scharley's reasoning. Also, tcl isn't exactly known for it's performance, but maybe that's changed in recent years.
Anyway, here is the tcl wiki link on embedding tcl in C applications, and a relevant quote from the page:
"How do I embed a Tcl interpreter in my existing C (or C++) application?" is a very frequently-asked question. It's straightforward, certainly far easier than doing the same with Perl or, in general, Python; moreover, this sort of "embeddability" was one of the original goals for Tcl, and many, many projects do it. There are no complete discussions of the topic available, but we can give an overview here. (RWT 14-Oct-2002)
Another alternative might be to go with Lua, as you mentioned, while extending it with another C string library of your choice (Google turns up The Better String Library, for instance).
Once you've compiled Lua into your application, you can "extend" C functions to Lua's interpreter. Or maybe the built-in string functions are adequate for you.
You certainly have a few options.
We looked at both Python and Lua for scripting for a .NET product. The goal was to provide some scripability for end users. The decision came down to Python because the powers-that-be preferred anything with Microsoft support to everything else. My choice was for Lua.
There's a good survey paper on the relative merits of the embedding APIs of various scripting languages:
H. Muhammad and R. Ierusalimschy. C APIs in extension
and extensible languages. Journal of Universal Computer
Science, 13(6):839–853, 2007.
Looking at combining both excellent string manipulation and an excellent embed API, I would suggest, in order:
Ruby: Excellent string support, including syntax support for regex. Well-designed embed API, very easy to use.
Lua: I'm not sure how its string support is, but its supposed to be a great language for embedding.
Python: Less easy to embed, slightly harder to use string features than Ruby. But it has Pyrex, so that might be an easier way to embed it.
PHP: Nasty API, nasty language. The embed SAPI is really a second-class citizen, but it does work. There are a lot of string manipulation functions. Still, I wouldn't recommend it.
Perl: Nasty to embed (so far as I've heard), string support could be better.
I can't comment about TCL, but I hear its designed for embedding.
Python is not heavyweight at all! It's quite simple to embed (here's the official guide, but you can find many tutorials as well), very powerful, great for string processing, and a pleasant and easy language to use overall. It has a huge user community and support base, which is a bonus.
Python has also been embedded into a large number of real-life applications. One cool example I can think of immediately is the Civilization IV game, most of which runs on Python scripts on top of a C++ API.
Some people may disagree but Sara Goleman has published a great book on extending and embedding PHP. Which is becoming one of the most widely used languages around... :)
PHP String support isn't as great as say Perl, but it's very usable.
Did I mention it's written in C?
</my2cents>
Perl. Its (original) reason for being is string manipulation.

Best C/C++ Libraries for maintaining session state in CGI application?

I have heard of Boost and ACE as two of the well known C++ libraries. What are the other good C/C++ libraries available?
Does Boost and ACE support session management for web applications written in C/C++?
EDIT: Ok I will try to be domain specific. I am looking for a C/C++ library which could help me maintain session state for a C++ based CGI web application.
When you're trying to build a web application in C++ I'd recommend Wt, a Qt-like framework for creating web applications in C++.
It handles sessions either in one process per session (when security matters) or multiple sessions per process.
You can either use the built-in webserver or use it with any webserver that supports FastCGI.
(Also, I'd recommend it over Boost.CGI as it seems to be maintained and feature-complete).
Depends if you are talking about general purpose or domain specific libraries. For general purpose Boost is best of breed (and don't forget about the good old STL), so I don't see the point of looking for something else that will cover much of the same ground, but is not as polished. As for domain specific you'd have to specify the domain :-)
If you're interested in C (not C++) as well, glib (the Gnome project's utility library) provides a number of useful data structures and constructs.
C++ has libraries for anything you could imagine, so the scope of your question is rather undefined. What interests you? Web applications, scientific programs, GUIs? Specify what you need exactly if you want a good answer.
Boost is a general-purpose library for relatively low-level things. It's rather complex and advanced though, so you should have a good grasp of C++ before you start with it. ACE is mainly for synchronization and communication between threads/processes/applications.
If web applications is what you need, I recommend you to strongly consider the language you're picking. C++ may not be the best direction to go here, unless you have very specific constraints that force your hand.
There is also GTK which is good if you need to have a gui or use unicode. (although c++0x should have better unicode support natively when the standard is complete).
Boost doesn't yet support sessions, but a CGI library has been proposed which should have sessions.
If you want to use C++ for web applications, consider using CGICC
Ok I will try to be domain specific. I am looking for a C/C++ library which could help me maintain session state for a C++ based CGI web application.
CppCMS?
Example of session management: http://cppcms.sourceforge.net/wikipp/en/page/tut_sessions
Reference: http://cppcms.sourceforge.net/wikipp/en/page/ref_cppcms_session_iface
Configuration: http://cppcms.sourceforge.net/wikipp/en/page/ref_config#sessions
Poco is an excellent C++ Library with data access, xml, networking, compression and crypto all wrapped up in once nice little package.
Boost evidently, QT for GUI (that's not clearly a library I know), Electronic Arts Standard Template Library and
Blitz++ if you want to do scientific computation :
Blitz++ is a C++ class library for
scientific computing which provides
performance on par with Fortran 77/90.
The C++ programming language offers
many features useful for tackling
complex scientific computing problems:
inheritance, polymorphism, generic
programming, and operator overloading
are some of the most important.
Unfortunately, these advanced features
came with a hefty performance
pricetag: until recently, C++ lagged
behind Fortran's performance by
anywhere from 20% to a factor of ten.
As a result, the adoption of C++ for
scientific computing has been slow.
Is there a way to soup up C++ so that
we can keep the advanced language
features but ditch the poor
performance? This is the goal of the
Blitz++ project: to develop techniques
which will enable C++ to rival -- and
in some cases even exceed -- the speed
of Fortran for numerical computing,
while preserving an object-oriented
interface. The Blitz++ Numerical
Library is being constructed as a
testbed for these techniques.
Recent benchmarks show C++ encroaching
steadily on Fortran's high-performance
monopoly, and for some benchmarks, C++
is even faster than Fortran! These
results are being obtained not through
better optimizing compilers,
preprocessors, or language extensions,
but through the use of template
techniques. By using templates
cleverly, optimizations such as loop
fusion, unrolling, tiling, and
algorithm specialization can be
performed automatically at compile
time.
Another goal of Blitz++ is to extend
the conventional dense array model to
incorporate new and useful features.
Some examples of such extensions are
flexible storage formats, tensor
notation and index placeholders.

Best Practice for Multi-programming-language Projects

Does anyone have any experience with doing this? I'm working on a Java decompiler right now in C++, but would like a higher level language to do the actual transformations of the internal trees. I'm curious if the overhead of marshaling data between languages is worth the benefit of a more expressive and language for better articulating what I'm trying to accomplish (like Haskell). Is this actually done in the "real world", or is it usually pick a language at the beginning of a project and stick with it? Any tips from those who have attempted it?
I'm a big advocate of always choosing the right programming language for each challenge. If there is another language which handles some otherwise tricky task easily, I'd say go for it.
Does it happen in the real world? Yes. I am currently working on a project which is made up of both PHP and objective-c code.
The trick is, as you pointed out, the communication between the two languages. If at all possible, let each language stick to its own domain, and have the two sections communicate in the simplest way possible. In my case, it was XML documents sent via http. In your case, some kind of formatted text file might be the answer.
Marshalling costs depend on the languages and architecture you're working with. For example, if you're on the CLR or JVM, there are low-cost interop solutions available - though I know you are working with probably unmanaged C++.
Another avenue is an embedded domain-specific language. Tree transformations are often expressible via pattern matching and application of a relatively small number of functions. You could consider writing a simple tree pattern-matcher - e.g. something that looks like Lisp s-exprs but uses placeholders to capture variables - with associated actions that are functions that transform the matched subtree.
John Ousterhout, the inventor of Tcl/Tk was a stong advocate of multi-language programming and wrote quite extensively about it. In order to do it, you need a clean interface mechanism between the languages you are using for it. There are quite a few mechanisms for this. Examples of different mechanisms for doing this are:
SWIG (Simplified Wrapper and
Interface Generator can take a c
or c++ (or several other languages)
header file and generate an
interface for a high level language
such as perl or python that allows
you to access the API. There are
other systems that use this
approach.
Java supports JNI, and various
other systems such as Python's
ctypes, VisualWorks DLL/C
connect are native mechanisms
that allow you to explicitly
construct the call to the lower
level subsystem.
Tcl/Tk was designed explicitly to be
embeded, and has a native API
for a C library to add hooks into
the language. The constructs for
this resemble argv[] structures in
C, and were designed to make it
relatively easy to interface a
command-line based C program into
Tcl. This is similar to the above
example, but coming from the opposite
direction. Many scripting languages
such as Python, Lua and Tcl support
this type of mechanism.
Explicit glue mechanisms such as
Pyrex, which are similar to a
wrapper generator, but have their
own language for defining the
interface. Pyrex is actually a
complete programming language.
Middleware such as COM or
CORBA allow a generic
interface definition to be built
externally to the application in an
interface definition language
and language bindings for the
languages concerned to use the
common interface mechanism.

Resources