Do processors have optimizations and architecture preferences targeted firstly or mainly to C/C++ languages?

Do processors have optimizations and architecture preferences targeted firstly or mainly to C/C++ languages? - c

I have read an article C Is Not a Low-level Language, where is such paragraph:
Unfortunately, simple translation providing fast code is not true for
C. In spite of the heroic efforts that processor architects invest in
trying to design chips that can run C code fast, the levels of
performance expected by C programmers are achieved only as a result of
incredibly complex compiler transforms. The Clang compiler, including
the relevant parts of LLVM, is around 2 million lines of code. Even
just counting the analysis and transform passes required to make C run
quickly adds up to almost 200,000 lines (excluding comments and blank
lines).
What does a bold sentence mean? Does it mean that manufacturers design processors with some optimizations and architecture decisions targeted firstly or even specifically to C (C++) code? Or it just means that they are trying to design processors that executes any code faster, including the code written in C language?
If some preferences to C exists, what are they?
My couple of thoughts:
a branch prediction algorithm tuned in to patterns happening mainly in C code.
instructions which are useful and used in C but aren't useful in other languages. Otherwise other languages (compilers) will use them too.
I knows about language specific processors like Jazelle or Lisp machine for Java and Lisp respectively, but similar technologies can't be applied to C, because there are no bytecode.

Processors don't necessarily have optimizations targeted at C, but they do provide features to make C (and other procedural languages in general) map more cleanly to the platform.
Take cache-coherency in a multi-threaded environment as an example. From a C perspective, a global variable shared by two threads should look the same to both threads. If one thread writes to it the other should be able to see those modifications. But in a multi-core CPU with independent caches, that takes extra effort to support. Core 1 has to be able to detect that core 2 is accessing an address it has modified in cache and flush that out to memory (or somehow share it directly to core 2's cache).
That's essentially the thesis of that entire article. C's abstract machine model doesn't necessarily map cleanly to real modern high-performance processors like it did to the (by comparison extremely simple) PDP-11, and CPUs and compilers have to take great pains to paper over those differences.

The "heroic efforts" of the processor architects is largely referring to the design of cache and memory subsystems on the CPUs.
For a very long time now, the instruction executions circuits inside the CPUs have been far, far quicker than the electronics that looks after fetching/writing data from/to memory, largely because the technologies we have for RAM chips is hasn't really got better. Where the cores have speeded up the memory hasn't, and so the cache and memory subsystem has to get ever more elaborate in order to be able to pre-fetch data and move it towards the execution circuits ahead of time. Needless to say, this doesn't always pan out well.
It's also partly because of the physical distance between the CPU and RAM chips. Though only a few inches (if that) of track on a motherboard, that distance is significant; the speed of a signal down the track is about 1ns every 8 inches. For signals clocked in the GHz range (1 cycle << 1ns), a short track is a long way. This is partly why Apple have gone down the route of putting RAM onto the same package as the CPU in the home-grown M1 silicon.
Back to caches - the likes of Intel (and AMD, ARM) have strived to make CPUs that have good, general purpose performance, so that they run pretty much any code well. Modern compilers help a lot - if they know what the cache in the CPU is likely to do in any particular circumstance, the compilers can arrange code to fit in with what the hardware is likely to do.
A reasonable question then is, is that effective? Well, yes and no. Yes, because compiled code does run quite well, but no for a couple of reasons. The first is that ultimate performance for any given algorightm is rarely reached by the compiler / CPU, and secondly all this complexity makes it nigh on impossible for a good programmer to do their own optimisation.
Some CPUs help out the hero-programmer here. PowerPC (at least some variants) has instructions where the programmer can give the cache system a hint that the programme will shortly need data from such-and-such a location in RAM. The CPU uses that instruction to pre-load the L1 cache with that data, so that when the program actually starts to perform operations on data at that address it's already in cache.
The IBM Cell processor took this to a whole new level. The SPE math cores (there were 8 of them) had no cache, and no way of addressing data in CPU RAM at all. What there was instead was 256K of static RAM per core into which all code and data had to fit, and a way for code to push code and data in and out of that static RAM very quickly (256Gbyte/sec at the time, which was very very quick). The developer was completely on their own; they had to write code to load code and data into a core, get that executed, and then write more code to get the results out to wherever. This was actually pretty liberating; instead of having a cache and memory subsystem trying to automatically deliver data to executions cores, get in the way or (worse) just hide inefficiencies from you, one had the freedom to break down an algorigthm into core-sized lumps knowing that if it fitted, it'd be very quick, or knowing for sure it didn't fit.
Miles Budnek's answer addresses the issues that arise from multi-core CPUs with a cache-coherency and a Symetric Multi Processing (SMP) environment. It's even harder for the cache designer to get it right if there's multuple cores that might very well start tampering with a value. The difficulties involved has lead to vulnerabilities like Meltdown and Spectre.
SMP could be said to be an "optimisation" put into CPUs by designers to aid the C (or other) developer in transitioning code from single to multiple thread. It's an attractive thought - in the way that a single thread programme can see all of it's data merely by addressing it, why not extend the same visibility of data to all threads in the programme?
Turns out that this is what makes it very difficult to design modern CPUs. However the reasons why the industry went this way are plain enough - the smallest possible delta between single and multicore CPUs was going to be the least troublesome for the existing software community to adopt. That's perfectly reasonable.
But it is running out of steam, fast. A better approach (if the goal is the outright pursuit of performance) would be to go back to the old Transputer architectures from Inmos from the 1980s, early 1990s. In such architectures, data held by one core could only be processed by another if the software was written to explicitly transfer the data. Sounds familiar? Yes - Cell process was a bit like that.
Interestingly, languages such as Rust, Go, Erlang have all implemented Communicating Sequential Processes as a parallel processing paradigm. The irony is that, these days, CSP has to be implemented on top of a SMP environment, which is itself an artificial construct brought about by the interconnect between CPUs, cores and memory (e.g. QPI, Hypertransport). Basically, if the whole software world got fully comfortable with CSP then CPU designers wouldn't have to design cache-coherency into their multi-core CPUs. Rust in particular is very well suited, as it already has a strong concept of data ownership in its syntax (which could be leveraged to shovel data around between cores automatically).

The article referred to by the OP seems to me to have it in for C for some reason. There were so many points in it I felt triggered by, but I don't want to go addressing each one point by point. Maybe there is some bias or special interest that has not been declared. As a C programmer, with a particular interest in writing high performance programs, I thought I'd give my two cents on some of the issues raised. Hopefully, this might be of interest to others in the industry with or without a programming background.
From my point of view, the strengths of C are mainly as follows....
C allows you to do things you just can't do in 'higher level' languages.
A well written C (see weakness no.1) program is hard to beat on performance on the same hardware, written in another language.
C is comfortable handling binary data allowing for memory conservation.
C is well established with lots of libraries and programmers.
Objects in memory can be made easy to process from anywhere in the program by using pointers so the data itself doesn't need to be passed around.
Multi-threaded and multi-process programs are quite easy to implement.
It has Read-Write shared memory between threads (and processes with some fancy low-level stuff?)
Assembly can be inlined where needed (though it's not C then I know!).
... and main weaknesses...
Utilising SIMD capabilities is not possible in standard C, and difficult to implement in a portable way with intrinsics.
It takes a lot of code to do simple things for which there are no library functions.
Buffer overflow potential is easily missed, even for experienced programmers.
C pointers can be confusing.
The C programming language has a special place in the evolution of programming languages and I for one, would welcome a replacement that is a better fit to what is possible with modern hardware if it doesn't tie the hands of the programmer and offers better security and performance. From the article,...
'A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.'
Such things exist already, GPUs! Modern CPUs are much more like GPUs than they used to be now core counts can be 100+. I have used OpenCL C to write programs with amazing computational performance but they can't do everything well. Some applications can not be efficiently parallelised, if at all. OpenCL C program performance can become terrible when there is even a small amount of branching. Also, it is so much easier to exhause your memory bandwidth and fast cache when running many threads that it might be judged not worth the added complexity over a good single threaded implementation.
In OpenCL C, the programmer has somewhat more control of where data is stored in memory which can definately aid performance. Maybe it's a costly mistake to try to make programming languages too hardware independent. Might it be better to review some (LLVM like) intermediate standard, like in OpenCL C, where one can define 'private', 'local' and 'constant' memory objects to get performance improvements over 'global' memory objects. Such a standard wouldn't need to be tied to an instruction set. As a programmer, I welcome fast CPU instructions but it would be nice if they could be much more easily utilised in portable code AND compilable to portable binaries. Maybe this is something compiler writers could look into along with using SIMD vector registers rather than memory for pushing and popping. As I see it, there are four levels of portability.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the compiler to create binaries that will run correctly and efficiently on any hardware conforming to the intermediate standard.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the host compiler to create binaries that will run on the host's hardware configuration conforming to the intermediate standard, but may not run on other hardware conforming to the same.
Hardware dependent source code where the logical execution path through the source depends on the architecture of the hardware on which it is run. Programs need to 'query' the hardware configuration.
Hardware specific source code.
In a fantasy world where one can just imagine new standards, hardware, and programming languages, one could choose which level of portablity to aim for. I think that C was supposed to be hardware independent, but it isn't really if you want to get the best performance out of your hardware. OpenCL C tries also, but doesn't quite make it, though with run-time kernel compilation it does a pretty good job. The host program has the same issues though as any other. I don't think there are any 'Level 1' portable languages currently.
Sorry my response is a bit rambling. It's unfortunate that it's difficult to have an objective constructive discussion about the pros and cons of different ideas about future changes in software and hardware. Personally, I think FPGAs have huge potential but are still a long way from where they would need to be to go mainstream. Any new computing language will probably become out of date when hardware changes occur and software trends change. It's remarkable that C still occupies such a prominent space. In another 10 or 20 years time, C will probably still be going strong. How many other modern languages will still be commonplace then?

Related

Is code easily portable between Cortex A5 and Cortex A9 made by two different companies?

Can code written for an Cortex A5 built by one company be ported to a Cortex A9 made by another company without too much difficulty?
I want to write some bare metal C code that runs on Atmel's SAMA5D4 (Cortex A5) that takes video from a CMOS camera with a parallel interface and encodes it to H.264. That chip can hardware encode at 720p.
Later, I may want to build a similar setup that can encode at 1080p, so I would want to upgrade to a more expensive chip, NXP i.MX 6Solo (Cortex A9).
So I want to know if I would encounter major headaches or if it would be rather easy to port later. My gut tells me it should be easy but I thought I'd better ask the experts first. If it's a huge headache though I may start with the more expensive chip first.
I'm new to this and not at all experienced with ARM chips or even much C but am willing to learn :-)

As captured in the comments, this task can be made easier if the code is initially written to attempt to clearly abstract the platform specific detail from the application code. This is not as simple as simply replacing the boot.s and isn't something that you can really claim to have done until you've tested the porting.
Much of the architectural behaviour between the two processors will be unchanged, and the C-compiler ought to be able to take advantage of micro-architectural optimisations. This optimisation may not be the best that you could achieve with some manual effort.
Where you are likely to see hard problems is any points in your code that are sensitive to memory ordering or potentially interactions between code and exceptions. The Cortex-A9 is significantly more out-of-order than the Cortex-A5, and the migration may expose bugs in your code. Libraries ought to be stable now, but there is still a risk to be aware of. Anticipating this sort of problem is quite hard and if you are writing the majority of the code yourself you probably need to build in some contingency for the porting task. Once the code is stable on A9, issues of this sort are less likely to show up on either A5 (to give a lower cost production option), or more recent high performance cores.

If I cut and paste a chapter of my math textbook into a my biology text book will that make sense? They both are written using the english language.
No that makes no sense. Assuming you are sticking to common ARM instructions for the code (english), the code isnt going to work from one chip (math book) to another (biology). The majority of the difference is between the vendors logic which is outside the ARM core, no reason whatsoever to assume that two vendors have the same peripherals at the same addresses that work exactly the same bit for bit, gate for gate.
So in general baremetal will NOT work and does NOT work like this. A very high level printf this or that C program, sure because you have many layers of abstraction including the target, doesnt even have to be arm to arm. Now saying that it is certainly possible for you to make or maybe if very lucky find a hardware abstraction layer that hides the differences between the chips, at that layer then you can ideally write that portion of the project and port it. As far as the arm vs arm the differences should be handled by the compiler and again dont even have to be arm to arm could be arm to mips. Any assembly language you may have or any core specific accesses/instructions would need to be checked against the two technical reference manuals to insure they are compatible. Probably not at the cortex-a level but for cortex-ms there are some address space core specific items that can affect high level language code, but for something like this to work you would have to hide that in the abstraction layer.
Generally NO, ARM is the underlying core, the chip differences have nothing to do with ARM so its like cutting and pasting a chapter from a mystery novel you are writing in english into a biography you are also writing in english and hoping that chapter makes sense in the latter book.

Pure C OpenCL vs Python OpenCL performance

I am looking for performance measurement between Python wrapper to OpenCL and Pure C OpenCL. Performance measurements can varies with time, memory, etc..
- Are there any benchmarks available?
- What should be the expectation about the time performance differences?
- What kind of tasks (parallel of course...) should make a difference?

It is likely that PyOpenCL is your best choice. I would choose to use C only in very specific situations (a super-critical need for speed/low-latency on the host). For most casual parallel programs, it is fine for the host side to have plenty of slack, because all the real work gets done on the device.
You can consider PyOpenCL and OpenCL to have identical performance on the device.
Maybe use C if you are, like... designing a self-driving car, and every millisecond/amp matters. But even in that situation, it is likely that Python could be used effectively.
The best way to figure out if your specific program is slowed down is to time your code. For PyOpenCL that means:
import time
and
cl.command_queue_properties.PROFILING_ENABLE
Many smart companies and individuals choose to code first in Python, because they can build a flexible, working prototype quickly. If they end up needing more host performance later, it is relatively easy to port Python to C.
Hope that helps!

OpenCL uses precompiled programs, that later sent to device for execution. They are so-called "kernels". These kernels are deployed to be executed on end-device. This means main cost that must be measured is OpenCL implementation API I/O. Therefore, you can't rely on memory/CPU measurements, as real OpenCL part will use same of them.
AFAIK, no benchmarks available, but it is not hard to do one, if you will need it (matrix multiplication is hello world example, overall).
OpenCL is not that kind, that uses I/O on every CPU cycle. Field of use - really big data processing, that uses one big input, a lot of processing operations, and one output (no matter small or big). No one says that OpenCL can't be used with many I/O and minimal calculation variations, but implementation API overhead not worth it.
Expectations must be that I/O is pretty same fast in approximation to overall application performance.

There is a benchmark here: https://github.com/bennylp/saxpy-benchmark, comparing PyOpenCL against OpenCL as well as other frameworks/methods such as CUDA, plain C++, Numpy, R, Octave, and even TensorFlow (disclaimer: I'm the author)
According to the benchmark results, the performance difference between OpenCL and PyOpenCL varies too wildly. The PyOpenCL GPU target is almost 7x slower than OpenCL, but for the CPU target PyOpenCL is actually more than 2x faster than OpenCL!

Footprint of Lua on a PPC Micro

We're developing some code on Freescale PPC micros (5517 and 5668 at the moment), and I was wondering if we could put Lua on them.
The devices need to be easily programmed/reconfigured in the field, and the current product uses a proprietary interpreted logic language that can be loaded in, and our software (written in C) runs an interpreter. I would like to move to a better language (the implementation is a bit buggy and slow), so I'm considering Lua, but the memory footprint must be very low. For the 5517 (which we may not use), the maximum RAM is 80K. Things are better on the 5668, with 592K of RAM.
So does anyone know if I can put Lua on bare metal? We're effectively not running an OS. If so, are there any estimates on what kind of memory footprint we might see? How much effort it would take?
Failing this, does anyone know of any kind of interpreter that might be better in a memory-constrained environment without an OS? Or are we better just rolling our own?

See the eLua project.

What language to learn for microcontroller programming? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I'm getting into microcontroller programming and have been hearing contrasting views. What language is most used in the industry for microcontroller programming? Is this what you use in your own work? If not, why not?
P.S.: I'm hoping the answer is not assembly language.

In my experience, you absolutely must know C, and assembly language helps too.

Unless you are dealing with very bare-bones microcontrollers (like the RS08 series), C is by far the language of choice. Get to know C, understand functionality like volatile and const. Also understand the architecture - what is efficient, what isn't, what can the CPU do. These will differ wildly from a "desktop" environment. Learn to love stdint.h.
You will encounter C++ (or a restricted subset) as projects scale up.
However, you need to understand the CPU and how to read basic assembly as a debugging tool. You can't become an excellent embedded developer without this skillset.

What 'contrasting' views have you heard? To some extent it will depend on the microcontroller and the application. However C is available for almost all architectures (I hesitate to say all, but probably all that you will ever encounter); so on that point alone, learning C would give you the greatest coverage.
For all architectures, the availability of an assembler and a C compiler are pretty much a given. For 32-bit and most 16-bit architectures C++ will also be available. Notable exceptions I have encountered are Microchip's PIC24/dsPIC parts for which C++ is not supported by Microchip's own GNU based compiler (although 3rd party compilers may do so).
While there are C++ compilers for 8 bit microcontroller's C++ is not ubiquitous on such platforms, and often the compilers are subsets of the full language. For the types (or more specifically the size) of application for which 8-bit is usually employed, C++ may be useful but not to the extent that it is on much larger applications, so C is generally adequate.
There are lot of myths about C++ in embedded systems; while the language is larger than C and has constructs that may compromise the performance or capacity of your system, you only pay for what you use with C++. But of course if what you use is just the C subset, the C would be adequate in any case.
The point about C (and C++) is that it is a systems level language; it will run on your microprocessor with no additional support save a very simple runtime start-up to initialise the processor (and possibly external SDRAM), initialise static data, establish a stack, and in the case of C++ invoke static constructors. This is why along with target specific assembler, it is used to build operating systems and kernels - it needs no operating system or kernel itself to run.
One of the reasons I suggested that it may depend on the microcontroller is that if for example it is an ARM9 with a few Mb of external SDRAM, and at least say 4Mb Flash (also usually external - memory takes up a lot of die space), then you could run a 'heavyweight' OS on it such as Linux, WinCE, or Symbian, or even a large RTOS such as QNX or VxWorks. Then your choice of language (once you got the OS working), would be influenced by the OS, though for real-time applications C and C++ would still dominate, (or often Ada in military, avionics, and some transport applications).
For mid-size applications - a few hundred KBytes of code and data space - C# running on the .NET-Micro platform is possible; However I sat in a presentation of this at the Embedded Systems Show in the UK a few years ago, just after it was when it was launched; when I asked the question "but is it real-time", and was told, "no you need WinCE for that", there was a gasp and a groan from much of the audience, and some stopped wasting their time an left the presentation there and then (including me).
So I am still interested in the 'contrasting' opinions you have heard; because although it is possible to use other languages; the answer to your question:
What language is most used in the
industry for microcontoller
programming?
then the definitive answer is C; for the reasons I have given. For anyone who might choose to contest this assertion here are the statistics (note the different survey method after 2004 explained in the text). However just to add to the collection of alternatives, I once spent two years programming in Forth on embedded systems, and I know of people still using it, but it is a bit of a niche.

I've successfully used both C and C++ but in almost any microcontroller project you will need to be familiar with the assembly language of the target micro. If only for debugging low level hardware issues assembly will be indispensable, even if it is a cursory familiarity.
I think the hardest thing for me when moving from a desktop environment to a micro was that almost everything needs to be allocated statically. You won't often use malloc/new in a micro unless maybe it has external RAM.
I notice that you also tagged your question with FPGA and Verilog, take a look at Altium, they have a C to Hardware compiler that works really well with their integrated environment.

Regarding assembler:
Prefer C/C++ over assembler as much as possible. You'll get better productivity by writing as much as possible in C or C++. That includes being able to run some of your code on a PC, which can help developing the higher-level code (application-layer functions).
On many embedded platforms, it's good to have someone on the project who is comfortable with a little assembler. Mostly to get start-up code and interrupts going nicely, and perhaps functions for interrupt enable/disable. That's not the same as knowing it really thoroughly--just a basic working knowledge will be sufficient.
If you're porting an RTOS (e.g. µC/OS-II) to a new platform, then you'll have to know your assembler more. But hopefully your RTOS supports your platform well already.
If you're pushing up against CPU performance limits, you probably need to know assembler more thoroughly. But hopefully you're not pushing performance limits much, because that can be a drag on a project's viability.
If you're writing for a DSP, you probably need to know the DSP's assembler fairly thoroughly.

Microcontrollers were originally programmed only in assembly language, but various high-level programming languages are now also in common use to target microcontrollers. These languages are either designed specially for the purpose, or versions of general purpose languages such as the C programming language. Compilers for general purpose languages will typically have some restrictions as well as enhancements to better support the unique characteristics of microcontrollers. Some microcontrollers have environments to aid developing certain types of applications. Microcontroller vendors often make tools freely available to make it easier to adopt their hardware.
Many microcontrollers are so quirky that they effectively require their own non-standard dialects of C, such as SDCC for the 8051, which prevent using standard tools (such as code libraries or static analysis tools) even for code unrelated to hardware features. Interpreters are often used to hide such low level quirks.
Interpreter firmware is also available for some microcontrollers. For example, BASIC on the early microcontrollers Intel 8052[4]; BASIC and FORTH on the Zilog Z8[5] as well as some modern devices. Typically these interpreters support interactive programming.
Simulators are available for some microcontrollers, such as in Microchip's MPLAB environment. These allow a developer to analyze what the behavior of the microcontroller and their program should be if they were using the actual part. A simulator will show the internal processor state and also that of the outputs, as well as allowing input signals to be generated. While on the one hand most simulators will be limited from being unable to simulate much other hardware in a system, they can exercise conditions that may otherwise be hard to reproduce at will in the physical implementation, and can be the quickest way to debug and analyze problems.

You need to know assembly language programming.You need to have good knowledge in C and also C++ too.so work hard on thse things to get better expertize on micro controller programming.

And don't forget about VHDL.

For microcontrollers assembler comes before C. Before the ARMs started pushing into this market the compilers were horrible and the memory and ROM really tiny. There are not enough resources or commonality to port your code so writing in C for portability makes no sense.
Some microcontroller's assembler is less than desirable, and ARM is taking over that market. For less money, less power, and less footprint you can have a 32-bit processor with more resources. It just makes sense. Much if your code will still not port, but you can probably get by with C.
Bottom line, assembler and C. If they advertise BASIC or Java or something like that, mark that company on your blacklist and move on. Been there, done that, have the scars to prove it.

First Assembly. After C.
I think that who knows Assembly and C are better than who knows only C.

We have to use C "for performance reasons" [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
In this age of many languages, there seems to be a great language for just about every task and I find myself professionally struggling against a mantra of "nothing but C is fast", where fast is really intended to mean "fast enough". I work with very rational open-minded people, who like to compare numbers, and all I have are thoughts and opinions. Could you help me find my way past subjective opinions and into the "real world"?
Would you help me find research as to what if any other languages could be used for embedded and (Linux) systems programming? I very well could be pushing a false hypothesis and would greatly appreciate research to show me this. Could you please link or include good numbers so as to help keep the "that's just his/her opinion" comments to a minimum.
So these are my particular requirements
memory is not a serious constraint
portability is not a serious concern
this is not a real time system

In my experience, using C for embedded and systems programming isn't necessarily a performance issue - it's often a portability issue. C tends to be the most portable, well supported language on just about every platform, especially on embedded systems platforms.
If you wish to use something else in an embedded system, it's often a matter of figuring out what options are available, then determining whether the performance, memory consumption, library support, etc, are "good enough" for your situation.

"Nothing but C is fast [enough]" is an early optimisation and wrong for all the reasons that early optimisations are wrong. If your system has enough complexity that something other than C is desirable, then there will be parts of the system that must be "fast enough" and parts with lighter constraints. If writing your code, for example, in Python will get the project finished faster, with fewer bugs, then you can follow up with some C or assembly code to speed up the time-critical parts.
Even if it turns out that the entire code must be written in C or assembly to meet the performance requirements, prototyping in a language like Python can have real benefits. You can take your working Python prototype and gradually replace parts with C code until you reach the necessary performance.
So, use the tools that let you get the development work done most correctly and most quickly, then use real data to determine where you need to optimize. It could be that C is the most appropriate tool to start with sometimes, but certainly not always, even in embedded systems.

Using C for embedded systems has got some very good reasons, of which "performance" is only one of the minor. Embedded is very close to the hardware, you need manual memory adressing to communicate with hardware. All the APIs and SDKs are available for C mostly.
There are only a few platforms that can run a VM for Java or Mono which is partially due to the performance implications but also due to expensive implementation costs.

Apart from performance, there is another consideration: you'll most likely be dealing with low-level APIs that were designed to be used in C or C++.
If you cannot use some SDK, you'll only get yourself in trouble instead of saving time with developing using a higher level language. At the very least, you'll end up redoing a bunch of function declarations and constant definitions.

For C:
C is often the only language that is supported by compilers for a processors.
Most of the libraries and example code is probability also in C.
Most embedded developers have years of C experience but very little experience in anything else.
Allows direct hardware interfacing and manual memory management.
Easy integration with assembly language.
C is going to be around for many years to come. In embedded development its a monopoly that smothers any attempt at change. A language that need a VM like Java or Lua is never going to go mainstream in the embedded environment. A compiled language might stand a chance if it provide compelling new features over C.

There are several benchmarks on the web between different languages. Most of them you will find a C or C++ implementation at the top as they give you more control to really optimize things.
Example: The Computer Language Benchmarks Game.

It's hard to argue against C (or other procedure languages like Pascal, Modula-2, Ada) and assembly for embedded. There is a large history of success with those languages. Generally, you want to remove the risk of the unknown. Trying to use anything other than C or assembly, in my opinion, is an unknown. Having said that, there's nothing wrong with a mixed model where you use one of the Schemes that go to C, or Python or Lua or JavaScript as a scripting language.
What you want is the ability to quickly and easily go to C when you have to.
If you convince the team to go with something that is unproven to them, the project is your cookie. If it crumbles, it'll likely be seen as your fault.

This article (by Michael Barr) talks about the use of C, C++, assembler and other languages in embedded systems, and includes a graph showing the relative usage of each.
And here's another article, fittingly entitled, Poor reasons for rejecting C++.

Ada is a high-level programming language that was designed for embedded systems and mission critical systems.
It is a fast secure language that has data checking built in everywhere. It is what the auto pilots in airplanes are programmed in.
At this link you have a comparison between Ada and C.

There are situations where you need real-time performance, especially in embedded systems. You also have severe memory constraints. A language like C gives you greater control over execution time and execution space.
So, depending on what you are doing, C may very well be "better" or more appropriate.
Check out the following articles
http://theunixgeek.blogspot.com/2008/09/c-vs-python-speed.html
http://wiki.python.org/moin/PythonSpeed/PerformanceTips (especially see Python is not C section)
http://scienceblogs.com/goodmath/2006/11/the_c_is_efficient_language_fa.php

C is ubiquitous, available for almost any architecture, usually from day-one of a processor's availability. C++ is a close second. If your system can support C++ and you have the necessary expertise, use it in preference to C - it is all that C is, and more, so there are few reasons for not using it.
C++ is a larger language, and there are constructs and techniques supported that may consume resources or behave in unacceptable ways in an embedded system, but that is not a reason not to use the language, but rather how to use it appropriately.
Java and C# (on Micro.Net or WinCE) may be viable alternatives for non-real-time.

You may want to look at the D programming language. It could use some performance tuning, as there are some areas Python can outperform it. I can't really point you to benchmarking comparisons since haven't been keeping a list, but as pointed to by Peter Olsson, Benchmarks & Language Implementations has D Digital Mars.
You will probably also want to look at these lovely questions:
Getting Embedded with D (the programming language)
How would you approach using D in a embedded real-time environment?

I'm not really a systems/embedded programmer, but it seems to me that embedded programs generally need deterministic performance - that immediately rules out many garbage collected languages, because they are not deterministic in general. However, there has been work on deterministic garbage collection (for example, Metronome for Java: http://www.ibm.com/developerworks/java/library/j-rtj4/index.html)
The issue is one of constraints - do the languages/runtimes meet the deterministic, memory usage, etc requirements.

C really is your best choice.
There is a difference for writing portable C code and getting too deep into the ghee whiz features of a specific compiler or corner cases of the language (all of which should be avoided). But portability across compilers and compiler versions. The number of employees that will be capable of developing for or maintaining the code. The compilers are going to have an easier time with it and produce better, cleaner, and more reliable code.
C is not going anywhere, with all the new languages being designed to fix the flaws in all the prior languages. C, with all the flaws these new languages are trying to fix, still stands strong.

Here are a couple articles that compare C# to C++ :
http://systematicgaming.wordpress.com/2009/01/03/performance-c-vs-c/
http://journal.stuffwithstuff.com/2009/01/03/debunking-c-vs-c-performance/
Not exactly what you asked for as it doesn't have a focus on embedded C programming. But it's interesting nonetheless. The first one demonstrates the performance of C++ and the benefits of using "unsafe" code for processor intensive tasks. The second one somewhat debunks the first one and shows that if you write the C# code a little differently then the performance is almost the same.
So I will say that C or C++ can be the clear winner in terms of performance in many cases. But often times the margin is slim. Whether to use C or not is another topic altogether. In my opinion it really should depend on the task at hand. But in embedded systems you often don't have much of a choice.

A couple people have mentioned Lua. People I know who have worked with embedded systems have said Lua is useful, but it's not really its own language per se but more of a library that can be embedded in C. It is targetted towards use in embedded systems and generally you'll want to call Lua code from C. But pure C makes for simpler (though not necessarily easier) maintenance, since everyone knows it.

Depending on the embedded platform, if memory constraints are an issue, you'll most likely need to use a non-garbage collected programming language.
C in this respect is likely the most well-known by the team and the most widely supported with available libraries and tools.

The truth is - not always.
It seems .NET runtime (but any other runtime can be taken as an example) imposes several MBs of runtime overhead. If this is all you have (in RAM), then you are out of luck. JavaME seems to be more compact, but it still all depends on resources you have at your disposal.

C compilers are much faster even on desktop systems, because of how few langage features there are compared to C++, so I'd imagine the difference is non-trivial on embedded systems. This translates to faster iteration times, although OTOH you don't have the conveniences of C++ (such as collections) which may slow you down in the long run.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight