Does modeling digital circuits in C have any practical benefits as opposed using the language's standard operations?

Does modeling digital circuits in C have any practical benefits as opposed using the language's standard operations? - c

So I've start looking into digital circuit designs and enlightened to find that almost every operation (that I'm aware of), all derive from 3 logical operations: AND, OR, and NOT. As an analogy, these are sort of like subatomic particles to atoms that make up everything else. Subatomic particles are to logic gate as atoms are to processor instructions. If programming in Assembly is like putting atoms together, then programming in C is like putting molecules (and atoms) together. Someone PLEASE tell me if I'm off base here.
With that said, I'm aware that GCC and most other compilers do a pretty good job optimizing from C to machine code. Lets assume we are looking at an x386 instruction set. If I built an 32-bit full-adder using only $$, ||, and ~, is the compiler smart enough to use existing instructions provided by the processor, or will my full-adder end up being a more bloated, less efficient versions of what's already on the processor.
Disclaimer:
I started looking into digital circuits in an attempt to start learning assembly, and I'm fair in C. I want to model some of these circuits in C to further my understand of the digital circuits because those are in terms I understand. But I don't want to lure myself into the illusion that it will also be efficient code (or any other practical benefits other than learning) when using a simple + will do. And yes, I'm aware that the horrible maintainability of the code would far outweigh any benefit's of this coding "style" may provide.

A software simulation of a full adder will never be anywhere near as efficient than a full adder built out of logic gates of the same technology as the silicon which runs the software simulation. The number of gates involved in running the simulation will far outstrip the gate count in the hardware adder, and the propagation delays will be significantly longer, especially if you include the I/O processing needed to make the simulation function as a real adder which accepts and produces electronic signals, which means that the code has to read and write actual CPU or peripheral I/O pins.
The above is true even if a crack team of the world's best x86 assembly language coders get together in a conference to design the ideal piece of code to implement a full adder in machine language; in other words, it doesn't reflect the inability of compilers to optimize C sufficiently well.
A software simulation of a logic circuit on a some given computer, however, can be more efficient than a circuit built with some different technology from that computer: specifically, older technology. For instance, a program running on a modern, fast microcontroller chip with integrated I/O can likely express a faster adder on its GPIO pins, than something cobbed together out of discrete, through-hole transistors, and it will almost certainly take up less space, and possibly require less current.

I've always been of mind that building digital circuits is no different than programming in assembly. I've said before that electronics are like physical opcodes.
To you question of C smart-compiling a logic adder to an ADD instruction.. No, it will not. There are a few exceptions in embedded development such as AVR and avr-gcc with the right optimization flags will turn REGISTER|=1<<bit or REGISTER&=~(1<<bit) and turn those into bit set and clear instructions instead of doing a verbatim Load, Logic, Store.
Beyond assemble/opcodes I can't really think of an analogy for higher-level languages to electronics.

Although you may use C to describe a digital circuit (in fact, the Verilog HDL has some resembling to C), there's one fundamental thing that you cannot model in plain C: parallelism.
If there's anything that belong to digital circuit description is its inherent parallelism in the description itself: functions (modules actually), and even code blocks "execute" in parallel.
Any ordinary C program describes a sequence of operations over time, and therefore, any attempt to describe a digital operation, unless it is a very trivial one, will be compiled as a sequence of steps, not exploiting the parallelism that digital circuits have by nature.
That said, there does exist some HDL (hardware description languages) that are fair close to C. One of them is Handel-C. Handel-C uses syntax borrowed from C, plus some additions to better handle the inherent parallelism present in digital design.
For example: imagine you have to exchange the value of two variables. The classical solution (besides solutions based in bitwise operations and the like) is:
temp = a;
a = b;
b = temp;
However, when someone is learning computer programming, it's a common mistake to code the above sequence as this:
a = b;
b = a;
Because we think in a variable interchange as a parallel operation: "the value of b is copied to a, meanwhile the value of a is copied to b".
The funny thing about this approach is that it actually works... if we manage to execute these two assignments in parallel. Something that is not possible in plain C, but it is in Handel-C:
par
{
a = b;
b = a;
}
The par statement indicates that each code line is to be "executed" in parallel with respect to the others.
In Verilog, the same interchange would be written as this:
a <= b;
b <= a;
<= is the nonblocking assignment in Verilog: the second line is not "executed" after the first one is finished, but both start at the same time. This sequence is normally found inside a clocked always (sort of a loop that is "executed" every time the clock signal in the sensitivity list changes from 0 to 1 -posedge- or from 1 to 0 -negedge- ).
always #(posedge clk) begin
a <= b;
b <= a;
end
This means: everytime the clock goes from 0 to 1, interchange the values between a and b.
Note that I always quote "executed" when I speak about languages for digital design. The code doesn't actually translate into a sequence of operations to be executed by a processor, but the code IS a circuit. Think of it as a 1D rendering of a 2D schematic, with sentences and operators instead of electronic symbols, and assignments, arguments and "function calls" instead of wires.
If you are familiarized with digital circuits, you will realize that the "always" loop look-alike, is actually translated into this:
Which is something you couldn't do with just translating the same high level description into assembly (unless the ISA of the target processor has some sort of XCHG instruction, which is actually not uncommon, and the code keeps the two variables to be interchanged into CPU registers).

Related

Why using Low-level-Languages or close to it ( C ) for embedded system and not a high level language, when all will be compiled to machine code?

I have searched but I couldn't find a clear answer. If we are compiling the code in a computer(powerful) then we are only sending a machine instruction to the memory in the embedded device. This, for my understandings, will make no difference if we use any sort of language because, in the end, we will be sending only a machine code to the embedded device, the code compilation which is the expensive phase is already done by a powerful machine!
Why using language like C ? Why not Java? we are sending a machine code at the end.

The answer partly lies in the runtime requirements and platform-provided expectations of a language: The size of the runtime for C is minimal - it needs a stack and that is about it to be able to start running code. For a compliant implementation static data initialisation is required, but you can run code without it - the initialisation itself could even be written in C, and even heap and standard library initialisation are optional, as is the presence of a library at all. It need have no OS dependencies, no interpreter and no virtual machine.
Most other languages require a great deal more runtime support and this is usually provided by an OS, runtime-library, or virtual machine. To operate "stand-alone" these languages would require that support to be "built-in" and would consequently be much larger - so much so that you may as well in many cases deploy a system with an OS and/or JVM for example in any case.
There are of course other reasons why particular languages are suited to embedded systems, such as hardware level access, performance and deterministic behaviour.
While the issue of a runtime environment and/or OS is a primary reason you do not often see higher-level languages in small embedded systems, it is by no means unheard of. The .Net Micro Framework for example allows C# to be used in embedded systems, and there are a number of embedded JVM implementations, and of course Linux distributions are widely embedded making language choice virtually unlimited. .Net Micro runs on a limited number of processor architectures, and requires a reasonably large memory (>256kb), and JVM implementations probably have similar requirements. Linux will not boot on less than about 16Mb ROM/4Mb RAM. Neither are particularly suited to hard real-time applications with deadlines in the microsecond domain.
C is more-or-less ubiquitous across 8, 16, 32 and 64 bit platforms and normally available for any architecture from day one, while support for other languages (other than perhaps C++ on 32 bit platforms at least) may be variable and patchy, and perhaps only available on more mature or widely used platforms.
From a developer point of view, one important consideration is also the availability of cross-compilation tools for the target platform and language. It is therefore a virtuous circle where developers choose C (or increasingly also C++) because that is the most widely available tool, and tool/chip vendors provide C and C++ tool-chains because that is what developers demand. Add to that the third-party support in the form of libraries, open-source code, debuggers, RTOS etc., and it would be a brave (or foolish) developer to select a language with barely any support. It is not just high level languages that suffer in this way. I once worked on a project programmed in Forth - a language even lower-level than C - it was a lonely experience, and while there were the enthusiastic advocates of the language, they were frankly a bit nuts favouring language evangelism over commercial success. C has in short reached critical mass acceptance and is hard to dislodge. C++ benefits from broad interoperability with C and similarly minimal runtime requirements, and by tool-chains that normally support both languages. So the only barrier to adoption of C++ is largely developer inertia, and to some extent availability on 8 and 16 bit platforms.

You're misunderstanding things a bit. Let's start by explaining the foundation of how computers work internally. I'll use simple and practical concepts here. For the underlying theories, read about Turing machines. So, what's your machine made up of? All computers have two basic components: a processor and a memory.
The memory is a sequential group of "cells" that works sort of like a table. If you "write" a value into the Nth cell, you can then retrieve that same value by "reading" from the Nth cell. This allows computers to "remember" things. If a computer is to perform a calculation, it needs to retrieve input data for it from somewhere, and to output data from it into somewhere. That place is the memory. In practice, the memory is what we call RAM, short for random access memory.
Then we have the processor. Its job is to perform the actual calculations on memory. The actual operations that are to be performed are mandated by a program, that is, a series of instructions that the processor is able to understand and execute. The processor decodes and executes an instruction, then the next one, and so on until the program halts (stops) the machine. If the program is add cell #1 and cell #2 and store result in cell #3, the processor will grab the values at cells 1 and 2, add their values together, and store the result into cell 3.
Now, there's some sort of an intrinsic question. Where is the program stored, if at all? First of all, a program can't be hardcoded into the wires. Otherwise, the system is not more of a computer than your microwave. To these problems are two distinct approaches/solutions: the Harvard architecture and the Von Neumann Architecture.
Basically, in the Harvard architecture, the data (as always has been) is stored in the memory. The code (or program) is stored somewhere else, usually in read-only memory. In the Von Neumann architecture, code is stored in memory, and is just another form of data. As a result, code is data, and data is code. It's worth noting that most modern systems use the Von Neumann architecture for several reasons, including the fact that this is the only way to implement just-in-time compilation, an essential part of runtime systems for modern bytecode-based programming languages, such as Java.
We now know what the machine does, and how it does that. However, how are both data and code stored? What's the "underlying format", and how shall it be interpreted? You've probably heard of this thing called the binary numeral system. In our usual decimal numeral system, we have ten digits, zero through nine. However, why exactly ten digits? Couldn't they be eight, or sixteen, or sixty, or even two? Be aware that it's impossible to create an unary based computational system.
Have you heard that computers are "logical and cold". Both of them are true... unless your machine has an AMD processor or a special kind of Pentium. The theory states that every logical predicate can be reduced to either "true" or "false". That is to say that "treu" and "false" are the basis of logic. Plus, computers are made up of electrical cruft, no? A light switch is either on or off, no? So, at the electrical level we can easily recognize two voltage levels, right? And we want to handle logic stuff, such as numbers, in computers, right? So zero and one may be, as the only feasible solution they are.
Now, taking all the theory into account, let's talk about programming languages and assembly languages. Assembly languages are a way to express binary instructions in a (supposedly) readable way to human programmers. For instance, something like this...
ADD 0, 1 # Add cells 0 and 1 together and store the result in cell 0
Could be translated by an assembler into something like...
110101110000000000000001
Both are equivalent, but humans will only understand the former, and processors will only understand the later.
A compiler is a program that translates input data that is expected to conform to the rules of a given programming language into another, usually lower-level form. For instance, a C compiler may take this code...
x = some_function(y + z);
And translate it into assembly code such as (of course this is not real assembly, BTW!)...
# Assume x is at cell 1, y at cell 2, and z at cell 3.
# Assuem that, when calling a function, the first argument
# is at cell 16, and the result is stored in cell 0.
MOVE 16, 2
ADD 16, 3
CALL some_function
MOVE 1, 0
And the assembler will spit (this is not random)...
11101001000100000000001001101110000100000000001110111011101101111010101111101111110110100111010010000000100000000
Now, let's talk about another language, namely Java. Java's compiler does not give you assembly/raw binary code, but bytecode. Bytecode is... like a generic, higher-level form of assembly language that the CPU can't understand (there are exceptions), but another program that directly runs on the CPU does. This means that the lie that some badly educated people spread around, that "both interpreted and compiled programs ultimately boil down to machine code" is false. If, for example, the interpreter is written in C, and has this line of code...
Bytecode some_bytecode;
/* ... */
execute_bytecode(&some_bytecode);
(Note: I won't translate that into assembly/binary again!) The processor executes the interpreter, and the interpreter's code executes the bytecode, by performing the actions specified by the bytecode. Although, if not optimized correctly, this can severely degrade performance, this is not the problem per se, but the fact that things such as reflection, garbage collection, and exceptions can add quite some overhead. For embedded systems, whose memories are small and whose processors are slow, this is something you want. You're wasting precious system resources on things you don't need. If C programs are slow on your Arduino, image a full blown Java/Python program with all sorts of bells and whistles! Even if you translated bytecode into machine code before inserting it into the system, support must be there for all that extra stuff, and results in basically the same unwanted overhead/waste. You would still need support for reflection, exceptions, garbage collection, etc... It's basically the same thing.
On most other environments, this is not a big deal, as memory is cheap and abundant, and processors are fast and powerful. Embedded systems have special needs, they're special by themselves, and things are not free in that land.

Why using language like C ? why not Java ? we are sending a machine
code at the end.
No, Java code does not compile to machine code, it needs a virtual machine (the JVM) on the target system.
You're partly right about the compilation, however, but still "higher-level" languages can result in less efficient machine code. For instance, the language can include garbage collection, run-time correctness checks, can't use all the "native" numeric types, etc.

In general it depends on the target. On small targets (i.e. microcontrollers like AVR) you don't have that complex programs running. Additionally, you need to access the hardware directly (f.e. a UART). High level languages like Java don't support accessing the hardware directly, so you usually end up with C.
In the case of C versus Java there's a major difference:
With C you compile the code and get a binary that runs on the target. It directly runs on the target.
Java instead creates Java Bytecode. The target CPU cannot process that. Instead it requires running another program: the Java runtime environment. That translates the Java Bytecode to actual machine code. Obviously this is more work and thus requires more processing power. While this isn't much of a concern for standard PCs it is for small embedded devices. (Note: some CPUs do actually have support for running Java bytecode directly. Those are exceptions though.)
Generally speaking, the compile step isn't the issue -- the limited resources and special requirements of the target device are.

you misunderstand something , 'compiling' java gives a different output then compiling a low level language , it is true that both are machine codes , but in c case the machine code is directly executable by the processor , whereas with java the output will be in an intermediate stage , a bytecode , and it can't be executed by the processor , it needs some extra work , a translation to a machine code , that is the only directly executable format , while that takes a extra time , c will be an attractive choice , because of its speed , with low level language you write you code then you compile to a target machine ( you need to specify the target to the compiler since each processor have his own machine code ) , then your code is understandable by the processor .
in the other hand c allows direct hardware access , that is not allowed in java-like languages even via an api

It's an industry thing.
There are three kinds of high level languages. Interpreted (lua, python, javascript), compiled to bytecode (java, c#), and compiled to machinne code (c, c++, fortran, cobol, pascal)
Yes, C is a high level language, and closer to java than to assembly.
High level languages are popular for two reasons.
Memory management, and a wide standard library.
Managed memory comes with a cost,
somebody must manage it. That's an issue not only for java and c#, where somebody must implement a VM, but also to baremetal c/c++ where someone must implement the memory allocation functions.
A wide standard library can't be supported by all targets because there aren't enough resources. ie, avr arduino doesn't support the full c++ standard library.
C gained popularity, because it can easily be converted to equivalent assembly code. Most statements can be converted, without optimization, to a bunch of fixed assembly instructions, so compilers are easy to program. And its standard is compact and easy to implement. C prevailed because it became the defacto standard for the lowest high level language of any arch.
So in the end, besides special snowflakes like cython, go, rust, haskell etc, industry decided that machinne code is compiled from C, C++ and most optimization efforts went that way
Languages, like java, decided to hide memory from the progarammer, so good luck trying to interface with low level stuff there. As by design they do that, almost nobody bothers trying to bring them to compete with C. Realistically, java without GC would be C++ with different syntax.
Finally, if all the industry money goes to one language, the cheapest/easyest thing to do is choosig that language.

You are right in that you can use any language that generates machine code. But JAVA is not one of them. JAVA, Python and even some languages that compile to machine code may have heavy system requirements. You could and some folks use Pascal, but C won the C vs Pascal war many years ago. There are some other languages that fell by the wayside that if you had a compiler for you could use. there are some new languages you can use, but the tools are not as mature and not as many targets as one would like. But it is very unlikely that they will unseat C. C is just the right amount of power/freedom, low enough and high enough.

Java is an interpreted language and (like all interpreted languages) produces an intermediate code that is not directly executable by the processor. So what you send to the embedded device would be the Bytecode and you should have a JVM running on it and interpreting your code. Clearly not feasible. For what concern the compiled languages (C, C++...) you are right to say that at the end you send machine code to the device. However consider that using high level features of a language will produce much more machine code that you would expect. If you use polymorphism for example, you have just a function call, but when you compile the machine code explodes. Consider also that very often the use of dynamic memory (malloc, new...) is not feasible on an embedded device.

How to write ISO-C compliant code while allowing multiple instructions between sequencing points?

I have a subtle question; I would like to write code that is portable (that's why I am sticking to any of the last three ISO-C standard definitions) and machine-independent (thus, assembler is out of the question), but that let the compiler pack several (independent) instructions within one CPU cycle.
I thought that using the comma operator would do the trick, but the standard says that each coma is a sequencing point, so it would not do.
I would like to take advantage of multiple independent assignments, additions, etc. (just as a register variable is an indication to the compiler of possible optimizations and of independentness of the operations).
Does anyone have any idea?

Let the compiler do optimizations.
The compiler can optimize across sequence points when it recognizes that they are independent and without interactions.
For example, in code:
a = x+y;
b = y+z;
A compiler can recognize that the assignment of a and b are fully independent of each other, and can do both at the same time, despite the sequence point.
As a general rule, you cannot do a better job than the compiler.
Let the compiler do its job of creating fast, efficient code, and you should focus on your job:writing clear, unambiguous instructions for bug-free algorithms.

The compiler generates code. The processor executes it. It is up to the processor to perform more than one instruction per cycle, and modern processors are quite good at this. If operations are independent, the processor will figure it out.
The processor will also rearrange instructions, and often perform multiple instructions that are nowhere near together in your source code. There is nothing that you can do to help in source code.

Your question is deeply misguided, as other answerers have pointed out. (Compilers usually reorder things and do all sorts of horrible stuff even when things are separated by sequence points; conversely, doing two things that can "interfere" with one another that aren't separated by a sequence point is undefined behaviour.) However, you can do what you're asking in a bit of a silly way.
The evaluation of different arguments to a function call are not sequenced with respect to one another, so you can make up a dummy function like this:
void dont_sequence(int, int) {}
and use it like this:
dont_sequence(i += 2, j += 4);
Again, I don't believe there is any purpose to this. This won't help any compiler I've ever used. The compiler doesn't have to follow your instructions; it's only required to generate code that behaves as if it followed your instructions, and that's what modern compilers do.

TL;DR there is no such trick available. Choose different language
1. C language was designed to be portable and machine-independent but does not have language constructs to clearly express data flow independence or other hints that might be utilized by compilers when targeting processors with different parallel granularity (see e.g. article Threading and Parallel Programming Constructs used in multicore systems development: Part 2 for discussion of such constructs).
2. Abstract machine-independent compiler target (which reflects current processor architectures) was described by computer scientist Donald Knuth as MMIX. There is also GCC compiler available that can target this processor. So you might check your C code against this output
3. For more detail explanation of how compilers and processors derive their hints (you call them sequencing points) see e.g. book Processor Architecture: From Dataflow to Superscalar and Beyond ; with 34 Tables - Jurij Silc, Borut Robic, Theo Ungerer
3. For list of portable and machine-independent languages that explicitly support paralelism see e.g. Wikipedia: List of concurrent and parallel programming languages
4. For some discussion about how to use C for parallel programming see e.g. Which is the best parallel programming language for initiating undergraduate students in the world of multicore/parallel computing?

Hardware/Software Implementation

What does it mean to say that a function (e.g. modular multiplication,sine) is implemented in hardware as opposed to software?

Implemented in hardware means the electrical circuit (through logical gates and so) can perform the operation.
For example, in the ALU the processor is physically able to add one byte to another.
Implemented in software are operations that usually are very complex combinations of basic implemented in hardware functions.

Typically, "software" is a list of instructions from a small set of precise formal instructions supported by the hardware in question. The hardware (the cpu) runs in an infinite loop executing your instruction stream stored in "memory".
When we talk about a software implementation of an algorithm, we mean that we achieve the final answer by having the CPU carry out some set of these instructions in the order put together by an outside programmer.
When we talk about a hardware implementation, we mean that the final answer is carried out with intermediate steps that don't come from a formal (inefficient) software stream coded up by a programmer but instead is carried out with intermediate steps that are not exposed to the outside world. Hardware implementations thus are likely to be faster because (a) they can be very particular to the algorithm being implemented, with no need to reach well-defined states that the outside would will see, and (b) don't have to sync up with the outside world.
Note, that I am calling things like sine(x), "algorithms".
To be more specific about efficiency, the software instructions, being a part of a formal interface, have predefined start/stop points as they await for the next clock cycle. These sync points are needed to some extent to allow other software instructions and other hardware to cleanly and unambiguously access these well defined calculations. In contrast, a hardware implementation is more likely to have a larger amount of its internal implementations be asynchronous, meaning that they run to completion instead of stopping at many intermediate points to await a clock tick.
For example, most processors have an instruction that carries out an integer addition. The entire process of calculating the final bit positions is likely done asynchronously. The stop/sync point occurs only after the added result is achieved. In turn, a more complex algorithm than "add", and which is done in software that contains many such additions, necessarily is partly carried out asynchronously (eg, in between each addition) but with many sync points (after each addition, jump, test, etc, result is known). If that more complex algorithm were done entirely in hardware, it's possible it would run to completion from beginning to end entirely independent of the timing clock. No outside program instructions would be consulted during the hardware calculation of that algorithm.

It means that the logic behind it is in the hardware (ie, using gates AND/OR/XOR, etc) rather than a software recreation of said hardware logic.

Hardware implementation means typically that a circuit was created to perform the refered operation. There is no need for a CPU nor virtual calculations. You can literally see the algorithm being performed through the lines and architecture of the circuit itself.

Why is C so fast, and why aren't other languages as fast or faster? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
In listening to the Stack Overflow podcast, the jab keeps coming up that "real programmers" write in C, and that C is so much faster because it's "close to the machine." Leaving the former assertion for another post, what is special about C that allows it to be faster than other languages?
Or put another way: what's to stop other languages from being able to compile down to binary that runs every bit as fast as C?

There isn't much that's special about C. That's one of the reasons why it's fast.
Newer languages which have support for garbage collection, dynamic typing and other facilities which make it easier for the programmer to write programs.
The catch is, there is additional processing overhead which will degrade the performance of the application. C doesn't have any of that, which means that there is no overhead, but that means that the programmer needs to be able to allocate memory and free them to prevent memory leaks, and must deal with static typing of variables.
That said, many languages and platforms, such as Java (with its Java Virtual Machine) and .NET (with its Common Language Runtime) have improved performance over the years with advents such as just-in-time compilation which produces native machine code from bytecode to achieve higher performance.

There is a trade-off the C designers have made. That's to say, they made the decision to put speed above safety. C won't
Check array index bounds
Check for uninitialized variable values
Check for memory leaks
Check for null pointer dereference
When you index into an array, in Java it takes some method call in the virtual machine, bound checking and other sanity checks. That is valid and absolutely fine, because it adds safety where it's due. But in C, even pretty trivial things are not put in safety. For example, C doesn't require memcpy to check whether the regions to copy overlap. It's not designed as a language to program a big business application.
But these design decisions are not bugs in the C language. They are by design, as it allows compilers and library writers to get every bit of performance out of the computer. Here is the spirit of C how the C Rationale document explains it:
C code can be non-portable. Although it strove to give programmers the opportunity to write truly portable programs, the Committee did not want to force programmers into writing portably, to preclude the use of C as a ``high-level assembler'': the ability to write machine-specific code is one of the strengths of C.
Keep the spirit of C. The Committee kept as a major goal to preserve the traditional spirit of C. There are many facets of the spirit of C, but the essence is a community sentiment of the underlying principles upon which the C language is based. Some of the facets of the spirit of C can be summarized in phrases like
Trust the programmer.
Don't prevent the programmer from doing what needs to be done.
Keep the language small and simple.
Provide only one way to do an operation.
Make it fast, even if it is not guaranteed to be portable.
The last proverb needs a little explanation. The potential for efficient code generation is one of the most important strengths of C. To help ensure that no code explosion occurs for what appears to be a very simple operation, many operations are defined to be how the target machine's hardware does it rather than by a general abstract rule. An example of this willingness to live with what the machine does can be seen in the rules that govern the widening of char objects for use in expressions: whether the values of char objects widen to signed or unsigned quantities typically depends on which byte operation is more efficient on the target machine.

If you spend a month to build something in C that runs in 0.05 seconds, and I spend a day writing the same thing in Java, and it runs in 0.10 seconds, then is C really faster?
But to answer your question, well-written C code will generally run faster than well-written code in other languages because part of writing C code "well" includes doing manual optimizations at a near-machine level.
Although compilers are very clever indeed, they are not yet able to creatively come up with code that competes with hand-massaged algorithms (assuming the "hands" belong to a good C programmer).
Edit:
A lot of comments are along the lines of "I write in C and I don't think about optimizations."
But to take a specific example from this post:
In Delphi I could write this:
function RemoveAllAFromB(a, b: string): string;
var
before, after :string;
begin
Result := b;
if 0 < Pos(a,b) then begin
before := Copy(b,1,Pos(a,b)-Length(a));
after := Copy(b,Pos(a,b)+Length(a),Length(b));
Result := before + after;
Result := RemoveAllAFromB(a,Result); //recursive
end;
end;
and in C I write this:
char *s1, *s2, *result; /* original strings and the result string */
int len1, len2; /* lengths of the strings */
for (i = 0; i < len1; i++) {
for (j = 0; j < len2; j++) {
if (s1[i] == s2[j]) {
break;
}
}
if (j == len2) { /* s1[i] is not found in s2 */
*result = s1[i];
result++; /* assuming your result array is long enough */
}
}
But how many optimizations are there in the C version? We make lots of decisions about implementation that I don't think about in the Delphi version. How is a string implemented? In Delphi I don't see it. In C, I've decided it will be a pointer to an array of ASCII integers, which we call chars. In C, we test for character existence one at a time. In Delphi, I use Pos.
And this is just a small example. In a large program, a C programmer has to make these kinds of low-level decisions with every few lines of code. It adds up to a hand-crafted, hand-optimized executable.

I didn't see it already, so I'll say it: C tends to be faster because almost everything else is written in C.
Java is built on C, Python is built on C (or Java, or .NET, etc.), Perl is, etc. The OS is written in C, the virtual machines are written in C, the compilers are written in C, the interpreters are written in C. Some things are still written in Assembly language, which tends to be even faster. More and more things are being written in something else, which is itself written in C.
Each statement that you write in other languages (not Assembly) is typically implemented underneath as several statements in C, which are compiled down to native machine code. Since those other languages tend to exist in order to obtain a higher level of abstraction than C, those extra statements required in C tend to be focused on adding safety, adding complexity, and providing error handling. Those are often good things, but they have a cost, and its names are speed and size.
Personally, I have written in literally dozens of languages spanning most of the available spectrum, and I personally have sought the magic that you hint at:
How can I have my cake and eat it, too? How can I play with high-level abstractions in my favorite language, then drop down to the nitty gritty of C for speed?
After a couple of years of research, my answer is Python (on C). You might want to give it a look. By the way, you can also drop down to Assembly from Python, too (with some minor help from a special library).
On the other hand, bad code can be written in any language. Therefore, C (or Assembly) code is not automatically faster. Likewise, some optimization tricks can bring portions of higher-level language code close to the performance level of raw C. But, for most applications, your program spends most of its time waiting on people or hardware, so the difference really does not matter.
Enjoy.

There are a lot of questions in there - mostly ones I am not qualified to answer. But for this last one:
what's to stop other languages from being able to compile down to binary that runs every bit as fast as C?
In a word, abstraction.
C is only one or two levels of abstraction away from machine language. Java and the .NET languages are at a minimum three levels of abstraction away from assembler. I'm not sure about Python and Ruby.
Typically, the more programmer toys (complex data types, etc.), the further you are from machine language and the more translation has to be done.
I'm off here and there, but that's the basic gist.
There are some good comments on this post with more details.

It is not so much that C is fast as that C's cost model is transparent. If a C program is slow, it is slow in an obvious way: by executing a lot of statements. Compared with the cost of operations in C, high-level operations on objects (especially reflection) or strings can have costs that are not obvious.
Two languages that generally compile to binaries which are just as fast as C are Standard ML (using the MLton compiler) and Objective Caml. If you check out the benchmarks game you'll find that for some benchmarks, like binary trees, the OCaml version is faster than C. (I didn't find any MLton entries.) But don't take the shootout too seriously; it is, as it says, a game, the the results often reflect how much effort people have put in tuning the code.

C is not always faster.
C is slower than, for example, Modern Fortran.
C is often slower than Java for some things (especially after the JIT compiler has had a go at your code).
C lets pointer aliasing happen, which means some good optimizations are not possible. Particularly when you have multiple execution units, this causes data fetch stalls. Ow.
The assumption that pointer arithmetic works really causes slow bloated performance on some CPU families (PIC particularly!) It used to suck the big one on segmented x86.
Basically, when you get a vector unit, or a parallelizing compiler, C stinks and modern Fortran runs faster.
C programmer tricks, like thunking (modifying the executable on the fly), cause CPU prefetch stalls.
Do you get the drift?
And our good friend, the x86, executes an instruction set that these days bears little relationship to the actual CPU architecture. Shadow registers, load-store optimizers, all in the CPU. So C is then close to the virtual metal. The real metal, Intel don't let you see. (Historically VLIW CPU's were a bit of a bust so, maybe that's no so bad.)
If you program in C on a high-performance DSP (maybe a TI DSP?), the compiler has to do some tricky stuff to unroll the C across the multiple parallel execution units. So in that case, C isn't close to the metal, but it is close to the compiler, which will do whole program optimization. Weird.
And finally, some CPUs (www.ajile.com) run Java bytecodes in hardware. C would a PITA to use on that CPU.

what's to stop other languages from
being able to compile down to binary
that runs every bit as fast as C?
Nothing. Modern languages like Java or .NET languages are oriented more toward programmer productivity rather than performance. Hardware is cheap nowadays. Also compilation to intermediate representation gives a lot of bonuses such as security, portability, etc. The .NET CLR can take advantage of different hardware. For example, you don't need to manually optimize/recompile program to use the SSE instructions set.

I guess you forgot that Assembly language is also a language :)
But seriously, C programs are faster only when the programmer knows what he's doing. You can easily write a C program that runs slower than programs written in other languages that do the same job.
The reason why C is faster is because it is designed in this way. It lets you do a lot of "lower level" stuff that helps the compiler to optimize the code. Or, shall we say, you the programmer are responsible for optimizing the code. But it's often quite tricky and error prone.
Other languages, like others already mentioned, focus more on productivity of the programmer. It is commonly believed that programmer time is much more expensive than machine time (even in the old days). So it makes a lot of sense to minimize the time programmers spend on writing and debugging programs instead of the running time of the programs. To do that, you will sacrifice a bit on what you can do to make the program faster because a lot of things are automated.

The main factors are that it's a statically-typed language and that's compiled to machine code. Also, since it's a low-level language, it generally doesn't do anything you don't tell it to.
These are some other factors that come to mind.
Variables are not automatically initialized
No bounds checking on arrays
Unchecked pointer manipulation
No integer overflow checking
Statically-typed variables
Function calls are static (unless you use function pointers)
Compiler writers have had lots of time to improve the optimizing code. Also, people program in C for the purpose of getting the best performance, so there's pressure to optimize the code.
Parts of the language specification are implementation-defined, so compilers are free to do things in the most optimal way
Most static-typed languages could be compiled just as fast or faster than C though, especially if they can make assumptions that C can't because of pointer aliasing, etc.

C++ is faster on average (as it was initially, largely a superset of C, though there are some differences). However, for specific benchmarks, there is often another language which is faster.
From The Computer Language Benchmarks Game:
fannjuch-redux was fastest in Scala
n-body and fasta were faster in Ada.
spectral-norm was fastest in Fortran.
reverse-complement, mandelbrot and pidigits were fastest in ATS.
regex-dna was fastest in JavaScript.
chameneou-redux was fastest is Java 7.
thread-ring was fastest in Haskell.
The rest of the benchmarks were fastest in C or C++.

For the most part, every C instruction corresponds to a very few assembler instructions. You are essentially writing higher level machine code, so you have control over almost everything the processor does. Many other compiled languages, such as C++, have a lot of simple looking instructions that can turn into much more code than you think it does (virtual functions, copy constructors, etc..) And interpreted languages like Java or Ruby have another layer of instructions that you never see - the Virtual Machine or Interpreter.

I know plenty of people have said it in a long winded way, but:
C is faster because it does less (for you).

Many of these answers give valid reasons for why C is, or is not, faster (either in general or in specific scenarios). It's undeniable that:
Many other languages provide automatic features that we take for granted. Bounds checking, run-time type checking, and automatic memory management, for example, don't come for free. There is at least some cost associated with these features, which we may not think about—or even realize—while writing code that uses these features.
The step from source to machine is often not as direct in other languages as it is in C.
OTOH, to say that compiled C code executes faster than other code written in other languages is a generalization that isn't always true. Counter-examples are easy to find (or contrive).
All of this notwithstanding, there is something else I have noticed that, I think, affects the comparative performance of C vs. many other languages more greatly than any other factor. To wit:
Other languages often make it easier to write code that executes more slowly. Often, it's even encouraged by the design philosophies of the language. Corollary: a C programmer is more likely to write code that doesn't perform unnecessary operations.
As an example, consider a simple Windows program in which a single main window is created. A C version would populate a WNDCLASS[EX] structure which would be passed to RegisterClass[Ex], then call CreateWindow[Ex] and enter a message loop. Highly simplified and abbreviated code follows:
WNDCLASS wc;
MSG msg;
wc.style = 0;
wc.lpfnWndProc = &WndProc;
wc.cbClsExtra = 0;
wc.cbWndExtra = 0;
wc.hInstance = hInstance;
wc.hIcon = NULL;
wc.hCursor = LoadCursor(NULL, IDC_ARROW);
wc.hbrBackground = (HBRUSH)(COLOR_BTNFACE + 1);
wc.lpszMenuName = NULL;
wc.lpszClassName = "MainWndCls";
RegisterClass(&wc);
CreateWindow("MainWndCls", "", WS_OVERLAPPEDWINDOW | WS_VISIBLE,
CW_USEDEFAULT, 0, CW_USEDEFAULT, 0, NULL, NULL, hInstance, NULL);
while(GetMessage(&msg, NULL, 0, 0)){
TranslateMessage(&msg);
DispatchMessage(&msg);
}
An equivalent program in C# could be just one line of code:
Application.Run(new Form());
This one line of code provides all of the functionality that nearly 20 lines of C code did, and adds some things we left out, such as error checking. The richer, fuller library (compared to those used in a typical C project) did a lot of work for us, freeing our time to write many more snippets of code that look short to us but involve many steps behind the scenes.
But a rich library enabling easy and quick code bloat isn't really my point. My point is more apparent when you start examining what actually happens when our little one-liner actually executes. For fun sometime, enable .NET source access in Visual Studio 2008 or higher, and step into the simple one-linef above. One of the fun little gems you'll come across is this comment in the getter for Control.CreateParams:
// In a typical control this is accessed ten times to create and show a control.
// It is a net memory savings, then, to maintain a copy on control.
//
if (createParams == null) {
createParams = new CreateParams();
}
Ten times. The information roughly equivalent to the sum of what's stored in a WNDCLASSEX structure and what's passed to CreateWindowEx is retrieved from the Control class ten times before it's stored in a WNDCLASSEX structure and passed on to RegisterClassEx and CreateWindowEx.
All in all, the number of instructions executed to perform this very basic task is 2–3 orders of magnitude more in C# than in C. Part of this is due to the use of a feature-rich library, which is necessarily generalized, versus our simple C code which does exactly what we need and nothing more. But part of it is due to the fact that the modularized, object-oriented nature of .NET framework, lends itself to a lot of repetition of execution that often is avoided by a procedural approach.
I'm not trying to pick on C# or the .NET framework. Nor am I saying that modularization, generalization, library/language features, OOP, etc. are bad things. I used to do most of my development in C, later in C++, and most lately in C#. Similarly, before C, I used mostly assembly. And with each step "higher" my language goes, I write better, more maintainable, more robust programs in less time. They do, however, tend to execute a little more slowly.

I don't think anyone has mentioned the fact that much more effort has been put into C compilers than any other compiler, with perhaps the exception of Java.
C is extremely optimizable for many of the reasons already stated - more than almost any other language. So if the same amount of effort is put into other language compilers, C will probably still come out on top.
I think there is at least one candidate language that, with effort, could be optimized better than C and thus we could see implementations that produce faster binaries. I'm thinking of Digital Mars' D, because the creator took care to build a language that could potentially be better optimized than C. There may be other languages that have this possibility. However, I cannot imagine that any language will have compilers more than just a few percent faster than the best C compilers. I would love to be wrong.
I think the real "low hanging fruit" will be in languages that are designed to be easy for humans to optimize. A skilled programmer can make any language go faster, but sometimes you have to do ridiculous things or use unnatural constructs to make this happen. Although it will always take effort, a good language should produce relatively fast code without having to obsess over exactly how the program is written.
It's also important (at least to me) that the worst case code tends to be fast. There are numerous "proofs" on the web that Java is as fast or faster than C, but that is based on cherry picking examples.
I'm not big fan of C, but I know that anything I write in C is going to run well. With Java, it will "probably" run within 15% of the speed, usually within 25%, but in some cases it can be far worse. Any cases where it's just as fast or within a couple of percent are usually due to most of the time being spent in the library code which is heavily optimized C code anyway.

This is actually a bit of a perpetuated falsehood. While it is true that C programs are frequently faster, this is not always the case, especially if the C programmer isn't very good at it.
One big glaring hole that people tend to forget about is when the program has to block for some sort of I/O, such as user input in any GUI program. In these cases, it doesn't really matter what language you use since you are limited by the rate at which data can come in rather than how fast you can process it. In this case, it doesn't matter much if you are using C, Java, C# or even Perl; you just cannot go any faster than the data can come in.
The other major thing is that using garbage collection (GC) and not using proper pointers allows the virtual machine to make a number of optimizations not available in other languages. For instance, the JVM is capable of moving objects around on the heap to defragment it. This makes future allocations much faster since the next index can simply be used rather than looking it up in a table. Modern JVMs also don't have to actually deallocate memory; instead, they just move the live objects around when they GC and the spent memory from the dead objects is recovered essentially for free.
This also brings up an interesting point about C and even more so in C++. There is something of a design philosophy of "If you don't need it, you don't pay for it." The problem is that if you do want it, you end up paying through the nose for it. For instance, the vtable implementation in Java tends to be a lot better than C++ implementations, so virtual function calls are a lot faster. On the other hand, you have no choice but to use virtual functions in Java and they still cost something, but in programs that use a lot of virtual functions, the reduced cost adds up.

It's not so much about the language as the tools and libraries. The available libraries and compilers for C are much older than for newer languages. You might think this would make them slower, but au contraire.
These libraries were written at a time when processing power and memory were at a premium. They had to be written very efficiently in order to work at all. Developers of C compilers have also had a long time to work in all sorts of clever optimizations for different processors. C's maturity and wide adoption makes for a signficant advantage over other languages of the same age. It also gives C a speed advantage over newer tools that don't emphasize raw performance as much as C had to.

Amazing to see the old "C/C++ must be faster than Java because Java is interpreted" myth is still alive and kicking. There are articles going back a few years, as well as more recent ones, that explain with concepts or measurements why this simply isn't always the case.
Current virtual machine implementations (and not just the JVM, by the way) can take advantage of information gathered during program execution to dynamically tune the code as it runs, using a variety of techniques:
rendering frequent methods to machine code,
inlining small methods,
adjustment of locking
and a variety of other adjustments based on knowing what the code is actually doing, and on the actual characteristics of the environment in which it's running.

The lack of abstraction is what makes C faster. If you write an output statement you know exactly what is happening. If you write an output statement in Java it is getting compiled to a class file which then gets run on a virtual machine, introducing a layer of abstraction.
The lack of object-oriented features as a part of the language also increases its speed do to less code being generated. If you use C as an object-oriented language, then you are doing all the coding for things such as classes, inheritance, etc. This means rather than make something generalized enough for everyone with the amount of code and the performance penalty that requires you only write what you need to get the job done.

The fastest running code would be carefully handcrafted machine code. Assembler will be almost as good. Both are very low level and it takes a lot of writing code to do things. C is a little above assembler. You still have the ability to control things at a very low level in the actual machine, but there is enough abstraction, make writing it faster and easier then assembler.
Other languages, such as C# and Java, are even more abstract. While Assembler and machine code are called low-level languages, C# and JAVA (and many others) are called high-level languages. C is sometimes called a midlevel language.

Don't take someone’s word for it; look at the disassembly for both C and your language-of-choice in any performance critical part of your code. I think you can just look in the disassembly window at runtime in Visual Studio to see disassembled .NET code. It should be possible, if tricky, for Java using WinDbg, though if you do it with .NET, many of the issues would be the same.
I don't like to write in C if I don't need to, but I think many of the claims made in these answers that tout the speed of languages other than C can be put aside by simply disassembling the same routine in C and in your higher level language of choice, especially if lots of data is involved as is common in performance critical applications. Fortran may be an exception in its area of expertise; I don't know. Is it higher level than C?
The first time I did compare JITed code with native code resolved any and all questions whether .NET code could run comparably to C code. The extra level of abstraction and all the safety checks come with a significant cost. The same costs would probably apply to Java, but don't take my word for it; try it on something where performance is critical. (Does anyone know enough about JITed Java to locate a compiled procedure in memory? It should certainly be possible.)

Setting aside advanced optimization techniques such as hot-spot optimization, pre-compiled meta-algorithms, and various forms of parallelism, the fundamental speed of a language correlates strongly with the implicit behind-the-scenes complexity required to support the operations that would commonly be specified within inner loops.
Perhaps the most obvious is validity checking on indirect memory references—such as checking pointers for null and checking indexes against array boundaries. Most high-level languages perform these checks implicitly, but C does not. However, this is not necessarily a fundamental limitation of these other languages—a sufficiently clever compiler may be capable of removing these checks from the inner loops of an algorithm through some form of loop-invariant code motion.
The more fundamental advantage of C (and to a similar extent the closely related C++) is a heavy reliance on stack-based memory allocation, which is inherently fast for allocation, deallocation, and access. In C (and C++) the primary call stack can be used for allocation of primitives, arrays, and aggregates (struct/class).
While C does offer the capability to dynamically allocate memory of arbitrary size and lifetime (using the so called 'heap'), doing so is avoided by default (the stack is used instead).
Tantalizingly, it is sometimes possible to replicate the C memory allocation strategy within the runtime environments of other programming languages. This has been demonstrated by asm.js, which allows code written in C or C++ to be translated into a subset of JavaScript and run safely in a web browser environment—with near-native speed.
As somewhat of an aside, another area where C and C++ outshine most other languages for speed is the ability to seamlessly integrate with native machine instruction sets. A notable example of this is the (compiler and platform dependent) availability of SIMD intrinsics which support the construction of custom algorithms that take advantage of the now nearly ubiquitous parallel processing hardware—while still utilizing the data allocation abstractions provided by the language (lower-level register allocation is managed by the compiler).

1) As others have said, C does less for you. No initializing variables, no array bounds checking, no memory management, etc. Those features in other languages cost memory and CPU cycles that C doesn't spend.
2) Answers saying that C is less abstracted and therefore faster are only half correct I think. Technically speaking, if you had a "sufficiently advanced compiler" for language X, then language X could approach or equal the speed of C. The difference with C is that since it maps so obviously (if you've taken an architecture course) and directly to assembly language that even a naive compiler can do a decent job. For something like Python, you need a very advanced compiler to predict the probable types of objects and generate machine code on the fly -- C's semantics are simple enough that a simple compiler can do well.

Back in the good ole days, there were just two types of languages: compiled and interpreted.
Compiled languages utilized a "compiler" to read the language syntax and convert it into identical assembly language code, which could than just directly on the CPU. Interpreted languages used a couple of different schemes, but essentially the language syntax was converted into an intermediate form, and then run in a "interpreter", an environment for executing the code.
Thus, in a sense, there was another "layer" -- the interpreter -- between the code and the machine. And, as always the case in a computer, more means more resources get used. Interpreters were slower, because they had to perform more operations.
More recently, we've seen more hybrid languages like Java, that employ both a compiler and an interpreter to make them work. It's complicated, but a JVM is faster, more sophisticated and way more optimized than the old interpreters, so it stands a much better change of performing (over time) closer to just straight compiled code. Of course, the newer compilers also have more fancy optimizing tricks so they tend to generate way better code than they used to as well. But most optimizations, most often (although not always) make some type of trade-off such that they are not always faster in all circumstances. Like everything else, nothing comes for free, so the optimizers must get their boast from somewhere (although often times it using compile-time CPU to save runtime CPU).
Getting back to C, it is a simple language, that can be compiled into fairly optimized assembly and then run directly on the target machine. In C, if you increment an integer, it's more than likely that it is only one assembler step in the CPU, in Java however, it could end up being a lot more than that (and could include a bit of garbage collection as well :-) C offers you an abstraction that is way closer to the machine (assembler is the closest), but you end up having to do way more work to get it going and it is not as protected, easy to use or error friendly. Most other languages give you a higher abstraction and take care of more of the underlying details for you, but in exchange for their advanced functionality they require more resources to run. As you generalize some solutions, you have to handle a broader range of computing, which often requires more resources.

I have found an answer on a link about why some languages are faster and some are slower, I hope this will clear more about why C or C++ is faster than others, There are some other languages also that is faster than C, but we can not use all of them. Some explanation -
One of the big reasons that Fortran remains important is because it's fast: number crunching routines written in Fortran tend to be quicker than equivalent routines written in most other languages. The languages that are competing with Fortran in this space—C and C++—are used because they're competitive with this performance.
This raises the question: why? What is it about C++ and Fortran that make them fast, and why do they outperform other popular languages, such as Java or Python?
Interpreting versus compiling
There are many ways to categorize and define programming languages, according to the style of programming they encourage and features they offer. When looking at performance, the biggest single distinction is between interpreted languages and compiled ones.
The divide is not hard; rather, there's a spectrum. At one end, we have traditional compiled languages, a group that includes Fortran, C, and C++. In these languages, there is a discrete compilation stage that translates the source code of a program into an executable form that the processor can use.
This compilation process has several steps. The source code is analyzed and parsed. Basic coding mistakes such as typos and spelling errors can be detected at this point. The parsed code is used to generate an in-memory representation, which too can be used to detect mistakes—this time, semantic mistakes, such as calling functions that don't exist, or trying to perform arithmetic operations on strings of text.
This in-memory representation is then used to drive a code generator, the part that produces executable code. Code optimization, to improve the performance of the generated code, is performed at various times within this process: high-level optimizations can be performed on the code representation, and lower-level optimizations are used on the output of the code generator.
Actually executing the code happens later. The entire compilation process is simply used to create something that can be executed.
At the opposite end, we have interpreters. The interpreters will include a parsing stage similar to that of the compiler, but this is then used to drive direct execution, with the program being run immediately.
The simplest interpreter has within it executable code corresponding to the various features the language supports—so it will have functions for adding numbers, joining strings, whatever else a given language has. As it parses the code, it will look up the corresponding function and execute it. Variables created in the program will be kept in some kind of lookup table that maps their names to their data.
The most extreme example of the interpreter style is something like a batch file or shell script. In these languages, the executable code is often not even built into the interpreter itself, but rather separate, standalone programs.
So why does this make a difference to performance? In general, each layer of indirection reduces performance. For example, the fastest way to add two numbers is to have both of those numbers in registers in the processor, and to use the processor's add instruction. That's what compiled programs can do; they can put variables into registers and take advantage of processor instructions. But in interpreted programs, that same addition might require two lookups in a table of variables to fetch the values to add, then calling a function to perform the addition. That function may very well use the same processor instruction as the compiled program uses to perform the actual addition, but all the extra work before the instruction can actually be used makes things slower.
If you want to know more please check the source.

Some C++ algorithms are faster than C, and some implementations of algorithms or design patterns in other languages can be faster than C.
When people say that C is fast, and then move on to talking about some other language, they are generally using C's performance as a benchmark.

Just step through the machine code in your IDE, and you'll see why it's faster (if it's faster). It leaves out a lot of hand-holding. Chances are your Cxx can also be told to leave it out too, in which case it should be about the same.
Compiler optimizations are overrated, as are almost all perceptions about language speed.
Optimization of generated code only makes a difference in hotspot code, that is, tight algorithms devoid of function calls (explicit or implicit). Anywhere else, it achieves very little.

With modern optimizing compilers, it's highly unlikely that a pure C program is going to be all that much faster than compiled .NET code, if at all. With the productivity enhancement that frameworks like .NET provide the developer, you can do things in a day that used to take weeks or months in regular C. Coupled with the cheap cost of hardware compared to a developer's salary, it's just way cheaper to write the stuff in a high-level language and throw hardware at any slowness.
The reason Jeff and Joel talk about C being the "real programmer" language is because there isn't any hand-holding in C. You must allocate your own memory, deallocate that memory, do your own bounds-checking, etc. There isn't any such thing as new object(); There isn't any garbage collection, classes, OOP, entity frameworks, LINQ, properties, attributes, fields, or anything like that.
You have to know things like pointer arithmetic and how to dereference a pointer. And, for that matter, know and understand what a pointer is. You have to know what a stack frame is and what the instruction pointer is. You have to know the memory model of the CPU architecture you're working on. There is a lot of implicit understanding of the architecture of a microcomputer (usually the microcomputer you're working on) when programming in C that simply is not present nor necessary when programming in something like C# or Java. All of that information has been off-loaded to the compiler (or VM) programmer.

It's the difference between automatic and manual. Higher-level languages are abstractions, thus automated. C/C++ are manually controlled and handled; even error checking code is sometimes a manual labor.
C and C++ are also compiled languages which means none of that run-everywhere business. These languages have to be fine-tuned for the hardware you work with, thus adding an extra layer of gotcha. Though this is slightly phasing out now as C/C++ compilers are becoming more common across all platforms. You can do cross compilations between platforms. It's still not a run everywhere situation, and you’re basically instructing compiler A to compile against compiler B the same code on a different architecture.
Bottom line, C languages are not meant to be easy to understand or reason. This is also why they’re referred to as systems languages. They came out before all this high-level abstraction nonsense. This is also why they are not used for front end web programming. They’re just not suited to the task; they’re meant to solve complex problems that can't be resolved with conventional language tooling.
This is why you get crazy stuff, like micro-architectures, drivers, quantum physics, AAA games, and operating systems. There are things C and C++ are just well suited for. Speed and number crunching being the chief areas.

C is fast because it is natively compiled, low-level language. But C is not the fastest. The Recursive Fibonacci Benchmark shows that Rust, Crystal, and Nim can be faster.

When is assembly faster than C? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level language, C in particular. However, I've also heard it stated many times that although that's not entirely false, the cases where assembler can actually be used to generate more performant code are both extremely rare and require expert knowledge of and experience with assembly.
This question doesn't even get into the fact that assembler instructions will be machine-specific and non-portable, or any of the other aspects of assembler. There are plenty of good reasons for knowing assembly besides this one, of course, but this is meant to be a specific question soliciting examples and data, not an extended discourse on assembler versus higher-level languages.
Can anyone provide some specific examples of cases where assembly will be faster than well-written C code using a modern compiler, and can you support that claim with profiling evidence? I am pretty confident these cases exist, but I really want to know exactly how esoteric these cases are, since it seems to be a point of some contention.

Here is a real world example: Fixed point multiplies on old compilers.
These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).
Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see
Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.
_umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.
C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:
// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
long long a_long = a; // cast to 64 bit.
long long product = a_long * b; // perform multiplication
return (int) (product >> 16); // shift by the fixed point bias
}
The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.
x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).
So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.
If you rewrite the same code in (inline) assembler you can gain a significant speed boost.
In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.
Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.
For reference: The end-result for the fixed-point mul for the VS.NET compiler is:
int inline FixedPointMul (int a, int b)
{
return (int) __ll_rshift(__emul(a,b),16);
}
The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.
Using Visual C++ 2013 gives the same assembly code for both ways.
gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)
See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)
Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).
Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.
Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.
Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.
Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

Many years ago I was teaching someone to program in C. The exercise was to rotate a graphic through 90 degrees. He came back with a solution that took several minutes to complete, mainly because he was using multiplies and divides etc.
I showed him how to recast the problem using bit shifts, and the time to process came down to about 30 seconds on the non-optimizing compiler he had.
I had just got an optimizing compiler and the same code rotated the graphic in < 5 seconds. I looked at the assembly code that the compiler was generating, and from what I saw decided there and then that my days of writing assembler were over.

Pretty much anytime the compiler sees floating point code, a hand written version will be quicker if you're using an old bad compiler. (2019 update: This is not true in general for modern compilers. Especially when compiling for anything other than x87; compilers have an easier time with SSE2 or AVX for scalar math, or any non-x86 with a flat FP register set, unlike x87's register stack.)
The primary reason is that the compiler can't perform any robust optimisations. See this article from MSDN for a discussion on the subject. Here's an example where the assembly version is twice the speed as the C version (compiled with VS2K5):
#include "stdafx.h"
#include <windows.h>
float KahanSum(const float *data, int n)
{
float sum = 0.0f, C = 0.0f, Y, T;
for (int i = 0 ; i < n ; ++i) {
Y = *data++ - C;
T = sum + Y;
C = T - sum - Y;
sum = T;
}
return sum;
}
float AsmSum(const float *data, int n)
{
float result = 0.0f;
_asm
{
mov esi,data
mov ecx,n
fldz
fldz
l1:
fsubr [esi]
add esi,4
fld st(0)
fadd st(0),st(2)
fld st(0)
fsub st(0),st(3)
fsub st(0),st(2)
fstp st(2)
fstp st(2)
loop l1
fstp result
fstp result
}
return result;
}
int main (int, char **)
{
int count = 1000000;
float *source = new float [count];
for (int i = 0 ; i < count ; ++i) {
source [i] = static_cast <float> (rand ()) / static_cast <float> (RAND_MAX);
}
LARGE_INTEGER start, mid, end;
float sum1 = 0.0f, sum2 = 0.0f;
QueryPerformanceCounter (&start);
sum1 = KahanSum (source, count);
QueryPerformanceCounter (&mid);
sum2 = AsmSum (source, count);
QueryPerformanceCounter (&end);
cout << " C code: " << sum1 << " in " << (mid.QuadPart - start.QuadPart) << endl;
cout << "asm code: " << sum2 << " in " << (end.QuadPart - mid.QuadPart) << endl;
return 0;
}
And some numbers from my PC running a default release build*:
C code: 500137 in 103884668
asm code: 500137 in 52129147
Out of interest, I swapped the loop with a dec/jnz and it made no difference to the timings - sometimes quicker, sometimes slower. I guess the memory limited aspect dwarfs other optimisations. (Editor's note: more likely the FP latency bottleneck is enough to hide the extra cost of loop. Doing two Kahan summations in parallel for the odd/even elements, and adding those at the end, could maybe speed this up by a factor of 2.)
Whoops, I was running a slightly different version of the code and it outputted the numbers the wrong way round (i.e. C was faster!). Fixed and updated the results.

Without giving any specific example or profiler evidence, you can write better assembler than the compiler when you know more than the compiler.
In the general case, a modern C compiler knows much more about how to optimize the code in question: it knows how the processor pipeline works, it can try to reorder instructions quicker than a human can, and so on - it's basically the same as a computer being as good as or better than the best human player for boardgames, etc. simply because it can make searches within the problem space faster than most humans. Although you theoretically can perform as well as the computer in a specific case, you certainly can't do it at the same speed, making it infeasible for more than a few cases (i.e. the compiler will most certainly outperform you if you try to write more than a few routines in assembler).
On the other hand, there are cases where the compiler does not have as much information - I'd say primarily when working with different forms of external hardware, of which the compiler has no knowledge. The primary example probably being device drivers, where assembler combined with a human's intimate knowledge of the hardware in question can yield better results than a C compiler could do.
Others have mentioned special purpose instructions, which is what I'm talking in the paragraph above - instructions of which the compiler might have limited or no knowledge at all, making it possible for a human to write faster code.

In my job, there are three reasons for me to know and use assembly. In order of importance:
Debugging - I often get library code that has bugs or incomplete documentation. I figure out what it's doing by stepping in at the assembly level. I have to do this about once a week. I also use it as a tool to debug problems in which my eyes don't spot the idiomatic error in C/C++/C#. Looking at the assembly gets past that.
Optimizing - the compiler does fairly well in optimizing, but I play in a different ballpark than most. I write image processing code that usually starts with code that looks like this:
for (int y=0; y < imageHeight; y++) {
for (int x=0; x < imageWidth; x++) {
// do something
}
}
the "do something part" typically happens on the order of several million times (ie, between 3 and 30). By scraping cycles in that "do something" phase, the performance gains are hugely magnified. I don't usually start there - I usually start by writing the code to work first, then do my best to refactor the C to be naturally better (better algorithm, less load in the loop etc). I usually need to read assembly to see what's going on and rarely need to write it. I do this maybe every two or three months.
doing something the language won't let me. These include - getting the processor architecture and specific processor features, accessing flags not in the CPU (man, I really wish C gave you access to the carry flag), etc. I do this maybe once a year or two years.

Only when using some special purpose instruction sets the compiler doesn't support.
To maximize the computing power of a modern CPU with multiple pipelines and predictive branching you need to structure the assembly program in a way that makes it a) almost impossible for a human to write b) even more impossible to maintain.
Also, better algorithms, data structures and memory management will give you at least an order of magnitude more performance than the micro-optimizations you can do in assembly.

Although C is "close" to the low-level manipulation of 8-bit, 16-bit, 32-bit, 64-bit data, there are a few mathematical operations not supported by C which can often be performed elegantly in certain assembly instruction sets:
Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back:
int16_t x, y;
// int16_t is a typedef for "short"
// set x and y to something
int16_t prod = (int16_t)(((int32_t)x*y)>>16);`
In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself.
Certain bitshifting operations (rotation/carries):
// 256-bit array shifted right in its entirety:
uint8_t x[32];
for (int i = 32; --i > 0; )
{
x[i] = (x[i] >> 1) | (x[i-1] << 7);
}
x[0] >>= 1;
This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer.
For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.
Having said that, I wouldn't resort to these techniques unless I had serious performance constraints. As others have said, assembly is much harder to document/debug/test/maintain than C code: the performance gain comes with some serious costs.
edit: 3. Overflow detection is possible in assembly (can't really do it in C), this makes some algorithms much easier.

Short answer? Sometimes.
Technically every abstraction has a cost and a programming language is an abstraction for how the CPU works. C however is very close. Years ago I remember laughing out loud when I logged onto my UNIX account and got the following fortune message (when such things were popular):
The C Programming Language -- A
language which combines the
flexibility of assembly language with
the power of assembly language.
It's funny because it's true: C is like portable assembly language.
It's worth noting that assembly language just runs however you write it. There is however a compiler in between C and the assembly language it generates and that is extremely important because how fast your C code is has an awful lot to do with how good your compiler is.
When gcc came on the scene one of the things that made it so popular was that it was often so much better than the C compilers that shipped with many commercial UNIX flavours. Not only was it ANSI C (none of this K&R C rubbish), was more robust and typically produced better (faster) code. Not always but often.
I tell you all this because there is no blanket rule about the speed of C and assembler because there is no objective standard for C.
Likewise, assembler varies a lot depending on what processor you're running, your system spec, what instruction set you're using and so on. Historically there have been two CPU architecture families: CISC and RISC. The biggest player in CISC was and still is the Intel x86 architecture (and instruction set). RISC dominated the UNIX world (MIPS6000, Alpha, Sparc and so on). CISC won the battle for the hearts and minds.
Anyway, the popular wisdom when I was a younger developer was that hand-written x86 could often be much faster than C because the way the architecture worked, it had a complexity that benefitted from a human doing it. RISC on the other hand seemed designed for compilers so noone (I knew) wrote say Sparc assembler. I'm sure such people existed but no doubt they've both gone insane and been institutionalized by now.
Instruction sets are an important point even in the same family of processors. Certain Intel processors have extensions like SSE through SSE4. AMD had their own SIMD instructions. The benefit of a programming language like C was someone could write their library so it was optimized for whichever processor you were running on. That was hard work in assembler.
There are still optimizations you can make in assembler that no compiler could make and a well written assembler algoirthm will be as fast or faster than it's C equivalent. The bigger question is: is it worth it?
Ultimately though assembler was a product of its time and was more popular at a time when CPU cycles were expensive. Nowadays a CPU that costs $5-10 to manufacture (Intel Atom) can do pretty much anything anyone could want. The only real reason to write assembler these days is for low level things like some parts of an operating system (even so the vast majority of the Linux kernel is written in C), device drivers, possibly embedded devices (although C tends to dominate there too) and so on. Or just for kicks (which is somewhat masochistic).

I'm surprised no one said this. The strlen() function is much faster if written in assembly! In C, the best thing you can do is
int c;
for(c = 0; str[c] != '\0'; c++) {}
while in assembly you can speed it up considerably:
mov esi, offset string
mov edi, esi
xor ecx, ecx
lp:
mov ax, byte ptr [esi]
cmp al, cl
je end_1
cmp ah, cl
je end_2
mov bx, byte ptr [esi + 2]
cmp bl, cl
je end_3
cmp bh, cl
je end_4
add esi, 4
jmp lp
end_4:
inc esi
end_3:
inc esi
end_2:
inc esi
end_1:
inc esi
mov ecx, esi
sub ecx, edi
the length is in ecx. This compares 4 characters at time, so it's 4 times faster. And think using the high order word of eax and ebx, it will become 8 times faster that the previous C routine!

A use case which might not apply anymore but for your nerd pleasure: On the Amiga, the CPU and the graphics/audio chips would fight for accessing a certain area of RAM (the first 2MB of RAM to be specific). So when you had only 2MB RAM (or less), displaying complex graphics plus playing sound would kill the performance of the CPU.
In assembler, you could interleave your code in such a clever way that the CPU would only try to access the RAM when the graphics/audio chips were busy internally (i.e. when the bus was free). So by reordering your instructions, clever use of the CPU cache, the bus timing, you could achieve some effects which were simply not possible using any higher level language because you had to time every command, even insert NOPs here and there to keep the various chips out of each others radar.
Which is another reason why the NOP (No Operation - do nothing) instruction of the CPU can actually make your whole application run faster.
[EDIT] Of course, the technique depends on a specific hardware setup. Which was the main reason why many Amiga games couldn't cope with faster CPUs: The timing of the instructions was off.

Point one which is not the answer.
Even if you never program in it, I find it useful to know at least one assembler instruction set. This is part of the programmers never-ending quest to know more and therefore be better. Also useful when stepping into frameworks you don't have the source code to and having at least a rough idea what is going on. It also helps you to understand JavaByteCode and .Net IL as they are both similar to assembler.
To answer the question when you have a small amount of code or a large amount of time. Most useful for use in embedded chips, where low chip complexity and poor competition in compilers targeting these chips can tip the balance in favour of humans. Also for restricted devices you are often trading off code size/memory size/performance in a way that would be hard to instruct a compiler to do. e.g. I know this user action is not called often so I will have small code size and poor performance, but this other function that look similar is used every second so I will have a larger code size and faster performance. That is the sort of trade off a skilled assembly programmer can use.
I would also like to add there is a lot of middle ground where you can code in C compile and examine the Assembly produced, then either change you C code or tweak and maintain as assembly.
My friend works on micro controllers, currently chips for controlling small electric motors. He works in a combination of low level c and Assembly. He once told me of a good day at work where he reduced the main loop from 48 instructions to 43. He is also faced with choices like the code has grown to fill the 256k chip and the business is wanting a new feature, do you
Remove an existing feature
Reduce the size of some or all of the existing features maybe at the cost of performance.
Advocate moving to a larger chip with a higher cost, higher power consumption and larger form factor.
I would like to add as a commercial developer with quite a portfolio or languages, platforms, types of applications I have never once felt the need to dive into writing assembly. I have how ever always appreciated the knowledge I gained about it. And sometimes debugged into it.
I know I have far more answered the question "why should I learn assembler" but I feel it is a more important question then when is it faster.
so lets try once more
You should be thinking about assembly
working on low level operating system function
Working on a compiler.
Working on an extremely limited chip, embedded system etc
Remember to compare your assembly to compiler generated to see which is faster/smaller/better.
David.

Matrix operations using SIMD instructions is probably faster than compiler generated code.

A few examples from my experience:
Access to instructions that are not accessible from C. For instance, many architectures (like x86-64, IA-64, DEC Alpha, and 64-bit MIPS or PowerPC) support a 64 bit by 64 bit multiplication producing a 128 bit result. GCC recently added an extension providing access to such instructions, but before that assembly was required. And access to this instruction can make a huge difference on 64-bit CPUs when implementing something like RSA - sometimes as much as a factor of 4 improvement in performance.
Access to CPU-specific flags. The one that has bitten me a lot is the carry flag; when doing a multiple-precision addition, if you don't have access to the CPU carry bit one must instead compare the result to see if it overflowed, which takes 3-5 more instructions per limb; and worse, which are quite serial in terms of data accesses, which kills performance on modern superscalar processors. When processing thousands of such integers in a row, being able to use addc is a huge win (there are superscalar issues with contention on the carry bit as well, but modern CPUs deal pretty well with it).
SIMD. Even autovectorizing compilers can only do relatively simple cases, so if you want good SIMD performance it's unfortunately often necessary to write the code directly. Of course you can use intrinsics instead of assembly but once you're at the intrinsics level you're basically writing assembly anyway, just using the compiler as a register allocator and (nominally) instruction scheduler. (I tend to use intrinsics for SIMD simply because the compiler can generate the function prologues and whatnot for me so I can use the same code on Linux, OS X, and Windows without having to deal with ABI issues like function calling conventions, but other than that the SSE intrinsics really aren't very nice - the Altivec ones seem better though I don't have much experience with them). As examples of things a (current day) vectorizing compiler can't figure out, read about bitslicing AES or SIMD error correction - one could imagine a compiler that could analyze algorithms and generate such code, but it feels to me like such a smart compiler is at least 30 years away from existing (at best).
On the other hand, multicore machines and distributed systems have shifted many of the biggest performance wins in the other direction - get an extra 20% speedup writing your inner loops in assembly, or 300% by running them across multiple cores, or 10000% by running them across a cluster of machines. And of course high level optimizations (things like futures, memoization, etc) are often much easier to do in a higher level language like ML or Scala than C or asm, and often can provide a much bigger performance win. So, as always, there are tradeoffs to be made.

I can't give the specific examples because it was too many years ago, but there were plenty of cases where hand-written assembler could out-perform any compiler. Reasons why:
You could deviate from calling conventions, passing arguments in registers.
You could carefully consider how to use registers, and avoid storing variables in memory.
For things like jump tables, you could avoid having to bounds-check the index.
Basically, compilers do a pretty good job of optimizing, and that is nearly always "good enough", but in some situations (like graphics rendering) where you're paying dearly for every single cycle, you can take shortcuts because you know the code, where a compiler could not because it has to be on the safe side.
In fact, I have heard of some graphics rendering code where a routine, like a line-draw or polygon-fill routine, actually generated a small block of machine code on the stack and executed it there, so as to avoid continual decision-making about line style, width, pattern, etc.
That said, what I want a compiler to do is generate good assembly code for me but not be too clever, and they mostly do that. In fact, one of the things I hate about Fortran is its scrambling the code in an attempt to "optimize" it, usually to no significant purpose.
Usually, when apps have performance problems, it is due to wasteful design. These days, I would never recommend assembler for performance unless the overall app had already been tuned within an inch of its life, still was not fast enough, and was spending all its time in tight inner loops.
Added: I've seen plenty of apps written in assembly language, and the main speed advantage over a language like C, Pascal, Fortran, etc. was because the programmer was far more careful when coding in assembler. He or she is going to write roughly 100 lines of code a day, regardless of language, and in a compiler language that's going to equal 3 or 400 instructions.

More often than you think, C needs to do things that seem to be unneccessary from an Assembly coder's point of view just because the C standards say so.
Integer promotion, for example. If you want to shift a char variable in C, one would usually expect that the code would do in fact just that, a single bit shift.
The standards, however, enforce the compiler to do a sign extend to int before the shift and truncate the result to char afterwards which might complicate code depending on the target processor's architecture.

You don't actually know whether your well-written C code is really fast if you haven't looked at the disassembly of what compiler produces. Many times you look at it and see that "well-written" was subjective.
So it's not necessary to write in assembler to get fastest code ever, but it's certainly worth to know assembler for the very same reason.

Tight loops, like when playing with images, since an image may cosist of millions of pixels. Sitting down and figuring out how to make best use of the limited number of processor registers can make a difference. Here's a real life sample:
http://danbystrom.se/2008/12/22/optimizing-away-ii/
Then often processors have some esoteric instructions which are too specialized for a compiler to bother with, but on occasion an assembler programmer can make good use of them. Take the XLAT instruction for example. Really great if you need to do table look-ups in a loop and the table is limited to 256 bytes!
Updated: Oh, just come to think of what's most crucial when we speak of loops in general: the compiler has often no clue on how many iterations that will be the common case! Only the programmer know that a loop will be iterated MANY times and that it therefore will be beneficial to prepare for the loop with some extra work, or if it will be iterated so few times that the set-up actually will take longer than the iterations expected.

I have read all the answers (more than 30) and didn't find a simple reason: assembler is faster than C if you have read and practiced the Intel® 64 and IA-32 Architectures Optimization Reference Manual, so the reason why assembly may be slower is that people who write such slower assembly didn't read the Optimization Manual.
In the good old days of Intel 80286, each instruction was executed at a fixed count of CPU cycles. Still, since Pentium Pro, released in 1995, Intel processors became superscalar, utilizing Complex Pipelining: Out-of-Order Execution & Register Renaming. Before that, on Pentium, produced in 1993, there were U and V pipelines. Therefore, Pentium introduced dual pipelines that could execute two simple instructions at one clock cycle if they didn't depend on one another. However, this was nothing compared with the Out-of-Order Execution & Register Renaming that appeared in Pentium Pro. This approach introduced in Pentium Pro is practically the same nowadays on most recent Intel processors.
Let me explain the Out-of-Order Execution in a few words. The fastest code is where instructions do not depend on previous results, e.g., you should always clear whole registers (by movzx) to remove dependency from previous values of the registers you are working with, so they may be renamed internally by the CPU to allow instruction execute in parallel or in a different order. Or, on some processors, false dependency may exist that may also slow things down, like false dependency on Pentium 4 for inc/dec, so you may wish to use add eax, 1 instead or inc eax to remove dependency on the previous state of the flags.
You can read more on Out-of-Order Execution & Register Renaming if time permits. There is plenty of information available on the Internet.
There are also many other essential issues like branch prediction, number of load and store units, number of gates that execute micro-ops, memory cache coherence protocols, etc., but the crucial thing to consider is the Out-of-Order Execution.
Most people are simply not aware of the Out-of-Order Execution. Therefore, they write their assembly programs like for 80286, expecting their instructions will take a fixed time to execute regardless of the context. At the same time, C compilers are aware of the Out-of-Order Execution and generate the code correctly. That's why the code of such uninformed people is slower, but if you become knowledgeable, your code will be faster.
There are also lots of optimization tips and tricks besides the Out-of-Order Execution. Just read the Optimization Manual above mentioned :-)
However, assembly language has its own drawbacks when it comes to optimization. According to Peter Cordes (see the comment below), some of the optimizations compilers do would be unmaintainable for large code-bases in hand-written assembly. For example, suppose you write in assembly. In that case, you need to completely change an inline function (an assembly macro) when it inlines into a function that calls it with some arguments being constants. At the same time, a C compiler makes its job a lot simpler—and inlining the same code in different ways into different call sites. There is a limit to what you can do with assembly macros. So to get the same benefit, you'd have to manually optimize the same logic in each place to match the constants and available registers you have.

I think the general case when assembler is faster is when a smart assembly programmer looks at the compiler's output and says "this is a critical path for performance and I can write this to be more efficient" and then that person tweaks that assembler or rewrites it from scratch.

It all depends on your workload.
For day-to-day operations, C and C++ are just fine, but there are certain workloads (any transforms involving video (compression, decompression, image effects, etc)) that pretty much require assembly to be performant.
They also usually involve using CPU specific chipset extensions (MME/MMX/SSE/whatever) that are tuned for those kinds of operation.

It might be worth looking at Optimizing Immutable and Purity by Walter Bright it's not a profiled test but shows you one good example of a difference between handwritten and compiler generated ASM. Walter Bright writes optimising compilers so it might be worth looking at his other blog posts.

LInux assembly howto, asks this question and gives the pros and cons of using assembly.

I have an operation of transposition of bits that needs to be done, on 192 or 256 bits every interrupt, that happens every 50 microseconds.
It happens by a fixed map(hardware constraints). Using C, it took around 10 microseconds to make. When I translated this to Assembler, taking into account the specific features of this map, specific register caching, and using bit oriented operations; it took less than 3.5 microsecond to perform.

The simple answer... One who knows assembly well (aka has the reference beside him, and is taking advantage of every little processor cache and pipeline feature etc) is guaranteed to be capable of producing much faster code than any compiler.
However the difference these days just doesn't matter in the typical application.

http://cr.yp.to/qhasm.html has many examples.

One of the posibilities to the CP/M-86 version of PolyPascal (sibling to Turbo Pascal) was to replace the "use-bios-to-output-characters-to-the-screen" facility with a machine language routine which in essense was given the x, and y, and the string to put there.
This allowed to update the screen much, much faster than before!
There was room in the binary to embed machine code (a few hundred bytes) and there was other stuff there too, so it was essential to squeeze as much as possible.
It turnes out that since the screen was 80x25 both coordinates could fit in a byte each, so both could fit in a two-byte word. This allowed to do the calculations needed in fewer bytes since a single add could manipulate both values simultaneously.
To my knowledge there is no C compilers which can merge multiple values in a register, do SIMD instructions on them and split them out again later (and I don't think the machine instructions will be shorter anyway).

One of the more famous snippets of assembly is from Michael Abrash's texture mapping loop (expained in detail here):
add edx,[DeltaVFrac] ; add in dVFrac
sbb ebp,ebp ; store carry
mov [edi],al ; write pixel n
mov al,[esi] ; fetch pixel n+1
add ecx,ebx ; add in dUFrac
adc esi,[4*ebp + UVStepVCarry]; add in steps
Nowadays most compilers express advanced CPU specific instructions as intrinsics, i.e., functions that get compiled down to the actual instruction. MS Visual C++ supports intrinsics for MMX, SSE, SSE2, SSE3, and SSE4, so you have to worry less about dropping down to assembly to take advantage of platform specific instructions. Visual C++ can also take advantage of the actual architecture you are targetting with the appropriate /ARCH setting.

Given the right programmer, Assembler programs can always be made faster than their C counterparts (at least marginally). It would be difficult to create a C program where you couldn't take out at least one instruction of the Assembler.

gcc has become a widely used compiler. Its optimizations in general are not that good. Far better than the average programmer writing assembler, but for real performance, not that good. There are compilers that are simply incredible in the code they produce. So as a general answer there are going to be many places where you can go into the output of the compiler and tweak the assembler for performance, and/or simply re-write the routine from scratch.

Longpoke, there is just one limitation: time. When you don't have the resources to optimize every single change to code and spend your time allocating registers, optimize few spills away and what not, the compiler will win every single time. You do your modification to the code, recompile and measure. Repeat if necessary.
Also, you can do a lot in the high-level side. Also, inspecting the resulting assembly may give the IMPRESSION that the code is crap, but in practice it will run faster than what you think would be quicker. Example:
int y = data[i];
// do some stuff here..
call_function(y, ...);
The compiler will read the data, push it to stack (spill) and later read from stack and pass as argument. Sounds shite? It might actually be very effective latency compensation and result in faster runtime.
// optimized version
call_function(data[i], ...); // not so optimized after all..
The idea with the optimized version was, that we have reduced register pressure and avoid spilling. But in truth, the "shitty" version was faster!
Looking at the assembly code, just looking at the instructions and concluding: more instructions, slower, would be a misjudgment.
The thing here to pay attention is: many assembly experts think they know a lot, but know very little. The rules change from architecture to next, too. There is no silver-bullet x86 code, for example, which is always the fastest. These days is better to go by rules-of-thumb:
memory is slow
cache is fast
try to use cached better
how often you going to miss? do you have latency compensation strategy?
you can execute 10-100 ALU/FPU/SSE instructions for one single cache miss
application architecture is important..
.. but it does't help when the problem isn't in the architecture
Also, trusting too much into compiler magically transforming poorly-thought-out C/C++ code into "theoretically optimum" code is wishful thinking. You have to know the compiler and tool chain you use if you care about "performance" at this low-level.
Compilers in C/C++ are generally not very good at re-ordering sub-expressions because the functions have side effects, for starters. Functional languages don't suffer from this caveat but don't fit the current ecosystem that well. There are compiler options to allow relaxed precision rules which allow order of operations to be changed by the compiler/linker/code generator.
This topic is a bit of a dead-end; for most it's not relevant, and the rest, they know what they are doing already anyway.
It all boils down to this: "to understand what you are doing", it's a bit different from knowing what you are doing.