Practical use of automatic vectorization?

Practical use of automatic vectorization? - loops

Has anyone taken advantage of the automatic vectorization that gcc can do? In the real world (as opposed to example code)? Does it take restructuring of existing code to take advantage? Are there a significant number of cases in any production code that can be vectorized this way?

I have yet to see either GCC or Intel C++ automatically vectorize anything but very simple loops, even when given the code of algorithms that can (and were, after I manually rewrote them using SSE intrinsics) be vectorized.
Part of this is being conservative - especially when faced with possible pointer aliasing, it can be very difficult for a C/C++ compiler to 'prove' to itself that a vectorization would be safe, even if you as the programmer know that it is. Most compilers (sensibly) prefer to not optimize code rather than risking miscompiling it. This is one area where higher level languages have a real advantage over C, at least in theory (I say in theory since I'm not actually aware of any automatically vectorizing ML or Haskell compilers).
Another part of it is simply analytical limitations - most research in vectorization, I understand, is related to optimizing classical numerical problems (fluid dynamics, say) which was the bread and butter of most vector machines before a few years ago (when, between CUDA/OpenCL, Altivec/SSE, and the STI Cell, vector programming in various forms became widely available in commercial systems).
It's fairly unlikely that code written for a scalar processor in mind will be easy for a compiler to vectorize. Happily, many things you can do to make it easier for a compiler to understand how to vectorize it, like loop tiling and partial loop unrolling, also (tend to) help performance on modern processors even if the compiler doesn't figure out how to vectorize it.

It is hard to use in any business logic, but gives speed ups when you are processing volumes of data in the same way.
Good example is sound/video processing where you apply the same operation to every sample/pixel.
I have used VisualDSP for this, and you had to check the results after compiling - if it is really used where it should.

Vectorized instructions are not limited to Cell processors - most modern workstations-like CPU have them (PPC, x86 since pentium 3, Sparc, etc...). When used well for floating points operations, it can help quite a lot for very computing intensive tasks (filters, etc...). In my experience, automatic vectorization does not work so well.

You may have noticed that pretty much no-one actually knows how to make good use of GCC's Automatic Vectorization. If you search around the web to see people's comments, it always come to the idea that GCC allows you to enable automatic vectorization, but it extremely rarely makes actual use of it, and so if you want to use SIMD acceleration (eg: MMX, SSE, AVX, NEON, AltiVec), then you basically haveto figure out how to write it using compiler intrinsics or Assembly language code.
But the problem with intrinsics is that you effectively need to understand the Assembly language side of it and then also learn the Intrinsics method of describing what you want, which is likely to result in much less efficient code than if you wrote it in Assembly code (such as by a factor of 10x), because the compiler is still going to have trouble making good use of your intrinsic instructions!
For example, you might be using SIMD Intrinsics so that many operations can be performed in parallel at the same time, but your compiler will probably generate Assembly code that transfers the data between the SIMD registers and the normal CPU registers and back, effectively making your SIMD code run at a similar speed (or even slower) than normal code!
So basically:
If you want upto 100% speedups (2x
speed), then either buy the
official Intel/ARM compilers or convert some of your code to use SIMD C/C++ Intrinsics.
If you
want 1000% speedups (10x speed), then
write it in Assembly code using SIMD instructions by hand. Or if available on your hardware, use GPU acceleration instead such as OpenCL or Nvidia's CUDA SDK, since they can provide similar speedups in the GPU as SIMD does in the CPU.

Related

(Edited) When should one use inline assembly in c (outside of optimization)? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Note: Edited to make the question non-oppion based
Assumptions
We are in user mode (not in the kernel)
The OS being used is either a modern version of Linux or a modern version of windows that uses a x86 CPU.
Other than optimization, is there a specific example where using inline assembly in a C program is needed. (If applicable, provide the inline assembly)
To be clear, injecting assembly language code through the use of the key words __asm__(in the case of GCC) or __asm (in the case of VC++)

(Most of this was written for the original version of the question. It was edited after).
You mean purely for performance reasons, so excluding using special instructions in an OS kernel?
What you really ultimately want is machine code that executes efficiently. And the ability to modify some text files and recompile to get different machine code. You can usually get both of those things without needing inline asm, therefore:
https://gcc.gnu.org/wiki/DontUseInlineAsm
GNU C inline assembly is hard to use correctly, but if you do use it correctly has very low overhead. Still, it blocks many important optimizations like constant-propagation.
See https://stackoverflow.com/tags/inline-assembly/info for guides on how to use it efficiently / safely. (e.g. use constraints instead of stupid mov instructions as the first or last instruction in the asm template.)
Pretty much always inappropriate, unless you know exactly what you're doing and can't hand-hold the compiler to make asm that's quite as good with pure C or intrinsics. Manual vectorization with intrinsics certainly still has its place; compilers are still terrible at some things, like auto-vectorizing complex shuffles. GCC/Clang won't auto-vectorize at all for search loops like a pure C implementation of memchr, or any loop where the trip-count isn't known before the first iteration.
And of course performance on current microarchitectures has to trump maintainability and optimizing differently for future CPUs. If it's ever appropriate, only for small hot loops where your program spends a lot of time, and typically CPU-bound. If memory-bound then there's usually not much to gain.
Over large scales, compilers are excellent (especially with link-time optimization). Humans can't compete on that scale, not while keeping code maintainable. The only place humans can still compete is in the small scale where you can afford the time to think about every single instruction in a loop that will run many iterations over the course of a program.
The more widely-used and performance-sensitive your code is (e.g. a video encoder like x264 or x265), the more reason there is to consider hand-tuned asm for anything. Saving a few cycles over millions of computers running your code every day starts to add up to being worth considering the maintenance / testing / portability downsides.
The one notable exception is ARM SIMD (NEON) where compilers are often still bad. I think especially for 32-bit ARM (where each 128-bit q0..15 register is aliased by 2x 64-bit d0..32 registers, so you can avoid shuffling by accessing the 2 halves as separate registers. Compilers don't model this well, and can easily shoot themselves in the foot when compiling intrinsics that you'd expect to be able to compile efficiently. Compilers are good at producing efficient asm from SIMD intrinsics for x86 (SSE/AVX) and PowerPC (altivec), but for some unknown reason are bad at optimizing ARM NEON intrinsics and often make sub-optimal asm.
Some compilers are not bad, e.g. apparently Apple clang/LLVM for AArch64 does ok more often than it used to. But still, see Arm Neon Intrinsics vs hand assembly - Jake Lee found the intrinsics version of his 4x4 float matmul was 3x slower than his hand-written version using clang, in Dec 2017. Jake is an ARM optimization expert so I'm inclined to believe that's fairly realistic.
or __asm (in the case of VC++)
MSVC-style asm is usually only useful for writing whole loops because having to take inputs via memory operands destroys (some of) the benefit. So amortizing that overhead over a whole loop helps.
For wrapping single instructions, introducing extra store-forwarding latency is just dumb, and there are MSVC intrinsics for almost everything you can't easily express in pure C. See What is the difference between 'asm', '__asm' and '__asm__'? for examples with a single instruction: you get much worse asm from using MSVC inline asm than you would for pure C or an intrinsic if you look at the big picture (including compiler-generated asm outside your asm block).
C++ code for testing the Collatz conjecture faster than hand-written assembly - why? shows a concrete example where hand-written asm is faster on current CPUs than anything I was able to get GCC or clang to emit by tweaking C source. They apparently don't know how to optimize for lower-latency LEA when it's part of a loop-carried dependency chain.
(The original question there was a great example of why you shouldn't write by hand in asm unless you know exactly what you're doing and use optimized compiler output as a starting point. But my answer shows that for a long-running hot tight loop, there are significant gains that compilers are missing with just micro-optimizations, even leaving aside algorithmic improvements.)
If you're considering asm, always benchmark it against the best you can get the compiler to emit. Working on a hand-written asm version may give you ideas that you can apply to your C to hand-hold compilers into making better asm. Then you can get the benefit without actually including any non-portable inline asm in your code.

SSE optimized code performs similar to plain version

I wanted to take my first steps with Intel's SSE so I followed the guide published here, with the difference that instead of developing for Windows and C++ I make it for Linux and C (therefore I don't use any _aligned_malloc but posix_memalign).
I also implemented one computing intensive method without making use of the SSE extensions. Surprisingly, when I run the program both pieces of code (that one with SSE and that one without) take similar amounts of time to run, usually being the time of the one using the SSE slightly higher than the other.
Is that normal? Could it be possible that GCC does already optimize with SSE (also using -O0 option)? I also tried the -mfpmath=387 option, but no way, still the same.

For floating point operations you may not see a huge benefit with SSE. Most modern x86 CPUs have two FPUs so double precision may only be about the same speed for SIMD vs scalar, and single precision might give you 2x for SIMD over scalar on a good day. For integer operations though, e.g. image or audio processing at 8 or 16 bits, you can still get substantial benefits with SSE.

GCC has a very good inbuilt code vectorizer, (which iirc kicks in at -O0 and above), so this means it will use SIMD in any place that it can in order to speed up scalar code (it will also optimize SIMD code a bit too, if its possible).
its pretty easy to confirm this is indeed whats happening here, just disassemble the output (or have gcc emit commented asm files).

What is the limit of optimization using SIMD?

I need to optimize some C code, which does lots of physics computations, using SIMD extensions on the SPE of the Cell Processor. Each vector operator can process 4 floats at the same time. So ideally I would expect a 4x speedup in the most optimistic case.
Do you think the use of vector operators could give bigger speedups?
Thanks

The best optimization occurs in rethinking the algorithm. Eliminate unnecessary steps. Find more a direct way of accomplishing the same result. Compute the solution in a domain more relevant to the problem.
For example, if the vector array is a list of n which are all on the same line, then it is sufficient to transform the end points only and interpolate the intermediate points.

It CAN give better speeds up than 4 times over straight floating point as the SIMD instructions could be less exact (Not so much as to give too many problems though) and so take fewer cycles to execute. It really depends.
Best plan is to learn as much about the processor you are optimising for as possible. You may find it can give you far better than 4x improvements. You may find out you can't. We can't say though without knowing more about the algorithm you are optimising and what CPU you are targetting.

On their own, no. But if the process of re-writing your algorithms to support them also happens to improve, say, cache locality or branching behaviour, then you could find unrelated speed-ups. However, this is true of any re-write...

This is entirely possible.
You can do more clever instruction-level micro optimizations than a compiler, if you know what you're doing.
Most SIMD instruction sets offers several powerful operations that don't have any equivalent in normal scalar FPU/ALU code (e.g. PAVG/PMIN etc. in SSE2). Even if these don't fit your problem exactly, you can often combine these instructions for great effect.
Not sure about Cell, but most SIMD instruction sets have features to optimize memory access, for example to prefetch data into cache. I've had very good results with these.
Now this isn't Cell or PPC at all, but a simple image convolution filter of mine got a 20x speedup (C vs. SSE2) on Atom, which is higher than the level of parallelity (16 pixels at a time).

It depends on the architecture.. For the moment I assume x86 architecture (aka SSE).
You can get factor four on tight loops easily. Just replace your existing math with SSE instruction and you're done.
You can even get a little more than that because if you use SSE you do the math in registers which are usually not used by the compiler. This frees up the general purpose register for other task such as loop control and address calculation. In short the code that surrounds the SSE instruction will be more compact and execute faster.
And then there is the option to hint the memory controller how you want to access the memory, e.g. if you want to store data in a way that it bypasses the cache or not. For bandwidth hungry algorithms that may give you some more extra speed ontop of that.

Is C inefficient compared to Assembly? [duplicate]

This question already has answers here:
When is assembly faster than C? [closed]
(40 answers)
Closed 1 year ago.
This is purely a theory question, so, given an "infinite" time to make a trivial program, and an advanced knowledge of C and Assembly, is it really better to do something in Assembly? is "performance" lost when compiling C into Assembly (to machine code)?
By performance I mean, do modern C compilers do a bad job at certain tasks that programming directly in Assembly speeds up?

Modern C can do a better job than assembly in many cases, because keeping track of which operations can overlap and which will block others is so complex it can only be reasonably be tracked by a computer.

C is not inefficient compared to anything. C is a language, and we don't describe languages in terms of efficiency. We compare programs in terms of efficiency. C doesn't write programs; programmers write programs.
Assembly gives you immense flexibility when comparing with C, and that is at the cost of time programming. If you are a guru C programmer and a guru Assembly programmer, then chances are you might be able to squeeze some more juice with Assembly for writing any given program, but the price for that is virtually certain to be prohibitive.
Most of us aren't gurus in either of these languages. For most of us, giving the responsibility of performance tuning to a C compiler is a double win: you get the wisdom of a number of Assembly gurus, the people who wrote the C compiler, along with an immense amount of time in your hands to further correct and enhance your C program. You also get portability as a bonus.

This question seems to stem from the misconception that higher performance is automatically better. There is too much to be gained from a higher level perspective to make assembly better in the general case. Even if performance is your primary concern, compilers usually do a better job creating efficient assembly than you could write yourself. They have a much broader "understanding" of all of your source code than you could possibly hold in your head. Many optimizations can be had from NOT using well-structured assembly.
Obviously there are exceptions. If you need to access hardware directly, including special processing features of CPUs (e.g. SSE), then assembly is the way to go. However, in that case, you're probably better off using a library that addresses your general problem more directly (e.g. numerics packages).
But you should only worry about things like this if you have a concrete, specific need for the increased performance and you can show that your assembly actually IS faster. Concrete specific needs include: noticed and measure performance problems, embedded systems where performance is a fundamental design concern, etc.

Unless you are an assembly expert and(/or) taking advantage of advanced opcodes not utilized by the compiler, the C compiler will likely win.
Try it for fun ;-)
More realistic solutions are often to let the C compiler do it's bit, then profile and, if needed, tweak specific sections -- many compilers can dump some sort of low-level IL (or even "assembly").

Use C for most tasks, and write inline assembly code for specific ones (for example, to take advantage of SSE, MME, ...)

It depends. C compilers for Intel do a pretty good job nowadays. I wasn't so impressed by compilers for ARM - I could easly write an assembly version of an inner loop that performed twice as fast.
You typically don't need assembly on x86 machines. If you want to gain direct access to SSE instructions, look into compiler intrinsics!

Ignoring how much time it would take to write the code, and assuming you have all the knowledge that is required to do any task most efficiently in both situations, assembly code will, by definition, always be able to either meet or outperform the code generated by a C compiler, because the C compiler has to create the assembly code to do the same task and it cannot optimize everything; and anything the C compiler writes, you could also write (in theory), and unlike the compiler, you can sometimes take a shortcut because you know more about the situation than can be expressed in C code.
However, that doesn't mean they do a bad job and that the code is too slow; just that it's slower than it could be. It may not be by more than a few microseconds, but it can still be slower.
What you have to remember is that some optimizations performed by a compiler are very complex: agressive optimization tends to lead to very unreadable assembly code, and it becomes harder to reason about the code as a result if you were to do them manually. That's why you'd normally write it in C (or some other language) first, then profile it to find problem areas, and then go on to hand-optimize that piece of code until it reaches an acceptable speed - because the cost of writing everything in assembly is much higher, while often providing little or no benefit.

Actually, C might be faster than assembly in many cases, since compilers apply optimizations to your code. Even so, the performance difference (if any) is negligible.
I would focus more on readability & maintainability of the code base, as well as whether what you are trying to do is supported in C. In many cases, assembly will allow you to do more low-level things that C simply cannot do. For example, with assembly you can take advantage of MMX or SSE instructions directly.
So in the end, focus on what you want to accomplish. Remember - assembly language code is terrible to maintain. Use it only when you have no other choice.

No, compilers do not do a bad job at all. The amount of optimization that can be squeezed out by using assembly is insignificant for most programs.
That amount depends on how you define 'modern C compiler'. A brand new compiler (For a chip that has just reached market) may have a large number of inefficiencies that will get ironed out over time. Just compile some simple programs (the string.h functions, for example), and analyze what each line of code does. You may be surprised at some of the wasteful things an untested C compiler does, and recognize the error with a simple read-through of the code. A mature, well-tested, thoroughly optimized compiler (Think x86) will do a great job of generating assembly, though a new one will still do a decent job.
In no case can C do a better job than assembly. You could just benchmark the two, and if your assembly was slower, compile with -S and submit the resulting assembly, and you're guaranteed a tie. C is compiled to assembly, which has a 1:1 correlation with the bytecode. The computer can't do anything that assembly can't do, assuming that the complete instruction set is published.
In some cases, C is not expressive enough to be fully optimized. A programmer may know something about the nature of the data that simply cannot be expressed in C in such a way that the compiler can take advantage of this knowledge. Certainly, C is expressive and close to the metal, and is very good for optimization, but complete optimization is not always possible.
A compiler can't define 'performance' like a human can. I understand that you said trivial programs, but even in the simplest (useful) algorithms, there will be a tradeoff between size and speed. The compiler can't do this at a more fine grained scale than the -Os/-O[1-3] flags, but a human can know what 'best' means in the context of the purpose of a program.
Some architecture-dependent assembly instructions can't be expressed in C. This is where ASM() statements come in. Sometimes, these are not for optimization at all, but simply because there is no way to express in C that this line must use, say, the atomic test-and-set operation, or that we want to issue an SVC interrupt with the encoded parameter X.
The above points notwithstanding, C is orders of magnitude more efficient to program in and to master. If performance is important, analysis of the assembly will be necessary, and optimizations will probably be found, but the tradeoff in developer time and effort is rarely worth the effort for complex programs on a PC. For very simple programs which must be as fast as absolutely possible (like an RTOS), or which have severe memory constraints (like an ATTiny with 1KB of Flash (non-writable) memory and 64Bytes of RAM), assembly may be the only way to go.

Given an infinite time and an extremely deep understanding on how a modern CPU works you can actually write the "perfect" program (i.e. the best performance possible on that machine), but you will have to consider, for any instruction in your program, how CPU behaves in that context, pipelining and caching related optimizations, and many many other things.
A compiler is built to generate the best assembly code possible. You will rarely understand a modern complier generated assembly code because it tends to be really extreme.
At times compliers fail in this task because they can't always foresee what's happening.
Generally they do a great job but they sometimes fail...
Resuming... knowing C and Assembly is absolutely not enough to do a better job than a compiler in 99.99% cases, and considered that programming something in C can be 10000 times faster than programming the same assembly program a nicer way to spend some time is optimizing what the compiler did wrong in the remaining 0.01%, not reinventing the wheel.

This depends on the compiler you use? This is no property of C or any language. Theoretically it's possible to load a compiler with such a sophisticated AI that you can compile prolog to more efficient machine language than GCC can do with C.
This depends 100% on the compiler and 0% on C.
What does matter is that C is written as a language for which it is easy to write an optimizing compiler from C -> assembly, and with assembly this means the instructions of a Von Neumann machine. It depends on the target, some languages like prolog will probably be easier to map on hypothetical 'reduction machines'.
But, given that assembly is your target language for your C compiler (you can technically compile C to brainfuck or to Haskell, there is no theoretical difference) then:
It is possible to write the optimally fast program in that assembly itself (duh)
It is possible to write a C compiler which in every instant shall produce the most optimal assembly. That is to say, there exists a function from every C program to the most optimal way to get the same I/O in assembly, and this function is computable, albeit perhaps not deterministically.
This is also possible with every other programming language in the world.

Why do you program in assembly? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a question for all the hardcore low level hackers out there. I ran across this sentence in a blog. I don't really think the source matters (it's Haack if you really care) because it seems to be a common statement.
For example, many modern 3-D Games have their high performance core engine written in C++ and Assembly.
As far as the assembly goes - is the code written in assembly because you don't want a compiler emitting extra instructions or using excessive bytes, or are you using better algorithms that you can't express in C (or can't express without the compiler mussing them up)?
I completely get that it's important to understand the low-level stuff. I just want to understand the why program in assembly after you do understand it.

I think you're misreading this statement:
For example, many modern 3-D Games have their high performance core engine written in C++ and Assembly.
Games (and most programs these days) aren't "written in assembly" the same way they're "written in C++". That blog isn't saying that a significant fraction of the game is designed in assembly, or that a team of programmers sit around and develop in assembly as their primary language.
What this really means is that developers first write the game and get it working in C++. Then they profile it, figure out what the bottlenecks are, and if it's worthwhile they optimize the heck out of them in assembly. Or, if they're already experienced, they know which parts are going to be bottlenecks, and they've got optimized pieces sitting around from other games they've built.
The point of programming in assembly is the same as it always has been: speed. It would be ridiculous to write a lot of code in assembler, but there are some optimizations the compiler isn't aware of, and for a small enough window of code, a human is going to do better.
For example, for floating point, compilers tend to be pretty conservative and may not be aware of some of the more advanced features of your architecture. If you're willing to accept some error, you can usually do better than the compiler, and it's worth writing that little bit of code in assembly if you find that lots of time is spent on it.
Here are some more relevant examples:
Examples from Games
Article from Intel about optimizing a game engine using SSE intrinsics. The final code uses intrinsics (not inline assembler), so the amount of pure assembly is very small. But they look at the assembler output by the compiler to figure out exactly what to optimize.
Quake's fast inverse square root. Again, the routine doesn't have assembler in it, but you need to know something about architecture to do this kind of optimization. The authors know what operations are fast (multiply, shift) and which are slow (divide, sqrt). So they come up with a very tricky implementation of square root that avoids the slow operations entirely.
High-Performance Computing
Outside the domain of games, people in scientific computing frequently optimize the crap out of things to get them to run fast on the latest hardware. Think of this as games where you can't cheat on the physics.
A great recent example of this is Lattice Quantum Chromodynamics (Lattice QCD). This paper describes how the problem pretty much boils down to one very small computational kernel, which was optimized heavily for PowerPC 440's on an IBM Blue Gene/L. Each 440 has two FPUs, and they support some special ternary operations that are tricky for compilers to exploit. Without these optimizations, Lattice QCD would've run much slower, which is costly when your problem requires millions of CPU hours on expensive machines.
If you are wondering why this is important, check out the article in Science that came out of this work. Using Lattice QCD, these guys calculated the mass of a proton from first principles, and showed last year that 90% of the mass comes from strong force binding energy, and the rest from quarks. That's E=mc2 in action. Here's a summary.
For all of the above, the applications are not designed or written 100% in assembly -- not even close. But when people really need speed, they focus on writing the key parts of their code to fly on specific hardware.

I have not coded in assembly language for many years, but I can give several reasons that I frequently saw:
Not all compilers can make use of certain CPU optimizations and instruction set (e.g., the new instruction sets that Intel adds once in a while). Waiting for compiler writers to catch up means losing a competitive advantage.
Easier to match actual code to known CPU architecture and optimization. For example, things you know about the fetching mechanism, caching, etc. This is supposed to be transparent to the developer, but the fact is that it is not, that's why compiler writers can optimize.
Certain hardware level accesses are only possible/practical via assembly language (e.g., when writing device driver).
Formal reasoning is sometimes actually easier for the assembly language than for the high-level language since you already know what the final or almost final layout of the code is.
Programming certain 3D graphic cards (circa late 1990s) in the absence of APIs was often more practical and efficient in assembly language, and sometimes not possible in other languages. But again, this involved really expert-level games based on the accelerator architecture like manually moving data in and out in certain order.
I doubt many people use assembly language when a higher-level language would do, especially when that language is C. Hand-optimizing large amounts of general-purpose code is impractical.

There is one aspect of assembler programming which others have not mentioned - the feeling of satisfaction you get knowing that every single byte in an application is the result of your own effort, not the compiler's. I wouldn't for a second want to go back to writing whole apps in assembler as I used to do in the early 80s, but I do miss that feeling sometimes...

Usually, a layman's assembly is slower than C (due to C's optimization) but many games (I distinctly remember Doom) had to have specific sections of the game in Assembly so it would run smoothly on normal machines.
Here's the example to which I am referring.

I started professional programming in assembly language in my very first job (80's). For embedded systems the memory demands - RAM and EPROM - were low. You could write tight code that was easy on resources.
By the late 80's I had switched to C. The code was easier to write, debug and maintain. Very small snippets of code were written in assembler - for me it was when I was writing the context switching in an roll-your-own RTOS. (Something you shouldn't do anymore unless it is a "science project".)
You will see assembler snippets in some Linux kernel code. Most recently I've browsed it in spinlocks and other synchronization code. These pieces of code need to gain access to atomic test-and-set operations, manipulating caches, etc.
I think you would be hard pressed to out-optimize modern C compilers for most general programming.
I agree with #altCognito that your time is probably better spent thinking harder about the problem and doing things better. For some reason programmers often focus on micro-efficiencies and neglect the macro-efficiencies. Assembly language to improve performance is a micro-efficiency. Stepping back for a wider view of the system can expose the macro problems in a system. Solving the macro problems can often yield better performance gains.
Once the macro problems are solved then collapse to the micro level.
I guess micro problems are within the control of a single programmer and in a smaller domain. Altering behavior at the macro level requires communication with more people - a thing some programmers avoid. That whole cowboy vs the team thing.

"Yes". But, understand that for the most part the benefits of writing code in assembler are not worth the effort. The return received for writing it in assembly tends to be smaller than the simply focusing on thinking harder about the problem and spending your time thinking of a better way of doing thigns.
John Carmack and Michael Abrash who were largely responsible for writing Quake and all of the high performance code that went into IDs gaming engines go into this in length detail in this book.
I would also agree with Ólafur Waage that today, compilers are pretty smart and often employ many techniques which take advantage of hidden architectural boosts.

These days, for sequential codes at least, a decent compiler almost always beats even a highly seasoned assembly-language programmer. But for vector codes it's another story. Widely deployed compilers don't do such a great job exploiting the vector-parallel capabilities of the x86 SSE unit, for example. I'm a compiler writer, and exploiting SSE tops my list of reasons to go on your own instead of trusting the compiler.

SSE code works better in assembly than compiler intrinsics, at least in MSVC. (i.e. does not create extra copies of data )

I've three or four assembler routines (in about 20 MB source) in my sources at work. All of them are SSE(2), and are related to operations on (fairly large - think 2400x2048 and bigger) images.
For hobby, I work on a compiler, and there you have more assembler. Runtime libraries are quite often full of them, most of them have to do with stuff that defies the normal procedural regime (like helpers for exceptions etc.)
I don't have any assembler for my microcontroller. Most modern microcontrollers have so much peripheral hardware (interrupt controled counters, even entire quadrature encoders and serial building blocks) that using assembler to optimize the loops is often not needed anymore. With current flash prices, the same goes for code memory. Also there are often ranges of pin-compatible devices, so upscaling if you systematically run out of cpu power or flash space is often not a problem
Unless you really ship 100000 devices and programming assembler makes it possible to really make major savings by just fitting in a flash chip a category smaller. But I'm not in that category.
A lot of people think embedded is an excuse for assembler, but their controllers have more CPU power than the machines Unix was developed on. (Microchip coming
with 40 and 60 MIPS microcontrollers for under USD 10).
However a lot people are stuck with legacy, since changing microchip architecture is not easy. Also the HLL code is very architecture dependent (because it uses the hardware periphery, registers to control I/O, etc). So there are sometimes good reasons to keep maintaining a project in assembler (I was lucky to be able to setup affairs on a new architecture from scratch). But often people kid themselves that they really need the assembler.
I still like the answer a professor gave when we asked if we could use GOTO (but you could read that as ASSEMBLER too): "if you think it is worth writing a 3 page essay on why you need the feature, you can use it. Please submit the essay with your results. "
I've used that as a guiding principle for lowlevel features. Don't be too cramped to use it, but make sure you motivate it properly. Even throw up an artificial barrier or two (like the essay) to avoid convoluted reasoning as justification.

Some instructions/flags/control simply aren't there at the C level.
For example, checking for overflow on x86 is the simple overflow flag. This option is not available in C.

Defects tend to run per-line (statement, code point, etc.); while it's true that for most problems, assembly would use far more lines than higher level languages, there are occasionally cases where it's the best (most concise, fewest lines) map to the problem at hand. Most of these cases involve the usual suspects, such as drivers and bit-banging in embedded systems.

If you were around for all the Y2K remediation efforts, you could have made a lot of money if you knew Assembly. There's still plenty of legacy code around that was written in it, and that code occasionally needs maintenance.

Another reason could be when the available compiler just isn't good enough for an architecture and the amount of code needed in the program is not that long or complex as for the programmer to get lost in it. Try programming a microcontroller for an embedded system, usually assembly will be much easier.

Beside other mentioned things, all higher languages have certain limitations. Thats why some people choose to programm in ASM, to have full control over their code.
Others enjoy very small executables, in the range of 20-60KB, for instance check HiEditor, which is implemented by author of the HiEdit control, superb powerfull edit control for Windows with syntax highlighting and tabs in only ~50kb). In my collection I have more then 20 such gold controls from Excell like ssheets to html renders.

I think a lot of game developers would be surprised at this bit of information.
Most games I know of use as little assembly as at all possible. In some cases none at all, and at worst, one or two loops or functions.
That quote is over-generalized, and nowhere near as true as it was a decade ago.
But hey, mere facts shouldn't hinder a true hacker's crusade in favor of assembly. ;)

If you are programming a low end 8 bit microcontroller with 128 bytes of RAM and 4K of program memory you don't have much choice about using assembly. Sometimes though when using a more powerful microcontroller you need a certain action to take place at an exact time. Assembly language comes in useful then as you can count the instructions and so measure the clock cycles used by your code.

Games are pretty performance hungry and although in the meantime the optimizers are pretty good a "master programmer" is still able to squeeze out some more performance by hand coding the right parts in assembly.
Never ever start optimizing your program without profiling it first. After profiling should be able to identify bottlenecks and if finding better algorithms and the like don't cut it anymore you can try to hand code some stuff in assembly.

Aside from very small projects on very small CPUs, I would not set out to ever program an entire project in assembly. However, it is common to find that a performance bottleneck can be relieved with the strategic hand coding of some inner loops.
In some cases, all that is really required is to replace some language construct with an instruction that the optimizer cannot be expected to figure out how to use. A typical example is in DSP applications where vector operations and multiply-accumulate operations are difficult for an optimizer to discover, but easy to hand code.
For example certain models of the SH4 contain 4x4 matrix and 4 vector instructions. I saw a huge performance improvement in a color correction algorithm by replacing equivalent C operations on a 3x3 matrix with the appropriate instructions, at the tiny cost of enlarging the correction matrix to 4x4 to match the hardware assumption. That was achieved by writing no more than a dozen lines of assembly, and carrying matching adjustments to the related data types and storage into a handful of places in the surrounding C code.

It doesn't seem to be mentioned, so I thought I'd add it: in modern games development, I think at least some of the assembly being written isn't for the CPU at all. It's for the GPU, in the form of shader programs.
This might be needed for all sorts of reasons, sometimes simply because whatever higher-level shading language used doesn't allow the exact operation to be expressed in the exact number of instructions wanted, to fit some size-constraint, speed, or any combination. Just as usual with assembly-language programming, I guess.

Almost every medium-to-large game engine or library I've seen to date has some hand-optimized assembly versions available for matrix operations like 4x4 matrix concatenation. It seems that compilers inevitably miss some of the clever optimizations (reusing registers, unrolling loops in a maximally efficient way, taking advantage of machine-specific instructions, etc) when working with large matrices. These matrix manipulation functions are almost always "hotspots" on the profile, too.
I've also seen hand-coded assembly used a lot for custom dispatch -- things like FastDelegate, but compiler and machine specific.
Finally, if you have Interrupt Service Routines, asm can make all the difference in the world -- there are certain operations you just don't want occurring under interrupt, and you want your interrupt handlers to "get in and get out fast"... you know almost exactly what's going to happen in your ISR if it's in asm, and it encourages you to keep the bloody things short (which is good practice anyway).

I have only personally talked to one developer about his use of assembly.
He was working on the firmware that dealt with the controls for a portable mp3 player.
Doing the work in assembly had 2 purposes:
Speed: delays needed to be minimal.
Cost: by being minimal with the code, the hardware needed to run it could be slightly less powerful. When mass-producing millions of units, this can add up.

The only assembler coding I continue to do is for embedded hardware with scant resources. As leander mentions, assembly is still well suited to ISRs where the code needs to be fast and well understood.
A secondary reason for me is to keep my knowledge of assembly functional. Being able to examine and understand the steps which the CPU is taking to do my bidding just feels good.

Last time I wrote in assembler was when I could not convince the compiler to generate libc-free, position independent code.
Next time will probably be for the same reason.
Of course, I used to have other reasons.

A lot of people love to denigrate assembly language because they've never learned to code with it and have only vaguely encountered it and it has left them either aghast or somewhat intimidated. True talented programmers will understand that it is senseless to bash C or Assembly because they are complimentary. in fact the advantage of one is the disadvantage of the other. The organized syntaxic rules of C improves clarity but at the same gives up all the power assembly has from being free of any structural rules ! C code instruction are made to create non-blocking code which could be argued forces clarity of programming intent but this is a power loss. In C the compiler will not allow a jump inside an if/elseif/else/end. Or you are not allowed to write two for/end loops on diferent variables that overlap each other, you cannot write self modifying code (or cannot in an seamless easy way), etc.. conventional programmers are spooked by the above, and would have no idea how to even use the power of these approaches as they have been raised to follow conventional rules.
Here is the truth : Today we have machine with the computing power to do much more that the application we use them for but the human brain is too incapable to code them in a rule free coding environment (= assembly) and needs restrictive rules that greatly reduce the spectrum and simplifies coding.
I have myself written code that cannot be written in C code without becoming hugely inefficient because of the above mentionned limitations. And i have not yet talked about speed which most people think is the main reason for writting in assembly, well it is if you mind is limited to thinking in C then you are the slave of you compiler forever. I always thought chess players masters would be ideal assembly programmers while the C programmers just play "Dames".

No longer speed, but Control. Speed will sometimes come from control, but it is the only reason to code in assembly. Every other reason boils down to control (i.e. SSE and other hand optimization, device drivers and device dependent code, etc.).

If I am able to outperform GCC and Visual C++ 2008 (known also as Visual C++ 9.0) then people will be interested in interviewing me about how it is possible.
This is why for the moment I just read things in assembly and just write __asm int 3 when required.
I hope this help...

I've not written in assembly for a few years, but the two reasons I used to were:
The challenge of the thing! I went through a several-month period years
ago when I'd write everything in x86 assembly (the days of DOS and Windows
3.1). It basically taught me a chunk of low level operations, hardware I/O, etc.
For some things it kept size small (again DOS and Windows 3.1 when writing TSRs)
I keep looking at coding assembly again, and it's nothing more than the challenge and joy of the thing. I have no other reason to do so :-)

I once took over a DSP project which the previous programmer had written mostly in assembly code, except for the tone-detection logic which had been written in C, using floating-point (on a fixed-point DSP!). The tone detection logic ran at about 1/20 of real time.
I ended up rewriting almost everything from scratch. Almost everything was in C except for some small interrupt handlers and a few dozen lines of code related to interrupt handling and low-level frequency detection, which runs more than 100x as fast as the old code.
An important thing to bear in mind, I think, is that in many cases, there will be much greater opportunities for speed enhancement with small routines than large ones, especially if hand-written assembler can fit everything in registers but a compiler wouldn't quite manage. If a loop is large enough that it can't keep everything in registers anyway, there's far less opportunity for improvement.

The Dalvik VM that interprets the bytecode for Java applications on Android phones uses assembler for the dispatcher. This movie (about 31 minutes in, but its worth watching the whole movie!) explains how
"there are still cases where a human can do better than a compiler".

I don't, but I've made it a point to at least try, and try hard at some point in the furture (soon hopefully). It can't be a bad thing to get to know more of the low level stuff and how things work behind the scenes when I'm programming in a high level language. Unfortunately time is hard to come by with a full time job as a developer/consultant and a parent. But I will give at go in due time, that's for sure.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight