Reverse engineering a firmware - what's up with every fourth byte?

Reverse engineering a firmware - what's up with every fourth byte? - arm

So I decided to grab my tools and analyze a router firmware. It went pretty okay up to the point where I had to find segments manually. I wouldn't bother you with it and i really don't want to ask about hacking anything or to do a favor for me. There is a pattern I'm sure someone could explain to me.
Looking at the hexdump, all i see is this:
There are strings that break the pattern but it goes all the way down almost to the end of the file.
what on earth can cause this pattern?
(if anyone's willing to help but needs more info: VxWorks 5.5.1 / probably ARM-9E CPU)

it is an arm, go look at the arm documentation you will see that for the 32 bit (non-thumb) arm instructions the first four bits are the condition code. The code 0b1110 is "ALWAYS" most of the time you dont do conditional execution so most arm instructions start with 0xE. makes it very easy to pick out an arm binary. the 16 bit thumb instructions also have a similar pattern but for different reasons, then if you add in thumb2 it changes that some...

Thats just due to how ARMs op codes are mapped and is actually helps me "eyeball" a dump to see if its ARM code.
I would suggest you go through part of the ARM Architecture Manual to see how op codes are generated. particularly conditionals. the E is created when you always want something to happen

Related

carry-less multiplication optimization for ECC over GF(2^m) in MIRACL

Link to MIRACL crypto library by CertiVox
Following the instructions in fastgf2m.txt, I've been able to get everything to compile. However, after execution, the benchmark (bmark.exe) program halts when evaluating curves over GF(2^m) with error, "This is not a point on the curve!"
I am able to get everything to work without the optimization but I'm unsure where the problem exists. I haven't modified any curve parameters and followed instructions in the distribution. I'm compiling on 64-bit Windows 8.1, on an Intel i7-3520M.
If anyone has any advice on how to correct this, it would be greatly appreciated.
Thanks!!

The method outlined in fastgf2m.txt is for generating unrolled code associated with a fixed m value determined at compile time. The bmark program changes m at runtime, and so the unrolled code will often not be correct in this case. The documentation could be clearer on this point.
Also make sure your processor does support the PCLMULQDQ instruction - many older processors will not.
It might be better to test the method on the ecsgen2/ecssign2/ecsver2 programs to implement ECDSA over GF(2^283) for example.

I want to create a simple assembler in C. Where should I begin? [duplicate]

This question already has answers here:
Building an assembler
(4 answers)
How Do You Make An Assembler? [closed]
(4 answers)
Closed 9 years ago.
I've recently been trying to immerse myself in the world of assembly programming with the eventual goal of creating my own programming language. I want my first real project to be a simple assembler written in C that will be able to assemble a very small portion of the x86 machine language and create a Windows executable. No macros, no linkers. Just assembly.
On paper, it seems simple enough. Assembly code comes in, machine code comes out.
But as soon as I thinking about all the details, it suddenly becomes very daunting. What conventions does the operating system demand? How do I align data and calculate jumps? What does the inside of an executable even look like?
I'm feeling lost. There aren't any tutorials on this that I could find and looking at the source code of popular assemblers was not inspiring (I'm willing to try again, though).
Where do I go from here? How would you have done it? Are there any good tutorials or literature on this topic?

I have written a few myself (assemblers and disassemblers) and I would not start with x86. If you know x86 or any other instruction set you can pick up and learn the syntax for another instruction set in short order (an evening/afternoon), at least the lions share of it. The act of writing an assembler (or disassembler) will definitely teach you an instruction set, fast, and you will know that instruction set better than many seasoned assembly programmers for that instruction set who have not examined the microcode at that level. msp430, pdp11, and thumb (not thumb2 extensions) (or mips or openrisc) are all good places to start, not a lot of instructions, not overly complicated, etc.
I recommend a disassembler first, and with that a fixed length instruction set like arm or thumb or mips or openrisc, etc. If not then at least use a disassembler (definitely choose an instruction set for which you already have an assembler, linker, and disassembler) and with pencil and paper understand the relationship between the machine code and the assembly, in particular the branches, they usually have one or more quirks like the program counter is an instruction or two ahead when the offset is added, to gain another bit they sometimes measure in whole instructions not bytes.
It is pretty easy to brute force parse the text with a C program to read the instructions. A harder task but perhaps as educational, would be to use bison/flex and learn that programming language to allow those tools to create (an even more extreme brute force) parser which then interfaces to your code to tell you what was found where.
The assembler itself is pretty straight forward, just read the ascii and set the bits in the machine code. Branches and other pc relative instructions are a little more painful as they can take multiple passes through the source/tables to completely resolve.
mov r0,r1
mov r2 ,#1
the assembler begins parsing the text for a line (being defined as the bytes that follow a carriage return 0xD or line feed 0xA), discard the white space (spaces and tabs) until you get to something non white space, then strncmp that with the known mnemonics. if you hit one then parse the possible combinations of that instruction, in the simple case above after the mov skip over the white space to non-white space, perhaps the first thing you find must be a register, then optional white space, then a comma. remove the whitespace and comma and compare that against a table of strings or just parse through it. Once that register is done then go past where the comma is found and lets say it is either another register or an immediate. If immediate lets say it has to have a # sign, if register lets say it has to start with a lower or upper case 'r'. after parsing that register or immediate, then make sure there is nothing else on the line that shouldnt be on the line. build the machine code for this instruciton or at least as much as you can, and move on to the next line. It may be tedious but it is not difficult to parse ascii...
at a minimum you will want a table/array that accumulates the machine code/data as it is created, plus some method for marking instructions as being incomplete, the pc-relative instructions to be completed on a future pass. you will also want a table/array that collects the labels you find and the address/offset in the machine code table where found. As well as the labels used in the instruction as a destination/source and the offset in the table/array holding the partially complete instruction they go with. after the first pass, then go back through these tables until you have matched up all the label definitions with the labels used as a source or destination, using the label definition address/offset to compute the distance to the instruction in question and then finish creating the machine code for that instruction. (some disassembly may be required and/or use some other method for remembering what kind of encoding it was when you come back to it later to finish building the machine code).
The next step is allowing for multiple source files, if that is something you want to allow. Now you have to have labels that dont get resolved by the assembler so you have to leave placeholders in the output and make some flavor of the longest jump/branch instruction because you dont know how far away the destination will be, expect the worse. Then there is the output file format you choose to create/use, then there is the linker which is mostly simple, but you have to remember to fill in the machine code for the final pc relative instructions, no harder than it was in the assembler itself.
Note, writing an assembler is not necessarily related to creating a programming language and then writing a compiler for it, separate thing, different problems. Actually if you want to make a new programming language just use an existing assembler for an existing instruction set. Not required of course, but most teachings and tutorials are going to use the bison/flex approach for programming languages, and there are many college course lecture notes/resources out there for beginning compiler classes that you can just use to get you started then modify the script to add the features of your language. The middle and back ends are the bigger challenge than the front end. there are many books on this topic and many online resources as well. As mentioned in another answer llvm is not a bad place to create a new programming language the middle and backends are done for you, you only need to focus on the programming language itself, the front end.

You should look at LLVM, llvm is a modular compiler back end, the most popular front end is Clang for compiling C/C++/Objective-C. The good thing about LLVM is that you can pick the part of the compiler chain that you are interested in and just focus on that, ignoring all of the others. You want to create your own language, write a parser that generates the LLVM internal representation code, and for free you get all of the middle layer target independent optimisations and compiling to many different targets. Interesting in a compiler for some exotic CPU, write a compiler backend that takes the LLVM intermediated code and generates your assemble. Have some ideas about optimisation technics, automatic threading perhaps, write a middle layer which processes LLVM intermediate code. LLVM is a collection of libraries not a standalone binary like GCC, and so it is very easy to use in you own projects.

What you're looking for is not a tutorial or source code, it's a specification. See http://msdn.microsoft.com/en-us/library/windows/hardware/gg463119.aspx
Once you understand the specification of an executable, write a program to generate one. The executable you build should be as simple as possible. Once you have mastered that, then you can write a simple line-oriented parser that reads instruction names and numeric arguments to generate a block of code to plug into the exe. Later you can add symbols, branches, sections, whatever you want, and that's where something like http://www.davidsalomon.name/assem.advertis/asl.pdf will come in.
P.S. Carl Norum has a good point in the comment above. If your goal is create your own programming language, learning to write an assembler is irrelevant and is very much not the right way to start (unless the language you want to create is an assembly language). There are already assemblers that produce executables from assembler source, so your compiler could produce assembler source and you could avoid the work of recreating the assembler ... and you should. Or you could use something like LLVM, which will solve many other daunting problems of compiler construction. The odds are very small that you will ever actually produce your own programming language, but they're much smaller if you start from scratch and there's no need to. Decide what your goal is and use the best tools available to achieve it.

x86 assembly instruction execution count

Hello everyone
I have a code and I want to find the number of times each assembly line executed. I dont care whether through profiling or emulation, yet I want high precision results. I came across a forum once that gave some scripting code to do so, yet I lost the link. Can anyone help me brainstorm some ways to do so?
Regards
Edit:
Okey I think I am halfway there. I have done some research on the BTS (Branch Trace Store) provided by Intel Manual 3A section 16.4.5 according to one the posts. This feature provides branch history. So now I need your help to find if there are any open source scripts or tools to do this. Waiting to check your feedback
cheers=)!

If your processor supports it, you can enable Branch Trace Store (BTS). BTS stores a log of all of the taken branches in a predefined area in memory. Each entry contains the branch source and destination. Using that, you can count how many times you were in each code segment.
Look at volume 3A of the Intel Software Developer's Manual, section 16.4.5 (in the current edition) for details on how to enable it.

If you do not care about performance, you can do a small trick to count that. Raise a single step exception and upon entering your custom seh handler, raise another one and step over to the next command.
Maybe some profiler tools like pin or valgrind do that for you in an easier manner. I would suggest that you take a look.

One (although slow) method would be to write your own debugger. It would then breakpoint the entry point of your program, and when it was hit it would set the trace flag on the EFlags in the context, so it would break to the debugger on the next instruction as well. You could then use a hash table with the EIP to count the number of times hit.
Only problem is that the overhead would be extreme and the application would run very slowly.

what are the steps/strategy to analyze and improve performance of an embedded system

I will break down this question in to sub questions. I am confused if I should ask them separately or in one question. So I will just stick to one SO question.
What are generally the steps to analyze and improve performance of C applications?
Do these steps change if I am developing for an embedded system?
What tools are out there which can help me?
Recently I have been given a task to improve the performance of our product on ARM11 platform. I am relatively new to this field of embedded systems and need gurus here on SO to help me out.

simply changing compilers can improve your C performance for the same source code by many times over. GCC has not necessarily gotten better for performance over the years, for some programs gcc 3.x produces much tighter code than 4.x. Back when I had access to the tools, ARMs compiler produced significantly better code than gcc. As much as 3 or 4 times faster. LLVM has caught up to GCC 4.x and I suspect will pass gcc by in terms of performance and overall use for cross compiling embedded code. Try different versions of gcc, 3.x and 4.x if you are using gcc. Metaware's compiler and arms adt ran circles around gcc3.x, gcc3.x will give gcc4.x a run for its money with arm code, for thumb code gcc4.x is better and for thumb2 (which doesnt apply to you) gcc4.x also better. Remember I have not said a word about changing a single line of code (yet).
LLVM is capable of full program optimization in addition to infinitely more tuning knobs than gcc. Despite that the code generated (ver 27) is only just catching up to the current gcc 4.x in terms of performance for the few programs I tried. And I didnt try the n factoral number of optimization combinations (optimize on the compile step, different options for each file, or combine two files or three files or all files and optimize those bundles, my theory is do no optimization on the C to bc steps, link all the bc together then do a single optimization pass on the whole program, the allow the default optimization when llc takes it to the target).
By the same token simply knowing your compiler and the optimizations can greatly improve the performance of the code without having to change any of it. You have an ARM11 arr you compiling for arm11 or generic arm? You can gain a few to a dozen percent by telling the compiler specifically which architecture/family (armv6 for example) over the generic armv4 (ARM7) that is often chosen as the default. Knowing to use -O2 or -O3 if you are brave.
It is often not the case but switching to thumb mode can improve performance for specific platforms. Doesnt apply to you but the gameboy advance is a perfect example, loaded with non-zero wait state 16 bit busses. Thumb has a handful of a percent overhead because it takes more instructions to do the same thing, but by increasing the fetch times, and taking advantage of some of the sequential read features of the gba thumb code can run significantly faster than arm code for the same source code.
having an arm11 you probably have an L1 and maybe L2 cache, are they on? Are they configured? Do you have an mmu and is your heavy use memory cached? or are you running zero wait state memory and dont need a cache and should turn it off? In addition to not realizing that you can take the same source code and make it run many times faster by changing compilers or options, folks often dont realize that when you use a cache simply adding a single up to a few nops in your startup code (as a trick to adjust where code lands in memory by one, two, a few words) you can change your codes execution speed by as much as 10 to 20 percent. Where those cache line reads hit in heavily used functions/loops makes a big difference. Even saving one cache line read by adjusting where the code lands is noticeable (cutting it from 3 to 2 or 2 to 1 for example).
Knowing your architecture, both the processor and your memory environment is where the tuning if any would start. Most C libraries if you are high level enough to use one (I often dont use a C library as I run without an operating system and with very limited resources) both in their C code and sometimes add some assembler to make bottleneck routines like memcpy, much faster. If your programs are operating on aligned 32 or even better 64 bit addresses, and you adjust even if it means using a handful of bytes more memory for every structure/array/memcpy to be an integral multiple of 32 bits or 64 bits you will see noticeable improvements (if your code uses structs or copies data in other ways). In addition to getting your structures (if you use them, I certainly dont with embedded code) size aligned, even if you waste memory, getting elements aligned, consider using 32 bit integers for every element instead of bytes or halfwords. Depending on your memory system this can help (it can hurt too btw). As with the GBA example above looking at specific functions that either by profiling or intuition you know are not being implemented in a manner that takes advantage of your processor or platform or libraries you may want to turn to assembler either from scratch or compiling from C initially then disassembling and hand tuning. Memcpy is a good example you may know your systems memory performance and may chose to create your own memcpy specifically for aligned data, copying 64 or 128 or more bits per instruction.
Likewise mixing global and local variables can make a noticeable performance difference. Traditionally folks are told never to use globals, but in embedded this isnt necessarily true, depends on how deeply embedded and how much tuning and speed and other factors you are interested in. This is a touchy subject and I may get flamed for it, so I will leave it at that.
The compiler has to burn and evict registers in order to make function calls, plus if you use local variables a stack frame may be required, so function calls are expensive, but at the same time, depending on the code within a function that has now grown in size by avoiding functions, you may create the problem you were trying to avoid, evicting registers to re-use them. Even a single line of C code can make the difference between all the variables in a function fits in registers to having to start evicting a bunch of registers. For functions or segments of code where you know you need some performance gain compile and disassemble (and look at register usage, how often it fetches memory or writes to memory). You can and will find places where you need to take a well used loop and make it its own function even though the function call has a penalty because by doing that the compiler can better optimize the loop and not evict/reuse registers and you get an overall net gain. Even a single extra instruction in a loop that goes around hundreds of times is a measurable performance hit.
Hopefully you already know to absolutely not compile for debug, turn all of the compile for debug options off. You may already know that code compile for debug that runs without bugs doesnt mean it is debugged, compiling for debug and using debuggers hide bugs leaving them as time bombs in your code for your final compile for release. Learn to always compile for release and test with the release version both for performance and finding bugs in your code.
Most instruction sets do not have a divide function. Avoid using divides or modulo in your code as much as humanly possible they are performance killers. Naturally this is not the case for powers of two, to save the compiler and to mentally avoid divides and modulos try to use shifts and ands. Multplies are easier and more often found in instruction sets, but are still costly. This is a good case to write assembler to do your multiplies instead of letting the C copiler do it. The arm multiply is a 32bit * 32bit = 32 bit so to do accurate math without overflowing there has to be extra C code wrapped around the multiply, if you already know you wont overflow, burn the registers for a function call and do the multiply in assembler (for the arm).
Likewise most instruction sets do not have a floating point unit, with yours you might, even so avoid float if at all possible. If you have to use float that is a whole other pandora's box of performance issues. Most folks dont see the performance problems with code as simple as this:
float a,b;
...
a = b * 7.0;
The rest of the problem is not understanding floating point accuracy and how good or bad the C libraries are just trying to get your constants into floating point form. Again float is a whole other long discussion on performance problems.
I am a product of Michael Abrash (I actually have a print copy of zen of assembly language) and the bottom line is time your code. Come up with an accurate way to time the code, you may think you know where the bottlenecks are and you may think you know your architecture but trying different things even if you think they are wrong, and timing them you may find and eventually have to figure out the error in your thinking. Adding nops to start.S as a final tuning step is a good example of this, all the other work you have done for performance can be instantly erased by not having a good alignment with the cache, this also means re-arranging functions within your source code so that they land in different places in the binary image. I have seen 10 to 20 percent swings of speed increase and decrease as a result of cache line alignments.

Code Review:
What are good code review techniques ?
Static and dynamic analysis of the code.
Tools for static analysis: Sparrow, Prevent, Klockworks
Tools for dynamic analysis : Valgrind, purify
Gprof allows you to learn where your program spent its time and which functions called which other functions while it was executing.
Steps are same
Apart from what is listed is point 1, there are tools like memcheck etc.
There is a big list here based on platform

Phew!! Quite a big question!
What are generally the steps to
analyze and improve performance of C
applications?
As well as other static code analysers mentioned here there is a fairly cheap version called PC-Lint which has been around for ages. Sometimes throws up lots of errors and warnings for one error but by the end of it you'll be happy and know waaaaay more about C/C++ because of it.
With all code analysers some of the issues may be more structural to the code so best to start analysing it from day 1 of coding; running analysis on old software may swamp you with issues which may take a while to untangle, best to keep it clean from the beginning.
But code analysers will not catch all logical errors, i.e. it doesn't do what you want it to do! These are best done by code reviews first, then testing. Performance is often improved by by trying to keep the algorithms as simple as possible, keeping instructions in loops tight, possibly unrolling loops (your compiler optimisations may do this), use of fast caches when accessing data which is slow to get.
Code reviews can raise a lot of issues from lots of other peoples eyes looking at it. Don't get too many people, try to get 3 other people if possible, sometimes junior developers ask the most insightful questions like, "why are we doing this?".
Testing can be roughly split into two sections, automated and manual. Automated testing requires effort producing test handlers for functions/units but once run can be run again and again very quickly. Manual testing requires planning, self-discipline to perform them all to the required, imagination to think up of scenarios that may impair performance and you have to be observant (you may have passed the test but the 'scope trace has a bit of an anomaly before/after the test).
"Do these steps change if I am
developing for an embedded system?"
Performance ananlysis can be different on embedded systems to applications systems; with the very broad brush that "embedded" now covers it depends how hardware-centric you are. It can be done using profilers, if you want a more cheap and chearful method then use test output pins to measure sections of code, or measure them with breakpoints on simulators that come with the development environment.
Make sure that not just a typical length of task is measured but also a maximum, as that is where one task may start impeding on other tasks and your scheduled tasks are not completed in time.
What tools are out there which can
help me?
Simulators on the IDEs, static analysis tools, dynamic analysis tools, but most of all you and other humans getting the requirements right, decent reviewing (of code and testing) and thorough testing (automated and manual).
Good luck!

My experiences.
Function calls are slow, eliminate with macros or inlined methods. Look at the disassembler listing to see.
If using GCC, mark optimized sections with #pragma GCC optimize("O3") or compile them separately.
Play with different combinations of applying the inline attribute (basically find a balance between size and speed).

It is a difficult question to be answered shortly since various techniques have been proposed such as flowchart and state diagram,so you can take a look at some titles:
ARM System-on-Chip Architecture, 2nd Edition -- Steve Furber
ARM System Developer's Guide - Designing and Optimizing System Software -- Andrew N. Sloss, Dominic Symes, Chris Wright & John Rayfield
The Definitive Guide to the ARM Cortex-M3 --Joseph Yiu
C Programming for Embedded Systems --Kirk Zurell
Embedded C -- Michael J. Pont
Programming Embedded Systems in C and C++ --Michael Barr
An Embedded Software Primer --David E, Simon
Embedded Microprocessor Systems 3rd Edition --Stuart Ball
Global Specification and Validation of Embedded Systems - Integrating Heterogeneous Components --G. Nicolescu & A.A Jerraya
Embedded Systems: Modeling, Technology and Applications --Gunter Hommel & Sheng Huanye
Embedded Systems and Computer Architecture --Graham Wilson
Designing Embedded Hardware --John Catsoulis

You have to use a profiler. It will help you identify your application's bottleneck(s). Then focus on improving the functions you spend the most time in and the ones you call the most. Repeat this procedure until you're satisfied with your application performance.
No they don't.
Depending on the platform you're developing onto :
Windows : AMD Code Analyst, VTune, Sleepy
Linux : valgrind / callgrind / cachegrind
Mac : the Xcode profiler is quite good.
Try to find a profiler for the architecture you actually work on.

Why do you program in assembly? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a question for all the hardcore low level hackers out there. I ran across this sentence in a blog. I don't really think the source matters (it's Haack if you really care) because it seems to be a common statement.
For example, many modern 3-D Games have their high performance core engine written in C++ and Assembly.
As far as the assembly goes - is the code written in assembly because you don't want a compiler emitting extra instructions or using excessive bytes, or are you using better algorithms that you can't express in C (or can't express without the compiler mussing them up)?
I completely get that it's important to understand the low-level stuff. I just want to understand the why program in assembly after you do understand it.

I think you're misreading this statement:
For example, many modern 3-D Games have their high performance core engine written in C++ and Assembly.
Games (and most programs these days) aren't "written in assembly" the same way they're "written in C++". That blog isn't saying that a significant fraction of the game is designed in assembly, or that a team of programmers sit around and develop in assembly as their primary language.
What this really means is that developers first write the game and get it working in C++. Then they profile it, figure out what the bottlenecks are, and if it's worthwhile they optimize the heck out of them in assembly. Or, if they're already experienced, they know which parts are going to be bottlenecks, and they've got optimized pieces sitting around from other games they've built.
The point of programming in assembly is the same as it always has been: speed. It would be ridiculous to write a lot of code in assembler, but there are some optimizations the compiler isn't aware of, and for a small enough window of code, a human is going to do better.
For example, for floating point, compilers tend to be pretty conservative and may not be aware of some of the more advanced features of your architecture. If you're willing to accept some error, you can usually do better than the compiler, and it's worth writing that little bit of code in assembly if you find that lots of time is spent on it.
Here are some more relevant examples:
Examples from Games
Article from Intel about optimizing a game engine using SSE intrinsics. The final code uses intrinsics (not inline assembler), so the amount of pure assembly is very small. But they look at the assembler output by the compiler to figure out exactly what to optimize.
Quake's fast inverse square root. Again, the routine doesn't have assembler in it, but you need to know something about architecture to do this kind of optimization. The authors know what operations are fast (multiply, shift) and which are slow (divide, sqrt). So they come up with a very tricky implementation of square root that avoids the slow operations entirely.
High-Performance Computing
Outside the domain of games, people in scientific computing frequently optimize the crap out of things to get them to run fast on the latest hardware. Think of this as games where you can't cheat on the physics.
A great recent example of this is Lattice Quantum Chromodynamics (Lattice QCD). This paper describes how the problem pretty much boils down to one very small computational kernel, which was optimized heavily for PowerPC 440's on an IBM Blue Gene/L. Each 440 has two FPUs, and they support some special ternary operations that are tricky for compilers to exploit. Without these optimizations, Lattice QCD would've run much slower, which is costly when your problem requires millions of CPU hours on expensive machines.
If you are wondering why this is important, check out the article in Science that came out of this work. Using Lattice QCD, these guys calculated the mass of a proton from first principles, and showed last year that 90% of the mass comes from strong force binding energy, and the rest from quarks. That's E=mc2 in action. Here's a summary.
For all of the above, the applications are not designed or written 100% in assembly -- not even close. But when people really need speed, they focus on writing the key parts of their code to fly on specific hardware.

I have not coded in assembly language for many years, but I can give several reasons that I frequently saw:
Not all compilers can make use of certain CPU optimizations and instruction set (e.g., the new instruction sets that Intel adds once in a while). Waiting for compiler writers to catch up means losing a competitive advantage.
Easier to match actual code to known CPU architecture and optimization. For example, things you know about the fetching mechanism, caching, etc. This is supposed to be transparent to the developer, but the fact is that it is not, that's why compiler writers can optimize.
Certain hardware level accesses are only possible/practical via assembly language (e.g., when writing device driver).
Formal reasoning is sometimes actually easier for the assembly language than for the high-level language since you already know what the final or almost final layout of the code is.
Programming certain 3D graphic cards (circa late 1990s) in the absence of APIs was often more practical and efficient in assembly language, and sometimes not possible in other languages. But again, this involved really expert-level games based on the accelerator architecture like manually moving data in and out in certain order.
I doubt many people use assembly language when a higher-level language would do, especially when that language is C. Hand-optimizing large amounts of general-purpose code is impractical.

There is one aspect of assembler programming which others have not mentioned - the feeling of satisfaction you get knowing that every single byte in an application is the result of your own effort, not the compiler's. I wouldn't for a second want to go back to writing whole apps in assembler as I used to do in the early 80s, but I do miss that feeling sometimes...

Usually, a layman's assembly is slower than C (due to C's optimization) but many games (I distinctly remember Doom) had to have specific sections of the game in Assembly so it would run smoothly on normal machines.
Here's the example to which I am referring.

I started professional programming in assembly language in my very first job (80's). For embedded systems the memory demands - RAM and EPROM - were low. You could write tight code that was easy on resources.
By the late 80's I had switched to C. The code was easier to write, debug and maintain. Very small snippets of code were written in assembler - for me it was when I was writing the context switching in an roll-your-own RTOS. (Something you shouldn't do anymore unless it is a "science project".)
You will see assembler snippets in some Linux kernel code. Most recently I've browsed it in spinlocks and other synchronization code. These pieces of code need to gain access to atomic test-and-set operations, manipulating caches, etc.
I think you would be hard pressed to out-optimize modern C compilers for most general programming.
I agree with #altCognito that your time is probably better spent thinking harder about the problem and doing things better. For some reason programmers often focus on micro-efficiencies and neglect the macro-efficiencies. Assembly language to improve performance is a micro-efficiency. Stepping back for a wider view of the system can expose the macro problems in a system. Solving the macro problems can often yield better performance gains.
Once the macro problems are solved then collapse to the micro level.
I guess micro problems are within the control of a single programmer and in a smaller domain. Altering behavior at the macro level requires communication with more people - a thing some programmers avoid. That whole cowboy vs the team thing.

"Yes". But, understand that for the most part the benefits of writing code in assembler are not worth the effort. The return received for writing it in assembly tends to be smaller than the simply focusing on thinking harder about the problem and spending your time thinking of a better way of doing thigns.
John Carmack and Michael Abrash who were largely responsible for writing Quake and all of the high performance code that went into IDs gaming engines go into this in length detail in this book.
I would also agree with Ólafur Waage that today, compilers are pretty smart and often employ many techniques which take advantage of hidden architectural boosts.

These days, for sequential codes at least, a decent compiler almost always beats even a highly seasoned assembly-language programmer. But for vector codes it's another story. Widely deployed compilers don't do such a great job exploiting the vector-parallel capabilities of the x86 SSE unit, for example. I'm a compiler writer, and exploiting SSE tops my list of reasons to go on your own instead of trusting the compiler.

SSE code works better in assembly than compiler intrinsics, at least in MSVC. (i.e. does not create extra copies of data )

I've three or four assembler routines (in about 20 MB source) in my sources at work. All of them are SSE(2), and are related to operations on (fairly large - think 2400x2048 and bigger) images.
For hobby, I work on a compiler, and there you have more assembler. Runtime libraries are quite often full of them, most of them have to do with stuff that defies the normal procedural regime (like helpers for exceptions etc.)
I don't have any assembler for my microcontroller. Most modern microcontrollers have so much peripheral hardware (interrupt controled counters, even entire quadrature encoders and serial building blocks) that using assembler to optimize the loops is often not needed anymore. With current flash prices, the same goes for code memory. Also there are often ranges of pin-compatible devices, so upscaling if you systematically run out of cpu power or flash space is often not a problem
Unless you really ship 100000 devices and programming assembler makes it possible to really make major savings by just fitting in a flash chip a category smaller. But I'm not in that category.
A lot of people think embedded is an excuse for assembler, but their controllers have more CPU power than the machines Unix was developed on. (Microchip coming
with 40 and 60 MIPS microcontrollers for under USD 10).
However a lot people are stuck with legacy, since changing microchip architecture is not easy. Also the HLL code is very architecture dependent (because it uses the hardware periphery, registers to control I/O, etc). So there are sometimes good reasons to keep maintaining a project in assembler (I was lucky to be able to setup affairs on a new architecture from scratch). But often people kid themselves that they really need the assembler.
I still like the answer a professor gave when we asked if we could use GOTO (but you could read that as ASSEMBLER too): "if you think it is worth writing a 3 page essay on why you need the feature, you can use it. Please submit the essay with your results. "
I've used that as a guiding principle for lowlevel features. Don't be too cramped to use it, but make sure you motivate it properly. Even throw up an artificial barrier or two (like the essay) to avoid convoluted reasoning as justification.

Some instructions/flags/control simply aren't there at the C level.
For example, checking for overflow on x86 is the simple overflow flag. This option is not available in C.

Defects tend to run per-line (statement, code point, etc.); while it's true that for most problems, assembly would use far more lines than higher level languages, there are occasionally cases where it's the best (most concise, fewest lines) map to the problem at hand. Most of these cases involve the usual suspects, such as drivers and bit-banging in embedded systems.

If you were around for all the Y2K remediation efforts, you could have made a lot of money if you knew Assembly. There's still plenty of legacy code around that was written in it, and that code occasionally needs maintenance.

Another reason could be when the available compiler just isn't good enough for an architecture and the amount of code needed in the program is not that long or complex as for the programmer to get lost in it. Try programming a microcontroller for an embedded system, usually assembly will be much easier.

Beside other mentioned things, all higher languages have certain limitations. Thats why some people choose to programm in ASM, to have full control over their code.
Others enjoy very small executables, in the range of 20-60KB, for instance check HiEditor, which is implemented by author of the HiEdit control, superb powerfull edit control for Windows with syntax highlighting and tabs in only ~50kb). In my collection I have more then 20 such gold controls from Excell like ssheets to html renders.

I think a lot of game developers would be surprised at this bit of information.
Most games I know of use as little assembly as at all possible. In some cases none at all, and at worst, one or two loops or functions.
That quote is over-generalized, and nowhere near as true as it was a decade ago.
But hey, mere facts shouldn't hinder a true hacker's crusade in favor of assembly. ;)

If you are programming a low end 8 bit microcontroller with 128 bytes of RAM and 4K of program memory you don't have much choice about using assembly. Sometimes though when using a more powerful microcontroller you need a certain action to take place at an exact time. Assembly language comes in useful then as you can count the instructions and so measure the clock cycles used by your code.

Games are pretty performance hungry and although in the meantime the optimizers are pretty good a "master programmer" is still able to squeeze out some more performance by hand coding the right parts in assembly.
Never ever start optimizing your program without profiling it first. After profiling should be able to identify bottlenecks and if finding better algorithms and the like don't cut it anymore you can try to hand code some stuff in assembly.

Aside from very small projects on very small CPUs, I would not set out to ever program an entire project in assembly. However, it is common to find that a performance bottleneck can be relieved with the strategic hand coding of some inner loops.
In some cases, all that is really required is to replace some language construct with an instruction that the optimizer cannot be expected to figure out how to use. A typical example is in DSP applications where vector operations and multiply-accumulate operations are difficult for an optimizer to discover, but easy to hand code.
For example certain models of the SH4 contain 4x4 matrix and 4 vector instructions. I saw a huge performance improvement in a color correction algorithm by replacing equivalent C operations on a 3x3 matrix with the appropriate instructions, at the tiny cost of enlarging the correction matrix to 4x4 to match the hardware assumption. That was achieved by writing no more than a dozen lines of assembly, and carrying matching adjustments to the related data types and storage into a handful of places in the surrounding C code.

It doesn't seem to be mentioned, so I thought I'd add it: in modern games development, I think at least some of the assembly being written isn't for the CPU at all. It's for the GPU, in the form of shader programs.
This might be needed for all sorts of reasons, sometimes simply because whatever higher-level shading language used doesn't allow the exact operation to be expressed in the exact number of instructions wanted, to fit some size-constraint, speed, or any combination. Just as usual with assembly-language programming, I guess.

Almost every medium-to-large game engine or library I've seen to date has some hand-optimized assembly versions available for matrix operations like 4x4 matrix concatenation. It seems that compilers inevitably miss some of the clever optimizations (reusing registers, unrolling loops in a maximally efficient way, taking advantage of machine-specific instructions, etc) when working with large matrices. These matrix manipulation functions are almost always "hotspots" on the profile, too.
I've also seen hand-coded assembly used a lot for custom dispatch -- things like FastDelegate, but compiler and machine specific.
Finally, if you have Interrupt Service Routines, asm can make all the difference in the world -- there are certain operations you just don't want occurring under interrupt, and you want your interrupt handlers to "get in and get out fast"... you know almost exactly what's going to happen in your ISR if it's in asm, and it encourages you to keep the bloody things short (which is good practice anyway).

I have only personally talked to one developer about his use of assembly.
He was working on the firmware that dealt with the controls for a portable mp3 player.
Doing the work in assembly had 2 purposes:
Speed: delays needed to be minimal.
Cost: by being minimal with the code, the hardware needed to run it could be slightly less powerful. When mass-producing millions of units, this can add up.

The only assembler coding I continue to do is for embedded hardware with scant resources. As leander mentions, assembly is still well suited to ISRs where the code needs to be fast and well understood.
A secondary reason for me is to keep my knowledge of assembly functional. Being able to examine and understand the steps which the CPU is taking to do my bidding just feels good.

Last time I wrote in assembler was when I could not convince the compiler to generate libc-free, position independent code.
Next time will probably be for the same reason.
Of course, I used to have other reasons.

A lot of people love to denigrate assembly language because they've never learned to code with it and have only vaguely encountered it and it has left them either aghast or somewhat intimidated. True talented programmers will understand that it is senseless to bash C or Assembly because they are complimentary. in fact the advantage of one is the disadvantage of the other. The organized syntaxic rules of C improves clarity but at the same gives up all the power assembly has from being free of any structural rules ! C code instruction are made to create non-blocking code which could be argued forces clarity of programming intent but this is a power loss. In C the compiler will not allow a jump inside an if/elseif/else/end. Or you are not allowed to write two for/end loops on diferent variables that overlap each other, you cannot write self modifying code (or cannot in an seamless easy way), etc.. conventional programmers are spooked by the above, and would have no idea how to even use the power of these approaches as they have been raised to follow conventional rules.
Here is the truth : Today we have machine with the computing power to do much more that the application we use them for but the human brain is too incapable to code them in a rule free coding environment (= assembly) and needs restrictive rules that greatly reduce the spectrum and simplifies coding.
I have myself written code that cannot be written in C code without becoming hugely inefficient because of the above mentionned limitations. And i have not yet talked about speed which most people think is the main reason for writting in assembly, well it is if you mind is limited to thinking in C then you are the slave of you compiler forever. I always thought chess players masters would be ideal assembly programmers while the C programmers just play "Dames".

No longer speed, but Control. Speed will sometimes come from control, but it is the only reason to code in assembly. Every other reason boils down to control (i.e. SSE and other hand optimization, device drivers and device dependent code, etc.).

If I am able to outperform GCC and Visual C++ 2008 (known also as Visual C++ 9.0) then people will be interested in interviewing me about how it is possible.
This is why for the moment I just read things in assembly and just write __asm int 3 when required.
I hope this help...

I've not written in assembly for a few years, but the two reasons I used to were:
The challenge of the thing! I went through a several-month period years
ago when I'd write everything in x86 assembly (the days of DOS and Windows
3.1). It basically taught me a chunk of low level operations, hardware I/O, etc.
For some things it kept size small (again DOS and Windows 3.1 when writing TSRs)
I keep looking at coding assembly again, and it's nothing more than the challenge and joy of the thing. I have no other reason to do so :-)

I once took over a DSP project which the previous programmer had written mostly in assembly code, except for the tone-detection logic which had been written in C, using floating-point (on a fixed-point DSP!). The tone detection logic ran at about 1/20 of real time.
I ended up rewriting almost everything from scratch. Almost everything was in C except for some small interrupt handlers and a few dozen lines of code related to interrupt handling and low-level frequency detection, which runs more than 100x as fast as the old code.
An important thing to bear in mind, I think, is that in many cases, there will be much greater opportunities for speed enhancement with small routines than large ones, especially if hand-written assembler can fit everything in registers but a compiler wouldn't quite manage. If a loop is large enough that it can't keep everything in registers anyway, there's far less opportunity for improvement.

The Dalvik VM that interprets the bytecode for Java applications on Android phones uses assembler for the dispatcher. This movie (about 31 minutes in, but its worth watching the whole movie!) explains how
"there are still cases where a human can do better than a compiler".

I don't, but I've made it a point to at least try, and try hard at some point in the furture (soon hopefully). It can't be a bad thing to get to know more of the low level stuff and how things work behind the scenes when I'm programming in a high level language. Unfortunately time is hard to come by with a full time job as a developer/consultant and a parent. But I will give at go in due time, that's for sure.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight