Difference between ARM Simulator and Verilog Simulator?

Difference between ARM Simulator and Verilog Simulator? - arm

Basically , I want to know What an ARM Simulator is ? IS it something like an Assembly language simulator ? If so what are the differences in comparison with Verilog Simulators?

Your question is quite broad/vague. but here goes.
An arm simulator, from the context of what you are asking, is likely an instruction set simulator. Software that just like a processor decodes the instructions, keeps track of the registers, and simulates the execution (if an instruction says add 1 to r1 then you have a variable in software that represents r1 and you add one to it).
A verilog simulator, is not really any different just a different language. verilog is a hardware design language, before you can simulate it you need to compile it. Just like any other high level language it needs to be compiled down to something related to the target. simulators will have their own target logic blocks. The verilog is compiled down to these blocks then that logic is simulated, not unlike the arm simulator. For each clock cycle you update the inputs to each logic element based on the output of the connected block from the prior cycle, then evaluate each logic element and determine the outputs. repeat forever. for each verilog simulator you have a different target at its core, partly why you (can) get different results from each simulator for the same code. Likewise when you compile for the actual target, fpga, asic, etc, it is compiled differently than the simulator (or can be, depends on the environment, simulator, etc).
There is no magic at all to any of these simulators, an instruction set simulator is generally easy to write, a worthwhile task for anyone wanting to get a good strong knowledge of an instruction set or how computers work (start with something small like lc-3, should take less than half an hour). A FAST simulator, that is another story, but a FUNCTIONAL simulator is fairly easy to write. Once compiled to a netlist of simple logic components a verilog simulator is probaby easy as well the biggest task though is the volume of signals and items to evaluate and parsing the code to get at the list of signals and logic functions and who is tied to what. Not as easy as an instruction set simulator, but quite understandable how it works and what the task would be...Verilator is pretty cool as it turns it into lines of C++ code, MANY lines, and a good sized project can take many hours to days to compile even on a screaming machine. (hint turn off waveforms to cut the compile time way down). But the task is understandable when you look at what is going on.

Related

Low level languages and their dependencies

I am trying to understand exactly what it means that low-level languages are machine-dependent.
Let's take for example C, well if it is machine-dependent does it mean that if it was compiled on one computer it might not be able to run on another?

In the end processors executes machine code which is basicly a collection of binary numbers. The processor decode each binary number to figure out what it is supposed to do. One binary number could mean "Add register X to register Y and store the result in register Z". Another binary number could mean "Store the content of register X into the memory address held by register Y". And so on...
The complete description of these decoding rules (i.e. binary number into operation) represents the processors instruction set (aka ISA).
A low level language is a language where the code you can write maps very closely to the specific processors instruction set. Assembly is one obvious example. Since different processor may have different instruction sets, it's clear that an assembly program written for one processors ISA can't be used on a processor with a different ISA.
Let's take for example C, well if it is machine-dependent does it mean that if it was compiled on one computer it might not be able to run on another?
Correct. A program compiled for one processor (family) can't run on another processor with (completely) different ISA. The program needs to be recompiled.
Also notice that the target OS also plays a role. If you use the same processor but use different OS you'll also need to recompile.
There are at least 3 different kind of languages.
A languages that is so close to the target systems ISA that the source code can only be used on that specific target. Example: Assembly
A language that allows you to write code that can be used on many different targets using a target specific compilation. Example: C
A language that allows you to write code that can be used on many different targets without a target specific compilation. These still require some kind of target specific runtime environment to be installed. Example: Java.

High-Level languages are portable, meaning every architecture can run high-level programs but, compared to low-level programs (like written in Assembly or even machine code), they are less efficient and consume more memory.
Low-level programs are known as "closer to the hardware" and so they are optimized for a certain type of hardware architecture/processor, being faster programs, but relatively machine-dependant or not-very-portable.
So, a program compiled for a type of processor it's not valid for other types; it needs to be recompiled.

In the before
When the first processors came out, there was no programming language whatsoever, you had a very long and very complicated documentation with a list of "opcodes": the code you had to put into memory for a given operation to be executed in your processor. To create a program, you had to put a long string of number in memory, and hope everything worked as documented.
Later came Assembly languages. The point wasn't really to make algorithms easier to implement or to make the program readable by any human without any experience on the specific processor model you were working with, it was created to save you from spending days and days looking up things in a documentation. For this reason, there isn't "an assembly language" but thousands of them, one per instruction set (which, at the time, basically meant one per CPU model)
At this point in time, all languages were platform-dependent. If you decided to switch CPUs, you'd have to rewrite a significant portion (if not all) of your code. Recognizing that as a bit of a problem, someone created a the first platform-independent language (according to this SE question it was FORTRAN in 1954) that could be compiled to run on any CPU architecture as long as someone made a compiler for it.
Fast forward a bit and C was invented. C is a platform-independent programming language, in the sense that any C program (as long as it conforms with the standard) can be compiled to run on any CPU (as long as this CPU has a C compiler). Once a C program has been compiled, the resulting file is a platform-dependent binary and will only be able to run on the architecture it was compiled for.
C is platform-dependent
There's an issue though: a processor is more than just a list of opcodes. Most processors have hardware control devices like watchdogs or timers that can be completely different from one architecture to another, even the way to talk to other devices can change completely. As such, if you want to actually run a program on a CPU, you have to include things that make it platform-dependent.
A real life example of this is the Linux kernel. The majority of the kernel is written in C but there's still around 1% written in different kinds of assembly. This assembly is required to do things such as initialize the CPU or use timers. Using this hack means Linux can run on your desktop x86_64 CPU, your ARM Android phone or a RISCV SoC but adding any new architecture isn't as simple as just "compile it with your architecture's compiler".
So... Did I just say the only way to run a platform-independent on an actual processor is to use platform-dependent code? Yes, for most architectures, you have to.
Or is it?
But there's a catch! That's only true if you want to run you code on bare metal (meaning: without an OS). One of the great things of using an OS is how abstracted everything is: you don't need to know how the kernel initializes the CPU, nor do you need to know how it gets its clock, you just need to know how to access those abstracted resources.
But the way of accessing resources dependent on the OS, aren't we back to square one? We could be, if not for the standard library! This library is used to access functions like printf in a defined way. It doesn't matter if you're working on a Linux running on PowerPC or on an ARM Windows, printf will always print things on the standard output the same way.
If you write standard C using only the standard library (and intend for your program to run in an OS) C is completely platform-independent!
EDIT: As said in the comments below, even that is not enough. It doesn't really have anything to do with specific CPUs but some things such as the system function or the size of some types are documented as implementation-defined. To make C really platform independent you need to make sure to only use well defined functions of the STL and learn some best practice (never rely on sizeof(int)==4 for instance).

Thinking about 'what's a program' might help you understand your question. Is a program a collection of text (that you've typed in or otherwise manufactured) or is it something you run? Is it both?
In the case of a 'low-level' language like C I'd say that the text is the program source, and that this is turned into a program (aka executable) by a compiler. A program is something you can run. You need a C compiler for a system to be able to make the program source into a program for that system. Once built the program can only be run on systems close to the one it was compiled for. However there is a more interesting, if more difficult question: can you at least keep the program source the same, so that all you need to do is recompile? The answer to this is 'sort-of No' I sort-of think. For example you can't, in pure C, read the state of the shift key. Of course operating systems provide such facilities and you can interface to those in C, but then such code depends on the OS. There might be libraries (eg the curses library) that provide such facilities for many OS and that can help to reduce the dependency, but no library can clain to portably cover all OS.
In the case of a 'higher-level' language like python I'd say the text is both the program and the program source. There is no separate compilation stage with such languages, but you do need an interpreter on a system to be able to run your python program on that system. However that this is happening may not be clear to the user as you may well seem to be able to run your python 'program' just by naming it like you run your C programs. But this, most likely comes down to the shell (the part of the OS that deals with commands) knowing about python programs and invoking the interpreter for you. It can appear then that you can run your python program anywhere but in fact what you can do is pass the program to any python interpreter.
In the zoo of programming there are not only many, very varied beasts, but new kinds of beasts arise all the time, and old beasts metamorphose. Terms like 'program', 'script' and even 'executable' are often used loosely.

Is assembly strictly required to make the "lowest" part of an operating system?

Im a mid-level(abstraction) programmer, and some months ago i started to think if i should reduce or increase abstraction(i've chosen to reduce).
Now, i think i've done most of the "research" about what i need, but still are a few questions remaining.
Right now while im "doing effectively nothing", im just reinforcing my C skills (bought "K&R C Programing Lang"), and im thinking to (after feel comfortable) start studying operating systems(like minix) just for learning purposes, but i have an idea stuck in my mind, and i don't really know if i should care.
In theory(i think, not sure), the higher level languages cannot refer to the hardware directly (like registers, memory locations, etc...) so the "perfect language" for the base would be assembly.
I already studied assembly(some time ago) just to see how it was (and i stopped in the middle of the book due to the outdated debugger that the book used(Assembly Language Step By Step, for Linux!)) but from what i have read, i din't liked the language a lot.
So the question is simple: Can an operating system(bootloader/kernel) be programmed without touching in a single line of assembly, and still be effective?
Even if it can, it will not be "cross-architecture", will it? (i386/arm/mips etc...)
Thanks for your support

You can do a significant amount of the work without assembly. Linux or NetBSD doesnt have to be completely re-written or patched for each of the many targets it runs on. Most of the code is portable and then there are abstraction layers and below the abstraction layer you find a target specific layer. Even within the target specific layers most of the code is not asm. I want to dispell this mistaken idea that in order to program registers or memory for a device driver for example that you need asm, you do not use asm for such things. You use asm for 1) instructions that a processor has that you cannot produce using a high level language. or 2) where high level language generated code is too slow.
For example in the ARM to enable or disable interrupts there is a specific instruction for accessing the processor state registers that you must use, so asm is required. but programming the interrupt controller is all done in the high level language. An example of the second point is you often find in C libraries that memcpy and other similar heavily used library functions are hand coded asm because it is dramatically faster.
Although you certainly CAN write and do anything you want in ASM, but you typically find that a high level language is used to access the "hardware directly (like registers, memory locations, etc...)". You should continue to re-inforce your C skills not just with the K&R book but also wander through the various C standards, you might find it disturbing how many "implementation defined" items there are, like bitfields, how variable sizes are promoted, etc. Just because a program you wrote 10 years ago keeps compiling and working using a/one specific brand of compiler (msvc, gcc, etc) doesnt mean the code is clean and portable and will keep working. Unfortunately gcc has taught many very bad programming habits that shock the user when the find out they didnt know the language a decade or so down the road and have to redo how they solve problems using that language.

You have answered your question yourself in "the higher level languages cannot refer to the hardware directly".
Whether you want it or not, at some point you will have to deal with assembly/machine code if you want to make an OS.
Interrupt and exception handlers will have to have some assembly code in them. So will need the scheduler (if not directly, indirectly). And the system call mechanism. And the bootloader.

What I've learned in the past reading websites and books is that:
a) many programmers dislikes assembly language because of the reasons we all know.
b) the main programming language for OS's seems to be C and even C++
c) assembly language can be used to 'speed up code' after profiling your source code in C or C++ (language doesn't matter in fact)
So, the combination of a mid level language and a low level language is in some cases inevitable. For example there is no use to speed up code for waiting on user input.
If it matters to build the shortest and fastest code for one specific range of computers (AMD, INTEL, ARM, DIGITAL-ALPHA, ...) then you should use assembler. My opinion...

what are the steps/strategy to analyze and improve performance of an embedded system

I will break down this question in to sub questions. I am confused if I should ask them separately or in one question. So I will just stick to one SO question.
What are generally the steps to analyze and improve performance of C applications?
Do these steps change if I am developing for an embedded system?
What tools are out there which can help me?
Recently I have been given a task to improve the performance of our product on ARM11 platform. I am relatively new to this field of embedded systems and need gurus here on SO to help me out.

simply changing compilers can improve your C performance for the same source code by many times over. GCC has not necessarily gotten better for performance over the years, for some programs gcc 3.x produces much tighter code than 4.x. Back when I had access to the tools, ARMs compiler produced significantly better code than gcc. As much as 3 or 4 times faster. LLVM has caught up to GCC 4.x and I suspect will pass gcc by in terms of performance and overall use for cross compiling embedded code. Try different versions of gcc, 3.x and 4.x if you are using gcc. Metaware's compiler and arms adt ran circles around gcc3.x, gcc3.x will give gcc4.x a run for its money with arm code, for thumb code gcc4.x is better and for thumb2 (which doesnt apply to you) gcc4.x also better. Remember I have not said a word about changing a single line of code (yet).
LLVM is capable of full program optimization in addition to infinitely more tuning knobs than gcc. Despite that the code generated (ver 27) is only just catching up to the current gcc 4.x in terms of performance for the few programs I tried. And I didnt try the n factoral number of optimization combinations (optimize on the compile step, different options for each file, or combine two files or three files or all files and optimize those bundles, my theory is do no optimization on the C to bc steps, link all the bc together then do a single optimization pass on the whole program, the allow the default optimization when llc takes it to the target).
By the same token simply knowing your compiler and the optimizations can greatly improve the performance of the code without having to change any of it. You have an ARM11 arr you compiling for arm11 or generic arm? You can gain a few to a dozen percent by telling the compiler specifically which architecture/family (armv6 for example) over the generic armv4 (ARM7) that is often chosen as the default. Knowing to use -O2 or -O3 if you are brave.
It is often not the case but switching to thumb mode can improve performance for specific platforms. Doesnt apply to you but the gameboy advance is a perfect example, loaded with non-zero wait state 16 bit busses. Thumb has a handful of a percent overhead because it takes more instructions to do the same thing, but by increasing the fetch times, and taking advantage of some of the sequential read features of the gba thumb code can run significantly faster than arm code for the same source code.
having an arm11 you probably have an L1 and maybe L2 cache, are they on? Are they configured? Do you have an mmu and is your heavy use memory cached? or are you running zero wait state memory and dont need a cache and should turn it off? In addition to not realizing that you can take the same source code and make it run many times faster by changing compilers or options, folks often dont realize that when you use a cache simply adding a single up to a few nops in your startup code (as a trick to adjust where code lands in memory by one, two, a few words) you can change your codes execution speed by as much as 10 to 20 percent. Where those cache line reads hit in heavily used functions/loops makes a big difference. Even saving one cache line read by adjusting where the code lands is noticeable (cutting it from 3 to 2 or 2 to 1 for example).
Knowing your architecture, both the processor and your memory environment is where the tuning if any would start. Most C libraries if you are high level enough to use one (I often dont use a C library as I run without an operating system and with very limited resources) both in their C code and sometimes add some assembler to make bottleneck routines like memcpy, much faster. If your programs are operating on aligned 32 or even better 64 bit addresses, and you adjust even if it means using a handful of bytes more memory for every structure/array/memcpy to be an integral multiple of 32 bits or 64 bits you will see noticeable improvements (if your code uses structs or copies data in other ways). In addition to getting your structures (if you use them, I certainly dont with embedded code) size aligned, even if you waste memory, getting elements aligned, consider using 32 bit integers for every element instead of bytes or halfwords. Depending on your memory system this can help (it can hurt too btw). As with the GBA example above looking at specific functions that either by profiling or intuition you know are not being implemented in a manner that takes advantage of your processor or platform or libraries you may want to turn to assembler either from scratch or compiling from C initially then disassembling and hand tuning. Memcpy is a good example you may know your systems memory performance and may chose to create your own memcpy specifically for aligned data, copying 64 or 128 or more bits per instruction.
Likewise mixing global and local variables can make a noticeable performance difference. Traditionally folks are told never to use globals, but in embedded this isnt necessarily true, depends on how deeply embedded and how much tuning and speed and other factors you are interested in. This is a touchy subject and I may get flamed for it, so I will leave it at that.
The compiler has to burn and evict registers in order to make function calls, plus if you use local variables a stack frame may be required, so function calls are expensive, but at the same time, depending on the code within a function that has now grown in size by avoiding functions, you may create the problem you were trying to avoid, evicting registers to re-use them. Even a single line of C code can make the difference between all the variables in a function fits in registers to having to start evicting a bunch of registers. For functions or segments of code where you know you need some performance gain compile and disassemble (and look at register usage, how often it fetches memory or writes to memory). You can and will find places where you need to take a well used loop and make it its own function even though the function call has a penalty because by doing that the compiler can better optimize the loop and not evict/reuse registers and you get an overall net gain. Even a single extra instruction in a loop that goes around hundreds of times is a measurable performance hit.
Hopefully you already know to absolutely not compile for debug, turn all of the compile for debug options off. You may already know that code compile for debug that runs without bugs doesnt mean it is debugged, compiling for debug and using debuggers hide bugs leaving them as time bombs in your code for your final compile for release. Learn to always compile for release and test with the release version both for performance and finding bugs in your code.
Most instruction sets do not have a divide function. Avoid using divides or modulo in your code as much as humanly possible they are performance killers. Naturally this is not the case for powers of two, to save the compiler and to mentally avoid divides and modulos try to use shifts and ands. Multplies are easier and more often found in instruction sets, but are still costly. This is a good case to write assembler to do your multiplies instead of letting the C copiler do it. The arm multiply is a 32bit * 32bit = 32 bit so to do accurate math without overflowing there has to be extra C code wrapped around the multiply, if you already know you wont overflow, burn the registers for a function call and do the multiply in assembler (for the arm).
Likewise most instruction sets do not have a floating point unit, with yours you might, even so avoid float if at all possible. If you have to use float that is a whole other pandora's box of performance issues. Most folks dont see the performance problems with code as simple as this:
float a,b;
...
a = b * 7.0;
The rest of the problem is not understanding floating point accuracy and how good or bad the C libraries are just trying to get your constants into floating point form. Again float is a whole other long discussion on performance problems.
I am a product of Michael Abrash (I actually have a print copy of zen of assembly language) and the bottom line is time your code. Come up with an accurate way to time the code, you may think you know where the bottlenecks are and you may think you know your architecture but trying different things even if you think they are wrong, and timing them you may find and eventually have to figure out the error in your thinking. Adding nops to start.S as a final tuning step is a good example of this, all the other work you have done for performance can be instantly erased by not having a good alignment with the cache, this also means re-arranging functions within your source code so that they land in different places in the binary image. I have seen 10 to 20 percent swings of speed increase and decrease as a result of cache line alignments.

Code Review:
What are good code review techniques ?
Static and dynamic analysis of the code.
Tools for static analysis: Sparrow, Prevent, Klockworks
Tools for dynamic analysis : Valgrind, purify
Gprof allows you to learn where your program spent its time and which functions called which other functions while it was executing.
Steps are same
Apart from what is listed is point 1, there are tools like memcheck etc.
There is a big list here based on platform

Phew!! Quite a big question!
What are generally the steps to
analyze and improve performance of C
applications?
As well as other static code analysers mentioned here there is a fairly cheap version called PC-Lint which has been around for ages. Sometimes throws up lots of errors and warnings for one error but by the end of it you'll be happy and know waaaaay more about C/C++ because of it.
With all code analysers some of the issues may be more structural to the code so best to start analysing it from day 1 of coding; running analysis on old software may swamp you with issues which may take a while to untangle, best to keep it clean from the beginning.
But code analysers will not catch all logical errors, i.e. it doesn't do what you want it to do! These are best done by code reviews first, then testing. Performance is often improved by by trying to keep the algorithms as simple as possible, keeping instructions in loops tight, possibly unrolling loops (your compiler optimisations may do this), use of fast caches when accessing data which is slow to get.
Code reviews can raise a lot of issues from lots of other peoples eyes looking at it. Don't get too many people, try to get 3 other people if possible, sometimes junior developers ask the most insightful questions like, "why are we doing this?".
Testing can be roughly split into two sections, automated and manual. Automated testing requires effort producing test handlers for functions/units but once run can be run again and again very quickly. Manual testing requires planning, self-discipline to perform them all to the required, imagination to think up of scenarios that may impair performance and you have to be observant (you may have passed the test but the 'scope trace has a bit of an anomaly before/after the test).
"Do these steps change if I am
developing for an embedded system?"
Performance ananlysis can be different on embedded systems to applications systems; with the very broad brush that "embedded" now covers it depends how hardware-centric you are. It can be done using profilers, if you want a more cheap and chearful method then use test output pins to measure sections of code, or measure them with breakpoints on simulators that come with the development environment.
Make sure that not just a typical length of task is measured but also a maximum, as that is where one task may start impeding on other tasks and your scheduled tasks are not completed in time.
What tools are out there which can
help me?
Simulators on the IDEs, static analysis tools, dynamic analysis tools, but most of all you and other humans getting the requirements right, decent reviewing (of code and testing) and thorough testing (automated and manual).
Good luck!

My experiences.
Function calls are slow, eliminate with macros or inlined methods. Look at the disassembler listing to see.
If using GCC, mark optimized sections with #pragma GCC optimize("O3") or compile them separately.
Play with different combinations of applying the inline attribute (basically find a balance between size and speed).

It is a difficult question to be answered shortly since various techniques have been proposed such as flowchart and state diagram,so you can take a look at some titles:
ARM System-on-Chip Architecture, 2nd Edition -- Steve Furber
ARM System Developer's Guide - Designing and Optimizing System Software -- Andrew N. Sloss, Dominic Symes, Chris Wright & John Rayfield
The Definitive Guide to the ARM Cortex-M3 --Joseph Yiu
C Programming for Embedded Systems --Kirk Zurell
Embedded C -- Michael J. Pont
Programming Embedded Systems in C and C++ --Michael Barr
An Embedded Software Primer --David E, Simon
Embedded Microprocessor Systems 3rd Edition --Stuart Ball
Global Specification and Validation of Embedded Systems - Integrating Heterogeneous Components --G. Nicolescu & A.A Jerraya
Embedded Systems: Modeling, Technology and Applications --Gunter Hommel & Sheng Huanye
Embedded Systems and Computer Architecture --Graham Wilson
Designing Embedded Hardware --John Catsoulis

You have to use a profiler. It will help you identify your application's bottleneck(s). Then focus on improving the functions you spend the most time in and the ones you call the most. Repeat this procedure until you're satisfied with your application performance.
No they don't.
Depending on the platform you're developing onto :
Windows : AMD Code Analyst, VTune, Sleepy
Linux : valgrind / callgrind / cachegrind
Mac : the Xcode profiler is quite good.
Try to find a profiler for the architecture you actually work on.

Getting into Embedded [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm trying to familiarize myself with the embedded field, but also have limited resources in terms of time and equipment to buy.
What's a good language to wrap my head around embedded, without investing too much time leaning an embedded-specific language? I'm most familiar with PHP, Java, Actionscript, but unfortunately know very little C. I remember reading somewhere that someone used PERL to program embedded systems, but not sure if that's really possible.
Can learning be done without needing to buy chips, etc. via simulators or such?
Can someone recommend a simplified roadmap to show how one would get sarted? I'm a little unsure where to even start.

You need to know C (but every programmer needs to know C !)
Most of these platforms have a simulator/emualtor, but since the point is to learn real applications and real problems ( which are all to do with real world timing issues) then you want a real board.
You probably also want an oscilloscope (a very cheap n'th hand slow analogue scope will do) and have some idea how to use it.
Easiest way in is probably Arduino, perhaps more professional but a little harder is launchpad MSP430

There are a few embedded programming lessons that carry over from one platform and style to another, but it is really a broad field. Different processors can require very different tactics, and different applications can dictate both different firmware design tactics and different microcontrollers. Here's some stuff to get you started....
msp430
Texas Instruments has several very inexpensive USB development kits which they call EZ430 and are based on their MSP430 family of micro-controllers. The simplest one has an msp430 f2013, which has 2K of flash program space, 3x128 bytes of usable user flash (another 128 byte page exists, but it's special), 128 bytes of RAM (yes, 128 bytes but it's enough for lots of things), and 16 CPU registers (some of these are special purpose like the Stack Pointer, Instruction Pointer, Status Register, and maybe one or two more). MSP430s also have several memory mapped special function registers which are used for configuring and controlling the built in peripherals. MSP430s are von Newman processors, so everything lives within one address space. These cost about $20us for both the programmer and a removable tab (pc board) containing the msp430 f2013. For about $10us you can get 3 replacement tabs with msp430 2012, which is pin compatible with the 2013 (mostly) and has a few different peripherals. These tabs have an LED, a button, and several large vias (holes in the pc board) which are connected to the pin of the processor. These vias are easy to solder wires into even if you have never soldered before -- due to capillary action the vias just suck the molten solder up and while it's hot you can just jab the end of your wire in there.
They also have a couple more similar kits with 802.15.4 radios. Even if you aren't interested in the radio you may still be interested in these because their programmer also has a UART pulled over from the removable tab and are compatible with the tabs used on the other kits mentioned above. These kits also contain at least one extra programmable board and a battery pack for it. (one of these kits may contain more, but I don't have mine with me right now, and not going to look it up).
They also have a kit that has a programmable watch as the target platform. I've never had one of these, but they have a display, accelerometers, and several other cool things, but this may overwhelm you for your first project. I'd suggest one of the previous kits to get you started with MSP430s.
You can get free C compilers and development environments for MSP430s in the form of IAR's Embedded Workbench kickstart (4 kb program space limited ) IDE, Code Composer Studio (also limited program size, but higher limit, I think), and gcc/gdb for the MSP430. IAR's kickstart is pretty easy to get started with quickly, though it's not perfect. You may find that you have to shut it down, unplug your USB EZ430, restart IAR, and plug back in to get it going again. Or maybe some different order will work better for you.
TI also provides many examples in badly named files (all of their downloadable files go out of their way to be badly named). Be warned -- similar MSP430s may have different device control register interfaces for similar peripherals, which can be confusing. Make sure that any document or example you are reading really does apply to the microcontroller you are using.
other small systems
There are many many other processors families and kits that you can go with, and you should probably at least know a little bit about them.
AVR -- Atmel's 8/16 bit Harvard architecture. Harvard refers to separate address spaces for code and working memory. It has 32 8 bit registers, some of which may be used in pairs as 16 bit registers. It's a very popular and pretty cool processor. Some of the smallest ones only have registers with no extra RAM, which is scary. Atmel also has an AVR32 which isn't at all the same as the AVR. Unless you make use of an existing bootloader capable of loading your new code you will need to get a JTAG unit for these.
8051 -- This is old as the hills and a pain in the butt to use until you finally understand it. It is an 8/16 bit processor, with many more limits on how you go about doing 16 bit math and only has 1 pair of registers which can act as a pointer. It has 3 separate address spaces (stack, global memory, and code) and lots of odd (compared to other architectures) features. The low level stuff might not mean much to you if your are programming in C except that very simple C operations can turn into much more code than you thought they would. You don't want to start on one of thise, most likely.
propeller -- Parallax's very interesting multi-core processor which is very unlike other processors. It has several cores which act mostly independently and can be used to simulate peripherals or do more traditional computational tasks. I've never used one of these, though I'd like to. Just never had a task that seemed to fit it. They have their own high level language to program them as well as the processor's assembly language.
larger systems
After you get out of the 8/16/24 bit processors you start to blur the lines between embedded and desktop level programming, even if it is technically embedded.
AVR32 -- There are 2 main versions of these. One is a Harvard architecture and the other is von Newman. The von Newman version is essentially a better ARM than ARM, but it's not as popular as ARM. As near as I can tell it was designed with "run Linux" in mind, though not tied to it in any crazy way. You used to be able to get cheap development boards for these and code is often almost as easy to load as copying files from one PC to another, though you will probably make use of uboot and tftp to do some work. JTAG is only needed when you mess up the boot loader. I think all of these have support for native JAVA acceleration. www.AVR32.org
ARM -- The most popular embedded processor. There's many versions of these. Some don't have an MMU (memory management unit) and some do. There's too much to say about them. Some version have native JAVA acceleration, though I think that the ARM lords don't freely tell all of the details of how to use it, so you have to find a JVM which knows how to use it. Many vendors make them, including Atmel, Freescale, Intel, and many others.
MIPS -- A very RISC processor. The RISCiest.
There are many others.
Programming styles
I could write 3 books on this but the general rule is make things as simple as the application can let you. An exception to this is that if you can easily make use of an operating system you might want to make use of it if it simplifies your task.

The first thing you need to know when answering this question is "WHAT IS" an embedded system? A GENERAL definition would be a computer system which is dedicated to a single specific purpose. This doesn't limit the type of hardware you can use, as matter of fact "Embedded PCs" have been used for years. The QNX realtime OS has existed since the early 80s and been used in industrial PCs for embedded applications for years. I've personally used in control systems for steel mills XRAY thickness gages. On the other hand I currently use TI DSPs without any OS support and only using 256K of Ram. Another example would be the key fob for your car. Old ones used to use a PIC microcontroller from Microchip. (That's actually the company's name.)
Some people call the IPhone an embedded system, but due to the fact you can load applications to do just about anything I tend to say it's a palm top computer with phone capabilities. An OLD DUMB cell phone that is just a phone, not PDA is an embedded system. That's just a bit of philosophy.
As a general rule there are a handful of concepts you need to grasp for embedded systems programming, and most of them can be explored on a PC.
EDIT:
The REASON why C or C++ is recommended is C itself was designed to do systems programming. C++ maintains all it's advantages but adds capabilities for OOP programming. Some ASM maybe required in some systems. However a lot of chip vendors, such as TI, provide tools basically make it possible to do your entire system in C++.
:END EDIT
ALOT of simple embedded systems look more or less like this:
While(true) // LOOP FOREVER... There is no command prompt
{
// Typically you want I/O to occur on fixed "timebase."
wait(timerTick);
readDigitalIO(&dioStruct);
readAnalogIO(&aioStruct);
// Combine current system state with input values
// and do some useful calculations. (i.e. Analog input to temperature calc)
Process(dioStruct,aioStruct,&CurrentState);
// This can be a serial output/audio buzzer/leds/motor controller
// or Whatever the system REQUIREMENT call for.
driveOutputs(CurrentState);
// The watchdog timer resets your system if it gets stuck.
petWatchDogTimer();
}
There is nothing here that you can't do using the PC. (Well a PC that still has a parallel port anyway. Which is a more or less just a DIO port.) On a simple system without a os this might be all there is. On a RTOS based system you may have several task that all look somewhat simalar to this, but pass data back and forth between the tasks.
The interesting parts come when you have to interface to the hardware on you're own, the first job I had out of college was writing a device driver for a data acquisition board under QNX.
Basic concepts of dealing with hardware, or device drivers (Which you can experiement with by hacking Linux device drive code which is freely avaliable.), most hardware looks to the programmer like just another memory address. This is called "Memory mapped I/O." What does this mean? Lets use a serial port as an example:
// Serial port registers definition:
typedef struct
{
unsigned int control; // Control bits for the port.
unsigned int baudDiv; // Baud rate divider.
unsigned int status; // READ Status bits/ Write resets fifos;
char TXdata; // The head of the hardware TX fifo.
char RXdata; // The tail of the hardware RX filo.
} serRegs;
// Using the volatile keyword to indicate the hardware can change the value
// independantly from the software.
volatile serRegs *Ser1 = (serRegs *)0x8000; // Hardware exists at a specific location in memory.
volatile serRegs *Ser2 = (serRegs *)0x8010; // Hardware exists at a specific location in memory.
// Bits bits 15-12 enable interupts and select interupt vector,
// bits 11-8 enable,bits 7-4 parity,bits 3-0 stop bits.
Ser1->status = 1; // Reset fifos.
Ser1->baudDiv = CLOCKVALUE / 9600; // Set the baudrate 9600;
Ser1->control = 0x1801; // Enable, 8 data, no parity, 1 stop bit.
// Write out a "OK\r\n" message; (Normally this would be a loop.)
Ser1->Txdata = 'O'; // First byte in fifo Transmission starts.
Ser1->Txdata = 'K'; // Second byte in fifo still transmitting first byte
Ser1->Txdata = '\r'; // Third byte in fifo still transmitting first byte
Ser1->Txdata = '\n'; // Fouth byte in fifo still transmitting first byte
Normally you would have a function or an interrupt handler to handle TXing the data, but for example I wanted to point out that the hardware is working while the software keeps going. Basically hardware works like, I write a value to an address and "STUFF" happens independantly of the software. This is perhaps one of the key concepts for embedded programming, how to make the computer effect a change in the real world.
EDIT:
If you actually want to get a cheap board, the current trend from Micro developers is to put a dev kit on a usb thumb stick. This page has info on several, ranging from 8 bits upto ARM architectures: http://dev.emcelettronica.com/microcontrollers-usb-stick-tool
The Cypress PSOC was one of the first to do this with the "FirstTouch Starter Kit." The PSOC is a very unique part in that it has a micro controller and "Configurable analog and digital blocks" that allow you to plop down a ADC, serial port, or digital I/O using a gui and automatically configures you're C app to use it. The PSOCs also are avaliable in DIP packages which makes them easy to use on a prototyper's breadboard.

Picture your embedded controller sitting in a switched-off circuit...
The Vcc power is applied and the reset circuit asserts reset signal.
Clocks have reached running speeds and voltages stabilized, so reset is de-asserted.
Now your controller sets its instruction pointer to the "reset vector," which is physical address 0xE0000000 on this particular chip. The controller fetches the instruction at that location.
Interrupts are disabled, and the first order of business is to initialize registers such as the stack pointer. On some chips, there are flags bits (e.g., x86 direction flag) which need to be cleared or set.
Once the registers and flag bits are set up correctly, it becomes possible for interrupt service routines to run. By now, we must have run code to about location 0xE0000072 when we get to the code which enables interrupts by first toggling some GPIO pins to the external interrupt controller, then enables the CPU interrupts mask.
At this point, the equivalent of "device drivers" are running in the form of interrupt service routines. Assuming the C environment has a library which matches the interfaces of these routines' data structures, by now our boot-loader code can jump to the main() function of some C object code.
In other words, the code which brought us from power-on to main(), and which handles the low-level I/O, is written in the assembler peculiar to the chip you choose. This means that if you want to be versatile at embedded programming, you must know how to implement assembly code starting at the reset vector.
The reality is that hobbyist embedded programming doesn't allow time for implementing all the ISRs and the boot-loader code. For this reason, many people use standard software frameworks available for specific chips. Others use custom-language chips such as the BASICstamp. The BASICstamp is an embedded chip which hosts a BASIC language interpreter on-board. The interpreter and all the ISRs are pre-written for you. The BASIC environment gives you the ability to control I/O pins, read voltages, everything you could do from assembly with an embedded controller, but a bit slower.

As for the language, C is probably the most important language to know. From Java you should be able to adapt, but just remember that a lot of high level Java will not be available to you. Loads of textbooks out there but I'd recommend the original C programming book by Kernighan and Ritchie http://en.wikipedia.org/wiki/The_C_Programming_Language_(book)
For a good introduction to embedded C you could try a book by Michael J Pont:
http://www.amazon.com/Embedded-C-Michael-J-Pont/dp/020179523X
As for the embedded side of things you could start with Microchip, the IDE is OK to develop in with a reasonable simulator, and the c compilers are free for the slightly limited student editions c18 and c30 compilers, the IDE installer will also ask if you want to install a 3rd party HI-TECH C compiler which you could use. As for the processor I'd recommend selecting a standard 18 series PIC such as the PIC18F4520.
http://www.microchip.com/stellent/idcplg?IdcService=SS_GET_PAGE&nodeId=1406&dDocName=en019469&part=SW007002
Whatever the chip manufacturer, you have to get to know the datasheets. You don't have to learn it all at once but will need it to hand!
Embedded, like most programming, tends to revolve around:
1) initialising a resource, in this case rather than data being from a data store it is from computer registers. Just include the processor header file (.h) and it will allow you to access these as ports (usually bytes) or pins (bits). Also micro-processors come with useful resources on the chip such as timers, analogue to digital converters (ADCs) and serial communications systems (UARTs). Remember that the chip itself is a resource and needs initialising before anything else.
2) using the resource. C will allow you to make data as global as possible and everything can access everything at every time! Avoid this temptation and keep it modular like Java will have encouraged you to (though for speed you may need to be a little looser on these rules).
But they do have an extra weapon called interrupts which can be used to provide real-time behaviour. These can be thought of a bit like OnClick() events. Interrupts can be generated by external events (e.g. buttons or receiving a byte from another device) and internal (timers, transmissions completed, ADC conversions completed). Keep interrupt service routines (ISRs) short and sweet, use them to handle real-time events (e.g. take a byte received and store it in a buffer then raise a flag) but allow background code to deal with it (e.g. check a received byte flag, if set then read the byte received). And remember the all important volatile for variables used by ISR routines and background routines!
Anyway, read around, I recommend www.ganssle.com for advice in general.
Good luck!

The scope of embedded computing has grown very broad, so the answers somewhat depend on what kind of device you're aiming at. On one end, there are 8-bit controllers with only a few KB of memory, usually programmed entirely in assembly or C. On the other end, processors such as those in your router are fairly powerful (200 MHz and a few MB of RAM is not uncommon) and often run an OS like Linux, which means you can use pretty much any language, though C and Java are the most common.
It's best to buy a real chip and experiment. Most of the work involved is usually in getting to know a device and how to interface with it, so using a simulator kind of defeats the purpose.

What's a good language to wrap my head around embedded, without investing too much time leaning an embedded-specific language?
As everyone will suggest: C. Now depending on how deep are you going to dig into your platform of choice, you may also need some assembly, but don't be scared about that: typically you'll use just a little.
If you are learning C, my personal suggestion is: work as you would do in assembly; the programming language won't give you much abstractions, so think in terms of memory management. When you've learnt how to do it move up towards abstractions and live happy.
C++ is also popular on embedded platforms, but IMHO is difficult unless you know well how to program in C you can also understand what's under the hood of its abstraction.
When you feel confident with C/C++ you can start to mess with embedded operating systems. You'll notice that they can be totally different from your OS of choice (By example not all operating systems have a C standard library, processes and splitting between userspace and kernelspace).
You'll learn how to build a cross compiler, how to mess with linker scripts, the tricks of binary formats and a lot of cool stuff.
For the theoretical point of view there's a lot of stuff as well: if you study Computer Science you can get a master's degree in embedded systems.
Can learning be done without needing to buy chips, etc. via simulators or such?
Yes: many operating systems can be run on simulators like qemu.
Can someone recommend a simplified roadmap to show how one would get sarted? I'm a little unsure where to even start.
Try to get a simple operating system which can be run on emulators, hack it and follow your curiosity. Don't be scared of messing with knotty code.

1) Most of the time for most and usually the lower end embedded system, you need to know C.
And I will still recommend you to get a vanilla development board to get yourself familiar with the work flow and tricky part of working with embedded system like debugging and cross compiling. You will run into trouble if you only rely on emulator.
You can try out The Linux Stamp, it is not expensive and is good for beginner but you do need some prior knowledge on Linux.
2) For high end embedded system, a good example is Smartphone from HTC (CPU speed can reach 1Ghz)or some other Android phone it run fast and you can even code Java on it.

C and the assembly specific to your chip.
No, you really need a real chip. Simulators aren't the real thing. You need to be able to deal with keypress jitter, voltage funkyness, etc.
The Arduino is the current fad for embedded hobbyists. I'm not a big fan of Harvard architecture, personally. But you will find oodles of help out there for it. I use an XCore for my thesis work and I have found it super easy to program multicore stuff. I would suggest starting with an AVR32 and going from there.

As everyone else is saying, you need to know C.
Have a look at AVR butterfly for a cheap development board.
Smileymicros have a simple kit with dev board and book:
http://www.smileymicros.com/index.php?module=pagemaster&PAGE_user_op=view_page&PAGE_id=41

The soon to ship Raspberry Pi board looks like an incredibly cheap way to get into this field.

Yup, Arduino would be the way to go. Agreed.. Cheap (about $20 to start) and has great API to get started with high level functions. C is a must though, can't avoid it. But if you can program in other languages you'll be all good.
My recommendation is to start shopping at http://www.sparkfun.com lots of examples to work from and helpful hints of what devices to buy.

Why do you program in assembly? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a question for all the hardcore low level hackers out there. I ran across this sentence in a blog. I don't really think the source matters (it's Haack if you really care) because it seems to be a common statement.
For example, many modern 3-D Games have their high performance core engine written in C++ and Assembly.
As far as the assembly goes - is the code written in assembly because you don't want a compiler emitting extra instructions or using excessive bytes, or are you using better algorithms that you can't express in C (or can't express without the compiler mussing them up)?
I completely get that it's important to understand the low-level stuff. I just want to understand the why program in assembly after you do understand it.

I think you're misreading this statement:
For example, many modern 3-D Games have their high performance core engine written in C++ and Assembly.
Games (and most programs these days) aren't "written in assembly" the same way they're "written in C++". That blog isn't saying that a significant fraction of the game is designed in assembly, or that a team of programmers sit around and develop in assembly as their primary language.
What this really means is that developers first write the game and get it working in C++. Then they profile it, figure out what the bottlenecks are, and if it's worthwhile they optimize the heck out of them in assembly. Or, if they're already experienced, they know which parts are going to be bottlenecks, and they've got optimized pieces sitting around from other games they've built.
The point of programming in assembly is the same as it always has been: speed. It would be ridiculous to write a lot of code in assembler, but there are some optimizations the compiler isn't aware of, and for a small enough window of code, a human is going to do better.
For example, for floating point, compilers tend to be pretty conservative and may not be aware of some of the more advanced features of your architecture. If you're willing to accept some error, you can usually do better than the compiler, and it's worth writing that little bit of code in assembly if you find that lots of time is spent on it.
Here are some more relevant examples:
Examples from Games
Article from Intel about optimizing a game engine using SSE intrinsics. The final code uses intrinsics (not inline assembler), so the amount of pure assembly is very small. But they look at the assembler output by the compiler to figure out exactly what to optimize.
Quake's fast inverse square root. Again, the routine doesn't have assembler in it, but you need to know something about architecture to do this kind of optimization. The authors know what operations are fast (multiply, shift) and which are slow (divide, sqrt). So they come up with a very tricky implementation of square root that avoids the slow operations entirely.
High-Performance Computing
Outside the domain of games, people in scientific computing frequently optimize the crap out of things to get them to run fast on the latest hardware. Think of this as games where you can't cheat on the physics.
A great recent example of this is Lattice Quantum Chromodynamics (Lattice QCD). This paper describes how the problem pretty much boils down to one very small computational kernel, which was optimized heavily for PowerPC 440's on an IBM Blue Gene/L. Each 440 has two FPUs, and they support some special ternary operations that are tricky for compilers to exploit. Without these optimizations, Lattice QCD would've run much slower, which is costly when your problem requires millions of CPU hours on expensive machines.
If you are wondering why this is important, check out the article in Science that came out of this work. Using Lattice QCD, these guys calculated the mass of a proton from first principles, and showed last year that 90% of the mass comes from strong force binding energy, and the rest from quarks. That's E=mc2 in action. Here's a summary.
For all of the above, the applications are not designed or written 100% in assembly -- not even close. But when people really need speed, they focus on writing the key parts of their code to fly on specific hardware.

I have not coded in assembly language for many years, but I can give several reasons that I frequently saw:
Not all compilers can make use of certain CPU optimizations and instruction set (e.g., the new instruction sets that Intel adds once in a while). Waiting for compiler writers to catch up means losing a competitive advantage.
Easier to match actual code to known CPU architecture and optimization. For example, things you know about the fetching mechanism, caching, etc. This is supposed to be transparent to the developer, but the fact is that it is not, that's why compiler writers can optimize.
Certain hardware level accesses are only possible/practical via assembly language (e.g., when writing device driver).
Formal reasoning is sometimes actually easier for the assembly language than for the high-level language since you already know what the final or almost final layout of the code is.
Programming certain 3D graphic cards (circa late 1990s) in the absence of APIs was often more practical and efficient in assembly language, and sometimes not possible in other languages. But again, this involved really expert-level games based on the accelerator architecture like manually moving data in and out in certain order.
I doubt many people use assembly language when a higher-level language would do, especially when that language is C. Hand-optimizing large amounts of general-purpose code is impractical.

There is one aspect of assembler programming which others have not mentioned - the feeling of satisfaction you get knowing that every single byte in an application is the result of your own effort, not the compiler's. I wouldn't for a second want to go back to writing whole apps in assembler as I used to do in the early 80s, but I do miss that feeling sometimes...

Usually, a layman's assembly is slower than C (due to C's optimization) but many games (I distinctly remember Doom) had to have specific sections of the game in Assembly so it would run smoothly on normal machines.
Here's the example to which I am referring.

I started professional programming in assembly language in my very first job (80's). For embedded systems the memory demands - RAM and EPROM - were low. You could write tight code that was easy on resources.
By the late 80's I had switched to C. The code was easier to write, debug and maintain. Very small snippets of code were written in assembler - for me it was when I was writing the context switching in an roll-your-own RTOS. (Something you shouldn't do anymore unless it is a "science project".)
You will see assembler snippets in some Linux kernel code. Most recently I've browsed it in spinlocks and other synchronization code. These pieces of code need to gain access to atomic test-and-set operations, manipulating caches, etc.
I think you would be hard pressed to out-optimize modern C compilers for most general programming.
I agree with #altCognito that your time is probably better spent thinking harder about the problem and doing things better. For some reason programmers often focus on micro-efficiencies and neglect the macro-efficiencies. Assembly language to improve performance is a micro-efficiency. Stepping back for a wider view of the system can expose the macro problems in a system. Solving the macro problems can often yield better performance gains.
Once the macro problems are solved then collapse to the micro level.
I guess micro problems are within the control of a single programmer and in a smaller domain. Altering behavior at the macro level requires communication with more people - a thing some programmers avoid. That whole cowboy vs the team thing.

"Yes". But, understand that for the most part the benefits of writing code in assembler are not worth the effort. The return received for writing it in assembly tends to be smaller than the simply focusing on thinking harder about the problem and spending your time thinking of a better way of doing thigns.
John Carmack and Michael Abrash who were largely responsible for writing Quake and all of the high performance code that went into IDs gaming engines go into this in length detail in this book.
I would also agree with Ólafur Waage that today, compilers are pretty smart and often employ many techniques which take advantage of hidden architectural boosts.

These days, for sequential codes at least, a decent compiler almost always beats even a highly seasoned assembly-language programmer. But for vector codes it's another story. Widely deployed compilers don't do such a great job exploiting the vector-parallel capabilities of the x86 SSE unit, for example. I'm a compiler writer, and exploiting SSE tops my list of reasons to go on your own instead of trusting the compiler.

SSE code works better in assembly than compiler intrinsics, at least in MSVC. (i.e. does not create extra copies of data )

I've three or four assembler routines (in about 20 MB source) in my sources at work. All of them are SSE(2), and are related to operations on (fairly large - think 2400x2048 and bigger) images.
For hobby, I work on a compiler, and there you have more assembler. Runtime libraries are quite often full of them, most of them have to do with stuff that defies the normal procedural regime (like helpers for exceptions etc.)
I don't have any assembler for my microcontroller. Most modern microcontrollers have so much peripheral hardware (interrupt controled counters, even entire quadrature encoders and serial building blocks) that using assembler to optimize the loops is often not needed anymore. With current flash prices, the same goes for code memory. Also there are often ranges of pin-compatible devices, so upscaling if you systematically run out of cpu power or flash space is often not a problem
Unless you really ship 100000 devices and programming assembler makes it possible to really make major savings by just fitting in a flash chip a category smaller. But I'm not in that category.
A lot of people think embedded is an excuse for assembler, but their controllers have more CPU power than the machines Unix was developed on. (Microchip coming
with 40 and 60 MIPS microcontrollers for under USD 10).
However a lot people are stuck with legacy, since changing microchip architecture is not easy. Also the HLL code is very architecture dependent (because it uses the hardware periphery, registers to control I/O, etc). So there are sometimes good reasons to keep maintaining a project in assembler (I was lucky to be able to setup affairs on a new architecture from scratch). But often people kid themselves that they really need the assembler.
I still like the answer a professor gave when we asked if we could use GOTO (but you could read that as ASSEMBLER too): "if you think it is worth writing a 3 page essay on why you need the feature, you can use it. Please submit the essay with your results. "
I've used that as a guiding principle for lowlevel features. Don't be too cramped to use it, but make sure you motivate it properly. Even throw up an artificial barrier or two (like the essay) to avoid convoluted reasoning as justification.

Some instructions/flags/control simply aren't there at the C level.
For example, checking for overflow on x86 is the simple overflow flag. This option is not available in C.

Defects tend to run per-line (statement, code point, etc.); while it's true that for most problems, assembly would use far more lines than higher level languages, there are occasionally cases where it's the best (most concise, fewest lines) map to the problem at hand. Most of these cases involve the usual suspects, such as drivers and bit-banging in embedded systems.

If you were around for all the Y2K remediation efforts, you could have made a lot of money if you knew Assembly. There's still plenty of legacy code around that was written in it, and that code occasionally needs maintenance.

Another reason could be when the available compiler just isn't good enough for an architecture and the amount of code needed in the program is not that long or complex as for the programmer to get lost in it. Try programming a microcontroller for an embedded system, usually assembly will be much easier.

Beside other mentioned things, all higher languages have certain limitations. Thats why some people choose to programm in ASM, to have full control over their code.
Others enjoy very small executables, in the range of 20-60KB, for instance check HiEditor, which is implemented by author of the HiEdit control, superb powerfull edit control for Windows with syntax highlighting and tabs in only ~50kb). In my collection I have more then 20 such gold controls from Excell like ssheets to html renders.

I think a lot of game developers would be surprised at this bit of information.
Most games I know of use as little assembly as at all possible. In some cases none at all, and at worst, one or two loops or functions.
That quote is over-generalized, and nowhere near as true as it was a decade ago.
But hey, mere facts shouldn't hinder a true hacker's crusade in favor of assembly. ;)

If you are programming a low end 8 bit microcontroller with 128 bytes of RAM and 4K of program memory you don't have much choice about using assembly. Sometimes though when using a more powerful microcontroller you need a certain action to take place at an exact time. Assembly language comes in useful then as you can count the instructions and so measure the clock cycles used by your code.

Games are pretty performance hungry and although in the meantime the optimizers are pretty good a "master programmer" is still able to squeeze out some more performance by hand coding the right parts in assembly.
Never ever start optimizing your program without profiling it first. After profiling should be able to identify bottlenecks and if finding better algorithms and the like don't cut it anymore you can try to hand code some stuff in assembly.

Aside from very small projects on very small CPUs, I would not set out to ever program an entire project in assembly. However, it is common to find that a performance bottleneck can be relieved with the strategic hand coding of some inner loops.
In some cases, all that is really required is to replace some language construct with an instruction that the optimizer cannot be expected to figure out how to use. A typical example is in DSP applications where vector operations and multiply-accumulate operations are difficult for an optimizer to discover, but easy to hand code.
For example certain models of the SH4 contain 4x4 matrix and 4 vector instructions. I saw a huge performance improvement in a color correction algorithm by replacing equivalent C operations on a 3x3 matrix with the appropriate instructions, at the tiny cost of enlarging the correction matrix to 4x4 to match the hardware assumption. That was achieved by writing no more than a dozen lines of assembly, and carrying matching adjustments to the related data types and storage into a handful of places in the surrounding C code.

It doesn't seem to be mentioned, so I thought I'd add it: in modern games development, I think at least some of the assembly being written isn't for the CPU at all. It's for the GPU, in the form of shader programs.
This might be needed for all sorts of reasons, sometimes simply because whatever higher-level shading language used doesn't allow the exact operation to be expressed in the exact number of instructions wanted, to fit some size-constraint, speed, or any combination. Just as usual with assembly-language programming, I guess.

Almost every medium-to-large game engine or library I've seen to date has some hand-optimized assembly versions available for matrix operations like 4x4 matrix concatenation. It seems that compilers inevitably miss some of the clever optimizations (reusing registers, unrolling loops in a maximally efficient way, taking advantage of machine-specific instructions, etc) when working with large matrices. These matrix manipulation functions are almost always "hotspots" on the profile, too.
I've also seen hand-coded assembly used a lot for custom dispatch -- things like FastDelegate, but compiler and machine specific.
Finally, if you have Interrupt Service Routines, asm can make all the difference in the world -- there are certain operations you just don't want occurring under interrupt, and you want your interrupt handlers to "get in and get out fast"... you know almost exactly what's going to happen in your ISR if it's in asm, and it encourages you to keep the bloody things short (which is good practice anyway).

I have only personally talked to one developer about his use of assembly.
He was working on the firmware that dealt with the controls for a portable mp3 player.
Doing the work in assembly had 2 purposes:
Speed: delays needed to be minimal.
Cost: by being minimal with the code, the hardware needed to run it could be slightly less powerful. When mass-producing millions of units, this can add up.

The only assembler coding I continue to do is for embedded hardware with scant resources. As leander mentions, assembly is still well suited to ISRs where the code needs to be fast and well understood.
A secondary reason for me is to keep my knowledge of assembly functional. Being able to examine and understand the steps which the CPU is taking to do my bidding just feels good.

Last time I wrote in assembler was when I could not convince the compiler to generate libc-free, position independent code.
Next time will probably be for the same reason.
Of course, I used to have other reasons.

A lot of people love to denigrate assembly language because they've never learned to code with it and have only vaguely encountered it and it has left them either aghast or somewhat intimidated. True talented programmers will understand that it is senseless to bash C or Assembly because they are complimentary. in fact the advantage of one is the disadvantage of the other. The organized syntaxic rules of C improves clarity but at the same gives up all the power assembly has from being free of any structural rules ! C code instruction are made to create non-blocking code which could be argued forces clarity of programming intent but this is a power loss. In C the compiler will not allow a jump inside an if/elseif/else/end. Or you are not allowed to write two for/end loops on diferent variables that overlap each other, you cannot write self modifying code (or cannot in an seamless easy way), etc.. conventional programmers are spooked by the above, and would have no idea how to even use the power of these approaches as they have been raised to follow conventional rules.
Here is the truth : Today we have machine with the computing power to do much more that the application we use them for but the human brain is too incapable to code them in a rule free coding environment (= assembly) and needs restrictive rules that greatly reduce the spectrum and simplifies coding.
I have myself written code that cannot be written in C code without becoming hugely inefficient because of the above mentionned limitations. And i have not yet talked about speed which most people think is the main reason for writting in assembly, well it is if you mind is limited to thinking in C then you are the slave of you compiler forever. I always thought chess players masters would be ideal assembly programmers while the C programmers just play "Dames".

No longer speed, but Control. Speed will sometimes come from control, but it is the only reason to code in assembly. Every other reason boils down to control (i.e. SSE and other hand optimization, device drivers and device dependent code, etc.).

If I am able to outperform GCC and Visual C++ 2008 (known also as Visual C++ 9.0) then people will be interested in interviewing me about how it is possible.
This is why for the moment I just read things in assembly and just write __asm int 3 when required.
I hope this help...

I've not written in assembly for a few years, but the two reasons I used to were:
The challenge of the thing! I went through a several-month period years
ago when I'd write everything in x86 assembly (the days of DOS and Windows
3.1). It basically taught me a chunk of low level operations, hardware I/O, etc.
For some things it kept size small (again DOS and Windows 3.1 when writing TSRs)
I keep looking at coding assembly again, and it's nothing more than the challenge and joy of the thing. I have no other reason to do so :-)

I once took over a DSP project which the previous programmer had written mostly in assembly code, except for the tone-detection logic which had been written in C, using floating-point (on a fixed-point DSP!). The tone detection logic ran at about 1/20 of real time.
I ended up rewriting almost everything from scratch. Almost everything was in C except for some small interrupt handlers and a few dozen lines of code related to interrupt handling and low-level frequency detection, which runs more than 100x as fast as the old code.
An important thing to bear in mind, I think, is that in many cases, there will be much greater opportunities for speed enhancement with small routines than large ones, especially if hand-written assembler can fit everything in registers but a compiler wouldn't quite manage. If a loop is large enough that it can't keep everything in registers anyway, there's far less opportunity for improvement.

The Dalvik VM that interprets the bytecode for Java applications on Android phones uses assembler for the dispatcher. This movie (about 31 minutes in, but its worth watching the whole movie!) explains how
"there are still cases where a human can do better than a compiler".

I don't, but I've made it a point to at least try, and try hard at some point in the furture (soon hopefully). It can't be a bad thing to get to know more of the low level stuff and how things work behind the scenes when I'm programming in a high level language. Unfortunately time is hard to come by with a full time job as a developer/consultant and a parent. But I will give at go in due time, that's for sure.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight