Embedded Development Board

Embedded Development Board - arm

I'm new to the embedded development world and am looking to get my very first board.
After some research, I realize that there aren't many choices with FPUs. This is important in my project as I'm going to be doing quite a bit of floating point computations.
I found the Mini2440 which seems to run on the ARM920T core. This particular unit is perfect for my needs (decent price, all the right I/O ports, and a touch screen to boot) but it seems that it doesn't have an FPU. I don't know how big of a penalty I'd be paying for FP emulation, so I'm unsure of whether to pull the trigger on this one.
That said:
Can someone please confirm whether this product (Mini2440) has an FPU or not?
My project will do image capture and analysis. Does anyone have any experience with running things like OpenMP on such platforms?
Please suggest any other similar boards in the ≤ $200 price range that have an FPU.
This world is new to me. Any other advice or things I should be aware of is much appreciated.

Fixed Point math can do nearly everything Floating point can and ARM processors with their shift optimization love fixed point. I haven't had a FPU in so long that coding Fixed Point is second nature to me. And even better, Fixed math quite often is more accurate.
In short, don't write off a board because it doesn't have an FPU. :)

Did you look at the BeagleBoard ? Its ARM CPU has VFP for floating point and also NEON for SIMD floating point. Cost is around $200.

I can't give you 100% confirmation, but I'm 99% sure that board's processor doesn't have an FPU; in that target market, it would be mentioned explicitly in the processor datasheets if it were present.
As an answer to a side-question: We were recently doing a bit of benchmarking that ended up comparing performance with an FPU to performance with compiler floating-point emulation without the FPU. Ended up with about a 100x difference in speed.
So, yes, it works -- but no, you don't want to do that for more than very occasional computations. As Michael notes, using fixed-point math is a much more attractive option for computations on embedded processors that don't have FPUs.

No touchscreen, not sure why that matters, the beagleboards serial port is screwy but you still get a terminal, or go with a hawkboard which is also omap based, half the price and designed a little better, has ethernet so you can vnc in and get a full gui without double or tripling the price on a lcd touchscreen.
Instead of going with floating point arm, use the on chip (omap) dsp for that. TI float is superior to IEEE in so many ways.

Try the Samsung S3C6410 with FPU. And the Witech OK6410 board with Samsung S3C6410 cpu and 4.3" LCD, $139 only

Related

Fixed Point FFT Algorithm Needed

I am developing an algorithm for an audio application for mobile platforms. It appears to me that currently the float point calculation support on many mobile processors is not ubiquitous and developing in fixed point would be a safer bet.
I have written FFT routines in float point form for some time now to a degree of success, however writing one in fixed point turned out to be rather difficult. Namely, I would be happy to improve the precision, as well as to find a way to handle potential overflows. The problem is, unlike float point FFTs, descriptions of fixed point FFT algorithms are hard to come by on the Internet.
Has anyone had some experience developing such algorithms?

Your first choice should probably be to use a native-optimized FFT. There are processing requirement for fixed point FFTs that are difficult to express efficiently in portable C (or any language probably): saturation arithmetic is probably the biggest obstacle. Assembly libraries will tend to take advantage of processor-specific instructions for these .
If you still want a portable ANSI C fixed point FFT, I only know of one choice: kissfft. (Disclaimer : I wrote it)

I have read great things about http://anthonix.com/ffts/index.html - this works well on mobile platforms - The site contains benchmarks

I have been working on an automated tool that converts floating-point C code to fixed-point, with a variety of options for tradeoffs between accuracy and execution time. I have had good results with a number of algorithms, including a 2D 8x8 discrete cosine transform. My target platform is typically an ARM Cortex-M processor but similar results should be achievable on other platforms. Would you be interested in letting me take a crack at your FFT?

Would a C6000 DSP be outperformed by a Cortex A9 for FP

I'm using an OMAP L138 processor at the moment which does not have a hardware FPU. We will be processing spectral data using algorithms that are FP intensive thus the ARM side won't be adequate. I'm not the algorithm person but one is "Dynamic Time Warping" (I don't know what it means, no). The initial performance numbers are:
Core i7 Laptop# 2.9GHz: 1 second
Raspberry Pi ARM1176 # 700MHz: 12 seconds
OMAP L138 ARM926 # 300MHz: 193 seconds
Worse, the Pi is about 30% of the price of the board I'm using!
I do have a TI C674x which is the other processor in the OMAP L138. The question is would I be best served by spending many weeks trying to:
learn the DSPLINK, interop libraries and toolchain not to mention forking out for the large cost of Code Composer or
throwing the L138 out and moving to a Dual Cortex A9 like the Pandaboard, possibly suffering power penalties in the process.
(When I look at FPU performance on the A8, it isn't an improvement over the Rasp Pi but Cortex A9 seems to be).
I understand the answer is "it depends". Others here have said that "you unlock an incredible fast DSP that can easily outperform the Cortex-A8 if assigned the right job" but for a defined job set would I be better off skipping to the A9, even if I had to buy an external DSP later?

That question can't be answered without knowing the clock-rates of DSP and the ARM.
Here is some background:
I just checked the cycles of a floating point multiplication on the c674x DSP:
It can issue two multiplications per cycle, and each multiplication has a result latency of three cycles (that means you have to wait three additional cycles before the result appears in the destination register).
You can however start two multiplications each cycle because the DSP will not wait for the result. The compiler/assembler will do the required scheduling for you.
That only uses two of the available eight functional units of the DSP, so while you do the two multiplications you can per cycle also do:
two load/stores (64 bit wide)
six floating point add/subtract instructions (or integer instructions)
Loop control and branching is free and does not cost you anything on the DSP.
That makes a total of six floating point operations per cycle with parallel loads/stores and loop control.
ARM-NEON on the other hand can, in floating point mode:
Issue two multiplications per cycle. Latency is comparable, and the instructions are also pipeline-able like on the DSP. Loading/storing takes extra time as does add/subtract stuff. Loop control and branching will very likely go for free in well written code.
So in summary the DSP does three times as much work per cycle as the Cortex-A9 NEON unit.
Now you can check the clock-rates of DSP and the ARM and see what is faster for your job.
Oh, one thing: With well-written DSP code you will almost never see a cache miss during loads because you move the data from RAM to the cache using DMA before you access the data. This gives impressive speed advantages for the DSP as well.

It does depend on the application but, generally speaking, it is rare these days for special purpose processors to beat general-purpose processors. General purpose processors now have have higher clock rates and multimedia acceleration. Even for a numerically intensive algorithm where a DSP may have an edge, the increased engineering complexity of dealing with a heterogeneous multi-processor environment makes this type of solution problematic from an ROI perspective.

Profiling floating point usage in C

Is there an easy way to count the number of multiplications actually executed by a piece of standard C code? The code I have in mind basically just does additions and multiplications, and it's the multiplications that are of primary interest, but it wouldn't hurt to get counts of the other operations as well.
If it were an option, I suppose I could go around replacing 'a * b' with 'multiply(a, b)' and write a cover function for the native * operator, b/c I really don't care about time performance during this test, but the primary objection to doing that is having to re-work a pile of source code just to run the test.
I have no objection to re-compiling the source, perhaps against some library or with obscure (afaik) options. Valgrind came to mind, but if I understand valgrind's purpose, that's more about tracing values than counting operations.

Compile the source code into assembly language and then search for the multiply instructions.
Note that the optimization level can greatly affect the number that appear. For loops, you would have to determine the scope of multiplies within a loop and factor that into the result, but if the code is fairly constrained or limited in extent, that should be straightforward.

Note: a shameless extrapolation of my comment for as much rep as I can skim.
PAPI has two high-level API functions called PAPI_flips and PAPI_flops which can be used to record the FLOPS as well as the number of floating point operations. Additionally, PAPI offers lots of other performance counter monitoring capability, depending on your processor architecture... cache, bus, memory, branches, etc. I think there is support or support is emerging for graphics accelerators and CUDA/GPGPU.
PAPI will need to be installed on your system, but I think it's widespread enough that installation wouldn't be too painful, if you know what you're doing.
The nice thing about PAPI is that you don't need to know anything about the code; just instrument it (the interface is the same as a stopwatch for FLOPS) and run it. It's based on the actual dynamic execution of your program, so it takes into account things that are hard to account for analytically, such as (pseudo-)random behavior, user/variable input, and related branches.

If your compiler supports soft-float (i.e. using functions with integer implementations to emulate floating-point), you could compiler your program in that mode (-msoft-float in GCC), and use your favorite profiling tool to measure how many times they are invoked.
Many processors also have performance counters that can count the number of floating-point operations that have been retired. Depending on the hardware and OS, you may or may not need some amount of kernel support to take advantage of them.

The best that I can think of is (assuming you're running gdb):
If you could identify the points were multiplications are occurring, you could then set tracepoints just prior to the multiplication (or perhaps just after them depending on the details), then run the program and count the number of tracepoint dumps.
Yes, it is very crude. Certainly there are other solutions; however, I would hesitate to trash my stack for something as simple as a count.

Why do you program in assembly? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a question for all the hardcore low level hackers out there. I ran across this sentence in a blog. I don't really think the source matters (it's Haack if you really care) because it seems to be a common statement.
For example, many modern 3-D Games have their high performance core engine written in C++ and Assembly.
As far as the assembly goes - is the code written in assembly because you don't want a compiler emitting extra instructions or using excessive bytes, or are you using better algorithms that you can't express in C (or can't express without the compiler mussing them up)?
I completely get that it's important to understand the low-level stuff. I just want to understand the why program in assembly after you do understand it.

I think you're misreading this statement:
For example, many modern 3-D Games have their high performance core engine written in C++ and Assembly.
Games (and most programs these days) aren't "written in assembly" the same way they're "written in C++". That blog isn't saying that a significant fraction of the game is designed in assembly, or that a team of programmers sit around and develop in assembly as their primary language.
What this really means is that developers first write the game and get it working in C++. Then they profile it, figure out what the bottlenecks are, and if it's worthwhile they optimize the heck out of them in assembly. Or, if they're already experienced, they know which parts are going to be bottlenecks, and they've got optimized pieces sitting around from other games they've built.
The point of programming in assembly is the same as it always has been: speed. It would be ridiculous to write a lot of code in assembler, but there are some optimizations the compiler isn't aware of, and for a small enough window of code, a human is going to do better.
For example, for floating point, compilers tend to be pretty conservative and may not be aware of some of the more advanced features of your architecture. If you're willing to accept some error, you can usually do better than the compiler, and it's worth writing that little bit of code in assembly if you find that lots of time is spent on it.
Here are some more relevant examples:
Examples from Games
Article from Intel about optimizing a game engine using SSE intrinsics. The final code uses intrinsics (not inline assembler), so the amount of pure assembly is very small. But they look at the assembler output by the compiler to figure out exactly what to optimize.
Quake's fast inverse square root. Again, the routine doesn't have assembler in it, but you need to know something about architecture to do this kind of optimization. The authors know what operations are fast (multiply, shift) and which are slow (divide, sqrt). So they come up with a very tricky implementation of square root that avoids the slow operations entirely.
High-Performance Computing
Outside the domain of games, people in scientific computing frequently optimize the crap out of things to get them to run fast on the latest hardware. Think of this as games where you can't cheat on the physics.
A great recent example of this is Lattice Quantum Chromodynamics (Lattice QCD). This paper describes how the problem pretty much boils down to one very small computational kernel, which was optimized heavily for PowerPC 440's on an IBM Blue Gene/L. Each 440 has two FPUs, and they support some special ternary operations that are tricky for compilers to exploit. Without these optimizations, Lattice QCD would've run much slower, which is costly when your problem requires millions of CPU hours on expensive machines.
If you are wondering why this is important, check out the article in Science that came out of this work. Using Lattice QCD, these guys calculated the mass of a proton from first principles, and showed last year that 90% of the mass comes from strong force binding energy, and the rest from quarks. That's E=mc2 in action. Here's a summary.
For all of the above, the applications are not designed or written 100% in assembly -- not even close. But when people really need speed, they focus on writing the key parts of their code to fly on specific hardware.

I have not coded in assembly language for many years, but I can give several reasons that I frequently saw:
Not all compilers can make use of certain CPU optimizations and instruction set (e.g., the new instruction sets that Intel adds once in a while). Waiting for compiler writers to catch up means losing a competitive advantage.
Easier to match actual code to known CPU architecture and optimization. For example, things you know about the fetching mechanism, caching, etc. This is supposed to be transparent to the developer, but the fact is that it is not, that's why compiler writers can optimize.
Certain hardware level accesses are only possible/practical via assembly language (e.g., when writing device driver).
Formal reasoning is sometimes actually easier for the assembly language than for the high-level language since you already know what the final or almost final layout of the code is.
Programming certain 3D graphic cards (circa late 1990s) in the absence of APIs was often more practical and efficient in assembly language, and sometimes not possible in other languages. But again, this involved really expert-level games based on the accelerator architecture like manually moving data in and out in certain order.
I doubt many people use assembly language when a higher-level language would do, especially when that language is C. Hand-optimizing large amounts of general-purpose code is impractical.

There is one aspect of assembler programming which others have not mentioned - the feeling of satisfaction you get knowing that every single byte in an application is the result of your own effort, not the compiler's. I wouldn't for a second want to go back to writing whole apps in assembler as I used to do in the early 80s, but I do miss that feeling sometimes...

Usually, a layman's assembly is slower than C (due to C's optimization) but many games (I distinctly remember Doom) had to have specific sections of the game in Assembly so it would run smoothly on normal machines.
Here's the example to which I am referring.

I started professional programming in assembly language in my very first job (80's). For embedded systems the memory demands - RAM and EPROM - were low. You could write tight code that was easy on resources.
By the late 80's I had switched to C. The code was easier to write, debug and maintain. Very small snippets of code were written in assembler - for me it was when I was writing the context switching in an roll-your-own RTOS. (Something you shouldn't do anymore unless it is a "science project".)
You will see assembler snippets in some Linux kernel code. Most recently I've browsed it in spinlocks and other synchronization code. These pieces of code need to gain access to atomic test-and-set operations, manipulating caches, etc.
I think you would be hard pressed to out-optimize modern C compilers for most general programming.
I agree with #altCognito that your time is probably better spent thinking harder about the problem and doing things better. For some reason programmers often focus on micro-efficiencies and neglect the macro-efficiencies. Assembly language to improve performance is a micro-efficiency. Stepping back for a wider view of the system can expose the macro problems in a system. Solving the macro problems can often yield better performance gains.
Once the macro problems are solved then collapse to the micro level.
I guess micro problems are within the control of a single programmer and in a smaller domain. Altering behavior at the macro level requires communication with more people - a thing some programmers avoid. That whole cowboy vs the team thing.

"Yes". But, understand that for the most part the benefits of writing code in assembler are not worth the effort. The return received for writing it in assembly tends to be smaller than the simply focusing on thinking harder about the problem and spending your time thinking of a better way of doing thigns.
John Carmack and Michael Abrash who were largely responsible for writing Quake and all of the high performance code that went into IDs gaming engines go into this in length detail in this book.
I would also agree with Ólafur Waage that today, compilers are pretty smart and often employ many techniques which take advantage of hidden architectural boosts.

These days, for sequential codes at least, a decent compiler almost always beats even a highly seasoned assembly-language programmer. But for vector codes it's another story. Widely deployed compilers don't do such a great job exploiting the vector-parallel capabilities of the x86 SSE unit, for example. I'm a compiler writer, and exploiting SSE tops my list of reasons to go on your own instead of trusting the compiler.

SSE code works better in assembly than compiler intrinsics, at least in MSVC. (i.e. does not create extra copies of data )

I've three or four assembler routines (in about 20 MB source) in my sources at work. All of them are SSE(2), and are related to operations on (fairly large - think 2400x2048 and bigger) images.
For hobby, I work on a compiler, and there you have more assembler. Runtime libraries are quite often full of them, most of them have to do with stuff that defies the normal procedural regime (like helpers for exceptions etc.)
I don't have any assembler for my microcontroller. Most modern microcontrollers have so much peripheral hardware (interrupt controled counters, even entire quadrature encoders and serial building blocks) that using assembler to optimize the loops is often not needed anymore. With current flash prices, the same goes for code memory. Also there are often ranges of pin-compatible devices, so upscaling if you systematically run out of cpu power or flash space is often not a problem
Unless you really ship 100000 devices and programming assembler makes it possible to really make major savings by just fitting in a flash chip a category smaller. But I'm not in that category.
A lot of people think embedded is an excuse for assembler, but their controllers have more CPU power than the machines Unix was developed on. (Microchip coming
with 40 and 60 MIPS microcontrollers for under USD 10).
However a lot people are stuck with legacy, since changing microchip architecture is not easy. Also the HLL code is very architecture dependent (because it uses the hardware periphery, registers to control I/O, etc). So there are sometimes good reasons to keep maintaining a project in assembler (I was lucky to be able to setup affairs on a new architecture from scratch). But often people kid themselves that they really need the assembler.
I still like the answer a professor gave when we asked if we could use GOTO (but you could read that as ASSEMBLER too): "if you think it is worth writing a 3 page essay on why you need the feature, you can use it. Please submit the essay with your results. "
I've used that as a guiding principle for lowlevel features. Don't be too cramped to use it, but make sure you motivate it properly. Even throw up an artificial barrier or two (like the essay) to avoid convoluted reasoning as justification.

Some instructions/flags/control simply aren't there at the C level.
For example, checking for overflow on x86 is the simple overflow flag. This option is not available in C.

Defects tend to run per-line (statement, code point, etc.); while it's true that for most problems, assembly would use far more lines than higher level languages, there are occasionally cases where it's the best (most concise, fewest lines) map to the problem at hand. Most of these cases involve the usual suspects, such as drivers and bit-banging in embedded systems.

If you were around for all the Y2K remediation efforts, you could have made a lot of money if you knew Assembly. There's still plenty of legacy code around that was written in it, and that code occasionally needs maintenance.

Another reason could be when the available compiler just isn't good enough for an architecture and the amount of code needed in the program is not that long or complex as for the programmer to get lost in it. Try programming a microcontroller for an embedded system, usually assembly will be much easier.

Beside other mentioned things, all higher languages have certain limitations. Thats why some people choose to programm in ASM, to have full control over their code.
Others enjoy very small executables, in the range of 20-60KB, for instance check HiEditor, which is implemented by author of the HiEdit control, superb powerfull edit control for Windows with syntax highlighting and tabs in only ~50kb). In my collection I have more then 20 such gold controls from Excell like ssheets to html renders.

I think a lot of game developers would be surprised at this bit of information.
Most games I know of use as little assembly as at all possible. In some cases none at all, and at worst, one or two loops or functions.
That quote is over-generalized, and nowhere near as true as it was a decade ago.
But hey, mere facts shouldn't hinder a true hacker's crusade in favor of assembly. ;)

If you are programming a low end 8 bit microcontroller with 128 bytes of RAM and 4K of program memory you don't have much choice about using assembly. Sometimes though when using a more powerful microcontroller you need a certain action to take place at an exact time. Assembly language comes in useful then as you can count the instructions and so measure the clock cycles used by your code.

Games are pretty performance hungry and although in the meantime the optimizers are pretty good a "master programmer" is still able to squeeze out some more performance by hand coding the right parts in assembly.
Never ever start optimizing your program without profiling it first. After profiling should be able to identify bottlenecks and if finding better algorithms and the like don't cut it anymore you can try to hand code some stuff in assembly.

Aside from very small projects on very small CPUs, I would not set out to ever program an entire project in assembly. However, it is common to find that a performance bottleneck can be relieved with the strategic hand coding of some inner loops.
In some cases, all that is really required is to replace some language construct with an instruction that the optimizer cannot be expected to figure out how to use. A typical example is in DSP applications where vector operations and multiply-accumulate operations are difficult for an optimizer to discover, but easy to hand code.
For example certain models of the SH4 contain 4x4 matrix and 4 vector instructions. I saw a huge performance improvement in a color correction algorithm by replacing equivalent C operations on a 3x3 matrix with the appropriate instructions, at the tiny cost of enlarging the correction matrix to 4x4 to match the hardware assumption. That was achieved by writing no more than a dozen lines of assembly, and carrying matching adjustments to the related data types and storage into a handful of places in the surrounding C code.

It doesn't seem to be mentioned, so I thought I'd add it: in modern games development, I think at least some of the assembly being written isn't for the CPU at all. It's for the GPU, in the form of shader programs.
This might be needed for all sorts of reasons, sometimes simply because whatever higher-level shading language used doesn't allow the exact operation to be expressed in the exact number of instructions wanted, to fit some size-constraint, speed, or any combination. Just as usual with assembly-language programming, I guess.

Almost every medium-to-large game engine or library I've seen to date has some hand-optimized assembly versions available for matrix operations like 4x4 matrix concatenation. It seems that compilers inevitably miss some of the clever optimizations (reusing registers, unrolling loops in a maximally efficient way, taking advantage of machine-specific instructions, etc) when working with large matrices. These matrix manipulation functions are almost always "hotspots" on the profile, too.
I've also seen hand-coded assembly used a lot for custom dispatch -- things like FastDelegate, but compiler and machine specific.
Finally, if you have Interrupt Service Routines, asm can make all the difference in the world -- there are certain operations you just don't want occurring under interrupt, and you want your interrupt handlers to "get in and get out fast"... you know almost exactly what's going to happen in your ISR if it's in asm, and it encourages you to keep the bloody things short (which is good practice anyway).

I have only personally talked to one developer about his use of assembly.
He was working on the firmware that dealt with the controls for a portable mp3 player.
Doing the work in assembly had 2 purposes:
Speed: delays needed to be minimal.
Cost: by being minimal with the code, the hardware needed to run it could be slightly less powerful. When mass-producing millions of units, this can add up.

The only assembler coding I continue to do is for embedded hardware with scant resources. As leander mentions, assembly is still well suited to ISRs where the code needs to be fast and well understood.
A secondary reason for me is to keep my knowledge of assembly functional. Being able to examine and understand the steps which the CPU is taking to do my bidding just feels good.

Last time I wrote in assembler was when I could not convince the compiler to generate libc-free, position independent code.
Next time will probably be for the same reason.
Of course, I used to have other reasons.

A lot of people love to denigrate assembly language because they've never learned to code with it and have only vaguely encountered it and it has left them either aghast or somewhat intimidated. True talented programmers will understand that it is senseless to bash C or Assembly because they are complimentary. in fact the advantage of one is the disadvantage of the other. The organized syntaxic rules of C improves clarity but at the same gives up all the power assembly has from being free of any structural rules ! C code instruction are made to create non-blocking code which could be argued forces clarity of programming intent but this is a power loss. In C the compiler will not allow a jump inside an if/elseif/else/end. Or you are not allowed to write two for/end loops on diferent variables that overlap each other, you cannot write self modifying code (or cannot in an seamless easy way), etc.. conventional programmers are spooked by the above, and would have no idea how to even use the power of these approaches as they have been raised to follow conventional rules.
Here is the truth : Today we have machine with the computing power to do much more that the application we use them for but the human brain is too incapable to code them in a rule free coding environment (= assembly) and needs restrictive rules that greatly reduce the spectrum and simplifies coding.
I have myself written code that cannot be written in C code without becoming hugely inefficient because of the above mentionned limitations. And i have not yet talked about speed which most people think is the main reason for writting in assembly, well it is if you mind is limited to thinking in C then you are the slave of you compiler forever. I always thought chess players masters would be ideal assembly programmers while the C programmers just play "Dames".

No longer speed, but Control. Speed will sometimes come from control, but it is the only reason to code in assembly. Every other reason boils down to control (i.e. SSE and other hand optimization, device drivers and device dependent code, etc.).

If I am able to outperform GCC and Visual C++ 2008 (known also as Visual C++ 9.0) then people will be interested in interviewing me about how it is possible.
This is why for the moment I just read things in assembly and just write __asm int 3 when required.
I hope this help...

I've not written in assembly for a few years, but the two reasons I used to were:
The challenge of the thing! I went through a several-month period years
ago when I'd write everything in x86 assembly (the days of DOS and Windows
3.1). It basically taught me a chunk of low level operations, hardware I/O, etc.
For some things it kept size small (again DOS and Windows 3.1 when writing TSRs)
I keep looking at coding assembly again, and it's nothing more than the challenge and joy of the thing. I have no other reason to do so :-)

I once took over a DSP project which the previous programmer had written mostly in assembly code, except for the tone-detection logic which had been written in C, using floating-point (on a fixed-point DSP!). The tone detection logic ran at about 1/20 of real time.
I ended up rewriting almost everything from scratch. Almost everything was in C except for some small interrupt handlers and a few dozen lines of code related to interrupt handling and low-level frequency detection, which runs more than 100x as fast as the old code.
An important thing to bear in mind, I think, is that in many cases, there will be much greater opportunities for speed enhancement with small routines than large ones, especially if hand-written assembler can fit everything in registers but a compiler wouldn't quite manage. If a loop is large enough that it can't keep everything in registers anyway, there's far less opportunity for improvement.

The Dalvik VM that interprets the bytecode for Java applications on Android phones uses assembler for the dispatcher. This movie (about 31 minutes in, but its worth watching the whole movie!) explains how
"there are still cases where a human can do better than a compiler".

I don't, but I've made it a point to at least try, and try hard at some point in the furture (soon hopefully). It can't be a bad thing to get to know more of the low level stuff and how things work behind the scenes when I'm programming in a high level language. Unfortunately time is hard to come by with a full time job as a developer/consultant and a parent. But I will give at go in due time, that's for sure.

How to calculate MIPS for an algorithm for ARM processor

I have been asked recently to produced the MIPS (million of instructions per second) for an algorithm we have developed. The algorithm is exposed by a set of C-style functions. We have exercise the code on a Dell Axim to benchmark the performance under different input.
This question came from our hardware vendor, but I am mostly a HL software developer so I am not sure how to respond to the request. Maybe someone with similar HW/SW background can help...
Since our algorithm is not real time, I don't think we need to quantify it as MIPS. Is it possible to simply quote the total number of assembly instructions?
If 1 is true, how do you do this (ie. how to measure the number of assembly instructions) either in general or specifically for ARM/XScale?
Can 2 be performed on a WM device or via the Device Emulator provided in VS2005?
Can 3 be automated?
Thanks a lot for your help.
Charles
Thanks for all your help. I think S.Lott hit the nail. And as a follow up, I now have more questions.
5 Any suggestion on how to go about measuring MIPS? I heard some one suggest running our algorithm and comparing it against Dhrystone/Whetstone benchmark to calculate MIS.
6 Since the algorithm does not need to be run in real time, is MIPS really a useful measure? (eg. factorial(N)) What are other ways to quantity the processing requirements? (I have already measured the runtime performance but it was not a satisfactory answer.)
7 Finally, I assume MIPS is a crude estimate and would be dep. on compiler, optimization settings, etc?

I'll bet that your hardware vendor is asking how many MIPS you need.
As in "Do you need a 1,000 MIPS processor or a 2,000 MIPS processor?"
Which gets translated by management into "How many MIPS?"
Hardware offers MIPS. Software consumes MIPS.
You have two degrees of freedom.
The processor's inherent MIPS offering.
The number of seconds during which you consume that many MIPS.
If the processor doesn't have enough MIPS, your algorithm will be "slow".
if the processor has enough MIPS, your algorithm will be "fast".
I put "fast" and "slow" in quotes because you need to have a performance requirement to determine "fast enough to meet the performance requirement" or "too slow to meet the performance requirement."
On a 2,000 MIPS processor, you might take an acceptable 2 seconds. But on a 1,000 MIPS processor this explodes to an unacceptable 4 seconds.
How many MIPS do you need?
Get the official MIPS for your processor. See http://en.wikipedia.org/wiki/Instructions_per_second
Run your algorithm on some data.
Measure the exact run time. Average a bunch of samples to reduce uncertainty.
Report. 3 seconds on a 750 MIPS processor is -- well -- 3 seconds at 750 MIPS. MIPS is a rate. Time is time. Distance is the product of rate * time. 3 seconds at 750 MIPS is 750*3 million instructions.
Remember Rate (in Instructions per second) * Time (in seconds) gives you Instructions.
Don't say that it's 3*750 MIPS. It isn't; it's 2250 Million Instructions.

Some notes:
MIPS is often used as a general "capacity" measure for processors, especially in the soft real-time/embedded field where you do want to ensure that you do not overload a processor with work. Note that this IS instructions per second, as the time is very important!
MIPS used in this fashion is quite unscientific.
MIPS used in this fashion is still often the best approximation there is for sizing a system and determining the speed of the processor. It might well be off by 25%, but never mind...
Counting MIPS requires a processor that is close to what you are using. The right instruction set is obviously crucial, to capture the actual instruction stream from the actual compiler in use.
You cannot in any way approximate this on a PC. You need to bring out one of a few tools to do this right:
Use an instruction-set simulator for the target archicture such as Qemu, ARM's own tools, Synopsys, CoWare, Virtutech, or VaST. These are fast but can count instructions pretty well, and will support the right instruction set. Barring extensive use of expensive instructions like integer divide (and please no floating point), these numbers tend to be usefully close.
Find a clock-cycle accurate simulator for your target processor (or something close), which will give pretty good estimate of pipeline effects etc. Once again, get it from ARM or from Carbon SoCDesigner.
Get a development board for the processor family you are targeting, or an ARM close to it design, and profile the application there. You don't use an ARM9 to profile for an ARM11, but an ARM11 might be a good approximation for an ARM Cortex-A8/A9 for example.

MIPS is generally used to measure the capability of a processor.
Algorithms usually take either:
a certain amount of time (when running on a certain processor)
a certain number of instructions (depending on the architecture)
Describing an algorithm in terms of instructions per second would seem like a strange measure, but of course I don't know what your algorithm does.
To come up with a meaningful measure, I would suggest that you set up a test which allows you to measure the average time taken for your algorithm to complete. Number of assembly instructions would be a reasonable measure, but it can be difficult to count them! Your best bet is something like this (pseudo-code):
const num_trials = 1000000
start_time = timer()
for (i = 1 to num_trials)
{
runAlgorithm(randomData)
}
time_taken = timer() - start_time
average_time = time_taken / num_trials

MIPS are a measure of CPU speed, not algorithm performance. I can only assume the somewhere along the line, someone is slightly confused. What are they trying to find out? The only likely scenario I can think of is they're trying to help you determine how fast a processor they need to give you to run your program satisfactorily.
Since you can measure an algorithm in number of instructions (which is no doubt going to depend on the input data, so this is non-trivial), you then need some measure of time in order to get MIPS -- for instance, say "I need to invoke it 1000 times per second". If your algorithm is 1000 instructions for that particular case, you'll end up with:
1000 instructions / (1/1000) seconds = 1000000 instructions per second = 1 MIPS.
I still think that's a really odd way to try to do things, so you may want to ask for clarification. As for your specific questions, I'll leave that to someone more familiar with Visual Studio.

Also remember that different compilers and compiler options make a HUGE difference. The same source code can run at many different speeds. So instead of buying the 2mips processor you may be able to use the 1/2mips processor and use a compiler option. Or spend the money on a better compiler and use the cheaper processor.
Benchmarking is flawed at best. As a hobby I used to compile the same dhrystone (and whetstone) code on various compilers from various vendors for the same hardware and the numbers were all over the place, orders of magnitude. Same source code same processor, dhrystone didnt mean a thing, not useful as a baseline. What matters in benchmarking is how fast does YOUR algorithm run, it had better be as fast or faster than it needs to. Depending on how close to the finish line you are allow for plenty of slop. Early on on probably want to be running 5 or 10 or 100 times faster than you need to so that by the end of the project you are at least slightly faster than you need to be.
I agree with what I think S. Lott is saying, this is all sales and marketing and management talk. Being the one that management has put between a rock and the hard place then what you need to do is get them to buy the fastest processor and best tools that they are willing to spend based on the colorful pie charts and graphs that you are going to generate from thin air as justification. If near the end of the road it doesnt quite meet performance, then you could return to stackoverflow, but at the same time management will be forced to buy a different toolchain at almost any price or swap processors and respin the board. By then you should know how close to the target you are, we need 1.0 and we are at 1.25 if we buy the processor that is twice as fast as the one we bought we should make it.
Whether or not you can automate these kinds of things or simulate them depends on the tools, sometimes yes, sometimes no. I am not familiar with the tools you are talking about so I cant speak to them directly.

This response is not intended to answer the question directly, but to provide additional context around why this question gets asked.
MIPS for an algorithm is only relevant for algorithms that need to respond to an event within the required time.
For example, consider a controller designed to detect the wind speed and move the actuator within a second when the wind speed crosses over 25 miles / hour. Let us say it takes 1000 instructions to calculate and compare the wind speed against the threshold. The MIPS requirement for this algorithm is 1 Kilo Instructions Per Second (KIPs). If the controller is based on 1 MIPS processor, we can comfortably say that there is more juice in the controller to add other functions.
What other functions could be added on the controller? That depends on the MIPS of the function/algorithm to be added. If there is another function that needs 100,000 instructions to be performed within a second (i.e. 100 KIPs), we can still accommodate this new function and still have some room for other functions to add.

For a first estimate a benchmark on the PC may be useful.
However, before you commit to a specific device and clock frequency you should get a developer board (or some PDA?) for the ARM target architecture and benchmark it there.
There are a lot of factors influencing the speed on today's machines (caching, pipelines, different instruction sets, ...) so your benchmarks on a PC may be way off w.r.t. the ARM.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight