Usage of getc with a file

Usage of getc with a file - c

To print the contents of a file one can use getc:
int ch;
FILE *file = fopen("file.txt", "r");
while ((ch = getc(file)) != EOF) {
// do something
}
How efficient is the getc function? That is, how frequently does it actually do operating system calls or something that would take a non-trivial amount of time? For example, let's say I had a 10TB file -- would calling this function trillions of times be a poor way to get through data?

That is, how frequently does it actually do operating system calls or something that would take a non-trivial amount of time?
You could look into the source code of GNU libc or of musl-libc to study the implementation of getc. You should also study the implementation of cat(1) and wc(1). Both are open source. And GNU as (part of GNU binutils) is a free software (internally used by most compilations by GCC) which in practice runs very quickly and does textual manipulation (transforming assembler textual input to binary object files). You could take inspiration from its source code.
You could change the buffer size with setvbuf(3)
You may want to read several bytes at once using fread(3) or fgets(3), probably by data pieces of several kilobytes
You can also use the debugger gdb(1) or the strace(1) utility to find out when syscalls(2) are used and which ones.
For example, let's say I had a 10TB file -- would calling this function trillions of times be a poor way to get through data?
Very probably not, because of the kernel's page cache.
You should profile and benchmark your program to find out its bottleneck.
Most of the time it won't be getc. See time(7) and gprof(1) (and compile all your code with GCC invoked as gcc -O3 -pg -Wall)
If raw input performance is critical in your program, consider also using directly and wisely open(2), read(2), mmap(2), madvise(2), readahead(2), posix_fadvise(2), close(2). Most of these syscalls could fail, see errno(3).
You may also change your file system (e.g. from Ext4 to XFS, see ext4(5) and xfs(5)), buy better SSD disks or more physical RAM, or play with mount(2) options, to improve performance.
See also the /proc pseudo-file system (so proc(5)...); and this answer.
You may want to use databases like sqlite or PostGreSQL
Your program could generate C code at runtime (like manydl.c does), try various approaches (compiling the generated C code /tmp/generated-c-1234.c as a plugin using gcc -O3 -fPIC /tmp/generated-c-1234.c -shared -o /tmp/generated-plugin-1234.so, then dlopen(3)-ing and dlsym(3)-ing that /tmp/generated-plugin-1234.so generated plugin), and use machine learning techniques to find a very good one (specific to the current hardware and computer). It could also generate machine code more directly using asmjit or libgccjit, try several approaches, and choose the best one for the particular situation.
Pitrat's book Artificial Beings and blog (still here) explains in more details this approach. The conceptual framework is called partial evaluation. See also this.
You could also use existing parser generators like GNU bison or ANTLR. They are generating C code.
Ian Taylor's libbacktrace could also be useful in such a dynamic metaprogramming approach (generating various form of C code, and choosing the best ones according to the call stack inspected with dladdr(3)).
Very probably your problem is a parsing problem. So read the first half of the Dragon book.
Before attempting any experimentation, discuss with your manager/boss/client the opportunity to spend months of full time work to gain a few percent of performance. Take into account that the same gain can be obtained by upgrading the hardware.
If your terabyte input textual file does not change often (e.g. is given every week, e.g. in bioinformatics software), it may be worthwhile to preprocess it and transform it -in batch mode- into a binary file, or some sqlite database, or some GDBM indexed file, or a some REDIS thing. Then documenting the format of that binary file or database (using EBNF notation, taking inspiration from elf(5)) is very important.

Related

How to check if `closefrom` can be used for closing file descriptors at runtime?

I'm looking to write some C code that will close all the currently open file descriptors, suitable to be used as part of a fork/exec to make a new process.
I know from the answer here that there are various platform-specific functions to do this efficiently, such as closefrom on Linux, but they're not available in all C libs (like musl) or kernels.
Most solutions I've seen check for the availability of these functions using compile time macros, but I'd like to try to do so at runtime. The idea is to have some code suitable for compiling statically and using on various platforms. The requirements are thus:
If the underlying kernel supports it, make use of closefrom functionality.
Otherwise, fall back to traditional means like looping through all possible file descriptors or reading from /proc.
One idea I had for Linux systems was just to call the syscall number, but I'd love to hear from C experts if this is a good idea:
#define CLOSE_RANGE_SYSCALL_NUMBER 436
void my_closefrom(int lowfd, int to) {
long ret = syscall(CLOSE_RANGE_SYSCALL_NUMBER, lowfd, to, 0);
if (ret != -1) {
// Success!
} else {
// Otherwise, fall back to another approach
// ...
}
}

From a comments on the question:
[I'm looking for] the ability to statically link once and then run it on systems different from the one I compiled it on.
I should be able to run this on a Linux system with kernel >= 5.9 and get the performance boost, or run it on an earlier kernel and not. AFAIK the only "ABI" compatibility that would be relevant would be whether the kernel provides the desired syscall
Well no. There is also the question of what the syscall number is. And for some syscalls, what the argument format is expected to be. And since you intend to link statically, you don't get to look only at the closefrom syscall, you need to consider all attributes of all of them.
In other words, I don't want to depend on glibc's closefrom at all. (It's just a shallow wrapper around the syscall after all.) I just want to make use of kernel functionality when it's available and I'm interested in advice on whether this is a reasonable idea.
No, it is not a reasonable idea.
The purpose of the system call wrapper functions is to abstract kernel details from userspace programs. This protects you from issues such as the Linux system call numbers being different for different architectures, including x86 vs. x86_64, or the occasional change in system call numbers for a given arch. It also gives you a measure of source compatibility with other systems, such as MacOs, the BSDs, and Solaris. Overall, the wrapper functions are the stable kernel interface for userspace programs.
I cannot imagine making a direct syscall without being confident that the syscall number I requested was associated with the system function I wanted. That is exactly the kind of thing that might test successfully enough to release, and then fail mysteriously and / or devastatingly in the field, probably a couple of years later, after I've forgotten all about my nasty hack.
Better solutions include:
lowest common denominator approaches. That is, things that work on all supported machines. That you thereby forego faster alternatives available on a subset of supported machines is a cost of broad portability. If it's fast enough on machines that don't have (e.g.) closefrom(), then why do you need to make it faster on systems that do have that function? And how much speed do you really gain?
compile-time selection. You said you want to avoid this, but there's a reason that it is the usual approach for tuning programs to the capabilities of host machines. With static linking you don't need to worry about the runtime host's C standard library, but you do need to worry about kernel version. A common approach is to provide two (or more) binaries targeting different, possibly overlapping, ranges of kernel versions.
working around your need for the feature in the first place. For the particular case of closing files at fork / exec, you could
set files as close-on-exec when you open them, OR
register fork handlers (pthread_atfork()) as needed to close the files when the program forks. This should work even if the program's initial thread is its only one.

Low level languages and their dependencies

I am trying to understand exactly what it means that low-level languages are machine-dependent.
Let's take for example C, well if it is machine-dependent does it mean that if it was compiled on one computer it might not be able to run on another?

In the end processors executes machine code which is basicly a collection of binary numbers. The processor decode each binary number to figure out what it is supposed to do. One binary number could mean "Add register X to register Y and store the result in register Z". Another binary number could mean "Store the content of register X into the memory address held by register Y". And so on...
The complete description of these decoding rules (i.e. binary number into operation) represents the processors instruction set (aka ISA).
A low level language is a language where the code you can write maps very closely to the specific processors instruction set. Assembly is one obvious example. Since different processor may have different instruction sets, it's clear that an assembly program written for one processors ISA can't be used on a processor with a different ISA.
Let's take for example C, well if it is machine-dependent does it mean that if it was compiled on one computer it might not be able to run on another?
Correct. A program compiled for one processor (family) can't run on another processor with (completely) different ISA. The program needs to be recompiled.
Also notice that the target OS also plays a role. If you use the same processor but use different OS you'll also need to recompile.
There are at least 3 different kind of languages.
A languages that is so close to the target systems ISA that the source code can only be used on that specific target. Example: Assembly
A language that allows you to write code that can be used on many different targets using a target specific compilation. Example: C
A language that allows you to write code that can be used on many different targets without a target specific compilation. These still require some kind of target specific runtime environment to be installed. Example: Java.

High-Level languages are portable, meaning every architecture can run high-level programs but, compared to low-level programs (like written in Assembly or even machine code), they are less efficient and consume more memory.
Low-level programs are known as "closer to the hardware" and so they are optimized for a certain type of hardware architecture/processor, being faster programs, but relatively machine-dependant or not-very-portable.
So, a program compiled for a type of processor it's not valid for other types; it needs to be recompiled.

In the before
When the first processors came out, there was no programming language whatsoever, you had a very long and very complicated documentation with a list of "opcodes": the code you had to put into memory for a given operation to be executed in your processor. To create a program, you had to put a long string of number in memory, and hope everything worked as documented.
Later came Assembly languages. The point wasn't really to make algorithms easier to implement or to make the program readable by any human without any experience on the specific processor model you were working with, it was created to save you from spending days and days looking up things in a documentation. For this reason, there isn't "an assembly language" but thousands of them, one per instruction set (which, at the time, basically meant one per CPU model)
At this point in time, all languages were platform-dependent. If you decided to switch CPUs, you'd have to rewrite a significant portion (if not all) of your code. Recognizing that as a bit of a problem, someone created a the first platform-independent language (according to this SE question it was FORTRAN in 1954) that could be compiled to run on any CPU architecture as long as someone made a compiler for it.
Fast forward a bit and C was invented. C is a platform-independent programming language, in the sense that any C program (as long as it conforms with the standard) can be compiled to run on any CPU (as long as this CPU has a C compiler). Once a C program has been compiled, the resulting file is a platform-dependent binary and will only be able to run on the architecture it was compiled for.
C is platform-dependent
There's an issue though: a processor is more than just a list of opcodes. Most processors have hardware control devices like watchdogs or timers that can be completely different from one architecture to another, even the way to talk to other devices can change completely. As such, if you want to actually run a program on a CPU, you have to include things that make it platform-dependent.
A real life example of this is the Linux kernel. The majority of the kernel is written in C but there's still around 1% written in different kinds of assembly. This assembly is required to do things such as initialize the CPU or use timers. Using this hack means Linux can run on your desktop x86_64 CPU, your ARM Android phone or a RISCV SoC but adding any new architecture isn't as simple as just "compile it with your architecture's compiler".
So... Did I just say the only way to run a platform-independent on an actual processor is to use platform-dependent code? Yes, for most architectures, you have to.
Or is it?
But there's a catch! That's only true if you want to run you code on bare metal (meaning: without an OS). One of the great things of using an OS is how abstracted everything is: you don't need to know how the kernel initializes the CPU, nor do you need to know how it gets its clock, you just need to know how to access those abstracted resources.
But the way of accessing resources dependent on the OS, aren't we back to square one? We could be, if not for the standard library! This library is used to access functions like printf in a defined way. It doesn't matter if you're working on a Linux running on PowerPC or on an ARM Windows, printf will always print things on the standard output the same way.
If you write standard C using only the standard library (and intend for your program to run in an OS) C is completely platform-independent!
EDIT: As said in the comments below, even that is not enough. It doesn't really have anything to do with specific CPUs but some things such as the system function or the size of some types are documented as implementation-defined. To make C really platform independent you need to make sure to only use well defined functions of the STL and learn some best practice (never rely on sizeof(int)==4 for instance).

Thinking about 'what's a program' might help you understand your question. Is a program a collection of text (that you've typed in or otherwise manufactured) or is it something you run? Is it both?
In the case of a 'low-level' language like C I'd say that the text is the program source, and that this is turned into a program (aka executable) by a compiler. A program is something you can run. You need a C compiler for a system to be able to make the program source into a program for that system. Once built the program can only be run on systems close to the one it was compiled for. However there is a more interesting, if more difficult question: can you at least keep the program source the same, so that all you need to do is recompile? The answer to this is 'sort-of No' I sort-of think. For example you can't, in pure C, read the state of the shift key. Of course operating systems provide such facilities and you can interface to those in C, but then such code depends on the OS. There might be libraries (eg the curses library) that provide such facilities for many OS and that can help to reduce the dependency, but no library can clain to portably cover all OS.
In the case of a 'higher-level' language like python I'd say the text is both the program and the program source. There is no separate compilation stage with such languages, but you do need an interpreter on a system to be able to run your python program on that system. However that this is happening may not be clear to the user as you may well seem to be able to run your python 'program' just by naming it like you run your C programs. But this, most likely comes down to the shell (the part of the OS that deals with commands) knowing about python programs and invoking the interpreter for you. It can appear then that you can run your python program anywhere but in fact what you can do is pass the program to any python interpreter.
In the zoo of programming there are not only many, very varied beasts, but new kinds of beasts arise all the time, and old beasts metamorphose. Terms like 'program', 'script' and even 'executable' are often used loosely.

What does __latent_entropy is used for in C

Please I would like to understand in which case do we use the keyword __latent_entropy in a C function signature.
I saw some google results talking about a GCC plugin, but I don't still understand what is its impact.
Thanks

You can have a look at the Kconfig's description of what enabling latent_entropy GCC plugin does (it also has a mention of its impact in Linux' performance):
config GCC_PLUGIN_LATENT_ENTROPY
bool "Generate some entropy during boot and runtime"
help
By saying Y here the kernel will instrument some kernel code to
extract some entropy from both original and artificially created
program state. This will help especially embedded systems where
there is little 'natural' source of entropy normally. The cost
is some slowdown of the boot process (about 0.5%) and fork and
irq processing.
Note that entropy extracted this way is not cryptographically
secure!
This plugin was ported from grsecurity/PaX. More information at:
* https://grsecurity.net/
* https://pax.grsecurity.net/
Here you'll find a more detailed description of the latent_entropy GCC plugin. Some content taken from the link:
...
this is where the new gcc plugin comes in: we can instrument the kernel's
boot code to do some hash-like computation and extract some entropy from
whatever program state we decide to mix into that computation. a similar
idea has in fact been implemented by Larry Highsmith of Subreption fame
in http://www.phrack.org/issues.html?issue=66&id=15 where he (manually)
instrumented the kernel's boot code to extract entropy from a few kernel
variables such as time (jiffies) and context switch counts.
the latent entropy plugin takes this extraction to a whole new level. first,
we define a new global variable that we mix into the kernel's entropy pools
on each initcall. second, each initcall function (and all other boot-only
functions they call) gets instrumented to compute a 'random' number that
gets mixed into this global variable at the end of the function (you can
think of it as an artificially created return value that each instrumented
function computes for our purposes). the computation is a mix of add/xor/rol
(the happy recovery Halvar mix :) with compile-time chosen random constants
and the sequence of these operations follows the instrumented functions's
control flow graph. for the rest of the gory details see the source code ;).
...

when should we care about cache missing?

I want to explain my question through a practical problem I met in my project.
I am writing a c library( which behaves like a programmable vi editor), and i plan to provide a series of APIs ( more than 20 in total ):
void vi_dw(struct vi *vi);
void vi_de(struct vi *vi);
void vi_d0(struct vi *vi);
void vi_d$(struct vi *vi);
...
void vi_df(struct vi *, char target);
void vi_dd(struct vi *vi);
These APIs do not perform core operations, they are just wrappers. For example, I can implement vi_de() like this:
void vi_de(struct vi *vi){
vi_v(vi); //enter visual mode
vi_e(vi); //press key 'e'
vi_d(vi); //press key 'd'
}
However, if the wrapper is as simple as such, I have to write more than 20 similar wrapper functions.
So, I consider implementing more complex wrappers to reduce the amount:
void vi_d_move(struct vi *vi, vi_move_func_t move){
vi_v(vi);
move(vi);
vi_d(vi);
}
static inline void vi_dw(struct vi *vi){
vi_d_move(vi, vi_w);
}
static inline void vi_de(struct vi *vi){
vi_d_move(vi, vi_e);
}
...
The function vi_d_move() is a better wrapper function, he can convert a part of similar move operation to APIs, but not all, like vi_f(), which need another wrapper with a third argument char target .
I finished explaining the example picked from my project. The pseudo code above is simper than real case, but is enough to show that:
The more complex the wrapper is, the less wrappers we need, and the slower they will be.(they will become more indirect or need to consider more conditions).
There are two extremes:
use only one wrapper but complex enough to adopt all move operations and convert them into corresponding APIs.
use more than twenty small and simple wrappers. one wrapper is one API.
For case 1, the wrapper itself is slow, but it has more chance resident in cache, because it is often executed(all APIs share it). It's a slow but hot path.
For case 2, these wrappers are simple and fast, but has less chance resident in cache. At least, for any API first time called, a cache miss will happen.(CPU need to fetch instructions from memory, but not L1, L2).
Currently, I implemented five wrappers, each of them are relatively simple and fast. this seems to be a balance, but just seems. I chose five just because I felt the move operation can be divided into five groups naturally. I have no idea how to evaluate it, I don't mean a profiler, I mean, in theory, what main factors should be considered in such case?
In the post end, I want to add more detail for these APIs:
These APIs need to be fast. Because this library is designed as a high performance virtual editor. The delete/copy/paste operation is designed to approach the bare C code.
A user program based on this library seldom calls all these APIs, only parts of them, and usually no more than 10 times for each.
In real case, the size of these simple wrappers are about 80 bytes each, and will be no more than 160 bytes even merged into a single complex one. (but will introduce more if-else branches).
4, As with the situation the library is used, I will take lua-shell as example(a little off-topic, but some friends want to know why I so care its performance):
lua-shell is a *nix shell which uses lua as its script. Its command execution unit(which do forks(), execute()..) is just a C module registered into the lua state machine.
Lua-shell treats everything as lua .
So, When user input:
local files = `ls -la`
And press Enter. The string input is first sent to lua-shell's preprocessor————which convert mixed-syntax to pure lua code:
local file = run_command("ls -la")
run_command() is the entry of lua-shell's command execution unit, which, I said before, is a C module.
We can talk about libvi now. lua-shell's preprocessor is the first user of the library I am writing. Here is its relative codes(pseudo):
#include"vi.h"
vi_loadstr("local files = `ls -la`");
vi_f(vi, '`');
vi_x(vi);
vi_i(vi, "run_command(\"");
vi_f(vi, '`');
vi_x(vi);
vi_a(" \") ");
The code above is parts of luashell's preprocessor implementation.
After generating the pure lua code, he feeds it to Lua State Machine and run it.
The shell user is sensitive to the time interval between Enter and a new prompt, and in most case lua-shell needs preprocess script with larger size and more complicate mixed-syntax.
This is a typical situation where libvi is used.

I won't care that much about cache misses (especially in your case), unless your benchmarks (with compiler optimizations enabled, i.e. compile with gcc -O2 -mtune=native if using GCC....) indicate that they matter.
If performances matters that much, enable more optimizations (perhaps compiling and linking your entire application or library with gcc -flto -O2 -mtune=native that is with link-time optimizations), and hand-optimize only what is critical. You should trust your optimizing compiler.
If you are in the design phase, consider perhaps making your application multi-threaded or somehow concurrent and parallel. With care, this could speedup it more than cache optimizations.
It is unclear what your library is about and what are your design goals. A possibility to add flexibility might be embed some interpreter (like lua or guile or python, etc...) in your application, hence configuring it thru scripts. In many cases, such an embedding could be fast enough (especially when the application specific primitives are of high enough level). Another (more complex) possibility is to provide metaprogramming abilities perhaps thru some JIT compiling library like libjit or libgccjit (so you would sort-of "compile" user scripts into dynamically produced machine code).
BTW, your question seems to focus on instruction cache misses. I would believe that data cache misses are more important (and less optimizable by the compiler), and that is why you would prefer e.g. vectors to linked lists (and more generally care about low-level data structures, focusing on using sequential -or cache-friendly- accesses)
(you could find a good video by Herb Sutter which explains that last point; I forgot the reference)
In some very specific cases, with recent GCC or Clang, adding a few __builtin_prefetch might slightly improve performance (by decreasing cache misses), but it could also harm it significantly, so I don't recommend using it in general, but see this.

what are the steps/strategy to analyze and improve performance of an embedded system

I will break down this question in to sub questions. I am confused if I should ask them separately or in one question. So I will just stick to one SO question.
What are generally the steps to analyze and improve performance of C applications?
Do these steps change if I am developing for an embedded system?
What tools are out there which can help me?
Recently I have been given a task to improve the performance of our product on ARM11 platform. I am relatively new to this field of embedded systems and need gurus here on SO to help me out.

simply changing compilers can improve your C performance for the same source code by many times over. GCC has not necessarily gotten better for performance over the years, for some programs gcc 3.x produces much tighter code than 4.x. Back when I had access to the tools, ARMs compiler produced significantly better code than gcc. As much as 3 or 4 times faster. LLVM has caught up to GCC 4.x and I suspect will pass gcc by in terms of performance and overall use for cross compiling embedded code. Try different versions of gcc, 3.x and 4.x if you are using gcc. Metaware's compiler and arms adt ran circles around gcc3.x, gcc3.x will give gcc4.x a run for its money with arm code, for thumb code gcc4.x is better and for thumb2 (which doesnt apply to you) gcc4.x also better. Remember I have not said a word about changing a single line of code (yet).
LLVM is capable of full program optimization in addition to infinitely more tuning knobs than gcc. Despite that the code generated (ver 27) is only just catching up to the current gcc 4.x in terms of performance for the few programs I tried. And I didnt try the n factoral number of optimization combinations (optimize on the compile step, different options for each file, or combine two files or three files or all files and optimize those bundles, my theory is do no optimization on the C to bc steps, link all the bc together then do a single optimization pass on the whole program, the allow the default optimization when llc takes it to the target).
By the same token simply knowing your compiler and the optimizations can greatly improve the performance of the code without having to change any of it. You have an ARM11 arr you compiling for arm11 or generic arm? You can gain a few to a dozen percent by telling the compiler specifically which architecture/family (armv6 for example) over the generic armv4 (ARM7) that is often chosen as the default. Knowing to use -O2 or -O3 if you are brave.
It is often not the case but switching to thumb mode can improve performance for specific platforms. Doesnt apply to you but the gameboy advance is a perfect example, loaded with non-zero wait state 16 bit busses. Thumb has a handful of a percent overhead because it takes more instructions to do the same thing, but by increasing the fetch times, and taking advantage of some of the sequential read features of the gba thumb code can run significantly faster than arm code for the same source code.
having an arm11 you probably have an L1 and maybe L2 cache, are they on? Are they configured? Do you have an mmu and is your heavy use memory cached? or are you running zero wait state memory and dont need a cache and should turn it off? In addition to not realizing that you can take the same source code and make it run many times faster by changing compilers or options, folks often dont realize that when you use a cache simply adding a single up to a few nops in your startup code (as a trick to adjust where code lands in memory by one, two, a few words) you can change your codes execution speed by as much as 10 to 20 percent. Where those cache line reads hit in heavily used functions/loops makes a big difference. Even saving one cache line read by adjusting where the code lands is noticeable (cutting it from 3 to 2 or 2 to 1 for example).
Knowing your architecture, both the processor and your memory environment is where the tuning if any would start. Most C libraries if you are high level enough to use one (I often dont use a C library as I run without an operating system and with very limited resources) both in their C code and sometimes add some assembler to make bottleneck routines like memcpy, much faster. If your programs are operating on aligned 32 or even better 64 bit addresses, and you adjust even if it means using a handful of bytes more memory for every structure/array/memcpy to be an integral multiple of 32 bits or 64 bits you will see noticeable improvements (if your code uses structs or copies data in other ways). In addition to getting your structures (if you use them, I certainly dont with embedded code) size aligned, even if you waste memory, getting elements aligned, consider using 32 bit integers for every element instead of bytes or halfwords. Depending on your memory system this can help (it can hurt too btw). As with the GBA example above looking at specific functions that either by profiling or intuition you know are not being implemented in a manner that takes advantage of your processor or platform or libraries you may want to turn to assembler either from scratch or compiling from C initially then disassembling and hand tuning. Memcpy is a good example you may know your systems memory performance and may chose to create your own memcpy specifically for aligned data, copying 64 or 128 or more bits per instruction.
Likewise mixing global and local variables can make a noticeable performance difference. Traditionally folks are told never to use globals, but in embedded this isnt necessarily true, depends on how deeply embedded and how much tuning and speed and other factors you are interested in. This is a touchy subject and I may get flamed for it, so I will leave it at that.
The compiler has to burn and evict registers in order to make function calls, plus if you use local variables a stack frame may be required, so function calls are expensive, but at the same time, depending on the code within a function that has now grown in size by avoiding functions, you may create the problem you were trying to avoid, evicting registers to re-use them. Even a single line of C code can make the difference between all the variables in a function fits in registers to having to start evicting a bunch of registers. For functions or segments of code where you know you need some performance gain compile and disassemble (and look at register usage, how often it fetches memory or writes to memory). You can and will find places where you need to take a well used loop and make it its own function even though the function call has a penalty because by doing that the compiler can better optimize the loop and not evict/reuse registers and you get an overall net gain. Even a single extra instruction in a loop that goes around hundreds of times is a measurable performance hit.
Hopefully you already know to absolutely not compile for debug, turn all of the compile for debug options off. You may already know that code compile for debug that runs without bugs doesnt mean it is debugged, compiling for debug and using debuggers hide bugs leaving them as time bombs in your code for your final compile for release. Learn to always compile for release and test with the release version both for performance and finding bugs in your code.
Most instruction sets do not have a divide function. Avoid using divides or modulo in your code as much as humanly possible they are performance killers. Naturally this is not the case for powers of two, to save the compiler and to mentally avoid divides and modulos try to use shifts and ands. Multplies are easier and more often found in instruction sets, but are still costly. This is a good case to write assembler to do your multiplies instead of letting the C copiler do it. The arm multiply is a 32bit * 32bit = 32 bit so to do accurate math without overflowing there has to be extra C code wrapped around the multiply, if you already know you wont overflow, burn the registers for a function call and do the multiply in assembler (for the arm).
Likewise most instruction sets do not have a floating point unit, with yours you might, even so avoid float if at all possible. If you have to use float that is a whole other pandora's box of performance issues. Most folks dont see the performance problems with code as simple as this:
float a,b;
...
a = b * 7.0;
The rest of the problem is not understanding floating point accuracy and how good or bad the C libraries are just trying to get your constants into floating point form. Again float is a whole other long discussion on performance problems.
I am a product of Michael Abrash (I actually have a print copy of zen of assembly language) and the bottom line is time your code. Come up with an accurate way to time the code, you may think you know where the bottlenecks are and you may think you know your architecture but trying different things even if you think they are wrong, and timing them you may find and eventually have to figure out the error in your thinking. Adding nops to start.S as a final tuning step is a good example of this, all the other work you have done for performance can be instantly erased by not having a good alignment with the cache, this also means re-arranging functions within your source code so that they land in different places in the binary image. I have seen 10 to 20 percent swings of speed increase and decrease as a result of cache line alignments.

Code Review:
What are good code review techniques ?
Static and dynamic analysis of the code.
Tools for static analysis: Sparrow, Prevent, Klockworks
Tools for dynamic analysis : Valgrind, purify
Gprof allows you to learn where your program spent its time and which functions called which other functions while it was executing.
Steps are same
Apart from what is listed is point 1, there are tools like memcheck etc.
There is a big list here based on platform

Phew!! Quite a big question!
What are generally the steps to
analyze and improve performance of C
applications?
As well as other static code analysers mentioned here there is a fairly cheap version called PC-Lint which has been around for ages. Sometimes throws up lots of errors and warnings for one error but by the end of it you'll be happy and know waaaaay more about C/C++ because of it.
With all code analysers some of the issues may be more structural to the code so best to start analysing it from day 1 of coding; running analysis on old software may swamp you with issues which may take a while to untangle, best to keep it clean from the beginning.
But code analysers will not catch all logical errors, i.e. it doesn't do what you want it to do! These are best done by code reviews first, then testing. Performance is often improved by by trying to keep the algorithms as simple as possible, keeping instructions in loops tight, possibly unrolling loops (your compiler optimisations may do this), use of fast caches when accessing data which is slow to get.
Code reviews can raise a lot of issues from lots of other peoples eyes looking at it. Don't get too many people, try to get 3 other people if possible, sometimes junior developers ask the most insightful questions like, "why are we doing this?".
Testing can be roughly split into two sections, automated and manual. Automated testing requires effort producing test handlers for functions/units but once run can be run again and again very quickly. Manual testing requires planning, self-discipline to perform them all to the required, imagination to think up of scenarios that may impair performance and you have to be observant (you may have passed the test but the 'scope trace has a bit of an anomaly before/after the test).
"Do these steps change if I am
developing for an embedded system?"
Performance ananlysis can be different on embedded systems to applications systems; with the very broad brush that "embedded" now covers it depends how hardware-centric you are. It can be done using profilers, if you want a more cheap and chearful method then use test output pins to measure sections of code, or measure them with breakpoints on simulators that come with the development environment.
Make sure that not just a typical length of task is measured but also a maximum, as that is where one task may start impeding on other tasks and your scheduled tasks are not completed in time.
What tools are out there which can
help me?
Simulators on the IDEs, static analysis tools, dynamic analysis tools, but most of all you and other humans getting the requirements right, decent reviewing (of code and testing) and thorough testing (automated and manual).
Good luck!

My experiences.
Function calls are slow, eliminate with macros or inlined methods. Look at the disassembler listing to see.
If using GCC, mark optimized sections with #pragma GCC optimize("O3") or compile them separately.
Play with different combinations of applying the inline attribute (basically find a balance between size and speed).

It is a difficult question to be answered shortly since various techniques have been proposed such as flowchart and state diagram,so you can take a look at some titles:
ARM System-on-Chip Architecture, 2nd Edition -- Steve Furber
ARM System Developer's Guide - Designing and Optimizing System Software -- Andrew N. Sloss, Dominic Symes, Chris Wright & John Rayfield
The Definitive Guide to the ARM Cortex-M3 --Joseph Yiu
C Programming for Embedded Systems --Kirk Zurell
Embedded C -- Michael J. Pont
Programming Embedded Systems in C and C++ --Michael Barr
An Embedded Software Primer --David E, Simon
Embedded Microprocessor Systems 3rd Edition --Stuart Ball
Global Specification and Validation of Embedded Systems - Integrating Heterogeneous Components --G. Nicolescu & A.A Jerraya
Embedded Systems: Modeling, Technology and Applications --Gunter Hommel & Sheng Huanye
Embedded Systems and Computer Architecture --Graham Wilson
Designing Embedded Hardware --John Catsoulis

You have to use a profiler. It will help you identify your application's bottleneck(s). Then focus on improving the functions you spend the most time in and the ones you call the most. Repeat this procedure until you're satisfied with your application performance.
No they don't.
Depending on the platform you're developing onto :
Windows : AMD Code Analyst, VTune, Sleepy
Linux : valgrind / callgrind / cachegrind
Mac : the Xcode profiler is quite good.
Try to find a profiler for the architecture you actually work on.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight