VHDL multidimensional arrays: advices and good design practices - arrays

I'm working with Xilinx ISE on a Spartan-6, which is driving a complex board with multiple functions. As you can imagine the VHDL project is becoming pretty complex and as a C++ programmer I feel the need of using arrays to compact the code.
I've already tried using those in the past but I had many problems. At the time I wasn't very experienced and a lot of other errors where present, which I solved after ditching the array structures.
Another problem I encountered was the impossibility to simulate the post-translate (again with arrays), but as I discovered that simulation is bugged because it doesn't initialize the LUTs created.
So here are the questions: what precautions do I have to keep in mind when using arrays? What are the most important design practices with arrays? Will I have problems in simulating sub-modules with the post-map or the post-PAR simulation?

The only complication that I can see with using an array vs using a "flat" signal by name is related to optimization that happens during the build process and how the names of things can be modified so that they are not easily recognizable. As you may have noticed, this is particularly true for multi-dimensional arrays and it is also true with records. Other than that, if you use a flat structure, or an array, they should implement the same, assuming you are describing the same structure in two different ways.
It may be difficult to determine the "new name" for your array due to the optimizations/renaming that took place, but this should not be confused with different design implementation on the hardware level.
When using arrays the syntax can be particularly troublesome when dealing with initialization. Be sure to consult your synthesis tool reference manual or user guide for support syntax and constructs.
In my opinion, the "cost" of using an array (or record) in terms of additional complication in post-synthesis activities is greatly outweighed by the simplification and clarity that can be gained in the code.

Arrays and records, and combinations of these, are one of the great strengths of VHDL, and makes it possible to create functional, readable, scalable and maintainable code. These types have always been in VHDL, and I have not encountered any problems in tools due to these types. VHDL tools are generally so mature that it is in more subtle features that you may encounter bugs.
If you encounter interface changes in post-map/PAR netlist, then I think it should be addresses by a wrapper round the design in order to fit into an existing test bench.
Since the majority of the simulation verification is (should be) made at VHDL level (any post-map/PAR simulation is only for sanity check of STA) it is much more important to have a high level and abstract view of the design during such simulation and debugging, instead of designing at bit level just to make the design match at post-map/PAR simulation.

I use arrays for two things.
First, to make code concise and simple. For example:
signal cnt1 : unsigned(10 downto 0);
signal cnt2 : unsigned(10 downto 0);
...
signal cnt9 : unsigned(10 downto 0);
Can be replaced by:
type cnt_arr_t is array(1 to 9) of unsigned(10 downto 0);
signal cnt : cnt_arr_t;
I never had any "problem" in Xilinx using arrays like this, they behave the same as defining multiples signals, they can be multi-dimensionnal, you can pass them to function, have them on entity ports, etc. The XST user guide specifies that multi-dimensional arrays can't be used as an entity port, but Molten Zilmer uses them without having issues.
The second, more common use of arrays is to define memories (RAM, ROM, dual-port RAM, etc). In that case, you have to be careful how you use the array or the design won't map to memory resources. A 512x32 memory use 1 blockRAM or lots of LUTs/register. Memories are smaller, faster and less power intensive for large depth values.
Look at the XST synthesis guide for how to use array to define memories. As a general rule, your array indices should start at 0 (1 to 2048 is not equivalent to 0 to 2047). Also, multi-dimensionnal arrays do not map to memory resources, at least when I tried it with an older version of XST. Otherwise, arrays of std_logic_vector/unsigned and arrays of records are fine.
You should always look at the synthesis report to make sure XST understand your code as you intend. Every memory mapping is reported with it's mode, depth, width and more. It's the best way to spot misinterpretation errors.
I must admit I never have to do post-map/PAR simulations, so I can't tell you if you will have problems.

Related

C- Why is for loop pointer indexing faster? [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Some years ago I was on a panel that was interviewing candidates for a relatively senior embedded C programmer position.
One of the standard questions that I asked was about optimisation techniques. I was quite surprised that some of the candidates didn't have answers.
So, in the interests of putting together a list for posterity - what techniques and constructs do you normally use when optimising C programs?
Answers to optimisation for speed and size both accepted.
First things first - don't optimise too early. It's not uncommon to spend time carefully optimising a chunk of code only to find that it wasn't the bottleneck that you thought it was going to be. Or, to put it another way "Before you make it fast, make it work"
Investigate whether there's any option for optimising the algorithm before optimising the code. It'll be easier to find an improvement in performance by optimising a poor algorithm than it is to optimise the code, only then to throw it away when you change the algorithm anyway.
And work out why you need to optimise in the first place. What are you trying to achieve? If you're trying, say, to improve the response time to some event work out if there is an opportunity to change the order of execution to minimise the time critical areas. For example when trying to improve the response to some external interrupt can you do any preparation in the dead time between events?
Once you've decided that you need to optimise the code, which bit do you optimise? Use a profiler. Focus your attention (first) on the areas that are used most often.
So what can you do about those areas?
minimise condition checking. Checking conditions (eg. terminating conditions for loops) is time that isn't being spent on actual processing. Condition checking can be minimised with techniques like loop-unrolling.
In some circumstances condition checking can also be eliminated by using function pointers. For example if you are implementing a state machine you may find that implementing the handlers for individual states as small functions (with a uniform prototype) and storing the "next state" by storing the function pointer of the next handler is more efficient than using a large switch statement with the handler code implemented in the individual case statements. YMMV.
minimise function calls. Function calls usually carry a burden of context saving (eg. writing local variables contained in registers to the stack, saving the stack pointer), so if you don't have to make a call this is time saved. One option (if you're optimising for speed and not space) is to make use of inline functions.
If function calls are unavoidable minimise the data that is being passed to the functions. For example passing pointers is likely to be more efficient than passing structures.
When optimising for speed choose datatypes that are the native size for your platform. For example on a 32bit processor it is likely to be more efficient to manipulate 32bit values than 8 or 16 bit values. (side note - it is worth checking that the compiler is doing what you think it is. I've had situations where I've discovered that my compiler insisted on doing 16 bit arithmetic on 8 bit values with all of the to and from conversions to go with them)
Find data that can be precalculated, and either calculate during initialisation or (better yet) at compile time. For example when implementing a CRC you can either calculate your CRC values on the fly (using the polynomial directly) which is great for size (but dreadful for performance), or you can generate a table of all of the interim values - which is a much faster implementation, to the detriment of the size.
Localise your data. If you're manipulating a blob of data often your processor may be able to speed things up by storing it all in cache. And your compiler may be able to use shorter instructions that are suited to more localised data (eg. instructions that use 8 bit offsets instead of 32 bit)
In the same vein, localise your functions. For the same reasons.
Work out the assumptions that you can make about the operations that you're performing and find ways of exploiting them. For example, on an 8 bit platform if the only operation that at you're doing on a 32 bit value is an increment you may find that you can do better than the compiler by inlining (or creating a macro) specifically for this purpose, rather than using a normal arithmetic operation.
Avoid expensive instructions - division is a prime example.
The "register" keyword can be your friend (although hopefully your compiler has a pretty good idea about your register usage). If you're going to use "register" it's likely that you'll have to declare the local variables that you want "register"ed first.
Be consistent with your data types. If you are doing arithmetic on a mixture of data types (eg. shorts and ints, doubles and floats) then the compiler is adding implicit type conversions for each mismatch. This is wasted cpu cycles that may not be necessary.
Most of the options listed above can be used as part of normal practice without any ill effects. However if you're really trying to eke out the best performance:
- Investigate where you can (safely) disable error checking. It's not recommended, but it will save you some space and cycles.
- Hand craft portions of your code in assembler. This of course means that your code is no longer portable but where that's not an issue you may find savings here. Be aware though that there is potentially time lost moving data into and out of the registers that you have at your disposal (ie. to satisfy the register usage of your compiler). Also be aware that your compiler should be doing a pretty good job on its own. (of course there are exceptions)
As everybody else has said: profile, profile profile.
As for actual techniques, one that I don't think has been mentioned yet:
Hot & Cold Data Separation: Staying within the CPU's cache is incredibly important. One way of helping to do this is by splitting your data structures into frequently accessed ("hot") and rarely accessed ("cold") sections.
An example: Suppose you have a structure for a customer that looks something like this:
struct Customer
{
int ID;
int AccountNumber;
char Name[128];
char Address[256];
};
Customer customers[1000];
Now, lets assume that you want to access the ID and AccountNumber a lot, but not so much the name and address. What you'd do is to split it into two:
struct CustomerAccount
{
int ID;
int AccountNumber;
CustomerData *pData;
};
struct CustomerData
{
char Name[128];
char Address[256];
};
CustomerAccount customers[1000];
In this way, when you're looping through your "customers" array, each entry is only 12 bytes and so you can fit many more entries in the cache. This can be a huge win if you can apply it to situations like the inner loop of a rendering engine.
My favorite technique is to use a good profiler. Without a good profile telling you where the bottleneck lies, no tricks and techniques are going to help you.
most common techniques I encountered are:
loop unrolling
loop optimization for better cache prefetch
(i.e. do N operations in M cycles instead of NxM singular operations)
data aligning
inline functions
hand-crafted asm snippets
As for general recommendations, most of them are already sounded:
choose better algos
use profiler
don't optimize if it doesn't give 20-30% performance boost
For low-level optimization:
START_TIMER/STOP_TIMER macros from ffmpeg (clock-level accuracy for measurement of any code).
Oprofile, of course, for profiling.
Enormous amounts of hand-coded assembly (just do a wc -l on x264's /common/x86 directory, and then remember most of the code is templated).
Careful coding in general; shorter code is usually better.
Smart low-level algorithms, like the 64-bit bitstream writer I wrote that uses only a single if and no else.
Explicit write-combining.
Taking into account important weird aspects of processors, like Intel's cacheline split issue.
Finding cases where one can losslessly or near-losslessly make an early termination, where the early-termination check costs much less than the speed one gains from it.
Actually inlined assembly for tasks which are far more suited to the x86 SIMD unit, such as median calculations (requires compile-time check for MMX support).
First and foremost, use a better/faster algorithm. There is no point optimizing code that is slow by design.
When optimizing for speed, trade memory for speed: lookup tables of precomputed values, binary trees, write faster custom implementation of system calls...
When trading speed for memory: use in-memory compression
Avoid using the heap. Use obstacks or pool-allocator for identical sized objects. Put small things with short lifetime onto the stack. alloca still exists.
Pre-mature optimization is the root of all evil!
;)
As my applications usually don't need much CPU time by design, I focus on the size my binaries on disk and in memory. What I do mostly is looking out for statically sized arrays and replacing them with dynamically allocated memory where it's worth the additional effort of free'ing the memory later. To cut down the size of the binary, I look for big arrays that are initialized at compile time and put the initializiation to runtime.
char buf[1024] = { 0, };
/* becomes: */
char buf[1024];
memset(buf, 0, sizeof(buf));
This will remove the 1024 zero-bytes from the binaries .DATA section and will instead create the buffer on the stack at runtime and the fill it with zeros.
EDIT: Oh yeah, and I like to cache things. It's not C specific but depending on what you're caching, it can give you a huge boost in performance.
PS: Please let us know when your list is finished, I'm very curious. ;)
If possible, compare with 0, not with arbitrary numbers, especially in loops, because comparison with 0 is often implemented with separate, faster assembler commands.
For example, if possible, write
for (i=n; i!=0; --i) { ... }
instead of
for (i=0; i!=n; ++i) { ... }
Another thing that was not mentioned:
Know your requirements: don't optimize for situations that will unlikely or never happen, concentrate on the most bang for the buck
basics/general:
Do not optimize when you have no problem.
Know your platform/CPU...
...know it thoroughly
know your ABI
Let the compiler do the optimization, just help it with the job.
some things that have actually helped:
Opt for size/memory:
Use bitfields for storing bools
re-use big global arrays by overlaying with a union (be careful)
Opt for speed (be careful):
use precomputed tables where possible
place critical functions/data in fast memory
Use dedicated registers for often used globals
count to-zero, zero flag is free
Difficult to summarize ...
Data structures:
Splitting of a data structure depending on case of usage is extremely important. It is common to see a structure that holds data that is accessed based on a flow control. This situation can lower significantly the cache usage.
To take into account cache line size and prefetch rules.
To reorder the members of the structure to obtain a sequential access to them from your code
Algorithms:
Take time to think about your problem and to find the correct algorithm.
Know the limitations of the algorithm you choose (a radix-sort/quick-sort for 10 elements to be sorted might not be the best choice).
Low level:
As for the latest processors it is not recommended to unroll a loop that has a small body. The processor provides its own detection mechanism for this and will short-circuit whole section of its pipeline.
Trust the HW prefetcher. Of course if your data structures are well designed ;)
Care about your L2 cache line misses.
Try to reduce as much as possible the local working set of your application as the processors are leaning to smaller caches per cores (C2D enjoyed a 3MB per core max where iCore7 will provide a max of 256KB per core + 8MB shared to all cores for a quad core die.).
The most important of all: Measure early, Measure often and never ever makes assumptions, base your thinking and optimizations on data retrieved by a profiler (please use PTU).
Another hint, performance is key to the success of an application and should be considered at design time and you should have clear performance targets.
This is far from being exhaustive but should provide an interesting base.
These days, the most important things in optimzation are:
respecting the cache - try to access memory in simple patterns, and don't unroll loops just for fun. Use arrays instead of data structures with lots of pointer chasing and it'll probably be faster for small amounts of data. And don't make anything too big.
avoiding latency - try to avoid divisions and stuff that's slow if other calculations depend on them immediately. Memory accesses that depend on other memory accesses (ie, a[b[c]]) are bad.
avoiding unpredictabilty - a lot of if/elses with unpredictable conditions, or conditions that introduce more latency, will really mess you up. There's a lot of branchless math tricks that are useful here, but they increase latency and are only useful if you really need them. Otherwise, just write simple code and don't have crazy loop conditions.
Don't bother with optimizations that involve copy-and-pasting your code (like loop unrolling), or reordering loops by hand. The compiler usually does a better job than you at doing this, but most of them aren't smart enough to undo it.
Collecting profiles of code execution get you 50% of the way there. The other 50% deals with analyzing these reports.
Further, if you use GCC or VisualC++, you can use "profile guided optimization" where the compiler will take info from previous executions and reschedule instructions to make the CPU happier.
Inline functions! Inspired by the profiling fans here I profiled an application of mine and found a small function that does some bitshifting on MP3 frames. It makes about 90% of all function calls in my applcation, so I made it inline and voila - the program now uses half of the CPU time it did before.
On most of embedded system i worked there was no profiling tools, so it's nice to say use profiler but not very practical.
First rule in speed optimization is - find your critical path.
Usually you will find that this path is not so long and not so complex. It's hard to say in generic way how to optimize this it's depend on what are you doing and what is in your power to do. For example you want usually avoid memcpy on critical path, so ever you need to use DMA or optimize, but what if you hw does not have DMA ? check if memcpy implementation is a best one if not rewrite it.
Do not use dynamic allocation at all in embedded but if you do for some reason don't do it in critical path.
Organize your thread priorities correctly, what is correctly is real question and it's clearly system specific.
We use very simple tools to analyze the bottle-necks, simple macro that store the time-stamp and index. Few (2-3) runs in 90% of cases will find where you spend your time.
And the last one is code review a very important one. In most case we avoid performance problem during code review very effective way :)
Measure performance.
Use realistic and non-trivial benchmarks. Remember that "everything is fast for small N".
Use a profiler to find hotspots.
Reduce number of dynamic memory allocations, disk accesses, database accesses, network accesses, and user/kernel transitions, because these often tend to be hotspots.
Measure performance.
In addition, you should measure performance.
Sometimes you have to decide whether it is more space or more speed that you are after, which will lead to almost opposite optimizations. For example, to get the most out of you space, you pack structures e.g. #pragma pack(1) and use bit fields in structures. For more speed you pack to align with the processors preference and avoid bitfields.
Another trick is picking the right re-sizing algorithms for growing arrays via realloc, or better still writing your own heap manager based on your particular application. Don't assume the one that comes with the compiler is the best possible solution for every application.
If someone doesn't have an answer to that question, it could be they don't know much.
It could also be that they know a lot. I know a lot (IMHO :-), and if I were asked that question, I would be asking you back: Why do you think that's important?
The problem is, any a-priori notions about performance, if they are not informed by a specific situation, are guesses by definition.
I think it is important to know coding techniques for performance, but I think it is even more important to know not to use them, until diagnosis reveals that there is a problem and what it is.
Now I'm going to contradict myself and say, if you do that, you learn how to recognize the design approaches that lead to trouble so you can avoid them, and to a novice, that sounds like premature optimization.
To give you a concrete example, this is a C application that was optimized.
Great lists. I will just add one tip I didn't saw in the above lists that in some case can yield huge optimisation for minimal cost.
bypass linker
if you have some application divided in two files, say main.c and lib.c, in many cases you can just add a \#include "lib.c" in your main.c That will completely bypass linker and allow for much more efficient optimisation for compiler.
The same effect can be achieved optimizing dependencies between files, but the cost of changes is usually higher.
Sometimes Google is the best algorithm optimization tool. When I have a complex problem, a bit of searching reveals some guys with PhD's have found a mapping between this and a well-known problem and have already done most of the work.
I would recommend optimizing using more efficient algorithms and not do it as an afterthought but code it that way from the start. Let the compiler work out the details on the small things as it knows more about the target processor than you do.
For one, I rarely use loops to look things up, I add items to a hashtable and then use the hashtable to lookup the results.
For example you have a string to lookup and then 50 possible values. So instead of doing 50 strcmps, you add all 50 strings to a hashtable and give each a unique number ( you only have to do this once ). Then you lookup the target string in the hashtable and have one large switch with all 50 cases ( or have functions pointers ).
When looking up things with common sets of input ( like css rules ), I use fast code to keep track of the only possible solitions and then iterate thought those to find a match. Once I have a match I save the results into a hashtable ( as a cache ) and then use the cache results if I get that same input set later.
My main tools for faster code are:
hashtable - for quick lookups and for caching results
qsort - it's the only sort I use
bsp - for looking up things based on area ( map rendering etc )

Using a giant workspace array for a Fortran subroutine

Quite often when I look at legacy Fortran code for linear algebra subroutines, I notice this:
Instead of working with a bunch of separate arrays, they will concatenate all their arrays into one big workspace array and use pointers to demarcate where a variable begins.
They even concatenate independent non-array variables into arrays. Are there benefits to doing this, and should I be doing this if I want to write optimized code?
No, don't do that if you want to keep sane mind. This is a practise from 1960s-1980s when there was no dynamic allocation possible and they wanted only small number of working arrays in the argument list.
In old subroutines you had a long list of arguments and then one or two working arrays:
call SUB(N1, N2, N3, M1, M2, M3, A, B, C, WRK, IWRK)
if you needed to pass 10 working arrays instead of one it would be too difficult to call it.
But in 21st century the most important thing is to keep your code readable and clear and only after that optimize it.
BTW having some quantities to close in memory can be even detrimental due to false sharing.
That does not mean you should fragment your memory too much, but it makes sense to keep stuff together when you will indeed access it sequentially. That's why structure of arrays are used instead of arrays of structures.
In general (independent of the programming language that is used): having "consecutive" blocks of well, anything is often helpful.
The operating system, or even the hardware might be able to benefit from having a single huge section in memory to deal with; compared to look at 50 or 100 different locations.
A good starter for such discussions would be this question for example.
But I agree 100% with the other answer: unless you get massive performance gains out of using such techniques, you should always prefer to write "clean" (aka readable) code. And that translates to avoiding such practices.

Manually translating code from one language to another

I often write codes in MATLAB/Python to test whether my algorithm is feasible (& actually works). I then need to convert the entire code into C and sometimes, in FORTRAN90.
What would be a good way to manually convert a medium sized code from one language to another?
I have tried :
Converting the entire code from one into another and then testing it.
(Sometimes, there are errors and bugs which just won't go away and the finding the source of the error becomes a problem)
Go line by line and check for consistency of outputs every few lines.
(Too time consuming)
Use converters like f2c.
(In my experience, they are extremely horrible. I link to a lot of libraries which have different function calls for C and Fortran)
Also,:
I am fairly conversant with the programming languages I deal with so I don't need manuals or reference guides for my work (i.e. I know the syntax).
I am not asking this question specifically about MATLAB and C but rather as a translation paradigm.
Regarding the size, the codes are less than 100 lines long.
I dont want to call the code of one language to another. Please don't suggest that.
Different languages call for different paradigms. You definitely don't write and design code the same way in eg. Matlab, Python, C# or C++. Even object hierarchies will change a lot depending on the language.
That said, if your code consists in a few interconnected procedures, then you may go away with a direct line by line translation (every language allow you to write two or three interconnected functions while remaining idiomatic). But this is the case only for the simplest programs.
Prototyping in a high level language and then implementing the same idea in a robust and clean way in a "production" language is a very good practice, but involves two very different things :
Prototype in whatever language you want. Test, experiment, and convince yourself that the idea works. Pay attention to the big picture, don't focus on performance but on the high level ideas. Pay also attention to difficulties that you encounter when implementing, as you'll face them again in step 2.
Implement from scratch the idea in the production environment in language X. It will be quicker than if you did not do the prototyping stage, since most of the difficulties have been met in stage 1. Use idiomatic X, and focus on correctness. Pay attention to corner cases, general robustness, and once it works correctly, performance. You'll notice that roughly half of your code is made of new things which did not appear in 1. (eg. error checking, corner case handling, input/output, unit testing, etc).
You can see that line by line translation is obviously not a good idea, since you don't translate into the same program.
Also, when not prototyping, I find myself throwing away the first version and making another one that I like better, ie. I find myself prototyping ! Implementing the same thing twice is not a loss of time, it is normal development flow.
You may want to consider using a higher level domain specific language with multiple backends (e.g., Matlab, C, Fortran), producing clean and idiomatic code for each target language, probably with some optimisations. If your problem domain is narrow and every piece of code is more or less typical, it should be fairly trivial to design and implement such a DSL.
Break the source down into psuedo-code with input/process/output and then write your new code base to fit that spec.

general question about solving problems with parallelisation

i have a general question about programming of parallel algorithms in C. Lets assume that our task is to implement some matrix algorithms with MPI and/or OpenMP. There are some situations, like false sharing in OpenMP or in MPI where the communications arise in dependence of the matrix dimension (columns cyclic distrubuted among processes), which cause some problems . Would it be a good and a common attempt to solve this situations by, for example, transposing the matrix, because this would reduce the necessary communications or even avoiding the false sharing problem? After that you would undo the transposition. Of course, assuming that this would lead to a much better speed up.
I dont think that this would be very cunning and more of a lazy way to do this. But im curious to read some opions about this.
Let's start with the first question first: can it make sense to transpose? The answer is, it depends, and you can estimate whether it will improve things or not.
The transposition/retransposition with impose a one-time memory bandwidth cost of 2*(going through memory the fast way) + 2*(going through memory the slow way) where those memory operations are literally memory operations in the multicore case, or network communications in the distributed memory case. You're going to be reading a matrix in the fast way and putting it into memory the slow way. (You can make this, essentially, 4*(going through memory the fast way) by reading the matrix in one cache-sized block at a time, transposing in cache, and writing out in order).
Whether or not that's a win or not depends on how many times you'll be accessing the array. If you would have been hitting the entire non-transposed array 4 times with memory access in the "wrong" direction, then you will clearly win by doing the two transposes. If you'd only be going through the non-transposed array once in the wrong direction, then you almost certainly won't win by doing the transposition.
As to the larger question, #AlexandreC is absolutely right here -- trying to implement your own linear algebra routines is madness. Take a look at, eg, How To Write Fast Numerical Code, figure 3; there can be factors of 40 in performance between naive and highly-tuned (say) GEMM operations. These things are highly memory-bandwidth limited, and in parallel that means network limited. By far, best is to use existing tools.
For multicore linear algebra, existing libraries include
Atlas
Plasma
Flame
For MPI implementations, there are
BLACS
Scalapack
or complete solver environments like
PETSc
Trilinos
I don't know that you'd throw the transpose away the second that you completed the operation, but yes this is a valid mechanism to increase parallelism.
I am not an expert; I've only read a little bit about this topic, and even that was for SIMD architectures, so please take my opinion lightly... but I think the usual mechanism is to lay your structures out in memory to match the machine (so you'd transpose a large matrix to line up better with your vectors and increase the dependency distance in your loops), and then you also build an indexing structure of pointers around that so that you can quickly access individual elements in the transpose differently. This gets more difficult to do as your input changes more dynamically.
I dont think that this would be very cunning and more of a lazy way to do this.
Lazy solutions are usually better than "cunning" ones, because they tend to be more simple and straightforward. They're therefore easier to implement, document, understand and maintain. Indeed, laziness is arguably one of the greatest virtues a programmer can have. As long as the program produces correct results at acceptable speeds, nobody should care how elegantly you solved the problem (including you).

What specific examples are there of knowing C making you a better high level programmer?

I know about the existance of question such as this one and this one. Let me explain.
Afet reading Joel's article Back to Basics and seeing many similar questions on SO, I've begun to wonder what are specific examples of situations where knowing stuff like C can make you a better high level programmer.
What I want to know is if there are many examples of this. Many times, the answer to this question is something like "Knowing C gives you a better feel of what's happening under the covers" or "You need a solid foundation for your program", and these answers don't have much meaning. I want to understand the different specific ways in which you will benefit from knowing low level concepts,
Joel gave a couple of examples: Binary databases vs XML, and strings. But two examples don't really justify learning C and/or Assembly. So my question is this: What specific examples are there of knowing C making you a better high level programmer?
My experience with teaching students and working with people who only studied high-level languages is that they tend to think at a certain high level of abstraction, and they assume that "everything comes for free". They can become very competent programmers, but eventually they have to deal with some code that has performance issues and then it comes to bite them.
When you work a lot with C, you do think about memory allocation. You often think about memory layout (and cache locality if that's an issue). You understand how and why certain graphics operations just cost a lot. How efficient or inefficient certain socket behaviors are. How buffers work, etc. I feel that using the abstractions in a higher level language when you do know how it is implemented below the covers sometimes gives you "that extra secret sauce" when thinking about performance.
For example, Java has a garbage collector and you can't directly assign things to memory directly. And yet, you can make certain design choices (e.g., with custom data structures) that affect performance because of the same reasons this would be an issue in C.
Also, and more generally, I feel that it is important for a power programmer to not only know big-O notation (which most schools teach), but that in real-life applications the constant is also important (which schools try to ignore). My anecdotal experience is that people with skills in both language levels tend to have a better understanding of the constant, perhaps because of what I described above.
In addition, many higher level systems that I have seen interface with lower level libraries and infrastructures. For instance, some communications, databases or graphics libraries. Some drivers for certain devices, etc. If you are a power programmer, you may eventially have to venture out there and it helps to at least have an idea of what is going on.
Knowing low level stuff can help a lot.
To become a racing driver, you have to learn and understand the basic physics of how tyres grip the road. Anyone can learn to drive pretty fast, but you need a good understanding of the "low level" stuff (forces and friction, racing lines, fine throttle and brake control, etc) to get those last few percent of performance that will allow you to win the race.
For example, if you understand how the CPU architecture works in your computer, you can write code that works better with it (e.g. if you know you have a certain CPU cache size or a certain number of bytes in each CPU cache line, you can arrange your data structures and the way that you access them to make the best use of the cache - for example, processing many elements of an array in order is often faster than processing random elements, due to the CPU cache). If you have a multi-core computer, then understanding how low level techniques like threading work can gave huge benefits (just as not understanding the low level can lead to disaster in threading).
If you understand how Disk I/O and caching works, you can modify file operations to work well with it (e.g. if you read from one file and write to another, working on large batches of data in RAM can help reduce I/O contention between the reading and writing phases of your code, and vastly improve throughput)
If you understand how virtual functions work, you can design high-level code that uses virtual functions well. If used incorrectly they can severely hamper performance.
If you understand how drawing is handled, you can use clever tricks to improve drawing speed. e.g. You can draw a chessboard by alternately drawing 64 white and black squares. But it is often faster to draw 32 white sqares and then 32 black ones (because you only have to change the drawing colour twice instead of 64 times). But you can actually draw the whole board black, then XOR 4 stripes across the board and 4 stripes down the board in white, and this can be much faster still (2 colour changes, and only 9 rectangles to draw instead of 64). This chessboard trick teaches you a very important programming skill: Lateral thinking. By designing your algorithm well, you can often make a big difference to how well your program operates.
Understanding C, or for that matter, any low level programming language, gives you an opportunity to understand things like memory usage (i.e. why is it a bad thing to create several million heavy objects), how pointers/object references work, etc.
The problem is that as we've created ever increasing levels of abstraction, we find ourselves doing a lot of 'lego block' programming, without understanding how the legos actually function. And by having almost infinite resources, we start treating memory and resources like water, and tend to solve problems by throwing more iron at the situation.
While not limited to C, there's a tremendous benefit to working at a low level with much smaller, memory constrained systems like the Arduino or old-school 8-bit processors. It lets you experience close to the metal coding in a much more approachable package, and after spending time squeezing apps into 512K, you will find yourself applying these skills at a larger level within your day to day programming.
So the language itself is not important, but having a deeper appreciation for how all of the bits come together, and how to work effectively at a level closer to the hardware is a set of skills beneficial to any software developer.
For one, knowing C helps you understand how memory works in the OS and in other high level languages. When your C# or Java program balloons on memory usage, understanding that references (which are basically just pointers) take memory too, and understand how many of the data structures are implemented (which you get from making your own in C) helps you understand that your dictionary is reserving huge amounts of memory that aren't actually used.
For another, knowing C can help you understand how to make use of lower level operating system features. You don't need this often, but sometimes you may need memory mapped files, or to use marshalling in C#, and C will greatly help understand what you're doing when that happens.
I think C has also helped my understanding of network protocols, but I can't put my finger on specific examples. I was reading another SO question the other day where someone was complaining about how C's bit-fields are 'basically useless' and I was thinking how elegantly C bit fields represent low-level network protocols. High level languages dealing with structures of bits always end up a mess!
In general, the more you know, the better programmer you will be.
However, sometimes knowing another language, such as C, can make you do the wrong thing, because there might be an assumption that is not true in a higher-level language (such as Python, or PHP). For example, one might assume that finding the length of a list might be O(N) where N is the length of the list. However, this is probably not the case in many high-level language instances. In Python, for most list-like things the cost is O(1).
Knowing more about the specifics of a language will help, but knowing more in general might lead one to make incorrect assumptions.
Just "knowing" C would not make you better.
But, if you understand the whole thing, how native binaries work, how does CPU work with it, what are architecture limitations, you may write a code which is easier for CPU.
For example, how L1/L2 caches affect your work, and how should you write your code to have more hits in L1/L2 caches. When working with C/C++ and doing heavy optimizations, you will have to go down to that kind of things.
It isn't so much knowing C as it is that C is closer to the bare metal than many other languages. You need to be more aware of how to allocate/deallocate memory because you have to do it yourself. Doing it yourself helps you understand the implications of many decisions that you make.
To me any language is acceptable as long as you understand how the compiler/interpreter (basically) maps your code onto the machine. It's a bit easier to do in a language that exposes this directly, but you should be able to, with a bit of reading, figure out how memory is allocated and organized, what sort of indexing patterns are more optimal than others, what constructs are more efficient for particular applications, etc.
More important, I think, is a good understanding of operating systems, memory architectures, and algorithms. If you understand how your algorithm works, why it would be better to choose one algorithm or data structure over another (e.g., HashSet vs. List), and how your code maps onto the machine, it shouldn't matter what language you are using.
This is my experience of how I learnt and taught myself programming, specifically, understanding C, this is going back to early 1990's so may be a bit antique, but the passion and the drive is important:
Learn to understand the low level principles of the computer, such as EGA/VGA programming, here's a link to the Simtel archive on the C programmer's guide to the PC.
Understanding how TSR's work
Download the whole archive of Bob Stout's snippets which is a big collection of C code that does one thing only - study them and understand it, not alone that, the collection of snippets strives to be portable.
Browse at the International Obfuscated C Code Contest (IOCCC) online, and see how the C code can be abused and understand the intracies of the language. The worst code abuse is the winner! Download the archives and study them.
Like myself, I loved the infamous Ponzo's C Tutorial which helped me immensely, unfortunately, the archive is very hard to find. If anyone knows of where to obtain them, please leave a comment and I will amend this answer to include the link. There is another one that I can remember - Coronado's [Generic?] C Tutorial, again, my memory on this one is hazy...
Look at Dr. Dobb's journal and C User Journal here - I do not know if you can still get them in print but they were a classic, can remember the feeling of holding a printed copy in my hand and tearing off home to type in the code to see what happens!
Grab an ancient copy of Turbo C v2 which I believe you can get from borland.com and just play with 16bit C programming to get a feel and mess with the pointers...sure it is ancient and old but playing with pointers on it is fine.
Understand and learn Pointers, link here to the legacy Simtel.net - a crucial link to achieving C Guru'ship for want of a better word, also you will find a host of downloads pertaining to the C programming language - I remember actually ordering the Simtel CD Archive and looking for the C stuff...
A couple of things that you have to deal directly with in C that other languages abstract away from you include explicit memory management (malloc) and dealing directly with pointers.
My girlfriend is one semester from graduating MIT (where they mainly use Java, Scheme, and Python) with a Computer Science degree, and she is currently working at a company whose codebase is in C++. For the first few days she had a difficult time understanding all the pointers/references/etc.
On the other hand, I found moving from C++ to Java very easy, because I was never confused about pass-references-by-value vs pass-by-reference.
Similarly, in C/C++ it is much more apparent that primitives are just the compiler treating the same sets of bits in different ways, as opposed to a language like Python or Ruby where everything is an object with its own distinct properties.
A simple (not entirely realistic) example to illustrate some of the advice above. Consider the seemingly harmless
while(true)
for(Iterator iter = foo.iterator(); iter.hasNext();)
bar.doSomething( iter.next() )
or the even higher level
while(true)
for(Baz b: foo)
bar.doSomething(b)
A possible problem here is that each time round the while loop a new object (the iterator) is created. If all you care about is programmer convenience, then the latter is definitely better. But if the loop has to be efficient or the machine is resource constrained then you are pretty much at the mercy of the designers of your high level language.
For example, a typical complaint for doing high-performance Java is having execution stop while garbage (such as all those allocated Iterator objects) is reclaimed. Not very good if your software is charged with tracking incoming missiles, auto-piloting a passenger jet, or just not leaving the user wondering why the GUI has stopped responding.
One possible solution (still in the higher-level language) would be to weaken the convenience of the iterator to something like
Iterator iter = new Iterator();
while(true)
for(foo.initAlreadyAllocatedIterator(iter); iter.hasNext();)
bar.doSomething(iter.next())
But this would only make sense if you had some idea about memory allocation...otherwise it just looks like a nasty API. Convenience always costs somewhere, and knowing lower-level stuff can help you identify and mitigate those costs.

Resources