Run time Data and Code memory size estimate - c

I am working on a project, C programing language, to develop an application, that can be ported on to a number of different microcontroller platforms, such as ARM\Freescale\PIC microcontroller. I am developing this application on Linux now and then I will have to port it to the above said platforms.
I would like to know, are there any tools (open source preferably), using which I can determine the "code" and the data memory footprint\size, before porting it to the new platform.
I have been searching on "Google" for it and have not found anything so far, not even for Linux as well.
any help from you will greatly help me.
-Vikas

For a small program, much of the size is determined by the libraries/DLL your program depends on. Since you refer to ARM/Freescale/Pic I assume you're dealing with compact, embedded applications where data size is measured in bytes rather than MBytes.
For your own code, size differences will determined by:
word size (i.e. 32bit programs tend to be a bit larger /more data than 8 bit)
architecture (i.e. Intel code versus ARM, freescale, PIC)
In your case, I expect that PIC is the most critical part (for RAM/ROM constraints). So propably monitoring the PIC compile size during PC development is sufficient. The linker output will contain info on TEXT/DATA/BSS size, which you can monitor.
I generally work on embedded systems. In my work much of the data size is known at design time (i.e. number of buffers * buffer size). For code size, I have rules of thumb on different architectures which help me to do a sanity check at design time. For instance, I define a suite of some exising-code libraries, for which I know performance and size numbers for each architecture. This way I know what kind of ratio I can expect at design time. If the PC program has 1 MBytes of data, it won't fit in an 8-bit PIC.....

Nothing can tell you how much memory your application will need. You'll have to make some assumptions about how it will be used and try your application under different scenarios.
As you're testing, you can monitor the memory usage stats in the /proc file system or use the ps command to do the same.

The size of your text/code segment will depend on optimization level and back-end. GCC can be configured to generate that information for you.
Run-time is a little more difficult as Jeremy said. Besides his suggestion, you also might want to try gcov and/or gprof in order to analyse your program in the context of your most common use scenarios. This kind of instrumentation is focussed on complexity rather than size but at least you'll know better where to focus your memory analysis.

Your compiler can/will generate a map file. The map file will, generally speaking, have code and data size (or location ranges). There may be differences between different compilers for the different targets. And as pointed out in other posts here, your dependencies on supplied libraries will also impact overall memory usage.

Related

Is dividing a program into multiple files heavier than using one big file?

I am currently developping an embedded application, and the problem is that I have reached a point where the whole app is actually too heavy for the RAM.
So I am asking myself this question: Would my compiled program be lighter if I refractored some the files into one big file?
Thanks.
No, not in general.
The source-code organisation doesn't impact the memory use.
It might be possible to refactor the program to use less memory, if you for instance have pieces of functionality that both use significant amounts but never run in parallel, but that's equally doable in a single file. It just requires making the sharing explicit.
If your code would be 100% identical in the "big file" and the "many files" approach, the main difference is that addresses of variables and fuctions would be resolved (i.e. padded into the binary) by the linker rather than the compiler. This should normally result in the very same binary.
There are, however, architectures that have "short relative addressing modes" (Motorola 68k is an example) for calls and references, where the "big file approach" could actually result in (slightly) smaller code and data. The compiler cannot normally insert such short addressing modes into modules intended for linking because at compile time it is not known how far away the reference would be in the resulting program.
So, depending on the CPU you are using, you might actually gain a few bytes.

Bottom Level Graphics Programming/ Working with Screen Hardware

It is said that a 'low level' programming language is one that works 'close' to the hardware. I work with C as well as Python and Java and it is generally excepted that C is a 'lower level' language or works 'closer' to the hardware than these other two languages. This of course makes sense because python and java are C derived languages.
So what I want to know is what is the 'lowest level' or 'closest to hardware' way of working with the computer screen to display pixel graphics on a monitor?
Basically, how does the screen work on the programming 'back end'?
Just to be perfectly clear I'll give an analogy. When I first learned about python lists, I figured they were multiple numbers stored together. What I later learned was that they are actually pointers to locations in memory reserved for the length of the array in units of the data type.
So what I know about the screen is that its a grid of pixels x wide, y long. When you refer to this grid with functions that accept some form of vector, a pixel will light up a specified colour at that pixel's coordinance (0,0 starts top left of course.) What is really going on in those functions?
sMachine code is lowest you can go. And this even still gets interpreted by the CPU. (Dependent on CPU of course)
Generally most people will go only as low as assembler though.
(I didn't mention GPU programming as you mentioned Python, Java, C and made an assumption you are trying to get to grips with the old way of rendering).
EDIT
Missed your other question
Pointers are used when a block of memory needs referencing. Simple data types do not require a pointer (but can still have one). This is dependent on the language and how it allows data type allocation.
In the case of a list/array of data you can not allocate this on the heap as it is a varying length so a block of memory is request from the operating system which will return a pointer to this section of memory that can be used. The operating system does not care how this is used and it's up to the application to know how much to allocate and how it should use it.
This is why languages use data types so the compiler knows how to manipulate that block of memory. Now nothing stops you manipulating this block of memory directly but you must follow the rules the compile uses since any further code executed could cause errors.
EDIT
Additional based on comment.
When modifying a a buffer that if linked to hardware you need to understand the specs for the buffer. Usually you have already requested the device to use a certain MODE of operation and you know how the hardware will behave. This is why varying standards arose to help programmer use these. I am going to reference to a very good old school programming guide for hardware it should help you a lot PC Game Programmer's Encyclopedia
These concepts still apply to operating systems now in windows you would use the GDI to allocate these screens and the GDI interfaces to a driver to present the required image. The guide I linked is about using hardware directly from old operating systems which have no graphics support (MS-DOS for example)
The word you are looking for is frame buffer. For an introduction see http://en.wikipedia.org/wiki/Framebuffer

what are the steps/strategy to analyze and improve performance of an embedded system

I will break down this question in to sub questions. I am confused if I should ask them separately or in one question. So I will just stick to one SO question.
What are generally the steps to analyze and improve performance of C applications?
Do these steps change if I am developing for an embedded system?
What tools are out there which can help me?
Recently I have been given a task to improve the performance of our product on ARM11 platform. I am relatively new to this field of embedded systems and need gurus here on SO to help me out.
simply changing compilers can improve your C performance for the same source code by many times over. GCC has not necessarily gotten better for performance over the years, for some programs gcc 3.x produces much tighter code than 4.x. Back when I had access to the tools, ARMs compiler produced significantly better code than gcc. As much as 3 or 4 times faster. LLVM has caught up to GCC 4.x and I suspect will pass gcc by in terms of performance and overall use for cross compiling embedded code. Try different versions of gcc, 3.x and 4.x if you are using gcc. Metaware's compiler and arms adt ran circles around gcc3.x, gcc3.x will give gcc4.x a run for its money with arm code, for thumb code gcc4.x is better and for thumb2 (which doesnt apply to you) gcc4.x also better. Remember I have not said a word about changing a single line of code (yet).
LLVM is capable of full program optimization in addition to infinitely more tuning knobs than gcc. Despite that the code generated (ver 27) is only just catching up to the current gcc 4.x in terms of performance for the few programs I tried. And I didnt try the n factoral number of optimization combinations (optimize on the compile step, different options for each file, or combine two files or three files or all files and optimize those bundles, my theory is do no optimization on the C to bc steps, link all the bc together then do a single optimization pass on the whole program, the allow the default optimization when llc takes it to the target).
By the same token simply knowing your compiler and the optimizations can greatly improve the performance of the code without having to change any of it. You have an ARM11 arr you compiling for arm11 or generic arm? You can gain a few to a dozen percent by telling the compiler specifically which architecture/family (armv6 for example) over the generic armv4 (ARM7) that is often chosen as the default. Knowing to use -O2 or -O3 if you are brave.
It is often not the case but switching to thumb mode can improve performance for specific platforms. Doesnt apply to you but the gameboy advance is a perfect example, loaded with non-zero wait state 16 bit busses. Thumb has a handful of a percent overhead because it takes more instructions to do the same thing, but by increasing the fetch times, and taking advantage of some of the sequential read features of the gba thumb code can run significantly faster than arm code for the same source code.
having an arm11 you probably have an L1 and maybe L2 cache, are they on? Are they configured? Do you have an mmu and is your heavy use memory cached? or are you running zero wait state memory and dont need a cache and should turn it off? In addition to not realizing that you can take the same source code and make it run many times faster by changing compilers or options, folks often dont realize that when you use a cache simply adding a single up to a few nops in your startup code (as a trick to adjust where code lands in memory by one, two, a few words) you can change your codes execution speed by as much as 10 to 20 percent. Where those cache line reads hit in heavily used functions/loops makes a big difference. Even saving one cache line read by adjusting where the code lands is noticeable (cutting it from 3 to 2 or 2 to 1 for example).
Knowing your architecture, both the processor and your memory environment is where the tuning if any would start. Most C libraries if you are high level enough to use one (I often dont use a C library as I run without an operating system and with very limited resources) both in their C code and sometimes add some assembler to make bottleneck routines like memcpy, much faster. If your programs are operating on aligned 32 or even better 64 bit addresses, and you adjust even if it means using a handful of bytes more memory for every structure/array/memcpy to be an integral multiple of 32 bits or 64 bits you will see noticeable improvements (if your code uses structs or copies data in other ways). In addition to getting your structures (if you use them, I certainly dont with embedded code) size aligned, even if you waste memory, getting elements aligned, consider using 32 bit integers for every element instead of bytes or halfwords. Depending on your memory system this can help (it can hurt too btw). As with the GBA example above looking at specific functions that either by profiling or intuition you know are not being implemented in a manner that takes advantage of your processor or platform or libraries you may want to turn to assembler either from scratch or compiling from C initially then disassembling and hand tuning. Memcpy is a good example you may know your systems memory performance and may chose to create your own memcpy specifically for aligned data, copying 64 or 128 or more bits per instruction.
Likewise mixing global and local variables can make a noticeable performance difference. Traditionally folks are told never to use globals, but in embedded this isnt necessarily true, depends on how deeply embedded and how much tuning and speed and other factors you are interested in. This is a touchy subject and I may get flamed for it, so I will leave it at that.
The compiler has to burn and evict registers in order to make function calls, plus if you use local variables a stack frame may be required, so function calls are expensive, but at the same time, depending on the code within a function that has now grown in size by avoiding functions, you may create the problem you were trying to avoid, evicting registers to re-use them. Even a single line of C code can make the difference between all the variables in a function fits in registers to having to start evicting a bunch of registers. For functions or segments of code where you know you need some performance gain compile and disassemble (and look at register usage, how often it fetches memory or writes to memory). You can and will find places where you need to take a well used loop and make it its own function even though the function call has a penalty because by doing that the compiler can better optimize the loop and not evict/reuse registers and you get an overall net gain. Even a single extra instruction in a loop that goes around hundreds of times is a measurable performance hit.
Hopefully you already know to absolutely not compile for debug, turn all of the compile for debug options off. You may already know that code compile for debug that runs without bugs doesnt mean it is debugged, compiling for debug and using debuggers hide bugs leaving them as time bombs in your code for your final compile for release. Learn to always compile for release and test with the release version both for performance and finding bugs in your code.
Most instruction sets do not have a divide function. Avoid using divides or modulo in your code as much as humanly possible they are performance killers. Naturally this is not the case for powers of two, to save the compiler and to mentally avoid divides and modulos try to use shifts and ands. Multplies are easier and more often found in instruction sets, but are still costly. This is a good case to write assembler to do your multiplies instead of letting the C copiler do it. The arm multiply is a 32bit * 32bit = 32 bit so to do accurate math without overflowing there has to be extra C code wrapped around the multiply, if you already know you wont overflow, burn the registers for a function call and do the multiply in assembler (for the arm).
Likewise most instruction sets do not have a floating point unit, with yours you might, even so avoid float if at all possible. If you have to use float that is a whole other pandora's box of performance issues. Most folks dont see the performance problems with code as simple as this:
float a,b;
...
a = b * 7.0;
The rest of the problem is not understanding floating point accuracy and how good or bad the C libraries are just trying to get your constants into floating point form. Again float is a whole other long discussion on performance problems.
I am a product of Michael Abrash (I actually have a print copy of zen of assembly language) and the bottom line is time your code. Come up with an accurate way to time the code, you may think you know where the bottlenecks are and you may think you know your architecture but trying different things even if you think they are wrong, and timing them you may find and eventually have to figure out the error in your thinking. Adding nops to start.S as a final tuning step is a good example of this, all the other work you have done for performance can be instantly erased by not having a good alignment with the cache, this also means re-arranging functions within your source code so that they land in different places in the binary image. I have seen 10 to 20 percent swings of speed increase and decrease as a result of cache line alignments.
Code Review:
What are good code review techniques ?
Static and dynamic analysis of the code.
Tools for static analysis: Sparrow, Prevent, Klockworks
Tools for dynamic analysis : Valgrind, purify
Gprof allows you to learn where your program spent its time and which functions called which other functions while it was executing.
Steps are same
Apart from what is listed is point 1, there are tools like memcheck etc.
There is a big list here based on platform
Phew!! Quite a big question!
What are generally the steps to
analyze and improve performance of C
applications?
As well as other static code analysers mentioned here there is a fairly cheap version called PC-Lint which has been around for ages. Sometimes throws up lots of errors and warnings for one error but by the end of it you'll be happy and know waaaaay more about C/C++ because of it.
With all code analysers some of the issues may be more structural to the code so best to start analysing it from day 1 of coding; running analysis on old software may swamp you with issues which may take a while to untangle, best to keep it clean from the beginning.
But code analysers will not catch all logical errors, i.e. it doesn't do what you want it to do! These are best done by code reviews first, then testing. Performance is often improved by by trying to keep the algorithms as simple as possible, keeping instructions in loops tight, possibly unrolling loops (your compiler optimisations may do this), use of fast caches when accessing data which is slow to get.
Code reviews can raise a lot of issues from lots of other peoples eyes looking at it. Don't get too many people, try to get 3 other people if possible, sometimes junior developers ask the most insightful questions like, "why are we doing this?".
Testing can be roughly split into two sections, automated and manual. Automated testing requires effort producing test handlers for functions/units but once run can be run again and again very quickly. Manual testing requires planning, self-discipline to perform them all to the required, imagination to think up of scenarios that may impair performance and you have to be observant (you may have passed the test but the 'scope trace has a bit of an anomaly before/after the test).
"Do these steps change if I am
developing for an embedded system?"
Performance ananlysis can be different on embedded systems to applications systems; with the very broad brush that "embedded" now covers it depends how hardware-centric you are. It can be done using profilers, if you want a more cheap and chearful method then use test output pins to measure sections of code, or measure them with breakpoints on simulators that come with the development environment.
Make sure that not just a typical length of task is measured but also a maximum, as that is where one task may start impeding on other tasks and your scheduled tasks are not completed in time.
What tools are out there which can
help me?
Simulators on the IDEs, static analysis tools, dynamic analysis tools, but most of all you and other humans getting the requirements right, decent reviewing (of code and testing) and thorough testing (automated and manual).
Good luck!
My experiences.
Function calls are slow, eliminate with macros or inlined methods. Look at the disassembler listing to see.
If using GCC, mark optimized sections with #pragma GCC optimize("O3") or compile them separately.
Play with different combinations of applying the inline attribute (basically find a balance between size and speed).
It is a difficult question to be answered shortly since various techniques have been proposed such as flowchart and state diagram,so you can take a look at some titles:
ARM System-on-Chip Architecture, 2nd Edition -- Steve Furber
ARM System Developer's Guide - Designing and Optimizing System Software -- Andrew N. Sloss, Dominic Symes, Chris Wright & John Rayfield
The Definitive Guide to the ARM Cortex-M3 --Joseph Yiu
C Programming for Embedded Systems --Kirk Zurell
Embedded C -- Michael J. Pont
Programming Embedded Systems in C and C++ --Michael Barr
An Embedded Software Primer --David E, Simon
Embedded Microprocessor Systems 3rd Edition --Stuart Ball
Global Specification and Validation of Embedded Systems - Integrating Heterogeneous Components --G. Nicolescu & A.A Jerraya
Embedded Systems: Modeling, Technology and Applications --Gunter Hommel & Sheng Huanye
Embedded Systems and Computer Architecture --Graham Wilson
Designing Embedded Hardware --John Catsoulis
You have to use a profiler. It will help you identify your application's bottleneck(s). Then focus on improving the functions you spend the most time in and the ones you call the most. Repeat this procedure until you're satisfied with your application performance.
No they don't.
Depending on the platform you're developing onto :
Windows : AMD Code Analyst, VTune, Sleepy
Linux : valgrind / callgrind / cachegrind
Mac : the Xcode profiler is quite good.
Try to find a profiler for the architecture you actually work on.

File descriptor limits and default stack sizes

Where I work we build and distribute a library and a couple complex programs built on that library. All code is written in C and is available on most 'standard' systems like Windows, Linux, Aix, Solaris, Darwin.
I started in the QA department and while running tests recently I have been reminded several times that I need to remember to set the file descriptor limits and default stack sizes higher or bad things will happen. This is particularly the case with Solaris and now Darwin.
Now this is very strange to me because I am a believer in 0 required environment fiddling to make a product work. So I am wondering if there are times where this sort of requirement is a necessary evil, or if we are doing something wrong.
Edit:
Great comments that describe the problem and a little background. However I do not believe I worded the question well enough. Currently, we require customers, and hence, us the testers, to set these limits before running our code. We do not do this programatically. And this is not a situation where they MIGHT run out, under normal load our programs WILL run out and seg fault.
So rewording the question, is requiring the customer to change these ulimit values to run our software to be expected on some platforms, ie, Solaris, Aix, or are we as a company making it to difficult for these users to get going?
Bounty:
I added a bounty to hopefully get a little more information on what other companies are doing to manage these limits. Can you set these pragmatically? Should we? Should our programs even be hitting these limits or could this be a sign that things might be a bit messy under the covers? That is really what I want to know, as a perfectionist a seemingly dirty program really bugs me.
If you need to change these values in order to get your QA tests to run, then that is not too much of a problem. However, requiring a customer to do this in order for the program to run should (IMHO) be avoided. If nothing else, create a wrapper script that sets these values and launches the application so that users will still have a one-click application launch. Setting these from within the program would be the preferable method, however. At the very least, have the program check the limits when it is launched and (cleanly) error out early if the limits are too low.
If a software developer told me that I had to mess with my stack and descriptor limits to get their program to run, it would change my perception of the software. It would make me wonder "why do they need to exceed the system limits that are apparently acceptable for every other piece of software I have?". This may or may not be a valid concern, but being asked to do something that (to many) can seem hackish doesn't have the same professional edge as an program that you just launch and go.
This problem seems even worse when you say "this is not a situation where they MIGHT run out, under normal load our programs WILL run out and seg fault". A program exceeding these limits is one thing, but a program that doesn't gracefully handle the error conditions resulting from exceeding these limits is quite another. If you hit the file handle limit and attempt to open a file, you should get an error indicating that you have too many files open. This shouldn't cause a program crash in a well-designed program. It may be more difficult to detect stack usage issues, but running out of file descriptors should never cause a crash.
You don't give much details about what type of program this is, but I would argue that it's not safe to assume that users of your program will necessarily have adequate permissions to change these values. In any case, it's probably also unsafe to assume that nothing else might change these values while your program is running without the user's knowledge.
While there are always exceptions, I would say that in general a program that exceeds these limits needs to have its code re-examined. The limits are there for a reason, and pretty much every other piece of software on your system works within those limits with no problems. Do you really need that many files open at the same time, or would it be cleaner to open a few files, process them, close them, and open a few more? Is your library/program trying to do too much in one big bundle, or would it be better to break it into smaller, independent parts that work together? Are you exceeding your stack limits because you are using a deeply-recursive algorithm that could be re-written in a non-recursive manner? There are likely many ways in which the library and program in question can be improved in order to ease the need to alter the system resource limits.
The short answer is: it's normal, but not inflexible. Of course, limits are in place to prevent rogue processes or users from starving the system of resources. Desktop systems will be less restrictive than server systems but still have certain limits (e.g. filehandles.)
This is not to say that limits cannot be altered in persistent/reproduceable manners, either by the user at the user's discretion (e.g. by adding the relevant ulimit calls in .profile) or programatically from within programs/libraries which know with certitude that they will require large amounts of filehandles (e.g. setsysinfo(SSI_FD_NEWMAX,...)), stack (provided at pthread creation time), etc.
On Darwin, the default soft limit on the number of open files is 256; the default hard limit is unlimited.
AFAICR, on Solaris, the default soft limit on the number of open files is 16384 and the hard limit is 32768.
For stack sizes, Darwin has soft/hard limits of 8192/65536 KB. I forget what the limit is on Solaris (and my Solaris machine is unavailable - power outages in Poughkeepsie, NY mean I can't get to the VPN to access the machine in Kansas from my home in California), but it is substantial.
I would not worry about the hard limits. If I thought the library might run out of 256 file descriptors, I'd increase the soft limit on Darwin; I would probably not bother on Solaris.
Similar limits apply on Linux and AIX. I can't answer for Windows.
Sad story: a few years ago now, I removed the code that changed the maximum file size limit in a program - because it had not been changed from the days when 2 MB was a big file (and some systems had a soft limit of just 0.5 MB). Once upon a decade and some ago, it actually increased the limit; when it was removed, it was annoying because it reduced the limit. Tempus fugit and all that.
On SuSE Linux (SLES 10), the open files limits are 4096/4096, and the stack limits are 8192/unlimited.
As you have to support a large number of different systems i would consider it wise to setup certain known to be good values for system limits/resources because the default values can differ wildly between systems.
The default size for pthread stacks is for example such a case. I recently had to find out that the default on HPUX11.31 is 256KB(!) which isn't very reasonable at least for our applications.
Setting up well defined values increases the portability of an application as you can be sure that there are X file descriptors, a stack size of Y, ... on every platform and that things are not just working by good luck.
I have the tendency to setup such limits from within the program itself as the user has less things to screw up (someone always tries to run the binary without the wrapper script). To optionally allow for runtime customization environment variables could be used to override the defaults (still enforcing the minimum limits).
Lets look at it this way. It is not very customer friendly to require customers to set these limits. As detailed in the other answers, you are most likely to hit soft limits and these can be changed. So change them automatically, if necessary in a script that starts the actual application (you can even write it so that it fails if the hard limits are too low and produce a nice error message instead of a segfault).
That's the practical part of it. Without knowing what the application does I'm a bit at a guess, but in most cases you should not be anywhere close to hitting any of the default limits of (even less progressive) operating systems. Assuming the system is not a server that is bombarded with requests (hence the large amount of file/socket handles used) it is probably a sign of sloppy programming. Based on experience with programmers, I would guess that file descriptors are left open for files that are only read/written once, or that the system keeps open a file descriptor on a file that is only sporadically changed/read.
Concerning stack sizes, that can mean two things. The standard cause of a program running out of stack is excessive recursion (or unbounded recursion), which is an error condition that the limits actually are designed to address. The second thing is that some big (probably configuration) structures are allocated on the stack that should be allocated in heap memory. It might even be worse and those huge structures are being passed around by value (instead of reference) and that would mean a big hit on available (wasted) stack space as well as a big performance penalty.
A small tip : If you plan to run the application over 64 bit processor, then please be careful about setting stacksize unlimited. Which in 64 Bit Linux system give -1 as stacksize.
Thanks
Shyam
Perhaps you could add whatever is appropriate to the start script, like 'ulimit -n -S 4096'.
But having worked with Solaris since 2.6, its not unusual to modify rlim_fd_cur and rlim_fd_max in /etc/system permanently. In older versions of Solaris, they're just too low for some workloads, like running webservers.

When to worry about endianness?

I have seen countless references about endianness and what it means. I got no problems about that...
However, my coding project is a simple game to run on linux and windows, on standard "gamer" hardware.
Do I need to worry about endianness in this case? When should I need to worry about it?
My code is simple C and SDL+GL, the only complex data are basic media files (png+wav+xm) and the game data is mostly strings, integer booleans (for flags and such) and static-sized arrays. So far no user has had issues, so I am wondering if adding checks is necessary (will be done later, but there are more urgent issues IMO).
The times when you need to worry about endianess:
you are sending binary data between machines or processes (using a network or file). If the machines may have different byte order or the protocol used specifies a particular byte order (which it should), you'll need to deal with endianess.
you have code that access memory though pointers of different types (say you access a unsigned int variable through a char*).
If you do these things you're dealing with byte order whether you know it or not - it might be that you're dealing with it by assuming it's one way or the other, which may work fine as long as your code doesn't have to deal with a different platform.
In a similar vein, you generally need to deal with alignment issues in those same cases and for similar reasons. Once again, you might be dealing with it by doing nothing and having everything work fine because you don't have to cross platform boundaries (which may come back to bite you down the road if that does become a requirement).
If you mean a PC by "standard gamer hardware", then you don't have to worry about endianness as it will always be little endian on x86/x64. But if you want to port the project to other architectures, then you should design it endianness-independently.
Whenever you recieve/transmit data from a network, remeber to convert to/from network and host byte order. The C functions htons, htonl etc, or equivalients in your language, should be used here.
Whenever you read multi-byte values (like UTF-16 characters or 32 bit ints) from a file, since that file might have originated on a system with different endianness. If the file is UTF 16 or 32 it probably has a BOM (byte-order mark). Otherwise, the file format will have to specify endianness in some way.
You only need to worry about it if your game needs to run on different hardware architectures. If you are positive that it will always run on Intel hardware then you can forget about it. If it will run on Linux though many people use different architectures than Intel and you may end up having to think about it.
Are you distributing you game in source code form?
Because if you are distributing you game as a binary only, then you know exactly which processor families your game will run on. Also, the media files, are they user generated (possibly via a level editor) or are they really only ment to be supplied by yourself?
If this is a truly closed environment (your distribute binaries and the game assets are not intended to be customized) then you know your own risks to endians and I personally wouldn't fool with it.
However, if you are either distributing source and/or hoping people will customize their game, then you have a potential for concern. However, with most of the desktop/laptop computers around these days moving to x86 I would think this is a diminishing concern.
The problem occurs with networking and how the data is sent and when you are doing bit fiddling on different processors since different processors may store the data differently in memory.
I believe Power PC has the opposite endianness of the Intel boards. Might be able to have a routine that sets the endianness dependant on the architecture? I'm not sure if you can actually tell what the hardware architecture is in code...maybe someone smarter then me does know the answer to that question.
Now in reference to your statement "standard" Gamer H/W, I would say typically you're going to look at Consumer Off the Shelf solutions are really what most any Standard Gamer is using, so you're almost going to for sure get the same endian across the board. I'm sure someone will disagree with me but that's my $.02
Ha...I just noticed on the right there is a link that is showing up related to the suggestion I had above.
Find Endianness through a c program

Resources