I'm a high-school student doing some C things where I'd like to profile my code to see where the actual performance bottlenecks are. I don't have much money, so I'd prefer free tools.
I like to use the MinGW/GCC compiler toolchain. This is not something I'm stuck with, but I'd prefer tools that are capable of working with this.
Features I need:
See how much total time is spent in a certain function.
Features I'd like:
See how much time a line of code takes.
Cross-platform (being able to use the same software on Linux & Mac)
See how often a function gets called (and how long each call takes on average).
See what causes the time spent (cache misses, branch mispredictions, etc).
I've tried using gprof, but I couldn't get it to work (it only shows main in the profile), and I've heard bad things about it, so what are my options?

if you want a free, Windows and Linux TBP (it also does event based and some other metric based forms of profiling) then AMD's code analyst should do the job nicely (even on Intel cpus, though Im not sure of the quality/reliability of the branching and cache analysis on Intel cpus), its also got a nice ui built in Qt which does the source + assembly line time breakdowns. its also got an API to embed events for the profiler to catch for more targeted profiling.


CPU operations windows profiler tools

Do you know any profiler tool that tells you the number of total CPU operations a C/C++ program does? I need something like valgrind callgrind on linux...
Intel has some tools such as VTune. They also provide a performance counter library which you can use to instrument your code manually, by reading the hardware perf counter registers before and after a piece of code.
Visual Studio has an instrumented profiler but I don't know if it gets down to the "instructions retired" level of detail.
You should ask yourself what information you really want: do you want to count the number of cycles spent on a function, or do you really want to know how much wall-clock time your app is spending on each function generally? The latter is more useful in most cases, and you can get it more easily by sampling. (see also Mike Dunlavey's simple do-it-by-hand method which works for big hotspots.)
Counting actual instructions retired and branch mispredicts and so on is only useful if you really understand the details of the CPU pipeline and how to optimize around it. Microseconds-per-function is typically what you really want to optimize instead.

Timing Kernel Executions on CUDA

I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.
Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.
I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.

Porting Autodesk Animator Pro to be cross platform

a previous relevant question from me is here Reverse Engineering old paint programs
I have set up my base of operations here: http://animatorpro.org
wiki coming soon.
Okay, so now I have a 300,000 line legacy MSDOS codebase. It's sort of a "be careful what you wish for" situation. I am not an experienced C programmer. I'm not entirely inexperienced either, but for all intents and purposes I'm a noob to the language and in particular the intricacies of its libraries. I am especially ignorant of the vagaries of the differences between C programs written specifically for MSDOS and programs that are cross platform. However I have been studying this code base for over a year now, and this is what I know about Animator Pro:
Compilers and tools used:
Watcom C compiler
tcmake (make program from Turbo C)
386asm, a specialised assembler for the Phar Lap dos extender
and of course, the Phar Lap dos extender itself.
a selection of obscure dos utilities
Much of the compilation seems to be driven by batch files. Though I have obtained copies of all these tools, I have not yet succeeded at compiling it. (though I have compiled its older brother, autodesk animator original.
It's got a plugin system that replicates DLL before DLL's were available, based on REX. The plugin system handles:
Video Drivers (with a plethora of included VESA drivers)
Input drivers (including wacom tablets, and keyboards)
Drawing Tools
Inks (Like photoshop's filters, or blending modes)
Scripting Addons (essentially compiled scripts)
File formats
It's got its own script interpreter named POCO, based on the C language- The scripting language has enough power to do virtually all the things the plugin system can do- Just slower.
Given this information, this is my development plan. Please criticise this. The source code is available in the link above, so you can easily, if you are so inclined, assess the situation yourself.
Compile with its original tools.
Switch to using DJGPP, and make the necessary changes to get it to compile with that, plus the original assembler.
Include the Allegro.cc "Game" library, and switch over as much functionality to that library as possible- Perhaps by simply writing new video and input drivers that use the Allegro API. I'm thinking allegro rather than SDL because: there is a DOS version of Allegro, and fascinatingly, one of its core functions is the ability to play Animator Pro's native format FLIC.
Hopefully after 3, I will have eliminated most or all of the Assembler in the project. I say hopefully, because it's in an obscure dialect that doesn't assemble in any modern free assembler without significant modification. I have tried them all. Whatever is left gets converted to assemble in NASM, or to C code if I can define the assembler's actual function.
Switch the dos extender from Phar Lap to HX Dos http://www.japheth.de/HX.html, Which promises to replicate as much of the WIN32 api as possible. Then make all the necessary code changes for that to work.
Switch to the win32 version of Allegro.cc, assuming that the win32 version can run on top of HXDos. Make any further necessary changes
Modify the plugin system to use some kind of standard cross platform plugin library. What this would be, I have no idea. Maybe you can offer some suggestions? I talked to the developer who originally wrote the plugin system, and he said some of the things it does aren't possible on modern OS's because of segmentation restrictions. I'm not sure what this means, but I'm guessing it means all the plugins will need to be rewritten almost from scratch.
Magically, I got all the above done, and we can try and make it run in windows, osx, and linux, whilst dealing with other cross platform niggles like long file names, and things I haven't thought of.
Anyone got a problem with any of this? Is allegro a good choice? if not, why? what would you do about this plugin system? What would you do different? Is this whole thing foolish, and should I just rewrite it from scratch, using the original as inpiration? (it would apparently take the original developer "About a month" to do that)
One thing I haven't covered above is the text/font system. Not sure what to do about that, but Animator Pro has its own custom font format, but also is able to use Postscript Type 1 fonts, and some other formats.
My biggest concern with your plan, in a nutshell: Your approach seems to be to attempt to keep the whole enormous thing working at all times, tweaking the environment ever-further away from DOS. During each tweak to the environment, that means you will have approximately a billion subtle assumptions that might have broken at once, none of which you necessarily understand yet. Untangling them all at once will be incredibly painful.
If I were doing the port, my approach would be to disable as much code as possible to get SOMETHING running in a modern environment, and bring the parts back online, one piece at a time. Write a simple test harness program that loads a display driver and draws some stuff, and compile it for DOS to make sure you understand the interface. Then write some C code that implements the same interface, but with Allegro (or SDL or SFML), and make that program work under Windows or Linux. When the output differs, you have a simple test case to work from.
Your entire job on this port is swapping out implementations of various interfaces and functions with completely new ones. This is a job that unit testing excels at. Don't write any new code without a test of some kind that runs on the old code under DOS! Make your potential problems as small and simple as you possibly can. Port assembly code instead of rewriting it only if you're reasonably confident that it will actually make your job easier (ie, algorithmic stuff that compiles fine with few tweaks under NASM). Don't bite off a bigger piece than you can comfortably fit in your brain at once.
I, for one, look forward to seeing your progress! I think what you're attempting to do is great. Thanks for doing it.
Hmmm - I might approach it by writing an OpenGL video "driver" for it. and todays machines are fast enough with tons of ram that you could do all the pixel specific algorithms on main CPU into a back buffer and it would work. As the "generic" VGA driver just mapped the video buffer to a pointer this would be a place to start. There was a zoom mode in the UI so you can look at the pixels on a high res display.
It is often very difficult to take an existing non-trivial code base that wasn't written with portability in mind - you mention a few - and then try to make it portable. There will be a lot of problems on the way. It is probably a better idea to start from scratch and rewrite the code using the existing code as reference only. If you start from scratch you can leverage existing portable UI solution in your new project like Qt.

How would I go about creating my own VM?

I'm wondering how to create a minimal virtual machine that'll be modeled after the Intel 16 bit system. This would be my first actual C project, most of my code is 100 lines or less, but I have the core fundamentals down, read K&R, and understand how things ought to work, so this pretty much is a test of wits.
Could anyone guide me in as far as documentation, tools, tutorials, or plain old tips/pointers on how to go about this, thus far I understand that I require somewhere to store data, a CPU of sorts and some sort of mechanism that functions as an interrupt controller.
I'm doing this to learn: Systems internals, ASM internals and C - three facets of computing that I want to learn in a singular project.
Please be kind enough not to tell me to do something simpler - that would only be annoying. :)
Thanks for reading, and hopefully writing!
Virtual machines fall into two categories: those that interpret the code instruction at a time and those that compile the code to native instructions (e.g. "JIT").
The interpretation category is usually built around an instruction execution loop, using a switch statement, computed gotos or function pointers to determine how to execute each instruction.
There is a fun platform that is worth studying for its simplicity and fun: Corewars.
Corewars is a programming challenge game where programs written in "Redcode" run on a MARS VM. There are many MARS VMs, typically written in C.
It has also inspired 8086-based versions, where programs written in 8086 assembler battle.
Well, for starters I would pick up a reference book for assembly language for the processor you intend to virtualize, like 80286 or similar.
For a JIT, you might want to dynamically generate and execute x86 code.
If you want to write a Virtual Machine using the x86 VMM technology you will need quite a bit of things.
There are a few instructions that are critical such as VM_ENTER/VM_EXIT (name can change depending on the chip, AMD and INTEL use different names but the functionalities are the same). Those instructions are actually privileged and therefore, you will need to write a kernel module to use them.
The first step for your VM to start is to boot it and therefore, you will need a 'BIOS' which will be loaded. Then you need to emulate devices, etc. You could even run an old version of MSDOS in such a VM if you wanted to.
All in all, it clearly isn't trivial and requires a lot of time and effort.
Now, you could do something similar to what VMWare used to do before the Virtualization ready CPUs appeared.

How do you profile your code?

I hope not everyone is using Rational Purify.
So what do you do when you want to measure:
time taken by a function
peak memory usage
code coverage
At the moment, we do it manually [using log statements with timestamps and another script to parse the log and output to excel. phew...)
What would you recommend? Pointing to tools or any techniques would be appreciated!
EDIT: Sorry, I didn't specify the environment first, Its plain C on a proprietary mobile platform
I've done this a lot. If you have an IDE, or an ICE, there is a technique that takes some manual effort, but works without fail.
Warning: modern programmers hate this, and I'm going to get downvoted. They love their tools. But it really works, and you don't always have the nice tools.
I assume in your case the code is something like DSP or video that runs on a timer and has to be fast. Suppose what you run on each timer tick is subroutine A. Write some test code to run subroutine A in a simple loop, say 1000 times, or long enough to make you wait at least several seconds.
While it's running, randomly halt it with a pause key and sample the call stack (not just the program counter) and record it. (That's the manual part.) Do this some number of times, like 10. Once is not enough.
Now look for commonalities between the stack samples. Look for any instruction or call instruction that appears on at least 2 samples. There will be many of these, but some of them will be in code that you could optimize.
Do so, and you will get a nice speedup, guaranteed. The 1000 iterations will take less time.
The reason you don't need a lot of samples is you're not looking for small things. Like if you see a particular call instruction on 5 out of 10 samples, it is responsible for roughly 50% of the total execution time. More samples would tell you more precisely what the percentage is, if you really want to know. If you're like me, all you want to know is where it is, so you can fix it, and move on to the next one.
Do this until you can't find anything more to optimize, and you will be at or near your top speed.
You probably want different tools for performance profiling and code coverage.
For profiling I prefer Shark on MacOSX. It is free from Apple and very good. If your app is vanilla C you should be able to use it, if you can get hold of a Mac.
For profiling on Windows you can use LTProf. Cheap, but not great:
(I think Microsoft are really shooting themself in the foot by not providing a decent profiler with the cheaper versions of Visual Studio.)
For coverage I prefer Coverage Validator on Windows:
It updates the coverage in real time.
For complex applications I am a great fan of Intel's Vtune. It is a slightly different mindset to a traditional profiler that instruments the code. It works by sampling the processor to see where instruction pointer is 1,000 times a second. It has the huge advantage of not requiring any changes to your binaries, which as often as not would change the timing of what you are trying to measure.
Unfortunately it is no good for .net or java since there isn't a way for the Vtune to map instruction pointer to symbol like there is with traditional code.
It also allows you to measure all sorts of other processor/hardware centric metrics, like clocks per instruction, cache hits/misses, TLB hits/misses, etc which let you identify why certain sections of code may be taking longer to run than you would expect just by inspecting the code.
If you're doing an 'on the metal' embedded 'C' system (I'm not quite sure what 'mobile' implied in your posting), then you usually have some kind of timer ISR, in which it's fairly easy to sample the code address at which the interrupt occurred (by digging back in the stack or looking at link registers or whatever). Then it's trivial to build a histogram of addresses at some combination of granularity/range-of-interest.
It's usually then not too hard to concoct some combination of code/script/Excel sheets which merges your histogram counts with addresses from your linker symbol/list file to give you profile information.
If you're very RAM limited, it can be a bit of a pain to collect enough data for this to be both simple and useful, but you would need to tell us a more about your platform.
nProf - Free, does that for .NET.
Gets the job done, at least enough to see the 80/20. (20% of the code, taking 80% of the time)
Windows (.NET and Native Exes): AQTime is a great tool for the money. Standalone or as a Visual Studio plugin.
Java: I'm a fan of JProfiler. Again, can run standalone or as an Eclipse (or various other IDEs) plugin.
I believe both have trial versions.
The Google Perftools are extremely useful in this regard.
I use devpartner with MSVC 6 and XP
How are any tools going to work if your platform is a proprietary OS? I think you're doing the best you can right now
