I've been looking for a C profiler for Windows, that will allow me to inspect time spent in the level of source-code lines, as opposed to just at the level of functions. This is in order to find hotspots in the program that can be optimized.
Very Sleepy looks great for this purpose. However, in the Source view, it doesn't seem that the number of time spent per line of code actually adds up to the 100% of Exclusive time for the function.
For example, Very Sleepy says we spent 18.50s Exclusive time in the function. But adding up all of the time durations specified in the Source view for that function, only adds up to about 10s.
This is how I compile the program:
gcc -IC:/msys64_new/mingw64/include *.c -o plane.exe -g -gdwarf-2 -fno-omit-frame-pointer -O2 -Wall -Wno-unused -LC:/msys64_new/mingw64/lib -lShlwapi
I then open Very Sleepy through the GUI and sample the running process for exactly 100 seconds.
I'm using Very Sleepy CS 0.90. I'm running Windows 7 and using the Mingw-w64 subsystem of MSYS2.
EDIT:
I've also noticed two additional weird things. First of all, Very Sleepy displays some functions without their name, but does recognize them as part of the profiled module.
Secondly, Very Sleepy seems to think a few variables are actually functions. For example:
extension_module_file_suffix is not a function, it's a variable. What's going on?
Related
I have this program called parser I compiled with -g flag this is my makefile
parser: header.h parser.c
gcc -g header.h parser.c -o parser
clean:
rm -f parser a.out
code for one function in parser.c is
int _find(char *html , struct html_tag **obj)
{
char temp[strlen("<end")+1];
memcpy(temp,"<end",strlen("<end")+1);
...
...
.
return 0;
}
What I like to see when I debug the parser or something can I also have the capability to change the lines of code after hitting breakpoint and while n through the code of above function. If its not the job of gdb then is there any opensource solution to actually changing code and possible saving so when I run through the next statement in code then changed statement before doing n (possible different index of array) will execute, is there any opensource tool or can it be done in gdb do I need to do some compiling options.
I know I can assign values to variables at runtime in gdb but is this it? like is there any thing like actually also being capable of changing soure
Most C implementations are compiled. The source code is analyzed and translated to processor instructions. This translation would be difficult to do on a piecewise basis. That is, given some small change in the source code, it would be practically impossible to update the executable file to represent those changes. As part of the translation, the compiler transforms and intertwines statements, assigns processor registers to be used for computing parts of expressions, designates places in memory to hold data, and more. When source code is changed slightly, this may result in a new compilation happening to use a different register in one place or needing more or less memory in a particular function, which results in data moving back or forth. Merging these changes into the running program would require figuring out all the differences, moving things in memory, rearranging what is in what processor register, and so on. For practical purposes, these changes are impossible.
GDB does not support this.
(Appleās developer tools may have some feature like this. I saw it demonstrated for the Swift programming language but have not used it.)
I've gotten a piece of software working, and am now trying to tune it up so it runs faster. I discovered something that struck as well - just bizarre. It's no longer relevant, because I switched to using a pointer instead of indexing an array (it's faster with the pointers), but I'd still like to know what is going on.
Here's the code:
short mask_num_vals(short mask)
{
short count = 0;
for(short val=0;val<NUM_VALS;val++)
if(mask & val_masks[val])
count++;
return count;
}
This small piece of code is called many many times. What really surprised me is that this code runs significantly faster than its predecessor, which simply had the two arguments to the "&" operation reversed.
Now, I would have thought the two versions would be, for all practical purposes, identical, and they do produce the same result. But the version above is faster - noticeably faster. It makes about a 5% difference in the running time of the overall code that uses it. My attempt to measure the amount of time spent in the function above failed completely - measuring the time used up far more time than actually executing the rest of the code. (A version of Heisenberg's principle for software, I guess.)
So my picture here is, the compiled code evaluates the two arguments, and then does a bitwise "and" on them. Who cares which order the arguments are in? Apparently the compiler or the computer does.
My completely unsupported conjecture is that the compiled code must be evaluating "val_masks[val]" for each bit. If "val_masks[val]" comes first, it evaluates it for every bit, if "mask" comes first, it doesn't bother with "val_masks[val]" if that particular bit in "mask" is zero. I have no evidence whatsoever to support this conjecture; I just can't think of anything else that might cause this behaviour.
Does this seem likely? This behaviour just seemed weird to me, and I think points to some difference in my picture of how the compiled code works, and how it actually works. Again, not all that relevant any more, as I've evolved the code further (using pointers instead of arrays). But I'd still be interested in knowing what is causing this.
Hardware is an Apple MacBook Pro 15-inch 2018, MacOS 10.15.5. Software is gcc compiler, and "gcc --version" produces the following output.
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.3 (clang-1103.0.32.62)
Target: x86_64-apple-darwin19.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Compiled with the command "gcc -c -Wall 'C filename'", linked with "gcc -o -Wall 'object filenames'".
Code optimizers are often unpredictable. Their output can change after small meaningless tweaks in code, or after changing command-line options, or after upgrading the compiler. You cannot always explain why the compiler does some optimization in one case but not in another; you can guess all you want, but only experience can show.
One powerful technique in determining what is going on: convert your two versions of code to assembly language and compare.
GCC could be invoked with the command-line switch -S for that.
gcc -S -Wall -O -fverbose-asm your-c-source.c
which produces a textual assembler file your-c-source.s (you could glance into it using a pager like less or a source code editor like GNU emacs) from the C file your-c-source.c
The Clang compiler has similar options.
I'm running OS X 10.12 and I'm developing a basic text-based operating system. I have developed a boot loader and that seems to be running fine. My only problem is that when I attempt to compile my kernel into pure binary, the linker won't work. I have done some research and I think that this is because of the fact OS X runs the Darwin linker and not the GNU linker. Because of this, I have downloaded and installed the GNU binutils. However, it still won't work...
Here is my kernel:
void main() {
// Create pointer to a character and point it to the first cell of video
// memory (i.e. the top-left)
char* video_memory = (char*) 0xb8000;
// At that address, put an x
*video_memory = 'x';
}
And this is when I attempt to compile it:
Hazims-MacBook-Pro:32 bit root# gcc -ffreestanding -c kernel.c -o kernel.o
Hazims-MacBook-Pro:32 bit root# ld -o kernel.bin -T text 0x1000 kernel.o --oformat binary
ld: unknown option: -T
Hazims-MacBook-Pro:32 bit root#
I would love to know how to solve this issue. Thank you for your time.
-T is a gcc compiler flag, not a linker flag. Have a look at this:
With these components you can now actually build the final kernel. We use the compiler as the linker as it allows it greater control over the link process. Note that if your kernel is written in C++, you should use the C++ compiler instead.
You can then link your kernel using:
i686-elf-gcc -T linker.ld -o myos.bin -ffreestanding -O2 -nostdlib boot.o kernel.o -lgcc
Note: Some tutorials suggest linking with i686-elf-ld rather than the compiler, however this prevents the compiler from performing various tasks during linking.
The file myos.bin is now your kernel (all other files are no longer needed). Note that we are linking against libgcc, which implements various runtime routines that your cross-compiler depends on. Leaving it out will give you problems in the future. If you did not build and install libgcc as part of your cross-compiler, you should go back now and build a cross-compiler with libgcc. The compiler depends on this library and will use it regardless of whether you provide it or not.
This is all taken directly from OSDev, which documents the entire process, including a bare-bones kernel, very clearly.
You're correct in that you probably want binutils for this especially if you're coding baremetal; while clang as is purports to be a cross compiler it's far from optimal or usable here, for various reasons. noticing you're developing on ARM I infer; you want this.
https://developer.arm.com/open-source/gnu-toolchain/gnu-rm
Aside from the fact that gcc does this thing better than clang markedly, there's also the issue that ld does not build on OS X from the binutils package; it in some configurations silently fails so you may in fact never have actually installed it despite watching libiberty etc build, it will even go through the motions of compiling the source of that target sometimes and just refuse to link it... to the fellow with the lousy tone blaming OP, if you had relevant experience ie ever had built this under this condition you would know that is patently obnoxious. it'd be nice if you'd refrain from discouraging people from asking legitimate questions.
In the CXXfilt package they mumble about apple-darwin not being a target; try changing FAKE_TARGET to instead of mn10003000-whatever or whatever they used, to apple-rhapsody some time.
You're still in way better shape just building them from current if you say need to strip relocations from something or want to work on restoring static linkage to the system. which is missing by default from that clang installation as well...anyhow it's not really that ld couldn't work with macho, it's all there, codewise in fact...that i am sure of
Regarding locating things in memory, you may want to refer to a linker script
http://svn.screwjackllc.com/?p=noid.git;a=blob_plain;f=new_mbed_bs.link_script.ld
As i have some code in there that will directly place things in memory, rather than doing it on command line it is more reproducible to go with the linker script. it's a little complex but what it is doing is setting up a couple of regions of memory to be used with my memory allocators, you can use malloc, but you should prefer not to use actual malloc; dynamic memory is fine when it isn't dynamic...heh...
The script also sets flags for the stack and heap locations, although they are just markers, not loaded til go time, they actually get placed, stack and heap, by the startup code, which is in assembly and rather readable and well commented (hard to believe, i know)... neat trick, you have some persistence to volatile memory, so i set aside a very tiny bit to flip and you can do things like have it control what bootloader to run on the next power cycle. again you are 100% correct regarding the linker; seems to be you are headed the right direction. incidentally another way you can modify objects prior to loading them , and preload things in memory, similar to this method, well there are a ton of ways, but, check out objcopy and objdump...you can use gdb to dump srecs of structures in memory, note the address, and then before linking but after assembly use dd to insert the records you extracted with gdb back in to extracted sections..is one of my favorite ways just because is smartass route :D also, if you are tight on memory ever and need to precalculate constants it's one way to optimize things...that way is actually closer to what ld is doing, just doing it by hand... probably path of least resistance on this now though is linker script.
I am using an ARM cortex A9 platfrom to measure the performance of some algorithms. More specifically i measure the execution time of one algorithm using the clock() function (time.h). I use the latter function right before calling my algorithm and right after the algorithm returns.
....
....
....
start=clock();
alg();
end=clock();
...
...
...
Then I compile the code with exactly the same options and i produce two different object files. The first one is named n and the second one nn. On the ARM platform i run my code in one core. All the other tasks'affinity is set to the other cores. Object file n returns 0,12sec while Object file nn returns 0.1sec. I compared the two binaries files and they don't differences. I noticed that if I give a name to the object file larger than 1 letter then I always have less execution time for my algorithm. Moreover if I run the n.c file and then rename it and run it again I will also get different performance numbers.
Could you please give me some ideas why something like this happens? Thanks in advance
P.S.1: I am using gcc 4.8.1 cross compiler.
P.S.2: I compile my code with
arm-none-linux-gnueabi-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat- abi=softfp -mfpu=neon -Ofast code.c -o n
In order to force a function to not be inlined that was consuming 46% of the runtime, I used __attribute__((noinline)) on the it and compiled the code with gcc -Wall -Winline -O2(these plus -g are what is used by the Makefile - I also see roughly the same effect when using -g as well) using gcc 4.5.2. I found that the program with the non-inlined function is more than 20% faster than the original. Does anyone know why this might be?
Let me provide some more details. The program that this occurred in is the latest version of the compression utility bzip2 for Linux. The key function ( generateMTFValues found in compress.c) in the program is the one that does the Move To Front transform. This function is only called by one function in the program.
Does anyone have any idea why the program runs faster in this case by forcing the compiler not to inline this function? The function only takes one parameter - a pointer to a struct that contains all of the block and compression info. Also, it only calls one other function which doesn't really consume any substantial processing time.
It can slow down the program, because the resulting code is larger and can lead to more misses of the CPU's instruction cache.
This is a complete WAG (Wild Ass Guess) based on near-perfect ignorance.
Could it be that for the inline version the optimizer is really busy juggling which values are in which registers and when? If that's the case, the procedure call version may give it room to devote more registers to what is happening in the loop.
As I said, just a WAG.