How to write and run program capable of showing CPU usage starting from OS boot time?
I'd like to measure how much processor, hard disk and RAM is being used during system start-up.
OS: Ubuntu 16.04 or Win 7 or any other on which it'll be easier to implement that.
Windows: you can call GetTickCount64 to get the time since the system was started, and QueryIdleProcessorCycleTime to get idle time. Subtract to get the time spent doing something useful.
As for HDD and RAM usage depends on what do you mean by 'used'... Number of reads? writes? Allocated pages? Committed pages? Virtual or physical memory? Page file requests - are they considered as HDD or RAM usage?
Related
If the memory of the machine is far larger than the cache configured for a storage system, the file system caches far more data than the cache configured for the storage system. so, how to do reproducible benchmarks with different machine memory and the same cache configured for the storage system?
Maybe try running a program that allocates and locks a bunch of memory (i.e. pin it so it can't be paged out), then sleeps. Kill it when you want to release the memory.
Specifically, I'm thinking of the mlock(2) POSIX system call, or the Linux-specific MAP_LOCKED flag for mmap(2). This requires root, since the default ulimit -l is only 64kiB for non-root users, at least on my Ubuntu desktop.
On an otherwise-idle system with nothing using much memory, it should be easy to detect the total present and lock all but 2GB of it, for example. It's probably less easy to choose a reasonable size to lock on systems with other processes running and using varying amounts of RAM.
My C program which uses sorting runs 10x slower the first time than other times. It uses file of integers to sort and even if I change the numbers, program still runs faster. When I restart the PC, the very first time program runs 10x slower. I use time to count the time.
The operating system holds the data in RAM even if it's not needed anymore (this is called "caching"), so when the program runs again, it gets all data from there and there's no disk I/O. Even when you change the data, that change happens in RAM first, and it stays there even after its written to the file.
It doesn't stay in RAM forever though, mind you. If the memory is needed for something else, the cache is deleted. At that point, a disk access is needed (and it's cached in RAM again at that point.)
This is why first access after a reboot is always slow; the data hasn't been cached yet since it was never read from the file.
You have to make hypothesis and confront them to reality. The first you can reasonably make is that it does smell a lot like a caching issue !
Ask yourself those questions :
Does my data fits in free RAM (= is my file cached by the OS FS cache
?)
Does my data fits in CPU data cache ?
Does my data fits in HDD internal cache ?
The most easy hypothesis to discard is the FS cache. Under linux, just issue sync; echo 3 > /proc/sys/vm/drop_caches between each call to your program. The first will make sure the cached data will make it to the physical medium (hard drive), the second will drop the content of the filesystem cache from memory.
The 'physical medium' might be the HDD cache itself, so beware... Under linux you can disable this "write-back" cache with the command hdparm -W 0 <device>, for instance if you are working with drive sda, hdparm -W 0 /dev/sda will do the job. You might want to re-enable it after you are finished with your tests :)
Another hypothesis is the CPU cache, have a look at How can I do a CPU cache flush in x86 Windows? and How to clear CPU L1 and L2 cache
Well, it may or may not be one of those, but it doesn't hurt trying :)
If your program does network access then that could be the reason for the initial delay. Many network protocols need time to setup things. Some examples:
DNS: if your program does any network access, chances are it needs to resolve a hostname to an IP address. The first time it would need at least a network round trip to populate a local cache. Following requests would be shorter.
Networked filesystems (NFS, CIFS and others): opening files can happen through the network.
Even some seemingly innocuous library functions can require network access: the users list for the host can be on a remote directory server.
Appart from this you could use some low level tracing tool to see where the time is spent. On linux a basic tool is strace -r. There is probably some similar tool for other systems. Your compiler must also come with a profiler (i.e. gprof for GCC or maybe Valgrind).
I had a very similar issue but I wasn't loading in a large file - so I was baffled at the long first execution time (caching couldn't have been the issue).
This answer pointed me in the right direction - it was my real-time anti-virus protection. Every time I recompiled the program it would re-scan it as being potentially malicious. I added my project path as an "Exception" to Avira's (in my case) real-time virus protection.
Program execution on the first execution is now lightening quick!
This is nothing new, not just your program many popular commercial softwares face this problem.
To start with check this MATLAB Article about slow fist time execution
In case of other programming language which runs on a Virtual Machine like C# or Java this is quite common.
http://en.wikipedia.org/wiki/Just-in-time_compilation#Startup_delay_and_optimizations
Caching is a good reason for that to happen in C but still 10x is quite a long duration..It might be also possible that you system was loading other resources after you restart.
You should run the program after say 10 minutes after restart for better results. All the startup application would be loaded by that time. (10 minutes ---- depends on the number of startup applications and the time it takes to start each of them)
This is because of compiler optimatization ,what it does is it caches the result for Temoparal Locality and the activation record is saved,time is also saved because the binding object donot have to be reloaded again during Linking Stage
There are two components to the time measured
If you are reading a file from disk,and loading it in memory - and sorting :
1)Time to read the file & store it in an array
2)Time of sorting
Were these measured separately?
Can you check this out?
Invalidating Linux Buffer Cache
Instead of doing a restart, if repeating the experiment with clearing the cache gives the same result, then you can infer that File buffer caching effects were not factored into.
I have an evaluation board (Olimex STM32-P103) which supports a SD-card connector. I want to put my program in to a SD memory instead of internal flash of the micro-controler; and run it from there.
I don't know if it is possible to do that according to boot-loader issue!
P.S my goal is running linux on this board and then port my application over it.
To run programs from SD-Card in general you should know that you can't run them "right away". This means, you have to load it in a executable memory somewhere in your address space which is done by a (more or less) simple bootloader. In the simplest instance, the bootloader is capable to read from a SD-Card a specific binary and copy it into the memory.
That being said you should think about this considering you only got 20k of RAM and 128k of Flash on your board. So where should your program go? Or better: Why not flashing the program in the 128k of Flash from the very beginning? Especially you should know that Linux is a bit "hungry" in terms of memory.
If your goal is to run a "normal" Linux on this board, I'm afraid you're screwed. This because from what I know Linux needs a MMU to run and the chip on this board does not provide one (as far as researchable without access to datasheets from ST).
If you're lucky you can go with uCLinux. I'm not sure if a finished port exists for the STM32 but it seems there are some resources based on a short google search for "STM32 uCLinux". But even if you manage to run uCLinux I'm afraid there's not much left in your system for your application, so the result might be a bit disappointing.
Depending on why you are looking for Linux running on this MCU, there are maybe other solutions like a FreeRTOS in combination with a lwIP-stack (if networking is needed) or a FAT library like FullFAT if you are looking for reading SD-Cards and stuff.
Edit: One thing i'd like to add is that booting from the SD-Card is typically something you do with "bigger" (not much but slightly) systems where you have enough RAM to keep the whole image you'd like to run in it and still have some space left for the data you want to process.
You're going to have to have some code in the STM's onboard flash (typically called a "boot loader") that implements this since the "bare metal" very likely can't boot from SD card.
You're going to have to build that code, which figures out how to use the STM's onboard peripherals to talk to the SD card, finds the file you want to run in the file system (which you also have to implement), and loads it.
I wanted to include a link to the STM standard peripheral library, but it seems to be down (being moved). :/
The data on the SD card is not memory mapped, so cannot be executed directly.
It is possible to dynamically load the data from the card into RAM for execution. WindRiver's VxWorks RTOS supports loading and linking object modules dynamically, I know of no other OS that would scale to a Cortex-M that directly supports that but it would be possible to write your own.
However, I would suggest that in the case of the microcontroller you are using the idea is ill-advised; optimal performance on Cortex-M is achieved when the code is in on-chip flash and data in RAM allowing the data and instruction to be fetch to occur simultaneously on the separate buses (Harvard architecture). If you execute the code from RAM the performance will be severely hit since then data and instructions must be fetched sequentially over the same bus.
The board is entirely unsuited to running Linux, with only 128K Bytes Program Flash, and 20K Bytes RAM is is not at all feasible. Even the smallest Linux distribution requires 600Kb RAM plus whatever is needed by application code. uClinux can just about run on higher-end STM32 with external RAM and Flash, but that would suffer from the same bus contention performance hit and Linux without an MMU is rather missing the one major benefit of using Linux at all. The part on your board lacks an external memory interface, so cannot be expanded to support Linux.
If you need an OS consider a RTOS such as uC/OS-II, FreeRTOS, or emBOS for example.
AS other says you cannot directly execute your code directly from the SD CARD.
But like those "linux board", you can load the stored kernel/programm into an external SDRAM that can be mapped and execute it from there.
You'll still need to write that "bootloader" and store it in the internal flash.
That'is a lot work to my opinion, for limited application.
If you want to write your application in a linux environnement then port it suck small target, I would rather design my application using dependency injection, or even use an emulator.
My C program which uses sorting runs 10x slower the first time than other times. It uses file of integers to sort and even if I change the numbers, program still runs faster. When I restart the PC, the very first time program runs 10x slower. I use time to count the time.
The operating system holds the data in RAM even if it's not needed anymore (this is called "caching"), so when the program runs again, it gets all data from there and there's no disk I/O. Even when you change the data, that change happens in RAM first, and it stays there even after its written to the file.
It doesn't stay in RAM forever though, mind you. If the memory is needed for something else, the cache is deleted. At that point, a disk access is needed (and it's cached in RAM again at that point.)
This is why first access after a reboot is always slow; the data hasn't been cached yet since it was never read from the file.
You have to make hypothesis and confront them to reality. The first you can reasonably make is that it does smell a lot like a caching issue !
Ask yourself those questions :
Does my data fits in free RAM (= is my file cached by the OS FS cache
?)
Does my data fits in CPU data cache ?
Does my data fits in HDD internal cache ?
The most easy hypothesis to discard is the FS cache. Under linux, just issue sync; echo 3 > /proc/sys/vm/drop_caches between each call to your program. The first will make sure the cached data will make it to the physical medium (hard drive), the second will drop the content of the filesystem cache from memory.
The 'physical medium' might be the HDD cache itself, so beware... Under linux you can disable this "write-back" cache with the command hdparm -W 0 <device>, for instance if you are working with drive sda, hdparm -W 0 /dev/sda will do the job. You might want to re-enable it after you are finished with your tests :)
Another hypothesis is the CPU cache, have a look at How can I do a CPU cache flush in x86 Windows? and How to clear CPU L1 and L2 cache
Well, it may or may not be one of those, but it doesn't hurt trying :)
If your program does network access then that could be the reason for the initial delay. Many network protocols need time to setup things. Some examples:
DNS: if your program does any network access, chances are it needs to resolve a hostname to an IP address. The first time it would need at least a network round trip to populate a local cache. Following requests would be shorter.
Networked filesystems (NFS, CIFS and others): opening files can happen through the network.
Even some seemingly innocuous library functions can require network access: the users list for the host can be on a remote directory server.
Appart from this you could use some low level tracing tool to see where the time is spent. On linux a basic tool is strace -r. There is probably some similar tool for other systems. Your compiler must also come with a profiler (i.e. gprof for GCC or maybe Valgrind).
I had a very similar issue but I wasn't loading in a large file - so I was baffled at the long first execution time (caching couldn't have been the issue).
This answer pointed me in the right direction - it was my real-time anti-virus protection. Every time I recompiled the program it would re-scan it as being potentially malicious. I added my project path as an "Exception" to Avira's (in my case) real-time virus protection.
Program execution on the first execution is now lightening quick!
This is nothing new, not just your program many popular commercial softwares face this problem.
To start with check this MATLAB Article about slow fist time execution
In case of other programming language which runs on a Virtual Machine like C# or Java this is quite common.
http://en.wikipedia.org/wiki/Just-in-time_compilation#Startup_delay_and_optimizations
Caching is a good reason for that to happen in C but still 10x is quite a long duration..It might be also possible that you system was loading other resources after you restart.
You should run the program after say 10 minutes after restart for better results. All the startup application would be loaded by that time. (10 minutes ---- depends on the number of startup applications and the time it takes to start each of them)
This is because of compiler optimatization ,what it does is it caches the result for Temoparal Locality and the activation record is saved,time is also saved because the binding object donot have to be reloaded again during Linking Stage
There are two components to the time measured
If you are reading a file from disk,and loading it in memory - and sorting :
1)Time to read the file & store it in an array
2)Time of sorting
Were these measured separately?
Can you check this out?
Invalidating Linux Buffer Cache
Instead of doing a restart, if repeating the experiment with clearing the cache gives the same result, then you can infer that File buffer caching effects were not factored into.
How can you limit the physical memory consumption of a C program from within the source code on a linux 2.6.32 machine?
I need to determine the type of page replacement algorithm the system is using.
The problem is that without limiting the number of pages a process can have in memory, it becomes difficult to analyze the pattern of page faults to determine the page replacement algorithm.
Also, I don't have root access on the machine.
setrlimit(RLIMIT_MEMLOCK, ...).