I'm implementing some form of internal profiler. Is there a way to know when and for how long a thread is context switched out? I know windows has it w the event tracing api and I know perf logs how many context switches happens. Is there a way to do it on linux? Needing root privileges is not an issue since it will be an internal tool.
Sort of.
See http://man7.org/linux/man-pages/man2/getrusage.2.html about the getrusage() function.
Notice that the structure it returns has voluntary and involuntary context switch numbers. Also, you have user and system time. Other APIs return wall-clock time.
Any wall-clock time greater than your user and system time is time you weren't running.
Other than that, you could probably use the kernel ftrace ability. See https://www.kernel.org/doc/Documentation/trace/ftrace.txt
Read http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html for even more options.
Related
I am building a custom VMM, and I am trying to implement a timeout without using signals (which are "sent" to the whole process) or threads (I'm not going to use threads).
Now, one idea is to implement the LAPIC and just before executing the guest code we could program the LAPIC TIMER to trigger after a certain time. It should be possible to have a fairly decent timeout with this. However, this solution is fairly painful to do just for simple timeout behavior.
Is there no other, better way to get KVM to interrupt itself after a certain amount of time? I was really hoping for an argument to KVM_RUN or just about anything, really.
As should be plain from the title, the guest is executing in userspace most of the time. There is a razor thin kernel layer. I don't really want to install a LAPIC unless I absolutely have to. Ideas?
Using KVM_CREATE_IRQCHIP in combination with KVM_SET_LAPIC we can utilize the emulated LVT timer to get per-thread execution timeouts on KVM without any trouble. It is expensive to call KVM_SET_LAPIC, but it is necessary in order to avoid exposing the MSRs and device to the guest.
I tried alternatively to write the MSRs using KVM API, however even that is not possible (I'm guessing without the proper CPUID bits). Either way, the LAPIC timer works no matter how many features you disable in the guest.
KVM_SET_LAPIC costs around 3 microseconds (which is extreme) on my machine, so I'm still looking for alternatives.
I speculate that given you trust ring-0 in the kernel writing just the x2APIC TIMER and INITCNT MSRs might be cheaper.
One thing to remember is to also set the CURRENTCNT register at the same time, because KVM_SET_LAPIC is explicit, and if you end up with a CURRENTCNT > INITCNT you will get a dmesg log entry, which can be expensive.
For some context, I'm profiling the execution of Memcached, and I would like to monitor dTLB misses during the execution of a specific function. Assuming that Memcached spawns multiple threads, each thread could potentially be executing the function in parallel. One particular solution I discovered, Perf features toggle events (Using perf probe to monitor performance stats during a particular function), should let me achieve this by setting probes on function entry and exit and toggling the event counter on/off on each probe respectively.
My question is:
(a) From my understanding, perf toggle events was included as part of a branch to Linux kernel 3.x. Has this been incorporated in recent LTS releases of Linux kernel 4.x? If not, are there any other alternatives?
(b) Another workaround I found is described here: performance monitoring for subset of process execution. However I'm not too sure if this will work correctly for the problem at hand. I'm concerned since Memcached is multi-threaded, having each thread spawn a new child process may cause too much overhead.
Any suggestions?
I could only find the implementation of the toggle events feature in the /perf/core_toggle repo, which is maintained by the developer of the feature. You can probably compile that code and play with the feature yourself. You can find examples on how to use it here. However, I don't think it has been accepted yet in the main Linux repo for any version of the kernel.
If you want to measure the number of one or more events, then there are alternatives that are easy to use, but require adding a few lines of code to your codebase. You can programmatically use the perf interface or other third-party tools that offer such APIs such as PAPI and LIKWID.
I heard there are some ways to modify linux such that an particular application can obtain very low latency such that whenver it ask resources, the OS will try to give it the resource as soon as possible, kind of overriding the default preemptive multitasking mechanism, I dont have a CS background, but the application I am working-on is very latency-sensitive, can anyone tell me are there any docs/stuff on this specific matter? many thanks.
Guaranteed low-latency response is called the real time capability. It means that timing goals that are realistic are guaranteed to be met.
There is a project for it called RTLinux. See the Real-Time Linux Wiki: https://rt.wiki.kernel.org/index.php/Main_Page
There are two real time models :
soft real time system - you get it by applying RT preempt kernel patches. I think it guaranties context switch within 10 ms. The goal of this project is to conform to hard real time requirements
hard real time system - have stricter guaranties (response of 1 ms). There are some libraries (like xenomai) that claim they provide hard real time system.
My application takes a checkpoint every few 100 milliseconds by using the fork system call. However, I notice that my application slows down significantly when using checkpointing (forking). I tested the time taken by fork call and it came out to be 1 to 2 ms. So why is fork slowing down my application so much. Note that I only keep 1 checkpoint (forked process) at a time and kill the previous checkpoint whenever I take a new one. Also, my computer has a huge RAM.
Notice that my forked process just sleeps after creation. It is only awoken when rollback needs to be done. So, it should not be scheduled by the OS. One thing that comes to my mind is that since fork is a copy-on-write mechanism, there are page faults occuring whenever my application modifies a page. But should that slow down the application significantly? Without checkpointing (forking), my application finishes in approximately 3.1 seconds and with it, it takes around 3.7 seconds. Any idea, what is slowing down my application?
You are probably observing the cost of the copy-on-write mechanism, as you hypothesize. That's actually quite expensive -- it is the reason vfork still exists. (The main cost is not the extra page faults themselves, but the memcpy of each page as it is touched, and the associated cache and TLB flushes.) It's not showing up as a cost of fork because the page faults don't happen inside the system call.
You can confirm the hypothesis by looking at the times reported by getrusage -- if this is correct, the extra time elapsed should be nearly all "system" time (CPU burnt inside the kernel). oprofile or perf will let you pin down the problem more specifically... if you can get them to work at all, which is nontrivial, alas.
Unfortunately, copy-on-write is also the reason why your checkpoint mechanism works in the first place. Can you get away with taking checkpoints at longer intervals? That's the only quick fix I can think of.
I suggest using oprofile to find out.
oprofile is believed to be able to profile a system (and not only a single process).
You could compare with what other checkpointing packages do, e.g. BLCR
Forking is by nature very expensive, as you're creating a copy of the existing process as an entirely new process. If speed is important to you, you should use threads.
Additionally, you say that the forked process sleeps until a 'rollback' is needed. I'm not sure what you mean by rollback, but provided its something that you can put in a function, you ought to just place it in a function and then create a thread that just runs that function and exits when you detect the need for the rollback. As an added bonus, if you use that method you only create the thread if you need it.
I am a beginning C programmer (though not a beginning programmer) looking to dive into a project to teach myself C. My project is music-based, and because of this I am curious whether there are any 'best practices' per-se, when it comes to timing functions.
Just to clarify, my project is pretty much an attempt to build some barebones music notation/composition software (remember, emphasis on barebones). I was originally thinking about using OSX as my platform, but I want to do it in C, not Obj-C (though I know it would probably be easier...CoreAudio looked like a pretty powerful tool for this kind of stuff). So even though I don't have to build OSX apps in Obj-C, I will probably end up building this on a linux system (probably debian...).
Thanks everyone, for your great answers.
There are two accurate methods for timing functions:
Single process execution.
Timer event handler / callback
Single Process Execution
Most modern computers execute more than one program simultaneously. Actually, they execute pieces of many programs, swapping them out based on priorities and other metrics to look like more than one program is executing at the same time. This overhead effects timing in programs. Either the program gets delayed in reading the time or the OS gets delayed in setting its own time variables.
The solution in this case is to eliminate as many tasks from running. The ideal environment is for best accuracy is to have your program as the sole program running. Some OSes provide API for superuser applications to block all other programs or kill them.
Timer event handling / callback
Since the OS can't be trusted to execute your program with high precision, most OS's will provide Timer APIs. Many of these APIs include the ability to call one of your functions when the timer expires. This is known as a callback function. Other OS's may send a message or generate an event when the timer expires. These fall under the class of timer handlers. The callback process has less overhead than the handlers and thus is more accurate.
Music Hardware
Although you may have your program send music to the speakers, many computers now have separate processors that play music. This frees up the main processor and provides more continuous notes, rather than sounds separated by silent gaps due to platform overhead of your program send the next sounds to the speaker.
A quality music processor has at least these to functions:
Start Playing
End Music Notification
Start Playing
This is the function where you tell the music processor where your data is and the size of the data. The processor will start playing the music.
End Music Notification
You provide the processor with a pointer to a function that it will call when the music data has been processed. Nice processors will call the function early so there will be no gaps in the sounds while reloading.
All of this is platform dependent and may not be standard across platforms.
Hope this helps.
This is quite a vast area, and, depending on exactly what you want to do, potentially very difficult.
You don't give much away by saying your project is "music based".
Is it a musical score typesetting program?
Is it processing audio?
Is it filtering MIDI data?
Is it sequencing MIDI data?
Is it generating audio from MIDI data
Does it only perform playback?
Does it need to operate in a real time environment?
Your question though hints at real time operation, so in that case...
The general rule when working in a real time environment is don't do anything which may block the real time thread. This includes:
Calling free/malloc/calloc/etc (dynamic memory allocation/deallocation).
File I/O.any
Use of spinlocks/semaphores/mutexes upon threads.
Calls to GUI code.
Calls to printf.
Bearing these considerations in mind for a real time music application, you're going to have to learn how to do multi-threading in C and how to pass data from the UI/GUI thread to the real time thread WITHOUT breaking ANY of the above restrictions.
For an open source real time audio (and MIDI) (routing) server take a look at http://jackaudio.org
gettimeofday() is the best for wall clock time. getrusage() is the best for CPU time, although it may not be portable. clock() is more portable for CPU timing, but it may have integer overflow.
This is pretty system-dependent. What OS are you using?
You can take a look at gettimeofday() for fairly high granularity. It should work ok if you just need to read time once in awhile.
SIGALRM/setitimer can be used to receive an interrupt periodically. Additionally, some systems have higher level libraries for dealing with time.