Could someone please tell me how to measure the time taken for heap dumps using either JConsole or VisualVM? - heap-memory

I am trying to analyze the performance impact of trying to take a heap dump for my application that uses close to 3 GB of heap memory. This is to decide and understand if I should enable the possibility of taking heap dumps as a proactive rather than last ditch reactive measure in monitoring memory leaks. Has someone looked into anything like this before. Is so, could you please help me out. Thanks in advance.

JVM is stopped during heap dumping, so depending on I/O throughput and CPU speed of your hardware this can take from several seconds to tens of seconds. If you want to get just live objects, you have add also time for full GC. You can try running jmap -dump:live,format=b,file=heap.bin <pid> from the commandline to see how long does it take in your case.

Related

Understanding Status.JVM.Memory.Direct.MemoryUsed in Flink

I have a flink job that kept crashing. I asked question on debugging that in this post.
The issue was solved by increasing memory for task managers. I then checked the memory usage related metrics for all the containers at the time that this crash happened, and I saw 2 of them did have abnormal value for Status.JVM.Memory.Direct.MemoryUsed. I have a chart for that:
jvm.memory.direct.memory_used.png
From Flink official doc, it says The biggest driver of Direct memory is by far the number of Flinkā€™s network buffers, which can be configured. However from task log I didn't see anything related to not enough network buffer. In order to prevent this from happening in the future, I would like to understand in detail what this portion of memory does in Flink and what could happen to these 2 outlier containers from the image. Thank you.
first, I've also need the behavior of TMs quitting without any logging of the problem, when it's an OutOfMemoryError.
Second, my experience with direct memory issues is that it didn't run out due to network buffers, but rather because I was using code that called through to compiled C code (Fasttext, in my case) which was allocating direct memory...are you sure you don't have a similar situation? Asking because usually Flink is good about not over-allocating memory - typically you get a failure like "Not enough memory for network buffers".

How to get CPU instruction count for a thread?

I know that getrusage() can provide per-thread CPU utilization, but only the time spent on the CPU. Is there any way to get the number of executed CPU instructions? Or the number of cycles spent on the cpu?
Basically, I need to find a reproducible measure of how much the thread spends on the CPU. Any suggestions to do this in C?
UPDATE (to respond to comments):
Ideally I'd need this in a platform independent way, but Linux would be the most useful.
Reproducibility is the most important for me, even if that means the actual runtime may be slightly different.
I know vTune (and have used it), but I'd like to have this info programmatically while my code is running. So vTune is out, as well as the suggestions made in the post linked by Craig Estey.
I did look at the Intel Intrinsics Guide, but did not find anything useful...
Take a look at google's filament engine. They are doing exactly that.
Look at their profiler.
https://github.com/google/filament/blob/master/libs/utils/src/Profiler.cpp
Also you can get more info from this link:
https://www.youtube.com/watch?v=Lcq_fzet9Iw

First call to C function slower than subsequent calls [duplicate]

My C program which uses sorting runs 10x slower the first time than other times. It uses file of integers to sort and even if I change the numbers, program still runs faster. When I restart the PC, the very first time program runs 10x slower. I use time to count the time.
The operating system holds the data in RAM even if it's not needed anymore (this is called "caching"), so when the program runs again, it gets all data from there and there's no disk I/O. Even when you change the data, that change happens in RAM first, and it stays there even after its written to the file.
It doesn't stay in RAM forever though, mind you. If the memory is needed for something else, the cache is deleted. At that point, a disk access is needed (and it's cached in RAM again at that point.)
This is why first access after a reboot is always slow; the data hasn't been cached yet since it was never read from the file.
You have to make hypothesis and confront them to reality. The first you can reasonably make is that it does smell a lot like a caching issue !
Ask yourself those questions :
Does my data fits in free RAM (= is my file cached by the OS FS cache
?)
Does my data fits in CPU data cache ?
Does my data fits in HDD internal cache ?
The most easy hypothesis to discard is the FS cache. Under linux, just issue sync; echo 3 > /proc/sys/vm/drop_caches between each call to your program. The first will make sure the cached data will make it to the physical medium (hard drive), the second will drop the content of the filesystem cache from memory.
The 'physical medium' might be the HDD cache itself, so beware... Under linux you can disable this "write-back" cache with the command hdparm -W 0 <device>, for instance if you are working with drive sda, hdparm -W 0 /dev/sda will do the job. You might want to re-enable it after you are finished with your tests :)
Another hypothesis is the CPU cache, have a look at How can I do a CPU cache flush in x86 Windows? and How to clear CPU L1 and L2 cache
Well, it may or may not be one of those, but it doesn't hurt trying :)
If your program does network access then that could be the reason for the initial delay. Many network protocols need time to setup things. Some examples:
DNS: if your program does any network access, chances are it needs to resolve a hostname to an IP address. The first time it would need at least a network round trip to populate a local cache. Following requests would be shorter.
Networked filesystems (NFS, CIFS and others): opening files can happen through the network.
Even some seemingly innocuous library functions can require network access: the users list for the host can be on a remote directory server.
Appart from this you could use some low level tracing tool to see where the time is spent. On linux a basic tool is strace -r. There is probably some similar tool for other systems. Your compiler must also come with a profiler (i.e. gprof for GCC or maybe Valgrind).
I had a very similar issue but I wasn't loading in a large file - so I was baffled at the long first execution time (caching couldn't have been the issue).
This answer pointed me in the right direction - it was my real-time anti-virus protection. Every time I recompiled the program it would re-scan it as being potentially malicious. I added my project path as an "Exception" to Avira's (in my case) real-time virus protection.
Program execution on the first execution is now lightening quick!
This is nothing new, not just your program many popular commercial softwares face this problem.
To start with check this MATLAB Article about slow fist time execution
In case of other programming language which runs on a Virtual Machine like C# or Java this is quite common.
http://en.wikipedia.org/wiki/Just-in-time_compilation#Startup_delay_and_optimizations
Caching is a good reason for that to happen in C but still 10x is quite a long duration..It might be also possible that you system was loading other resources after you restart.
You should run the program after say 10 minutes after restart for better results. All the startup application would be loaded by that time. (10 minutes ---- depends on the number of startup applications and the time it takes to start each of them)
This is because of compiler optimatization ,what it does is it caches the result for Temoparal Locality and the activation record is saved,time is also saved because the binding object donot have to be reloaded again during Linking Stage
There are two components to the time measured
If you are reading a file from disk,and loading it in memory - and sorting :
1)Time to read the file & store it in an array
2)Time of sorting
Were these measured separately?
Can you check this out?
Invalidating Linux Buffer Cache
Instead of doing a restart, if repeating the experiment with clearing the cache gives the same result, then you can infer that File buffer caching effects were not factored into.

Why the first time C program runs, it runs 10x slower

My C program which uses sorting runs 10x slower the first time than other times. It uses file of integers to sort and even if I change the numbers, program still runs faster. When I restart the PC, the very first time program runs 10x slower. I use time to count the time.
The operating system holds the data in RAM even if it's not needed anymore (this is called "caching"), so when the program runs again, it gets all data from there and there's no disk I/O. Even when you change the data, that change happens in RAM first, and it stays there even after its written to the file.
It doesn't stay in RAM forever though, mind you. If the memory is needed for something else, the cache is deleted. At that point, a disk access is needed (and it's cached in RAM again at that point.)
This is why first access after a reboot is always slow; the data hasn't been cached yet since it was never read from the file.
You have to make hypothesis and confront them to reality. The first you can reasonably make is that it does smell a lot like a caching issue !
Ask yourself those questions :
Does my data fits in free RAM (= is my file cached by the OS FS cache
?)
Does my data fits in CPU data cache ?
Does my data fits in HDD internal cache ?
The most easy hypothesis to discard is the FS cache. Under linux, just issue sync; echo 3 > /proc/sys/vm/drop_caches between each call to your program. The first will make sure the cached data will make it to the physical medium (hard drive), the second will drop the content of the filesystem cache from memory.
The 'physical medium' might be the HDD cache itself, so beware... Under linux you can disable this "write-back" cache with the command hdparm -W 0 <device>, for instance if you are working with drive sda, hdparm -W 0 /dev/sda will do the job. You might want to re-enable it after you are finished with your tests :)
Another hypothesis is the CPU cache, have a look at How can I do a CPU cache flush in x86 Windows? and How to clear CPU L1 and L2 cache
Well, it may or may not be one of those, but it doesn't hurt trying :)
If your program does network access then that could be the reason for the initial delay. Many network protocols need time to setup things. Some examples:
DNS: if your program does any network access, chances are it needs to resolve a hostname to an IP address. The first time it would need at least a network round trip to populate a local cache. Following requests would be shorter.
Networked filesystems (NFS, CIFS and others): opening files can happen through the network.
Even some seemingly innocuous library functions can require network access: the users list for the host can be on a remote directory server.
Appart from this you could use some low level tracing tool to see where the time is spent. On linux a basic tool is strace -r. There is probably some similar tool for other systems. Your compiler must also come with a profiler (i.e. gprof for GCC or maybe Valgrind).
I had a very similar issue but I wasn't loading in a large file - so I was baffled at the long first execution time (caching couldn't have been the issue).
This answer pointed me in the right direction - it was my real-time anti-virus protection. Every time I recompiled the program it would re-scan it as being potentially malicious. I added my project path as an "Exception" to Avira's (in my case) real-time virus protection.
Program execution on the first execution is now lightening quick!
This is nothing new, not just your program many popular commercial softwares face this problem.
To start with check this MATLAB Article about slow fist time execution
In case of other programming language which runs on a Virtual Machine like C# or Java this is quite common.
http://en.wikipedia.org/wiki/Just-in-time_compilation#Startup_delay_and_optimizations
Caching is a good reason for that to happen in C but still 10x is quite a long duration..It might be also possible that you system was loading other resources after you restart.
You should run the program after say 10 minutes after restart for better results. All the startup application would be loaded by that time. (10 minutes ---- depends on the number of startup applications and the time it takes to start each of them)
This is because of compiler optimatization ,what it does is it caches the result for Temoparal Locality and the activation record is saved,time is also saved because the binding object donot have to be reloaded again during Linking Stage
There are two components to the time measured
If you are reading a file from disk,and loading it in memory - and sorting :
1)Time to read the file & store it in an array
2)Time of sorting
Were these measured separately?
Can you check this out?
Invalidating Linux Buffer Cache
Instead of doing a restart, if repeating the experiment with clearing the cache gives the same result, then you can infer that File buffer caching effects were not factored into.

Getting CPU time in OS X

I have an objective-c application for OS X that compares two sqlite DB's and produces a diff in json format. The db are quite large (10,000 items with many fields). Sometimes this applications runs in about 55 sec(using 95% of the cpu). Sometimes it takes around 8 min (using 12% of the cpu). This is with the same DB's. When it is only using a small portion of the cpu the rest is available. There does not appear to be anything taking priority over the process. Adding "nice -20" on the command seems to assure I get the cpu usage. My questions are
If nothing else is using the cpu why
does my app not take advantage of
it?
Is there something I can do
programatically to change this?
Is there something I can do to OS X to
change this?
Question 1:
Since, I assume, you have to read in the databases from disk, you aren't making full use of the CPU because your code is blocking on disk reads. On Mac OS X there is a lot of stuff running in the background that doesn't use a lot of CPU time but does send out a lot of disk reads, like Spotlight.
Question 2:
Probably not, other than make the most efficient use of disk access possible.
Question 3:
Shut down any other processes that are accessing the disk. This includes many system processes that you really shouldn't shut down, so I don't think there's much you can do here other than try running it on Darwin without all the Mac OS X fanciness.
It sounds like you're IO bound in the long cases. Are you doing anything else on the machine? The CPU isn't throttling itself - it's definitely waiting for something.
You can use some of the developer tools to look at your app while it's running - perhaps most useful would be "Instruments", which is a GUI on top of dtrace. You should have this installed if you're running the most recent Xcode. You can also use Shark, which is somewhat easier to use at first glance, but less informative in the long run.
Usually you get all the performance that's available. If the CPU is not at 100% there's something blocking it. In case of databases it's often locking. Use Shark to find out what's going on in your application.
When your program uses little CPU, probably because it is waiting for disk, especially when other processes are accessing to the disk at the same time. Another possibility is your program uses too much memory and the OS begins to use swap space.

Resources