Understanding Status.JVM.Memory.Direct.MemoryUsed in Flink

Understanding Status.JVM.Memory.Direct.MemoryUsed in Flink - apache-flink

I have a flink job that kept crashing. I asked question on debugging that in this post.
The issue was solved by increasing memory for task managers. I then checked the memory usage related metrics for all the containers at the time that this crash happened, and I saw 2 of them did have abnormal value for Status.JVM.Memory.Direct.MemoryUsed. I have a chart for that:
jvm.memory.direct.memory_used.png
From Flink official doc, it says The biggest driver of Direct memory is by far the number of Flink’s network buffers, which can be configured. However from task log I didn't see anything related to not enough network buffer. In order to prevent this from happening in the future, I would like to understand in detail what this portion of memory does in Flink and what could happen to these 2 outlier containers from the image. Thank you.

first, I've also need the behavior of TMs quitting without any logging of the problem, when it's an OutOfMemoryError.
Second, my experience with direct memory issues is that it didn't run out due to network buffers, but rather because I was using code that called through to compiled C code (Fasttext, in my case) which was allocating direct memory...are you sure you don't have a similar situation? Asking because usually Flink is good about not over-allocating memory - typically you get a failure like "Not enough memory for network buffers".

Related

What If There Is A Memory Leak In Virtual Environment? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
What I understood from reading some web articles is that just like any other program, the host OS allocates X amount of memory to virtual OS and when I start any program on virtual OS, the virtual OS fetches the exact amount of memory needed for the program.
When I shut the virtual OS down, it returns the allocated memory to the host OS.
But what happens if there is a memory leakage in the virtual OS environment? I am starting to learn C, and my professor says that in dynamic memory allocation operations, permanent leakage can happen in the host OS.
But what if it happens in virtual environment? I guess the program will give back ALL of the memory allocated to the host OS when I shut it down, right? What happens when I start the virtual host again the next time? Does the memory leakage show up there permanently?
Just getting afraid before I even start writing my first program in C.
P. S. If I use websites like Repl.it and use memory allocation over there, will it cause damage to my system still?

Memory leak can occur when you allocate some memory (with malloc in C) and you never free that memory, this can happen for a number of reasons.
Now the important thing to understand is that this allocated memory will be released once the process is finished running.
When you setup your VM you set the maximum amount of memory it can consume. When you shut down your VM it will also be released.
You can't cause a "permanent" memory leakage if the program you write doesn't run. If the OS has some always running service with memory leak than it will slow down when it is out of memory but when you restart, all the memory will be released again.
So don't let this stop you, you can't damage your computer and you can always recover it by exiting the program. (or restarting the PC in a worst case scenario
)
EDIT:
As it was mentioned in the comments there is a special scenario when you leak shared memory, in this case exiting the program might not release the memory but I consider this the worst case scenario and a reboot will solve this problem as well. (still not permanent)

This answer is meant to provide a different view point, in addition to the good answer(s) and comments, which I agree with.
I am trying to see the worst case, i.e a way how you could get what you fear.
You probably have an environment which does not match the following assumptions, in that case my construct does of course not apply.
your virtual OS supports "persistence"
(i.e. you can shut it down in a "hibernate" way, it can start with the same running processes and their restored memory content)
your virtualisation engine also supports persistence of the virtual OS
shutting down for persistence in virtual OS is possible with a process occupying a critical amount of memory (sanity checks could prevent this)
virtualisation engine also does not mind the depleted memory and allows persistence
you choose to use persistent shutdown,
rebooting the virtual OS normally would include killing the evil process and reclaiming the memory (this is discussed by other answers and comments, but thanks to MrBlaise for proposing the clarification here)
In this circumstance I imagine that you can have:
a process which has taken (and ran out of) all avaiable memory
but has not crashed or otherwise triggered emergency measures
then this situation is saved for persistence before shutting down, successfully
then you restart the virtual OS
it restores the previous situation, i.e. returns from hiberantion
the restored previous situation contains a still/again running process which has taken all memory
I think this will still only affect the virtual OS, not the host.
Please note that I intentionally made all necessary assumptions just to get the situation you are afraid of. Some of the assumptions are quite "daring".
I imagine for example that anything supporting persistence should have sanity checks, which at least detect the memory issue and ask how to handle.
(By the way, I do not know about virtualisation engines which support persistence, neither whether any do not support it. I am thinking in the generic, theoretical area.
In case I have invented the persistence for virtualisation engines (can't believe it), I claim this as prior art. ;-))

How can i disable cpu cache for certain memory regions?

I read on wikipedia that disabling cpu-cache can improve performance:
Marking some memory ranges as non-cacheable can improve performance, by avoiding caching of memory regions that are rarely re-accessed.
When I googled how to do it in c on linux however, I didn't find anything. It's not that I really need this feature but I'm interested anyways.
And do you know of any projects which use this optimization?
Edit: I'm programming for x86_64

That comment about non-caching doesn't mean what you think it means, and where it is used, it isn't usually a user-accessible feature. That is, CPU cache control is typically a privileged operation.
That said...
-- A normal user program can be build with functions who's attributes are "hot" or "cold" to let the compiler tell the loader to group the functions in ways that will utilize the cache most usefully.
-- A normal program can use the madvise() function in linux to tell the paging function various things, including the fact that the memory just used is or is not likely to be used again soon.
-- The kernel itself uses the Memory Type Range Regesters (mtrr) and Page Attribute Table (pat) flags in later kernels, to tell the hardware that particular ranges of memory (such as the memory mapped display buffer, and the various parts of the PCI bus) are not to be cached.
"Normal Data™" such as you are likely to use in any C program will essentially never benefit from marking any of its data not cache-worthy. The performance improvement that not-cached data enjoys is the subsequent absence of the various cache-flush and memory barrier operations that memory mapped devices and display buffers would need almost constantly. Laying a cache over a memory mapped device, for example, would require a cache invalidate command before every read and a cache forced write command after every single write to make sure that the reads and writes happen at the exact moment needed. This would "poison" the cache usage, using up and instantly discarding cache lines (a physically limited resource) in a most unfriendly and unhelpful way.
In the rare case that you write a program that gains access to one of these cache harmful regions -- such as if you wrote part of the X display server on a linux system -- the kernel would have already set the registers for the device and the non-cache behavior would be transparent to you.
There is effectively no time where your normal application grade program is going to benefit from any ability to mark a variable as harmful to cache beyond the various madvise() type of usage.
Even then, the cases were you could gain any benefit are so rare that if you'd ever acutally run into one, the problem set would have included the need and methodology as part of your research and you'd have been told how and why so explicitly you'd never have needed to ask this question.
To go back to the same example again, if you'd been writing the necessary driver, when you'd been reading up on the display adapter hardware or the PCI bus the various flags and techniques would have been documented and discussed in the hardware guide.
There are ways to pull off cache ejection and such from user space with things like the CLCLEAR instruction on an intel platform. These techniques will not improve general performance.
Since it's a privileged operation on a Linux system, you could write a kernel driver that acquired and marked a region of memory as uncacheable and then let you map it into your application. But the need for such a region is so rare, and so likely to be misused, that there isn't a normal methodology for doing it in place.
So how do you do it? You don't, at least not the you that you are today. When you become a kernel driver writer with an intimate specialty knowledge of multi-threaded code and data synchronization issues, you'll know how you could do it, and at that point you'll know why you don't want to except as a last resort.
TL;DR :: because of the way linux uses and manages data and code, there is never a benefit for marking any part of a normal application as uncacheable that doesn't cause more heartbreak than it saves. As such, there is no unprivileged API for doing this.
P.S. Also, that said, someone already pointed to things that lead to this article http://lwn.net/Articles/255364/ which covers ways to make your program very cache friendly and some of the ways that you can do some cache bypass operations very cheaply. For instance use of memset() tends to go around the cache while setting memory, and some operations can "stream past" the cache. This isn't the same thing as what you ask, but once you understand all of that article you'll have a much better understanding of why marking a region of memory as uncachable is usually, as the Jedi say, not the solution you are looking for.

Recently i needed to experiment with uncached memory in a cache-heavy multi-threaded application.
I came up with this kernel module which allows to map uncached memory to userspace.
User process requests uncached memory by calling mmap() on module's character device (see test directory for demo).
What every programmer should know about memory is indeed a must read !

Could someone please tell me how to measure the time taken for heap dumps using either JConsole or VisualVM?

I am trying to analyze the performance impact of trying to take a heap dump for my application that uses close to 3 GB of heap memory. This is to decide and understand if I should enable the possibility of taking heap dumps as a proactive rather than last ditch reactive measure in monitoring memory leaks. Has someone looked into anything like this before. Is so, could you please help me out. Thanks in advance.

JVM is stopped during heap dumping, so depending on I/O throughput and CPU speed of your hardware this can take from several seconds to tens of seconds. If you want to get just live objects, you have add also time for full GC. You can try running jmap -dump:live,format=b,file=heap.bin <pid> from the commandline to see how long does it take in your case.

Getting CPU time in OS X

I have an objective-c application for OS X that compares two sqlite DB's and produces a diff in json format. The db are quite large (10,000 items with many fields). Sometimes this applications runs in about 55 sec(using 95% of the cpu). Sometimes it takes around 8 min (using 12% of the cpu). This is with the same DB's. When it is only using a small portion of the cpu the rest is available. There does not appear to be anything taking priority over the process. Adding "nice -20" on the command seems to assure I get the cpu usage. My questions are
If nothing else is using the cpu why
does my app not take advantage of
it?
Is there something I can do
programatically to change this?
Is there something I can do to OS X to
change this?

Question 1:
Since, I assume, you have to read in the databases from disk, you aren't making full use of the CPU because your code is blocking on disk reads. On Mac OS X there is a lot of stuff running in the background that doesn't use a lot of CPU time but does send out a lot of disk reads, like Spotlight.
Question 2:
Probably not, other than make the most efficient use of disk access possible.
Question 3:
Shut down any other processes that are accessing the disk. This includes many system processes that you really shouldn't shut down, so I don't think there's much you can do here other than try running it on Darwin without all the Mac OS X fanciness.

It sounds like you're IO bound in the long cases. Are you doing anything else on the machine? The CPU isn't throttling itself - it's definitely waiting for something.
You can use some of the developer tools to look at your app while it's running - perhaps most useful would be "Instruments", which is a GUI on top of dtrace. You should have this installed if you're running the most recent Xcode. You can also use Shark, which is somewhat easier to use at first glance, but less informative in the long run.

Usually you get all the performance that's available. If the CPU is not at 100% there's something blocking it. In case of databases it's often locking. Use Shark to find out what's going on in your application.

When your program uses little CPU, probably because it is waiting for disk, especially when other processes are accessing to the disk at the same time. Another possibility is your program uses too much memory and the OS begins to use swap space.

Handling more than 1024 file descriptors, in C on Linux

I am working on a threaded network server using epoll (edge triggered) and threads and I'm using httperf to benchmark my server.
So far, it's performing really well or almost exactly at the rate the requests are being sent. Until the 1024 barrier, where everything slows down to around 30 requests/second.
Running on Ubuntu 9.04 64-bit.
I've already tried:
Increasing the ulimit number of file descriptors, successfully. It just doesn't improve the performance above 1024 concurrent connections.
andri#filefridge:~/Dropbox/School/Group 452/Code/server$ ulimit -n
20000
I am pretty sure that this slow-down is happening in the operating system as it happens before the event is sent to epoll (and yes, I've also increased the limit in epoll).
I need to benchmark how many concurrent connections my program can handle until it starts to slow down (without the operating system interfering).
How do I get my program to run with more than 1024 file descriptors?
This limit is probably there for a reason, but for benchmarking purposes, I need it gone.
Update
Thanks for all your answers but I think I've found the culprit. After redefining __FD_SETSIZE in my program everything started to move a lot faster. Of course ulimit also needs to be raised, but without __FD_SETSIZE my program never takes advantage of it.

Thanks for all your answers but I think I've found the culprit. After redefining __FD_SETSIZE in my program everything started to move a lot faster. Of course ulimit also needs to be raised, but without __FD_SETSIZE my program never takes advantage of it.

Please see the C10K problem page. It contains an in-depth discussion on how to achieve the '10000 simultaneous connections' goal, while maintaining high-performance and managing to serve each client.
It also contains information on how to increase the performance of your kernel when handling a large number of connections at once.

Just don't.
Yes, I mean that.
If you need to increase the file descriptors, there's a hidden bug in your code. Hunt it down instead of treating its symptoms. Remember to close file descriptors when you're done.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight