I am running the word2phrase.c using a very large (45Gb) training set. My PC has 16Gb of physical RAM and 4Gb of swap. I've left it train overnight (second time tbh) and I come back in the morning, to see it was "killed" without further explanation. I sat and watched it die, when my RAM run out.
I set in my /etc/sysctl.conf
vm.oom-kill = 0
vm.overcommit_memory = 2
The actual source code does not appear to write to the file the data, but rather keep it in memory, which is creating the issue.
Is the total memory (RAM + SWAP) used to kill OOM? For example, if I increase my SWAP to 32Gb, will this stop happening?
Can I force this process to use SWAP instead of Physical RAM, at the expense of slower performance?
Q: Is the total memory (RAM + SWAP) used to kill OOM?
Yes.
Q: For example, if I increase my SWAP to 32Gb, will this stop happening?
Yes, if RAM and swap space combined (48 GB) are enough for the process.
Q: Can I force this process to use SWAP instead of Physical RAM, at the expense of slower performance?
This will be managed automatically by the operating system. All you have to do is to increase swap space.
To answer the first question, yes.
Second question:
Can I force this process to use SWAP instead of Physical RAM
linux dictates how the process is running, and allocate the memory appropriately for the process. When the threshold gets reached, linux will use the swap space as a measure.
Increasing swap space may work in this case. Then, again, I do not know how linux will cope with such a large swap, bear in mind, this could decrease performance dramatically.
Best alternative thing to do is split up the 45GB training set to smaller chunks.
Related
I am trying to run a benchmark on Mac with OS version 10.9.5, 2 GHz Intel Core i7 processor, and 8 GB 1600 MHz DDR3 memory. The workload reads in sections of a large file that was mmapped into the address space. I am interested in how performance is affected if the amount of buffer cache available to my process is limited. My colleague and I were not aware of straightforward ways of doing this (i.e. a direct system call, etc), so we created a memory hog program that runs in the background and constantly touches every page allocated to it, to force the kernel to allocate a certain chunk of physical memory to the hogger. Our basic problem is that we found the benchmark to run significantly faster with the hogger running in the background, and we did not know how to interpret that result.
The basic gist of the memhog code is (there is a bit of error checking in our actual code):
mem = malloc(size); // Allocate size bytes
for (int i = 0; ; i = (i + PAGE_SIZE) % size) { // Infinite loop
mem[i]++;
}
This program was compiled with gcc with no optimizations enabled. By adding print statements into the loop, we verified that every iteration of the loop (except the first one, due to all the page faults that must occur the first time we touch the memory) happens quickly, which suggests that about size bytes of physical memory is actually held by this process.
The oddness occurs when we tried running our benchmark while running the memory hog. Without running the memory hog, the benchmark runs in about 16 seconds. When we do run the memory hog in the background, however, the benchmark speeds up significantly! Namely, we found that when we are hogging 1 MB of memory (basically hogging almost nothing at all, but just having a useless process spin in the background), the benchmark finishes in about 13 seconds consistently. Hogging more memory (4 GB), it takes 14-15 seconds to finish, and hogging 6 GB, the benchmark takes roughly 16 seconds to finish. The benchmark itself takes about 4.3 GB of memory (as measured by getrusage by subtracting values for ru_maxrss before loading in all data and after loading in all data).
We found this to be very puzzling; while performance might not deteriorate with memhog in the background, it is odd that performance could improve by so much when a useless process is spinning in the background. Do you have any insights into this behavior? Thanks!
(We also have vm_stat data that we can post if it helps with the question; we didn't included it now as it was rather long).
context
I am doing some trials with memory caching. Read a lot of papers.
The problem is not how to make cache friendly code per process, I almost got it.
My main concern is : how will the cache behave when, say, hundreds of running processes will hit the L1 cache?
Since L1 size is scarce, should I understand that there will be a lot of cache eviction that will slow other processes since all the processes will fight for L1 cache?
On a cpu with 64bytes cache line and 64k l1 cache with a word size of 64bit.
This is the point I don't understand.
edit:
The hundreds are per core
First off, you'll likely use a multicore CPU. That means you have far fewer processes per core. Modern OS'es try to keep cores and processes somewhat associated, too.
But that said, you indeed lose L1 cache when your program is swapped out. It doesn't even make sense to hold on to it. Your address 0x04000000 doesn't have the same content as the same address in another process. They're virtual addresses.
I have a program that processes a large dataset consisting of a large number (300+) of sizable memory (40MB+) mapped files. All the files are needed together though they are accessed in a sequential way. at the moment I am memory mapping the files and then using madvise with MADV_SEQUENTIAL since I don't want the thing to be any more of a memory hog than it needs to be (without any madvise the consumption becomes a problem). The problem it that the program runs much slower (like 50x slower) than the diskio of the system would indicate it should, and becomes worse faster than linearly. as the number of files are involved are increased. Processing 100 files is more than 10x faster than 300 files despite being only 3x the data. I suspect that the memory mapped files are generating a page fault every time a 4kb page is crossed, net result disk seek time is greater than disk transfer time.
Can anyone think of a better way than using madvise with MADV_WILLNEED and MADV_DONTNEED every so often, and if this is the best way, any ideas as to how far to look ahead?
I am confused over what is meant by virtual address space. In a 32 bit machine a process can address 2^32 memory locations. Does that mean the virtual address space of every process is 2^32 (4GB) ?
The following is a snapshot of the virtual address space of a process. Can this grow upto 4GB? Is there any limit on the number of processes in such a system?
Can this grow upto 4GB?
The size of the address space is capped by the number of unique pointer values. For a 32-bit processor, a 32-bit value can represent 2 ^ 32 distinct values. If you allow each such value to address a different byte of memory, you get 2 ^ 32 bytes, which equals four gigabytes.
So, yes, the virtual address space of a process can theoretically grow to 4 GB. However in reality, this may also depend on the system and processor. As can be seen:
This theoretical maximum cannot be achieved on the Pentium class of processors, however. One reason is that the lower bits of the segment value encode information about the type of selector. As a result, of the 65536 possible selector values, only 8191 of them are usable to access user-mode data. This drops you to 32TB.
Note that there are two ways to allocate memory from the system, you can, of course, allocate memory for your process implicitly using C's malloc ( your question is tagged c ), but explicitly map file bytes.
Is there any limit on the number of processes in such a system?
a process includes one or more threads that actually execute the code in the process (technically, processes don’t run, threads do) and that are represented with kernel thread objects.
According to some tests carried out here, A 32-bit Windows XP system with 2GB of default address space can create approximately 2025 threads:
However a 32-bit test limit running on 64 bit Windows XP with 4GB allocated address space
created close to 3204 threads:
However the exact thread and process limit is extremely variable, it depends on a lot of factors. The way the threads specify their stack size, the way processes specify their minimum working set, the amount of physical memory available and the system commit limit. In any case, you don't usually have to worry about this on modern systems, since if your application really exceeds the thread limit you should rethink your design, as there are almost always alternate ways to accomplish the same goals with a reasonable number.
Yes, the virtual address space of every process is 4GB on 32-bit systems (232 bytes). In reality, the small amount of virtual memory that is actually used corresponds to locations in the processor cache(s), physical memory, or the disk (or wherever else the computer decides to put stuff).
Theoretically (and this behavior is pretty common among the common operating systems), a process could actually use all its virtual memory if the OS decided to put everything it couldn't fit in physical memory onto the disk, but this would make the program extremely slow because every time it tried to access some memory location that wasn't cached, it would have to go fetch it from the disk.
You asked if the picture you gave could grow up to 4GB. Actually, the picture you gave takes up all 4GB already. It is a way of partitioning a process's 4GB of virtual memory into different sections. Also if you're thinking of the heap and the stack "growing", they don't really grow; they have a set amount of memory allocated for them in that partitioning layout, and they just utilise that memory however they want to (a stack moves a pointer around, a heap maintains a data structure of used and non-used memory, etc).
Did you read wikipedia's virtual memory, process, address space pages?
What book did you read about advanced unix programming? or on advanced linux programming?
Usually, the address space is the set of segments which are valid (not in blue in your figure).
See also mmap(2) and execve(2) pages.
Try (on a Linux system)
cat /proc/self/maps
and
cat /proc/$$/maps
to understand a bit more.
See also this question and this. Read Operating Systems: Three Easy Pieces
Of course, the kernel is able to set some limits (see also setrlimit(2) syscall). And they are resource constraints (swap space, RAM, ...).
Answering the neglected part...
There's a limit on how many processes there can be. All per-process data structures that the kernel keeps in its portion of the virtual address space (which is shared, otherwise you wouldn't be able to access the kernel in every process) take some space. So, for example, if there's 1GB available for this data and only a 4KB page is needed per process in the kernel, then you arrive at about 250 thousand of processes max. In practice, this number is usually much smaller because things more are complex and there's physical memory reserved for various things for every process. See, for example, Mark Russinovich's article on process and thread limits in Windows for more details.
How much data can be malloced and how is the limit determined? I am writing an algorithm in C that basically utilizes repeatedly some data stored in arrays. My idea is to have this saved in dynamically allocated arrays but I am not sure if it's possible to have such amounts malloced.
I use 200 arrays of size 2046 holding complex data of size 8 byte each. I use these throughout the process so I do not wish to calculate it over and over.
What are your thoughts about feasibility of such an approach?
Thanks
Mir
How much memory malloc() can allocate depends on:
How much memory your program can address directly
How much physical memory is available
How much swap space is available
On a modern, flat-memory-model 32-bit system, your program can address 4 gigabytes, but some of the address space (usually 2 gigabytes, sometimes 1 gigabyte) is reserved for the kernel. So, as a rule of thumb, you should be able to allocate almost two gigabytes at once, assuming you have the physical memory and swap space to back it up.
On a 64-bit system, running a 64-bit OS and a 64-bit program, your addressable memory is essentially unlimited.
200 arrays of 2048 bytes each is only 400k, which should fit in cache (even on a smartphone).
A 32bit OS has a limit of 4Gb, typically some (upto half on win32) are reserved for the operating system - mapping the address space of graphcis card memory etc.
Linux supports 64Gb of address space (using Intel's 36bit PAE) on 32bit versions.
EDIT: although each process is limited to 4Gb
The main problem with allocating large amounts of memory is if you need it to be locked in RAM - then you obviously need a lot of RAM. Or if you need it all to be contiguous - it's much easier to get 4 * 1Gb chunks of memory than a single 4Gb chunk with nothing else in the way.
A common approach is to allocate all the memory you need at the start of the program so you can be sure that if the app isn't going to be possible it will fail instantly rather than when it's done 90% of the work.
Don't run other memory intensive apps at the same time.
There are also a bunch of flags you can use to suggest to the kernel that this app should get priority in memory or keep memory locked in ram - sorry it's too long since i did HPC on linux and i'm probably out of date with modern kernels.
I think that on most mordern (64bit) systems you can allocate 4GB at a time with a malloc( size_t ) call if that much memory is available. How big is each of those 'complex data' entries? if they are of the size 256 bytes, then you'll only need to allocate 100MB.
256bytes × 200 arrays × 2048 entries = 104857600bytes
104857600 bytes / 1024 / 1024 = 100MB.
So for 4096bytes each that's still only 1600MB or ≃ 1.6GB so it is feasible on most systems today, my four year old laptop got 3GB internal memory. Sometimes I does image manipulation with GIMP and it takes up over 2GB of memory.
With some implementations of malloc(), the regions are not actually backed by memory until they really get used so you can in theory carry on forever (though in practice of course the list of allocated regions assigned to your process in the kernel takes up space, so you might find you can only call malloc() a few million times even if it never actually gives you any memory). It's called "optimistic allocation" and is the strategy used by Linux (which is why it then has the OOM killer, for when it was over-optimistic).