Is an Object the smallest pageable unit in the Heap? - heap-memory

If I have a 2 GB ram and I have an 2 instances of an Object which is 1.5 GB each, the operating system will help and context switch the pages to and from harddisk.
What if I have 1 instances but is 3 GB. Can the same paging method breakdown this instances into 2 pages? Or will I encounter out-of-memory issue?
I will also like to apply the same question to other data structures beside object, will the paging page it as a whole, or will it break it into smaller units.
Thanks.

The operating system has no concept of "Objects", only memory pages. Your object will be made up of many memory pages which the OS can swap in and out of real memory independently of each other. Page size varies with operating systems, but is typically 4K.

Related

Window control for mmapped large file(linux, mmap)

How can we control the window in RSS when mapping a large file? Now let me explain what i mean.
For example, we have a large file that exceeds RAM by several times, we do shared memory mmaping for several processes, if we access some object whose virtual address is located in this mapped memory and catch a page fault, then reading from disk, the sub-question is, will the opposite happen if we no longer use the given object? If this happens like an LRU, then what is the size of the LRU and how to control it? How is page cache involved in this case?
RSS graph
This is the RSS graph on testing instance(2 thread, 8 GB RAM) for 80 GB tar file. Where does this value of 3800 MB come from and stay stable when I run through the file after it has been mapped? How can I control it (or advise the kernel to control it)?
As long as you're not taking explicit action to lock the pages in memory, they should eventually be swapped back out automatically. The kernel basically uses a memory pressure heuristic to decide how much of physical memory to devote to swapped-in pages, and frequently rebalances as needed.
If you want to take a more active role in controlling this process, have a look at the madvise() system call.
This allows you to tweak the paging algorithm for your mmap, with actions like:
MADV_FREE (since Linux 4.5)
The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. ...
MADV_COLD (since Linux 5.4)
Deactivate a given range of pages. This will make the pages a more probable reclaim target should there be a memory pressure.
MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)
MADV_WILLNEED
Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)
MADV_DONTNEED
Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) ...
Issuing an madvise(MADV_SEQUENTIAL) after creating the mmap might be sufficient to get acceptable behavior. If not, you could also intersperse some MADV_WILLNEED/MADV_DONTNEED access hints (and/or MADV_FREE/MADV_COLD) during the traversal as you pass groups of pages.

Is it okay to use mmap() for 4Kb blocks all over the place or is it better to mmap() my whole file in one go?

I want to work on a file which is composed of 4Kb blocks.
As things happen, I will write more data and map new parts, unmap parts that I do not need anymore.
Is a map() of just 4Kb too small when the total amount of file data to map is around 4Gb total? (i.e. some 1,048,576 individually mapped blocks).
I'm worried that making so many small mmap() calls is not going to be efficient in the end, even if they are very well directed to the exact blocks I want to use. At the same time, it may still be better than reading and writing these blocks with read()/write() each time I change one byte.
As far as I understand it, even a single mmap() that covers several contiguous 4kb pages will require the kernel (and the TLB, MMU...) to deal with as many virtual/physical associations as the number of these pages (this is the purpose of memory pages; contiguous virtual pages can be mapped to non-contiguous physical pages).
So, considering the usage of these mapped pages, once set up by a unique or by many mmap() calls, there should not be any difference in performances.
But each single call to mmap() probably requires some overhead in order to choose the part of virtual address space to use; a single mmap() call will just have to choose once a big enough virtual location (it should not be too difficult on a 64-bit system, as stated in other answers) but repeated calls will imply this overhead many times.
So, if I had to deal with this situation on a 64-bit system, I would mmap() the entire file at once, using huge-pages in order to reduce the pressure on TLB.
Note that mapping the entire file at once does not imply using the same amount of physical memory right at this moment; virtual/physical memory association will only occur for each single page when it is accessed for the first time.
There is no shortage of address space on 64-bit architectures. Unless your code has to work in 32-bit architectures too (rare these days), map the whole file once and avoid the overhead of multiple mmap calls and thousands of extra kernel objects. With reading and writing changes, it depends on your desired semantics. See this answer.
On 64-bit systems you should pretty much map the entire file or at least the entire range in one go and let the operating system handle the paging in and out for you. The mmap calls do have some overhead themselves. In practice the user address space on x86-64 is something like 128 TiB so you should be able to map say 1 TiB files/ranges without any problems.

Size of page tables and available physical addresses

let's assume we have 4 GiB of RAM and we use a page table size of 4 kiB, with 32 Bit addresses.
After my calculations, I got:
we can only address a maximal of 2^32 addresses;
each page table has a total of 2^20 entries;
in total we have 4 GiB/4 kiB = 1048576 pages.
But what I can't understand is, if a page table has 2^20 entries we already covered all the possible addresses with this page table.
How is possible if each process has his own page table? It should then be possible to have the same physical address on more than one page table, which could cause severe problem, or am I missing something?
Many thanks in advance for your help.
Each process can in theory map all of memory, but in practice, most of the pages in a process's address space are not mapped, leaving plenty of memory for other processes.
Furthermore, it doesn't necessarily cause a problem to have the same pages mapped into two different address spaces. It is done for shared libraries, kernel pages, and memory that is shared between processes for interprocess communication.
(Kernel pages may be mapped into every process so that the kernel can access its own pages during a system call from any process. These pages are protected so that they cannot be accessed by the application code.)

Memory Layout of a C program (Phyical vs Logical view)

As per my understanding, the logical view of the C program is divided into many segments such as
Code
Data
Bss
Heap
Stack (typical implementation: Heap and Stack growing in opposite directions).
How are these segments aligned in the physical memory?
As per my understanding, physical memory uses frames of fixed size to store the pages of the process.
If that is the case then what how is this actually consistent with the user view? Example: the stack and heap area might be distributed among many pages. Pages might be scattered through the memory.
If that is the case then what how is this actually consistent with the
user view? Example: the stack and heap area might be distributed among
many pages. Pages might be scattered through the memory.
The virtual memory system maps "virtual" addresses onto physical addresses so that user code never knows or cares about where the memory it's using is located in physical memory. This is typically done using a hardware memory management unit (MMU), but it could also be done by the operating system without the MMU. As far as user code is concerned, there's just one nice big address space that's always available.

Do we need to take cache thrashing into account with CUDA?

I'm not familiar with the workings of GPU memory caching, so would like to know if the assumptions of temporal and spatial proximity of memory access associated with CPUs also applies with GPUs. That is, programming in CUDA C, do I need to take into account C's row-major array storage format to prevent cache thrashing?
Many thanks.
Yes, very much.
Say you are fetching 4 byte integers for each thread.
Scenario one
Each thread is fetching one integer with the index of its thread id. That means thread zero is fetching a[0], thread 1 is fetching a[1] etc... As with the GPU it will fetch in cache lines of 128 bytes. As a coincidence a warp is 32 threads, ergo 32*4 = 128 bytes. This means for one warp, it will one do one fetch request from memory.
Scenario two
If the threads are fetching in total random order with a distance between the indexes is greater than 128 bytes. It will have to make 32 memory requests of 128 bytes. This means that you will for each warp fill the caches with 32 times more memory, and if your problem is big your cache will be invalidated up to 32 more times than scenario one.
This means that if you would request memory that would normally reside in the cache in scenario one, in scenario two it is highly likely that it have to be resolve with another memory request from global memory.
No and yes. No because the GPU does not provides the same kind of "cache" that CPU.
But you have many other constraints that makes the underlying C array layout and how it is accessed by concurrent threads very important for performances.
You may have a look at this page for basics about CUDA memory types or here for more in depth details about cache on Fermi GPU.

Resources