Virtual memory management in Fortran under Mac OS X - arrays

I'm writing a Fortran 90 program (compiled using gfortran) to run under Mac OS X. I have 13 data arrays, each comprising about 0.6 GB of data My machine is maxed out at 8 GB real memory, and if I try to hold all 13 arrays in memory at once, I'm basically trying to use all 8 GB, which I know isn't possible in view of other system demands. So I know that the arrays would be subject to swapping. What I DON'T know is how this managed by the operating system. In particular,
Does the OS swap out entire data structures (e.g., arrays) when it needs to make room for other data structures, or does it rather do it on a page-by-page basis? That is, does it swap out partial arrays, based on which portions of the array have been least-recently accessed?
The answer may determine how I organize the arrays. If partial arrays can get swapped out, then I could store everything in one giant array (with indexing to select which of the 13 subarrays I need) and trust the OS to manage everything efficiently. Otherwise, I might preserve separate and distinct arrays, each one individually fitting comfortably within the available physical memory.

Operating systems are not typically made aware of structures (like arrays) in user memory. Most operating systems I'm aware of, including Mac OS X, swap out memory on a page-by-page basis.

Although the process is often wrongly called swapping, on x86 as well as on many modern architectures, the OS performs paging to what is still called the swap device (mostly because of historical reasons). The virtual memory space of each process is divided into pages and a special table, called process page table, holds the mapping between pages in virtual memory and frames in physical memory. Each page can be mapped or not mapped. Further mapped pages can be present or not present. Access to an unmapped page results in segmentation fault. Access to a non-present page results in page fault which is further handled by the OS - it takes the page from the swap device and installs it into a frame in the physical memory (if any is available). The standard page size is 4 KiB on x86 and almost any other widespread architecture nowadays. Also, modern MMUs (Memory Management Units, often an integral part of the CPU) support huge pages (e.g. 2 MiB) that can be used to reduce the amount of entries in the page tables and thus leave more memory for user processes.
So paging is really fine grained in comparison with your data structures and one often has loose or no control whatsoever over how the OS does it. Still, most Unices allow you to give instructions and hints to the memory manager using the C API, available in the <sys/mman.h> header file. There are functions that allows you to lock a certain portion of memory and prevent the OS from paging it out to the disk. There are functions that allows you to hint the OS that a certain memory access pattern is to be expected so that it can optimise the way it moves pages in and out. You may combine these with clearly developed data structures in order to achieve some control over paging and to get the best performance of a given OS.

Related

How to share an existing dynamic array within Linux POSIX model in c language?

I have very big quickly growing (realloc, tcmalloc) dynamic array (about 2-4 billion double). After the growth ends I would like to share this array between two different applications. I know how to prepare shared memory region and copy my full-grown array into it, but this is too prodigally for memory, because I must to keep source and shared destination array at the same moment. Is it possible to share already existing dynamic array within POSIX model without copying?
EDITED:
Little bit explanation.
I am able to use memory allocation within POSIX model (shm_open() and others) but if I do it, I have to reallocate already shared memory segment many times (reading digits row by row from database to memory). It is much more overhead in comparison with simple realloc().
I have one producer, who reads from database and writes into shared memory.
I can't know beforehand how many records are present in the database and therefore I can't know the size of the shared array before the allocation. For this reason I have to reallocate big array while producer is reading row by row from database. After the memory is shared and filled, another applications reads data from shared array. Sometimes size of this big shared array could be changed and be replenished with new data.
Is it possible to share already existing dynamic array within POSIX model without copying?
No, that is not how shared memory works. Read shm_overview(7) & mmap(2).
Copying two billion doubles might take a few seconds.
Perhaps you could use mremap(2).
BTW, for POSIX shared memory, most computers limit the size of the of the segment shared with shm_open(3) to a few megabytes (not gigabytes). Heuristically the maximal shared size (on the whole computer) should be much less than half of the RAM available.
My feeling is that your design is inadequate and you should not use shared memory in your case. You did not explain what problem you are trying to solve and how is the data modified
(Did you consider using some RDBMS?). What are the synchronization issues?
Your question smells a lot like some XY problem, so you really should explain more, motivate it much more, and also give a broader, higher-level picture.

Where exactly are the variables stored in a C program?

I am new to computer programming. I was studying about variables and came across a definition on the internet:
Variables are the names you give to computer memory locations which are used to store values in a computer program.
What are these memory locations? Do these locations refer to the actual computer memory or this is just a dump in the program itself from where it calls those variables later when we need them?
Also there are other terms that I encountered here on stack overflow like heap and stack. I could not get my head around these. Please help.
The way you've asked the question suggests you expect a single answer. That is simply not the case.
In a rough sense, all variables will exist in memory while your program is being executed. The memory your variables exist in depends both on several things.
Modern computer hardware often has quite a complex physical memory architecture - with multiple levels of cache (in both the CPU, and various peripheral devices), a number of CPU registers, shared memory, different types of RAM, storage devices, EEPROMs, etc. Different systems have these types of memory - and more types - in different proportions.
Operating systems may make memory available to your program in different ways. For example, it may provide virtual memory, using a combination of RAM and reserved hard drive space (and managing mappings, so your program can't tell the difference). This can allow your program to use more memory than is physically available as RAM, but also affects performance, since the operating system must swap memory usage of your program between RAM and the hard drive (which is typically orders of magnitude slower).
A lot of compilers and libraries are implemented to maximise your programs performance (by various measures) - compiler optimisation of your code (which can cause some variables in your code to not even exist when your program is run), library functions crafted for performance, etc. One consequence of this is that the compiler, or library, may use memory in different ways (e.g. some implementations may embed code in your executable to detect memory resources available when the program is run, others may simply assume a fixed amount of RAM), and the usage may even vary over time.

How can a process with a 32-bit address space access large amounts of memory efficiently? [duplicate]

This question already has an answer here:
moving a structure from 32-bit to 64-bit in C programming language [closed]
(1 answer)
Closed 9 years ago.
We have a process with a 32-bit address space that needs to access more memory than it can address directly. Most of the source code for this process cannot be changed. We can change the modules that are used to manage access to the data. The interfaces to those modules may include 64-bit pieces of data that identify the memory to be accessed.
We currently have an implementation in which the interface modules use interprocess communication with a 64-bit process to transfer data to and from the address space of that 64-bit process.
Is there a better way?
Very few platforms support mixing 32-bit and 64-bit code. If you need more than 2 or 3 GB of address space, your options are:
Recompile the whole application as 64-bit, or
Use memory-mapped files to page in and out large chunks of data.
Recompiling is easy. Accessing more than 2 or 3 GB of memory in a 32-bit program is hard.
Note that recompiling a 32-bit application as a 64-bit application requires no changes to your code or functionality, barring a few bugs that might turn up if your code has unportable constructs. Things like:
size_t round_to_16(size_t x)
{
return x & ~15; // BUG in 64-bit code, should be ~(size_t) 15
}
As stated in various comments, the situation is:
There is a 32-bit process of which a small portion can be altered. The rest is essentially pre-compiled and cannot be changed.
The small portion currently communicates with a 64-bit process to transfer selected data between a 64-bit address space and the address space of the 32-bit process.
You seek alternatives, presumably with higher performance.
Interprocess communication is generally fast. There may not be a faster method unless:
your system has specialized hardware for accelerating memory transfers, or
your system has means of remapping memory (more below).
Unix has calls such as shmat and mmap that allow processes to attach to shared memory segments and to map portions of their address spaces to offsets within shared memory segments. It is possible that calls such as these can support mapping portions of a 32-bit address space into large shared memory segments that exist in a large physical address space.
For example, the call mmap takes a void * parameter for the address to map in the process’ address space and an off_t parameter for the offset into a shared memory segment. Conceivably, the off_t type may be a 64-bit type even though the void * is only a 32-bit pointer. I have not investigated this.
Remapping memory is conceivable faster than transferring memory by copy operations, since it can involve simply changing the virtual address map instead of physically moving data.

Memory mapped database

I have 8 terabytes of data composed of ~5000 arrays of small sized elements (under a hundred bytes per element). I need to load sections of these arrays (a few dozen megs at a time) into memory to use in an algorithm as quickly as possible. Are memory mapped files right for this use, and if not what else should I use?
Given your requirements I would definitely go with memory mapped files. It's almost exactly what they were made for. And since memory mapped files consume few physical resources, your extremely large files will have little impact on the system as compared to other methods, especially since smaller views can be mapped into the address space just before performing I/O (eg, those arrays of elements). The other big benefit is they give you the simplest working environment possible. You can (mostly) just view your data as a large memory address space and let Windows worry about the I/O. Obviously, you'll need to build in locking mechanisms to handle multiple threads, but I'm sure you know that.

Process Page Tables

I'm interested in gaining a greater understanding of the virtual memory and page mechanism, specifically for Windows x86 systems. From what I have gathered from various online resources (including other questions posted on SO),
1) The individual page tables for each process are located within the kernel address space of that same process.
2) There is only a single page table per process, containing the mapping of virtual pages onto physical pages (or frames).
3) The physical address corresponding to a given virtual address is calculated by the memory management unit (MMU) essentially by using the first 20 bits of the provided virtual address as the index of the page table, using that index to retrieve the beginning address of the physical frame and then applying some offset to that address according to the remaining 12 bits of the virtual address.
Are these three statements correct? Or am I misinterpreting the information?
So, first lets clarify some things:
In the case of the x86 architecture, it is not the operating system that determines the paging policy, it is the CPU (more specifically it's MMU). How the operating system views the paging system is independent of the the way it is implemented. As a commenter rightly pointed out, there is an OS specific component to paging models. This is subordinate to the hardware's way of doing things.
32 bit and 64 bit x86 processors have different paging schemes so you can't really talk about the x86 paging model without also specifying the word size of the processor.
What follows is a massively condensed version of the 32 bit x86 paging model, using the simplest version of it. There are many additional tweaks that are possible and I know that various OS's make use of them. I'm not going into those because I'm not really familiar with the internals of most OS's and because you really shouldn't go into that until you have a grasp on the simpler stuff. If you want the to know all of the wonderful quirks of the x86 paging model, you can go to the Intel docs: Intel System Programming Guide
In the simplest paging model, the memory space is divided into 4KB blocks called pages. A contiguous chunk of 1024 of these is mapped to a page table (which is also 4KB in size). For a further level of indirection, All 1024 page tables are mapped to a 4KB page directory and the base of this directory sits in a special register %cr3 in the processor. This two level structure is in place because most memory spaces in the OS are sparse which means that most of it is unused. You don't want to keep a bunch of page tables around for memory that isn't touched.
When you get a memory address, the most significant 10 bits index into the page directory, which gives you the base of the page table. The next 10 bits index into that page table to give you the base of the physical page (also called the physical frame). Finally, the last 12 bits index into the frame. The MMU does all of this for you, assuming you've set %cr3 to the correct value.
64 bit systems have a 4 level paging system because their memory spaces are much more sparse. Also, it is possible to page sizes that are not 4KB.
To actually get to your questions:
All of this paging information (tables, directories etc) sits in kernel memory. Note that kernel memory is one big chuck and there is no concept of having kernel memory for a single process.
There is only one page directory per process. This is because the page directory defines a memory space and each process has exactly one memory space.
The last paragraph above gives you the way an address is chopped up.
Edit: Clean up and minor modifications.
Overall that's pretty much correct.
If memory serves, a few details are a bit off though:
The paging for the kernel memory doesn't change per-process, so all the page tables are always visible to the kernel.
In theory, there's also a segment-based translation step. Most practical systems (e.g., *BSD, Linux, Windows, OS/X), however, use segments with their base set to 0 and limit set to the address space limit, so this step ends up as essentially a NOP.

Resources