the architecture of on-disk data structures - c

The closest I've come to finally understanding the architecture of on disk btrees is this.
It's simple and very easy to read and understand. But I still feel confused. It doesn't seem like there is any in memory data structure at all. Am I missing something? What makes this a btree? Is it just the array of longs that "point" to the keys of their child nodes? Is that efficient? Is that just how most databases and filesystems are designed?
Are there ways of implementing on disk btrees (or other data structures) in memory? Where each nodes contains a file offset or something?

Node pointers are typically stored on disk as addresses (for example using long integers).
In general an implementation choose to use either physical or logical addresses:
Physical addresses specify the actual offset (within a file or similar) where the node is stored.
In contrast, logical addresses require some kind of mechanism that resolve to a physical address each time a pointer is navigated/traversed.
Physical addressing is faster (because no resolve mechanism is needed). However, logical addressing could allow nodes to be reorganized without having to rewrite pointers. The ability to be able to re-organize nodes in this way can be used as the basis for implementing good clustering, space utilization and even low-level data distribution.
Some implementations use a combination of logical and physical addressing such that each address is composed of a logical address that refer (dynamically) to a segment (blob) and a physical address within that segment.
It is important to note that node addresses are disk based, therefore they cannot be directly translated to in-memory pointers.
In some cases it is beneficial to convert disk-based pointers to memory pointers when data is loaded into memory (and then convert back to disk-based pointers when writing).
This conversion is sometimes called pointer swizzling, and can be implemented in many ways. The fundamental idea is that the data addressed by swizzled in-memory pointers shall not be loaded into memory before the pointer is navigated/traversed.
The common approaches to this is to either use a logical in-memory addressing scheme or to use memory mapped files. Memory mapped files use virtual memory addressing in which memory pages are not loaded into memory before they are accessed. Virtual memory mapped files are provided by the OS. This approach is sometimes called page faulted addressing because accessing data on a memory page that is not mapped into memory yet will cause a page fault interrupt.

Related

How to share an existing dynamic array within Linux POSIX model in c language?

I have very big quickly growing (realloc, tcmalloc) dynamic array (about 2-4 billion double). After the growth ends I would like to share this array between two different applications. I know how to prepare shared memory region and copy my full-grown array into it, but this is too prodigally for memory, because I must to keep source and shared destination array at the same moment. Is it possible to share already existing dynamic array within POSIX model without copying?
EDITED:
Little bit explanation.
I am able to use memory allocation within POSIX model (shm_open() and others) but if I do it, I have to reallocate already shared memory segment many times (reading digits row by row from database to memory). It is much more overhead in comparison with simple realloc().
I have one producer, who reads from database and writes into shared memory.
I can't know beforehand how many records are present in the database and therefore I can't know the size of the shared array before the allocation. For this reason I have to reallocate big array while producer is reading row by row from database. After the memory is shared and filled, another applications reads data from shared array. Sometimes size of this big shared array could be changed and be replenished with new data.
Is it possible to share already existing dynamic array within POSIX model without copying?
No, that is not how shared memory works. Read shm_overview(7) & mmap(2).
Copying two billion doubles might take a few seconds.
Perhaps you could use mremap(2).
BTW, for POSIX shared memory, most computers limit the size of the of the segment shared with shm_open(3) to a few megabytes (not gigabytes). Heuristically the maximal shared size (on the whole computer) should be much less than half of the RAM available.
My feeling is that your design is inadequate and you should not use shared memory in your case. You did not explain what problem you are trying to solve and how is the data modified
(Did you consider using some RDBMS?). What are the synchronization issues?
Your question smells a lot like some XY problem, so you really should explain more, motivate it much more, and also give a broader, higher-level picture.

Which data structure works best in shared memory scenario and fast lookup

I am still at a conceptual stage of a project. Yet to start code implementation. A subtask is this :
2 Processes will request data from a commonly accessed DLL. This DLL would be storing this data in a buffer in memory. If I just instantiate a structure within the DLL and store data in it, then each process instance will have a seperate structure and the data won't be common. So I need to have a shared memory implementation. Now another requirement that I have is of fast lookup time within the data. I am not sure how an AVL tree can be stored within a shared memory space. Is there an implementation available on the internet for an AVL tree/Hashmap that can be stored in shared memory space ? Also, is this the right approach to the problem ? Or should I be using something else altogether ?
TIA!
Whether this is the right approach depends on various factors, such as how expensive the data is to produce, whether the processes need to communicate with each other concerning the data, and so on. The rest of this answer assumes that you really do need a lookup structure in shared memory.
You can use any data structure, provided that you can allocate storage for both your data and the data structure's internals in your shared memory space. This typically means that you won't be able to use malloc for it, since each process' heap usually remains private. You will need your own custom allocator.
Let's say you chose AVL trees. Here's a library that implements them: https://github.com/fbuihuu/libtree. It looks like in this library, the "internal" AVL node data is stored intrusively in your "objects." Intrusive means that you reserve fields to be used by the library when declaring your object struct. So, as long as you allocate space for your objects in shared memory, using your custom allocator, and also allocate space for the root tree struct there as well, the whole tree should be accessible to multiple processes. You just have to make sure that the shared memory itself is mapped to the same address range in each process.
If you used a non-intrusive AVL implementation, meaning that each node is represented by an internal struct which then points to a separate struct containing your data, the library or your implementation would have to allow you to specify the allocator for the internal struct somehow, so that you could make sure the space will be allocated in shared memory.
As for how to write the custom allocator, that really depends on your usage and the system. You need to consider if you will ever need to "resize" the shared memory region, whether the system allows you to do that, whether you will allocate only fixed-width blocks inside the region, or you need to support blocks with arbitrary length, whether it's acceptable to spread your data structures over multiple shared memory regions, how your processes can synchronize and communicate, and so on. If you go this route, you should ask a new question on the topic. Be sure to mention what system you are using (Windows?) and what your constraints are.
EDIT
Just to further discourage you from doing this unless it's necessary: if, for example, your data is expensive to produce but you don't care whether the processes build up their own independent lookup structures once the data is available to them, then you can, for example, have the DLL write the data to a simple ring buffer in shared memory, and the rest of the code take it from there. Building up two AVL trees isn't really a problem unless they are going to be very large.
Also, if you only care about concurrency, and it's not important for there to be two processes, you may be able to make them both threads of one process.
In the case of Windows, Microsoft's recommended functions return what can be different pointer values to shared memory for each process. This means that within the shared memory, offsets (from the start of shared memory) have to be used instead of pointers. For example in a linked list, there is a next offset instead of a next pointer. You may want to create macros to convert offsets to pointers, and pointers to offsets.

How to use a mmap file mapping for variables

I'm currently experimenting with IPC via mmap on Unix.
So far, I'm able to map a sparse-file of 10MB into the RAM and access it reading and writing from two separated processes. Awesome :)
Now, I'm currently typecasting the address of the memory-segment returned by mmap to char*, so I can use it as a plain old cstring.
Now, my real question digs a bit further. I've quite a lot experience with higher levels of programming (ruby, java), but never did larger projects in C or ASM.
I want to use the mapped memory as an address-space for variable-allocation. I dont't wether this is possible or does make any sense at all. Im thinking of some kind of a hash-map-like data structure, that lives purely in the shared segment. This would allow some interesting experiments with IPC, even with other languages like ruby over FFI.
Now, a regular implementation of a hash would use pretty often something like malloc. But this woult allocate memory out of the shared space.
I hope, you understand my thoughts, although my english is not the best!
Thank you in advance
Jakob
By and large, you can treat the memory returned by mmap like memory returned by malloc. However, since the memory may be shared between multiple "unrelated" processes, with independent calls to mmap, the starting address for each may be different. Thus, any data structure you build inside the shared memory should not use direct pointers.
Instead of pointers, offsets from the initial map address should be used instead. The data structure would then compute the right pointer value by adding the offset to the starting address of the mmamp region.
The data structure would be built from the single call to mmap. If you need to grow the data structure, you have to extend the mmap region itself. The could be done with mremap or by manually munmap and mmap again after the backing file has been extended.

Virtual memory management in Fortran under Mac OS X

I'm writing a Fortran 90 program (compiled using gfortran) to run under Mac OS X. I have 13 data arrays, each comprising about 0.6 GB of data My machine is maxed out at 8 GB real memory, and if I try to hold all 13 arrays in memory at once, I'm basically trying to use all 8 GB, which I know isn't possible in view of other system demands. So I know that the arrays would be subject to swapping. What I DON'T know is how this managed by the operating system. In particular,
Does the OS swap out entire data structures (e.g., arrays) when it needs to make room for other data structures, or does it rather do it on a page-by-page basis? That is, does it swap out partial arrays, based on which portions of the array have been least-recently accessed?
The answer may determine how I organize the arrays. If partial arrays can get swapped out, then I could store everything in one giant array (with indexing to select which of the 13 subarrays I need) and trust the OS to manage everything efficiently. Otherwise, I might preserve separate and distinct arrays, each one individually fitting comfortably within the available physical memory.
Operating systems are not typically made aware of structures (like arrays) in user memory. Most operating systems I'm aware of, including Mac OS X, swap out memory on a page-by-page basis.
Although the process is often wrongly called swapping, on x86 as well as on many modern architectures, the OS performs paging to what is still called the swap device (mostly because of historical reasons). The virtual memory space of each process is divided into pages and a special table, called process page table, holds the mapping between pages in virtual memory and frames in physical memory. Each page can be mapped or not mapped. Further mapped pages can be present or not present. Access to an unmapped page results in segmentation fault. Access to a non-present page results in page fault which is further handled by the OS - it takes the page from the swap device and installs it into a frame in the physical memory (if any is available). The standard page size is 4 KiB on x86 and almost any other widespread architecture nowadays. Also, modern MMUs (Memory Management Units, often an integral part of the CPU) support huge pages (e.g. 2 MiB) that can be used to reduce the amount of entries in the page tables and thus leave more memory for user processes.
So paging is really fine grained in comparison with your data structures and one often has loose or no control whatsoever over how the OS does it. Still, most Unices allow you to give instructions and hints to the memory manager using the C API, available in the <sys/mman.h> header file. There are functions that allows you to lock a certain portion of memory and prevent the OS from paging it out to the disk. There are functions that allows you to hint the OS that a certain memory access pattern is to be expected so that it can optimise the way it moves pages in and out. You may combine these with clearly developed data structures in order to achieve some control over paging and to get the best performance of a given OS.

Memory mapped database

I have 8 terabytes of data composed of ~5000 arrays of small sized elements (under a hundred bytes per element). I need to load sections of these arrays (a few dozen megs at a time) into memory to use in an algorithm as quickly as possible. Are memory mapped files right for this use, and if not what else should I use?
Given your requirements I would definitely go with memory mapped files. It's almost exactly what they were made for. And since memory mapped files consume few physical resources, your extremely large files will have little impact on the system as compared to other methods, especially since smaller views can be mapped into the address space just before performing I/O (eg, those arrays of elements). The other big benefit is they give you the simplest working environment possible. You can (mostly) just view your data as a large memory address space and let Windows worry about the I/O. Obviously, you'll need to build in locking mechanisms to handle multiple threads, but I'm sure you know that.

Resources