numa, mbind, segfault - c

I have allocated memory using valloc, let's say array A of [15*sizeof(double)]. Now I divided it into three pieces and I want to bind each piece (of length 5) into three NUMA nodes (let's say 0,1, and 2). Currently, I am doing the following:
double* A=(double*)valloc(15*sizeof(double));
piece=5;
nodemask=1;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
nodemask=2;
mbind(&A[5],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
nodemask=4;
mbind(&A[10],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
First question is am I doing it right? I.e. is there any problems with being properly aligned to page size for example? Currently with size of 15 for array A it runs fine, but if I reset the array size to something like 6156000 and piece=2052000, and subsequently three calls to mbind start with &A[0], &A[2052000], and &A[4104000] then I am getting a segmentation fault (and sometimes it just hangs there). Why it runs for small size fine but for larger gives me segfault? Thanks.

For this to work, you need to deal with chunks of memory that are at least page-size and page-aligned - that means 4KB in most systems. In your case, I suspect the page gets moved twice (possibly three times), due to you calling mbind() three times over.
The way numa memory is located is that CPU socket 0 has a range of 0..X-1 MB, socket 1 has X..2X-1, socket three has 2X-3X-1, etc. Of course, if you stick a 4GB stick of ram next to socket 0 and a 16GB in the socket 1, then the distribution isn't even. But the principle still stands that a large chunk of memory is allocated for each socket, in accordance to where the memory is actually located.
As a consequence of how the memory is located, the physical location of the memory you are using will have to be placed in the linear (virtual) address space by page-mapping.
So, for large "chunks" of memory, it is fine to move it around, but for small chunks, it won't work quite right - you certainly can't "split" a page into something that is affine to two different CPU sockets.
Edit:
To split an array, you first need to find the page-aligned size.
page_size = sysconf(_SC_PAGESIZE);
objs_per_page = page_size / sizeof(A[0]);
// We should be an even number of "objects" per page. This checks that that
// no object straddles a page-boundary
ASSERT(page_size % sizeof(A[0]));
split_three = SIZE / 3;
aligned_size = (split_three / objs_per_page) * objs_per_page;
remnant = SIZE - (aligned_size * 3);
piece = aligned_size;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size*2 + remnant],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
Obviously, you will now need to split the three threads similarly using the aligned size and remnant as needed.

Related

Using Linux AIO, able to do IOs but writing garbage as well into the file

This might seem silly, but, I am using libaio ( not posix aio), I am able to write something into the file, but I am also writing extra stuff into the file.
I read about the alignment requirement and the data type of the buffer field of iocb.
Here is the code sample ( only relevant sections of use, for representation )
aio_context_t someContext;
struct iocb somecb;
struct io_event someevents[1];
struct iocb *somecbs[1];
somefd = open("/tmp/someFile", O_RDWR | O_CREAT);
char someBuffer[4096];
... // error checks
someContext = 0; // this is necessary
io_setup(32, &someContext ); // no error checks pasted here
strcpy(someBuffer, "hello stack overflow");
memset(&somecb, 0, sizeof(somecb));
somecb.aio_fildes = somefd ;
somecb.aio_lio_opcode = IOCB_CMD_PWRITE;
somecb.aio_buf = (uint64_t)someBuffer;
somecb.aio_offset = 0;
somecb.aio_nbytes = 100; // // //
// I am avoiding the memeaign and sysconf get page part in sample paste
somecbs[0] = &somecb; // address of the solid struct, avoiding heap
// avoiding error checks for this sample listing
io_submit(someContext, 1, somecbs);
// not checking for events count or errors
io_getevents(someContext, 1, 1, someevents, NULL);
The Output:
This code does create the file, and does write the intended string
hello stack overflow into the file /tmp/someFile.
The problem:
The file /tmp/someFile also contains after the intended string, in series,
#^#^#^#^#^#^#^#^#^ and some sections from the file itself ( code section), can say garbage.
I am certain to an extent that this is some pointer gone wrong in the data field, but cannot crack this.
How to use aio ( not posix) to write exactly and only 'hello world' into a file?
I am aware that aio calls might be not supported on all file systems as of now. The one I am running against does support.
Edit - If you want the starter pack for this attempt , you can get from here.
http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt
Edit 2 : Carelessness, I was setting up more number of bytes to write to within the file, and the code was honoring it. Put simply, to write 'hw' exactly one needed no more than 2 bytes in the bytes field of iocb.
There's a few things going on here. First up, the alignment requirement that you mentioned is either 512 bytes or 4096 bytes, depending on your underlying device. Try 512 bytes to start. It applies to:
The offset that you're writing in the file must be a multiple of 512 bytes. It can be 0, 512, 1024, etc. You can write at offset 0 like you're doing here, but you can't write at offset 100.
The length of data that you're writing to the file must be a multiple of 512 bytes. Again, you can write 512 bytes, 1024 bytes, or 2048 bytes, and so on - any multiple of 512. You can't write 100 bytes like you're trying to do here.
The address of the memory that contains the data you're writing must be a multiple of 512. (I typically use 4096, to be safe.) Here, you'll need to be able to do someBuffer % 512 and get 0. (With the code the way it is, it most likely won't be.)
In my experience, failing to meet any of the above requirements doesn't actually give you an error back! Instead, it'll complete the I/O request using normal, regular old blocking I/O.
Unaligned I/O: If you really, really need to write a smaller amount of data or write at an unaligned offset, then things get tricky even above and beyond the io_submit interface. You'll need to do an aligned read to cover the range of data that you need to write, then modify the data in memory and write the aligned region back to disk.
For example, say you wanted to modify offset 768 through 1023 on the disk. You'd need to read 512 bytes at offset 512 into a buffer. Then, memcpy() the 256 bytes you wanted to write 256 bytes into that buffer. Finally, you issue a write of the 512 byte buffer at offset 512.
Uninitialized Data: As others have pointed out, you haven't fully initialized the buffer that you're writing. Use memset() to initialize it to zero to avoid writing junk.
Allocating an Aligned Pointer: To meet the pointer requirements for the data buffer, you'll need to use posix_memalign(). For example, to allocate 4096 bytes with a 512 byte alignment restriction: posix_memalign(&ptr, 512, 4096);
Lastly, consider whether you need to do this at all. Even in the best of cases, io_submit still "blocks", albeit at the 10 to 100 microsecond level. Normal blocking I/O with pread and pwrite offers a great many benefits to your application. And, if it becomes onerous, you can relegate it to another thread. If you've got a latency-sensitive app, you'll need to do io_submit in another thread anyway!

c language 'order of memory allocation' and 'execution speed'

I'm making a program using c.
I have many arrays and size of each array is not so small.
(more than 10,000 elements in each array).
And, there are set of arrays that are accessed and computed together frequently.
For example,
a_1[index] = constant * a_2[index];
b_1[index] = constant * b_2[index];
a_1 is compute with a_2 and b_1 is computed with b_2.
Suppose that I have a~z_1 and a~z_2 arrays, in my case,
is there significant 'execution speed' difference between the following 2 different memory allocation ways.
allocating memory in order of a~z_1 followed by a~z_2
allocating a_1,a_2 followed by b_1,b_2, c_1,c_2 and others?
1.
MALLOC(a_1);
MALLOC(b_1);
...
MALLOC(z_1);
MALLOC(a_2);
...
MALLOC(z_2);
2.
MALLOC(a_1);
MALLOC(a_2);
MALLOC(b_1);
MALLOC(b_2);
...
MALLOC(z_1);
MALLOC(z_2);
I think allocating memory in second way will be faster because of hit rate.
Because arrays allocated in similar times will be in the similar address, those arrays will be uploaded in the cash or RAM at the same time, and therefore computer does not need to upload arrays in several times to compute one line of code.
For example, to compute
a_1[index] = constant * a_2[index];
, upload a_1 and a_2 at the same time not separately.
(Is it correct?)
However, for me, in terms of maintenance, allocating in first way is much easier.
I have AA_a~AA_z_1,AA_a~AA_z_2, BB_a~BB_z_1, CC_a~CC_z~1 and other arrays.
Because I can efficiently use MACRO in the following way to allocate memory.
Like,
#define MALLOC_GROUP(GROUP1,GROUP2)
MALLOC(GROUP1##_a_##GROUP2);
MALLOC(GROUP1##_b_##GROUP2);
...
MALLOC(GROUP1##_z_##GROUP2)
void allocate(){
MALLOC_GROUP(AA,1);
MALLOC_GROUP(AA,2);
MALLOC_GROUP(BB,2);
}
To sum, is allocating sets of arrays, computed together, at the similar time affects to the execution speed of the program?
Thank you.

Is doubling the capacity of a dynamic array necessary?

When making automatically expanding arrays (like C++'s std::vector) in C, it is often common (or at least common advice) to double the size of the array each time it is filled to limit the amount of calls to realloc in order to avoid copying the entire array as much as possible.
Eg. we start by allocating room for 8 elements, 8 elements are inserted, we then allocate room for 16 elements, 8 more elements are inserted, we allocate for 32.., etc.
But realloc does not have to actually copy the data if it can expand the existing memory allocation. For example, the following code only does 1 copy (the initial NULL allocation, so it is not really a copy) on my system, even though it calls realloc 10000 times:
#include <stdlib.h>
#include <stdio.h>
int main()
{
int i;
int copies = 0;
void *data = NULL;
void *ndata;
for (i = 0; i < 10000; i++)
{
ndata = realloc(data, i * sizeof(int));
if (data != ndata)
copies++;
data = ndata;
}
printf("%d\n", copies);
}
I realize that this example is very clinical - a real world application would probably have more memory fragmentation and would do more copies, but even if I make a bunch of random allocations before the realloc loop, it only does marginally worse with 2-4 copies instead.
So, is the "doubling method" really necessary? Would it not be better to just call realloc each time a element is added to the dynamic array?
You have to step back from your code for a minute and thing abstractly. What is the cost of growing a dynamic container? Programmers and researchers don't think in terms of "this took 2ms", but rather in terms of asymptotic complexity: What is the cost of growing by one element given that I already have n elements; how does this change as n increases?
If you only ever grew by a constant (or bounded) amount, then you would periodically have to move all the data, and so the cost of growing would depend on, and grow with, the size of the container. By contrast, when you grow the container geometrically, i.e. multiply its size by a fixed factor, every time it is full, then the expected cost of inserting is actually independent of the number of elements, i.e. constant.
It is of course not always constant, but it's amortized constant, meaning that if you keep inserting elements, then the average cost per element is constant. Every now and then you have to grow and move, but those events get rarer and rarer as you insert more and more elements.
I once asked whether it makes sense for C++ allocators to be able to grow, in the way that realloc does. The answers I got indicated that the non-moving growing behaviour of realloc is actually a bit of a red herring when you think asymptotically. Eventually you won't be able to grow anymore, and you'll have to move, and so for the sake of studying the asymptotic cost, it's actually irrelevant whether realloc can sometimes be a no-op or not. (Moreover, non-moving growth seems to upset moder, arena-based allocators, which expect all their allocations to be of a similar size.)
Compared to almost every other type of operation, malloc, calloc, and especially realloc are very memory expensive. I've personally benchmarked 10,000,000 reallocs, and it takes a HUGE amount of time to do that.
Even though I had other operations going on at the same time (in both benchmark tests), I found that I could literally cut HOURS off of the run time by using max_size *= 2 instead of max_size += 1.
Q: 'doubling the capacity of a dynamic array necessary"
A: No. One could grow only to the extent needed. But then you may truly copy data many times. It is a classic trade off between memory and processor time. A good growth algorithm takes into account what is known about the program's data needs and also not to over-think those needs. An exponential growth of 2x is a happy compromise.
But now to your claim "following code only does 1 copy".
The amount of copying with advanced memory allocators may not be what OP thinks. Getting the same address does not mean that the underlying memory mapping did not perform significant work. All sorts of activity go on under-the-hood.
For memory allocations that grow & shrink a lot over the life of the code, I like grow and shrink thresholds geometrically placed apart from each other.
const size_t Grow[] = {1, 4, 16, 64, 256, 1024, 4096, ... };
const size_t Shrink[] = {0, 2, 8, 32, 128, 512, 2048, ... };
By using the grow thresholds while getting larger and shrink one while contracting, one avoid thrashing near a boundary. Sometimes a factor of 1.5 is used instead.

Counting page transfers between disk and main memory

for I := 1 to 1024 do
for J := 1 to 1024 do
A[J,I] := A[J,I] * B[I,J]
For the given code, I want to count how many pages are transferred between disk and main memory given the following assumptions:
page size = 512 words
no more than 256 pages can be in main memory
LRU replacement strategy
all 2d arrays size (1:1024,1:1024)
each array element occupies 1 word
2d arrays are mapped in main memory in row-major order
I was given the solution, and my questions stems from that:
A[J,I] := A[J,I] * B[I,J]
writeA := readA * readB
Notice that there are 2 transfers changing every J loop and 1 transfer
that only changes every I loop.
1024 * (8 + 1024 * (1 + 1)) = 2105344 transfers
So the entire row of B is read every time we use it, therefore we
count the entire row as transferred (8 pages). But since we only read
a portion of each A row (1 value) when we transfer it, we only grab 1
page each time.
So what I'm trying to figure out is, how do we get that 8 pages are transferred every time we read B but only 1 transfer for each read and write of A?
I'm not surprised you're confused, because I certainly am.
Part of the confusion comes from labelling the arrays 1:1024. I couldn't think like that, I relabelled them 0:1023.
I take "row-major order" to mean that A[0,0] is in the same disk block as A[0,511]. The next block is A[0,512] to A[0,1023]. Then A[1,0] to A[1,511]... And the same arrangement for B.
As the inner loop first executes, the system will fetch the block containing A[0,0], then B[0,0]. As J increments, each element of A referenced will come from a separate disk block. A[1,0] is in a different block from A[0,0]. But only every 512th B element referenced will come from a different block; B[0,0] is in the same block as B[0,511]. So for one complete iteration through the inner loop, 1024 calculations, there will be 1024 fetches of blocks from A, 1024 writes of dirty blocks from A, and 2 fetches of blocks from B. 2050 accesses overall. I don't understand why the answer you have says there will be 8 fetches from B. If B were not aligned on a 512-word boundary, there would be 3 fetches from B per cycle; but not 8.
This same pattern happens for each value of I in the outer loop. That makes 2050*1024 = 2099200 total blocks read and written, assuming B is 512-word aligned.
I'm entirely prepared for someone to point out my obvious bloomer - they usually do - but the explanation you've been given seems wrong to me.

How to calculate fragmentation?

Imagine you have some memory containing a bunch of bytes:
++++ ++-- ---+ +++-
-++- ++++ ++++ ----
---- ++++ +
Let us say + means allocated and - means free.
I'm searching for the formula of how to calculate the percentage of fragmentation.
Background
I'm implementing a tiny dynamic memory management for an embedded device with static memory. My goal is to have something I can use for storing small amounts of data. Mostly incoming packets over a wireless connection, at about 128 Bytes each.
As R. says, it depends exactly what you mean by "percentage of fragmentation" - but one simple formula you could use would be:
(free - freemax)
---------------- x 100% (or 100% for free=0)
free
where
free = total number of bytes free
freemax = size of largest free block
That way, if all memory is in one big block, the fragmentation is 0%, and if memory is all carved up into hundreds of tiny blocks, it will be close to 100%.
Calculate how many 128 bytes packets you could fit in the current memory layout.
Let be that number n.
Calculate how many 128 bytes packets you could fit in a memory layout with the same number of bytes allocated than the current one, but with no holes (that is, move all the + to the left for example).
Let be that number N.
Your "fragmentation ratio" would be alpha = n/N
If your allocations are all roughly the same size, just split your memory up into TOTAL/MAXSIZE pieces each consisting of MAXSIZE bytes. Then fragmentation is irrelevant.
To answer your question in general, there is no magic number for "fragmentation". You have to evaluate the merits of different functions in reflecting how fragmented memory is. Here is one I would recommend, as a function of a size n:
fragmentation(n) = -log(n * number_of_free_slots_of_size_n / total_bytes_free)
Note that the log is just there to map things to a "0 to infinity" scale; you should not actually evaluate that in practice. Instead you might simply evaluate:
freespace_quality(n) = n * number_of_free_slots_of_size_n / total_bytes_free
with 1.0 being ideal (able to allocate the maximum possible number of objects of size n) and 0.0 being very bad (unable to allocate any).
If you had [++++++-----++++--++-++++++++--------+++++] and you wanted to measure the fragmentation of the free space (or any other allocation)
You could measure the average contiguous block size
Total blocks / Count of contiguous blocks.
In this case it would be
4/(5 + 2 + 1 + 8) / 4 = 4
Based on R.. GitHub STOP HELPING ICE's answer, I came up with the following way of computing fragmentation as a single percentage number:
Where:
n is the total number of free blocks
FreeSlots(i) means how many i-sized slots you can fit in the available free memory space
IdealFreeSlots(i) means how many i-sized slots would fit in a perfectly unfragmented memory of size n. This is a simple calculation: IdealFreeSlots(i) = floor(n / i).
How I came up with this formula:
I was thinking about how I could combine all the freespace_quality(i) values to get a single fragmentation percentage, but I wasn't very happy with the result of this function. Even in an ideal scenario, you could have freespace_quality(i) != 1 if the free space size n is not divisible by i. For example, if n=10 and i=3, freespace_quality(3) = 9/10 = 0.9.
So, I created a derived function freespace_relative_quality(i) which looks like this:
This would always have the output 1 in the ideal "perfectly unfragmented" scenario.
After doing the math:
All that's left to do now to get to the final fragmentation formula is to calculate the average freespace quality for all values of i (from 1 to n), and then invert the range by doing 1 - the average quality so that 0 means completely unfragmented (maximum quality) and 1 means most fragmented (minimum quality).

Resources