Offset for mmap() must be page aligned [duplicate] - c

I came across following algorithm that aligns virtual address to immediate next page bounday.
VirtualAddr = (VirtualAddr & ~(PageSize-1));
Also, given a length of bytes aligns length (rounds it) to be on the page boundary
len = ((PageSize-1)&len) ? ((len+PageSize) & ~(PageSize-1)):len;
I am finding it hard to decipher how this works.
Can someone help me out to break it down?

Those calculations assume that the page size is a power of 2 (which is the case for
all systems that I know of), for example
PageSize = 4096 = 2^12 = 1000000000000 (binary)
Then (written as binary numbers)
PageSize-1 = 00...00111111111111
~(PageSize-1) = 11...11000000000000
which means that
(VirtualAddr & ~(PageSize-1))
is VirtualAddr with the lower 12 bits set to zero or, in other words,
VirtualAddr rounded down to the next multiple of 2^12 = PageSize.
Now you can (hopefully) see that in
len = ((PageSize-1)&len) ? ((len+PageSize) & ~(PageSize-1)):len;
the first expression
((PageSize-1)&len)
is zero exactly if len is a multiple of PageSize. In that case, len is left
unchanged. Otherwise (len + PageSize) is rounded down to the next multiple of
PageSize.
So in any case, len is rounded up to the next multiple of PageSize.

I think the first one should be
VirtualAddr = (VirtualAddr & ~(PageSize-1)) + PageSize;

This one-liner will do it - if it is already aligned aligned it will not skip to the next page boundary:
aligned = ((unsigned long) a & (getpagesize()-1)) ? (void *) (((unsigned long) a+getpagesize()) & ~(getpagesize()-1)) : a;
This one-liner will do it - if it is already aligned aligned it will not skip to the next page boundary:
if you really do want to skip to the next page boundary even if it's already aligned - just do:
aligned = (void *) (((unsigned long) a+getpagesize()) & ~(getpagesize()-1))
This should avoid all compiler warnings, too.
getpagesize() is a POSIX thing. #include <unistd.h> to avoid warnings.

Related

Why right-shifting an address by three bits as a hash function for a fixed-size hash table?

I'm following an article where I've got a hash table with a fixed number of 2048 baskets.
The hash function takes a pointer and the hash table itself, treats the address as a bit-pattern, shifts it right three bits and reduces it modulo the size of the hash table (2048):
(It's written as a macro in this case):
#define hash(p, t) (((unsigned long)(p) >> 3) & \
(sizeof(t) / sizeof((t)[0]) - 1))
The article, however, doesn't elaborate on why it's right-shifting the address by three bits (and it seems a bit arbitrary at first). My first guess was that the reason is to sort of group pointers with a similar address by cutting off the last three bits but I don't see how this would be useful given that most addresses allocated for one application have similar addresses anyway; take this as an example:
#include <stdio.h>
int main()
{
int i1 = 0, i2 = 0, i3 = 0;
printf("%p\n", &i1);
printf("%p\n", &i2);
printf("%p\n", &i3);
printf("%lu\n", ((unsigned long)(&i1) >> 3) & 2047); // Provided that the size of the hash table is 2048.
printf("%lu\n", ((unsigned long)(&i2) >> 3) & 2047);
printf("%lu", ((unsigned long)(&i3) >> 3) & 2047);
return 0;
}
Also, I'm wondering why it's choosing 2048 as a fixed size and if this is in relation to the three-bit shift.
For reference, the article is an extract from "C Interfaces and Implementations, Techniques for creating reusable software" by David P. Hanson.
Memory allocations must be properly aligned. I.e. the hardware may specify that an int should be aligned to a 4-byte boundary, or that a double should be aligned to 8 bytes. This means that the last two address bits for an int must be zero, three bits for the double.
Now, C allows you to define complex structures which mix char, int, long, float, and double fields (and more). And while the compiler can add padding to align the offsets to the fields to the appropriate boundaries, the entire structure must also be properly aligned to the largest alignment that one of its members uses.
malloc() does not know what you are going to do with the memory, so it must return an allocation that's aligned for the worst case. This alignment is specific to the platform, but it's generally not less than 8-byte alignment. A more typical value today is 16-byte alignment.
So, the hash algorithm simply cuts off the three bits of the address which are virtually always zero, and thus less than worthless for a hash value. This easily reduces the number of hash collisions by a factor of 8. (The fact that it only cuts off 3-bits indicates that the function was written a while ago. Today it should be programmed to cut off four bits.)
This code assumes that the objects which are going to be hashed are aligned to 8 (more precise to 2^(right_shift) ). Otherwise this hash function (or macro) will return colliding results.
#define mylog2(x) (((x) & 1) ? 0 : ((x) & 2) ? 1 : ((x) & 4) ? 2 : ((x) & 8) ? 3 : ((x) & 16) ? 4 : ((x) & 32) ? 5 : -1)
#define hash(p, t) (((unsigned long)(p) >> mylog2(sizeof(p))) & \
(sizeof(t) / sizeof((t)[0]) - 1))
unsigned long h[2048];
int main()
{
int i1 = 0, i2 = 0, i3 = 0;
long l1,l2,l3;
printf("sizeof(ix) = %zu\n", sizeof(i1));
printf("sizeof(lx) = %zu\n", sizeof(l1));
printf("%lu\n", hash(&i1, h)); // Provided that the size of the hash table is 2048.
printf("%lu\n", hash(&i2, h));
printf("%lu\n", hash(&i3, h));
printf("\n%lu\n", hash(&l1, h)); // Provided that the size of the hash table is 2048.
printf("%lu\n", hash(&l2, h));
printf("%lu\n", hash(&l3, h));
return 0;
}
https://godbolt.org/z/zq1zfP
to make it more reliable you need to take into the account the size of the object:
#define hash1(o, p, t) (((unsigned long)(p) >> mylog2(sizeof(o))) & \
(sizeof(t) / sizeof((t)[0]) - 1))
Then it will work with any size data https://godbolt.org/z/a7dYj9
Though it's not dictated by the C language standard, on most platforms (where platform = compiler + designated HW architecture), variable x is allocated at an address which is a multiple of (i.e., divisible by) sizeof(x).
This is because many platforms do not support unaligned load/store operations (e.g., writing a 4-byte value to an address which is not aligned to 4 bytes).
Knowing that sizeof(long) is at most 8 (again, on most platforms), we can further predict that the last 3 bits on the address of every long variable will always be zero.
When designing a hash-table solution, one would typically strive for as fewer collisions as possible.
Here, the hashing solution takes the last 11 bits of every address.
So in order to reduce the number of collisions, we shift-right every address by 3 bits, thus replacing of those 3 "predictable" zeros with something "more random".

Cache block tag size

I'm writing a cache simulation program in C on linux using gcc as the compiler and I'm done for the most part. Only a few test cases go wrong (a few things out of the thousands of fed addresses that should be hitting are missing). I specify the cache properties on the command line. I suspect the error within my code has to do with the tag (if things aren't hitting then their tags aren't matching up when they should be). So my question is: Am I calculating the tag right?
//setting sizes of bits
int offsetSize = log2(lineSize);
int indexSize = 0;
if (strcmp(associativity,"direct") == 0){//direct associativity
indexSize = log2(numLines);
}else if (assocNum == numLines){//fully associative
indexSize = 0;
}else{//set associative
indexSize = log2(assocNum);
}
address = (int) strtol(readAddress,&eptr,16);
unsigned long long int mask = 0;
//get the offset Bits
mask = (1 << offsetSize) - 1;
offsetBits = address & mask;
//get the index bits
mask = (1 << (indexSize)) - 1;
mask = mask << offsetSize;
indexBits = (address & mask) >> offsetSize;
//get tag bits
tagBits = address >> (offsetSize+indexSize);
The addresses that are being fed are usually 48 bits, so the variables address and mask is of type unsigned long long int. I think the problem I'm having is that I'm taking all the upper bits of the address, when I should only be taking a small set of bits from the large address.
For example: I have 32 cache lines in a 4-way set associative cache with a block size of 4.
offsetSize = log2(4) = 2
indexSize = log2(4) = 2
My code currently takes the upper bits of the address no matter the address size, minus the last 4 bits. Should I be taking only the upper 28 bits instead? (tagSize = (8*4)-3-2)
My code currently takes the upper bits of the address no matter the address size, minus the last 4 bits. Should I be taking only the upper 28 bits instead?
The tag has to contain all upper bits so that the tag can be used to determine if it is or isn't a cache hit.
If addresses are 48-bits and are split into 3 fields, you'd have a 2-bit "offset in cache line" field, a 2-bit "index in cache" field and a 44-bit "upper bits that have to be stored in the tag" field. If you only store 28 bits in the tag then you get cache hits when you should get cache misses (because the entry in the cache happens to contain data for a different address where the 28 bits happened to match).
Note that you can/should think of "associativity" as the number of sets of cache lines that happen to operate in parallel (where direct mapped is just "associativity = 1", and where fully associative is just "associativity = total_cache_size / cache_line_size"). The associativity has no direct effect on the index size (only the size of the sets of cache lines matters for index size), and the problem you're having is probably related to indexSize = log2(assocNum); (which doesn't make sense).
In other words:
if( direct_mapped ) {
associativity = 1;
} else {
max_associativity = total_cache_size / cache_line_size;
if( fully_associative || (associativity > max_associativity) ) {
associativity = max_associativity;
}
}
set_size = total_cache_size / associativity;
number_of_lines_in_set = set_size / cache_line_size;
offset_size = log2(cache_line_size);
index_size = log2(number_of_lines_in_set);
tag_size = address_size - index_size - offsetSize;

Bit Twiddling in C - Counting Bits

I want to count the bits that are set in an extremely large bit-vector (i.e. 100,000 bits).
What I am currently doing is using a pointer to char (i.e. char *cPtr) to point to the beginning of the array of bits. I then:
1. look at each element of the array (i.e. cPtr[x]),
2. convert it to an integer (i.e. (int) cPtr[x])
3. use a 256 element look-up table to see how many bits are set in the given byte (i.e. cPtr[x]).
It occurs to me that if I use a short int pointer instead (i.e. short int * sPtr), then I will only need half as many look-ups, but with a 65534 element look-up table, which will have its own cost in memory usage.
I'm wondering what is the optimal number of bits to examine each time. Also, if that number is not the size of some preset type, how can I walk down my bit-vector and set a pointer to be ANY arbitrary number of bits past the starting location of the bit array.
I know there are other ways to count bits, but for now I want to be certain I can optimize this method before making comparisons to other methods.
You can count it using bitwise operation:
char c = cPtr[x];
int num = ((c & 0x01) >> 0) +
((c & 0x02) >> 1) +
((c & 0x04) >> 2) +
((c & 0x08) >> 3) +
((c & 0x10) >> 4) +
((c & 0x20) >> 5) +
((c & 0x40) >> 6) +
((c & 0x80) >> 7);
It might seem a little long, but it doesn't require accessing many time to memory, so after all it seems pretty cheap for me.
You can even make it cheaper by reading an int every time, but then you will probably need to address an alignment issue.
I'm wondering what is the optimal number of bits to examine each time
The only way to find out is to test. See this question for a discussion of the fastest way to count 32 bits at a time.
Also, if that number is not the size of some preset type, how can I
walk down my bit-vector and set a pointer to be ANY arbitrary number
of bits past the starting location of the bit array.
You can't set a pointer to an arbitrary bit. Most machines have byte-addressing, some can only address words.
You can construct a word starting with an arbitrary bit like so:
long wordAtBit(int32_t* array, size_t bit)
{
size_t idx = bit>>5;
long word = array[idx] >> (bit&31);
return word | (array[idx+1] << (32 - (bit&31));
}
This should be quite fast (taken from Wikipedia):
static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount(uint32 i)
{
return (wordbits[i&0xFFFF] + wordbits[i>>16]);
}
In this way, you can check 32 bits per iteration.
I am a bit late to the party, but there are much faster approaches than the ones that have been suggested so far. The reason is that many modern architectures offer hardware instructions to count the number of bits in various ways (leading zeroes, leading ones, trailing zeroes or ones, counting the number of bits set to 1, etc...). Counting the number of bits set to 1 is called the Hamming weight, also commonly called population count, or just popcount.
As a matter of fact, x86 CPUs have a POPCNT instruction as part of the SSE4.2 instruction set. The very latest latest CPU architecture from Intel (nicknamed Haswell) offer even more hardware support for bit manipulation with the BMI1 and BMI2 extensions - maybe there is something else to use there!

C: Memcpy vs Shifting: Whats more efficient?

I have a byte array containing 16 & 32bit data samples, and to cast them to Int16 and Int32 I currently just do a memcpy with 2 (or 4) bytes.
Because memcpy is probably isn't optimized for lenghts of just two bytes, I was wondering if it would be more efficient to convert the bytes using integer arithmetic (or an union) to an Int32.
I would like to know what the effiency of calling memcpy vs bit shifting is, because the code runs on an embedded platform.
I would say that memcpy is not the way to do this. However, finding the best way depends heavily on how your data is stored in memory.
To start with, you don't want to take the address of your destination variable. If it is a local variable, you will force it to the stack rather than giving the compiler the option to place it in a processor register. This alone could be very expensive.
The most general solution is to read the data byte by byte and arithmetically combine the result. For example:
uint16_t res = ( (((uint16_t)char_array[high]) << 8)
| char_array[low]);
The expression in the 32 bit case is a bit more complex, as you have more alternatives. You might want to check the assembler output which is best.
Alt 1: Build paris, and combine them:
uint16_t low16 = ... as example above ...;
uint16_t high16 = ... as example above ...;
uint32_t res = ( (((uint32_t)high16) << 16)
| low16);
Alt 2: Shift in 8 bits at a time:
uint32_t res = char_array[i0];
res = (res << 8) | char_array[i1];
res = (res << 8) | char_array[i2];
res = (res << 8) | char_array[i3];
All examples above are neutral to the endianess of the processor used, as the index values decide which part to read.
Next kind of solutions is possible if 1) the endianess (byte order) of the device match the order in which the bytes are stored in the array, and 2) the array is known to be placed on an aligned memory address. The latter case depends on the machine, but you are safe if the char array representing a 16 bit array starts on an even address and in the 32 bit case it should start on an address dividable by four. In this case you could simply read the address, after some pointer tricks:
uint16_t res = *(uint16_t *)&char_array[xxx];
Where xxx is the array index corresponding to the first byte in memory. Note that this might not be the same as the index to he lowest value.
I would strongly suggest the first class of solutions, as it is endianess-neutral.
Anyway, both of them are way faster than your memcpy solution.
memcpy is not valid for "shifting" (moving data by an offset shorter than its length within the same array); attempting to use it for such invokes very dangerous undefined behavior. See http://lwn.net/Articles/414467/
You must either use memmove or your own shifting loop. For sizes above about 64 bytes, I would expect memmove to be a lot faster. For extremely short shifts, your own loop may win. Note that memmove has more overhead than memcpy because it has to determine which direction of copying is safe. Your own loop already knows (presumably) which direction is safe, so it can avoid an extra runtime check.

mprotect - how aligning to multiple of pagesize works?

I am not understanding the 'aligning allocated memory' part from the mprotect usage.
I am referring to the code example given on http://linux.die.net/man/2/mprotect
char *p;
char c;
/* Allocate a buffer; it will have the default
protection of PROT_READ|PROT_WRITE. */
p = malloc(1024+PAGESIZE-1);
if (!p) {
perror("Couldn't malloc(1024)");
exit(errno);
}
/* Align to a multiple of PAGESIZE, assumed to be a power of two */
p = (char *)(((int) p + PAGESIZE-1) & ~(PAGESIZE-1));
c = p[666]; /* Read; ok */
p[666] = 42; /* Write; ok */
/* Mark the buffer read-only. */
if (mprotect(p, 1024, PROT_READ)) {
perror("Couldn't mprotect");
exit(errno);
}
For my understanding, I tried using a PAGESIZE of 16, and 0010 as address of p.
I ended up getting 0001 as the result of (((int) p + PAGESIZE-1) & ~(PAGESIZE-1)).
Could you please clarify how this whole 'alignment' works?
Thanks,
Assuming that PAGESIZE is a power of 2 (a requirement), an integral value x can be rounded down to a multiple of PAGESIZE with (x & ~(PAGESIZE-1)). Similarly, ((x + PAGESIZE-1) & ~(PAGESIZE-1)) will result in x rounded up to a multiple of PAGESIZE.
For example, if PAGESIZE is 16, then in binary with a 32-bit word:
00000000000000000000000000010000 PAGESIZE
00000000000000000000000000001111 PAGESIZE-1
11111111111111111111111111110000 ~(PAGESIZE-1)
A bitwise-and (&) with the above value will clear the low 4 bits of the value, making it a multiple of 16.
That said, the code quoted in the description is from an old version of the manual page, and is not good because it wastes memory and does not work on 64-bit systems. It is better to use posix_memalign() or memalign() to obtain memory that is already properly aligned. The example on the current version of the mprotect() manual page uses memalign(). The advantage of posix_memalign() is that it is part of the POSIX standard, and does not have different behavior on different systems like the older non-standard memalign().

Resources