Linux: buddy system free memory - c

Could anyone explain this code?
page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
page_to_pfn() have already return the page_idx, so what does '&' use for? Or page_to_pfn() return something else?

You need to know that x & ((1 << n) - 1) is a trick meaning x % ((int) pow(2, n)). Often it's faster (but it's better to leave these kind of optimizations to the compiler).
So in this case what this does it does a modulo by pow(2, MAX_ORDER). This causes a wrap-around; if page_idx is larger than pow(2, MAX_ORDER) it will go back to 0. Here is equivalent, but more readable code:
const int MAX_ORDER_N = (int) pow(2, MAX_ORDER);
page_idx = page_to_pfn(page);
/* wraparound */
while (page_idx > MAX_ORDER_N) {
page_idx -= MAX_ORDER_N;
}

It's a bit mask that ensures that page_idx does not exceed a certain value (2^MAX_ORDER).
# define MAX_ORDER (8)
(1 << MAX_ORDER) /* 100000000 */
- 1 /* flip bits, same as ~(…) due to two-complement: 11111111 */
So you only have the eight least significant bits left
1010010101001
& 0000011111111
= 0000010101001

chekck this function will be clear:
static inline struct page *
__page_find_buddy(struct page *page, unsigned long page_idx, unsigned int order)
{
unsigned long buddy_idx = page_idx ^ (1 << order);
return page + (buddy_idx - page_idx);
}
it just limits page_idx into a range of 8MB, maybe because the maximum block size is 4MB (1024 pages), it can not be merged again, only 2MB blocks can merge into 4MB, and the buddy block can be before or after the page, so
the whole range is [page_idx - 2MB, page_idx + 2MB] ??
its absolute size is not important, but offset (buddy_idx - page_idx) is important, add page to get the real buddy address.

Related

Using bit shifting with rand() to allow for a larger random range

I am reviewing a function to generate keys for a Radix Map and found the implementation of rand() to be novel to me.
Here is the function:
static int make_random(RadixMap *map)
{
size_t i = 0;
for (i = 0; i < map->max - 1; i++){
uint32_t key = (uint32_t) (rand() | (rand() << 16));<--This was interesting
check(RadixMap_add(map, key, i) == 0, "Failed to add key %u", key);
}
return i;
error:
return 0;
}
----- Type definitions --------
typedef union RMElement {
uint64_t raw;
struct {
uint32_t key;
uint32_t value;
} data;
} RMElement;
typedef struct RadixMap {
size_t max;
size_t end;
uint32_t counter;
RMElement *contents;
RMElement *temp;
} RadixMap;
from ex35 Learn C the Hard Way by Zed Shaw
The specific part I found interesting was
uint32_t key = (uint32_t) (rand() | (rand() << 16)); <-- This was interesting
It is interesting to me because it would have been possible to simply do ..
uint32_t key = rand();
As RAND_MAX (0x7FFFFFFF) is less than uint32_t MAX (0xFFFFFFFF)
The bit shifting implementation looks to have the following advantages.
Allows for a larger random value range, 0xFFFFFFFF vs 0x7FFFFFFF
Values (other than initial 0) are at least 5 digits decimal (65537) (0x10001)
Reduced probability of seeing "0".
And the following disadvantage
Increased code complexity?
Are there other reasons for using this bit shift implementation of rand()?
I've been trying to hash out the reason for using this implementation in my code review and wanted to make sure I was on the right track with my thinking.
The C standard only guarantees that RAND_MAX is at least 32767. This code accounts for that by calling rand twice and shifting to ensure it gets at least 30 bits of randomness.
However, this does does not properly account for the case where RAND_MAX is larger.
The rand function returns an int which is signed. If RAND_MAX was the same as INT_MAX, rand() << 16 would most likely shift a "1" bit into the sign bit, triggering undefined behavior.
The proper way to implement this to handle both cases is:
uint32_t key = rand() | ((uint32_t)rand() << 16));
Since left shifting an unsigned number is well defined as long as the shift amount is less than the size of the type.
Or better yet:
uint32_t key = (((uint32_t)rand() & 0x7FFF) << 17) |
(((uint32_t)rand() & 0x7FFF) << 2) |
((uint32_t)rand() & 0x3);
To get a full 32 bits of randomness.
uint32_t key = (uint32_t) (rand() | (rand() << 16)); has shortcomings.
Not uniform when RAND_MAX != 65535, which is the usual case.
Undefined behavior when int is 16 bit. Also UB in other cases due to signed integer overflow possibilities with rand() << 16
The cast is too late to protect against a narrow int. Effectively same as uint32_t key = rand() | (rand() << 16); uint32_t key = rand() + (rand() * (RAND_MAX+(uint32_t key)1); would make a bit more sense.
A key failing is using | to append the bits zeroed on the right are not the same as the bit-width of RAND_MAX.
2nd weakness is assuming shifting is better than multiplying by a power-of-2. A good compiler emits efficient code either way.
Instead, call your random function (1, 2 or 3 times) as needed based on its RAND_MAX. Below works well when RAND_MAX is a Mersenne number.
See Is there any way to compute the width of an integer type at compile-time?.
#define IMAX_BITS(m) ((m)/((m)%255+1) / 255%255*8 + 7-86/((m)%255+12))
// Bit width of RAND_MAX, which is at least 15
#define RAND_MAX_BITS IMAX_BITS(RAND_MAX)
_Static_assert(((RAND_MAX + 1u) & RAND_MAX) == 0, "RAND_MAX is not a Mersenne number");
uint32_t rand32(void) {
uint32_t r = rand();
#if RAND_MAX_BITS < 32
r = (r << RAND_MAX_BITS) | rand();
#endif
#if RAND_MAX_BITS*2 < 32
r = (r << RAND_MAX_BITS) | rand();
#endif
return r;
}
(Bit shifting) Increased code complexity?
No.
Are there other reasons for using this bit shift implementation of rand()?
OP's code is not uniform as it generally favors one bits with its potential or-ing of bits past the 15th.
I've been trying to hash out the reason for using this implementation ...
Do not use it.
Or, you could just use a really fast random number generator.
Careful you don't see with values that don't have too many zero bytes.
uint64_t
xorshift128plus(uint64_t seed[2])
{
uint64_t x = seed[0];
uint64_t y = seed[1];
seed[0] = y;
x ^= x << 23;
seed[1] = x ^ y ^ (x >> 17) ^ (y >> 26);
return s[1] + y;
}
convert the result to float or just modulo your max int value...

Most efficient way to set n consecutive bits to 1?

I want to get a function that will set the n last bits of a numerical type to 1. For example:
bitmask (5) = 0b11111 = 31
bitmask (0) = 0
I, first, had this implementation (mask_t is just a typedef around uint64_t):
mask_t bitmask (unsigned short n) {
return ((((mask_t) 1) << n) - 1;
}
Everything is fine except when the function hit bitmask (64) (the size of mask_t), then I get bitmask (64) = 0 in place of 64 bits set to 1.
So, I have two questions:
Why do I have this behavior ? Pushing the 1 by 64 shifts on the left should clear the register and remain with 0, then applying the -1 should fill the register with 1s...
What is the proper way to achieve this function ?
Yes this is a well known problem. There are easy ways to implement this function over the range 0..63 and over the range 1..64 (one way has been mentioned in the comments), but 0..64 is more difficult.
Of course you can just take either the "left shifting" or "right shifting" mask generation and then special-case the "missing" n,
uint64_t bitmask (unsigned short n) {
if (n == 64) return -((uint64_t)1);
return (((uint64_t) 1) << n) - 1;
}
Or
uint64_t bitmask (unsigned short n) {
if (n == 0) return 0;
uint64_t full = ~(uint64_t)0;
return full >> (64 - n);
}
Either way tends to compile to a branch, though it technically doesn't have to.
You can do it without if (not tested)
uint64_t bitmask (unsigned int n) {
uint64_t x = (n ^ 64) >> 6;
return (x << (n & 63)) - 1;
}
The idea here is that we're going to either shift 1 left by some amount the same as in your original code, or 0 in the case that n = 64. Shifting 0 left by 0 is just going to be 0 again, subtracting 1 sets all 64 bits.
Alternatively if you're on a modern x64 platform and BZHI is available, a very fast (BZHI is fast on all CPUs that implement it) but limited-portability option is:
uint64_t bitmask (unsigned int n) {
return _bzhi_u64(~(uint64_t)0, n);
}
This is even well-defined for n > 64, the actual count of 1's will be min(n & 0xFF, 64) because BZHI saturates but it reads only the lowest byte of the index.
You cannot left shift by a value larger than or equal to the bit width of the type in question. Doing so invokes undefined behavior.
From section 6.5.7 of the C standard:
2 The integer promotions are performed on each of the operands. The
type of the result is that of the promoted left operand. If the value
of the right operand is negative or is greater than or equal to the
width of the promoted left operand, the behavior is undefined.
You'll need to add a check for this in your code:
mask_t bitmask (unsigned short n) {
if (n >= 64) {
return ~(mask_t)0;
} else {
return (((mask_t) 1) << n) - 1;
}
}
Finally, just for your information, I ended up by writing:
mask_t bitmask (unsigned short n) {
return (n < (sizeof (mask_t) * CHAR_BIT)) ? (((mask_t) 1) << n) - 1 : -1;
}
But, the answer of harold is so complete and well explained that I will select it as the answer.

How to check if a set of bits are 0 from a determinated position?

I'm writting a bitmap physical memory manager and i want implement a function that checks if a n bits are free starting from an specific bit.
Right now I use this function that checks if a single bit is free and i call it n times to see if n bits are free but i think it is not very efficient to do it this way:
inline static bool physical_memory_map_test(uint32_t bit)
{
return physical_memory.blocks[bit/32] & (1 << bit % 32);
}
So I want to implemnt something like this: (the "" contains pseudo code):
static bool physical_memory_map_test(uint32_t starting_bit, uint32_t count)
{
int excess = (starting_bit%32 + count) -32;
if(excess < 0)
return (physical_memory.blocks[bit/32] & "-excess number of 1s" << bit % 32)) && (physical_memory.blocks[bit/32] & "count + excess number of 1s" << bit % 32));
return physical_memory.blocks[bit/32] & ("count number of ones, if count is 3, this should be 111" << bit % 32);
}
Or something better to check if all the bits are 0 (return true) or if at least one of them is a 1(return false)
How could i do that?
Since you are checking a range of uint32_t words, you will end up with a loop. Your task is to make it loop by 32 bits instead of looping by 1 bit.
You need to check partial 32-bit words at both ends:
In order to do that you need to construct a mask with k lower bits set to 1, and upper (32-k) set to 0. You can do it like this:
uint32_t mask_K = ~(~0U << k);
Use
if (block & mask_K)
to test the lower k bits;
if (block & ~mask_K)
tests the upper k bits.

Insert bit into uint16_t

Is there any efficient algorithm that allows to insert bit bit to position index when working with uint16_t? I've tried reading bit-by-bit after index, storing all such bits into array of char, changing bit at index, increasing index, and then looping again, inserting bits from array, but could be there a better way? So I know how to get, set, unset or toggle specific bit, but I suppose there could be better algorithm than processing bit-by-bit.
uint16_t bit_insert(uint16_t word, int bit, int index);
bit_insert(0b0000111111111110, 1, 1); /* must return 0b0100011111111111 */
P.S. The solution must be in pure ANSI-compatible C. I know that 0b prefix may be specific to gcc, but I've used it here to make things more obvious.
Use bitwise operators:
#define BIT_INSERT(word, bit, index) \
(((word) & (~(1U << (index)))) | ((bit) << (index)))
#include <errno.h>
#include <stdint.h>
/* Insert a bit `idx' positions from the right (lsb). */
uint16_t
bit_insert_lsb(uint16_t n, int bit, int idx)
{
uint16_t lower;
if (idx > 15) {
errno = ERANGE;
return 0U;
}
/* Get bits 0 to `idx' inclusive. */
lower = n & ((1U << (idx + 1)) - 1);
return ((n & ~lower) | ((!!bit) << idx) | (lower >> 1));
}
/* Insert a bit `idx' positions from the left (msb). */
uint16_t
bit_insert_msb(uint16_t n, int bit, int idx)
{
uint16_t lower;
if (idx > 15) {
errno = ERANGE;
return 0U;
}
/* Get bits 0 to `16 - idx' inclusive. */
lower = n & ((1U << (15 - idx + 1)) - 1);
return ((n & ~lower) | ((!!bit) << (15 - idx)) | (lower >> 1));
}
Bits are typically counted from the right, where the least significant bit (lsb) resides, to the left, where the most significant bit (msb) is located. I allowed for insertion from either side by creating two functions. The one expected, according to the question, is bit_insert_msb.
Both functions perform a sanity check, setting errno to ERANGE and returning 0 if the value of idx is too large. I also provided some of C99's _Bool behaviour for the bit parameter in the return statements: 0 is 0 and any other value is 1. If you use a C99 compiler, I'd recommend changing bit's type to _Bool. You can then replace (!!bit) with bit directly.
I'd love to say it could be optimised, but that could very well make it less comprehensible.
Happy coding!
If you're counting bits from the left
mask = (1 << (16 - index + 1)) - 1; // all 1s from bit "index" to LSB
// MSB of word (from left to index) | insert bit at index | LSB of word from (index-1)
word = (word & ~mask) | (bit << (16 - index)) | ((word & mask) >> 1);
There may be many ways more efficient but this way it's easy to understand

Emulating variable bit-shift using only constant shifts?

I'm trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.
The particular PowerPC processor I'm working on has the quirk that a shift-by-constant-immediate, like
int ShiftByConstant( int x ) { return x << 3 ; }
is fast, single-op, and superscalar, whereas a shift-by-variable, like
int ShiftByVar( int x, int y ) { return x << y ; }
is a microcoded operation that takes 7-11 cycles to execute while the entire rest of the pipeline stops dead.
What I'd like to do is figure out which non-microcoded integer PPC ops the sraw decodes into and then issue them individually. This won't help with the latency of the sraw itself — it'll replace one op with six — but in between those six ops I can dual-dispatch some work to the other execution units and get a net gain.
I can't seem to find anywhere what μops sraw decodes into — does anyone know how I can replace a variable bit-shift with a sequence of constant shifts and basic integer operations? (A for loop or a switch or anything with a branch in it won't work because the branch penalty is even bigger than the microcode penalty, even for correctly-predicted branches.)
This needn't be answered in assembly; I'm hoping to learn the algorithm rather than the particular code, so an answer in C or a high level language or even pseudo code would be perfectly helpful.
Edit: A couple of clarifications that I should add:
I'm not even a little bit worried about portability
PPC has a conditional-move, so we can assume the existence of a branchless intrinsic function
int isel(a, b, c) { return a >= 0 ? b : c; }
(if you write out a ternary that does the same thing I'll get what you mean)
integer multiplication is also microcoded and even slower than sraw. :-(
On Xenon PPC, the latency of a predicted branch is 8 cycles, so even one makes it as costly as the microcoded instruction. Jump-to-pointer (any indirect branch or function pointer) is a guaranteed mispredict, a 24 cycle stall.
Here you go...
I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).
The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.
template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
{ // 31-bit shift capability (Rolls over at 32-bits)
const int bMask1=-(1&nShift);
const int bMask2=-(1&(nShift>>1));
const int bMask3=-(1&(nShift>>2));
const int bMask4=-(1&(nShift>>3));
const int bMask5=-(1&(nShift>>4));
nVal=(nVal&bMask1) + nVal; //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
return(nVal);
}
template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
{ // 31-bit shift capability (Rolls over at 32-bits)
const int bMask1=-(1&nShift);
const int bMask2=-(1&(nShift>>1));
const int bMask3=-(1&(nShift>>2));
const int bMask4=-(1&(nShift>>3));
const int bMask5=-(1&(nShift>>4));
nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
return(nVal);
}
EDIT: Note on isel()
I saw your isel() code on your website.
// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
// mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
return x + ((y - x) & mask);
};
FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an 'andc' opcode. It's the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:
return (x & (~mask)) + (y & mask);
How about this:
if (y & 16) x <<= 16;
if (y & 8) x <<= 8;
if (y & 4) x <<= 4;
if (y & 2) x <<= 2;
if (y & 1) x <<= 1;
will probably take longer yet to execute but easier to interleave if you have other code to go between.
Let's assume that your max shift is 31. So the shift amount is a 5-bit number. Because shifting is cumulative, we can break this into five constant shifts. The obvious version uses branching, but you ruled that out.
Let N be a number between 1 and 5. You want to shift x by 2N if the bit whose value is 2N is set in y, otherwise keep x intact. Here one way to do it:
#define SHIFT(N) x = isel(((y >> N) & 1) - 1, x << (1 << N), x);
The macro assigns to x either x << 2ᴺ or x, depending on whether the Nth bit is set in y or not.
And then the driver:
SHIFT(1); SHIFT(2); SHIFT(3); SHIFT(4); SHIFT(5)
Note that N is a macro variable and becomes constant.
Don't know though if this is going to be actually faster than the variable shift. If it would be, one wonders why the microcode wouldn't run this instead...
This one breaks my head. I've now discarded a half dozen ideas. All of them exploit the notion that adding a thing to itself shifts left 1, doing the same to the result shifts left 4, and so on. If you keep all the partial results for shift left 0, 1, 2, 4, 8, and 16, then by testing bits 0 to 4 of the shift variable you can get your initial shift. Now do it again, once for each 1 bit in the shift variable. Frankly, you might as well send your processor out for coffee.
The one place I'd look for real help is Hank Warren's Hacker's Delight (which is the only useful part of this answer).
How about this:
int[] multiplicands = { 1, 2, 4, 8, 16, 32, ... etc ...};
int ShiftByVar( int x, int y )
{
//return x << y;
return x * multiplicands[y];
}
If the shift count can be calculated far in advance then I have two ideas that might work
Using self-modifying code
Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift
Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction
// shift by constant functions
typedef int (*shiftFunc)(int); // the shift function
#define SHL(n) int shl##n(int x) { return x << (n); }
SHL(1)
SHL(2)
SHL(3)
...
shiftFunc shiftLeft[] = { shl1, shl2, shl3... };
int arr[MAX]; // all the values that need to be shifted with the same amount
shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
for (int i = 0; i < MAX; i++)
arr[i] = shl(arr[i]);
This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.
Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD
If the range of the values is small, lookup table is another possible solution
#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n) S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
{ S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
{ S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
...
{ S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}
Now x << n is simply shl[x][n] with x being an uint8_t. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together
There is some good stuff here regarding bit manipulation black magic:
Advanced bit manipulation fu (Christer Ericson's blog)
Don't know if any of it's directly applicable, but if there is a way, likely there are some hints to that way in there somewhere.
Here's something that is trivially unrollable:
int result= value;
int shift_accumulator= value;
for (int i= 0; i<5; ++i)
{
result += shift_accumulator & (-(k & 1)); // replace with isel if appropriate
shift_accumulator += shift_accumulator;
k >>= 1;
}

Resources