How to align a pointer in C - c

Is there a way to align a pointer in C? Suppose I'm writing data to an array stack (so the pointer goes downward) and I want the next data I write to be 4-aligned so the data is written at a memory location which is a multiple of 4, how would I do that?
I have
uint8_t ary[1024];
ary = ary+1024;
ary -= /* ... */
Now suppose that ary points at location 0x05. I want it to point to 0x04.
Now I could just do
ary -= (ary % 4);
but C doesn't allow modulo on pointers. Is there any solution that is architecture independent?

Arrays are NOT pointers, despite anything you may have read in misguided answers here (meaning this question in particular or Stack Overflow in general — or anywhere else).
You cannot alter the value represented by the name of an array as shown.
What is confusing, perhaps, is that if ary is a function parameter, it will appear that you can adjust the array:
void function(uint8_t ary[1024])
{
ary += 213; // No problem because ary is a uint8_t pointer, not an array
...
}
Arrays as parameters to functions are different from arrays defined either outside a function or inside a function.
You can do:
uint8_t ary[1024];
uint8_t *stack = ary + 510;
uintptr_t addr = (uintptr_t)stack;
if (addr % 8 != 0)
addr += 8 - addr % 8;
stack = (uint8_t *)addr;
This ensures that the value in stack is aligned on an 8-byte boundary, rounded up. Your question asks for rounding down to a 4-byte boundary, so the code changes to:
if (addr % 4 != 0)
addr -= addr % 4;
stack = (uint8_t *)addr;
Yes, you can do that with bit masks too. Either:
addr = (addr + (8 - 1)) & -8; // Round up to 8-byte boundary
or:
addr &= -4; // Round down to a 4-byte boundary
This only works correctly if the LHS is a power of two — not for arbitrary values. The code with modulus operations will work correctly for any (positive) modulus.
See also: How to allocate aligned memory using only the standard library.
Demo code
Gnzlbg commented:
The code for a power of two breaks if I try to align e.g. uintptr_t(2) up to a 1 byte boundary (both are powers of 2: 2^1 and 2^0). The result is 1 but should be 2 since 2 is already aligned to a 1 byte boundary.
This code demonstrates that the alignment code is OK — as long as you interpret the comments just above correctly (now clarified by the 'either or' words separating the bit masking operations; I got caught when first checking the code).
The alignment functions could be written more compactly, especially without the assertions, but the compiler will optimize to produce the same code from what is written and what could be written. Some of the assertions could be made more stringent, too. And maybe the test function should print out the base address of the stack before doing anything else.
The code could, and maybe should, check that there won't be numeric overflow or underflow with the arithmetic. This would be more likely a problem if you aligned addresses to a multi-megabyte boundary; while you keep under 1 KiB, alignments, you're unlikely to find a problem if you're not attempting to go out of bounds of the arrays you have access to. (Strictly, even if you do multi-megabyte alignments, you won't run into trouble if the result will be within the range of memory allocated to the array you're manipulating.)
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
/*
** Because the test code works with pointers to functions, the inline
** function qualifier is moot. In 'real' code using the functions, the
** inline might be useful.
*/
/* Align upwards - arithmetic mode (hence _a) */
static inline uint8_t *align_upwards_a(uint8_t *stack, uintptr_t align)
{
assert(align > 0 && (align & (align - 1)) == 0); /* Power of 2 */
assert(stack != 0);
uintptr_t addr = (uintptr_t)stack;
if (addr % align != 0)
addr += align - addr % align;
assert(addr >= (uintptr_t)stack);
return (uint8_t *)addr;
}
/* Align upwards - bit mask mode (hence _b) */
static inline uint8_t *align_upwards_b(uint8_t *stack, uintptr_t align)
{
assert(align > 0 && (align & (align - 1)) == 0); /* Power of 2 */
assert(stack != 0);
uintptr_t addr = (uintptr_t)stack;
addr = (addr + (align - 1)) & -align; // Round up to align-byte boundary
assert(addr >= (uintptr_t)stack);
return (uint8_t *)addr;
}
/* Align downwards - arithmetic mode (hence _a) */
static inline uint8_t *align_downwards_a(uint8_t *stack, uintptr_t align)
{
assert(align > 0 && (align & (align - 1)) == 0); /* Power of 2 */
assert(stack != 0);
uintptr_t addr = (uintptr_t)stack;
addr -= addr % align;
assert(addr <= (uintptr_t)stack);
return (uint8_t *)addr;
}
/* Align downwards - bit mask mode (hence _b) */
static inline uint8_t *align_downwards_b(uint8_t *stack, uintptr_t align)
{
assert(align > 0 && (align & (align - 1)) == 0); /* Power of 2 */
assert(stack != 0);
uintptr_t addr = (uintptr_t)stack;
addr &= -align; // Round down to align-byte boundary
assert(addr <= (uintptr_t)stack);
return (uint8_t *)addr;
}
static inline int inc_mod(int x, int n)
{
assert(x >= 0 && x < n);
if (++x >= n)
x = 0;
return x;
}
typedef uint8_t *(*Aligner)(uint8_t *addr, uintptr_t align);
static void test_aligners(const char *tag, Aligner align_a, Aligner align_b)
{
const int align[] = { 64, 32, 16, 8, 4, 2, 1 };
enum { NUM_ALIGN = sizeof(align) / sizeof(align[0]) };
uint8_t stack[1024];
uint8_t *sp = stack + sizeof(stack);
int dec = 1;
int a_idx = 0;
printf("%s\n", tag);
while (sp > stack)
{
sp -= dec++;
uint8_t *sp_a = (*align_a)(sp, align[a_idx]);
uint8_t *sp_b = (*align_b)(sp, align[a_idx]);
printf("old %p, adj %.2d, A %p, B %p\n",
(void *)sp, align[a_idx], (void *)sp_a, (void *)sp_b);
assert(sp_a == sp_b);
sp = sp_a;
a_idx = inc_mod(a_idx, NUM_ALIGN);
}
putchar('\n');
}
int main(void)
{
test_aligners("Align upwards", align_upwards_a, align_upwards_b);
test_aligners("Align downwards", align_downwards_a, align_downwards_b);
return 0;
}
Sample output (partially truncated):
Align upwards
old 0x7fff5ebcf4af, adj 64, A 0x7fff5ebcf4c0, B 0x7fff5ebcf4c0
old 0x7fff5ebcf4be, adj 32, A 0x7fff5ebcf4c0, B 0x7fff5ebcf4c0
old 0x7fff5ebcf4bd, adj 16, A 0x7fff5ebcf4c0, B 0x7fff5ebcf4c0
old 0x7fff5ebcf4bc, adj 08, A 0x7fff5ebcf4c0, B 0x7fff5ebcf4c0
old 0x7fff5ebcf4bb, adj 04, A 0x7fff5ebcf4bc, B 0x7fff5ebcf4bc
old 0x7fff5ebcf4b6, adj 02, A 0x7fff5ebcf4b6, B 0x7fff5ebcf4b6
old 0x7fff5ebcf4af, adj 01, A 0x7fff5ebcf4af, B 0x7fff5ebcf4af
old 0x7fff5ebcf4a7, adj 64, A 0x7fff5ebcf4c0, B 0x7fff5ebcf4c0
old 0x7fff5ebcf4b7, adj 32, A 0x7fff5ebcf4c0, B 0x7fff5ebcf4c0
old 0x7fff5ebcf4b6, adj 16, A 0x7fff5ebcf4c0, B 0x7fff5ebcf4c0
old 0x7fff5ebcf4b5, adj 08, A 0x7fff5ebcf4b8, B 0x7fff5ebcf4b8
old 0x7fff5ebcf4ac, adj 04, A 0x7fff5ebcf4ac, B 0x7fff5ebcf4ac
old 0x7fff5ebcf49f, adj 02, A 0x7fff5ebcf4a0, B 0x7fff5ebcf4a0
old 0x7fff5ebcf492, adj 01, A 0x7fff5ebcf492, B 0x7fff5ebcf492
…
old 0x7fff5ebcf0fb, adj 08, A 0x7fff5ebcf100, B 0x7fff5ebcf100
old 0x7fff5ebcf0ca, adj 04, A 0x7fff5ebcf0cc, B 0x7fff5ebcf0cc
old 0x7fff5ebcf095, adj 02, A 0x7fff5ebcf096, B 0x7fff5ebcf096
Align downwards
old 0x7fff5ebcf4af, adj 64, A 0x7fff5ebcf480, B 0x7fff5ebcf480
old 0x7fff5ebcf47e, adj 32, A 0x7fff5ebcf460, B 0x7fff5ebcf460
old 0x7fff5ebcf45d, adj 16, A 0x7fff5ebcf450, B 0x7fff5ebcf450
old 0x7fff5ebcf44c, adj 08, A 0x7fff5ebcf448, B 0x7fff5ebcf448
old 0x7fff5ebcf443, adj 04, A 0x7fff5ebcf440, B 0x7fff5ebcf440
old 0x7fff5ebcf43a, adj 02, A 0x7fff5ebcf43a, B 0x7fff5ebcf43a
old 0x7fff5ebcf433, adj 01, A 0x7fff5ebcf433, B 0x7fff5ebcf433
old 0x7fff5ebcf42b, adj 64, A 0x7fff5ebcf400, B 0x7fff5ebcf400
old 0x7fff5ebcf3f7, adj 32, A 0x7fff5ebcf3e0, B 0x7fff5ebcf3e0
old 0x7fff5ebcf3d6, adj 16, A 0x7fff5ebcf3d0, B 0x7fff5ebcf3d0
old 0x7fff5ebcf3c5, adj 08, A 0x7fff5ebcf3c0, B 0x7fff5ebcf3c0
old 0x7fff5ebcf3b4, adj 04, A 0x7fff5ebcf3b4, B 0x7fff5ebcf3b4
old 0x7fff5ebcf3a7, adj 02, A 0x7fff5ebcf3a6, B 0x7fff5ebcf3a6
old 0x7fff5ebcf398, adj 01, A 0x7fff5ebcf398, B 0x7fff5ebcf398
…
old 0x7fff5ebcf0f7, adj 01, A 0x7fff5ebcf0f7, B 0x7fff5ebcf0f7
old 0x7fff5ebcf0d3, adj 64, A 0x7fff5ebcf0c0, B 0x7fff5ebcf0c0
old 0x7fff5ebcf09b, adj 32, A 0x7fff5ebcf080, B 0x7fff5ebcf080

DO NOT USE MODULO!!! IT IS REALLY SLOW!!! Hands down the fastest way to align a pointer is to use 2's complement math. You need to invert the bits, add one, and mask off the 2 (for 32-bit) or 3 (for 64-bit) least significant bits. The result is an offset that you then add to the pointer value to align it. Works great for 32 and 64-bit numbers. For 16-bit alignment just mask the pointer with 0x1 and add that value. Algorithm works identically in any language but as you can see, Embedded C++ is vastly superior than C in every way shape and form.
#include <cstdint>
/** Returns the number to add to align the given pointer to a 8, 16, 32, or 64-bit
boundary.
#author Cale McCollough.
#param ptr The address to align.
#return The offset to add to the ptr to align it. */
template<typename T>
inline uintptr_t MemoryAlignOffset (const void* ptr) {
return ((~reinterpret_cast<uintptr_t> (ptr)) + 1) & (sizeof (T) - 1);
}
/** Word aligns the given byte pointer up in addresses.
#author Cale McCollough.
#param ptr Pointer to align.
#return Next word aligned up pointer. */
template<typename T>
inline T* MemoryAlign (T* ptr) {
uintptr_t offset = MemoryAlignOffset<uintptr_t> (ptr);
char* aligned_ptr = reinterpret_cast<char*> (ptr) + offset;
return reinterpret_cast<T*> (aligned_ptr);
}
For detailed write up and proofs please #see https://github.com/kabuki-starship/kabuki-toolkit/wiki/Fastest-Method-to-Align-Pointers. If you would like to see proof of why you should never use modulo, I invented the world fastest integer-to-string algorithm. The benchmark on the paper shows you the effect of optimizing away just one modulo instruction. Please #see https://github.com/kabuki-starship/kabuki-toolkit/wiki/Engineering-a-Faster-Integer-to-String-Algorithm.

For some reason I can't use modulo or bitwise operations. In this case:
void *alignAddress = (void*)((((intptr_t)address + align - 1) / align) * align) ;
For C++:
template <int align, typename T>
constexpr T padding(T value)
{
return ((value + align - 1) / align) * align;
}
...
char* alignAddress = reinterpret_cast<char*>(padding<8>(reinterpret_cast<uintptr_t>(address)))

I'm editing this answer because:
I had a bug in my original code (I forgot a typecast to intptr_t), and
I'm replying to Jonathan Leffler's criticism in order to clarify my intent.
The code below is not meant to imply you can change the value of an array (foo). But you can get an aligned pointer into that array, and this example illustrates one way to do it.
#define alignmentBytes ( 1 << 2 ) // == 4, but enforces the idea that that alignmentBytes should be a power of two
#define alignmentBytesMinusOne ( alignmentBytes - 1 )
uint8_t foo[ 1024 + alignmentBytesMinusOne ];
uint8_t *fooAligned;
fooAligned = (uint8_t *)((intptr_t)( foo + alignmentBytesMinusOne ) & ~alignmentBytesMinusOne);

Based on tricks learned elsewhere and one from reading #par answer apparently all I needed for my special case which is for a 32-bit like machine is ((size - 1) | 3) + 1 which acts like this and thought might be useful for other,
for (size_t size = 0; size < 20; ++size) printf("%d\n", ((size - 1) | 3) + 1);
0
4
4
4
4
8
8
8
8
12
12
12
12
16
16
16
16
20
20
20

I'm using it to align pointers in C :
#include <inttypes.h>
static inline void * please_align(void * ptr){
char * res __attribute__((aligned(128))) ;
res = (char *)ptr + (128 - (uintptr_t) ptr) % 128;
return res ;
}

Related

What is the fastest way to initialize an array in C with only two bytes for all elements?

Assume that we have an array called:
uint8_t data_8_bit[2] = {color >> 8, color & 0xff};
The data is 16-bit color data. Our goal is to create an array called:
uint8_t data_16_bit[2*n];
Where n is actually the length of 16-bit data array. But the array data_16_bit cannot hold 16-bit values so therefore I have added a 2*n as array size.
Sure, I know that I can fill up the array data_16_bit by using a for-loop:
for(int i = 0; i < n; i++)
for(int j = 0; j < 2; j++)
data_16_bit[j*i] = data_8_bit[j];
But there must be a faster way than this?
memset or memcpy?
IMO the easiest one to optimize by the compiler (and very safe as well) is
void foo(uint16_t color, uint8_t *arr16, size_t n)
{
uint8_t data_8_bit[2] = {color >> 8, color & 0xff};
while(n--)
{
memcpy(arr16 + n * 2, data_8_bit, 2);
}
}
https://godbolt.org/z/8Wh5Pc3aP
It appears that what you are trying to do is ensure that each element of data_16_bit at an even index contains the same value as data_8_bit[0], and each element at an odd index contains the same value as data_8_bit[1].
Standard C does not provide a way to express such a thing via an initializer.
memset() does not, by itself, provide a solution better than plain iteration because you're trying to set the target bytes to alternating values instead of all to the same value.
memcpy() does not yield any simple approach that is much, if any, better than the simple iterative assignments because the source pattern is only two bytes. It would be possible to perform fewer than n calls to memcpy() in the general case, but the code to accomplish that would be fairly complex.
If n is a compile-time constant then the fastest approach is to just write out a full initializer:
uint8_t data_16_bit[2*8] = {
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff
};
If n is not a compile time constant then
you should consider using dynamically-allocated memory instead of a VLA, and
you cannot use an initializer.
In that case, something like your for loop is probably about as good as it gets. I would write it like this, though:
for(int i = 0; i < n * 2; i += 2) {
data_16_bit[i] = data_8_bit[0];
data_16_bit[i+1] = data_8_bit[1];
}
Although quite unknown to many, you can use wmemset for this if sizeof(wchar_t) is a multiple of 2 on your platform, for example when it's a 2-byte type:
_Static_assert(sizeof(wchar_t)*CHAR_BIT == 16);
wchar_t pattern;
memcpy(&pattern, data_8_bit, 2);
wmemset((wchar_t*)data_16_bit, pattern, n);
If wchar_t is a 4-byte type like on most *nix platforms
_Static_assert(sizeof(wchar_t)*CHAR_BIT == 32);
wchar_t pattern;
memcpy(&pattern, data_8_bit, 2);
memcpy((char*)&pattern + 2, data_8_bit, 2);
wmemset((wchar_t*)data_16_bit, pattern, n);
If wchar_t is even bigger (extremely unlikely) then just repeat that those first steps to create the filling pattern
wmemset should be hand-optimized with SIMD in assembly like memset so it'll be extremely fast compared to other solutions where the compiler isn't able to auto-vectorize. For example there are lots of optimized memset and wmemset versions for x86-64 in glibc including SSE2, AVX2 and even AVX-512
A few questions to consider.
Once initialized, will this data be read-only, or modifiable?
Is the number of elements fixed? configurable at build time? or varies at runtime?
Is there a reasonable limit to the maximum number elements?
How many times will you need to initialize your buffer?
Beginning with the third question, it looks like you will have a maximum of 65536 unique sets as you are dealing with a 16-bits of data. If you are willing to sacrifice a bit of space for speed, you can create a 64 kB global table that contains all the permutations in the order that you expect. The end result is that you have the table loaded into memory automatically as it would reside in one of the data sections in your executable. How you create/populate this table is up to you. (For example, you could manually create the table, or you could have a dedicated step in your build process that both creates and compiles it into a linkable object file.)
Continuing on, assuming that your table is loaded into memory.
If your pre-populated table is at least as large as what you will ever need, you can either ...
Return a pointer to the table if the contents will never change (and best to have it reside in read-only data section for this case). The benefit is that no more data needs to be copied--you only need to move a pointer around.
Use a memory copying routine such as memcpy() (or a custom one if you don't want what the compiler generates) to copy from the pre-populated table to your desired buffer if the contents are going to change, or if your destination buffer is larger than your 64 kB pre-populated table.
We can store into the uint8_t array using a uint16_t pointer.
This is desirable because the original fill value is 16 bits.
We can even store 64 bits at a time. We replicate the 16 bit color four times to get a uint64_t value. We store into the array using a uint64_t pointer.
This method is what the builtin memset function tries to do [it would try to use XMM registers].
Here's what I came up with.
Note that we can start with byte stores if we need to align to an 8 byte boundary [if the arch requires this], then do wide stores in the middle and revert to char/short stores at the end.
#include <stddef.h>
#include <stdint.h>
void
fill8(uint8_t *data,uint16_t color,size_t len)
{
uint64_t *p64;
uint16_t *p16;
size_t count;
// NOTE: enhancement would be to align the data pointer to 8 byte boundary
// by transferring 16 bit data as below
// get 64 bit color value
uint64_t c64 = 0;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
// get pointer to 64 bit data and its count
p64 = (uint64_t *) data;
count = len / sizeof(c64);
// increment byte pointer and decrement byte length
data += count * sizeof(c64);
len -= count * sizeof(c64);
// transfer data 64 bits at a time
for (; count > 0; --count, ++p64)
*p64 = c64;
// get pointer to 16 bit data
p16 = (uint16_t *) data;
count = len / sizeof(color);
// increment byte pointer and decrement byte length
data += count * sizeof(color);
len -= count * sizeof(color);
// transfer data 16 bits at a time
for (; count > 0; --count, ++p16)
*p16 = color;
}
UPDATE:
It leads to UB. Strict alising. If platform does not allow unaligned access and *data is unaligned it will result in the exception – 0___________
Unaligned access is not a "strict aliasing" violation. It is unaligned access. On certain architectures, this will cause an alignment exception [in the hardware]. I addressed this in comments above, but, to be fair, there is a reworked example below that includes the alignment code.
Note that we can start with byte stores if we need to align to an 8 byte boundary Alignment has nothing to do with strict aliasing. It doesn't matter where the data that uint_t *data points to - in this function it's of type uint8_t and referring to it with a uint64_t * pointer unequivocally violates strict aliasing. – Andrew Henle
No, I don't believe it violates "strict aliasing" because it doesn't apply here. You might have more luck with "violates type punning" but I doubt that as well.
The [updated] code [below] is very similar to Freebsd's memset.
And, even if the code violated the "rule", there are known workarounds and exceptions:
https://developers.redhat.com/blog/2020/06/02/the-joys-and-perils-of-c-and-c-aliasing-part-1
https://developers.redhat.com/blog/2020/06/03/the-joys-and-perils-of-aliasing-in-c-and-c-part-2
More detail in the links, but strict aliasing allows the compiler to optimize a function for access to a and b that is:
void func(int *a,long *b)
But, it really wants:
void func(int * restrict a,long * restrict b)
Without the restrict the compiler can't do [dubious] optimizations on the pointers because it can't determine if they overlap.
Here, the compiler can deduce the pointer relationships and generate correct code because of the way the pointers are set/incremented.
If we must, to be [completely] safe, compile with -fno-strict-aliasing [but I don't believe it's required in this instance].
It might be "type punning". But, because the data is unsigned, and the relationships are known (e.g. uint8_t and uint64_t), the generated code will still be correct.
As I mentioned, memset does similar pointer/data manipulations. If the code herein is "bad", then libc is also broken. See the Freebsd memset implementation below.
The pointer alignment issue, which I addressed in comments, is valid. Here is some reworked code to address the alignment:
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <byteswap.h>
#define OFFOF(_ptr) \
((uintptr_t) (_ptr))
int opt_b;
int opt_v;
#define sysfault(_fmt...) \
do { \
printf(_fmt); \
exit(1); \
} while (0);
uint8_t phys[1000000];
void
fill8(uint8_t *data,uint16_t color,size_t len)
{
uint64_t *p64;
uint16_t *p16;
size_t count;
if (opt_b)
color = bswap_16(color);
// NOTE: enhancement would be to align the data pointer to 8 byte boundary
// by transferring 16 bit data as below
for (; (len > 0) && (OFFOF(data) & 0x07); ++data, --len) {
*data = color;
color = bswap_16(color);
}
// get 64 bit color value
uint64_t c64 = 0;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
// get pointer to 64 bit data and its count
p64 = (uint64_t *) data;
count = len / sizeof(c64);
// increment byte pointer and decrement byte length
data += count * sizeof(c64);
len -= count * sizeof(c64);
// transfer data 64 bits at a time
for (; count > 0; --count, ++p64)
*p64 = c64;
// transfer data 8 bits at a time
for (; len > 0; ++data, --len) {
*data = color;
color = bswap_16(color);
}
}
void
verify(size_t len,size_t align)
{
uint8_t *data;
size_t idx;
uint8_t val;
data = &phys[0];
for (idx = 0; idx < align; ++idx) {
val = data[idx];
if (val != 0)
sysfault("verify: BEF idx=%zu val=%2.2X\n",idx,val);
}
data = &phys[align];
for (idx = 0; idx < len; ++idx) {
val = data[idx];
if (opt_v) {
printf(" %2.2X",val);
if ((idx % 16) == 15)
printf("\n");
}
if (val == 0)
sysfault("verify: DAT idx=%zu val=%2.2X\n",idx,val);
}
if (opt_v)
printf("\n");
data = &phys[align + len];
for (idx = 0; idx < align; ++idx) {
val = phys[idx];
if (val != 0)
sysfault("verify: AFT idx=%zu val=%2.2X\n",idx,val);
}
}
void
dotest(int tstno,size_t len,size_t align)
{
uint8_t *data = phys;
memset(phys,0,sizeof(phys));
while (1) {
uintptr_t off = data - phys;
if (off == align)
break;
++data;
}
if ((tstno > 1) && opt_v)
printf("\n");
printf("T:%d %p L:%zu A:%zu\n",tstno,data,len,align);
uint16_t color = 0x0102;
fill8(data,color,len);
verify(len,align);
}
int
main(int argc,char **argv)
{
int tstno = 1;
--argc;
++argv;
for (; argc > 0; --argc, ++argv) {
char *cp = *argv;
if (*cp != '-')
break;
cp += 2;
switch(cp[-1]) {
case 'b': // big endian
opt_b = ! opt_b;
break;
case 'v': // verbose check
opt_v = ! opt_v;
break;
}
}
for (size_t len = 1; len <= 128; ++len) {
for (size_t align = 0; align < 8; ++align, ++tstno)
dotest(tstno,len,align);
}
return 0;
}
Here is [one of] Freebsd's memset implementations:
/*-
* Copyright (c) 1990, 1993
* The Regents of the University of California. All rights reserved.
*
* This code is derived from software contributed to Berkeley by
* Mike Hibler and Chris Torek.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/
#if defined(LIBC_SCCS) && !defined(lint)
static char sccsid[] = "#(#)memset.c 8.1 (Berkeley) 6/4/93";
#endif /* LIBC_SCCS and not lint */
#include <sys/cdefs.h>
__FBSDID("$FreeBSD$");
#include <sys/types.h>
#include <limits.h>
#define wsize sizeof(u_int)
#define wmask (wsize - 1)
#ifdef BZERO
#include <strings.h>
#define RETURN return
#define VAL 0
#define WIDEVAL 0
void
bzero(void *dst0, size_t length)
#else
#include <string.h>
#define RETURN return (dst0)
#define VAL c0
#define WIDEVAL c
void *
memset(void *dst0, int c0, size_t length)
#endif
{
size_t t;
#ifndef BZERO
u_int c;
#endif
u_char *dst;
dst = dst0;
/*
* If not enough words, just fill bytes. A length >= 2 words
* guarantees that at least one of them is `complete' after
* any necessary alignment. For instance:
*
* |-----------|-----------|-----------|
* |00|01|02|03|04|05|06|07|08|09|0A|00|
* ^---------------------^
* dst dst+length-1
*
* but we use a minimum of 3 here since the overhead of the code
* to do word writes is substantial.
*/
if (length < 3 * wsize) {
while (length != 0) {
*dst++ = VAL;
--length;
}
RETURN;
}
#ifndef BZERO
if ((c = (u_char)c0) != 0) { /* Fill the word. */
c = (c << 8) | c; /* u_int is 16 bits. */
#if UINT_MAX > 0xffff
c = (c << 16) | c; /* u_int is 32 bits. */
#endif
#if UINT_MAX > 0xffffffff
c = (c << 32) | c; /* u_int is 64 bits. */
#endif
}
#endif
/* Align destination by filling in bytes. */
if ((t = (long)dst & wmask) != 0) {
t = wsize - t;
length -= t;
do {
*dst++ = VAL;
} while (--t != 0);
}
/* Fill words. Length was >= 2*words so we know t >= 1 here. */
t = length / wsize;
do {
*(u_int *)dst = WIDEVAL;
dst += wsize;
} while (--t != 0);
/* Mop up trailing bytes, if any. */
t = length & wmask;
if (t != 0)
do {
*dst++ = VAL;
} while (--t != 0);
RETURN;
}

Check and align buffer

I'm trying to understand how to check if a pointer is aligned or not and eventually align it.
To understand it I take this function:
#define PJ_POOL_ALIGNMENT 8
PJ_DEF(pj_pool_t*) pj_pool_create_on_buf(const char *name,
void *buf,
pj_size_t size)
{
#if PJ_HAS_POOL_ALT_API == 0
struct creation_param param;
pj_size_t align_diff;
PJ_ASSERT_RETURN(buf && size, NULL);
if (!is_initialized) {
if (pool_buf_initialize() != PJ_SUCCESS)
return NULL;
is_initialized = 1;
}
/* Check and align buffer */
align_diff = (pj_size_t)buf;
if (align_diff & (PJ_POOL_ALIGNMENT-1)) {
align_diff &= (PJ_POOL_ALIGNMENT-1);
buf = (void*) (((char*)buf) + align_diff);
size -= align_diff;
}
param.stack_buf = buf;
param.size = size;
pj_thread_local_set(tls, &param);
return pj_pool_create_int(&stack_based_factory, name, size, 0,
pj_pool_factory_default_policy.callback);
#else
PJ_UNUSED_ARG(buf);
return pj_pool_create(NULL, name, size, size, NULL);
#endif
}
obviously the part that interests me is / * Check and align buffer * /
the only thing I think I understand is this:
let's focus on the if.
This wants to verify if the buffer is aligned to an 8 byte multiple address. If the condition of the if is not aligned, a number other than 0 returns, and then the alignment is carried out, otherwise, it is sufficient that there is also only a bit with a 1 to make the IF be skipped. To obtain this result they make the variable PJ_POOL_ALIGNMENT a 7 (0111) and with this they make an AND with the address of where the buffer was allocated. The operation is as follows considering that I want to get a number other than 0 if the address is not aligned.
0000.. . 0111 AND
xxxx. . . x100
0000.. . 0100 not aligned
if there is a 1 (or more 1) in any of the last 3 bits and therefore I know it is not aligned with an 8byte block: x AND 1 = 0, the if will be true. then it will enter the correction block.
But the if block is obscure to me.
Someone who can confirm if my reasoning is correct and make me understand the block.
The current alignment code is incorrect. It determines the alignment difference from the lower alignment boundary and is incorrectly adding that to the pointer value to reach the upper alignment boundary:
xxxxx000 + 000 = xxxxx000 (OK - no change)
xxxxx001 + 001 = xxxxx010 (WRONG)
xxxxx010 + 010 = xxxxx100 (WRONG)
xxxxx011 + 011 = xxxxx110 (WRONG)
xxxxx100 + 100 = xxxxy000 (OK - rounded up)
xxxxx101 + 101 = xxxxy010 (WRONG)
xxxxx110 + 110 = xxxxy100 (WRONG)
xxxxx111 + 111 = xxxxy110 (WRONG)
The difference to the upper alignment boundary is the 2's complement of the difference to the lower alignment boundary, modulo the alignment size:
xxxxx000 + 000 = xxxxx000 (OK - no change)
xxxxx001 + 111 = xxxxy000 (OK - rounded up)
xxxxx010 + 110 = xxxxy000 (OK - rounded up)
xxxxx011 + 101 = xxxxy000 (OK - rounded up)
xxxxx100 + 100 = xxxxy000 (OK - rounded up)
xxxxx101 + 011 = xxxxy000 (OK - rounded up)
xxxxx110 + 010 = xxxxy000 (OK - rounded up)
xxxxx111 + 001 = xxxxy000 (OK - rounded up)
The current alignment code can be corrected with the addition of a single line to convert the lower alignment difference to an upper alignment difference:
/* Check and align buffer */
align_diff = (pj_size_t)buf;
if (align_diff & (PJ_POOL_ALIGNMENT-1)) {
align_diff &= (PJ_POOL_ALIGNMENT-1);
align_diff = PJ_POOL_ALIGNMENT - align_diff; // upper alignment
buf = (void*) (((char*)buf) + align_diff);
size -= align_diff;
}
Alternatively, the upper alignment difference could be determined directly before the if:
/* Check and align buffer */
align_diff = (pj_size_t)-(pj_size_t)buf & (PJ_POOL_ALIGNMENT-1);
if (align_diff != 0) {
buf = (void*) (((char*)buf) + align_diff);
size -= align_diff;
}
It could be argued (and has been!) that this is less readable than the original version.
In fact, the if could be omitted, since adding zero makes no difference:
/* Check and align buffer */
align_diff = (pj_size_t)-(pj_size_t)buf & (PJ_POOL_ALIGNMENT-1);
buf = (void*) (((char*)buf) + align_diff);
size -= align_diff;
Regarding align_diff = (pj_size_t)-(pj_size_t)buf & (PJ_POOL_ALIGNMENT-1);, the (pj_size_t)buf converts the pointer to an unsigned integer type, the - negates the value, and the initial (pj_size_t) converts the negated value to an unsigned integer type using 2's complement arithmetic. The & (PJ_POOL_ALIGNMENT-1) converts modulo PJ_POOL_ALIGNMENT equivalently to % PJ_POOL_ALIGNMENT (since PJ_POOL_ALIGNMENT is a power of 2).
Strictly, to avoid undefined behavior, the above pointer to integer conversion should be done using uintptr_t (defined by #include <stdint.h>) instead of pj_size_t:
align_diff = (uintptr_t)-(uintptr_t)buf & (PJ_POOL_ALIGNMENT-1);
Regarding buf = (void*) (((char*)buf) + align_diff);, pointer arithmetic is not allowed on void * values (at least in standard C), so (char*)buf converts it to a pointer to char. Since sizeof(char) is 1 byte by definition, the + align_diff advances the pointer by align_diff bytes as required. The (void*) then converts this back to a void * before assigning that back to buf. This (void*) can be omitted in C (but not in C++), so the statement could be rewritten as:
buf = (char*)buf + align_diff;
which is arguably more readable.

Which is faster: logarithm or lookup table? [duplicate]

I have a byte I'm using for bitflags. I know that one and only one bit in the byte is set at any give time.
Ex: unsigned char b = 0x20; //(00100000) 6th most bit set
I currently use the following loop to determine which bit is set:
int getSetBitLocation(unsigned char b) {
int i=0;
while( !((b >> i++) & 0x01) ) { ; }
return i;
}
How do I most efficiently determine the position of the set bit? Can I do this without iteration?
Can I do this without iteration?
It is indeed possible.
How do I most efficiently determine the position of the set bit?
You can try this algorithm. It splits the char in half to search for the top bit, shifting to the low half each time:
int getTopSetBit(unsigned char b) {
int res = 0;
if(b>15){
b = b >> 4;
res = res + 4;
}
if(b>3){
b = b >> 2;
res = res + 2;
}
//thanks #JasonD
return res + (b>>1);
}
It uses two comparisons (three for uint16s, four for uint32s...). and it might be faster than your loop. It is definitely not shorter.
Based on the idea by Anton Kovalenko (hashed lookup) and the comment by 6502 (division is slow), I also suggest this implementation (8-bit => 3-bit hash using a de-Bruijn sequence)
int[] lookup = {7, 0, 5, 1, 6, 4, 3, 2};
int getBitPosition(unsigned char b) {
// return lookup[(b | (b>>1) | (b>>2) | (b>>4)) & 0x7];
return lookup[((b * 0x1D) >> 4) & 0x7];
}
or (larger LUT, but uses just three terms instead of four)
int[] lookup = {0xFF, 0, 1, 4, 2, 0xFF, 5, 0xFF, 7, 3, 0xFF, 0xFF, 6, 0xFF, 0xFF, 0xFF};
int getBitPosition(unsigned char b) {
return lookup[(b | (b>>3) | (b>>4)) & 0xF];
}
Lookup table is simple enough, and you can reduce its size if the set of values is sparse. Let's try with 11 elements instead of 128:
unsigned char expt2mod11_bits[11]={0xFF,0,1,0xFF,2,4,0xFF,7,3,6,5};
unsigned char pos = expt2mod11_bits[b%11];
assert(pos < 8);
assert(1<<pos == b);
Of course, it's not necessarily more effective, especially for 8 bits, but the same trick can be used for larger sizes, where full lookup table would be awfully big. Let's see:
unsigned int w;
....
unsigned char expt2mod19_bits[19]={0xFF,0,1,13,2,0xFF,14,6,3,8,0xFF,12,15,5,7,11,4,10,9};
unsigned char pos = expt2mod19_bits[w%19];
assert(pos < 16);
assert(1<<pos == w);
This is a quite common problem for chess programs that use 64 bits to represent positions (i.e. one 64-bit number to store where are all the white pawns, another for where are all the black ones and so on).
With this representation there is sometimes the need to find the index 0...63 of the first or last set bit and there are several possible approaches:
Just doing a loop like you did
Using a dichotomic search (i.e. if x & 0x00000000ffffffffULL is zero there's no need to check low 32 bits)
Using special instruction if available on the processor (e.g. bsf and bsr on x86)
Using lookup tables (of course not for the whole 64-bit value, but for 8 or 16 bits)
What is faster however really depends on your hardware and on real use cases.
For 8 bits only and a modern processor I think that probably a lookup table with 256 entries is the best choice...
But are you really sure this is the bottleneck of your algorithm?
unsigned getSetBitLocation(unsigned char b) {
unsigned pos=0;
pos = (b & 0xf0) ? 4 : 0; b |= b >>4;
pos += (b & 0xc) ? 2 : 0; b |= b >>2;
pos += (b & 0x2) ? 1 : 0;
return pos;
}
It would be hard to do it jumpfree. Maybe with the Bruin sequences ?
Based on log2 calculation in Find the log base 2 of an N-bit integer in O(lg(N)) operations:
int getSetBitLocation(unsigned char c) {
// c is in {1, 2, 4, 8, 16, 32, 64, 128}, returned values are {0, 1, ..., 7}
return (((c & 0xAA) != 0) |
(((c & 0xCC) != 0) << 1) |
(((c & 0xF0) != 0) << 2));
}
Easiest thing is to create a lookup table. The simplest one will be sparse (having 256 elements) but it would technically avoid iteration.
This comment here technically avoids iteration, but who are we kidding, it is still doing the same number of checks: How to write log base(2) in c/c++
Closed form would be log2(), a la, log2() + 1 But I'm not sure how efficient that is - possibly the CPU has an instruction for taking base 2 logrithms?
if you define
const char bytes[]={1,2,4,8,16,32,64,128}
and use
struct byte{
char data;
int pos;
}
void assign(struct byte b,int i){
b.data=bytes[i];
b.pos=i
}
you don't need to determine the position of the set bit
A lookup table is fast and easy when CHAR_BIT == 8, but on some systems, CHAR_BIT == 16 or 32 and a lookup table becomes insanely bulky. If you're considering a lookup table, I'd suggest wrapping it; make it a "lookup table function", instead, so that you can swap the logic when you need to optimise.
Using divide and conquer, by performing a binary search on a sorted array, involves comparisons based on log2 CHAR_BIT. That code is more complex, involving an initialisation of an array of unsigned char to use as a lookup table for a start. Once you have such the array initialised, you can use bsearch to search it, for example:
#include <stdio.h>
#include <stdlib.h>
void uchar_bit_init(unsigned char *table) {
for (size_t x = 0; x < CHAR_BIT; x++) {
table[x] = 1U << x;
}
}
int uchar_compare(void const *x, void const *y) {
char const *X = x, *Y = y;
return (*X > *Y) - (*X < *Y);
}
size_t uchar_bit_lookup(unsigned char *table, unsigned char value) {
unsigned char *position = bsearch(lookup, c, sizeof lookup, 1, char_compare);
return position ? position - table + 1 : 0;
}
int main(void) {
unsigned char lookup[CHAR_BIT];
uchar_bit_init(lookup);
for (;;) {
int c = getchar();
if (c == EOF) { break; }
printf("Bit for %c found at %zu\n", c, uchar_bit_lookup(lookup, c));
}
}
P.S. This sounds like micro-optimisation. Get your solution done (abstracting the operations required into these functions), then worry about optimisations based on your profiling. Make sure your profiling targets the system that your solution will run on if you're going to focus on micro-optimisations, because the efficiency of micro-optimisations differ widely as hardware differs even slightly... It's usually a better idea to buy a faster PC ;)

Dividing a bit string into three parts in C

I currently have a integer value that was read from an input file as a hexadecimal. I need to divide the 32 bit bitstream into three separate parts in order to manipulate it. The desired output is below:
desired output:
In this, V is my input value, left is the first X1 digits, next is the digits between X1 and X2, and last is the digits from X2 to the end. There is a constraint that each subsection must be greater than 0 in length.
What makes this difficult is that the location where I am splitting x varies (X1 and X2 can change)
Is there a good way to split these up?
The splitter() function here does the job you ask for. It takes quite a lot of arguments, unfortunately. There's the value to be split (value), the size of the chunk at the least significant end of the value (p1), the size of the middle chunk (p2), and then pointers to the high, medium and low values (hi_val, md_val, lo_val).
#include <assert.h>
#include <inttypes.h>
#include <stdio.h>
static void splitter(uint32_t value, unsigned p1, unsigned p2, uint32_t *hi_val, uint32_t *md_val, uint32_t *lo_val)
{
assert(p1 + p2 < 32);
*lo_val = value & ((1U << p1) - 1);
value >>= p1;
*md_val = value & ((1U << p2) - 1);
value >>= p2;
*hi_val = value;
}
static void test_splitter(uint32_t value, int p1, int p2)
{
uint32_t hi_val;
uint32_t md_val;
uint32_t lo_val;
splitter(value, p1, p2, &hi_val, &md_val, &lo_val);
printf("0x%.8" PRIX32 " (%2u,%2u,%2u) = 0x%.4" PRIX32 " : 0x%.4" PRIX32 " : 0x%.4" PRIX32 "\n",
value, (32 - p1 - p2), p2, p1, hi_val, md_val, lo_val);
}
int main(void)
{
uint32_t value;
value = 0xFFFFFFFF;
test_splitter(value, 9, 11);
value = 0xFFF001FF;
test_splitter(value, 9, 11);
value = 0x000FFE00;
test_splitter(value, 9, 11);
value = 0xABCDEF01;
test_splitter(value, 10, 6);
test_splitter(value, 8, 8);
test_splitter(value, 13, 9);
test_splitter(value, 10, 8);
return 0;
}
The test_splitter() function allows for simple testing of a single value plus the sections it is to be split in, and main() calls the test function a number of times.
The output is:
0xFFFFFFFF (12,11, 9) = 0x0FFF : 0x07FF : 0x01FF
0xFFF001FF (12,11, 9) = 0x0FFF : 0x0000 : 0x01FF
0x000FFE00 (12,11, 9) = 0x0000 : 0x07FF : 0x0000
0xABCDEF01 (16, 6,10) = 0xABCD : 0x003B : 0x0301
0xABCDEF01 (16, 8, 8) = 0xABCD : 0x00EF : 0x0001
0xABCDEF01 (10, 9,13) = 0x02AF : 0x006F : 0x0F01
0xABCDEF01 (14, 8,10) = 0x2AF3 : 0x007B : 0x0301
If any of the sections is larger than 16, the display gets spoiled — but the code still works.
In theory, the 1U values could be a 16-bit quantity, but I'm assuming that the CPU is working with 32-bit int. There are ways (UINT32_C(1)) to ensure that it is a 32-bit value, but that's probably OTT. The code explicitly forces 32-bit unsigned integer values, and prints them as such.
If i understand your question, you want to allocate data. Look alloca malloc fucntions.

Understanding the implementation of memcpy()

I was looking the implementation of memcpy.c, I found a different memcpy code. I couldnt understand why do they do (((ADDRESS) s) | ((ADDRESS) d) | c) & (sizeof(UINT) - 1)
#if !defined(__MACHDEP_MEMFUNC)
#ifdef _MSC_VER
#pragma function(memcpy)
#undef __MEMFUNC_ARE_INLINED
#endif
#if !defined(__MEMFUNC_ARE_INLINED)
/* Copy C bytes from S to D.
* Only works if non-overlapping, or if D < S.
*/
EXTERN_C void * __cdecl memcpy(void *d, const void *s, size_t c)
{
if ((((ADDRESS) s) | ((ADDRESS) d) | c) & (sizeof(UINT) - 1)) {
BYTE *pS = (BYTE *) s;
BYTE *pD = (BYTE *) d;
BYTE *pE = (BYTE *) (((ADDRESS) s) + c);
while (pS != pE)
*(pD++) = *(pS++);
}
else {
UINT *pS = (UINT *) s;
UINT *pD = (UINT *) d;
UINT *pE = (UINT *) (BYTE *) (((ADDRESS) s) + c);
while (pS != pE)
*(pD++) = *(pS++);
}
return d;
}
#endif /* ! __MEMFUNC_ARE_INLINED */
#endif /* ! __MACHDEP_MEMFUNC */
The code is testing whether the addresses are aligned suitably for a UINT. If so, the code copies using UINT objects. If not, the code copies using BYTE objects.
The test works by first performing a bitwise OR of the two addresses. Any bit that is on in either address will be on in the result. Then the test performs a bitwise AND with sizeof(UINT) - 1. It is expected the the size of a UINT is some power of two. Then the size minus one has all lower bits on. E.g., if the size is 4 or 8, then one less than that is, in binary 112 or 1112. If either address is not a multiple of the size of a UINT, then it will have one of these bits on, and the test will indicate it. (Usually, the best alignment for an integer object is the same as its size. This is not necessarily true. A modern implementation of this code should use _Alignof(UINT) - 1 instead of the size.)
Copying with UINT objects is faster, because, at the hardware level, one load or store instruction loads or stores all the bytes of a UINT (likely four bytes). Processors will typically copy faster when using these instructions than when using four times as many single-byte load or store instructions.
This code is of course implementation dependent; it requires support from the C implementation that is not part of the base C standard, and it depends on specific features of the processor it executes on.
A more advanced memcpy implementation could contain additional features, such as:
If one of the addresses is aligned but the other is not, use special load-unaligned instructions to load multiple bytes from one address, with regular store instructions to the other address.
If the processor has Single Instruction Multiple Data instructions, use those instructions to load or store many bytes (often 16, possibly more) in a single instruction.
The code
((((ADDRESS) s) | ((ADDRESS) d) | c) & (sizeof(UINT) - 1))
Checks to see if either s, d, or c are not aligned to the size of a UINT.
For example, if s = 0x7ff30b14, d = 0x7ffa81d8, c = 256, and sizeof(UINT) == 4, then:
s = 0b1111111111100110000101100010100
d = 0b1111111111110101000000111011000
c = 0b0000000000000000000000100000000
s | d | c = 0b1111111111110111000101111011100
(s | d | c) & 3 = 0b00
So both pointers are aligned. It is easier to copy memory between pointers that are both aligned, and this does it with only one branch.
On many architectures, *(UINT *) ptr is much faster if ptr is correctly aligned to the width of a UINT. On some architectures, *(UINT *) ptr will actually crash if ptr is not correctly aligned.

Resources