Understanding the implementation of memcpy()

Understanding the implementation of memcpy() - c

I was looking the implementation of memcpy.c, I found a different memcpy code. I couldnt understand why do they do (((ADDRESS) s) | ((ADDRESS) d) | c) & (sizeof(UINT) - 1)
#if !defined(__MACHDEP_MEMFUNC)
#ifdef _MSC_VER
#pragma function(memcpy)
#undef __MEMFUNC_ARE_INLINED
#endif
#if !defined(__MEMFUNC_ARE_INLINED)
/* Copy C bytes from S to D.
* Only works if non-overlapping, or if D < S.
*/
EXTERN_C void * __cdecl memcpy(void *d, const void *s, size_t c)
{
if ((((ADDRESS) s) | ((ADDRESS) d) | c) & (sizeof(UINT) - 1)) {
BYTE *pS = (BYTE *) s;
BYTE *pD = (BYTE *) d;
BYTE *pE = (BYTE *) (((ADDRESS) s) + c);
while (pS != pE)
*(pD++) = *(pS++);
}
else {
UINT *pS = (UINT *) s;
UINT *pD = (UINT *) d;
UINT *pE = (UINT *) (BYTE *) (((ADDRESS) s) + c);
while (pS != pE)
*(pD++) = *(pS++);
}
return d;
}
#endif /* ! __MEMFUNC_ARE_INLINED */
#endif /* ! __MACHDEP_MEMFUNC */

The code is testing whether the addresses are aligned suitably for a UINT. If so, the code copies using UINT objects. If not, the code copies using BYTE objects.
The test works by first performing a bitwise OR of the two addresses. Any bit that is on in either address will be on in the result. Then the test performs a bitwise AND with sizeof(UINT) - 1. It is expected the the size of a UINT is some power of two. Then the size minus one has all lower bits on. E.g., if the size is 4 or 8, then one less than that is, in binary 112 or 1112. If either address is not a multiple of the size of a UINT, then it will have one of these bits on, and the test will indicate it. (Usually, the best alignment for an integer object is the same as its size. This is not necessarily true. A modern implementation of this code should use _Alignof(UINT) - 1 instead of the size.)
Copying with UINT objects is faster, because, at the hardware level, one load or store instruction loads or stores all the bytes of a UINT (likely four bytes). Processors will typically copy faster when using these instructions than when using four times as many single-byte load or store instructions.
This code is of course implementation dependent; it requires support from the C implementation that is not part of the base C standard, and it depends on specific features of the processor it executes on.
A more advanced memcpy implementation could contain additional features, such as:
If one of the addresses is aligned but the other is not, use special load-unaligned instructions to load multiple bytes from one address, with regular store instructions to the other address.
If the processor has Single Instruction Multiple Data instructions, use those instructions to load or store many bytes (often 16, possibly more) in a single instruction.

The code
((((ADDRESS) s) | ((ADDRESS) d) | c) & (sizeof(UINT) - 1))
Checks to see if either s, d, or c are not aligned to the size of a UINT.
For example, if s = 0x7ff30b14, d = 0x7ffa81d8, c = 256, and sizeof(UINT) == 4, then:
s = 0b1111111111100110000101100010100
d = 0b1111111111110101000000111011000
c = 0b0000000000000000000000100000000
s | d | c = 0b1111111111110111000101111011100
(s | d | c) & 3 = 0b00
So both pointers are aligned. It is easier to copy memory between pointers that are both aligned, and this does it with only one branch.
On many architectures, *(UINT *) ptr is much faster if ptr is correctly aligned to the width of a UINT. On some architectures, *(UINT *) ptr will actually crash if ptr is not correctly aligned.

Related

What is the fastest way to initialize an array in C with only two bytes for all elements?

Assume that we have an array called:
uint8_t data_8_bit[2] = {color >> 8, color & 0xff};
The data is 16-bit color data. Our goal is to create an array called:
uint8_t data_16_bit[2*n];
Where n is actually the length of 16-bit data array. But the array data_16_bit cannot hold 16-bit values so therefore I have added a 2*n as array size.
Sure, I know that I can fill up the array data_16_bit by using a for-loop:
for(int i = 0; i < n; i++)
for(int j = 0; j < 2; j++)
data_16_bit[j*i] = data_8_bit[j];
But there must be a faster way than this?
memset or memcpy?

IMO the easiest one to optimize by the compiler (and very safe as well) is
void foo(uint16_t color, uint8_t *arr16, size_t n)
{
uint8_t data_8_bit[2] = {color >> 8, color & 0xff};
while(n--)
{
memcpy(arr16 + n * 2, data_8_bit, 2);
}
}
https://godbolt.org/z/8Wh5Pc3aP

It appears that what you are trying to do is ensure that each element of data_16_bit at an even index contains the same value as data_8_bit[0], and each element at an odd index contains the same value as data_8_bit[1].
Standard C does not provide a way to express such a thing via an initializer.
memset() does not, by itself, provide a solution better than plain iteration because you're trying to set the target bytes to alternating values instead of all to the same value.
memcpy() does not yield any simple approach that is much, if any, better than the simple iterative assignments because the source pattern is only two bytes. It would be possible to perform fewer than n calls to memcpy() in the general case, but the code to accomplish that would be fairly complex.
If n is a compile-time constant then the fastest approach is to just write out a full initializer:
uint8_t data_16_bit[2*8] = {
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff,
color >> 8, color & 0xff
};
If n is not a compile time constant then
you should consider using dynamically-allocated memory instead of a VLA, and
you cannot use an initializer.
In that case, something like your for loop is probably about as good as it gets. I would write it like this, though:
for(int i = 0; i < n * 2; i += 2) {
data_16_bit[i] = data_8_bit[0];
data_16_bit[i+1] = data_8_bit[1];
}

Although quite unknown to many, you can use wmemset for this if sizeof(wchar_t) is a multiple of 2 on your platform, for example when it's a 2-byte type:
_Static_assert(sizeof(wchar_t)*CHAR_BIT == 16);
wchar_t pattern;
memcpy(&pattern, data_8_bit, 2);
wmemset((wchar_t*)data_16_bit, pattern, n);
If wchar_t is a 4-byte type like on most *nix platforms
_Static_assert(sizeof(wchar_t)*CHAR_BIT == 32);
wchar_t pattern;
memcpy(&pattern, data_8_bit, 2);
memcpy((char*)&pattern + 2, data_8_bit, 2);
wmemset((wchar_t*)data_16_bit, pattern, n);
If wchar_t is even bigger (extremely unlikely) then just repeat that those first steps to create the filling pattern
wmemset should be hand-optimized with SIMD in assembly like memset so it'll be extremely fast compared to other solutions where the compiler isn't able to auto-vectorize. For example there are lots of optimized memset and wmemset versions for x86-64 in glibc including SSE2, AVX2 and even AVX-512

A few questions to consider.
Once initialized, will this data be read-only, or modifiable?
Is the number of elements fixed? configurable at build time? or varies at runtime?
Is there a reasonable limit to the maximum number elements?
How many times will you need to initialize your buffer?
Beginning with the third question, it looks like you will have a maximum of 65536 unique sets as you are dealing with a 16-bits of data. If you are willing to sacrifice a bit of space for speed, you can create a 64 kB global table that contains all the permutations in the order that you expect. The end result is that you have the table loaded into memory automatically as it would reside in one of the data sections in your executable. How you create/populate this table is up to you. (For example, you could manually create the table, or you could have a dedicated step in your build process that both creates and compiles it into a linkable object file.)
Continuing on, assuming that your table is loaded into memory.
If your pre-populated table is at least as large as what you will ever need, you can either ...
Return a pointer to the table if the contents will never change (and best to have it reside in read-only data section for this case). The benefit is that no more data needs to be copied--you only need to move a pointer around.
Use a memory copying routine such as memcpy() (or a custom one if you don't want what the compiler generates) to copy from the pre-populated table to your desired buffer if the contents are going to change, or if your destination buffer is larger than your 64 kB pre-populated table.

We can store into the uint8_t array using a uint16_t pointer.
This is desirable because the original fill value is 16 bits.
We can even store 64 bits at a time. We replicate the 16 bit color four times to get a uint64_t value. We store into the array using a uint64_t pointer.
This method is what the builtin memset function tries to do [it would try to use XMM registers].
Here's what I came up with.
Note that we can start with byte stores if we need to align to an 8 byte boundary [if the arch requires this], then do wide stores in the middle and revert to char/short stores at the end.
#include <stddef.h>
#include <stdint.h>
void
fill8(uint8_t *data,uint16_t color,size_t len)
{
uint64_t *p64;
uint16_t *p16;
size_t count;
// NOTE: enhancement would be to align the data pointer to 8 byte boundary
// by transferring 16 bit data as below
// get 64 bit color value
uint64_t c64 = 0;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
// get pointer to 64 bit data and its count
p64 = (uint64_t *) data;
count = len / sizeof(c64);
// increment byte pointer and decrement byte length
data += count * sizeof(c64);
len -= count * sizeof(c64);
// transfer data 64 bits at a time
for (; count > 0; --count, ++p64)
*p64 = c64;
// get pointer to 16 bit data
p16 = (uint16_t *) data;
count = len / sizeof(color);
// increment byte pointer and decrement byte length
data += count * sizeof(color);
len -= count * sizeof(color);
// transfer data 16 bits at a time
for (; count > 0; --count, ++p16)
*p16 = color;
}
UPDATE:
It leads to UB. Strict alising. If platform does not allow unaligned access and *data is unaligned it will result in the exception – 0___________
Unaligned access is not a "strict aliasing" violation. It is unaligned access. On certain architectures, this will cause an alignment exception [in the hardware]. I addressed this in comments above, but, to be fair, there is a reworked example below that includes the alignment code.
Note that we can start with byte stores if we need to align to an 8 byte boundary Alignment has nothing to do with strict aliasing. It doesn't matter where the data that uint_t *data points to - in this function it's of type uint8_t and referring to it with a uint64_t * pointer unequivocally violates strict aliasing. – Andrew Henle
No, I don't believe it violates "strict aliasing" because it doesn't apply here. You might have more luck with "violates type punning" but I doubt that as well.
The [updated] code [below] is very similar to Freebsd's memset.
And, even if the code violated the "rule", there are known workarounds and exceptions:
https://developers.redhat.com/blog/2020/06/02/the-joys-and-perils-of-c-and-c-aliasing-part-1
https://developers.redhat.com/blog/2020/06/03/the-joys-and-perils-of-aliasing-in-c-and-c-part-2
More detail in the links, but strict aliasing allows the compiler to optimize a function for access to a and b that is:
void func(int *a,long *b)
But, it really wants:
void func(int * restrict a,long * restrict b)
Without the restrict the compiler can't do [dubious] optimizations on the pointers because it can't determine if they overlap.
Here, the compiler can deduce the pointer relationships and generate correct code because of the way the pointers are set/incremented.
If we must, to be [completely] safe, compile with -fno-strict-aliasing [but I don't believe it's required in this instance].
It might be "type punning". But, because the data is unsigned, and the relationships are known (e.g. uint8_t and uint64_t), the generated code will still be correct.
As I mentioned, memset does similar pointer/data manipulations. If the code herein is "bad", then libc is also broken. See the Freebsd memset implementation below.
The pointer alignment issue, which I addressed in comments, is valid. Here is some reworked code to address the alignment:
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <byteswap.h>
#define OFFOF(_ptr) \
((uintptr_t) (_ptr))
int opt_b;
int opt_v;
#define sysfault(_fmt...) \
do { \
printf(_fmt); \
exit(1); \
} while (0);
uint8_t phys[1000000];
void
fill8(uint8_t *data,uint16_t color,size_t len)
{
uint64_t *p64;
uint16_t *p16;
size_t count;
if (opt_b)
color = bswap_16(color);
// NOTE: enhancement would be to align the data pointer to 8 byte boundary
// by transferring 16 bit data as below
for (; (len > 0) && (OFFOF(data) & 0x07); ++data, --len) {
*data = color;
color = bswap_16(color);
}
// get 64 bit color value
uint64_t c64 = 0;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
c64 = (c64 << 16) | color;
// get pointer to 64 bit data and its count
p64 = (uint64_t *) data;
count = len / sizeof(c64);
// increment byte pointer and decrement byte length
data += count * sizeof(c64);
len -= count * sizeof(c64);
// transfer data 64 bits at a time
for (; count > 0; --count, ++p64)
*p64 = c64;
// transfer data 8 bits at a time
for (; len > 0; ++data, --len) {
*data = color;
color = bswap_16(color);
}
}
void
verify(size_t len,size_t align)
{
uint8_t *data;
size_t idx;
uint8_t val;
data = &phys[0];
for (idx = 0; idx < align; ++idx) {
val = data[idx];
if (val != 0)
sysfault("verify: BEF idx=%zu val=%2.2X\n",idx,val);
}
data = &phys[align];
for (idx = 0; idx < len; ++idx) {
val = data[idx];
if (opt_v) {
printf(" %2.2X",val);
if ((idx % 16) == 15)
printf("\n");
}
if (val == 0)
sysfault("verify: DAT idx=%zu val=%2.2X\n",idx,val);
}
if (opt_v)
printf("\n");
data = &phys[align + len];
for (idx = 0; idx < align; ++idx) {
val = phys[idx];
if (val != 0)
sysfault("verify: AFT idx=%zu val=%2.2X\n",idx,val);
}
}
void
dotest(int tstno,size_t len,size_t align)
{
uint8_t *data = phys;
memset(phys,0,sizeof(phys));
while (1) {
uintptr_t off = data - phys;
if (off == align)
break;
++data;
}
if ((tstno > 1) && opt_v)
printf("\n");
printf("T:%d %p L:%zu A:%zu\n",tstno,data,len,align);
uint16_t color = 0x0102;
fill8(data,color,len);
verify(len,align);
}
int
main(int argc,char **argv)
{
int tstno = 1;
--argc;
++argv;
for (; argc > 0; --argc, ++argv) {
char *cp = *argv;
if (*cp != '-')
break;
cp += 2;
switch(cp[-1]) {
case 'b': // big endian
opt_b = ! opt_b;
break;
case 'v': // verbose check
opt_v = ! opt_v;
break;
}
}
for (size_t len = 1; len <= 128; ++len) {
for (size_t align = 0; align < 8; ++align, ++tstno)
dotest(tstno,len,align);
}
return 0;
}
Here is [one of] Freebsd's memset implementations:
/*-
* Copyright (c) 1990, 1993
* The Regents of the University of California. All rights reserved.
*
* This code is derived from software contributed to Berkeley by
* Mike Hibler and Chris Torek.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/
#if defined(LIBC_SCCS) && !defined(lint)
static char sccsid[] = "#(#)memset.c 8.1 (Berkeley) 6/4/93";
#endif /* LIBC_SCCS and not lint */
#include <sys/cdefs.h>
__FBSDID("$FreeBSD$");
#include <sys/types.h>
#include <limits.h>
#define wsize sizeof(u_int)
#define wmask (wsize - 1)
#ifdef BZERO
#include <strings.h>
#define RETURN return
#define VAL 0
#define WIDEVAL 0
void
bzero(void *dst0, size_t length)
#else
#include <string.h>
#define RETURN return (dst0)
#define VAL c0
#define WIDEVAL c
void *
memset(void *dst0, int c0, size_t length)
#endif
{
size_t t;
#ifndef BZERO
u_int c;
#endif
u_char *dst;
dst = dst0;
/*
* If not enough words, just fill bytes. A length >= 2 words
* guarantees that at least one of them is `complete' after
* any necessary alignment. For instance:
*
* |-----------|-----------|-----------|
* |00|01|02|03|04|05|06|07|08|09|0A|00|
* ^---------------------^
* dst dst+length-1
*
* but we use a minimum of 3 here since the overhead of the code
* to do word writes is substantial.
*/
if (length < 3 * wsize) {
while (length != 0) {
*dst++ = VAL;
--length;
}
RETURN;
}
#ifndef BZERO
if ((c = (u_char)c0) != 0) { /* Fill the word. */
c = (c << 8) | c; /* u_int is 16 bits. */
#if UINT_MAX > 0xffff
c = (c << 16) | c; /* u_int is 32 bits. */
#endif
#if UINT_MAX > 0xffffffff
c = (c << 32) | c; /* u_int is 64 bits. */
#endif
}
#endif
/* Align destination by filling in bytes. */
if ((t = (long)dst & wmask) != 0) {
t = wsize - t;
length -= t;
do {
*dst++ = VAL;
} while (--t != 0);
}
/* Fill words. Length was >= 2*words so we know t >= 1 here. */
t = length / wsize;
do {
*(u_int *)dst = WIDEVAL;
dst += wsize;
} while (--t != 0);
/* Mop up trailing bytes, if any. */
t = length & wmask;
if (t != 0)
do {
*dst++ = VAL;
} while (--t != 0);
RETURN;
}

The right way to use function _mm_clflush to flush a large struct

I am starting to use functions like _mm_clflush, _mm_clflushopt, and _mm_clwb.
Say now as I have defined a struct name mystruct and its size is 256 Bytes. My cacheline size is 64 Bytes. Now I want to flush the cacheline that contains the mystruct variable. Which of the following way is the right way to do so?
_mm_clflush(&mystruct)
or
for (int i = 0; i < sizeof(mystruct)/64; i++) {
_mm_clflush( ((char *)&mystruct) + i*64)
}

The clflush CPU instruction doesn't know the size of your struct; it only flushes exactly one cache line, the one containing the byte pointed to by the pointer operand. (The C intrinsic exposes this as a const void*, but char* would also make sense, especially given the asm documentation which describes it as an 8-bit memory operand.)
You need 4 flushes 64 bytes apart, or maybe 5 if your struct isn't alignas(64) so it could have parts in 5 different lines. (You could unconditionally flush the last byte of the struct, instead of using more complex logic to check if it's in a cache line you haven't flushed yet, depending on relative cost of clflush vs. more logic and a possible branch mispredict.)
Your original loop did 4 flushes of 4 adjacent bytes at the start of your struct.
It's probably easiest to use pointer increments so the casting is not mixed up with the critical logic.
// first attempt, a bit clunky:
const int LINESIZE = 64;
const char *lastbyte = (const char *)(&mystruct+1) - 1;
for (const char *p = (const char *)&mystruct; p <= lastbyte ; p+=LINESIZE) {
_mm_clflush( p );
}
// if mystruct is guaranteed aligned by 64, you're done. Otherwise not:
// check if next line to maybe flush contains the last byte of the struct; if not then it was already flushed.
if( ((uintptr_t)p ^ (uintptr_t)lastbyte) & -LINESIZE == 0 )
_mm_clflush( lastbyte );
x^y is 1 in bit-positions where they differ. x & -LINESIZE discards the offset-within-line bits of the address, keeping only the line-number bits. So we can see if 2 addresses are in the same cache line or not with just XOR and TEST instructions. (Or clang optimizes that to a shorter cmp instruction).
Or rewrite that into a single loop, using that if logic as the termination condition:
I used a C++ struct foo &var reference so I could follow your &var syntax but still see how it compiles for a function taking a pointer arg. Adapting to C is straightforward.
Looping over every cache line of an arbitrary size unaligned struct
/* I think this version is best:
* compact setup / small code-size
* with no extra latency for the initial pointer
* doesn't need to peel a final iteration
*/
inline
void flush_structfoo(struct foo &mystruct) {
const int LINESIZE = 64;
const char *p = (const char *)&mystruct;
uintptr_t endline = ((uintptr_t)&mystruct + sizeof(mystruct) - 1) | (LINESIZE-1);
// set the offset-within-line address bits to get the last byte
// of the cacheline containing the end of the struct.
do { // flush while p is in a cache line that contains any of the struct
_mm_clflush( p );
p += LINESIZE;
} while(p <= (const char*)endline);
}
With GCC10.2 -O3 for x86-64, this compiles nicely (Godbolt)
flush_v3(foo&):
lea rax, [rdi+255]
or rax, 63
.L11:
clflush [rdi]
add rdi, 64
cmp rdi, rax
jbe .L11
ret
GCC doesn't unroll, and doesn't optimize any better if you use alignas(64) struct foo{...}; unfortunately. You might use if (alignof(mystruct) >= 64) { ... } to check if special handling is needed to let GCC optimize better, otherwise just use end = p + sizeof(mystruct); or end = (const char*)(&mystruct+1) - 1; or similar.
(In C, #include <stdalign.h> for #define for alignas() and alignof() like C++, instead of ISO C11 _Alignas and _Alignof keywords.)
Another alternative is this, but it's clunkier and takes more setup work.
const int LINESIZE = 64;
uintptr_t line = (uintptr_t)&mystruct & -LINESIZE;
uintptr_t lastline = ((uintptr_t)&mystruct + sizeof(mystruct) - 1) & -LINESIZE;
do { // always at least one flush; works on small structs
_mm_clflush( (void*)line );
line += LINESIZE;
} while(line < lastline);
A struct that was 257 bytes would always touch exactly 5 cache lines, no checking needed. Or a 260-byte struct that's known to be aligned by 4. IDK if we can get GCC to optimize away the checks based on that.

Checking if two pointers are on the same page

I saw this interview question and wanted to know if my function is doing what it's supposed to or if there's a better way to do this.
Here's the exact quote of the question:
The operating system typically allocates memory in pages such that the base address of the page are 0, 4K, 8K etc. Given two addresses (pointers), write a function to find if two pointers are on the same page. Here's the function prototype: int AreOnSamePage (void * a, void * b);
Here's my implementation. I made it return 4 if it's between 4k and 8k. It returns 1 if it's between 0 and 4k and it returns -1 if it's over 8k away. Am I getting the right addresses? The interview question is worded vaguely. Is it correct to use long's since the addresses could be pretty big?
int AreOnSamePage(void* a, void* b){
long difference = abs(&a - &b);
printf("%ld %ld\n",(long)&a,(long)&b);
if(difference > 8000)
return -1;
if(difference >= 4000)
return 4;
return 1;
}

a and b are pointers, so the distance between them is:
ptrdiff_t difference = (ptrdiff_t) abs((char *)a - (char *) b)
But you don't need it.
Two pointers are on the same page, if
(uintptr_t)a / 4096 == ( uintptr_t ) b / 4096
Else they are on different pages.
So:
int AreOnSamePage(void* a, void* b) {
const size_t page_size = 4096;
if ( (uintptr_t) a / page_size == (uintptr_t) b / page_size)
return 1;
else
return 0;
}

There are many problems with your code.
You are comparing addresses of function parameters (they are side by side, on stack), not pointers
You for no reason compare the difference with 8000
4K != 4000
Imagine one address is 3K, other is 5K, according to your code, they are on the same page.
Bad choice of return values

The name AreOnSamePage() implies that the function returns either 0 or 1; I'd find it odd to have it return -1, 4 or other values.
If a page is 4KB, then it means you need 12 bits to index each byte inside a page (because 2^12 = 4096), so as long as the N-12 most significant bits of both pointer values compare equal, then you know they are on the same page (where N is the size of a pointer).
So you can do this:
#include <stdint.h>
static const uintptr_t PAGE_SIZE = 4096;
static const uintptr_t PAGE_MASK = ~(PAGE_SIZE-1);
int AreOnSamePage(void *a, void *b) {
return (((uintptr_t) a) & PAGE_MASK) == (((uintptr_t) b) & PAGE_MASK);
}
PAGE_MASK is a bit mask that has all N-12 most significant bits set to 1 and the 12 least significant bits set to 0. By doing the bitwise AND with an address, we effectively clear the least significant 12 bits (the offset into the page), so we can compare only the other bits that matter.
Note that uintptr_t is guaranteed to be wide enough to store pointer values, unlike long.

As already stated, you should use uintptr_t to proces the pointers. Your code is, however, wrong, as you test the distance, not the page. Also, you foget that computers use powers of two. 8000 is none; that would be 8192. similar for 4000.
The fastest approach for the test would be:
#include <stdbool.h>
#include <stdint.h>
// this should better be found in a system header:
#define PAGESIZE 4096U
bool samePage(void *a, void *b)
{
return ((uintptr_t)a ^ (uintptr_t)b) < PAGESIZE;
}
or:
return !(((uintptr_t)a ^ (uintptr_t)b) / PAGESIZE);
Note the result of the division will be converted to bool. If this is used as an inline, it will just tested for zero/not zero.
The XOR will zero all bits which are equal. So if any higher order bits differ, they will be set after XOR, and make the result >= PAGESIZE. This saves you one division or masking.
This requires PAGESIZE to be a power of two, of course.

Your aptempt to solve the interview's question is wrong.
You should be comparing a and b. Not &a and &b.
But even then it would still be wrong.
Consider pointer a points to last position of page 0 and pointer b points to first position of page 1. And page 1 is the one after page 0.
Their difference is 1. But they are in different pages.
In order to correctly implement it you should consider that a page is 4Kib long. 4Kib = 2^12 = 4096. So all the bits of a pair of pointers save for the last 12 will be equal if they are in the same page.
#include<stdint.h>
int AreOnSamePage(void* a, void* b){
return ((intptr_t)a & ~(intptr_t)0xFFF) ==
((intptr_t)b & ~(intptr_t)0xFFF);
}
A more concise but equivalent implementation :
int AreOnSamePage(void* a, void* b){
return ((intptr_t)a)>>12 == ((intptr_t)b)>>12;
}

C code for alignment on Intel Core 2 Duo

I've been given the following c code for alignment
struct s *p, *new_p
p = (struct s*) malloc(sizeof(struct s) + BOUND -1);
new_p = (struct s*) (((int) p+BOUND-1) & ~(BOUND -1);
where BOUND represents 32 bytes. A line of cache is 32 bytes like in Pentium II and III but I cannot figure out the way p and new_p get aligned. Are both aligned or only new_p?
Also, I have this code for a line of cache of 64 B for a set associative cache with 8 blocks in each set and a size of 32 Kb:
int *tempA, *tempB;
...
pA= (int *) malloc (sizeof(int)*N + 63);
tempA = (int *)(((int)pA+63)&~(63));
tempB = (int *)((((int)pA+63)&~(63))+4096+64)
Accompanied with this remark: there will be a penalty if you access more than 8 address with a separation of 4 Kb.
The whole doesn't make much sense to me. Any ideas of what's going on?

Why not use _Alignas() (since C11)?
Casting a pointer to int is an invitation to disaster (aka undefined behaviour). Just think about a 64 bit machine with 32 bit (standard for most x86). If you need arithmetics on pointers, use uintptr_t (I would not recommend using intptr_t, though). However, even here, arithmetic on the value is still undefined (but very likely safe for platforms with single, linear address space).
Standard note: do not cast void * as returned by malloc().
Update:
Ok, lets take the code above and give it a proper formating and typing:
#include <stdint.h>
// align to this boundary (must be power of two!)
#define ALIGN_BOUNDARY 64U
Do not use magic numbers in your code! 2 months later you will wonder what that means.
int *tempA, *tempB;
How are those used?
int *pA = malloc (sizeof(int) * N + ALIGN_BOUNDARY - 1);
uintptr_t adjA = ((uintptr_t)pA + (ALIGN_BOUNDARY - 1)) & ~((uintptr_t) (ALIGN_BOUNDARY - 1);
This just rounds up the address to the next aligned boundary (here: 64 bytes).
tempA = (int *)adjA;
tempB = (int *)(adjA + 4096 + 64)
Not sure what the later is good for, but with the malloc given, that will result in disaster due to accessing beyond the allocated block if used with the same indexes (0..N) as *pA.
In any way, I would be very, very careful with this code. Not only it apparently is badly written/documented, but it seems also to contain errors.

Safe, efficient way to access unaligned data in a network packet from C

I'm writing a program in C for Linux on an ARM9 processor. The program is to access network packets which include a sequence of tagged data like:
<fieldID><length><data><fieldID><length><data> ...
The fieldID and length fields are both uint16_t. The data can be 1 or more bytes (up to 64k if the full length was used, but it's not).
As long as <data> has an even number of bytes, I don't see a problem. But if I have a 1- or 3- or 5-byte <data> section then the next 16-bit fieldID ends up not on a 16-bit boundary and I anticipate alignment issues. It's been a while since I've done any thing like this from scratch so I'm a little unsure of the details. Any feedback welcome. Thanks.

To avoid alignment issues in this case, access all data as an unsigned char *. So:
unsigned char *p;
//...
uint16_t id = p[0] | (p[1] << 8);
p += 2;
The above example assumes "little endian" data layout, where the least significant byte comes first in a multi-byte number.

You should have functions (inline and/or templated if the language you're using supports those features) that will read the potentially unaligned data and return the data type you're interested in. Something like:
uint16_t unaligned_uint16( void* p)
{
// this assumes big-endian values in data stream
// (which is common, but not universal in network
// communications) - this may or may not be
// appropriate in your case
unsigned char* pByte = (unsigned char*) p;
uint16_t val = (pByte[0] << 8) | pByte[1];
return val;
}

The easy way is to manually rebuild the uint16_ts, at the expense of speed:
uint8_t *packet = ...;
uint16_t fieldID = (packet[0] << 8) | packet[1]; // assumes big-endian host order
uint16_t length = (packet[2] << 8) | packet[2];
uint8_t *data = packet + 4;
packet += 4 + length;
If your processor supports it, you can type-pun or use a union (but beware of strict aliasing).
uint16_t fieldID = htons(*(uint16_t *)packet);
uint16_t length = htons(*(uint16_t *)(packet + 2));
Note that unaligned access aren't always supported (e.g. they might generate a fault of some sort), and on other architectures, they're supported, but there's a performance penalty.
If the packet isn't aligned, you could always copy it into a static buffer and then read it:
static char static_buffer[65540];
memcpy(static_buffer, packet, packet_size); // make sure packet_size <= 65540
uint16_t fieldId = htons(*(uint16_t *)static_buffer);
uint16_t length = htons(*(uint16_t *)(static_buffer + 2));
Personally, I'd just go for option #1, since it'll be the most portable.

Alignment is always going to be fine, although perhaps not super-efficient, if you go through a byte pointer.
Setting aside issues of endian-ness, you can memcpy from the 'real' byte pointer into whatever you want/need that is properly aligned and you will be fine.
(this works because the generated code will load/store the data as bytes, which is alignment safe. It's when the generated assembly has instructions loading and storing 16/32/64 bits of memory in a mis-aligned manner that it all falls apart).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight