ReadProcessMemory with __int64 address - c

Hey guys I want to get some memory from a process that I already know with CheatEngine. I defined a region that I want to scan (0x190D186FF->0x190D1870A) but the address is too big to be stored in a simple int. That's why I use an __int64 but with that modification ReadProcessMemory doesn't seems to handle the address anymore.
When I compile I got 3 warnings for VirtualProtectEx and ReadProcessMemory: cast to pointer from integer of different size
How can I read really big address from the memory ?
int main( int argc, char *argv[] ) {
HWND hWnd;
DWORD PID;
HANDLE hProc;
__int64 address;
char mem = 0;
PDWORD oldProtect = 0;
int valid = 0;
char inputPID[4];
printf( "What is the program PID ?\n" );
fgets( inputPID, sizeof( inputPID ), stdin );
PID = (DWORD)atoi( inputPID );
hProc = OpenProcess( PROCESS_VM_READ, false, PID );
if ( !hProc ) {
printf( "Error: Couldn't open process '%i'\n", PID );
return 0;
}
for ( address = 0x190D186FF; address <= 0x190D1870A; address++ ) {
VirtualProtectEx( hProc, (PVOID)address, (SIZE_T)sizeof( address ), PAGE_READONLY, oldProtect );
valid = ReadProcessMemory( hProc, (PCVOID)address, &mem, (DWORD)sizeof( char ), NULL );
if ( valid ) {
printf( "Memory value at 0x%I64x: '%c'\n", address, mem );
}
VirtualProtectEx( hProc, (PVOID)address, (SIZE_T)sizeof( address ), (DWORD)oldProtect, NULL );
}
system( "pause" );
}

Your problem is your trying to stuff 64bit of data into 32bit variables. You need to switch your project to build in x64.
Your compiler doesn't automatically compile as x64 on a 64 bit OS. You need to change your Configuration build type to compile for x64.
There are 2 ways you can go about making this easier on yourself.
1) Compile for same process architecture as the process you're going to be interacting with, this alleviates many problems. Use uintptr_t or UINT_PTR which will resolve to the correct pointer size either 32 bit or 64 bit depending which you compile for, for all your addresses and offsets.
2) Make your own TYPEDEF like
#define TARGET_X64
#ifdef TARGET_X64
typedef unsigned __int64 addr_ptr
#else
typedef unsigned int addr_ptr
#endif
Then define TARGET_X64 when you're interacting with a x64 process. If you do it like this, and you're compiling as x32 there are certain API's that with have complications when accessing x64 processes and vice versa.
I highly recommend using the first method.

Related

How to get "valid data length" of a file?

There is a function to set the "valid data length" value: SetFileValidData, but I didn't find a way to get the "valid data length" value.
I want to know about given file if the EOF is different from the VDL, because writing after the VDL in case of VDL<EOF will cause a performance penalty as described here.
I found this page, claims that:
there is no mechanism to query the value of the VDL
So the answer is "you can't".
If you care about performance you can set the VDL to the EOF, but then note that you may allow access old garbage on your disk - the part between those two pointers, that supposed to be zeros if you would access that file without setting the VDL to point the EOF.
Looked into this. No way to get this information via any API, even the e.g. NtQueryInformationFile API (FileEndOfFileInformation only worked with NtSetInformationFile). So finally I read this by manually reading NTFS records. If anyone has a better way, please tell! This also obviously only works with full system access (and NTFS) and might be out of sync with the in-memory information Windows uses.
#pragma pack(push)
#pragma pack(1)
struct NTFSFileRecord
{
char magic[4];
unsigned short sequence_offset;
unsigned short sequence_size;
uint64 lsn;
unsigned short squence_number;
unsigned short hardlink_count;
unsigned short attribute_offset;
unsigned short flags;
unsigned int real_size;
unsigned int allocated_size;
uint64 base_record;
unsigned short next_id;
//char padding[470];
};
struct MFTAttribute
{
unsigned int type;
unsigned int length;
unsigned char nonresident;
unsigned char name_lenght;
unsigned short name_offset;
unsigned short flags;
unsigned short attribute_id;
unsigned int attribute_length;
unsigned short attribute_offset;
unsigned char indexed_flag;
unsigned char padding1;
//char padding2[488];
};
struct MFTAttributeNonResident
{
unsigned int type;
unsigned int lenght;
unsigned char nonresident;
unsigned char name_length;
unsigned short name_offset;
unsigned short flags;
unsigned short attribute_id;
uint64 starting_vnc;
uint64 last_vnc;
unsigned short run_offset;
unsigned short compression_size;
unsigned int padding;
uint64 allocated_size;
uint64 real_size;
uint64 initial_size;
};
#pragma pack(pop)
HANDLE GetVolumeData(const std::wstring& volfn, NTFS_VOLUME_DATA_BUFFER& vol_data)
{
HANDLE vol = CreateFileW(volfn.c_str(), GENERIC_WRITE | GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (vol == INVALID_HANDLE_VALUE)
return vol;
DWORD ret_bytes;
BOOL b = DeviceIoControl(vol, FSCTL_GET_NTFS_VOLUME_DATA,
NULL, 0, &vol_data, sizeof(vol_data), &ret_bytes, NULL);
if (!b)
{
CloseHandle(vol);
return INVALID_HANDLE_VALUE;
}
return vol;
}
int64 GetFileValidData(HANDLE file, HANDLE vol, const NTFS_VOLUME_DATA_BUFFER& vol_data)
{
BY_HANDLE_FILE_INFORMATION hfi;
BOOL b = GetFileInformationByHandle(file, &hfi);
if (!b)
return -1;
NTFS_FILE_RECORD_INPUT_BUFFER record_in;
record_in.FileReferenceNumber.HighPart = hfi.nFileIndexHigh;
record_in.FileReferenceNumber.LowPart = hfi.nFileIndexLow;
std::vector<BYTE> buf;
buf.resize(sizeof(NTFS_FILE_RECORD_OUTPUT_BUFFER) + vol_data.BytesPerFileRecordSegment - 1);
NTFS_FILE_RECORD_OUTPUT_BUFFER* record_out = reinterpret_cast<NTFS_FILE_RECORD_OUTPUT_BUFFER*>(buf.data());
DWORD bout;
b = DeviceIoControl(vol, FSCTL_GET_NTFS_FILE_RECORD, &record_in,
sizeof(record_in), record_out, 4096, &bout, NULL);
if (!b)
return -1;
NTFSFileRecord* record = reinterpret_cast<NTFSFileRecord*>(record_out->FileRecordBuffer);
unsigned int currpos = record->attribute_offset;
MFTAttribute* attr = nullptr;
while ( (attr==nullptr ||
attr->type != 0xFFFFFFFF )
&& record_out->FileRecordBuffer + currpos +sizeof(MFTAttribute)<buf.data() + bout)
{
attr = reinterpret_cast<MFTAttribute*>(record_out->FileRecordBuffer + currpos);
if (attr->type == 0x80
&& record_out->FileRecordBuffer + currpos + attr->attribute_offset+sizeof(MFTAttributeNonResident)
< buf.data()+ bout)
{
if (attr->nonresident == 0)
return -1;
MFTAttributeNonResident* dataattr = reinterpret_cast<MFTAttributeNonResident*>(record_out->FileRecordBuffer
+ currpos + attr->attribute_offset);
return dataattr->initial_size;
}
currpos += attr->length;
}
return -1;
}
[...]
NTFS_VOLUME_DATA_BUFFER vol_data;
HANDLE vol = GetVolumeData(L"\\??\\D:", vol_data);
if (vol != INVALID_HANDLE_VALUE)
{
int64 vdl = GetFileValidData(alloc_test->getOsHandle(), vol, vol_data);
if(vdl>=0) { [...] }
[...]
}
[...]
The SetValidData (according to MSDN) can be used to create for example a large file without having to write to the file. For a database this will allocate a (contiguous) storage area.
As a result, it seems the file size on disk will have changed without any data having been written to the file.
By implication, any GetValidData (which does not exist) just returns the size of the file, so you can use GetFileSize which returns the "valid" file size.
I think you are confused as to what "valid data length" actually means. Check this answer.
Basically, while SetEndOfFile lets you increase the length of a file quickly, and allocates the disk space, if you skip to the (new) end-of-file to write there, all the additionally allocated disk space would need to be overwritten with zeroes, which is kind of slow.
SetFileValidData lets you skip that zeroing-out. You're telling the system, "I am OK with whatever is in those disk blocks, get on with it". (This is why you need the SE_MANAGE_VOLUME_NAME priviledge, as it could reveal priviledged data to unpriviledged users if you don't overwrite the data. Users with this priviledge can access the raw drive data anyway.)
In either case, you have set the new effective size of the file. (Which you can read back.) What, exactly, should a seperate "read file valid data" report back? SetFileValidData told the system that whatever is in those disk blocks is "valid"...
Different approach of explanation:
The documentation mentions that the "valid data length" is being tracked; the purpose for this is for the system to know which range (from end-of-valid-data to end-of-file) it still needs to zero out, in the context of SetEndOfFile, when necessary (e.g. you closing the file). You don't need to read back this value, because the only way it could be different from the actual file size is because you, yourself, did change it via the aforementioned functions...

User-threaded scheduling API on mac OSX using ucontext & signals

I'm designing a scheduling algorithm that has the following features:
Have 2 user-threads (contexts) in the one process (I'm supposed to do 3 threads but that didn't work on osx yet, so I decided to make 2 work for now)
preemptive using a SIGALRM signal that goes off every 1 sec and changes the control from one context to another, and save the current state (registers and current position) of the context that was running before doing the switch.
what I have noticed is the following:
ucontext.h library behaves strange on mac osx whereas when it is applied in Linux it behaves exactly the way it is supposed to (the example from this man link: http://man7.org/linux/man-pages/man3/makecontext.3.html works perfectly as it is supposed to on linux whereas on mac it fails with Segmentation fault before it does any swapping). I have to make it run on osx unfortunately and not linux.
I managed to work around the swapcontext error on osx by using getcontext() & then setcontext() to do the swapping of contexts.
In my signal handler function, I use the sa_sigaction( int sig, siginfo_t *s, void * cntxt ) since the 3rd variable once re-casted it as a ucontext_t pointer is the information about the context that was interrupted (which is true on Linux once I tested it) but on mac it doesn't point to the proper location as when I use it I get a segmentation fault yet again.
i have designed my test functions for each context to be looping inside a while loop as I want to interrupt them and make sure they go back to execute at the proper location within that function. i have defined a static global count variable that helps me see whether I was in the proper user-thread or not.
One last note is that I found out that calling getcontext() inside my while loop with in the test functions updates the position of my current context constantly since it is am empty while loop and therefore calling setcontext() when that context's time comes makes it execute from proper place. This solution is redundant since these functions will be provided from outside the API.
#include <stdio.h>
#include <sys/ucontext.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <errno.h>
/*****************************************************************************/
/* time-utility */
/*****************************************************************************/
#include <sys/time.h> // struct timeval
void timeval_add_s( struct timeval *tv, uint64_t s ) {
tv->tv_sec += s;
}
void timeval_diff( struct timeval *c, struct timeval *a, struct timeval *b ) {
// use signed variables
long aa;
long bb;
long cc;
aa = a->tv_sec;
bb = b->tv_sec;
cc = aa - bb;
cc = cc < 0 ? -cc : cc;
c->tv_sec = cc;
aa = a->tv_usec;
bb = b->tv_usec;
cc = aa - bb;
cc = cc < 0 ? -cc : cc;
c->tv_usec = cc;
out:
return;
}
/******************************************************************************/
/* Variables */
/*****************************************************************************/
static int count;
/* For now only the T1 & T2 are used */
static ucontext_t T1, T2, T3, Main, Main_2;
ucontext_t *ready_queue[ 4 ] = { &T1, &T2, &T3, &Main_2 };
static int thread_count;
static int current_thread;
/* timer struct */
static struct itimerval a;
static struct timeval now, then;
/* SIGALRM struct */
static struct sigaction sa;
#define USER_THREAD_SWICTH_TIME 1
static int check;
/******************************************************************************/
/* signals */
/*****************************************************************************/
void handle_schedule( int sig, siginfo_t *s, void * cntxt ) {
ucontext_t * temp_current = (ucontext_t *) cntxt;
if( check == 0 ) {
check = 1;
printf("We were in main context user-thread\n");
} else {
ready_queue[ current_thread - 1 ] = temp_current;
printf("We were in User-Thread # %d\n", count );
}
if( current_thread == thread_count ) {
current_thread = 0;
}
printf("---------------------------X---------------------------\n");
setcontext( ready_queue[ current_thread++ ] );
out:
return;
}
/* initializes the signal handler for SIGALARM, sets all the values for the alarm */
static void start_init( void ) {
int r;
sa.sa_sigaction = handle_schedule;
sigemptyset( &sa.sa_mask );
sa.sa_flags = SA_SIGINFO;
r = sigaction( SIGALRM, &sa, NULL );
if( r == -1 ) {
printf("Error: cannot handle SIGALARM\n");
goto out;
}
gettimeofday( &now, NULL );
timeval_diff( &( a.it_value ), &now, &then );
timeval_add_s( &( a.it_interval ), USER_THREAD_SWICTH_TIME );
setitimer( ITIMER_REAL, &a, NULL );
out:
return;
}
/******************************************************************************/
/* Thread Init */
/*****************************************************************************/
static void thread_create( void * task_func(void), int arg_num, int task_arg ) {
ucontext_t* thread_temp = ready_queue[ thread_count ];
getcontext( thread_temp );
thread_temp->uc_link = NULL;
thread_temp->uc_stack.ss_size = SIGSTKSZ;
thread_temp->uc_stack.ss_sp = malloc( SIGSTKSZ );
thread_temp->uc_stack.ss_flags = 0;
if( arg_num == 0 ) {
makecontext( thread_temp, task_func, arg_num );
} else {
makecontext( thread_temp, task_func, arg_num, task_arg );
}
thread_count++;
out:
return;
}
/******************************************************************************/
/* Testing Functions */
/*****************************************************************************/
void thread_funct( int i ) {
printf( "---------------------------------This is User-Thread #%d--------------------------------\n", i );
while(1) { count = i;} //getcontext( ready_queue[ 0 ] );}
out:
return;
}
void thread_funct_2( int i ) {
printf( "---------------------------------This is User-Thread #%d--------------------------------\n", i );
while(1) { count = i;} //getcontext( ready_queue[ 1 ] ); }
out:
return;
}
/******************************************************************************/
/* Main Functions */
/*****************************************************************************/
int main( void ) {
int r;
gettimeofday( &then, NULL );
thread_create( (void *)thread_funct, 1, 1);
thread_create( (void *)thread_funct_2, 1, 2);
start_init();
while(1);
printf( "completed\n" );
out:
return 0;
}
What am I doing wrong here? I have to change this around a bit to run it on Linux properly & running the version that works on Linux on OSX causes segmentation fault, but why would it work on that OS and not this?
Is this related by any chance to my stack size i allocate in each context?
Am I supposed to have a stack space allocated for my signal? (It says that if I don't then it uses a default stack, and if I do it doesn't really make a difference)?
If the use of ucontext will never give predictable behavior on mac osx, then what is the alternative to implement user-threading on osx? I tried using tmrjump & longjmp but I run into the same issue which is when a context is interrupted in the middle of executing certain function then how can I get the exact position of where that context got interrupted in order to continue where I left off next time?
So after days of testing and debugging I finally got this. I had to dig deep into the implementation of the ucontext.h and found differences between the 2 OS. Turns out that OSX implementation of ucontext.h is different from that of Linux. For instance the mcontext_t struct within ucontext_t struct which n=usually holds the values of the registers (PI, SP, BP, general registers...) of each context is declared as a pointer in OSX whereas on Linux it is not. A couple of other differences that needed top be set specially the context's stack pointer (rsp) register, the base pointer (rbp) register, the instruction pointer (rip) register, the destination index (rdi) register... All these had to be set correctly at the beginining/creation of each context as well as after it returns for the first time. I also had top create a mcontext struct to hold these registers and have my ucontext_t struct's uc_mcontext pointer point to it. After all that was done I was able to use the ucontext_t pointer that was passed as an argument in the sa_sigaction signal handler function (after I recast it to ucontext_t) in order to resume exactly where the context left off last time. Bottom line it was a messy affair. Anyone interested in more details can msg me. JJ out.

What's a good way to implement simple clone() based multithread library?

I'm trying to build simple multithread library based on linux using clone() and other kernel utilities.I've come to a point where I'm not really sure what's the correct way to do things. I tried going trough original NPTL code but it's a bit too much.
That's how for instance I imagine the create method:
typedef int sk_thr_id;
typedef void *sk_thr_arg;
typedef int (*sk_thr_func)(sk_thr_arg);
sk_thr_id sk_thr_create(sk_thr_func f, sk_thr_arg a){
void* stack;
stack = malloc( 1024*64 );
if ( stack == 0 ){
perror( "malloc: could not allocate stack" );
exit( 1 );
}
return ( clone(f, (char*) stack + FIBER_STACK, SIGCHLD | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_VM, a ) );
}
1: I'm not really sure what the correct clone() flags should be. I just found these being used in a simple example. Any general directions here will be welcome.
Here are parts of the mutex primitives created using futexes(not my own code for now):
#define cmpxchg(P, O, N) __sync_val_compare_and_swap((P), (O), (N))
#define cpu_relax() asm volatile("pause\n": : :"memory")
#define barrier() asm volatile("": : :"memory")
static inline unsigned xchg_32(void *ptr, unsigned x)
{
__asm__ __volatile__("xchgl %0,%1"
:"=r" ((unsigned) x)
:"m" (*(volatile unsigned *)ptr), "0" (x)
:"memory");
return x;
}
static inline unsigned short xchg_8(void *ptr, char x)
{
__asm__ __volatile__("xchgb %0,%1"
:"=r" ((char) x)
:"m" (*(volatile char *)ptr), "0" (x)
:"memory");
return x;
}
int sys_futex(void *addr1, int op, int val1, struct timespec *timeout, void *addr2, int val3)
{
return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3);
}
typedef union mutex mutex;
union mutex
{
unsigned u;
struct
{
unsigned char locked;
unsigned char contended;
} b;
};
int mutex_init(mutex *m, const pthread_mutexattr_t *a)
{
(void) a;
m->u = 0;
return 0;
}
int mutex_lock(mutex *m)
{
int i;
/* Try to grab lock */
for (i = 0; i < 100; i++)
{
if (!xchg_8(&m->b.locked, 1)) return 0;
cpu_relax();
}
/* Have to sleep */
while (xchg_32(&m->u, 257) & 1)
{
sys_futex(m, FUTEX_WAIT_PRIVATE, 257, NULL, NULL, 0);
}
return 0;
}
int mutex_unlock(mutex *m)
{
int i;
/* Locked and not contended */
if ((m->u == 1) && (cmpxchg(&m->u, 1, 0) == 1)) return 0;
/* Unlock */
m->b.locked = 0;
barrier();
/* Spin and hope someone takes the lock */
for (i = 0; i < 200; i++)
{
if (m->b.locked) return 0;
cpu_relax();
}
/* We need to wake someone up */
m->b.contended = 0;
sys_futex(m, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);
return 0;
}
2: The main question for me is how to implement the "join" primitive? I know it's supposed to be based on futexes too. It's a struggle for me for now to come up with something.
3: I need some way to cleanup stuff(like the allocated stack) after a thread has finished. I can't really thing of a good way to do this too.
Probably for these I'll need to have additional structure in user space for every thread with some information saved in it. Can someone point me in good direction for solving these issues?
4: I'll want to have a way to tell how much time a thread has been running, how long it's been since it's last being scheduled and other stuff like that. Are there some kernel calls providing such info?
Thanks in advance!
The idea that there can exist a "multithreading library" as a third-party library separate from the rest of the standard library is an outdated and flawed notion. If you want to do this, you'll have to first drop all use of the standard library; particularly, your call to malloc is completely unsafe if you're calling clone yourself, because:
malloc will have no idea that multiple threads exist, and therefore may fail to perform proper synchronization.
Even if it knew they existed, malloc will need to access an unspecified, implementation-specific structure located at the address given by the thread pointer. As this structure is implementation-specific, you have no way of creating such a structure that will be interpreted correctly by both the current and all future versions of your system's libc.
These issues don't apply just to malloc but to most of the standard library; even async-signal-safe functions may be unsafe to use, as they might dereference the thread pointer for cancellation-related purposes, performing optimal syscall mechanisms, etc.
If you really insist on making your own threads implementation, you'll have to abstain from using glibc or any modern libc that's integrated with threads, and instead opt for something much more naive like klibc. This could be an educational experiment, but it would not be appropriate for a deployed application.
1) You are using an example of LinuxThreads. I will not rewrite good references for directions, but I advise you "The Linux Programming interface" of Michael Kerrisk, chapter 28. It explains in 25 pages, what you need.
2) If you set the CLONE_CHILD_CLEARID flag, when the child terminates, the ctid argument of clone is cleared. If you treat that pointer as a futex, you can implement the join primitive. Good luck :-) If you don't want to use futexes, have also a look to wait3 and wait4.
3) I do not know what you want to cleanup, but you can use the clone tls arugment. This is a thread local storage buffer. If the thread is finished, you can clean that buffer.
4) See getrusage.

LVM_GETITEMTEXT for x32 and x64 in C

I've been trying to get the text of items in listview another process. I found an awesome tutorial on CodeProject. Thanks to this article I was able to do this on x32. But when try to run on x64, it crashes the application I'm trying to access when SendMessage is called. In the articles comments people had simliar problems because of different pointer sizes. Some people suggested using a x64 compiler which I cant use. I need my program to run on both x32/x64. One guy suggested:
I have the answer. The LVITEM
structure is wrong under 64-bit
systems. Pointers are 64-bit now, so
the text pointer has to be followed by
a dummy value, to offset the length
member correctly.
I think this would be the best solution, as I could run it for x32 and x64 with one exe. I just have no idea how to do what hes talking about. I have included my code which currently works on x32. If anyone can help me out. That would be awesoem.
LVITEMLVITEM lvi, *_lvi;
char item[512];
char *_item;
unsigned long pid;
HANDLE process;
GetWindowThreadProcessId(procList, &pid);
process = OpenProcess(0x001f0fff, FALSE, pid);
_lvi = (LVITEM*)VirtualAllocEx(process, NULL, sizeof(LVITEM), 0x1000, 4);
_item = (char*)VirtualAllocEx(process, NULL, 512, 0x1000, 4);
lvi.cchTextMax = 512;
int r, c;
for (r = 0; r < rowCount; r++)
{
for (c = 0; c < columnCount; c++)
{
lvi.iSubItem = c;
lvi.pszText =_item;
// Insert lvi into programs's memory
WriteProcessMemory(process, _lvi, &lvi, sizeof(LVITEM), NULL);
// Have program write text to in its memory where we told it to
SendMessage(procList, LVM_GETITEMTEXT, (WPARAM)r, (LPARAM)_lvi);
// Get TVITEM back from programs
ReadProcessMemory(process, _item, item, 512, NULL);
}
}
// Clean up the mess we made
VirtualFreeEx(process, _lvi, 0, MEM_RELEASE);
VirtualFreeEx(process, _item, 0, MEM_RELEASE);
CloseHandle(process);
I don't think you'll be able to achieve this. In a 32 bit process your pointers will be too short. I believe that VirtualAllocEx will fail when called from a 32 bit process and with a 64 bit process handle as its first parameter. I think you would see this if you added error checking to your code.
Your only solution will be to have 2 versions, x86 and x64. That should be no real trouble - usually it can be done with single source.
Sending LVM_GETITEMTEXT message from 32-bit application to 64-bit ListView is actually possible.
I was able to achieve this by using not the original LVITEM (60 bytes long) but LVITEM structure (88 bytes long) with seven 4-byte placeholders inserted between members. It works on my Win7 Pro 64-bit, though I have not tested this approach yet on other machines.
Below is the structure. This is C++, but nothing prevents us from doing the same in .NET.
typedef struct {
UINT mask;
int iItem;
int iSubItem;
UINT state;
UINT stateMask;
int placeholder1;
LPTSTR pszText;
int placeholder11;
int cchTextMax;
int iImage;
LPARAM lParam;
int placeholder2;
#if (_WIN32_IE >= 0x0300)
int iIndent;
#endif
#if (_WIN32_WINNT >= 0x0501)
int iGroupId;
UINT cColumns;
int placeholder3;
UINT puColumns;
int placeholder4;
#endif
#if (_WIN32_WINNT >= 0x0600)
int piColFmt;
int placeholder5;
int iGroup;
int placeholder6;
#endif
} LVITEM64, *LPLVITEM64;

Misaligned Pointer Performance

Aren't misaligned pointers (in the BEST possible case) supposed to slow down performance and in the worst case crash your program (assuming the compiler was nice enough to compile your invalid c program).
Well, the following code doesn't seem to have any performance differences between the aligned and misaligned versions. Why is that?
/* brutality.c */
#ifdef BRUTALITY
xs = (unsigned long *) ((unsigned char *) xs + 1);
#endif
...
/* main.c */
#include <stdio.h>
#include <stdlib.h>
#define size_t_max ((size_t)-1)
#define max_count(var) (size_t_max / (sizeof var))
int main(int argc, char *argv[]) {
unsigned long sum, *xs, *itr, *xs_end;
size_t element_count = max_count(*xs) >> 4;
xs = malloc(element_count * (sizeof *xs));
if(!xs) exit(1);
xs_end = xs + element_count - 1; sum = 0;
for(itr = xs; itr < xs_end; itr++)
*itr = 0;
#include "brutality.c"
itr = xs;
while(itr < xs_end)
sum += *itr++;
printf("%lu\n", sum);
/* we could free the malloc-ed memory here */
/* but we are almost done */
exit(0);
}
Compiled and tested on two separate machines using
gcc -pedantic -Wall -O0 -std=c99 main.c
for i in {0..9}; do time ./a.out; done
I tested this some time in the past on Win32 machines and did not notice much of a penalty on 32-bit machines. On 64-bit, though, it was significantly slower. For example, I ran the following bit of code. On a 32-bit machine, the times printed were hardly changed. But on a 64-bit machine, the times for the misaligned accesses were nearly twice as long. The times follow the code.
#define UINT unsigned __int64
#define ENDPART QuadPart
#else
#define UINT unsigned int
#define ENDPART LowPart
#endif
int main(int argc, char *argv[])
{
LARGE_INTEGER startCount, endCount, freq;
int i;
int offset;
int iters = atoi(argv[1]);
char *p = (char*)malloc(16);
double *d;
for ( offset = 0; offset < 9; offset++ )
{
d = (double*)( p + offset );
printf( "Address alignment = %u\n", (unsigned int)d % 8 );
*d = 0;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&startCount);
for(i = 0; i < iters; ++i)
*d = *d + 1.234;
QueryPerformanceCounter(&endCount);
printf( "Time: %lf\n",
(double)(endCount.ENDPART-startCount.ENDPART)/freq.ENDPART );
}
}
Here are the results on a 64-bit machine. I compiled the code as a 32-bit application.
[P:\t]pointeralignment.exe 100000000
Address alignment = 0
Time: 0.484156
Address alignment = 1
Time: 0.861444
Address alignment = 2
Time: 0.859656
Address alignment = 3
Time: 0.861639
Address alignment = 4
Time: 0.860234
Address alignment = 5
Time: 0.861539
Address alignment = 6
Time: 0.860555
Address alignment = 7
Time: 0.859800
Address alignment = 0
Time: 0.484898
The x86 architecture has always been able to handle misaligned accesses, so you'll never get a crash. Other processors might not be as lucky.
You're probably not seeing any time difference because the loop is memory-bound; it can only run as fast as data can be fetched from RAM. You might think that the misalignment will cause the RAM to be accessed twice, but the first access puts it into cache, and the second access can be overlapped with getting the next value from RAM.
You're assuming either x86 or x64 architectures. On MIPS, for example, your code may result in a SIGBUS(bus fault) signal being raised. On other architectures, non-aligned accesses will typically be slower than aligned accesses, although, it is very much architecture dependent.
x86 or x64?
Misaligned pointers were a killer in x86 where 64bit architectures were not nearly as prone to the crash, or even slow performance at all.
It is probably because malloc of that many bytes is returning NULL. At least that's what it does for me.
You never defined BRUTALITY in your posted code. Are you sure you are testing in 'brutal' mode?
Maybe in order to malloc such a huge buffer, the system is paging memory to and from disk. That could swamp small differences. Try a much smaller buffer and a large, in program loop count around that.
I made the mods I've suggested here and in the comments and tested on my system (a tired, 4 year old, 32 bit laptop). Code shown below. I do get a measurable difference, but only around 3%. I maintain my changes are a success because your question indicates you get no difference at all correct ?
Sorry I am using Windows and used the windows specific GetTickCount() API I am familiar with because I often do timing tests, and enjoy the simplicity of that misnamed API (it actually return millisecs since system start).
/* main.cpp */
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
#define BRUTALITY
int main(int argc, char *argv[]) {
unsigned long i, begin, end;
unsigned long sum, *xs, *itr, *xs_begin, *xs_end;
size_t element_count = 100000;
xs = (unsigned long *)malloc(element_count * (sizeof *xs));
if(!xs) exit(1);
xs_end = xs + element_count - 1;
#ifdef BRUTALITY
xs_begin = (unsigned long *) ((unsigned char *) xs + 1);
#else
xs_begin = xs;
#endif
begin = GetTickCount();
for( i=0; i<50000; i++ )
{
for(itr = xs_begin; itr < xs_end; itr++)
*itr = 0;
sum = 0;
itr = xs_begin;
while(itr < xs_end)
sum += *itr++;
}
end = GetTickCount();
printf("sum=%lu elapsed time=%lumS\n", sum, end-begin );
free(xs);
exit(0);
}

Resources