How can return value be modified? - c

I have a routine calling gSoap API function soap_malloc. But the program gives me a segmentation fault whenever I try to access the memory allocated by soap_malloc. When I use gdb to debug it. I find that inside soap_malloc, return value stored in register %rax is 0x7fffec0018f0. But on return, %rax changed to 0xffffffffec0018f0. Only the lower 32-bit was retained, the higher 32-bit all changed to 1. And that lead to accessing an address which is quite high, therefore caused the routine stopped. Thanks for you all to give me any ideas on how can this be happening. I'm running my multi-thread program in Ubuntu12.04 x86-64.
This is how I call it:
void *temp = soap_malloc(soap, 96);
And this is the soap_malloc implementation(only that else part is executed, and macro SOAP_MALLOC is just passing the second argument to malloc, SOAP_CANARY is constant 0xC0DE):
#ifndef SOAP_MALLOC /* use libc malloc */
# define SOAP_MALLOC(soap, size) malloc(size)
#endif
#ifndef SOAP_CANARY
# define SOAP_CANARY (0xC0DE)
#endif
void* soap_malloc(struct soap *soap, size_t n)
{ register char *p;
if (!n)
return (void*)SOAP_NON_NULL;
if (!soap)
return SOAP_MALLOC(soap, n);
if (soap->fmalloc)
p = (char*)soap->fmalloc(soap, n);
else
{ n += sizeof(short);
n += (-(long)n) & (sizeof(void*)-1); /* align at 4-, 8- or 16-byte boundary */
if (!(p = (char*)SOAP_MALLOC(soap, n + sizeof(void*) + sizeof(size_t))))
{ soap->error = SOAP_EOM;
return NULL;
}
/* set the canary to detect corruption */
*(unsigned short*)(p + n - sizeof(unsigned short)) = (unsigned short)SOAP_CANARY;
/* keep chain of alloced cells for destruction */
*(void**)(p + n) = soap->alist;
*(size_t*)(p + n + sizeof(void*)) = n;
soap->alist = p + n;
}
soap->alloced = 1;
return p;
}
This is the definition of SOAP_NON_NULL:
static const char soap_padding[4] = "\0\0\0";
#define SOAP_NON_NULL (soap_padding)
Updates(2013-03-12)
I explicitly declared soap_malloc as returning void * and the problem solved. Previously, the returned value is truncated to int and the sign bit 1 was extended when assigning the result to void *temp.

Does the calling code have a proper prototype for the soap_malloc() function in scope?
It seems the void * is being converted to int, i.e. the default return type. This happens if the function hasn't been properly declared.

Related

How to create a process that runs a routine with variable number of parameters?

I know there are lots of questions here about functions that take a variable number of arguments. I also know there's lots of docs about stdarg.h and its macros. And I also know how printf-like functions take a variable number of arguments. I already tried each of those alternatives and they didn't help me. So, please, keep that in mind before marking this question as duplicate.
I'm working on the process management features of a little embedded operating system and I'm stuck on the design of a function that can create processes that run a function with a variable number of parameters. Here's a simplified version of how I want my API to looks like:
// create a new process
// * function is a pointer to the routine the process will run
// * nargs is the number of arguments the routine takes
void create(void* function, uint8_t nargs, ...);
void f1();
void f2(int i);
void f3(float f, int i, const char* str);
int main()
{
create(f1, 0);
create(f2, 1, 9);
create(f3, 3, 3.14f, 9, "string");
return 0;
}
And here is a pseudocode for the relevant part of the implementation of system call create:
void create(void* function, uint8_t nargs, ...)
{
process_stack = create_stack();
first_arg = &nargs + 1;
copy_args_list_to_process_stack(process_stack, first_arg);
}
Of course I'll need to know the calling convention in order to be able to copy from create's activation record to the new process stack, but that's not the problem. The problem is how many bytes do I need to copy. Even though I know how many arguments I need to copy, I don't know how much space each of those arguments occupy. So I don't know when to stop copying.
The Xinu Operating System does something very similar to what I want to do, but I tried hard to understand the code and didn't succeed. I'll transcript a very simplified version of the Xinu's create function here. Maybe someone understand and help me.
pid32 create(void* procaddr, uint32 ssize, pri16 priority, char *name, int32 nargs, ...)
{
int32 i;
uint32 *a; /* points to list of args */
uint32 *saddr; /* stack address */
saddr = (uint32 *)getstk(ssize); // return a pointer to the new process's stack
*saddr = STACKMAGIC; // STACKMAGIC is just a marker to detect stack overflow
// this is the cryptic part
/* push arguments */
a = (uint32 *)(&nargs + 1); /* start of args */
a += nargs -1; /* last argument */
for ( ; nargs > 4 ; nargs--) /* machine dependent; copy args */
*--saddr = *a--; /* onto created process's stack */
*--saddr = (long)procaddr;
for(i = 11; i >= 4; i--)
*--saddr = 0;
for(i = 4; i > 0; i--) {
if(i <= nargs)
*--saddr = *a--;
else
*--saddr = 0;
}
}
I got stuck on this line: a += nargs -1;. This should move the pointer a 4*(nargs - 1) ahead in memory, right? What if an argument's size is not 4 bytes? But that is just the first question. I also didn't understand the next lines of the code.
If you are writing an operating system, you also define the calling convention(s) right? Settle for argument sizes of sizeof(void*) and pad as necessary.

Distributed shared memory [duplicate]

This question already has answers here:
Distributed shared memory library for C++? [closed]
(3 answers)
Closed 8 years ago.
My current system runs on Linux, with the different tasks using shared memory to access the common data (which is defined as a C struct). The size of the shared data is about 100K.
Now, I want to run the main user interface on Windows, while keeping all the other tasks in Linux, and I’m looking for the best replacement for the shared memory. The refresh rate of the UI is about 10 times per second.
My doubt is if it’s better to do it by myself or to use a third party solution. If I do it on my own, my idea is to open a socket and then use some sort of data serialization between the client (Windows) and the server (Linux).
In the case of third party solutions, I’m a bit overwhelmed by the number of options. After some search, it seems to me that the right solution could be MPI, but I would like to consider other options before start working with it: XDR, OpenMP, JSON, DBus, RDMA, memcached, boost libraries … Has anyone got any experience with any of them? Which are the pros and cons of using such solutions for accessing the shared memory on Linux from Windows?
Maybe MPI or the other third party solutions are too elaborated for such a simple use as mine and I should use a do-it-yourself approach? Any advice if I take that solution? Am I missing something? Am I looking in the wrong direction? I wouldn’t like to reinvent the wheel.
For the file sharing, I’m considering Samba.
My development is done with C and C++, and, of course, the solution needs to be compiled in Linux and Windows.
AFAIK (but I don't know Windows) you cannot share memory (à la shm_overview(7)) on both Linux and Windows at once (you could share data, using some network protocol, this is what memcached does).
However, you can make a Linux process answering to network messages from a Windows machine (but that is not shared memory!).
Did you consider using a Web interface? You could add a web interface -with ajax techniques- on Linux (to your Linux software), e.g. using FastCGI or an HTTP server library like libonion or wt. Then your Windows users could use their browser to interact with your program (which would still run on some Linux compute server).
PS. Read the wikipage on distributed shared memory. It means something different than what you are asking! Read also about message passing!
I'd recommend using a plain TCP socket between the user interface and the service, with simple tagged message frames passed between the two in preferably architecture-independent manner (i.e. with specific byte order and integer and floating-point types).
On the server end, I'd use a helper process or thread, that maintains two local copies of the shared state struct per client. First one reflects the state the client knows about, and the second one is used to snapshot the shared state at regular intervals. When the two differ, the helper sends the differing fields to the client. Conversely, the client can send modification requests to the helper.
I would not send the 100k shared data structure as a single chunk; it'd be very inefficient. Also, assuming the state contains multiple fields, having the client send the entire new 100k state would overwrite fields not meant to by the client.
I'd use one message for each field in the state, or atomically manipulated set of fields. The message structure itself should be very simple. At minimum, each message should start with a fixed-size length and type fields, so that it is trivial to receive the messages. It is always a good idea to leave the possibility of later extending the capabilities/fields, without breaking backwards compatibility, too.
You don't need a lot of code to implement the above. You do need some accessor/manipulator code for each different type of field in the state (char, short, int, long, double, float, and any other type or struct you might use), and that (and the helper that compares the active state with the simulated user state) will probably be the bulk of the code.
Although I don't normally recommend reinventing the wheel, in this case I do recommend writing your own message passing code, directly on top of TCP, simply because all the libraries I know of that could be used for this, are rather unwieldy or would impose some design choices I'd rather leave to the application.
If you want, I can provide some example code that would hopefully illustrate this better. I don't have Windows, but Qt does have a QTcpSocket class you can use that works the same in both Linux and Windows, so there shouldn't be (m)any differences. In fact, if you design the message structure well, you could write the UI in some other language like Python, without any differences to the server itself.
Promised examples follow; apologies for maximum-length post.
Paired counters are useful when there is a single writer for a specific field, possibly many readers, and you wish to add minimal overhead to the write path. Reading never blocks, and a reader will see whether they got a valid copy or if an update was in progress and they should not rely on the value they saw. (This also allows reliable write semantics for a field in an async-signal-safe context, where pthread locking is not guaranteed to work.)
This implementation works best on GCC-4.7 and later, but should work with earlier C compilers, too. Legacy __sync built-ins also work with Intel, Pathscale, and PGI C compilers. counter.h:
#ifndef COUNTER_H
#define COUNTER_H
/*
* Atomic generation counter support.
*
* A generation counter allows a single writer thread to
* locklessly modify a value non-atomically, while letting
* any number of reader threads detect the change. Reader
* threads can also check if the value they observed is
* "atomic", or taken during a modification in progress.
*
* There is no protection against multiple concurrent writers.
*
* If the writer gets canceled or dies during an update,
* the counter will get garbled. Reinitialize before relying
* on such a counter.
*
* There is no guarantee that a reader will observe an
* "atomic" value, if the writer thread modifies the value
* more often than the reader thread can read it.
*
* Two typical use cases:
*
* A) A single writer requires minimum overhead/latencies,
* whereas readers can poll and retry if necessary
*
* B) Non-atomic value or structure modified in
* an interrupt context (async-signal-safe)
*
* Initialization:
*
* Counter counter = COUNTER_INITIALIZER;
*
* or
*
* Counter counter;
* counter_init(&counter);
*
* Write sequence:
*
* counter_acquire(&counter);
* (modify or write value)
* counter_release(&counter);
*
* Read sequence:
*
* unsigned int check;
* check = counter_before(&counter);
* (obtain a copy of the value)
* if (check == counter_after(&counter))
* (a copy of the value is atomic)
*
* Read sequence with spin-waiting,
* will spin forever if counter garbled:
*
* unsigned int check;
* do {
* check = counter_before(&counter);
* (obtain a copy of the value)
* } while (check != counter_after(&counter));
*
* All these are async-signal-safe, and will never block
* (except for the spinning loop just above, obviously).
*/
typedef struct {
volatile unsigned int value[2];
} Counter;
#define COUNTER_INITIALIZER {{0U,0U}}
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 7)
/*
* GCC 4.7 and later provide __atomic built-ins.
* These are very efficient on x86 and x86-64.
*/
static inline void counter_init(Counter *const counter)
{
/* This is silly; assignments should suffice. */
do {
__atomic_store_n(&(counter->value[1]), 0U, __ATOMIC_SEQ_CST);
__atomic_store_n(&(counter->value[0]), 0U, __ATOMIC_SEQ_CST);
} while (__atomic_load_n(&(counter->value[0]), __ATOMIC_SEQ_CST) |
__atomic_load_n(&(counter->value[1]), __ATOMIC_SEQ_CST));
}
static inline unsigned int counter_before(Counter *const counter)
{
return __atomic_load_n(&(counter->value[1]), __ATOMIC_SEQ_CST);
}
static inline unsigned int counter_after(Counter *const counter)
{
return __atomic_load_n(&(counter->value[0]), __ATOMIC_SEQ_CST);
}
static inline unsigned int counter_acquire(Counter *const counter)
{
return __atomic_fetch_add(&(counter->value[0]), 1U, __ATOMIC_SEQ_CST);
}
static inline unsigned int counter_release(Counter *const counter)
{
return __atomic_fetch_add(&(counter->value[1]), 1U, __ATOMIC_SEQ_CST);
}
#else
/*
* Rely on legacy __sync built-ins.
*
* Because there is no __sync_fetch(),
* counter_before() and counter_after() are safe,
* but not optimal (especially on x86 and x86-64).
*/
static inline void counter_init(Counter *const counter)
{
/* This is silly; assignments should suffice. */
do {
counter->value[0] = 0U;
counter->value[1] = 0U;
__sync_synchronize();
} while (__sync_fetch_and_add(&(counter->value[0]), 0U) |
__sync_fetch_and_add(&(counter->value[1]), 0U));
}
static inline unsigned int counter_before(Counter *const counter)
{
return __sync_fetch_and_add(&(counter->value[1]), 0U);
}
static inline unsigned int counter_after(Counter *const counter)
{
return __sync_fetch_and_add(&(counter->value[0]), 0U);
}
static inline unsigned int counter_acquire(Counter *const counter)
{
return __sync_fetch_and_add(&(counter->value[0]), 1U);
}
static inline unsigned int counter_release(Counter *const counter)
{
return __sync_fetch_and_add(&(counter->value[1]), 1U);
}
#endif
#endif /* COUNTER_H */
Each shared state field type needs their own structure defined. For our purposes, I'll just use a Value structure to describe one of them. It is up to you to define all the needed Field and Value types to match your needs. For example, a 3D vector of double precision floating-point components would be
typedef struct {
double x;
double y;
double z;
} Value;
typedef struct {
Counter counter;
Value value;
} Field;
where the modifying thread uses e.g.
void set_field(Field *const field, Value *const value)
{
counter_acquire(&field->counter);
field->value = value;
counter_release(&field->counter);
}
and each reader maintains their own local shapshot using e.g.
typedef enum {
UPDATED = 0,
UNCHANGED = 1,
BUSY = 2
} Updated;
Updated check_field(Field *const local, const Field *const shared)
{
Field cache;
cache.counter[0] = counter_before(&shared->counter);
/* Local counter check allows forcing an update by
* simply changing one half of the local counter.
* If you don't need that, omit the local-only part. */
if (local->counter[0] == local->counter[1] &&
cache.counter[0] == local->counter[0])
return UNCHANGED;
cache.value = shared->value;
cache.counter[1] = counter_after(&shared->counter);
if (cache.counter[0] != cache.counter[1])
return BUSY;
*local = cache;
return UPDATED;
}
Note that the cache above constitutes one local copy of the value, so each reader only needs to maintain one copy of the fields it is interested in. Also, the above accessor function only modifies local with a successful snapshot.
When using pthread locking, pthread_rwlock_t is a good option, as long as the programmer understands the priority issues with multiple readers and writers. (see man pthread_rwlock_rdlock and man pthread_rwlock_wrlock for details.)
Personally, I would just stick to mutexes with as short holding times as possible, to reduce overall complexity. The Field structure for a mutex-protected change-counted value I'd use would probably be something like
typedef struct {
pthread_mutex_t mutex;
Value value;
volatile unsigned int change;
} Field;
void set_field(Field *const field, const Value *const value)
{
pthread_mutex_lock(&field->mutex);
field->value = *value;
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 7)
__atomic_fetch_add(&field->change, 1U, __ATOMIC_SEQ_CST);
#else
__sync_fetch_and_add(&field->change, 1U);
#endif
pthread_mutex_unlock(&field->mutex);
}
void get_field(Field *const local, Field *const field)
{
pthread_mutex_lock(&field->mutex);
local->value = field->value;
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 7)
local->value = __atomic_fetch_n(&field->change, __ATOMIC_SEQ_CST);
#else
local->value = __sync_fetch_and_add(&field->change, 0U);
#endif
pthread_mutex_unlock(&field->mutex);
}
Updated check_field(Field *const local, Field *const shared)
{
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 7)
if (local->change == __atomic_fetch_n(&field->change, __ATOMIC_SEQ_CST))
return UNCHANGED;
#else
if (local->change == __sync_fetch_and_add(&field->change, 0U))
return UNCHANGED;
#endif
pthread_mutex_lock(&shared->mutex);
local->value = shared->value;
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 7)
local->change = __atomic_fetch_n(&field->change, __ATOMIC_SEQ_CST);
#else
local->change = __sync_fetch_and_add(&field->change, 0U);
#endif
pthread_mutex_unlock(&shared->mutex);
}
Note that the mutex field in the local copy is unused, and can be omitted from the local copy (defining a suitable Cache etc. structure type without it).
The semantics with respect to the change field are simple: It is always accessed atomically, and can therefore be read without holding the mutex. It is only modified while holding the mutex (but because readers do not hold the mutex, it must be modified atomically even then).
Note that if your server is restricted to x86 or x86-64 architectures, you can use ordinary accesses to change above (e.g. field->change++), as unsigned int access is atomic on these architectures; no need to use built-in functions. (Although increment is non-atomic, readers will only ever see the old or the new unsigned int, so that's not a problem either.)
On the server, you'll need some kind of "agent" process or thread, acting on behalf of the remote clients, sending field changes to each client, and optionally receiving field contents for update from each client.
Efficient implementation of such an agent is a big question. A properly thought-out recommendation would require more detailed information on the system. The simplest possible implementation, I guess, would be a low-priority thread simply scanning the shared state continuously, reporting any field changes not known to each client as it sees them. If change rate is high -- say, usually more than a dozen changes per second over the entire state -- then that may be quite acceptable. If change rate varies, adding a global change counter (modified whenever a single field is modified, i.e. in set_field()) and sleeping (or at least yielding) when no global change is observed, is better.
If there are multiple remote clients, I would use a single agent, which collects changes into a queue, with a cached copy of the state as seen at the head of the queue (with previous changes being those that lead up to it). Initially connected clients get the entire state, then each queued change following that. Note that the queue entries do not need to represent fields as such, but the messages sent to each client to update that field, and that if a latter update replaces an already queued value, the already queued value can be removed, reducing the amount of data that needs to be sent to clients that are catching up. This is an interesting topic on its own, and would warrant a question of its own, in my opinion.
Like I mentioned above, I would personally use a plain TCP socket between the user interface and the service.
There are four main issues with respect to the TCP messages:
Byte order and data formats (unless both server and remote clients are guaranteed to use the same hardware architecture)
My preferred solution is to use native byte order (but well-defined standard types) when sending, with recipient doing any byte order conversions. This requires that each end sends an initial message with predetermined "prototype values" for each integer and floating-point type used.
I assume integers use two's complement format with no padding bits, and double and float are double- and single-precision IEEE-754 types, respectively. Most current hardware architectures have these natively (although byte ordering does vary). Weird architectures can and do emulate these in software, typically by supplying suitable command-line switches or using some libraries.
Message framing (as TCP is a stream of bytes, not a stream of packets, from the application perspective)
Most robust option is to prefix each message with a fixed size type field and a fixed size length field. (Usually, the length field indicates how many additional bytes are part of this message, and will include any possible padding bytes the sender might have added.)
The recipient will receive TCP input until it has enough cached for the fixed type and length parts to examine the complete packet length. Then it receives more data, until it has a full TCP packet. A minimal receive buffer implementation is something like
Extensibility
Backwards compatibility is essential for making updates and upgrades easier. This means you'll probably need to include some kind of version number in the initial message from server to client, and have both server and client ignore messages they do not know. (This obviously requires a length field in each message, as recipients cannot infer the length of messages they do not recognize.)
In some cases you might need the ability to mark a message important, in the sense that if the recipient does not recognize it, further communications should be aborted. This is easiest to achieve by using some kind of identifier or bit in the type field. For example, PNG file format is very similar to the message format I've described here, with a four-byte (four ASCII character) type field, and a four-byte (32-bit) length field for each message ("chunk"). If the first type character is uppercase ASCII, it means it is a critical one.
Passing initial state
Passing the initial state as a single chunk fixes the entire shared state structure. Instead, I recommend passing each field or record separately, just as if they were updates. Optionally, when all the initial fields have been sent, you can send a message that informs the client they have the entire state; this should let the client first receive the complete state, then construct and render the user interface, without having to dynamically adjust to varying number of fields.
Whether each client requires the full state, or just a smaller subset of the state, is another question. When a remote client connects to a server, it probably should include some kind of identifier the server can use to make that determination.
Here is messages.h, a header file with inline implementation of integer and floating-point type handling and byte order detection:
#ifndef MESSAGES_H
#define MESSAGES_H
#include <stdint.h>
#include <string.h>
#include <errno.h>
#if defined(__GNUC__)
static const struct __attribute__((packed)) {
#else
#pragma pack(push,1)
#endif
const uint64_t u64; /* Offset +0 */
const int64_t i64; /* +8 */
const double dbl; /* +16 */
const uint32_t u32; /* +24 */
const int32_t i32; /* +28 */
const float flt; /* +32 */
const uint16_t u16; /* +36 */
const int16_t i16; /* +38 */
} prototypes = {
18364758544493064720ULL, /* u64 */
-1311768465156298103LL, /* i64 */
0.71948481353325643983254167324048466980457305908203125,
3735928559UL, /* u32 */
-195951326L, /* i32 */
1.06622731685638427734375f, /* flt */
51966U, /* u16 */
-7658 /* i16 */
};
#if !defined(__GNUC__)
#pragma pack(pop)
#endif
/* Add prototype section to a buffer.
*/
size_t add_prototypes(unsigned char *const data, const size_t size)
{
if (size < sizeof prototypes) {
errno = ENOSPC;
return 0;
} else {
memcpy(data, &prototypes, sizeof prototypes);
return sizeof prototypes;
}
}
/*
* Byte order manipulation functions.
*/
static void reorder64(void *const dst, const void *const src, const int order)
{
if (order) {
uint64_t value;
memcpy(&value, src, 8);
if (order & 1)
value = ((value >> 8U) & 0x00FF00FF00FF00FFULL)
| ((value & 0x00FF00FF00FF00FFULL) << 8U);
if (order & 2)
value = ((value >> 16U) & 0x0000FFFF0000FFFFULL)
| ((value & 0x0000FFFF0000FFFFULL) << 16U);
if (order & 4)
value = ((value >> 32U) & 0x00000000FFFFFFFFULL)
| ((value & 0x00000000FFFFFFFFULL) << 32U);
memcpy(dst, &value, 8);
} else
if (dst != src)
memmove(dst, src, 8);
}
static void reorder32(void *const dst, const void *const src, const int order)
{
if (order) {
uint32_t value;
memcpy(&value, src, 4);
if (order & 1)
value = ((value >> 8U) & 0x00FF00FFUL)
| ((value & 0x00FF00FFUL) << 8U);
if (order & 2)
value = ((value >> 16U) & 0x0000FFFFUL)
| ((value & 0x0000FFFFUL) << 16U);
memcpy(dst, &value, 4);
} else
if (dst != src)
memmove(dst, src, 4);
}
static void reorder16(void *const dst, const void *const src, const int order)
{
if (order & 1) {
const unsigned char c[2] = { ((const unsigned char *)src)[0],
((const unsigned char *)src)[1] };
((unsigned char *)dst)[0] = c[1];
((unsigned char *)dst)[1] = c[0];
} else
if (dst != src)
memmove(dst, src, 2);
}
/* Detect byte order conversions needed.
*
* If the prototypes cannot be supported, returns -1.
*
* If prototype record uses native types, returns 0.
*
* Otherwise, bits 0..2 are the integer order conversion,
* and bits 3..5 are the floating-point order conversion.
* If 'order' is the return value, use
* reorderXX(local, remote, order)
* for integers, and
* reorderXX(local, remote, order/8)
* for floating-point types.
*
* For adjusting records sent to server, just do the same,
* but with order obtained by calling this function with
* parameters swapped.
*/
int detect_order(const void *const native, const void *const other)
{
const unsigned char *const source = other;
const unsigned char *const target = native;
unsigned char temp[8];
int iorder = 0;
int forder = 0;
/* Verify the size of the types.
* C89/C99/C11 says sizeof (char) == 1, but check that too. */
if (sizeof (double) != 8 ||
sizeof (int64_t) != 8 ||
sizeof (uint64_t) != 8 ||
sizeof (float) != 4 ||
sizeof (int32_t) != 4 ||
sizeof (uint32_t) != 4 ||
sizeof (int16_t) != 2 ||
sizeof (uint16_t) != 2 ||
sizeof (unsigned char) != 1 ||
sizeof prototypes != 40)
return -1;
/* Find byte order for the largest floating-point type. */
while (1) {
reorder64(temp, source + 16, forder);
if (!memcmp(temp, target + 16, 8))
break;
if (++forder >= 8)
return -1;
}
/* Verify forder works for all other floating-point types. */
reorder32(temp, source + 32, forder);
if (memcmp(temp, target + 32, 4))
return -1;
/* Find byte order for the largest integer type. */
while (1) {
reorder64(temp, source + 0, iorder);
if (!memcmp(temp, target + 0, 8))
break;
if (++iorder >= 8)
return -1;
}
/* Verify iorder works for all other integer types. */
reorder64(temp, source + 8, iorder);
if (memcmp(temp, target + 8, 8))
return -1;
reorder32(temp, source + 24, iorder);
if (memcmp(temp, target + 24, 4))
return -1;
reorder32(temp, source + 28, iorder);
if (memcmp(temp, target + 28, 4))
return -1;
reorder16(temp, source + 36, iorder);
if (memcmp(temp, target + 36, 2))
return -1;
reorder16(temp, source + 38, iorder);
if (memcmp(temp, target + 38, 2))
return -1;
/* Everything works. */
return iorder + 8 * forder;
}
/* Verify current architecture is supportable.
* This could be a compile-time check.
*
* (The buffer contains prototypes for network byte order,
* and actually returns the order needed to convert from
* native to network byte order.)
*
* Returns -1 if architecture is not supported,
* a nonnegative (0 or positive) value if successful.
*/
static int verify_architecture(void)
{
static const unsigned char network_endian[40] = {
254U, 220U, 186U, 152U, 118U, 84U, 50U, 16U, /* u64 */
237U, 203U, 169U, 135U, 238U, 204U, 170U, 137U, /* i64 */
63U, 231U, 6U, 5U, 4U, 3U, 2U, 1U, /* dbl */
222U, 173U, 190U, 239U, /* u32 */
244U, 82U, 5U, 34U, /* i32 */
63U, 136U, 122U, 35U, /* flt */
202U, 254U, /* u16 */
226U, 22U, /* i16 */
};
return detect_order(&prototypes, network_endian);
}
#endif /* MESSAGES_H */
Here is an example function that a sender can use to pack a 3-component vector identified by a 32-bit unsigned integer, into a 36-byte message. This uses a message frame starting with four bytes ("Vec3"), followed by the 32-bit frame length, followed by the 32-bit identifier, followed by three doubles:
size_t add_vector(unsigned char *const data, const size_t size,
const uint32_t id,
const double x, const double y, const double z)
{
const uint32_t length = 4 + 4 + 4 + 8 + 8 + 8;
if (size < (size_t)bytes) {
errno = ENOSPC;
return 0;
}
/* Message identifier, four bytes */
buffer[0] = 'V';
buffer[1] = 'e';
buffer[2] = 'c';
buffer[3] = '3';
/* Length field, uint32_t */
memcpy(buffer + 4, &length, 4);
/* Identifier, uint32_t */
memcpy(buffer + 8, &id, 4);
/* Vector components */
memcpy(buffer + 12, &x, 8);
memcpy(buffer + 20, &y, 8);
memcpy(buffer + 28, &z, 8);
return length;
}
Personally, I prefer shorter, 16-bit identifiers and 16-bit length; that also limits maximum message length to 65536 bytes, making 65536 bytes a good read/write buffer size.
The recipient can handle the received TCP data stream for example thus:
static unsigned char *input_data; /* Data buffer */
static size_t input_size; /* Data buffer size */
static unsigned char *input_head; /* Next byte in buffer */
static unsigned char *input_tail; /* After last buffered byte */
static int input_order; /* From detect_order() */
/* Function to handle "Vec3" messages: */
static void handle_vec3(unsigned char *const msg)
{
uint32_t id;
double x, y, z;
reorder32(&id, msg+8, input_order);
reorder64(&x, msg+12, input_order/8);
reorder64(&y, msg+20, input_order/8);
reorder64(&z, msg+28, input_order/8);
/* Do something with vector id, x, y, z. */
}
/* Function that tries to consume thus far buffered
* input data -- typically run once after each
* successful TCP receive. */
static void consume(void)
{
while (input_head + 8 < input_tail) {
uint32_t length;
/* Get current packet length. */
reorder32(&length, input_head + 4, input_order);
if (input_head + length < input_tail)
break; /* Not all read, yet. */
/* We have a full packet! */
/* Handle "Vec3" packet: */
if (input_head[0] == 'V' &&
input_head[1] == 'e' &&
input_head[2] == 'c' &&
input_head[3] == '3')
handle_vec3(input_head);
/* Advance to next packet. */
input_head += length;
}
if (input_head < input_tail) {
/* Move partial message to start of buffer. */
if (input_head > input_data) {
const size_t have = input_head - input_data;
memmove(input_data, input_head, have);
input_head = input_data;
input_tail = input_data + have;
}
} else {
/* Buffer is empty. */
input_head = input_data;
input_tail = input_data;
}
}
Questions?

How to skip a line doing a buffer overflow in C

I want to skip a line in C, the line x=1; in the main section using bufferoverflow; however, I don't know why I can not skip the address from 4002f4 to the next address 4002fb in spite of the fact that I am counting 7 bytes form <main+35> to <main+42>.
I also have configured the options the randomniZation and execstack environment in a Debian and AMD environment, but I am still getting x=1;. What it's wrong with this procedure?
I have used dba to debug the stack and the memory addresses:
0x00000000004002ef <main+30>: callq 0x4002a4 **<function>**
**0x00000000004002f4** <main+35>: movl $0x1,-0x4(%rbp)
**0x00000000004002fb** <main+42>: mov -0x4(%rbp),%esi
0x00000000004002fe <main+45>: mov $0x4629c4,%edi
void function(int a, int b, int c)
{
char buffer[5];
int *ret;
ret = buffer + 12;
(*ret) += 8;
}
int main()
{
int x = 0;
function(1, 2, 3);
x = 1;
printf("x = %i \n", x);
return 0;
}
You must be reading Smashing the Stack for Fun and Profit article. I was reading the same article and have found the same problem it wasnt skipping that instruction. After a few hours debug session in IDA I have changed the code like below and it is printing x=0 and b=5.
#include <stdio.h>
void function(int a, int b) {
int c=0;
int* pointer;
pointer =&c+2;
(*pointer)+=8;
}
void main() {
int x =0;
function(1,2);
x = 3;
int b =5;
printf("x=%d\n, b=%d\n",x,b);
getch();
}
In order to alter the return address within function() to skip over the x = 1 in main(), you need two pieces of information.
1. The location of the return address in the stack frame.
I used gdb to determine this value. I set a breakpoint at function() (break function), execute the code up to the breakpoint (run), retrieve the location in memory of the current stack frame (p $rbp or info reg), and then retrieve the location in memory of buffer (p &buffer). Using the retrieved values, the location of the return address can be determined.
(compiled w/ GCC -g flag to include debug symbols and executed in a 64-bit environment)
(gdb) break function
...
(gdb) run
...
(gdb) p $rbp
$1 = (void *) 0x7fffffffe270
(gdb) p &buffer
$2 = (char (*)[5]) 0x7fffffffe260
(gdb) quit
(frame pointer address + size of word) - buffer address = number of bytes from local buffer variable to return address
(0x7fffffffe270 + 8) - 0x7fffffffe260 = 24
If you are having difficulties understanding how the call stack works, reading the call stack and function prologue Wikipedia articles may help. This shows the difficulty in making "buffer overflow" examples in C. The offset of 24 from buffer assumes a certain padding style and compile options. GCC will happily insert stack canaries nowadays unless you tell it not to.
2. The number of bytes to add to the return address to skip over x = 1.
In your case the saved instruction pointer will point to 0x00000000004002f4 (<main+35>), the first instruction after function returns. To skip the assignment you need to make the saved instruction pointer point to 0x00000000004002fb (<main+42>).
Your calculation that this is 7 bytes is correct (0x4002fb - 0x4002fb = 7).
I used gdb to disassemble the application (disas main) and verified the calculation for my case as well. This value is best resolved manually by inspecting the disassembly.
Note that I used a Ubuntu 10.10 64-bit environment to test the following code.
#include <stdio.h>
void function(int a, int b, int c)
{
char buffer[5];
int *ret;
ret = (int *)(buffer + 24);
(*ret) += 7;
}
int main()
{
int x = 0;
function(1, 2, 3);
x = 1;
printf("x = %i \n", x);
return 0;
}
output
x = 0
This is really just altering the return address of function() rather than an actual buffer overflow. In an actual buffer overflow, you would be overflowing buffer[5] to overwrite the return address. However, most modern implementations use techniques such as stack canaries to protect against this.
What you're doing here doesn't seem to have much todo with a classic bufferoverflow attack. The whole idea of a bufferoverflow attack is to modify the return adress of 'function'. Disassembling your program will show you where the ret instruction (assuming x86) takes its adress from. This is what you need to modify to point at main+42.
I assume you want to explicitly provoke the bufferoverflow here, normally you'd need to provoke it by manipulating the inputs of 'function'.
By just declaring a buffer[5] you're moving the stackpointer in the wrong direction (verify this by looking at the generated assembly), the return adress is somewhere deeper inside in the stack (it was put there by the call instruction). In x86 stacks grow downwards, that is towards lower adresses.
I'd approach this by declaring an int* and moving it upward until I'm at the specified adress where the return adress has been pushed, then modify that value to point at main+42 and let function ret.
You can't do that this way.
Here's a classic bufferoverflow code sample. See what happens once you feed it with 5 and then 6 characters from your keyboard. If you go for more (16 chars should do) you'll overwrite base pointer, then function return address and you'll get segmentation fault. What you want to do is to figure out which 4 chars overwrite the return addr. and make the program execute your code. Google around linux stack, memory structure.
void ff(){
int a=0; char b[5];
scanf("%s",b);
printf("b:%x a:%x\n" ,b ,&a);
printf("b:'%s' a:%d\n" ,b ,a);
}
int main() {
ff();
return 0;
}

Problem with setjmp/longjmp

The code below is just not working.
Can anybody point out why
#define STACK_SIZE 1524
static void mt_allocate_stack(struct thread_struct *mythrd)
{
unsigned int sp = 0;
void *stck;
stck = (void *)malloc(STACK_SIZE);
sp = (unsigned int)&((stck));
sp = sp + STACK_SIZE;
while((sp % 8) != 0)
sp--;
#ifdef linux
(mythrd->saved_state[0]).__jmpbuf[JB_BP] = (int)sp;
(mythrd->saved_state[0]).__jmpbuf[JB_SP] = (int)sp-500;
#endif
}
void mt_sched()
{
fprintf(stdout,"\n Inside the mt_sched");
fflush(stdout);
if ( current_thread->state == NEW )
{
if ( setjmp(current_thread->saved_state) == 0 )
{
mt_allocate_stack(current_thread);
fprintf(stdout,"\n Jumping to thread = %u",current_thread->thread_id);
fflush(stdout);
longjmp(current_thread->saved_state, 2);
}
else
{
new_fns();
}
}
}
All I am trying to do is to run the new_fns() on a new stack. But is is showing segmentation fault at new_fns().
Can anybody point me out what's wrong.
Apart all other considerations, you are using "&stck" instead ok "stck" as stack! &stck points to the cell containing the POINTER TO the allocated stack
Then, some observations:
1) setjmp is not intended for this purpose: this code may work only on some systems, and perhaps only with som runtime library versions.
2) I think that BP should be evaluated in some other way. I suggest to check how you compiled composes a stack frame. I.e., on x86 platforms EBP points to the base of the local context, and at *EBP you can find the address of the base of the calling context. ESP points to EBP-SIZE_OF_LOCAL_CONTEXT, different compilers usually compute that size in a different way.
As far as I can see, you are implementig some sort of "fibers". If you are working on Win32, there is aready a set of function that implements in a safe way this functionality (see "fibers"). On linux I suggest you to have a look to "libfiber".
Regards

Generators in C

I got this chunk of code I can't really comprehend.
I got stuck after it replaced pow2s's g to a map's gen structure. And from there, I can't see how it continues tracking the value, and how it is stored.
The code compiles and runs.
Can someone help me understand this code? Thanks!
PS: I'm learning C
It is translated from the follow Python code:
>>> def pow2s():
yield 1
for i in map((lambda x:2*x),pow2s()):
yield i
>>> def mymap(f,iter):
for i in iter:
yield f(i)
And the translated C code:
#include <stdio.h>
#include <stdlib.h>
struct gen { // generic structure, the base of all generators
int (*next)() ;
int continue_from ;
} ;
typedef int (*fptr)() ;
// Each iterator has 3 components: a structure, a constructor for the structure,
// and a next function
// map
struct mapgen { // structure for map
int (*next)() ;
int continue_from ; // not really required, provided for compatibility
fptr f ;
struct gen *g ;
} ;
int map_next(struct mapgen *p) { // next function for map
return p->f(p->g->next(p->g)) ;
}
struct gen *map(fptr f, struct gen *g) { // constructor for map iterator
struct mapgen *p = (struct mapgen *)malloc(sizeof(struct mapgen));
p->next = map_next;
p->continue_from = 0;
p->f = f;
p->g = g;
return (struct gen *)p ;
}
// powers of 2
struct pow2s { // structure
int (*next)() ;
int continue_from ;
struct gen *g ;
};
int times2(int x) { // anonymous lambda is translated into this
return 2*x ;
}
struct gen *pow2() ; // forward declaration of constructor
int pow2next(struct pow2s * p){ // next function for iterator
switch(p->continue_from) {
case 0:
p->continue_from = 1;
return 1;
case 1:
p->g = map(times2,pow2()) ;
p->continue_from = 2;
return p->g->next(p->g) ;
case 2:
p->continue_from = 2;
return p->g->next(p->g) ;
}
}
struct gen * pow2() { // constructor for pow2
struct pow2s * p = (struct pow2s *)malloc(sizeof(struct pow2s));
p->next = pow2next;
p->continue_from = 0;
return (struct gen *)p;
}
// in main, create an iterator and print some of its elements.
int main() {
int i ;
struct gen * p = pow2() ;
for(i=0;i<10;i++)
printf("%d ",p->next(p)) ;
printf("\n");
}
The code shows how you can generate an arbitrary sequence of numbers
by means of 'generators'.
generators are a popular tool in dynamic languages like python and enable one to
iterate over an arbitrary long sequence without allocating the whole sequence at once.
The tracking happens in the lines
p->next = pow2next;
p->continue_from = 0;
Which tells p that it should call pow2next to obtain the next item in the sequence
and continue_from = 0 to indicate that where at the start of the sequence.
When you call p->next(p) it will in fact just call pow2next with p as it's parameter.
For the first call this will simply return 1 and increment continue_from to 2.
switch(p->continue_from) {
case 0:
p->continue_from = 1;
return 1;
/* ... */
On the second call (continue_from = 2) it will create a new map_gen structure working
on a fresh struct pow2s and using the function times2:
case 1:
p->g = map(times2,pow2()) ;
p->continue_from = 2;
return p->g->next(p->g) ;
/* ... */
All further calls go through p->g->next(p->g) which uses times2 and map_gen to retrieve the next
value / create new map_gen structures as needed.
All value tracking is done using the struct-member continue_from or by using return codes.
While showing an interesting approach to generators in C I have to state that this code leaks memory!
As you can see it allocates new structures using malloc but it never free's them.
I hope this explanation is not to confusing even if you're just starting to learn C.
If you really want to understand generators you may like to have a little python ex-course ;)
UPDATE
As you stated in your comment none of the generators ever seems to return a value > 2.
The key to the increasing numbers lies in the function map_next:
int map_next(struct mapgen *p) {
return p->f(p->g->next(p->g));
}
What this does is, instead of returning a fix, number it applies p->f()
(in our case the function times2() to the result of p->g->next(p->g).
This is a recursive call.
It will continue to call map_next() on each map_gen in the list until it reaches the last one.
This last element will return a fix value (either 1 or 2).
Which is then passed back to the previous call which will apply times2() on it and return the result to it's caller, which in turn will apply times2() on it and return the result to it's caller.... (you get the idea).
All these recursive calls sum up and form the final value.
If you print out the result of each pow2next() call you will get this:
/* first call */
1
pow2next: returning 1
pow2next: returning 2 /* times2(1) */
2
pow2next: returning 1
pow2next: returning 2 /* times2(1) */
pow2next: returning 4 /* times2(2) */
4
pow2next: returning 1
pow2next: returning 2 /* times2(1) */
pow2next: returning 4 /* times2(2) */
pow2next: returning 8 /* times2(4) */
8
pow2next: returning 1
pow2next: returning 2 /* times2(1) */
pow2next: returning 4 /* times2(2) */
pow2next: returning 8 /* times2(4) */
pow2next: returning 16 /* times2(8) */
16
/* and so on */
You can see clearly how the result of the top most call is passed all the way back to the first call to form the result.
It tracks the value by growing a tail of struct mapgen instances mixed with times2 instances
Each call to pow2next adds another pair to the tail.
The only value of this example is as an illustration of how much high level languages do for us, and how naive implementation of high-level concepts can kill your performance.

Resources