Why does internal Lua strings store the way they do? - c

I was wanting a simple string table that will store a bunch of constants and I thought "Hey! Lua does that, let me use some of there functions!"
This is mainly in the lstring.h/lstring.c files (I am using 5.2)
I will show the code I am curious about first. Its from lobject.h
/*
** Header for string value; string bytes follow the end of this structure
*/
typedef union TString {
L_Umaxalign dummy; /* ensures maximum alignment for strings */
struct {
CommonHeader;
lu_byte reserved;
unsigned int hash;
size_t len; /* number of characters in string */
} tsv;
} TString;
/* get the actual string (array of bytes) from a TString */
#define getstr(ts) cast(const char *, (ts) + 1)
/* get the actual string (array of bytes) from a Lua value */
#define svalue(o) getstr(rawtsvalue(o))
As you see, the data is stored outside of the structure. To get the byte stream, you take the size of TString, add 1, and you got the char* pointer.
Isn't this bad coding though? Its been DRILLED into m in my C classes to make clearly defined structures. I know I might be stirring a nest here, but do you really lose that much speed/space defining a structure as header for data rather than defining a pointer value for that data?

The idea is probably that you allocate the header and the data in one big chunk of data instead of two:
TString *str = (TString*)malloc(sizeof(TString) + <length_of_string>);
In addition to having just one call to malloc/free, you also reduce memory fragmentation and increase memory localization.
But answering your question, yes, these kind of hacks are usually a bad practice, and should be done with extreme care. And if you do, you'll probably want to hide them under a layer of macros/inline functions.

As rodrigo says, the idea is to allocate the header and string data as a single chunk of memory. It's worth pointing out that you also see the non-standard hack
struct lenstring {
unsigned length;
char data[0];
};
but C99 added flexible array members so it can be done in a standard compliant way as
struct lenstring {
unsigned length;
char data[];
};
If Lua's string were done in this way it'd be something like
typedef union TString {
L_Umaxalign dummy;
struct {
CommonHeader;
lu_byte reserved;
unsigned int hash;
size_t len;
const char data[];
} tsv;
} TString;
#define getstr(ts) (ts->tsv->data)

It relates to the complications arising from the more limited C language. In C++, you would just define a base class called GCObject which contains the garbage collection variables, then TString would be a subclass and by using a virtual destructor, both the TString and it's accompanying const char * blocks would be freed properly.
When it comes to writing the same kind of functionality in C, it's a bit more difficult as classes and virtual inheritance do not exist.
What Lua is doing is implementing garbage collection by inserting the header required to manage the garbage collection status of the part of memory following it. Remember that free(void *) does not need to know anything other than the address of the memory block.
#define CommonHeader GCObject *next; lu_byte tt; lu_byte marked
Lua keeps a linked list of these "collectable" blocks of memory, in this case an array of characters, so that it can then free the memory efficiently without knowing the type of object it is pointing to.
If your TString pointed to another block of memory where the character array was, then it require the garbage collector determine the object's type, then delve into its structure to also free the string buffer.
The pseudo code for this kind of garbage collection would be something like this:
GCHeader *next, *prev;
GCHeader *current = firstObject;
while(current)
{
next = current->next;
if (/* current is ready for deletion */)
{
free(current);
// relink previous to the next (singly-linked list)
if (prev)
prev->next = next;
}
else
prev = current; // store previous undeleted object
current = next;
}

Related

allocate memory block and assign to a char array for send socket a memory block [duplicate]

I'm writing an application in C (as a beginner) and I'm struggling with getting corrupted data inside a struct that contains a variable length array. I found similar issues described in forum posts on cprogramming.com and also on cert.og/secure-coding. I thought I'd had found the right solution, but it seems not.
The struct looks like this;
typedef struct {
int a;
int b;
} pair;
typedef struct {
CommandType name;
pair class;
pair instr;
pair p1;
pair p2;
pair p3;
CommandType expected_next;
char* desc;
int size;
pair sw1;
pair sw2;
pair* data;
} command;
With the problematic one being "command". For any given instance (or whatever the correct phrase would be) of "command" different fields would be set, although in most cases the same fields are set albeit in different instances.
The problem I have is when trying to set the expected_next, name, sw1, sw2, size and data fields. And it's the data field that's getting corrupt. I'm allocating memory for the struct like this;
void *command_malloc(int desc_size,int data_size)
{
return malloc(sizeof(command) +
desc_size*sizeof(char) +
data_size*sizeof(pair));
}
command *cmd;
cmd = command_malloc(0, file_size);
But when I (pretty) print the resulting cmd, the middle of the data field appears to be random garbage. I've stepped through with gdb and can see that the correct data is getting loaded into the the field. It appears that it's only when the command gets passed to a different function that it gets corrupted. This code is called inside a function such as;
command* parse(char *line, command *context)
And the pretty-print happens in another function;
void pretty_print(char* line, command* cmd)
I had thought I was doing things correctly, but apparently not. As far as I can tell, I construct other instances of the struct okay (and I duplicated those approaches for this one) but they don't contain any variable length array in them and their pretty-prints looks fine - which concerns me because they might also be broken, but the breakage is less obvious.
What I'm writing is actually a parser, so a command gets passed into the parse function (which describes the current state, giving hints to the parser what to expect next) and the next command (derived from the input "line") is returned. "context" is free-d at the end of the parse function, which the new command getting returned - which would then be passed back into "parse" with the next "line" of input.
Can anyone suggest anything as to why this might be happening?
Many thanks.
When you allocate memory to structure, only a pointer size gets allocated to *desc. You must allocate memory to the space (array contents) desc points to, as someone already pointed out. Purpose of my answer is to show slightly different way of doing that.
Since having a pointer *desc increases structure size by a word (sizeof pointer), you can safely have a variable length array hack in you structure to reduce structure size.
Here's how your structure should look like, notice that desc[] has been pulled down to the end of structure :
typedef struct {
CommandType name;
pair class;
pair instr;
pair p1;
pair p2;
pair p3;
CommandType expected_next;
int size;
pair sw1;
pair sw2;
pair* data;
char desc[];
} command;
Now,
1. Allocate memory for command which includes array size also :
command *cmd = malloc(sizeof(command) + desc_length);
Use desc :
cmd->desc[desc_length -1] = '\0';
This hack works only if member is at the end of structure, saves structure size, saves pointer indirection, can be used if array length is structure instance specific.
You have to allocate desc and data separately.
When you allocate your struct command *cmd, memory is allocated for your pointers of decs and data. Desc and data have to be malloced separately.
So allocate your command
command *cmd = malloc(sizeof(command));
then allocate memory for data or desc
example for desc:
cmd->desc = malloc( sizeof(char )*100);

Make struct Array point to another struct Array

I have two structs in a library I cannot change. p.e:
struct{
uint8_t test;
uint8_t data[8];
}typedef aStruct;
struct{
uint8_t value;
uint8_t unimportant_stuff;
char data[8];
}typedef bStruct;
aStruct a;
bStruct b;
In my application there is a process that permantently refreshs my aStruct's.
Now I have a buffer of bStruct's I want to keep updated as well.
The data[] array is the important field. I don't really care about the other values of the structs.
I already made sure, that on that specific system where the code runs on, a "char" is 8Bits as well.
Now I'd like to make the "b.data" array point to exactly the same values as my "a.data" array. So if the process refreshs my aStruct, the values in my bStruct are up to date as well.
Therefore that in C an array is only a pointer to the first element, I thought something like this must be possible:
b.data = a.data
But unfortunately this gives me the compiler-error:
error: assignment to expression with array type
Is there a way to do what I intend to do?
Thanks in advance
Okay, according to the input I got from you guys, I think it might be the best thing to redesign my application.
So instead of a buffer of bStruct's I might use a buffer of aStruct*. This makes sure my buffer is always up to date. And then if I need to do something with an element of the buffer, I will write a short getter-function which copies the data from that aStruct* into a temporary bStruct and returns it.
Thanks for your responses and comments.
If you want b.data[] array to point to exactly the same values, then you can make data of b a char* and make it point to a's data.
Something like
struct{
uint8_t value;
uint8_t unimportant_stuff;
char* data;
}typedef bStruct;
and
b.data = a.data;
But, keep in mind, this means that b.data is pointing at the same memory location as a.data and hence, changing values of b.data would change values of a.data also.
There is another way of doing this. It is by copying all the values of a.data into b.data. Then, b.data would merely contain the same values as a.data, but it would point to different memory locations.
This can either be done by copying one by one. In a for loop for all the 8 elements.
Or, to use memcpy()
NOTE
Arrays cannot be made to point to another memory locations. As they are non modifiable l-value. If you cannot modify the structs, then you have to use the second method.
What you are asking is not possible when you can not modify the existing struct definitions. But you can still automate the functionality with a bit of OO style programming on your side. All of the following assumes that the data fields in the structs are of same length and contain elements of same size, as in your example.
Basically, you wrap the existing structs with your own container. You can put this in a header file:
/* Forward declaration of the wrapper type */
typedef struct s_wrapperStruct wrapperStruct;
/* Function pointer type for an updater function */
typedef void (*STRUCT_UPDATE_FPTR)(wrapperStruct* w, aStruct* src);
/* Definition of the wrapper type */
struct s_wrapperStruct
{
STRUCT_UPDATE_FPTR update;
aStruct* ap;
bStruct* bp;
};
Then you can can create a factory style module that you use to create your synced struct pairs and avoid exposing your synchronization logic to uninterested parties. Implement a couple of simple functions.
/* The updater function */
static void updateStructs(wrapperStruct* w, aStruct* src)
{
if ( (w != NULL) && (src != NULL) )
{
/* Copy the source data to your aStruct (or just the data field) */
memcpy(w->ap, src, sizeof(aStruct));
/* Sync a's data field to b */
sync(w); /* Keep this as a separate function so you can make it optional */
}
}
/* Sync the data fields of the two separate structs */
static void sync(wrapperStruct* w)
{
if (w != NULL)
{
memcpy(w->bp->data, w->ap->data, sizeof(w->bp->data));
}
}
Then in your factory function you can create the wrapped pairs.
/* Create a wrapper */
wrapperStruct syncedPair = { &updateStructs, &someA, &someB };
You can then pass the pair where you need it, e.g. the process that is updating your aStruct, and use it like this:
/* Pass new data to the synced pair */
syncedPair.update( &syncedPair, &newDataSource );
Because C is not designed as an OO language, it does not have a this pointer and you need to pass around the explicit wrapper pointer. Essentially this is what happens behind the scenes in C++ where the compiler saves you the extra trouble.
If you need to sync a single aStruct to multiple bStructs, it should be quite simple to change the bp pointer to a pointer-to-array and modify the rest accordingly.
This might look like an overly complicated solution, but when you implement the logic once, it will likely save you from some manual labor in maintenance.

Allocating a dynamic array in a dynamically allocated struct (struct of arrays)

This question is really about how to use variable-length types in the Python/C API (PyObject_NewVar, PyObject_VAR_HEAD, PyTypeObject.tp_basicsize and .tp_itemsize , but I can ask this question without bothering with the details of the API. Just assume I need to use an array inside a struct.
I can create a list data structure in one of two ways. (I'll just talk about char lists for now, but it doesn't matter.) The first uses a pointer and requires two allocations. Ignoring #includes and error handling:
struct listptr {
size_t elems;
char *data;
};
struct listptr *listptr_new(size_t elems) {
size_t basicsize = sizeof(struct listptr), itemsize = sizeof(char);
struct listptr *lp;
lp = malloc(basicsize);
lp->elems = elems;
lp->data = malloc(elems * itemsize);
return lp;
}
The second way to create a list uses array notation and one allocation. (I know this second implementation works because I've tested it pretty thoroughly.)
struct listarray {
size_t elems;
char data[1];
};
struct listarray *listarray_new(size_t elems) {
size_t basicsize = offsetof(struct listarray, data), itemsize = sizeof(char);
struct listarray *la;
la = malloc(basicsize + elems * itemsize);
la->elems = elems;
return lp;
}
In both cases, you then use lp->data[index] to access the array.
My question is why does the second method work? Why do you declare char data[1] instead of any of char data[], char data[0], char *data, or char data? In particular, my intuitive understanding of how structs work is that the correct way to declare data is char data with no pointer or array notation at all. Finally, are my calculations of basicsize and itemsize correct in both implementations? In particular, is this use of offsetof guaranteed to be correct for all machines?
Update
Apparently this is called a struct hack: In C99, you can use a flexible array member:
struct listarray2 {
size_t elems;
char data[];
}
with the understanding that you'll malloc enough space for data at runtime. Before C99, the data[1] declaration was common. So my question now is why declare char data[1] or char data[] instead of char *data or char data?
The reason you'd declare char data[1] or char data[] instead of char *data or char data is to keep your structure directly serializable and deserializable. This is important in cases where you'll be writing these sorts of structures to disk or over a network socket, etc.
Take for example your first code snippet that requires two allocations. Your listptr type is not directly serializable. i.e. listptr.elems and the data pointed to by listptr.data are not in a contiguous piece of memory. There is no way to read/write this structure to/from disk with a generic function. You need a custom function that is specific to your struct listptr type to do it. i.e. On serialize you'd have to first write elems to disk, and then write the data pointed to by the data pointer. On deserialization you'd have to read elems, allocate the appropriate space to listptr.data and then read the data from disk.
Using a flexible array member solves this problem because listptr.elem and the listptr.data reside in a contiguous memory space. So to serialize it you can simply write out the total allocated size for the structure and then the structure itself. On deserialize you then first read the allocated size, allocate the needed space and then read your listptr struct into that space.
You may wonder why you'd ever really need this, but it can be an invaluable feature. Consider a data stream of heterogeneous types. Provided you define a header that defines the which heterogeneous type you have and its size and precede each type in the stream with this header, you can generically serialize and deserialize data stream very elegantly and efficiently.
The only reason I know of for choosing char data[1] over char data[] is if you are defining an API that needs to be portable between C99 and C++ since C++ does not have support for flexible array members.
Also, wanted to point out that in the char data[1] you can do the following to get the total needed structure size:
size_t totalsize = offsetof(struct listarray, data[elems]);
You also ask why you wouldn't use char data instead of char data[1] or char data[]. While technically possible to use just plain old char data, it would be (IMHO) morally shunned. The two main issues with this approach are:
You wanted an array of chars, but now you can't access the data member directly as an array. You need to point a pointer to the address of data to access it as an array. i.e.
char *as_array = &listarray.data;
Your structure definition (and your code's use of the structure) would be totally misleading to anyone reading the code. Why declare a single char when you really meant an array of char?
Given these two things, I don't know why anyone would use char data in favor of char data[1]. It just doesn't benefit anyone given the alternatives.

Relative pointers in memory mapped file using C

Is it possible to use a structure with a pointer to another structure inside a memory mapped file instead of storing the offset in some integral type and calculate the pointer?
e.g. given following struct:
typedef struct _myStruct_t {
int number;
struct _myStruct_t *next;
} myStruct_t;
myStruct_t* first = (myStruct_t*)mapViewHandle;
myStruct_t* next = first->next;
instead of this:
typedef struct _myStruct_t {
int number;
int next;
} myStruct_t;
myStruct_t* first = (myStruct_t*)mappedFileHandle;
myStruct_t* next = (myStruct_t*)(mappedFileHandle+first->next);
I read about '__based' keyword, but this is Microsoft specific and therefore Windows-bound.
Looking for something working with GCC compiler.
I'm pretty sure there's nothing akin to the __based pointer from Visual Studio in GCC. The only time I'd seen anything like that built-in was on some pretty odd hardware. The Visual Studio extension provides an address translation layer around all operations involving the pointer.
So it sounds like you're into roll-your-own territory; although I'm willing to be told otherwise.
The last time I was dealing with something like this it was on the palm platform, where, unless you locked down memory, there was the possibility of it being moved around. You got memory handles from allocations and you had to MemHandleLock before you used it, and MemPtrUnlock it after you were finished using it so the block could be moved around by the OS (which seemed to happen on ARM based palm devices).
If you're insistent on storing pointer-esque values in a memory mapped structure the first recommendation would be to store the value in an intptr_t, which is an int size that can contain a pointer value. While your offsets are unlikely to exceed 4GB, it pays to stay safe.
That said, this is probably easy to implement in C++ using a template class, it's just that marking the question as C makes things a lot messier.
C++: It is very doable and portable (the code, but maybe not the data).
It was a while ago, but I created a template for a self-relative pointer classes.
I had tree structures inside blocks of memory that might move.
Internally, the class had a single intptr_t, but = * . -> operators were overloaded so it appeared like a regular pointer. Handling null took some attention.
I also did versions using int, short and not very useful char for space-saving pointers that were unable to point far away (outside memory block).
In C you could use macros to wrap get and set
// typedef OBJ { int p; } OBJ;
#define OBJPTR(P) ((OBJ*)((P)?(int)&(P)+(P):0))
#define SETOBJPTR(P,V) ((P)=(V)?(int)(V)-(int)&(P):0)
The above C macros are for self-relative pointers that can be slightly more efficient than based pointers.
Here is a working example of a tree in a small block of relocatable memory using 2-byte (short) pointers to save space. int is okay for casting from pointers since it is 32 bit code:
#include <stdio.h>
#include <memory.h>
typedef struct OBJ
{
int val;
short left;
short right;
#define OBJPTR(P) ((OBJ*)((P)?(int)&(P)+(P):0))
#define SETOBJPTR(P,V) ((P)=(V)?(int)(V)-(int)&(P):0)
} OBJ;
typedef struct HEAD
{
short top; // top of tree
short available; // index of next available place in data block
char data[0x7FFF]; // put whole tree here
} HEAD;
HEAD * blk;
OBJ * Add(int val)
{
short * where = &blk->top; // find pointer to "pointer" to place new node
OBJ * nd;
while ( ( nd = OBJPTR(*where) ) != 0 )
where = val < nd->val ? &nd->left : &nd->right;
nd = (OBJ*) ( blk->data + blk->available ); // allocate node
blk->available += sizeof(OBJ); // finish allocation
nd->val = val;
nd->left = nd->right = 0;
SETOBJPTR( *where, nd );
return nd;
}
void Dump(OBJ*top,int indent)
{
if ( ! top ) return;
Dump( OBJPTR(top->left), indent + 3 );
printf( "%*s %d\n", indent, "", top->val );
Dump( OBJPTR(top->right), indent + 3 );
}
void main(int argc,char*argv)
{
blk = (HEAD*) malloc(sizeof(HEAD));
blk->available = (int) &blk->data - (int) blk;
blk->top = 0;
Add(23); Add(2); Add(45); Add(99); Add(0); Add(12);
Dump( OBJPTR(blk->top), 3 );
{ // PROOF a copy at a different address still has the tree:
HEAD blk2 = *blk;
Dump( OBJPTR(blk2.top), 3 );
}
}
A note about based verses self-relative "*" operator.
Based can involve 2 addresses and 2 memory fetches.
Self-relative involves 1 address and 1 memory fetch.
Pseudo assembly:
load reg1,address of pointer
load reg2,fetch reg1
add reg3,reg2+reg1
load reg1,address of pointer
load reg2,fetch reg1
load reg3,address of base
load reg4,fetch base
add reg5,reg2+reg4
The first is extremely unlikely to work.
Remember that a pointer, such as struct _myStruct_t * is a pointer to a location in memory. Suppose that this structure was located at address 1000 in memory: that would mean that the next structure, located just after it, might be located at address 1008, and that's what's stored in ->next (the numbers don't matter; what matters is that they are memory addresses). Now you save that structure to a file (or un-map it). Then you map it again, but this time, it ends up starting at address 2000, but the ->next pointer is still 1008.
You have (generally) no control over where files are mapped in memory, so no control over the actual memory locations of the elements within the mapped structure. Therefore you can only depend on relative offsets.
Note that your second version may or may not work as you expect, depending on the declared type of mappedFileHandle. If it's a pointer to myStruct_t, then adding an integer n to it will produce a pointer to an address which is n*sizeof(myStruct_t) bytes higher in memory (as opposed to being n bytes higher).
If you declared mappedFileHandle as
myStruct_t* mappedFileHandle;
then you can subscript it like an array. If the mapped file is laid out as a sequence of myStruct_t blocks, and the next field refers to other blocks by index within that sequence, then (supposing myStruct_t* b is a block of interest)
mappedFileHandle[b->next].number
is the number field of the b->nextth block in the sequence.
(This is just a consequence of the way that arrays are defined in C: mappedFileHandle[b->next] is defined to be equivalent to *(mappedFileHandle + b->next), which is an object of type myStruct_t, which you can therefore get the number field of).

Variable length arrays in struct

I'm writing an application in C (as a beginner) and I'm struggling with getting corrupted data inside a struct that contains a variable length array. I found similar issues described in forum posts on cprogramming.com and also on cert.og/secure-coding. I thought I'd had found the right solution, but it seems not.
The struct looks like this;
typedef struct {
int a;
int b;
} pair;
typedef struct {
CommandType name;
pair class;
pair instr;
pair p1;
pair p2;
pair p3;
CommandType expected_next;
char* desc;
int size;
pair sw1;
pair sw2;
pair* data;
} command;
With the problematic one being "command". For any given instance (or whatever the correct phrase would be) of "command" different fields would be set, although in most cases the same fields are set albeit in different instances.
The problem I have is when trying to set the expected_next, name, sw1, sw2, size and data fields. And it's the data field that's getting corrupt. I'm allocating memory for the struct like this;
void *command_malloc(int desc_size,int data_size)
{
return malloc(sizeof(command) +
desc_size*sizeof(char) +
data_size*sizeof(pair));
}
command *cmd;
cmd = command_malloc(0, file_size);
But when I (pretty) print the resulting cmd, the middle of the data field appears to be random garbage. I've stepped through with gdb and can see that the correct data is getting loaded into the the field. It appears that it's only when the command gets passed to a different function that it gets corrupted. This code is called inside a function such as;
command* parse(char *line, command *context)
And the pretty-print happens in another function;
void pretty_print(char* line, command* cmd)
I had thought I was doing things correctly, but apparently not. As far as I can tell, I construct other instances of the struct okay (and I duplicated those approaches for this one) but they don't contain any variable length array in them and their pretty-prints looks fine - which concerns me because they might also be broken, but the breakage is less obvious.
What I'm writing is actually a parser, so a command gets passed into the parse function (which describes the current state, giving hints to the parser what to expect next) and the next command (derived from the input "line") is returned. "context" is free-d at the end of the parse function, which the new command getting returned - which would then be passed back into "parse" with the next "line" of input.
Can anyone suggest anything as to why this might be happening?
Many thanks.
When you allocate memory to structure, only a pointer size gets allocated to *desc. You must allocate memory to the space (array contents) desc points to, as someone already pointed out. Purpose of my answer is to show slightly different way of doing that.
Since having a pointer *desc increases structure size by a word (sizeof pointer), you can safely have a variable length array hack in you structure to reduce structure size.
Here's how your structure should look like, notice that desc[] has been pulled down to the end of structure :
typedef struct {
CommandType name;
pair class;
pair instr;
pair p1;
pair p2;
pair p3;
CommandType expected_next;
int size;
pair sw1;
pair sw2;
pair* data;
char desc[];
} command;
Now,
1. Allocate memory for command which includes array size also :
command *cmd = malloc(sizeof(command) + desc_length);
Use desc :
cmd->desc[desc_length -1] = '\0';
This hack works only if member is at the end of structure, saves structure size, saves pointer indirection, can be used if array length is structure instance specific.
You have to allocate desc and data separately.
When you allocate your struct command *cmd, memory is allocated for your pointers of decs and data. Desc and data have to be malloced separately.
So allocate your command
command *cmd = malloc(sizeof(command));
then allocate memory for data or desc
example for desc:
cmd->desc = malloc( sizeof(char )*100);

Resources