typecast array to struct in c - c

I have a structure like this
struct packet
{
int seqnum;
char type[1];
float time1;
float pri;
float time2;
unsigned char data[512];
}
I am receiving packet in an array
char buf[529];
I want to take the seqnum,data everything separately.Does the following typecast work.. It is giving junk value for me.
struct packet *pkt;
pkt=(struct packet *)buf;
printf(" %d",pkt->seqnum)

No, that likely won't work and is generally a bad and broken way of doing this.
You must use compiler-specific extensions to make sure there's no invisible padding between your struct members, for something like that to work. With gcc, for instance, you do this using the __attribute__() syntax.
It is, thus, not a portable idea.
It's much better to be explicit about it, and unpack each field. This also gives you a chance to have a well-defined endianness in your network protocol, which is generally a good idea for interoperability's sake.

No, that isn't generally valid code. You should make the struct first and then memcopy stuff into it:
packet p;
memcpy(&p.seqnum, buf + 0, 4);
memcpy(&p.type[0], buf + 4, 1);
memcpy(&p.time1, buf + 5, 4);
And so forth.
You must take great care to get the type sizes and endianness right.

First of all, you cannot know in advance where the compiler will insert padding bytes in your structure for performance optimization (cache line alignment, integer alignment etc) since this is platform-dependent. Except, of course, if you are considering building the app only on your platform.
Anyway, in your case it seems like you are getting data from somewhere (network ?) and it is highly probable that the data has been compacted (no padding bytes between fields).
If you really want to typecast your array to a struct pointer, you can still tell the compiler to remove the padding bytes it might add. Note that this depends on the compiler you use and is not a standard C implementation. With gcc, you might add this statement at the end of your structure definition :
struct my_struct {
int blah;
/* Blah ... */
} __attribute__((packed));
Note that it will affect the performance for member access, copy etc ...
Unless you have a very good reason to do so, don't ever use the __attribute__((packed)) thing !
The other solution, which is much more advisable is to make the parsing on your own. You just allocate an appropriate structure and fill its fields by seeking the good information from your buffer. A sequence of memcpy instructions is likely to do the trick here (see Kerrek's answer)

Related

Best practice for parsing data of mixed type?

I am wondering whether there is any known best practice/method for parsing mixed type of data packet.
For instance, let's say the data is 10 bytes, and it consists of:
Byte 0-1: manufacturer ID (int)
Byte 2: type (int)
Byte 3-4: device id (ascii char)
I could simply define each data type size and location as #define, and parse it using those defines. But I am wondering if there is any structure to organise this better.
Best practice it to assume all data from outside the program (e.g. from the user, from a file, from a network, from a different process) is potentially incorrect (and potentially unsafe/malicious).
Then, based on the assumption of "potential incorrectness" define types to distinguish between "unchecked, potential incorrect data" and "checked, known correct data". For your example, you could use uint8_t packet[10]; as the data type for unchecked data and a normal structure (with padding and without __attribute__((packed));) for the checked data. This makes it extremely difficult for a programmer to accidentally use unsafe data when they think they're using safe/checked data.
Of course you will also need code to convert between these data types, which needs to do as many sanity checks as possible (and possibly also worry about things like endianess). For your example these checks could be:
are any of the bytes that are supposed to be ASCII characters >= 0x80, and are any of them invalid (e.g. maybe control characters like backspace are not permitted).
is the manufacturer ID valid (e.g. maybe there's an enumeration that it needs to match)
is the type valid (e.g. maybe there's an enumeration that it needs to match)
Note that this function should return some kind of status to indicate if the conversion was successful or not, and in most cases this status should also give an indication of what the problem was if the conversion wasn't successful (so that the caller can inform the user or log the problem or handle the problem in the most suitable way for the problem). For example, maybe "unknown manufacturer ID" means that the program needs to be updated to handle a new manufacturer and that the data was correct, and "invalid manufacturer ID" means that the data was definitely wrong.
Like this:
struct packet {
uint16_t mfg;
uint8_t type;
uint16_t devid;
} __attribute__((packed));
The packed attribute (or your platform's equivalent) is required to avoid implicit padding which doesn't exist in the protocol.
Once you have the above struct, you simply cast (part of) a char array which you received from wherever:
char buf[1000];
(struct packet*)(buf + N);
For a fully portable version, I suggest you do the read in this fashion:
struct {
uint16_t e1;
uint8_t e2;
uint16_t e3;
} d;
uint8_t *cursor;
uint8_t rbuf[5];
read(sock, rbuf, sizeof(rbuf));
memcpy(&s.e1, &rbuf[0], sizeof(s.e1));
s.e2 = rbuf[2];
memcpy(&s.e3, &rbuf[3], sizeof(s.e3));
s.e1 = ntohs(s.e1);
s.e3 = ntohs(s.e3);
You may be tempted to do something like others answers said, something like:
struct s {
uint16_t e1;
uint8_t e2;
uint16_t e3;
} __attribute__((packed));
struct s d;
read(sock, &d, sizeof(d));
s.e1 = ntohs(s.e1);
s.e3 = ntohs(s.e3);
However, this code is not fully portable and can lead you to problems, since you are accessing items (s.e3) with unaligned memory, which is in itself undefined behavior. Under some circumstances, this fashion OK and desirable (less cache polluting since more structs can fill the different cache lines, and maybe simpler code), but in others cases, it can cause a bus error and make your code incompatible for some architectures.
Beyond that, you should follow other best practices, like trying to read as many structs at possible between read() calls, make nicer code about net-to-host byte-ordering translations... but I think that avoid the non-standard attribute should be the first thing.
Note that if you DON'T do unaligned access, all of this (even the __packed__ attribute) is completely unnecesary, and that you can read the structs like:
struct {
uint16_t e1;
uint8_t e2;
uint8_t e2;
uint16_t e3;
} d;
read(rsock, &d, sizeof(d));

Why is the offsetof macro necessary?

I am new to the C language and just learned about structs and pointers.
My question is related to the offsetof macro I recently saw. I know how it works and the logic behind that.
In the <stddef.h> file the definition is as follows:
#define offsetof(type,member) ((unsigned long) &(((type*)0)->member))
My question is, if I have a struct as shown below:
struct test {
int field1:
int field2:
};
struct test var;
Why cannot I directly get the address of field2 as:
char * p = (char *)&var;
char *addressofField2 = p + sizeof(int);
Rather than writing something like this
field2Offset = offsetof (struct test, field2);
and then adding offset value to var's starting address?
Is there any difference? Is using offsetof more efficient?
The C compiler will often add extra padding bits or bytes between members of a struct in order to improve efficiency and keep integers word-aligned (which in some architectures is required to avoid bus errors and in some architectures is required to avoid efficiency problems). For example, in many compilers, if you have this struct:
struct ImLikelyPadded {
int x;
char y;
int z;
};
you might find that sizeof(struct ImLikelyPadded) is 12, not 9, because the compiler will insert three extra padding bytes between the end of the one-byte char y and the word-sized int z. This is why offsetof is so useful - it lets you determine where things really are even factoring in padding bytes and is highly portable.
Unlike arrays, memory layout of struct is not always contiguous. Compiler may add extra bytes, in order to align the memory. This is called padding.
Because of padding, it us difficult to find location of member manually. This is also why we always use sizeof to find struct size.
Offsetof , macro let you find out the distance,offset, of a member of struct from the strating position of the struct.
One intelligent use if offsetof is seen in Linux kernel's container_of macro. This macro let you find out starting position of node given the address of member in a generic inclusive doubly linked list
As already mentioned in the other answers, padding is one of the reasons. I won't repeat what was already said about it.
Another good reason to use the offsetof macro and not manually compute offsets is that you only have to write it once. Imagine what happens if you need to change the type of field1 or insert or remove one or more fields in front of field2. Using your hand-crafted calculation you have to find and change all its occurrences. Missing one of them will produce mysterious bugs that are difficult to find.
The code written using offsetof doesn't need any update in this situation. The compiler takes care of everything on the next compilation.
Even more, the code that uses offsetof is more clear. The macro is standard, its functionality is documented. A fellow programmer that reads the code understands it immediately. It's not that easy to understand what the hand-crafted code attempts.

struct xyz a[0]; What does this mean? [duplicate]

I am working on refactoring some old code and have found few structs containing zero length arrays (below). Warnings depressed by pragma, of course, but I've failed to create by "new" structures containing such structures (error 2233). Array 'byData' used as pointer, but why not to use pointer instead? or array of length 1? And of course, no comments were added to make me enjoy the process...
Any causes to use such thing? Any advice in refactoring those?
struct someData
{
int nData;
BYTE byData[0];
}
NB It's C++, Windows XP, VS 2003
Yes this is a C-Hack.
To create an array of any length:
struct someData* mallocSomeData(int size)
{
struct someData* result = (struct someData*)malloc(sizeof(struct someData) + size * sizeof(BYTE));
if (result)
{ result->nData = size;
}
return result;
}
Now you have an object of someData with an array of a specified length.
There are, unfortunately, several reasons why you would declare a zero length array at the end of a structure. It essentially gives you the ability to have a variable length structure returned from an API.
Raymond Chen did an excellent blog post on the subject. I suggest you take a look at this post because it likely contains the answer you want.
Note in his post, it deals with arrays of size 1 instead of 0. This is the case because zero length arrays are a more recent entry into the standards. His post should still apply to your problem.
http://blogs.msdn.com/oldnewthing/archive/2004/08/26/220873.aspx
EDIT
Note: Even though Raymond's post says 0 length arrays are legal in C99 they are in fact still not legal in C99. Instead of a 0 length array here you should be using a length 1 array
This is an old C hack to allow a flexible sized arrays.
In C99 standard this is not neccessary as it supports the arr[] syntax.
Your intution about "why not use an array of size 1" is spot on.
The code is doing the "C struct hack" wrong, because declarations of zero length arrays are a constraint violation. This means that a compiler can reject your hack right off the bat at compile time with a diagnostic message that stops the translation.
If we want to perpetrate a hack, we must sneak it past the compiler.
The right way to do the "C struct hack" (which is compatible with C dialects going back to 1989 ANSI C, and probably much earlier) is to use a perfectly valid array of size 1:
struct someData
{
int nData;
unsigned char byData[1];
}
Moreover, instead of sizeof struct someData, the size of the part before byData is calculated using:
offsetof(struct someData, byData);
To allocate a struct someData with space for 42 bytes in byData, we would then use:
struct someData *psd = (struct someData *) malloc(offsetof(struct someData, byData) + 42);
Note that this offsetof calculation is in fact the correct calculation even in the case of the array size being zero. You see, sizeof the whole structure can include padding. For instance, if we have something like this:
struct hack {
unsigned long ul;
char c;
char foo[0]; /* assuming our compiler accepts this nonsense */
};
The size of struct hack is quite possibly padded for alignment because of the ul member. If unsigned long is four bytes wide, then quite possibly sizeof (struct hack) is 8, whereas offsetof(struct hack, foo) is almost certainly 5. The offsetof method is the way to get the accurate size of the preceding part of the struct just before the array.
So that would be the way to refactor the code: make it conform to the classic, highly portable struct hack.
Why not use a pointer? Because a pointer occupies extra space and has to be initialized.
There are other good reasons not to use a pointer, namely that a pointer requires an address space in order to be meaningful. The struct hack is externalizeable: that is to say, there are situations in which such a layout conforms to external storage such as areas of files, packets or shared memory, in which you do not want pointers because they are not meaningful.
Several years ago, I used the struct hack in a shared memory message passing interface between kernel and user space. I didn't want pointers there, because they would have been meaningful only to the original address space of the process generating a message. The kernel part of the software had a view to the memory using its own mapping at a different address, and so everything was based on offset calculations.
It's worth pointing out IMO the best way to do the size calculation, which is used in the Raymond Chen article linked above.
struct foo
{
size_t count;
int data[1];
}
size_t foo_size_from_count(size_t count)
{
return offsetof(foo, data[count]);
}
The offset of the first entry off the end of desired allocation, is also the size of the desired allocation. IMO it's an extremely elegant way of doing the size calculation. It does not matter what the element type of the variable size array is. The offsetof (or FIELD_OFFSET or UFIELD_OFFSET in Windows) is always written the same way. No sizeof() expressions to accidentally mess up.

Portable way to find size of a packed structure in C

I'm coding a network layer protocol and it is required to find a size of packed a structure defined in C. Since compilers may add extra padding bytes which makes sizeof function useless in my case. I looked up Google and find that we could use ___attribute(packed)___ something like this to prevent compiler from adding extra padding bytes. But I believe this is not portable approach, my code needs to support both windows and linux environment.
Currently, I've defined a macro to map packed sizes of every structure defined in my code. Consider code below:
typedef struct {
...
} a_t;
typedef struct {
...
} b_t;
#define SIZE_a_t 8;
#define SIZE_b_t 10;
#define SIZEOF(XX) SIZE_##XX;
and then in main function, I can use above macro definition as below:-
int size = SIZEOF(a_t);
This approach does work, but I believe it may not be best approach. Any suggestions or ideas on how to efficiently solve this problem in C?
Example
Consider the C structure below:-
typedef struct {
uint8_t a;
uint16_t b;
} e_t;
Under Linux, sizeof function return 4 bytes instead of 3 bytes. To prevent this I'm currently doing this:-
typedef struct {
uint8_t a;
uint16_t b;
} e_t;
#define SIZE_e_t 3
#define SIZEOF(XX) SIZE_##e_t
Now, when I call SIZEOF(e_t) in my functin, it should return 3 not 4.
sizeof is the portable way to find the size of a struct, or of any other C data type.
The problem you're facing is how to ensure that your struct has the size and layout that you need.
#pragma pack or __attribute__((packed)) may well do the job for you. It's not 100% portable (there's no mention of packing in the C standard), but it may be portable enough for your current purposes, but consider whether your code might need to be ported to some other platform in the future. It's also potentially unsafe; see this question and this answer.
The only 100% portable approach is to use arrays of unsigned char and keep track of which fields occupy which ranges of bytes. This is a lot more cumbersome, of course.
Your macro tells you the size that you think the struct should have, if it has been laid out as you intend.
If that's not equal to sizeof(a_t), then whatever code you write that thinks it is packed isn't going to work anyway. Assuming they're equal, you might as well just use sizeof(a_t) for all purposes. If they're not equal then you should be using it only for some kind of check that SIZEOF(a_t) == sizeof(a_t), which will fail and prevent your non-working code from compiling.
So it follows that you might as well just put the check in the header file that sizeof(a_t) == 8, and not bother defining SIZEOF.
That's all aside from the fact that SIZEOF doesn't really behave like sizeof. For example consider typedef a_t foo; sizeof(foo);, which obviously won't work with SIZEOF.
I don't think, that specifying size manually is more portable, than using sizeof.
If size is changed your const-specified size will be wrong.
Attribute packed is portable. In Visual Studio it is #pragma pack.
I would recommend against trying to read/write data by overlaying it on a struct. I would suggest instead writing a family of routines which are conceptually like printf/scanf, but which use format specifiers that specify binary data formats. Rather than using percent-sign-based tags, I would suggest simply using a binary encoding of the data format.
There are a few approaches one could take, involving trade-off between the size of the serialization/deserialization routines themselves, the size of the code necessary to use them, and the ability to handle a variety of deserialization formats. The simplest (and most easily portable) approach would be to have routines which, instead of using a format string, process items individually by taking a double-indirect pointer, read some data type from it, and increment it suitably. Thus:
uint32_t read_uint32_bigendian(uint8_t const ** src)
{
uint8_t *p;
uint32_t tmp;
p = *src;
tmp = (*p++) << 24;
tmp |= (*p++) << 16;
tmp |= (*p++) << 8;
tmp |= (*p++);
*src = p;
}
...
char buff[256];
...
uint8_t *buffptr = buff;
first_word = read_uint32_bigendian(&buffptr);
next_word = read_uint32_bigendian(&buffptr);
This approach is simple, but has the disadvantage of having lots of redundancy in the packing and unpacking code. Adding a format string could simplify it:
#define BIGEND_INT32 "\x43" // Or whatever the appropriate token would be
uint8_t *buffptr = buff;
read_data(&buffptr, BIGEND_INT32 BIGEND_INT32, &first_word, &second_word);
This approach could read any number of data items with a single function call, passing buffptr only once, rather than once per data item. On some systems, it might still be a bit slow. An alternative approach would be to pass in a string indicating what sort of data should be received from the source, and then also pass in a string or structure indicating where the data should go. This could allow any amount of data to be parsed by a single call giving a double-indirect pointer for the source, a string pointer indicating the format of data at the source, a pointer to a struct indicating how the data should be unpacked, and a a pointer to a struct to hold the target data.

How do I determine the memory layout of a structure?

Suppose I have the following structure (in C):
struct my_struct {
int foo;
float bar;
char *baz;
};
If I now have a variable, say
struct my_struct a_struct;
How can I find out how the fields of that structure are going to be laid out in memory? In other words, I need to know what the address of a_struct.foo, of a_struct.bar and a_struct.baz are going to be. And I cannot do that programatically, because I am actually cross-compiling to another platform.
CLARIFICATION
Thanks the answers so far, but I cannot do this programatically (i.e. with the offsetof macro, or with a small test program) because I am cross-compiling and I need to know how the fields are going to be aligned on the target platform. I know this is implementation-dependent, that's the whole point of my question. I am using GCC to compile, targeting an ARM architecture.
What I need in the end is to be able to dump the memory from the target platform and parse it with other tools, such as Python's struct library. But for that I need to know how the fields were laid out.
In general, this is implementation specific. It depends on things like the compiler, compiler settings, the platform you are compiling on, word-size, etc. Here's a previous SO thread on the topic: C struct memory layout?
If you are cross-compiling, I'd imagine the specific layout will be different depending on which platform you compile for. I'd consult references for your compiler and platform.
There's a program called pahole (Poke-A-Hole) in the dwarves package that will produce a report showing the structures of your program along with markers showing where and how large padding is.
I think you have two options.
The first one is to use __attribute__((packed)) after the struct declaration. This will ensure that each member will be allocated exactly the amount of memory that its type requires.
The other one is to examine your structure and use the alignment rules (n-byte basic type variable has to be n-byte aligned) to figure out the layout.
In your example, in either case each member variable will take 4 bytes and the structure will occupe 12 bytes in memory.
One hacky way to see the memory view of what's inside it would be to cast a struct pointer to a char pointer, then print out all the chars, something like:
struct my_struct s;
s.foo = MAX_INT;
s.bar = 1.0;
s.baz = "hello";
for (int i = 0; i < sizeof(s); i++) {
char *c = ((char*)&s) + i;
printf("byte %d: 0x%02x\n", i, *c);
}
That doesn't explicitly show you the boundaries, you'd have to infer that from the dump.
Comments made by others about packing still apply; you'll also want to use explicitly sized types like uint32 instead of unsigned int

Resources