I receive datagrams through a network and I would like to copy the data to a struct with the appropirate fields (corresponding to the format of the message). There are many different types of datagrams (with different fields and size). Here is a simplified version (in reality the fields are always arrays of chars):
struct dg_a
{
char id[2];
char time[4];
char flags;
char end;
};
struct dg_a data;
memcpy(&data, buffer, offsetof(struct dg_a, end));
Currently I add a dummy field called end to the end of the struct so that I can use offsetof to determine how many bytes to copy.
Is there a better and less error-prone way to do this? I was looking for something more portable than putting __attribute__((packed)) and using sizeof.
--
EDIT
Several people in the comments had stated that my approach is bad, but so far nobody has presented a reason why this is. Since struct members are char, there are no trap representations and no paddings between the members (guaranteed by the standard).
A central issue is the size of buffer (assumed to be a character array). The 2 below copy, perhaps a few byte difference.
memcpy(&data, buffer, offsetof(struct dg_a, end)); // 7
// or
memcpy(&data, buffer, sizeof data); // 7, 8, 16 depends on alignment.
Consider avoiding those issues and use buffer as wide as any data structure and zero filled/padded prior to being populated with incoming data.
struct dg_a {
char id[2];
char time[4];
char flags;
}; // no end field
union dg_all {
struct dg_a a;
struct dg_b b;
...
struct dg_z z;
} buffer = { 0 };
foo(&buffer, sizeof buffer); // get data
switch (bar(buffer)) {
case `a` {
struct dg_a data = buffer.a; // Ditch the memcpy
// or maybe no need for copy, just use `buffer.a`
If the term "language" refers to a mapping between source text and behavior, the name C describes two families of languages:
The family of languages which mapped "C syntax" to the behaviors of commonplace microcomputer hardware in ways which were defined more by precedent than specification, but were essentially 100% consistent throughout the 1980s and most of the 1990s among implementations targeting commonplace hardware.
The family of all languages that meet the C Specification, including those processed by deliberately-capricious implementations.
Even though the authors of the C Standard recognized that it would not be practical to mandate that all implementations be suitable for all of the purposes served by C programs, a mentality has emerged in some fields that the only programs that should be considered "portable" are those which the Standard requires all implementations to support. A program which could be broken by a deliberately-capricious implementation should (given that mentality) be viewed as "non-portable" or "erroneous", even if it would benefit greatly from semantics which compilers for commonplace hardware had unanimously supported during the late 20th century, and for which the Standard defines no nice replacements.
Because compilers targeting certain fields like high-end number crunching can benefit from assuming that code won't rely upon certain hardware features, and because the authors of the Standard didn't want to get into details of deciding what implementations should be regarded as suitable for what purposes, some compiler writers really don't want to support code which attempts to overlay data onto structures. Such constructs may be more readable than code which tries to manually parse all the data, and compilers that endeavor to support such code may be able to process it more easily and efficiently than code which manually parses all the data, but since the Standard would allow compilers to assign struct layouts in silly ways if they chose to do so, compiler writers have a mentality that any code which tries to overlay data onto structures should be considered defective.
C has no standard mechanism for avoiding padding between structure elements or at the end of the structure. Many implementations provide such a thing as an extension, however, and inasmuch as you seem to want to match structure layout to network message payloads, your only alternative is to rely on such an extension.
Although using __attribute__((packed)) or a work-alike will enable you to use sizeof for your purpose, that's just a bonus. The main point of doing so is to match the structure layout to the network message structure for the benefit of your proposed memory copying. If the structure is laid out with internal padding where the protocol message has none, then a direct, whole-message copy such as you propose simply cannot work. That sizeof otherwise does not give you the correct size is only a symptom of the larger problem.
Note also that you may face other issues with copying raw bytes, too. In particular, if you intend to exchange messages between machines with different architectures, and these message contain integers larger than one byte, then you need to account for byte-order differences. If the protocol is well designed, then it in fact specifies byte order. Similarly, if you're passing around character data then you may need to deal with encoding issues (which may themselves have have their own byte-ordering considerations).
Overall, you are unlikely to be able to build a robust, portable protocol implementation based on copying whole message payloads into corresponding structures, all at once. At minimum, you would likely need to perform message-type-specific fixup after the main copy. I recommend instead biting the bullet and writing appropriate marshalling functions for each message type into and out of the corresponding network representation. You'll more easily make this portable.
Related
Imagine parsing a string and wanting to extract a sub-string. To represent this sub-string, I see two ways:
// 1. represent it using a start pointer and a length
struct { char *start; size_t length; };
// 2. represent it using two pointers, start and end
struct { char *start; char *end; };
// or it could as well be returned by a function:
char *find_substring(char *s, size_t s_len, size_t *substring_len);
char *find_substring(char *s, size_t s_len, char **substring_end);
Is there a reason to prefer one form over the other? Is it only down to preferences? I don't see it affecting performances as one can be translated into the other using a simple addition/subtraction but I might be wrong.
The context is an HTTP request parser if that changes anything. I used the first one but I'm curious if the second one brings anything to the table as I have seen it used in picohttpparser.
Is there a reason to prefer one form over the other?
One could choose optimization and speed of execution as the measure of preference over on the other.
If more often you append data on the end, then *end++ would be faster over start[length++].
If more often you get the length of the string, then just length would be faster then end - start.
Remember about rules of optimization. The only real answer comes from profiling your code.
Is it only down to preferences?
I advise to prefer the more appropriate representation to the problem you are trying to model, based on how readable it is, how easy it is to use it and find bugs in it, which comes down to personal preference.
We could also inspect existing implementations. In C, all(all?) C standard functions and in POSIX like strbuf, aiocb, XSI messages queues, iovec use pointer+integer to represent a memory region. I think all C++ implementations of std::vector, like glibc std::vector or llvm vector, use pointers internally, but one can expect they be optimized for push_back() operations.
Generally I lean over to use pointers. When operating on size_t you have to handle overflow and underflow and negative/too big values or converting from pointer difference ptrdiff_t to size_t. Such problems kind-of disappear with pointers - a pointer is either valid, or not, you need only bound check using < > operators if you may in-/decrement it or not. However when writing an external api, I would use size_t, as C programmers are used to represent memory region using that convention.
In most cases this would be down to personal preference. I guess most people choose the first representation. But depending what you plan to do with that substring the second implementation may be better performance-wise.
With the second implementation you have to be specific about where end points to: is it the last character still in substring or the first character beyond substring.
The first way is preferred way. For example, consider that you have to deal with very large strings. Then it will not remain a simple allocation of bytes. In that case you have to represent it in a more complicated manner.
The second way leaks the information about internal representation of string while the first does not.
I need to serialize a C struct to a file in a portable way, so that I can read the file on other machines and can be guaranteed that I will get the same thing that I put in.
The file format doesn't matter as long as it is reasonably compact (writing out the in-memory representation of a struct would be ideal if it wasn't for the portability issues.)
Is there a clean way to easily achieve this?
You are essentially designing a binary network protocol, so you may want to use an existing library (like Google's protocol buffers). If you still want to design your own, you can achieve reasonable portability of writing raw structs by doing this:
Pack your structs (GCC's __attribute__((packed)), MSVC's #pragma pack). This is compiler-specific.
Make sure your integer endianness is correct (htons, htonl). This is architecture-specific.
Do not use pointers for strings (use character buffers).
Use C99 exact integer sizes (uint32_t etc).
Ensure that the code only compiles where CHAR_BIT is 8, which is the most common, or otherwise handles transformation of character strings to a stream of 8-bit octets. There are some environments where CHAR_BIT != 8, but they tend to be special-purpose hardware.
With this you can be reasonably sure you will get the same result on the other end as long as you are using the same struct definition. I am not sure about floating point numbers representation, however, but I usually avoid sending those.
Another thing unrelated to portability you may want to address is backwards compatibility by introducing length as a first field, and/or using version tag.
You could try using a library such as protocol buffers; rolling your own is probably not worth the effort.
Write one function for output.
Use sprintf to print an ascii representation of each field to the file,
one field per line.
Write one function for input.
Use fgets to load each line from the file.
Use scanf to convert to binary, directly into the field in your structure.
If you plan on doing this with a lot of different structures,
consider adding a header to each file, which identifies what kind of structure
it represents.
I'm working on a microcontroller-based software project.
A part of the project is a parser for a binary protocol.
The protocol is fixed and cannot be changed.
A PC is acting as a "master" and mainly transmits commands, which have to be executed by the "slave", the microcontroller board.
The protocol data is received by a hardware communication interface, e.g. UART, CAN or Ethernet.
That's not the problem.
After all bytes of a frame (4 - 10, depending on the command) are received, they are stored in a buffer of type "uint8_t cmdBuffer[10]" and a flag is set, indicating that the command can now be executed.
The first byte of a frame (cmdBuffer[0]) contains the command, the rest of the frame are parameters for the command, which may differ in number and size, depending on the command.
This means, the payload can be interpreted in many ways. For every possible command, the data bytes change their meaning.
I don't want to have much ugly bit operations, but self-documentating code.
So my approach is:
I create a "typedef struct" for each command
After determining the command, the pointer to the cmdBuffer is casted to a pointer of my new typedef
by doing so, I can access the command's parameters as structure members, avoiding magic numbers in array acces, bit operations for parameters > 8 bit, and it is easier to read
Example:
typedef struct
{
uint8_t commandCode;
uint8_t parameter_1;
uint32_t anotherParameter;
uint16 oneMoreParameter;
}payloadA_t;
//typedefs for payloadB_t and payloadC_t, which may have different parameters
void parseProtocolData(uint8_t *data, uint8_t length)
{
uint8_t commandToExecute;
//this, in fact, just returns data[0]
commandToExecute = getPayloadType(data, length);
if (commandToExecute == COMMAND_A)
{
executeCommand_A( (payloadA_t *) data);
}
else if (commandToExecute == COMMAND_B)
{
executeCommand_B( (payloadB_t *) data);
}
else if (commandToExecute == COMMAND_C)
{
executeCommand_C( (payloadC_t *) data);
}
else
{
//error, unknown command
}
}
I see two problems with this:
First, depending on the microcontroller architecture, the byteorder may be intel or motorola for 2 or 4- byte parameters.
This should not be much problem. The protocol itself uses network byte order. On the target controller, a macro can be used for correcting the order.
The major problem: there may be padding bytes in my tyepdef struct. I'm using gcc, so i can just add a "packed"-attribute to my typedef. Other compilers provide pragmas for this. However, on 32-bit machines, packed structures will result in bigger (and slower) machine code. Ok, this may also be not a problem. But I'v heard, there can be a hardware fault when accessing un-aligned memory (on ARM architecture, for example).
There are many commands (around 50), so I don't want access the cmdBuffer as an array
I think the "structure approach" increases code readability in contrast to the "array approach"
So my questions:
Is this approach OK, or is it just a dirty hack?
are there cases where the compiler can rely on the "strict aliasing rule" and make my approach not work?
Is there a better solution? How would you solve this problem?
Can this be kept, at least a little, portable?
Regards,
lugge
Generally, structs are dangerous for storing data protocols because of padding. For portable code, you probably want to avoid them. Keeping the raw array of data is therefore the best idea still. You only need a way to interpret it differently depending on the received command.
This scenario is a typical example where some kind of polymorphism is desired. Unfortunately, C has no built-in support for that OO feature, so you'll have to create it yourself.
The best way to do this depends on the nature of these different kinds of data. Since I don't know that, I can only suggest on such way, it may or may not be optimal for your specific case:
typedef enum
{
COMMAND_THIS,
COMMAND_THAT,
... // all 50 commands
COMMANDS_N // a constant which is equal to the number of commands
} cmd_index_t;
typedef struct
{
uint8_t command; // the original command, can be anything
cmd_index_t index; // a number 0 to 49
uint8_t data [MAX]; // the raw data
} cmd_t;
Step one would then be that upon receiving a command, you translate it to the corresponding index.
// ...receive data and place it in cmdBuffer[10], then:
cmd_t cmd;
cmd_create(&cmd, cmdBuffer[0], &cmdBuffer[1]);
...
void cmd_create (cmd_t* cmd, uint8_t command, uint8_t* data)
{
cmd->command = command;
memcpy(cmd->data, data, MAX);
switch(command)
{
case THIS: cmd->index = COMMAND_THIS; break;
case THAT: cmd->index = COMMAND_THAT; break;
...
}
}
Once you have an index 0 to N means that you can implement lookup tables. Each such lookup table can be an array of function pointers, which determine the specific interpretation of the data. For example:
typedef void (*interpreter_func_t)(uint8_t* data);
const interpreter_func_t interpret [COMMANDS_N] =
{
&interpret_this_command,
&interpret_that_command,
...
};
Use:
interpret[cmd->index] (cmd->data);
Then you can make similar lookup tables for different tasks.
initialize [cmd->index] (cmd->data);
interpret [cmd->index] (cmd->data);
repackage [cmd->index] (cmd->data);
do_stuff [cmd->index] (cmd->data);
...
Use different lookup tables for different architectures. Things like endianess can be handled inside the interpreter functions. And you can of course change the function prototypes, maybe you need to return something or pass more parameters etc.
Note that the above example is most suitable when all commands result in the same kind of actions. If you need to do entirely different things depending on command, other approaches are more suitable.
IMHO it is a dirty hack. The code may break when ported to a system with different alignment requirements, different variable sizes, different type representations (e.g. big endian / little endian). Or even on the same system but different version of compiler / system headers / whatever.
I don't think it violates strict aliasing, so long as the relevant bytes form a valid representation.
I would just write code to read the data in a well-defined manner, e.g.
bool extract_A(PayloadA_t *out, uint8_t const *in)
{
out->foo = in[0];
out->bar = read_uint32(in + 1, 4);
// ...
}
This may run slightly slower than the "hack" version, it depends on your requirements whether you prefer maintenance headaches, or those extra microseconds.
Answering your questions in the same order:
This approach is quite common, but it's still called a dirty hack by any book I know that mentions this technique. You spelled the reasons out yourself: in essence it's highly unportable or requires a lot of preprocessor magic to make it portable.
strict aliasing rule: see the top voted answer for What is the strict aliasing rule?
The only alternative solution I know is to explicitly code the deserialization as you mentioned yourself. This can actually be made very readable like this:
uint8_t *p = buffer;
struct s;
s.field1 = read_u32(&p);
s.field2 = read_u16(&p);
I. E. I would make the read functions move the pointer forward by the number of deserialized bytes.
As said above, you can use the preprocessor to handle different endianness and struct packing.
It's a dirty hack. The biggest problem I see with this solution is memory alignment rather than endianness or struct packing.
The memory alignment issue is this. Some microcontrollers such as ARM require that multi-byte variables be aligned with certain memory offsets. That is, 2-byte half-words must be aligned on even memory addresses. And 4-byte words must be aligned on memory addresses that are multiples of 4. These alignment rules are not enforced by your serial protocol. So if you simply cast the serial data buffer into a packed structure then the individual structure members may not have the proper alignment. Then when your code tries to access a misaligned member it will result in an alignment fault or undefined behavior. (This is why the compiler creates an un-packed structure by default.)
Regarding endianness, it sounds like your proposing to correct the byte-order when your code accesses the member in the packed structure. If your code accesses the packed structure member multiple times then it will have to correct the endianness every time. It would be more efficient to just correct the endianness once, when the data is first received from the serial port. And this is another reason not to simply cast the data buffer into a packed structure.
When you receive the command, you should parse out each field individually into an unpacked structure where each member is properly aligned and has the proper endianness. Then your microcontroller code can access each member most efficiently. This solution is also more portable if done correctly.
Yes this is the problem of memory alignment.
Which controller you are using ?
Just declare the structure along with following syntax,
__attribute__(packed)
may be it will solve your problem.
Or you can try to access the variable as reference by address instead of reference by value.
I have a small client server application in which i wish to send an entire structure over a TCP socket in C not C++. Assume the struct to be the following:
struct something{
int a;
char b[64];
float c;
}
I have found many posts saying that i need to use pragma pack or to serialize the data before sending and recieveing.
My question is, is it enough to use JUST pragma pack or just serialzation ? Or do i need to use both?
Also since serialzation is processor intensive process this makes your performance fall drastically, so what is the best way to serialize a struct WITHOUT using an external library(i would love a sample code/algo)?
You need the following to portably send struct's over the network:
Pack the structure. For gcc and compatible compilers, do this with __attribute__((packed)).
Do not use any members other than unsigned integers of fixed size, other packed structures satisfying these requirements, or arrays of any of the former. Signed integers are OK too, unless your machine doesn't use a two's complement representation.
Decide whether your protocol will use little- or big-endian encoding of integers. Make conversions when reading and writing those integers.
Also, do not take pointers of members of a packed structure, except to those with size 1 or other nested packed structures. See this answer.
A simple example of encoding and decoding follows. It assumes that the byte order conversion functions hton8(), ntoh8(), hton32(), and ntoh32() are available (the former two are a no-op, but there for consistency).
#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
// get byte order conversion functions
#include "byteorder.h"
struct packet {
uint8_t x;
uint32_t y;
} __attribute__((packed));
static void decode_packet (uint8_t *recv_data, size_t recv_len)
{
// check size
if (recv_len < sizeof(struct packet)) {
fprintf(stderr, "received too little!");
return;
}
// make pointer
struct packet *recv_packet = (struct packet *)recv_data;
// fix byte order
uint8_t x = ntoh8(recv_packet->x);
uint32_t y = ntoh32(recv_packet->y);
printf("Decoded: x=%"PRIu8" y=%"PRIu32"\n", x, y);
}
int main (int argc, char *argv[])
{
// build packet
struct packet p;
p.x = hton8(17);
p.y = hton32(2924);
// send packet over link....
// on the other end, get some data (recv_data, recv_len) to decode:
uint8_t *recv_data = (uint8_t *)&p;
size_t recv_len = sizeof(p);
// now decode
decode_packet(recv_data, recv_len);
return 0;
}
As far as byte order conversion functions are concerned, your system's htons()/ntohs() and htonl()/ntohl() can be used, for 16- and 32-bit integers, respectively, to convert to/from big-endian. However, I'm not aware of any standard function for 64-bit integers, or to convert to/from little endian. You can use my byte order conversion functions; if you do so, you have to tell it your machine's byte order by defining BADVPN_LITTLE_ENDIAN or BADVPN_BIG_ENDIAN.
As far as signed integers are concerned, the conversion functions can be implemented safely in the same way as the ones I wrote and linked (swapping bytes directly); just change unsigned to signed.
UPDATE: if you want an efficient binary protocol, but don't like fiddling with the bytes, you can try something like Protocol Buffers (C implementation). This allows you to describe the format of your messages in separate files, and generates source code that you use to encode and decode messages of the format you specify. I also implemented something similar myself, but greatly simplified; see my BProto generator and some examples (look in .bproto files, and addr.h for usage example).
Before you send any data over a TCP connection, work out a protocol specification. It doesn't have to be a multiple-page document filled with technical jargon. But it does have to specify who transmits what when and it must specify all messages at the byte level. It should specify how the ends of messages are established, whether there are any timeouts and who imposes them, and so on.
Without a specification, it's easy to ask questions that are simply impossible to answer. If something goes wrong, which end is at fault? With a specification, the end that didn't follow the specification is at fault. (And if both ends follow the specification and it still doesn't work, the specification is at fault.)
Once you have a specification, it's much easier to answer questions about how one end or the other should be designed.
I also strongly recommend not designing a network protocol around the specifics of your hardware. At least, not without a proven performance issue.
It depends on whether you can be sure that your systems on either end of the connection are homogeneous or not. If you are sure, for all time (which most of us cannot be), then you can take some shortcuts - but you must be aware that they are shortcuts.
struct something some;
...
if ((nbytes = write(sockfd, &some, sizeof(some)) != sizeof(some))
...short write or erroneous write...
and the analogous read().
However, if there's any chance that the systems might be different, then you need to establish how the data will be transferred formally. You might well linearize (serialize) the data - possibly fancily with something like ASN.1 or probably more simply with a format that can be reread easily. For that, text is often beneficial - it is easier to debug when you can see what's going wrong. Failing that, you need to define the byte order in which an int is transferred and make sure that the transfer follows that order, and the string probably gets a byte count followed by the appropriate amount of data (consider whether to transfer a terminal null or not), and then some representation of the float. This is more fiddly. It is not all that hard to write serialization and deserialization functions to handle the formatting. The tricky part is designing (deciding on) the protocol.
You could use an union with the structure you want to send and an array:
union SendSomething {
char arr[sizeof(struct something)];
struct something smth;
};
This way you can send and receive just arr. Of course, you have to take care about endianess issues and sizeof(struct something) might vary across machines (but you can easily overcome this with a #pragma pack).
Why would you do this when there are good and fast serialization libraries out there like Message Pack which do all the hard work for you, and as a bonus they provide you with cross-language compatibility of your socket protocol?
Use Message Pack or some other serialization library to do this.
Usually, serialization brings several benefits over e.g. sending the bits of the structure over the wire (with e.g. fwrite).
It happens individually for each non-aggregate atomic data (e.g. int).
It defines precisely the serial data format sent over the wire
So it deals with heterogenous architecture: sending and recieving machines could have different word length and endianness.
It may be less brittle when the type change a little bit. So if one machine has an old version of your code running, it might be able to talk with a machine with a more recent version, e.g. one having a char b[80]; instead of char b[64];
It may deal with more complex data structures -variable-sized vectors, or even hash-tables- with a logical way (for the hash-table, transmit the association, ..)
Very often, the serialization routines are generated. Even 20 years ago, RPCXDR already existed for that purpose, and XDR serialization primitives are still in many libc.
Pragma pack is used for the binary compatibility of you struct on another end.
Because the server or the client to which you send the struct may be written on another language or builded with other c compiler or with other c compiler options.
Serialization, as I understand, is making stream of bytes from you struct. When you write you struct in the socket you make serialiazation.
Google Protocol Buffer offers a nifty solution to this problem. Refer here Google Protobol Buffer - C Implementaion
Create a .proto file based on the structure of your payload and save it as payload.proto
syntax="proto3"
message Payload {
int32 age = 1;
string name = 2;
} .
Compile the .proto file using
protoc --c_out=. payload.proto
This will create the header file payload.pb-c.h and its corresponding payload.pb-c.c in your directory.
Create your server.c file and include the protobuf-c header files
#include<stdio.h>
#include"payload.pb.c.h"
int main()
{
Payload pload = PLOAD__INIT;
pload.name = "Adam";
pload.age = 1300000;
int len = payload__get_packed_size(&pload);
uint8_t buffer[len];
payload__pack(&pload, buffer);
// Now send this buffer to the client via socket.
}
On your receiving side client.c
....
int main()
{
uint8_t buffer[MAX_SIZE]; // load this buffer with the socket data.
size_t buffer_len; // Length of the buffer obtain via read()
Payload *pload = payload_unpack(NULL, buffer_len, buffer);
printf("Age : %d Name : %s", pload->age, pload->name);
}
Make sure you compile your programs with -lprotobuf-c flag
gcc server.c payload.pb-c.c -lprotobuf-c -o server.out
gcc client.c payload.pb-c.c -lprotobuf-c -o client.out
Are there any libraries or guides for how to read and parse binary data in C?
I am looking at some functionality that will receive TCP packets on a network socket and then parse that binary data according to a specification, turning the information into a more useable form by the code.
Are there any libraries out there that do this, or even a primer on performing this type of thing?
I have to disagree with many of the responses here. I strongly suggest you avoid the temptation to cast a struct onto the incoming data. It seems compelling and might even work on your current target, but if the code is ever ported to another target/environment/compiler, you'll run into trouble. A few reasons:
Endianness: The architecture you're using right now might be big-endian, but your next target might be little-endian. Or vice-versa. You can overcome this with macros (ntoh and hton, for example), but it's extra work and you have make sure you call those macros every time you reference the field.
Alignment: The architecture you're using might be capable of loading a mutli-byte word at an odd-addressed offset, but many architectures cannot. If a 4-byte word straddles a 4-byte alignment boundary, the load may pull garbage. Even if the protocol itself doesn't have misaligned words, sometimes the byte stream itself is misaligned. (For example, although the IP header definition puts all 4-byte words on 4-byte boundaries, often the ethernet header pushes the IP header itself onto a 2-byte boundary.)
Padding: Your compiler might choose to pack your struct tightly with no padding, or it might insert padding to deal with the target's alignment constraints. I've seen this change between two versions of the same compiler. You could use #pragmas to force the issue, but #pragmas are, of course, compiler-specific.
Bit Ordering: The ordering of bits inside C bitfields is compiler-specific. Plus, the bits are hard to "get at" for your runtime code. Every time you reference a bitfield inside a struct, the compiler has to use a set of mask/shift operations. Of course, you're going to have to do that masking/shifting at some point, but best not to do it at every reference if speed is a concern. (If space is the overriding concern, then use bitfields, but tread carefully.)
All this is not to say "don't use structs." My favorite approach is to declare a friendly native-endian struct of all the relevant protocol data without any bitfields and without concern for the issues, then write a set of symmetric pack/parse routines that use the struct as a go-between.
typedef struct _MyProtocolData
{
Bool myBitA; // Using a "Bool" type wastes a lot of space, but it's fast.
Bool myBitB;
Word32 myWord; // You have a list of base types like Word32, right?
} MyProtocolData;
Void myProtocolParse(const Byte *pProtocol, MyProtocolData *pData)
{
// Somewhere, your code has to pick out the bits. Best to just do it one place.
pData->myBitA = *(pProtocol + MY_BITS_OFFSET) & MY_BIT_A_MASK >> MY_BIT_A_SHIFT;
pData->myBitB = *(pProtocol + MY_BITS_OFFSET) & MY_BIT_B_MASK >> MY_BIT_B_SHIFT;
// Endianness and Alignment issues go away when you fetch byte-at-a-time.
// Here, I'm assuming the protocol is big-endian.
// You could also write a library of "word fetchers" for different sizes and endiannesses.
pData->myWord = *(pProtocol + MY_WORD_OFFSET + 0) << 24;
pData->myWord += *(pProtocol + MY_WORD_OFFSET + 1) << 16;
pData->myWord += *(pProtocol + MY_WORD_OFFSET + 2) << 8;
pData->myWord += *(pProtocol + MY_WORD_OFFSET + 3);
// You could return something useful, like the end of the protocol or an error code.
}
Void myProtocolPack(const MyProtocolData *pData, Byte *pProtocol)
{
// Exercise for the reader! :)
}
Now, the rest of your code just manipulates data inside the friendly, fast struct objects and only calls the pack/parse when you have to interface with a byte stream. There's no need for ntoh or hton, and no bitfields to slow down your code.
The standard way to do this in C/C++ is really casting to structs as 'gwaredd' suggested
It is not as unsafe as one would think. You first cast to the struct that you expected, as in his/her example, then you test that struct for validity. You have to test for max/min values, termination sequences, etc.
What ever platform you are on you must read Unix Network Programming, Volume 1: The Sockets Networking API. Buy it, borrow it, steal it ( the victim will understand, it's like stealing food or something... ), but do read it.
After reading the Stevens, most of this will make a lot more sense.
Let me restate your question to see if I understood properly. You are
looking for software that will take a formal description of a packet
and then will produce a "decoder" to parse such packets?
If so, the reference in that field is PADS. A good article
introducing it is PADS: A Domain-Specific Language for Processing Ad
Hoc Data. PADS is very complete but unfortunately under a non-free licence.
There are possible alternatives (I did not mention non-C
solutions). Apparently, none can be regarded as completely production-ready:
binpac
PacketTypes
DataScript
If you read French, I summarized these issues in Génération de
décodeurs de formats binaires.
In my experience, the best way is to first write a set of primitives, to read/write a single value of some type from a binary buffer. This gives you high visibility, and a very simple way to handle any endianness-issues: just make the functions do it right.
Then, you can for instance define structs for each of your protocol messages, and write pack/unpack (some people call them serialize/deserialize) functions for each.
As a base case, a primitive to extract a single 8-bit integer could look like this (assuming an 8-bit char on the host machine, you could add a layer of custom types to ensure that too, if needed):
const void * read_uint8(const void *buffer, unsigned char *value)
{
const unsigned char *vptr = buffer;
*value = *buffer++;
return buffer;
}
Here, I chose to return the value by reference, and return an updated pointer. This is a matter of taste, you could of course return the value and update the pointer by reference. It is a crucial part of the design that the read-function updates the pointer, to make these chainable.
Now, we can write a similar function to read a 16-bit unsigned quantity:
const void * read_uint16(const void *buffer, unsigned short *value)
{
unsigned char lo, hi;
buffer = read_uint8(buffer, &hi);
buffer = read_uint8(buffer, &lo);
*value = (hi << 8) | lo;
return buffer;
}
Here I assumed incoming data is big-endian, this is common in networking protocols (mainly for historical reasons). You could of course get clever and do some pointer arithmetic and remove the need for a temporary, but I find this way makes it clearer and easier to understand. Having maximal transparency in this kind of primitive can be a good thing when debugging.
The next step would be to start defining your protocol-specific messages, and write read/write primitives to match. At that level, think about code generation; if your protocol is described in some general, machine-readable format, you can generate the read/write functions from that, which saves a lot of grief. This is harder if the protocol format is clever enough, but often doable and highly recommended.
You might be interested in Google Protocol Buffers, which is basically a serialization framework. It's primarily for C++/Java/Python (those are the languages supported by Google) but there are ongoing efforts to port it to other languages, including C. (I haven't used the C port at all, but I'm responsible for one of the C# ports.)
You don't really need to parse binary data in C, just cast some pointer to whatever you think it should be.
struct SomeDataFormat
{
....
}
SomeDataFormat* pParsedData = (SomeDataFormat*) pBuffer;
Just be wary of endian issues, type sizes, reading off the end of buffers, etc etc
Parsing/formatting binary structures is one of the very few things that is easier to do in C than in higher-level/managed languages. You simply define a struct that corresponds to the format you want to handle and the struct is the parser/formatter. This works because a struct in C represents a precise memory layout (which is, of course, already binary). See also kervin's and gwaredd's replies.
I'm not really understand what kind of library you are looking for ? Generic library that will take any binary input and will parse it to unknown format?
I'm not sure there is such library can ever exist in any language.
I think you need elaborate your question a little bit.
Edit:
Ok, so after reading Jon's answer seems there is a library, well kind of library it's more like code generation tool. But as many stated just casting the data to the appropriate data structure, with appropriate carefulness i.e using packed structures and taking care of endian issues you are good. Using such tool with C it's just an overkill.
Basically suggestions about casting to struct work but please be aware that numbers can be represented differently on different architectures.
To deal with endian issues network byte order was introduced - common practice is to convert numbers from host byte order to network byte order before sending the data and to convert back to host order on receipt. See functions htonl, htons, ntohl and ntohs.
And really consider kervin's advice - read UNP. You won't regret it!