Best way to assert compatible architecture when reading and writing a file? - c

I have a program that reads and writes a binary file. A file is interchangeable between executions of the program on the same platform, but a file produced on one machine may not be valid on another platform due to the sizes of types, endian-ness etc.
I want a quick way to be able to assert that a given file is valid for reading on a given architecture. I am not interested in making a file cross-architecture (in fact the file is memory-mapped structs). I only want a way of checking that the file was created on an architecture with the same size types, etc before reading it.
One idea is to write a struct with constant magic numbers in to the start of the file. This can be read and verified. Another would be to store the sizeof various types in single-byte integers.
This is for C but I suppose the question is language-agnostic for languages with the same kinds of issues.
What's the best way to do this?
I welcome amendments for the title of this question!

I like the magic number at the start of the file idea. You get to make these checks with a magic value:
If there are at least two magic bytes and you treat them as a single multi-byte integer, you can detect endianness changes. For instance, if you choose 0xABCD, and your code reads 0xCDAB, you're on a platform with different endianness than the one where the file was written.
If you use a 4- or 8-byte integer, you can detect 32- vs. 64-bit platforms, if you choose your data type so it's a different size on the two platforms.
If there is more than just an integer or you choose it carefully, you can rule out the possibility of accidentally reading a file written out by another program to a high degree of probability. See /etc/magic on any Unixy type system for a good list of values to avoid.

#include <stdint.h>
union header {
uint8_t a[8];
uint64_t u;
};
const struct header h = { .u = (sizeof( short ) << 0 )
| (sizeof( int ) << 8 )
| (sizeof( long ) << 16 )
| (sizeof( long long ) << 24 )
| (sizeof( float ) << 32 )
| (sizeof( double ) << 40 )
| (sizeof( long double ) << 48 )
| 0 } ;
This should be enough to verify the type sizes and endianness, except that floating point numbers are crazy difficult for this.
If you want to verify that your floating point numbers are stored in the same format on the writer and the reader then you might want to store a couple of constant floating point numbers (more interesting than 0, 1, and -1) in the different sizes after this header, and verify that they are what you think they should be.
It is very likely that storing an actual magic string with version number would also be good as another check that this is the correct file format.
If you don't care about floats or something like that then feel free to delete them. I didn't include char because it is supposed to always be 1 byte.
It might be a good idea if you also store the sizeof some struct like:
struct misalligned {
char c;
uint64_t u;
};
This should allow you to easily determine the alignment and padding of the compiler that generated the code that generated the file. If this were done on most 32 bit computers that care about alignment the size would be 96 because there would be 3 bytes of padding between c and u, but if it were done on a 64 bit machine then the sizeof it may be 128, having 7 bytes of padding between c and u. If this were done on an AVR the sizeof this would most likely be 9 because there would be no padding.
NOTE
this answer relied on the question stating that the files were being memory mapped and no need for portability beyond recognizing that a file was the wrong format. If the question were about general file storage and retrivial I would have answered differently. The biggest difference would be packing the data structures.

Call the uname(2) function (or equivalent on non-POSIX platforms) and write the sysname and machine fields from the struct utsname into a header at the start of the file.
(There's more to it than just sizes and endianness - there's also floating point formats and structure padding standards that vary too. So it's really the machine ABI that you want to assert is the same).

First, I fully agree with the previous answer provided by Warren Young.
This is a meta-data case we're talking about.
On a filesystem and homogeneous content, I'd prefer having one padded (to the size of a structure) meta-data at the beginning of the binary file. This allow to preserve data structure alignment and simplify append-writing.
If heterogeneous, I'd prefer using Structure-Value or Structure-Length-Value (also known as Type Length Value) in front of each data or range of data.
On a stream with random joining, you may wish to have some kind of structure sync with something like HDLC (on Wikipedia) and meta-data repetition during the constant flow of binary data. If you're familiar with audio/video format, you may think of TAGs inside a data flow which is intrinsically composed of frames.
Nice subject !

Related

Use casts to access a byte-array like a structure?

I'm working on a microcontroller-based software project.
A part of the project is a parser for a binary protocol.
The protocol is fixed and cannot be changed.
A PC is acting as a "master" and mainly transmits commands, which have to be executed by the "slave", the microcontroller board.
The protocol data is received by a hardware communication interface, e.g. UART, CAN or Ethernet.
That's not the problem.
After all bytes of a frame (4 - 10, depending on the command) are received, they are stored in a buffer of type "uint8_t cmdBuffer[10]" and a flag is set, indicating that the command can now be executed.
The first byte of a frame (cmdBuffer[0]) contains the command, the rest of the frame are parameters for the command, which may differ in number and size, depending on the command.
This means, the payload can be interpreted in many ways. For every possible command, the data bytes change their meaning.
I don't want to have much ugly bit operations, but self-documentating code.
So my approach is:
I create a "typedef struct" for each command
After determining the command, the pointer to the cmdBuffer is casted to a pointer of my new typedef
by doing so, I can access the command's parameters as structure members, avoiding magic numbers in array acces, bit operations for parameters > 8 bit, and it is easier to read
Example:
typedef struct
{
uint8_t commandCode;
uint8_t parameter_1;
uint32_t anotherParameter;
uint16 oneMoreParameter;
}payloadA_t;
//typedefs for payloadB_t and payloadC_t, which may have different parameters
void parseProtocolData(uint8_t *data, uint8_t length)
{
uint8_t commandToExecute;
//this, in fact, just returns data[0]
commandToExecute = getPayloadType(data, length);
if (commandToExecute == COMMAND_A)
{
executeCommand_A( (payloadA_t *) data);
}
else if (commandToExecute == COMMAND_B)
{
executeCommand_B( (payloadB_t *) data);
}
else if (commandToExecute == COMMAND_C)
{
executeCommand_C( (payloadC_t *) data);
}
else
{
//error, unknown command
}
}
I see two problems with this:
First, depending on the microcontroller architecture, the byteorder may be intel or motorola for 2 or 4- byte parameters.
This should not be much problem. The protocol itself uses network byte order. On the target controller, a macro can be used for correcting the order.
The major problem: there may be padding bytes in my tyepdef struct. I'm using gcc, so i can just add a "packed"-attribute to my typedef. Other compilers provide pragmas for this. However, on 32-bit machines, packed structures will result in bigger (and slower) machine code. Ok, this may also be not a problem. But I'v heard, there can be a hardware fault when accessing un-aligned memory (on ARM architecture, for example).
There are many commands (around 50), so I don't want access the cmdBuffer as an array
I think the "structure approach" increases code readability in contrast to the "array approach"
So my questions:
Is this approach OK, or is it just a dirty hack?
are there cases where the compiler can rely on the "strict aliasing rule" and make my approach not work?
Is there a better solution? How would you solve this problem?
Can this be kept, at least a little, portable?
Regards,
lugge
Generally, structs are dangerous for storing data protocols because of padding. For portable code, you probably want to avoid them. Keeping the raw array of data is therefore the best idea still. You only need a way to interpret it differently depending on the received command.
This scenario is a typical example where some kind of polymorphism is desired. Unfortunately, C has no built-in support for that OO feature, so you'll have to create it yourself.
The best way to do this depends on the nature of these different kinds of data. Since I don't know that, I can only suggest on such way, it may or may not be optimal for your specific case:
typedef enum
{
COMMAND_THIS,
COMMAND_THAT,
... // all 50 commands
COMMANDS_N // a constant which is equal to the number of commands
} cmd_index_t;
typedef struct
{
uint8_t command; // the original command, can be anything
cmd_index_t index; // a number 0 to 49
uint8_t data [MAX]; // the raw data
} cmd_t;
Step one would then be that upon receiving a command, you translate it to the corresponding index.
// ...receive data and place it in cmdBuffer[10], then:
cmd_t cmd;
cmd_create(&cmd, cmdBuffer[0], &cmdBuffer[1]);
...
void cmd_create (cmd_t* cmd, uint8_t command, uint8_t* data)
{
cmd->command = command;
memcpy(cmd->data, data, MAX);
switch(command)
{
case THIS: cmd->index = COMMAND_THIS; break;
case THAT: cmd->index = COMMAND_THAT; break;
...
}
}
Once you have an index 0 to N means that you can implement lookup tables. Each such lookup table can be an array of function pointers, which determine the specific interpretation of the data. For example:
typedef void (*interpreter_func_t)(uint8_t* data);
const interpreter_func_t interpret [COMMANDS_N] =
{
&interpret_this_command,
&interpret_that_command,
...
};
Use:
interpret[cmd->index] (cmd->data);
Then you can make similar lookup tables for different tasks.
initialize [cmd->index] (cmd->data);
interpret [cmd->index] (cmd->data);
repackage [cmd->index] (cmd->data);
do_stuff [cmd->index] (cmd->data);
...
Use different lookup tables for different architectures. Things like endianess can be handled inside the interpreter functions. And you can of course change the function prototypes, maybe you need to return something or pass more parameters etc.
Note that the above example is most suitable when all commands result in the same kind of actions. If you need to do entirely different things depending on command, other approaches are more suitable.
IMHO it is a dirty hack. The code may break when ported to a system with different alignment requirements, different variable sizes, different type representations (e.g. big endian / little endian). Or even on the same system but different version of compiler / system headers / whatever.
I don't think it violates strict aliasing, so long as the relevant bytes form a valid representation.
I would just write code to read the data in a well-defined manner, e.g.
bool extract_A(PayloadA_t *out, uint8_t const *in)
{
out->foo = in[0];
out->bar = read_uint32(in + 1, 4);
// ...
}
This may run slightly slower than the "hack" version, it depends on your requirements whether you prefer maintenance headaches, or those extra microseconds.
Answering your questions in the same order:
This approach is quite common, but it's still called a dirty hack by any book I know that mentions this technique. You spelled the reasons out yourself: in essence it's highly unportable or requires a lot of preprocessor magic to make it portable.
strict aliasing rule: see the top voted answer for What is the strict aliasing rule?
The only alternative solution I know is to explicitly code the deserialization as you mentioned yourself. This can actually be made very readable like this:
uint8_t *p = buffer;
struct s;
s.field1 = read_u32(&p);
s.field2 = read_u16(&p);
I. E. I would make the read functions move the pointer forward by the number of deserialized bytes.
As said above, you can use the preprocessor to handle different endianness and struct packing.
It's a dirty hack. The biggest problem I see with this solution is memory alignment rather than endianness or struct packing.
The memory alignment issue is this. Some microcontrollers such as ARM require that multi-byte variables be aligned with certain memory offsets. That is, 2-byte half-words must be aligned on even memory addresses. And 4-byte words must be aligned on memory addresses that are multiples of 4. These alignment rules are not enforced by your serial protocol. So if you simply cast the serial data buffer into a packed structure then the individual structure members may not have the proper alignment. Then when your code tries to access a misaligned member it will result in an alignment fault or undefined behavior. (This is why the compiler creates an un-packed structure by default.)
Regarding endianness, it sounds like your proposing to correct the byte-order when your code accesses the member in the packed structure. If your code accesses the packed structure member multiple times then it will have to correct the endianness every time. It would be more efficient to just correct the endianness once, when the data is first received from the serial port. And this is another reason not to simply cast the data buffer into a packed structure.
When you receive the command, you should parse out each field individually into an unpacked structure where each member is properly aligned and has the proper endianness. Then your microcontroller code can access each member most efficiently. This solution is also more portable if done correctly.
Yes this is the problem of memory alignment.
Which controller you are using ?
Just declare the structure along with following syntax,
__attribute__(packed)
may be it will solve your problem.
Or you can try to access the variable as reference by address instead of reference by value.

embedding chars in int and vice versa

I have smart card on which I can store bytes (multiple of 16).
If I do: Save(byteArray, length) then I can do Receive(byteArray,length)
and I think I will get byte array in the same order I stored.
Now, I have such issue. I realized if I store integer on this card,
and some other machine (with different endianness) reads it, it may get wrong data.
So, I thought maybe solution is I always store data on this card, in a little
endian way, and always retrieve the data in a little endian way (I will write apps to read and write, so I am free to interpret numbers as I like.). Is this possible?
Here is something I have come up with:
Embed integer in char array:
int x;
unsigned char buffer[250];
buffer[0] = LSB(x);
buffer[1] = LSB(x>>8);
buffer[2] = LSB(x>>16);
buffer[3] = LSB(x>>24);
Important is I think that LSB function should return the least significant byte regardless of the endiannes of the machine, how would such LSB function look like?
Now, to reconstruct the integer (something like this):
int x = buffer[0] | (buffer[1]<<8) | (buffer[2]<<16) | (buffer[3]<<24);
As I said I want this to work, regardless of the endiannes of the machine who reads it and writes it. Will this work?
The 'LSB' function may be implemented via a macro as below:-
#define LSB(x) ((x) & 0xFF)
Provided x is unsigned.
If your C library is posix compliant, then you have standard functions available to do exactly what you are trying to code. ntohl, ntohs, htonl, htons (network to host long, network to host short, ...). That way you don't have to change your code if you want to compile it for a big-endian or for a little-endian architecture. The functions are defined in arpa/inet.h (see http://linux.die.net/man/3/ntohl).
I think the answer for your question is YES, you can write data on a smart card such that it is universally (and correctly) read by readers of both big endian AND little endian orientation. With one big caveat: it would be incumbent on the reader to do the interpretation, not your smart card interpreting the reader, would it not? That is, as you know there are many routines to determine endianess (1, 2, 3). But it is the readers that would have to contain code to test endianess, not your card.
Your code example works, but I am not sure it would be necessary given the nature of the problem as it is presented.
By the way, HERE is a related post.

Reading a binary file bit by bit

I know the function below:
size_t fread(void *ptr, size_t size_of_elements, size_t number_of_elements, FILE *a_file);
It only reads byte by byte, my goal is to be able to read 12 bits at a time and then take them into an array. Any help or pointers would be greatly appreciated!
Adding to the first comment, you can try reading one byte at a time (declare a char variable and write there), and then use the bitwise operators >> and << to read bit by bit. Read more here: http://www.cprogramming.com/tutorial/bitwise_operators.html
Many years ago, I wrote some I/O routines in C for a Huffman encoder. This needs to be able to read and write on the granularity of bits rather than bytes. I created functions analogous to read(2) and write(2) that could be asked to (say) read 13 bits from a stream. To encode, for example, bytes would be fed into the coder and variable numbers of bits would emerge the other side. I had a simple structure with a bit pointer into the current byte being read or written. Every time it went off the end, it flushed the completed byte out and reset the pointer to zero. Unfortunately that code is long gone, but it might be an idea to pull apart an open-source Huffman coder and see how the problem was solved there. Similarly, base64 coding takes 3 bytes of data and turns them into 4 (or vice versa).
I've implemented a couple of methods to read/write files bit by bit. Here they are. Whether it is viable or not for your use case, you have to decide that for yourself. I've tried to make the most readable, optimized code i could, not being a seasoned C developer (for now).
Internally, it uses a "bitCursor" to store information about previous bits that don't yet fit a full byte. It has who data fields: d stores data and s stores size, or the amount of bits stored in the cursor.
You have four functions:
newBitCursor(), which returns a bitCursor object with default values
{0,0}. Such a cursor is needed at the beginning of a sequence of
read/write operations to or from a file.
fwriteb(void *ptr, size_t sizeb, size_t rSizeb, FILE *fp, bitCursor
*c), which writes sizeb rightmost bits of the value stored in ptr to fp.
fcloseb(FILE *fp, bitCursor *c), which writes a remaining byte, if
the previous writes did not exactly encapsulate all data needed to
be written, that is probably almost always the case...
freadb(void *ptr, size_t sizeb, size_t rSizeb, FILE *fp, bitCursor
*c), which reads sizeb bits and bitwise ORs them to *ptr. (it is, therefore, your responsibility to init *ptr as 0)
More info is provided in the comments. Have Fun!
Edit: It has come to my knowledge today that when i made that i assumed Little Endian! :P Oops! It's always nice to realize how much of a noob i still am ;D.
Edit: GNU's Binary File Descriptors.
Read the first two bytes from your a_file file pointer and check the bits in the least or greatest byte — depending on the endianness of your platform (x86 is little-endian) — using bitshift operators.
You can't really put bits into an array, as there isn't a datatype for bits. Rather than keeping 1's and 0's in an array, which is inefficient, it seems cheaper just to keep the two bytes in a two-element array (say, of type unsigned char *) and write functions to map those two bytes to one of 4096 (2^12) values-of-interest.
As a complication, on subsequent reads, if you want to fread through the pointer every 12 bits, you would read only one byte, using the left-over bits from the previous read to build a new 12-bit value. If there are no leftovers, you would need to read two bytes.
Your mapping functions would need to address the second case where bits were used from previous read, because the two bytes would need different mapping. To do this efficiently, a modulus on a read-counter could be used to swap between two mappings.
read 2 bytes and do bit wise operations will get it done for the next time read 2nd bytes onwards apply the bit-wise operations will get back you expected . . . .
For your problem you can see this demo program which read 2byte but actual information is only 12 bit.As well as this type of things are used it bit wise access.
fwrite() is a standard library function which take the size argument as byte and of type int.So it is not possible exactly 12bit read.If the file you create then create like below as well as read as below it solve your problem.
If that file is special file which not written by you then follow the standard provided for that file to read I think they also writing like this only.Or you can provide the axact where it I may help you.
#include<stdio.h>
#include<stdlib.h>
struct node
{
int data:12;
}NODE;
int main()
{
FILE *fp;
fp=fopen("t","w");
NODE.data=1024;
printf("%d\n",NODE.data);
fwrite(&NODE,sizeof(NODE),1,fp);
NODE.data=0;
NODE.data=2048;
printf("%d\n",(unsigned)NODE.data);
fwrite(&NODE,sizeof(NODE),1,fp);
fclose(fp);
fp=fopen("t","r");
fread(&NODE,sizeof(NODE),1,fp);
printf("%d\n",NODE.data);
fread(&NODE,sizeof(NODE),1,fp);
printf("%d\n",NODE.data);
fclose(fp);
}

Manipulating binary C struct data off line

I have a practical problem to solve. In an embedded system I'm working on, I have a pretty large structure holding system parameters, which includes int, float, and other structures. The structure can be saved to external storage.
I'm thinking of writing a PC software to change certain values inside the binary structure without changing the overall layout of the file. The idea is that one can change 1 or 2 parameter and load it back into memory from external storage. So it's a special purpose hex editor.
The way I figures it: if I can construct a table that contains:
- parameter name
- offset of the parameter in memory
- type
I should be able to change anything I want.
My problem is that it doesn't look too easy to figure out offset of each parameter programmatically. One can always manually to printf their addresses but I'd like to avoid that.
Anybody know a tool that can be of help?
EDIT
The system in question is 32 ARM based running in little endian mode. Compiler isn't asked to pack the structure so for my purpose it's the same as a x86 PC.
The problem is the size of the structure: it contains no less than 1000 parameters in total, spreading across multiple level of nested structures. So it's not feasible (or rather i'm not willing to) to write a short piece of code to dump all the offsets. I guess the key problem is parsing the C headers and automatically generate offsets or code to dump their offsets. Please suggest such a parsing tool.
Thanks
The way this is normally done is by just sharing the header file that defines the struct with the other (PC) application as well. You should then be able to basically load the bytes and cast them to that kind of struct and edit away.
Before doing this, you need to know about (and possibly handle):
Type sizes. If the two platforms are very different (16bit vs 32 bit) etc, you might have differently sized ints/floats, etc.
Field alignment. If the compiler/target are different, the packing rules for the fields won't be necessarily the same. You can generally force these to be the same with compiler-specific #pragmas.
Endianness. This is something to be careful of in all cases. If your embedded device is a different endianness from the PC target, you'll have to byte swap each field when you load on the PC.
If there are too many differences here to bother with, and you're building something quicker/dirtier, I don't know of tools to generate struct offsets automatically, but as another poster says, you could write some short C code that uses offsetof or equivalent (or does, as you suggest, pointer arithmetic) to dump out a table of fields and offsets.
I suspect you could parse the output of the pahole tool to get this. It reads the debugging information in an executable and prints out the type, name and offset of each struct member.
Given a fully-defined structure in your code, you can use the offsetof() macro to find out how the fields are laid out. This won't help if you're trying to programmatically determine the layout on the other machine, though.
Rather than writing your own tool, there are existing hex editors that'll take a struct definition and let you edit the values in a file. If you're using Windows, I believe that Hex Workshop can do this, for example.
To make sure I understand the question:
In your embedded system, you have a struct that contains system settings that looks something like this:
struct SubStruct{
int subSetting1;
float subSetting2;
}
/* ... some other substructure definitions. ... */
struct Settings{
float setting1;
int setting2;
struct SubStruct setting3;
/* ... lots of other settings ... */
}
And your problem is you want to take a binary file containing said structure and modify the values on another machine?
Given that you know the sizes of the integral types and the packing used by the embedded system, you may have two options. If the endianness of your embedded system and PC system is different, you will also have to deal with that, e.g. using hton when writing fields.
Option 1:
Declare the same structure in your PC program, with pragmas to force the same byte-alignment as the embedded system, and with any integral types that are different sizes between the embedded and PC systems "retyped" to specific sizes, e.g. by using int16_t, int32_t if you have stdint.h, or your own typedefs if not, on your PC program.
Option 2:
You may be able to simply make a list of types, e.g. a list that looks like:
Name Type
setting1 float
setting2 int
subStruct1SubSetting1 int
subStruct1SubSetting2 float
And then calculate the locations based on the sizes and packing used by the embedded system, e.g. if the embedded system uses 4-byte alignment, 4-byte floats, and 2-byte ints, the above table would calculate out:
Name Type Calculated Offset
setting1 float 0
setting2 int 4
subStruct1SubSetting1 int 8
subStruct1SubSetting2 float 12
or if the embedded system packs the structure to 1-byte alignment (e.g. if it's an 8-bit micro)
Name Type Calculated Offset
setting1 float 0
setting2 int 4
subStruct1SubSetting1 int 6
subStruct1SubSetting2 float 8
I'm unsure what debugging capabilities you have in your system. Under Windows I would use simple debugging script with text conversion on top of it (debugger can be run in batch mode to generate type information).

Parsing Binary Data in C?

Are there any libraries or guides for how to read and parse binary data in C?
I am looking at some functionality that will receive TCP packets on a network socket and then parse that binary data according to a specification, turning the information into a more useable form by the code.
Are there any libraries out there that do this, or even a primer on performing this type of thing?
I have to disagree with many of the responses here. I strongly suggest you avoid the temptation to cast a struct onto the incoming data. It seems compelling and might even work on your current target, but if the code is ever ported to another target/environment/compiler, you'll run into trouble. A few reasons:
Endianness: The architecture you're using right now might be big-endian, but your next target might be little-endian. Or vice-versa. You can overcome this with macros (ntoh and hton, for example), but it's extra work and you have make sure you call those macros every time you reference the field.
Alignment: The architecture you're using might be capable of loading a mutli-byte word at an odd-addressed offset, but many architectures cannot. If a 4-byte word straddles a 4-byte alignment boundary, the load may pull garbage. Even if the protocol itself doesn't have misaligned words, sometimes the byte stream itself is misaligned. (For example, although the IP header definition puts all 4-byte words on 4-byte boundaries, often the ethernet header pushes the IP header itself onto a 2-byte boundary.)
Padding: Your compiler might choose to pack your struct tightly with no padding, or it might insert padding to deal with the target's alignment constraints. I've seen this change between two versions of the same compiler. You could use #pragmas to force the issue, but #pragmas are, of course, compiler-specific.
Bit Ordering: The ordering of bits inside C bitfields is compiler-specific. Plus, the bits are hard to "get at" for your runtime code. Every time you reference a bitfield inside a struct, the compiler has to use a set of mask/shift operations. Of course, you're going to have to do that masking/shifting at some point, but best not to do it at every reference if speed is a concern. (If space is the overriding concern, then use bitfields, but tread carefully.)
All this is not to say "don't use structs." My favorite approach is to declare a friendly native-endian struct of all the relevant protocol data without any bitfields and without concern for the issues, then write a set of symmetric pack/parse routines that use the struct as a go-between.
typedef struct _MyProtocolData
{
Bool myBitA; // Using a "Bool" type wastes a lot of space, but it's fast.
Bool myBitB;
Word32 myWord; // You have a list of base types like Word32, right?
} MyProtocolData;
Void myProtocolParse(const Byte *pProtocol, MyProtocolData *pData)
{
// Somewhere, your code has to pick out the bits. Best to just do it one place.
pData->myBitA = *(pProtocol + MY_BITS_OFFSET) & MY_BIT_A_MASK >> MY_BIT_A_SHIFT;
pData->myBitB = *(pProtocol + MY_BITS_OFFSET) & MY_BIT_B_MASK >> MY_BIT_B_SHIFT;
// Endianness and Alignment issues go away when you fetch byte-at-a-time.
// Here, I'm assuming the protocol is big-endian.
// You could also write a library of "word fetchers" for different sizes and endiannesses.
pData->myWord = *(pProtocol + MY_WORD_OFFSET + 0) << 24;
pData->myWord += *(pProtocol + MY_WORD_OFFSET + 1) << 16;
pData->myWord += *(pProtocol + MY_WORD_OFFSET + 2) << 8;
pData->myWord += *(pProtocol + MY_WORD_OFFSET + 3);
// You could return something useful, like the end of the protocol or an error code.
}
Void myProtocolPack(const MyProtocolData *pData, Byte *pProtocol)
{
// Exercise for the reader! :)
}
Now, the rest of your code just manipulates data inside the friendly, fast struct objects and only calls the pack/parse when you have to interface with a byte stream. There's no need for ntoh or hton, and no bitfields to slow down your code.
The standard way to do this in C/C++ is really casting to structs as 'gwaredd' suggested
It is not as unsafe as one would think. You first cast to the struct that you expected, as in his/her example, then you test that struct for validity. You have to test for max/min values, termination sequences, etc.
What ever platform you are on you must read Unix Network Programming, Volume 1: The Sockets Networking API. Buy it, borrow it, steal it ( the victim will understand, it's like stealing food or something... ), but do read it.
After reading the Stevens, most of this will make a lot more sense.
Let me restate your question to see if I understood properly. You are
looking for software that will take a formal description of a packet
and then will produce a "decoder" to parse such packets?
If so, the reference in that field is PADS. A good article
introducing it is PADS: A Domain-Specific Language for Processing Ad
Hoc Data. PADS is very complete but unfortunately under a non-free licence.
There are possible alternatives (I did not mention non-C
solutions). Apparently, none can be regarded as completely production-ready:
binpac
PacketTypes
DataScript
If you read French, I summarized these issues in Génération de
décodeurs de formats binaires.
In my experience, the best way is to first write a set of primitives, to read/write a single value of some type from a binary buffer. This gives you high visibility, and a very simple way to handle any endianness-issues: just make the functions do it right.
Then, you can for instance define structs for each of your protocol messages, and write pack/unpack (some people call them serialize/deserialize) functions for each.
As a base case, a primitive to extract a single 8-bit integer could look like this (assuming an 8-bit char on the host machine, you could add a layer of custom types to ensure that too, if needed):
const void * read_uint8(const void *buffer, unsigned char *value)
{
const unsigned char *vptr = buffer;
*value = *buffer++;
return buffer;
}
Here, I chose to return the value by reference, and return an updated pointer. This is a matter of taste, you could of course return the value and update the pointer by reference. It is a crucial part of the design that the read-function updates the pointer, to make these chainable.
Now, we can write a similar function to read a 16-bit unsigned quantity:
const void * read_uint16(const void *buffer, unsigned short *value)
{
unsigned char lo, hi;
buffer = read_uint8(buffer, &hi);
buffer = read_uint8(buffer, &lo);
*value = (hi << 8) | lo;
return buffer;
}
Here I assumed incoming data is big-endian, this is common in networking protocols (mainly for historical reasons). You could of course get clever and do some pointer arithmetic and remove the need for a temporary, but I find this way makes it clearer and easier to understand. Having maximal transparency in this kind of primitive can be a good thing when debugging.
The next step would be to start defining your protocol-specific messages, and write read/write primitives to match. At that level, think about code generation; if your protocol is described in some general, machine-readable format, you can generate the read/write functions from that, which saves a lot of grief. This is harder if the protocol format is clever enough, but often doable and highly recommended.
You might be interested in Google Protocol Buffers, which is basically a serialization framework. It's primarily for C++/Java/Python (those are the languages supported by Google) but there are ongoing efforts to port it to other languages, including C. (I haven't used the C port at all, but I'm responsible for one of the C# ports.)
You don't really need to parse binary data in C, just cast some pointer to whatever you think it should be.
struct SomeDataFormat
{
....
}
SomeDataFormat* pParsedData = (SomeDataFormat*) pBuffer;
Just be wary of endian issues, type sizes, reading off the end of buffers, etc etc
Parsing/formatting binary structures is one of the very few things that is easier to do in C than in higher-level/managed languages. You simply define a struct that corresponds to the format you want to handle and the struct is the parser/formatter. This works because a struct in C represents a precise memory layout (which is, of course, already binary). See also kervin's and gwaredd's replies.
I'm not really understand what kind of library you are looking for ? Generic library that will take any binary input and will parse it to unknown format?
I'm not sure there is such library can ever exist in any language.
I think you need elaborate your question a little bit.
Edit:
Ok, so after reading Jon's answer seems there is a library, well kind of library it's more like code generation tool. But as many stated just casting the data to the appropriate data structure, with appropriate carefulness i.e using packed structures and taking care of endian issues you are good. Using such tool with C it's just an overkill.
Basically suggestions about casting to struct work but please be aware that numbers can be represented differently on different architectures.
To deal with endian issues network byte order was introduced - common practice is to convert numbers from host byte order to network byte order before sending the data and to convert back to host order on receipt. See functions htonl, htons, ntohl and ntohs.
And really consider kervin's advice - read UNP. You won't regret it!

Resources