Serializing strings in C - c

I'm serializing structs into byte-streams. My method is simple:
pack all ints in little endian order and copy strings including the null terminator. The other side has to statically know how to unpack the byte-stream, there is no additional metadata.
My problem is, that I do not know how to handle the the NULL pointer?
I need to send something, because there is no additional metadata in the stream.
I considered the following two options:
Send a '\0' and make the receiving side interpret it as NULL in any case
Send a '\0' and make the receiving side interpret it as '\0' in any case (alloc a byte)
Send a special character representing char* str == NULL, e.g. ETX, EOT, EM ?
What do you think?

It looks like you are currently trying to tell the receiving end that the end of the serialized string has been reached by passing it a special character. There are a million cases that can screw you over with this:
What if your struct contains a byte that is equal to that special character. Escape it with another special character. What if your struct contains a byte sequence that is equal to your escape character followed by your special character, check for that too?
Yeah it's doable, but I think that's not a very good solution and you'll have to write a parser to look for the escape character and then anyone who takes a look at the code later will spend two hours trying to figure out what's going on.
(tl;dr) Instead... just make the first 32 bits of the serialized string equal to the number of bytes in the string. This only costs 4 bytes per serialization, solves all your problems, you won't have to write a parser or worry about special characters, and will make it a lot easier on the next guy who gets to read through your code!
edit
Thanks to JeremyP I've just realized that I didn't really answer your question. Send one of these guys for every string:
struct s_str
{
bool is_null;
int size;
char* str;
};
If it's null, simply set is_null to true and you don't really have to worry about the other two.
If it's size zero, set is_null to false and size to zero.
If str contains just a '\0', set is_null to false, size to one, and str[0] to '\0'
In my opinion, this might not be the most memory efficient way (you could probably save a byte somewhere somehow) but is definitely quite clear in what you're doing, and again the next guy that comes along will like this a lot more.

Do not do this. use some extra bytes to store length and concatenate with your data string. The receiver end can check the length to know how much it should read into his local buffer.

It depends on the significance of the pointer in your protocol.
If the pointer is significant, i.e. it is needed for the recipient to know how to rebuild the struct, then you need to send something. It could be either a byte with 0/non-zero to indicate existence, or an integer that indicates the number of bytes pointed to by the pointer.
Example:
struct Foo {
int *arr,
char *text
}
Struct Foo could be serialized like this:
<arr length>< arr ><text length>< text >
4 bytes n bytes 4 bytes n bytes

I would advise you to use an existing library for serialization. I can think of two at the moment: tpl and gwlib's gwser.
About tpl:
You can use tpl to store and reload your C data quickly and easily.
Tpl works with files, memory buffers and file descriptors so it's
suitable for use as a file format, IPC message format or any scenario
where you need to store and retrieve your data.
About gwlib, see the link, it's not very verbose, but it provides a few usage examples.

Related

How to encode the length of the string at the front of the string

Imaging I want to design a protocol, so I want to send a packet from the client side to the server side, I need to encode my data.
I have a string, I want to add the length of the string at the front of the string, for example:
string: "my"
which length is 2
So what I expect is to create a char[] in c and store | 2 | my | in the buffer
In this way, after the server receives the packet, it will know how many bytes need to be read for this request. (by using C programming)
I tried to do it but I don't know how to control the empty between the length and the string, I can create a buffer which size is 10, and use sprintf() to convert the length of the string and add it into the buffer.
One poor way to do it is to encode the length in ASCII at the front the string - the down side is you’ll need variable char elements to store the length if you ever want to send anything longer than 9 chars.
A better way to encode the strings length, since you are designing your own protocol, is to allocate a fixed number of bytes at the beginning, say 8 bytes, and cast &char[0] as a pointer to an uint64_t Basically, use array[0~7] to store an 8byte unsigned long. Align the address w.r.t. 8byte boundary for (slightly) better performance.
If the sender and receiver machine have different endianness, you’ll also have to include a multi-byte long “magic number” at the head of the char array. This is necessary for both sides to correctly recover the string length from the multi-byte-long length field.
There are two standards used in C:
str*: char * which is terminated with a '\0'.
mem*, read/write: void * plus a length size_t. It's the same idea for readv() and writev() but here the two variables is bundled into an array of struct iovec. Note that sizeof(size_t) may differ between sender and render.
If you use anything else it's automatically a learning curve for whoever needs to read or interact with your code. I wouldn't do that trade-off, but you do you.
You can, of course, encode the length into the char * but now you have to think about how you encode it (big vs little endian), fixed vs variable size.
You might be interested in SDS which hides the length. This way only have to reimplement the functions that change the length of the string instead of all string functions. Use an existing library.

Disadvantages of strlen() in Embedded Domain

What are the disadvantages of using strlen()?
If sometimes in TCP Communication NULL character comes in string than we find length of string up to only null character.
we cant find actual length of string.
if we make other alternative of this strlen function than its also stops at NULL character. so which method i can use to find out string length in C
To read from a "TCP Communication" you are probably using read. The prototype for read is
ssize_t read(int fildes, void *buf, size_t nbyte);
and the return value is the number of bytes read (even if they are 0).
So, let's say you're about to read 10 bytes, all of which are 0. You have an array with more than enough to hold all the data
int fildes;
char data[1000];
// fildes = TCPConnection
nbytes = read(fildes, data, 1000);
Now, by inspecting nbytes you know you have read 10 bytes. If you check data[0] through data[9] you will find they have 0;
If the runtime library provides strcpy() and strcat(), then surely it provides strlen().
I suspect you are confusing NULL, an invalid pointer value, from the ASCII code NUL, for a zero character value which indicates the end of string to many C runtime functions.
As such, there is no concern for inserting a NUL value in a string, nor in detecting it.
Response to updated question:
Since you seem to be processing binary data, the string functions are not a good fit—unless you can guarantee there are no NULs in the stream. However, for this reason, most TCP/IP messages use headers with fields containing the number of bytes which follow.
"Embedded" strikes me as a red herring here.
If you're processing binary data where an embedded NUL might be valid, then you can't expect meaningful results from strlen.
If you're processing strings (as that term is defined in C -- a block of non-NUL data terminated by a NUL) then you can use strlen just fine.
A system being "embedded" would affect this only to the degree that it might be less common to process strings and more common to process binary data.
it is safer to use strnlen instead of strlen to avoid the problems with strlen. The strlen problems are present everywhere not just embedded. Many of the string functions are dangerous because they go forever or until a zero is hit, or like scanf or strtok go until a pattern is hit.
Remember tcp is a stream not a packet, you may have to wait for multiple or many packets and piece together the data before you can attempt to call it a string anyway. that is assuming the payload is an asciiz string anyway, if raw data then dont use string functions use some other solution.
Yes, strlen() uses a terminating character \0 aka NUL. Most str* functions do so. There could be a risk that data coming from files/command line/sockets would not contain this character (usually, they won't: they'll be \n-terminated), but their size will also be provided by the read()/recv() function you've used. If that's a concern, you can always use a buffer slightly larger than what declared to those functions, e.g.
char mybuf[256+4];
mybuf[256]=0;
int reallen=fgets(mybuf, 256, stdin);
// we've got a 0-terminated string in mybuf.
If your data may not contain \0, compare strlen(mybuf) with reallen and terminate the session with an error code if they differ.
If your data may contain 0, then it should be processed as a buffer and not as a string. Size must be kept aside, and memcpy / memcmp functions should be used instead of strcpy and strcmp.
Also, your network protocol should be very explicit on whether strings or binary data is expected in different parts of the communication. HTTP is for instance, and it provides many way to tell the actual size of the transmitted payload.
This isn't specific to "embedded" programs, but it has come a major concern in every programs to ensure no remote code/script injection can occur. If by "embedded", you mean you're in a non-preemptive environment and have only limited time available to perform some action ... then yeah, you don't want to end up scanning 2GB of incoming bits for a (never-appearing) \0. either the above trick, or strnlen (mentioned in another answer) could be used to ensure this isn't the case.

Search and replace string and ignore null byte character

I'm working on a C program that uses NFQUEUE to filter traffic for another application. One of the things I need to do is replace a string contained within a packet, with another string.
The problem is, the packets seem to contain the null terminator byte randomly (in the middle of the string). This means that most solutions I see, using strstr(), don't work. I need to find something similar that doesn't stop upon reaching a null terminator byte, but rather allows for a length to be specified and uses that instead. (nfq_get_payload() returns a length.)
I've looked at replacing the null bytes with another byte before performing the replace, and then restoring the null bytes before the packet is sent off. The problem with that approach is there's a chance the packet could contain the character, so that wouldn't be the best approach. I suppose I could also find a random byte that is not contained within the packet, but I'd rather avoid doing all that.
edit: Both the original string and replacement string are the same length, which is 13 characters.
You might be satisfied with memchr if finding one character can work for you. Otherwise, you would have to make a memmem implementation yourself or find one online.
Be aware that string-searching algorithms (because that's what memmem is) can have a wide range of performance characteristics, so you want to find one based on a performant algorithm (e.g. this one looks acceptable, but your mileage may vary).

Socket data length questions

I have a couple of questions related to the following code:
char buffer[256];
memset(buffer,0,256);
read(socket_fd,buffer,255);
The questions:
Why I read 255 not 256 ?
Let's say I want to send the word: "Cool" from the client to the server. How many bytes should I write "in client" and how many bytes should i read "in the server"?
I'm really confused.
You already have good answers here, but I think there's a concept we should explain.
When you send data through streams (that is, something that writes a number of bytes from one end, and those bytes can be read in the same order in the other end), you almost always want to know when to stop reading. This is mandatory if you'll send more than one thing: when does the first message stop, and the second begin? In the stream, things get mixed up.
So, how do we delimit messages? There are three simple ways (and many other not so simple ones, of course):
1 Fixed-length messages:
If you know beforehand that every message is, say, 10-bytes long, then you don't have a problem. You just read 10 bytes, and the 11th one will be part of another message. This is very simple, but also very rigid.
2 Delimiting characters, or strings:
If you are sending human-readable text, you might delimit your messages the same way you delimit strings in your char*: putting a 0 character at the end. That way, when you read a 0, you know the message ended and any remaining data in the stream belongs to another message.
This is okay for ascii text, but when it comes to arbitrary data it's also somewhat rigid: there's a character, or a sequence of characters, that your messages can't contain (or your program will get confused as to where a message ends).
3 Message headers:
This is the best approach for arbitrary length, arbitrary content messages. Before sending any actual message data, send a fixed-length header (or use technique nr 2 to mark the end of the header), specifying metadata about your message. For example, it's length.
Say you want to send the message 'Cool', as you said. Well, first send a byte (or a 2-byte short, or a 4-byte integer) containing '4', the length of the message, and receive it on the other end. You know that before any message arrives, you must read 1 byte, store that somewhere and then read the remaining specified bytes.
A simplified example:
struct mheader {
int length;
}
// (...)
struct mheader in_h;
read(fd, &in_h, sizeof(struct mheader);
if (in_h.length > 0) {
read(fd, buffer, in_h.length)
}
In actual use, remember that read doesn't always read the exact amount of bytes you request. Check the return value to find out (which could be negative to indicate errors), and read again if necessary.
Hope this helps. Good luck!
So that the buffer retains the NUL at the end, as extra insurance against string overflows. Reading 256 would allow it to get overwritten.
You would write five bytes. Either write "Cool\0", or write 4 (the length) followed by the 4 characters in "Cool". Read all of it, and figure out the length after.
You look at the return value from read(); it tells you how many bytes were read.
You use the number of bytes read when you want to write the same data.
You don't have to use 255 in the read unless you definitely want to be able to put a NUL at the end - but since you know how many bytes were read, you won't go beyond that anyway. So, the 255 is an insurance policy against carelessness by the programmer.
The memset() is likewise most an insurance policy against carelessness by the programmer - it is not really necessary, unless you want to mask out previous sensitive data.

Passing variable-length structures between MPI processes

I need to MPI_Gatherv() a number of int/string pairs. Let's say each pair looks like this:
struct Pair {
int x;
unsigned s_len;
char s[1]; // variable-length string of s_len chars
};
How to define an appropriate MPI datatype for Pair?
In short, it's theoretically impossible to send one message of variable size and receive it into a buffer of the perfect size. You'll either have to send a first message with the sizes of each string and then a second message with the strings themselves, or encode that metainfo into the payload and use a static receiving buffer.
If you must send only one message, then I'd forgo defining a datatype for Pair: instead, I'd create a datatype for the entire payload and dump all the data into one contiguous, untyped package. Then at the receiving end you could iterate over it, allocating the exact amount of space necessary for each string and filling it up. Let me whip up an ASCII diagram to illustrate. This would be your payload:
|..x1..|..s_len1..|....string1....|..x2..|..s_len2..|.string2.|..x3..|..s_len3..|.......string3.......|...
You send the whole thing as one unit (e.g. an array of MPI_BYTE), then the receiver would unpack it something like this:
while (buffer is not empty)
{
read x;
read s_len;
allocate s_len characters;
move s_len characters from buffer to allocated space;
}
Note however that this solution only works if the data representation of integers and chars is the same on the sending and receiving systems.
I don't think you can do quite what you want with MPI. I'm a Fortran programmer, so bear with me if my understanding of C is a little shaky. You want, it seems, to pass a data structure consisting of 1 int and 1 string (which you pass by passing the location of the first character in the string) from one process to another ? I think that what you are going to have to do is pass a fixed length string -- which would have, therefore, to be as long as any of the strings you really want to pass. The reception area for the gathering of these strings will have to be large enough to to receive all the strings together with their lengths.
You'll probably want to declare a new MPI datatype for your structs; you can then gather these and, since the gathered data includes the length of the string, recover the useful parts of the string at the receiver.
I'm not certain about this, but I've never come across truly variable message lengths as you seem to want to use and it does sort feel un-MPI-like. But it may be something implemented in the latest version of MPI that I've just never stumbled across, though looking at the documentation on-line it doesn't seem so.
MPI implementations do not inspect or interpret the actual contents of a message. Provided that you know the size of the data structure, you can represent that size in some number of char's or int's. The MPI implementation will not know or care about the actual internal details of the data.
There are a few caveats...both the sender and receiver need to agree on the interpretation of the message contents, and the buffer that you provide on the sending and receiving side needs to fit into some definable number of char's or int's.

Resources