Passing variable-length structures between MPI processes - c

I need to MPI_Gatherv() a number of int/string pairs. Let's say each pair looks like this:
struct Pair {
int x;
unsigned s_len;
char s[1]; // variable-length string of s_len chars
};
How to define an appropriate MPI datatype for Pair?

In short, it's theoretically impossible to send one message of variable size and receive it into a buffer of the perfect size. You'll either have to send a first message with the sizes of each string and then a second message with the strings themselves, or encode that metainfo into the payload and use a static receiving buffer.
If you must send only one message, then I'd forgo defining a datatype for Pair: instead, I'd create a datatype for the entire payload and dump all the data into one contiguous, untyped package. Then at the receiving end you could iterate over it, allocating the exact amount of space necessary for each string and filling it up. Let me whip up an ASCII diagram to illustrate. This would be your payload:
|..x1..|..s_len1..|....string1....|..x2..|..s_len2..|.string2.|..x3..|..s_len3..|.......string3.......|...
You send the whole thing as one unit (e.g. an array of MPI_BYTE), then the receiver would unpack it something like this:
while (buffer is not empty)
{
read x;
read s_len;
allocate s_len characters;
move s_len characters from buffer to allocated space;
}
Note however that this solution only works if the data representation of integers and chars is the same on the sending and receiving systems.

I don't think you can do quite what you want with MPI. I'm a Fortran programmer, so bear with me if my understanding of C is a little shaky. You want, it seems, to pass a data structure consisting of 1 int and 1 string (which you pass by passing the location of the first character in the string) from one process to another ? I think that what you are going to have to do is pass a fixed length string -- which would have, therefore, to be as long as any of the strings you really want to pass. The reception area for the gathering of these strings will have to be large enough to to receive all the strings together with their lengths.
You'll probably want to declare a new MPI datatype for your structs; you can then gather these and, since the gathered data includes the length of the string, recover the useful parts of the string at the receiver.
I'm not certain about this, but I've never come across truly variable message lengths as you seem to want to use and it does sort feel un-MPI-like. But it may be something implemented in the latest version of MPI that I've just never stumbled across, though looking at the documentation on-line it doesn't seem so.

MPI implementations do not inspect or interpret the actual contents of a message. Provided that you know the size of the data structure, you can represent that size in some number of char's or int's. The MPI implementation will not know or care about the actual internal details of the data.
There are a few caveats...both the sender and receiver need to agree on the interpretation of the message contents, and the buffer that you provide on the sending and receiving side needs to fit into some definable number of char's or int's.

Related

How to encode the length of the string at the front of the string

Imaging I want to design a protocol, so I want to send a packet from the client side to the server side, I need to encode my data.
I have a string, I want to add the length of the string at the front of the string, for example:
string: "my"
which length is 2
So what I expect is to create a char[] in c and store | 2 | my | in the buffer
In this way, after the server receives the packet, it will know how many bytes need to be read for this request. (by using C programming)
I tried to do it but I don't know how to control the empty between the length and the string, I can create a buffer which size is 10, and use sprintf() to convert the length of the string and add it into the buffer.
One poor way to do it is to encode the length in ASCII at the front the string - the down side is you’ll need variable char elements to store the length if you ever want to send anything longer than 9 chars.
A better way to encode the strings length, since you are designing your own protocol, is to allocate a fixed number of bytes at the beginning, say 8 bytes, and cast &char[0] as a pointer to an uint64_t Basically, use array[0~7] to store an 8byte unsigned long. Align the address w.r.t. 8byte boundary for (slightly) better performance.
If the sender and receiver machine have different endianness, you’ll also have to include a multi-byte long “magic number” at the head of the char array. This is necessary for both sides to correctly recover the string length from the multi-byte-long length field.
There are two standards used in C:
str*: char * which is terminated with a '\0'.
mem*, read/write: void * plus a length size_t. It's the same idea for readv() and writev() but here the two variables is bundled into an array of struct iovec. Note that sizeof(size_t) may differ between sender and render.
If you use anything else it's automatically a learning curve for whoever needs to read or interact with your code. I wouldn't do that trade-off, but you do you.
You can, of course, encode the length into the char * but now you have to think about how you encode it (big vs little endian), fixed vs variable size.
You might be interested in SDS which hides the length. This way only have to reimplement the functions that change the length of the string instead of all string functions. Use an existing library.

Difficulties understanding how to take elements from a file and store them in C

I'm working on an assignment that is supposed to go over the basics of reading a file and storing the information from that file. I'm personally new to C and struggling with the lack of a "String" variable.
The file that the program is supposed to work with contains temperature values, but we are supposed to account for "corrupted data". The assignment states:
Every input item read from the file should be treated as a stream of characters (string), you can
use the function atof() to convert a string value into a floating point number (invalid data can be
set to a value lower than the lowest minimum to identify it as corrupt)."
The number of elements in the file is undetermined but an example given is:
37.8, 38.a, 139.1, abc.5, 37.9, 38.8, 40.5, 39.0, 36.9, 39.8
After reading the file we're supposed to allow a user to query these individual entries, but as mentioned if the data entry contains a non-numeric value, we are supposed to state that the specific data entry is corrupted.
Overall, I understand how to functionally write a program that can fulfill those requirements. My issue is not knowing what data structure to use and/or how to store the information to be called upon later.
The closest to an actual string datatype which you find in C is a sequence of chars which is terminated by a '\0' value. That is used for most things which you'd expect to do with strings.
Storing them requires just sufficent memory, as offered by a sufficiently large array of char, or as offered by malloc().
I think the requirements of your assignment would be met by making a char array as buffer, then reading in with fgets(), making sure to not read more than fits into your array and making sure that there is a '\0' at the end.
Then you can use atof() on the content of the array and if it fails do the handling of corrupted input. Though I would prefer sscanf() for its better feedback via separate return value.

Can a C implementation use length-prefixed-strings "under the hood"?

After reading this question: What are the problems of a zero-terminated string that length-prefixed strings overcome? I started to wonder, what exactly is stopping a C implementation from allocating a few extra bytes for any char or wchar_t array allocated on the stack or heap and using them as a "string prefix" to store the number N of its elements?
Then, if the N-th character is '\0', N - 1 would signify the string length.
I believe this could mightily boost performance of functions such as strlen or strcat.
This could potentially turn to extra memory consumption if a program uses non-0-terminated char arrays extensively, but that could be remedied by a compiler flag turning on or off the regular "count-until-you-reach-'\0'" routine for the compiled code.
What are possible obstacles for such an implementation? Does the C Standard allow for this? What problems can this technique cause that I haven't accounted for?
And... has this actually ever been done?
You can store the length of the allocation. And malloc implementations really do do that (or some do, at least).
You can't reasonably store the length of whatever string is stored in the allocation, though, because the user can change the contents to their whim; it would be unreasonable to keep the length up to date. Furthermore, users might start strings somewhere in the middle of the character array, or might not even be using the array to hold a string!
Then, if the N-th character is '\0', N - 1 would signify the string length.
Actually, no, and that's why this suggestion cannot work.
If I overwrite a character in a string with a 0, I have effectively truncated the string, and a subsequent call of strlen on the string must return the truncated length. (This is commonly done by application programs, including every scanner generated by (f)lex, as well as the strtok standard library function. Amongst others.)
Moreover, it is entirely legal to call strlen on an interior byte of the string.
For example (just for demonstration purposes, although I'll bet you can find code almost identical to this in common use.)
/* Split a string like 'key=value...' into key and value parts, and
* return the value, and optionally its length (if the second argument
* is not a NULL pointer).
* On success, returns the value part and modifieds the original string
* so that it is the key.
* If there is no '=' in the supplied string, neither it nor the value
* pointed to by plen are modified, and NULL is returned.
*/
char* keyval_split(char* keyval, int* plen) {
char* delim = strchr(keyval, '=');
if (delim) {
if (plen) *plen = strlen(delim + 1)
*delim = 0;
return delim + 1;
} else {
return NULL;
}
}
There's nothing fundamentally stopping you from doing this in your application, if that was useful (one of the comments noted this). There are two problems that would emerge, however:
You'd have to reimplement all the string-handling functions, and have my_strlen, my_strcpy, and so on, and add string-creating functions. That might be annoying, but it's a bounded problem.
You'd have to stop people, when writing for the system, deliberately or automatically treating the associated character arrays as ‘ordinary’ C strings, and using the usual functions on them. You might have to make sure that such usages broke promptly.
This means that it would, I think, be infeasible to smuggle a reimplemented ‘C string’ into an existing program.
Something like
typedef struct {
size_t len;
char* buf;
} String;
size_t my_strlen(String*);
...
might work, since type-checking would frustrate (2) (unless someone decided to hack things ‘for efficiency’, in which case there's not much you can do).
Of course, you wouldn't do this unless and until you'd proven that string management was the bottleneck in your code and that this approach provably improved things....
There are a couple of issues with this approach. First of all, you wouldn't be able to create arbitrarily long strings. If you only reserve 1 byte for length, then your string can only go up to 255 characters. You can certainly use more bytes to store the length, but how many? 2? 4?
What if you try to concatenate two strings that are both at the edge of their size limits (i.e., if you use 1 byte for length and try to concatenate two 250-character strings to each other, what happens)? Do you simply add more bytes to the length as necessary?
Secondly, where do you store this metadata? It somehow has to be associated with the string. This is similar to the problem Dennis Ritchie ran into when he was implementing arrays in C. Originally, array objects stored an explicit pointer to the first element of the array, but as he added struct types to the language, he realized that he didn't want that metadata cluttering up the representation of the struct object in memory, so he got rid of it and introduced the rule that array expressions get converted to pointer expressions in most circumstances.
You could create a new aggregate type like
struct string
{
char *data;
size_t len;
};
but then you wouldn't be able to use the C string library to manipulate objects of that type; an implementation would still have to support the existing interface.
You could store the length in the leading byte or bytes of the string, but how many do you reserve? You could use a variable number of bytes to store the length, but now you need a way to distinguish length bytes from content bytes, and you can't read the first character by simply dereferencing the pointer. Functions like strcat would have to know how to step around the length bytes, how to adjust the contents if the number of length bytes changes, etc.
The 0-terminated approach has its disadvantages, but it's also a helluva lot easier to implement and makes manipulating strings a lot easier.
The string methods in the standard library have defined semantics. If one generates an array of char that contains various values, and passes a pointer to the array or a portion thereof, the methods whose behavior is defined in terms of NUL bytes must search for NUL bytes in the same fashion as defined by the standard.
One could define one's own methods for string handling which use a better form of string storage, and simply pretend that the standard library string-related functions don't exist unless one must pass strings to things like fopen. The biggest difficulty with such an approach is that unless one uses non-portable compiler features it would not be possible to use in-line string literals. Instead of saying:
ns_output(my_file, "This is a test"); // ns -- new string
one would have to say something more like:
MAKE_NEW_STRING(this_is_a_test, "This is a test");
ns_output(my_file, this_is_a_test);
where the macro MAKE_NEW_STRING would create a union of an anonymous type, define an instance called this_is_a_test, and suitably initialize it. Since a lot of strings would be of different anonymous types, type-checking would require that strings be unions that include a member of a known array type, and code expecting strings should be given a pointer that member, likely using something like:
#define ns_output(f,s) (ns_output_func((f),(s).stringref))
It would be possible to define the types in such a way as to avoid the need for the stringref member and have code just accept void*, but the stringref member would essentially perform static duck-typing (only things with a stringref member could be given to such a macro) and could also allow type-checking on the type of stringref itself).
If one could accept those constraints, I think one could probably write code that was more efficient in almost every way that zero-terminated strings; the question would be whether the advantages would be worth the hassle.

Serializing strings in C

I'm serializing structs into byte-streams. My method is simple:
pack all ints in little endian order and copy strings including the null terminator. The other side has to statically know how to unpack the byte-stream, there is no additional metadata.
My problem is, that I do not know how to handle the the NULL pointer?
I need to send something, because there is no additional metadata in the stream.
I considered the following two options:
Send a '\0' and make the receiving side interpret it as NULL in any case
Send a '\0' and make the receiving side interpret it as '\0' in any case (alloc a byte)
Send a special character representing char* str == NULL, e.g. ETX, EOT, EM ?
What do you think?
It looks like you are currently trying to tell the receiving end that the end of the serialized string has been reached by passing it a special character. There are a million cases that can screw you over with this:
What if your struct contains a byte that is equal to that special character. Escape it with another special character. What if your struct contains a byte sequence that is equal to your escape character followed by your special character, check for that too?
Yeah it's doable, but I think that's not a very good solution and you'll have to write a parser to look for the escape character and then anyone who takes a look at the code later will spend two hours trying to figure out what's going on.
(tl;dr) Instead... just make the first 32 bits of the serialized string equal to the number of bytes in the string. This only costs 4 bytes per serialization, solves all your problems, you won't have to write a parser or worry about special characters, and will make it a lot easier on the next guy who gets to read through your code!
edit
Thanks to JeremyP I've just realized that I didn't really answer your question. Send one of these guys for every string:
struct s_str
{
bool is_null;
int size;
char* str;
};
If it's null, simply set is_null to true and you don't really have to worry about the other two.
If it's size zero, set is_null to false and size to zero.
If str contains just a '\0', set is_null to false, size to one, and str[0] to '\0'
In my opinion, this might not be the most memory efficient way (you could probably save a byte somewhere somehow) but is definitely quite clear in what you're doing, and again the next guy that comes along will like this a lot more.
Do not do this. use some extra bytes to store length and concatenate with your data string. The receiver end can check the length to know how much it should read into his local buffer.
It depends on the significance of the pointer in your protocol.
If the pointer is significant, i.e. it is needed for the recipient to know how to rebuild the struct, then you need to send something. It could be either a byte with 0/non-zero to indicate existence, or an integer that indicates the number of bytes pointed to by the pointer.
Example:
struct Foo {
int *arr,
char *text
}
Struct Foo could be serialized like this:
<arr length>< arr ><text length>< text >
4 bytes n bytes 4 bytes n bytes
I would advise you to use an existing library for serialization. I can think of two at the moment: tpl and gwlib's gwser.
About tpl:
You can use tpl to store and reload your C data quickly and easily.
Tpl works with files, memory buffers and file descriptors so it's
suitable for use as a file format, IPC message format or any scenario
where you need to store and retrieve your data.
About gwlib, see the link, it's not very verbose, but it provides a few usage examples.

C sending multiple data types using sendto

In my program I have a few structs and a char array that I want to send as a single entity over UDP.
I am struggling to think of a good way to do this.
My first thought was to create a structure which contains everything I want to send but it would be of the wrong type for using sendto()
How would I store the two structs and a char array in another array so that it will be received in the way I intended?
Thanks
Since C allows you to cast to your heart's content, there's no such thing as a wrong type for sendto(). You simply cast the address of your struct to a void * and pass that as the argument to sendto().
However, a lot of people will impress on you that it's not advisable to send structs this way in the first place:
If the programs on either side of the connection are compiled by different compilers or in different environments, chances are your structs will not have the same packing.
If the two hosts involved in the transfer don't have the same endinanness, part of your data will end up backwards.
If the host architectures differ (e.g. 32 bit vs. 64 bits) then sizes of structs may be off as well. Certainly there will be size discrepancies if the sizes of your basic data types (int, char, long, double, etc.) differ.
So... Please take the advice of the first paragraph only if you're sure your two hosts are identical twins, or close enough to it.
In other cases, consider converting your data to some kind of neutral text representation, which could be XML but doesn't need to be anything that complicated. Strings are sent as a sequence of bytes, and there's much less that can go wrong. Since you control the format, you should be able to parse that stuff with little trouble on the receiving side.
Update
You mention that you're transferring mostly bit fields. That means that your data essentially consists of a bunch of integers, all of them less than (I'm assuming) 32 bits.
My suggestion for a "clean" solution, therefore, would be to write a function to unpack all those bit fields, and to ship the whole works as an array of (perhaps unsigned) integers. Assuming that sizeof(int) is the same across machines, htons() will work successfully on the elements (each individually!) of those arrays, and you can then wrap them back into a structure on the other side.
You can send multiple pieces of data as one with writev. Just create the array of struct iovec that it needs, with one element for each data structure you want to send.

Resources