How to send and receive a binary tree using MPI? - c

I want to send a binary tree from one core to another use some function
like MPI_Send(). Or do there have any fast algorithm to make this function?
The data structure I use is
typedef struct BiNode{
struct BiNode *lchi,*rchi;
struct BiNode *parent;
char *name;
}BiNode;
This binary tree have more than 2000 leaves.

Read more about serialization. A 2000 nodes tree is, on current machines and networks, quite a small piece of data. If the average name length is a dozen of bytes, you need to transmit a few dozens of kilobytes (not a big deal today). Typical datacenter network bandwidth is 100Mbytes/sec, and inter-process communication (using e.g. some pipe(7) or unix(7) sockets between cores of the same processor) is usually at least ten times faster. See also http://norvig.com/21-days.html
Or do there have any fast algorithm to make this function?
You probably need some depth-first traversal (and there is probably nothing faster).
You might consider writing your tree in some textual format -or some text-based protocol- such as (some customized variant using) JSON (or XML or YAML or S-expressions). Then take advantage of existing JSON libraries, such as Jansson. They are capable of encoding and decoding your data (in some JSON format) in a dynamically allocated string buffer.
If performance is critical, consider using some binary format, like XDR or ASN-1. Or simply compress the JSON (or other textual) encoding, using some existing compression library (perhaps zlib).
My guess is that in your case, it is not worth the trouble (using JSON is a lot simpler to code, and your development time has some cost and value). Your bottleneck is probably the network itself, not any software layers. But you need to benchmark.

MPI has a feature called datatypes. A full explanation would take a really long time, but you probably want to look at structs in there (though you might be able to get away with vectors depending on how your memory is laid out).
However, you probably can't just use MPI datatypes because you'd just be transmitting a bunch of pointers which won't mean anything to the process on the other end. Instead you have to decide which parts you actually need to send and serialize them in a way that makes sense.
So you have a few options I think.
Change the way your tree is laid out in memory so it's an array of contiguous memory where all of the pointers you have above become indices in the array.
This might not actually make sense in the context of your application, but it makes the "tree" very easy to transmit. At that point, you can either just send a large array of bytes or you can construct MPI datatypes to describe each cell in the array and send an array of 2000 of those.
Re-create the tree on the other process from the source data (whether that's a file or something else).
This is probably not the answer you were looking for and doesn't help if you've generated this data from anything non-trivial in the middle of your application.
Use POSIX shared memory.
Since you say "core" in the description of your question, I'm assuming you want to transfer data between OS processes on the same physical machine. If that's the case, you can use shared memory and you don't need to do message passing at all. Just open a shared memory region, attach to it with the other process and "poof" all of the data is available on the other end. As long as you share all of the memory that those pointers are pointing to, I think you'll be fine.

Related

Write dynamically allocated structure to file

Suppose we have following structure:
struct Something {
int i;
};
If I want to write in a file any data of this type(dynamically allocated), I do this:
struct Something *object = malloc(sizeof(struct Something));
object->i = 0; // set member some value
FILE *file = fopen("output_file", "wb");
fwrite(object, sizeof(struct Something), 1 file);
fclose(file);
Now, my questions:
How we do this with a structure what contains pointers? I tested using same method, it worked fine, data could been read, but I want to know if there are any risks?
What you want is called serialization. See also XDR (a portable binary data format) & libs11n (a C++ binary serialization library); you often care about data portability: being able to read the data on some different computer.
"serialization" means to "convert" some complex data structure (e.g. a list, a tree, a vector or even your Something...) into a (serial) byte stream (e.g. a file, a network connection, etc...), and backwards. Dealing with circular data structures or shared sub-components may be tricky.
You don't want to write raw pointers inside a file (but you could), because the written address probably won't make any sense at the next execution of your program (e.g. because of ASLR), i.e. when you'll read the data again.
Read also about application checkpointing and persistence.
For pragmatic reasons (notably ease of debugging and resilience w.r.t. small software evolution) it is often better to use some textual data format (like e.g. JSON or Yaml) to store such persistent data.
You might also be interested in databases. Look first into sqlite, and also into DBMS ("relational" -or SQL based- ones like PostGreSQL, NoSQL ones like e.g. MongoDB)
The issue is not writing a single dynamically allocated struct (since you want mostly to write the data content, not the pointer, so it is the same to fwrite a malloc-ed struct or a locally allocated one), it is to serialize complex data structures which use lots of weird internal pointers!
Notice that copying garbage collectors use algorithms similar to serialization algorithms (since both need to scan a complex graph of references).
Also, on today's computers, disk -or network- IO is a lot (e.g. a million times) slower than the CPU, so it makes sense to do some significant computation before writing files.

Is saving a binary file a standard? Is it limited to only 1 type?

When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.
If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.
Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)
I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.

send glib hashtable with MPI

i recently came across a problem with my parallel program. Each process has several glib hashtables that need to be exchanged with other processes, these hashtables may be quite large. What is the best approach to achieve that?
create derived datatype
use mpi pack and unpack
send key & value as arrays (problem, since amount of elements is not known at compile time)
I haven't used 1 & 2 before and don't even know if thats possible, that's why i am asking you guys..
Pack/unpack creates a copy of your data: if your maps are large, you'll want to avoid that. This also rules out your 3rd option.
You can indeed define a custom datatype, but it'll be a little tricky. See the end of this answer for an example (replacing "graph" with "map" and "node" with "pair" as you read). I suggest you read up on these topics to get a firm understanding of what you need to do.
That the number of elements is not known at compile time shouldn't be a real issue. You can just send a message containing the payload size before sending the map contents. This will let the receiving process allocate just enough memory for the receive buffer.
You may also want to consider simply printing the contents of your maps to files, and then having the processes read each others' ouput. This is much more straightforward, but also less elegant and much slower than message passing.

C data structure to disk

How can I make a copy of a tree data structure in memory to disk in C programming language?
You need to serialize it, i.e. figure out a way to go through it serially that includes all nodes. These are often called traversal methods.
Then figure out a way to store the representation of each node, together with references to other nodes, so that it can all be loaded in again.
One way of representing the references is implicitly, by nesting like XML does.
The basic pieces here are:
The C file I/O routines are fopen, fwrite, fprintf, etc.
Copying pointers to disk is useless, since the next time you run all those pointer values will be crap. So you'll need some alternative to pointers that still somehow refers disk records to each other. One sensible alternative would be file indexes (the kind used by your C I/O routines like fseek and ftell).
That should be about all the info you need to do the job.
Alternatively, if you use an array-based tree (with array indexes instead of pointers, or with the links implied by their position in the array) you could just save and load the whole shebang without any further logic required.
Come up with a serialization (and deserialization) function. Then run it and send the output to a file.

ANSI C hash table implementation with data in one memory block

I am looking for an open source C implementation of a hash table that keeps all the data in one memory block, so it can be easily send over a network let say.
I can only find ones that allocate small pieces of memory for every key-value pair added to it.
Thank you very much in advance for all the inputs.
EDIT: It doesn't necessarily need to be a hash table, whatever key-value pair table would probably do.
The number of times you would serialize such data structure (and sending over network is serializing as well) vs the number of times you would use such data structure (in your program) is pretty low. So, most implementations focus more on the speed instead of the "maybe easier to serialize" side.
If all the data would be in one allocated memory block a lot of operations on that data structure would be a bit expensive because you would have to:
reallocate memory on add-operations
most likeley compress / vacuum on delete-operations (so that the one block you like so much is dense and has no holes)
Most network operations are buffered anyway, just iterate over the keys and send keys + values.
On a unix system I'd probably utilise a shared memory buffer (see shm_open()), or if that's not available a memory-mapped file with the MAP_SHARED flag, see the OS-specific differences though http://en.wikipedia.org/wiki/Mmap
If both shm_open and mmap aren't available you could still use a file on the disk (to some extent), you'd have to care about the proper locking, I'd send an unlock signal to the next process and maybe the seek of the updated portion of the file, then that process locks the file again, seeks to the interesting part and proceeds as usual (updates/deletes/etc.).
In any case, you could freely design the layout of the hashtable or whatever you want, like having fixed width key/seek pairs. That way you'd have the fast access to the keys of your hashtable and if necessary you seek to the data portion, then copy/delete/modify/etc.
Ideally this file should be on a ram disk, of course.
I agree completely with akira (+1). Just one more comment on data locality. Once the table gets larger, or if the satellite data is large enough, there's most certainly cache pollution which slows down any operation on the table additionally, or in other words you can rely on the level-1/2/3 cache chain to serve the key data promptly whilst putting up with a cache miss when you have to access the satellite data (e.g. for serialisation).
Libraries providing hashtables tend to hide the details and make the thing work efficiently (that is normally what programmers want when they use an hashtabe), so normally the way they handle the memory is hidden from the final programmer's eyes, and programmers shouldn't rely on the particular "memory layout", that may change in following version of the library.
Write your own function to serialize (and unserialize) the hashtable in the most convenient way for your usage. You can keep the serialized content if you need it several times (of course, when the hashtable is changed, you need to update the serialized "version" kept in memory).

Resources