Write dynamically allocated structure to file - c

Suppose we have following structure:
struct Something {
int i;
};
If I want to write in a file any data of this type(dynamically allocated), I do this:
struct Something *object = malloc(sizeof(struct Something));
object->i = 0; // set member some value
FILE *file = fopen("output_file", "wb");
fwrite(object, sizeof(struct Something), 1 file);
fclose(file);
Now, my questions:
How we do this with a structure what contains pointers? I tested using same method, it worked fine, data could been read, but I want to know if there are any risks?

What you want is called serialization. See also XDR (a portable binary data format) & libs11n (a C++ binary serialization library); you often care about data portability: being able to read the data on some different computer.
"serialization" means to "convert" some complex data structure (e.g. a list, a tree, a vector or even your Something...) into a (serial) byte stream (e.g. a file, a network connection, etc...), and backwards. Dealing with circular data structures or shared sub-components may be tricky.
You don't want to write raw pointers inside a file (but you could), because the written address probably won't make any sense at the next execution of your program (e.g. because of ASLR), i.e. when you'll read the data again.
Read also about application checkpointing and persistence.
For pragmatic reasons (notably ease of debugging and resilience w.r.t. small software evolution) it is often better to use some textual data format (like e.g. JSON or Yaml) to store such persistent data.
You might also be interested in databases. Look first into sqlite, and also into DBMS ("relational" -or SQL based- ones like PostGreSQL, NoSQL ones like e.g. MongoDB)
The issue is not writing a single dynamically allocated struct (since you want mostly to write the data content, not the pointer, so it is the same to fwrite a malloc-ed struct or a locally allocated one), it is to serialize complex data structures which use lots of weird internal pointers!
Notice that copying garbage collectors use algorithms similar to serialization algorithms (since both need to scan a complex graph of references).
Also, on today's computers, disk -or network- IO is a lot (e.g. a million times) slower than the CPU, so it makes sense to do some significant computation before writing files.

Related

How to send and receive a binary tree using MPI?

I want to send a binary tree from one core to another use some function
like MPI_Send(). Or do there have any fast algorithm to make this function?
The data structure I use is
typedef struct BiNode{
struct BiNode *lchi,*rchi;
struct BiNode *parent;
char *name;
}BiNode;
This binary tree have more than 2000 leaves.
Read more about serialization. A 2000 nodes tree is, on current machines and networks, quite a small piece of data. If the average name length is a dozen of bytes, you need to transmit a few dozens of kilobytes (not a big deal today). Typical datacenter network bandwidth is 100Mbytes/sec, and inter-process communication (using e.g. some pipe(7) or unix(7) sockets between cores of the same processor) is usually at least ten times faster. See also http://norvig.com/21-days.html
Or do there have any fast algorithm to make this function?
You probably need some depth-first traversal (and there is probably nothing faster).
You might consider writing your tree in some textual format -or some text-based protocol- such as (some customized variant using) JSON (or XML or YAML or S-expressions). Then take advantage of existing JSON libraries, such as Jansson. They are capable of encoding and decoding your data (in some JSON format) in a dynamically allocated string buffer.
If performance is critical, consider using some binary format, like XDR or ASN-1. Or simply compress the JSON (or other textual) encoding, using some existing compression library (perhaps zlib).
My guess is that in your case, it is not worth the trouble (using JSON is a lot simpler to code, and your development time has some cost and value). Your bottleneck is probably the network itself, not any software layers. But you need to benchmark.
MPI has a feature called datatypes. A full explanation would take a really long time, but you probably want to look at structs in there (though you might be able to get away with vectors depending on how your memory is laid out).
However, you probably can't just use MPI datatypes because you'd just be transmitting a bunch of pointers which won't mean anything to the process on the other end. Instead you have to decide which parts you actually need to send and serialize them in a way that makes sense.
So you have a few options I think.
Change the way your tree is laid out in memory so it's an array of contiguous memory where all of the pointers you have above become indices in the array.
This might not actually make sense in the context of your application, but it makes the "tree" very easy to transmit. At that point, you can either just send a large array of bytes or you can construct MPI datatypes to describe each cell in the array and send an array of 2000 of those.
Re-create the tree on the other process from the source data (whether that's a file or something else).
This is probably not the answer you were looking for and doesn't help if you've generated this data from anything non-trivial in the middle of your application.
Use POSIX shared memory.
Since you say "core" in the description of your question, I'm assuming you want to transfer data between OS processes on the same physical machine. If that's the case, you can use shared memory and you don't need to do message passing at all. Just open a shared memory region, attach to it with the other process and "poof" all of the data is available on the other end. As long as you share all of the memory that those pointers are pointing to, I think you'll be fine.

C - Save/Load Pointer Data to File

Firstly apologies if this question has been asked before or has a glaring obvious solution that I cannot see. I have found a similar question however I believe what I am asking goes a little further than what was previously asked.
I have a structure as follows:
typedef struct {
int id;
char *title;
char *body;
} journal_entry;
Q: How do I write and load the contents of a pointer to memory in C (not C++) without using fixed lengths?
Am I wrong in thinking that by writing title or body to file I would endup with junk data and not actually the information I had stored? I do not know the size that the title or body of a journal entry would be and the size may vary significantly from entry to entry.
My own reading suggests that I will need to dereference pointers and fwrite each part of the struct separately. But I'm uncertain how to keep track of the data and the structs without things becoming confused particularly for larger files. Furthermore if these are not the only items I intend to store in the file (for example I may wish to include small images later on I'm uncertain how I would order the file structure for convenience.
The other (possibly perceived) problem is that I have used malloc to allocate memory for the string for the body / entry when loading the data how will I know how much memory to allocate for the string when I wish to load the entry again? Do I need to expand my struct to include int body_len and int title_len?
Guidance or suggestions would be very gratefully received.
(I am focusing on a Linux point of view, but it could be adapted to other systems)
Serialization
What you want to achieve is often called serialization (citing wikipedia) - or marshalling:
The serialization is the process of translating data structures or object state into a format that can be stored and reconstructed later in the same or another computer
Pointer I/O
It is in principle possible to read and write pointers, e.g. the %p conversion specification for fprintf(3) & fscanf(3) (and you might directly write and read a pointer, which is like at the machine level some intptr_t integer. However, a given address (e.g. 0x1234F580 ...) is likely to be invalid or have a different meaning when read again by a different process (e.g. because of ASLR).
Serialization of aggregate data
You might use some textual format like JSON (and I actually recommend doing so) or other format like YAML (or perhaps invent your own, e.g. inspired by s-exprs). It is a well established habit to prefer textual format (and Unix had that habit since before 1980) to binary ones (like XDR, ASN/1, ...). And many protocols (HTTP, SMTP, FTP, JSONRPC ....) are textual protocols
Notice that on current systems, I/O is much slower than computation, so the relative cost of textual encoding & decoding is tiny w.r.t. network or disk IO (see table of Answers here)
The encoding of a some aggregate data (e.g. a struct in C) is generally compositional, and by composing the encoding of elementary scalar data (numbers, strings, ....) you can encode some higher-level data type.
serialization libraries
Most formats (notably JSON) have several free software libraries to encode/decode them, e.g. Jansson, JsonCPP, etc..
Suggestion:
Use JSON and format your journal_entry perhaps into a JSON object like
{ "id": 1234,
"title": "Some Title Here",
"body": "Some body string goes here" }
Concretely, you'll use some JSON library and first convert your journal_entry into some JSON type (and vice versa), then use the library to encode/decode that JSON
databases
You could also consider a database approach (e.g. sqlite, etc...)
PS. Serialization of closures (or anything containing pointer to code) may be challenging. You'll need to define what exactly that means.
PPS. Some languages provide builtin support for serialization and marshalling. For example, Ocaml has a Marshal module, Python has pickle
You are correct that storing this structure in memory is not a good idea, because once the strings to which your pointers point are gone, there is no way to retrieve them. From the practical point of view, one way is to declare strings of finite length (if you know that your strings have a length limit):
typedef struct {
int id;
char title[MAX_TITLE_LEGNTH];
char body[MAX_BODY_LENGTH];
} journal_entry;
If you need to allocate title and body with malloc, you can have a "header" element that stores the length of the whole structure. When you write your structure to file, you would use this element to figure out how many bytes you need to read.
I.e. to write:
FILE* fp = fopen(<your-file-name>,"wb");
size_t size = sizeof(id)+strlen(title)+1+strlen(body)+1;
fwrite(&size, sizeof(size), 1, fp);
fwrite(&id, sizeof(id), 1, fp);
fwrite(title, sizeof(char), strlen(title)+1, fp);
fwrite(body, sizeof(char), strlen(body)+1, fp);
fclose(fp);
To read (not particularly safe implementation, just to give the idea):
FILE* fp = fopen(<your-file-name>,"rb");
size_t size;
int read_bytes = 0;
struct journal_entry je;
fread(&size, sizeof(size), 1, fp);
void* buf = malloc(size);
fread(buf, size, 1, fp);
fclose(fp);
je.id = *((int*)buf); // might break if you wrote your file on OS with different endingness
read_bytes += sizeof(je.id)
je.title = (char*)(buf+read_bytes);
read_bytes += strlen(je.title)+1;
je.body = (char*)(buf+read_bytes);
// other way would be to malloc je.title and je.body and destroy the buf
In memory you can store strings as pointers to arrays. But in a file on disk you would typically store the data directly. One easy way to do it would be to store a uint32_t containing the size, then store the actual bytes of the string. You could also store null-terminated strings in the file, and simply scan for the null terminator when reading them. The first method makes it easier to preallocate the needed buffer space when reading, without needed to pass over the data twice.

sqlite increase size of returned blob

I was wondering whether it is possible to tell SQLite to return the blobs in memory chunks that are multiples of 4 let's say.
For various reasons this would make other parts of the code simpler.
I'm using the C-API function
const void *sqlite3_column_blob(sqlite3_stmt*, int iCol);
There is no such function; pointers returned by SQLite point into buffers that may be part of larger data structures.
If you want the larger buffers hard enough, you have to create your own copies.
You can open a BLOB for incremental I/O and read portions of its data. Finally you should close BLOB.

C data structure to disk

How can I make a copy of a tree data structure in memory to disk in C programming language?
You need to serialize it, i.e. figure out a way to go through it serially that includes all nodes. These are often called traversal methods.
Then figure out a way to store the representation of each node, together with references to other nodes, so that it can all be loaded in again.
One way of representing the references is implicitly, by nesting like XML does.
The basic pieces here are:
The C file I/O routines are fopen, fwrite, fprintf, etc.
Copying pointers to disk is useless, since the next time you run all those pointer values will be crap. So you'll need some alternative to pointers that still somehow refers disk records to each other. One sensible alternative would be file indexes (the kind used by your C I/O routines like fseek and ftell).
That should be about all the info you need to do the job.
Alternatively, if you use an array-based tree (with array indexes instead of pointers, or with the links implied by their position in the array) you could just save and load the whole shebang without any further logic required.
Come up with a serialization (and deserialization) function. Then run it and send the output to a file.

ANSI C hash table implementation with data in one memory block

I am looking for an open source C implementation of a hash table that keeps all the data in one memory block, so it can be easily send over a network let say.
I can only find ones that allocate small pieces of memory for every key-value pair added to it.
Thank you very much in advance for all the inputs.
EDIT: It doesn't necessarily need to be a hash table, whatever key-value pair table would probably do.
The number of times you would serialize such data structure (and sending over network is serializing as well) vs the number of times you would use such data structure (in your program) is pretty low. So, most implementations focus more on the speed instead of the "maybe easier to serialize" side.
If all the data would be in one allocated memory block a lot of operations on that data structure would be a bit expensive because you would have to:
reallocate memory on add-operations
most likeley compress / vacuum on delete-operations (so that the one block you like so much is dense and has no holes)
Most network operations are buffered anyway, just iterate over the keys and send keys + values.
On a unix system I'd probably utilise a shared memory buffer (see shm_open()), or if that's not available a memory-mapped file with the MAP_SHARED flag, see the OS-specific differences though http://en.wikipedia.org/wiki/Mmap
If both shm_open and mmap aren't available you could still use a file on the disk (to some extent), you'd have to care about the proper locking, I'd send an unlock signal to the next process and maybe the seek of the updated portion of the file, then that process locks the file again, seeks to the interesting part and proceeds as usual (updates/deletes/etc.).
In any case, you could freely design the layout of the hashtable or whatever you want, like having fixed width key/seek pairs. That way you'd have the fast access to the keys of your hashtable and if necessary you seek to the data portion, then copy/delete/modify/etc.
Ideally this file should be on a ram disk, of course.
I agree completely with akira (+1). Just one more comment on data locality. Once the table gets larger, or if the satellite data is large enough, there's most certainly cache pollution which slows down any operation on the table additionally, or in other words you can rely on the level-1/2/3 cache chain to serve the key data promptly whilst putting up with a cache miss when you have to access the satellite data (e.g. for serialisation).
Libraries providing hashtables tend to hide the details and make the thing work efficiently (that is normally what programmers want when they use an hashtabe), so normally the way they handle the memory is hidden from the final programmer's eyes, and programmers shouldn't rely on the particular "memory layout", that may change in following version of the library.
Write your own function to serialize (and unserialize) the hashtable in the most convenient way for your usage. You can keep the serialized content if you need it several times (of course, when the hashtable is changed, you need to update the serialized "version" kept in memory).

Resources