C data structure to disk - c

How can I make a copy of a tree data structure in memory to disk in C programming language?

You need to serialize it, i.e. figure out a way to go through it serially that includes all nodes. These are often called traversal methods.
Then figure out a way to store the representation of each node, together with references to other nodes, so that it can all be loaded in again.
One way of representing the references is implicitly, by nesting like XML does.

The basic pieces here are:
The C file I/O routines are fopen, fwrite, fprintf, etc.
Copying pointers to disk is useless, since the next time you run all those pointer values will be crap. So you'll need some alternative to pointers that still somehow refers disk records to each other. One sensible alternative would be file indexes (the kind used by your C I/O routines like fseek and ftell).
That should be about all the info you need to do the job.
Alternatively, if you use an array-based tree (with array indexes instead of pointers, or with the links implied by their position in the array) you could just save and load the whole shebang without any further logic required.

Come up with a serialization (and deserialization) function. Then run it and send the output to a file.

Related

How do deleting and inserting something in the middle of a certain file in file allocation system works?

So I know that each file may use a bunch of clusters with each cluster having a pointer to the next cluster having the rest of the file, but I don't understand what happens when we try to delete or insert something in the middle of a file in a certain sector. How is this issue resolved in FAT?
The first Idea that came to me, was shifting the data, but that doesn't seem to be a very efficient approach.
So I know that each file may use a bunch of clusters with each cluster
having a pointer to the next cluster having the rest of the file, but
I don't understand what happens when we try to delete or insert
something in the middle of a file in a certain sector. How is this
issue resolved in FAT?
Generally speaking, you can't delete or insert in the middle of a file, by which I mean that these operations are not directly supported by filesystem drivers. You modify a file by writing a block of data starting at a particular offset from original start of the file. You use writes to implement insertions or deletions, and this is managed at the userspace level, not the filesystem level.
The first Idea that came to me, was shifting the data, but that
doesn't seem to be a very efficient approach.
There are two basic options:
you overwrite the tail of the file in place, starting at the starting position of the insertion or deletion. This is effectively a shift, yes, and the program has to manage that itself.
you write a whole new file, then replace the original with it.
The latter is usually the preferred option for modifications other than at the end of the file, because the original file remains in a consistent state throughout the process, and because you don't need additional intermediate storage for the portion of the file contents that need to be shifted.
None of this is specific to FAT. I can't rule out the possibility that there is some esoteric filesystem or storage medium out there somewhere to which different rules apply, but for the most part, this is the nature of persistent storage.

How to send and receive a binary tree using MPI?

I want to send a binary tree from one core to another use some function
like MPI_Send(). Or do there have any fast algorithm to make this function?
The data structure I use is
typedef struct BiNode{
struct BiNode *lchi,*rchi;
struct BiNode *parent;
char *name;
}BiNode;
This binary tree have more than 2000 leaves.
Read more about serialization. A 2000 nodes tree is, on current machines and networks, quite a small piece of data. If the average name length is a dozen of bytes, you need to transmit a few dozens of kilobytes (not a big deal today). Typical datacenter network bandwidth is 100Mbytes/sec, and inter-process communication (using e.g. some pipe(7) or unix(7) sockets between cores of the same processor) is usually at least ten times faster. See also http://norvig.com/21-days.html
Or do there have any fast algorithm to make this function?
You probably need some depth-first traversal (and there is probably nothing faster).
You might consider writing your tree in some textual format -or some text-based protocol- such as (some customized variant using) JSON (or XML or YAML or S-expressions). Then take advantage of existing JSON libraries, such as Jansson. They are capable of encoding and decoding your data (in some JSON format) in a dynamically allocated string buffer.
If performance is critical, consider using some binary format, like XDR or ASN-1. Or simply compress the JSON (or other textual) encoding, using some existing compression library (perhaps zlib).
My guess is that in your case, it is not worth the trouble (using JSON is a lot simpler to code, and your development time has some cost and value). Your bottleneck is probably the network itself, not any software layers. But you need to benchmark.
MPI has a feature called datatypes. A full explanation would take a really long time, but you probably want to look at structs in there (though you might be able to get away with vectors depending on how your memory is laid out).
However, you probably can't just use MPI datatypes because you'd just be transmitting a bunch of pointers which won't mean anything to the process on the other end. Instead you have to decide which parts you actually need to send and serialize them in a way that makes sense.
So you have a few options I think.
Change the way your tree is laid out in memory so it's an array of contiguous memory where all of the pointers you have above become indices in the array.
This might not actually make sense in the context of your application, but it makes the "tree" very easy to transmit. At that point, you can either just send a large array of bytes or you can construct MPI datatypes to describe each cell in the array and send an array of 2000 of those.
Re-create the tree on the other process from the source data (whether that's a file or something else).
This is probably not the answer you were looking for and doesn't help if you've generated this data from anything non-trivial in the middle of your application.
Use POSIX shared memory.
Since you say "core" in the description of your question, I'm assuming you want to transfer data between OS processes on the same physical machine. If that's the case, you can use shared memory and you don't need to do message passing at all. Just open a shared memory region, attach to it with the other process and "poof" all of the data is available on the other end. As long as you share all of the memory that those pointers are pointing to, I think you'll be fine.

How to represent a random-access text file in memory (C)

I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.
I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.
1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.
It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...
solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.
As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.
Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains

Parsing: load into memory or use stream

I'm writing a little parser and I would like to know the advantages and disadvantages of the different ways to load the data to be parsed. The two ways that I thought of are:
Load the file's contents into a string then parse the string (access the character at an array position)
Parse as reading the file stream (fgetc)
The former will allow me to have two functions: one for parse_from_file and parse_from_string, however I believe this mode will take up more memory. The latter will not have that disadvantage of using more memory.
Does anyone have any advice on the matter?
Reading the entire file in or memory mapping it will be faster, but may cause issues if you want your language to be able to #include other files as these would be memory mapped or read into memory as well.
The stdio functions would work well because they usually try to buffer up data for you, but they are also general purpose so they also try to look out for usage patterns which differ from reading a file from start to finish, but that shouldn't be too much overhead.
A good balance is to have a large circular buffer (x * 2 * 4096 is a good size) which you load with file data and then have your tokenizer read from. Whenever a block's worth of data has been passed to your tokenizer (and you know that it is not going to be pushed back) you can refill that block with new data from the file and update some buffer location info.
Another thing to consider is if there is any chance that the tokenizer would ever need to be able to be used to read from a pipe or from a person typing directly in some text. In these cases your reads may return less data than you asked for without it being at the end of the file, and the buffering method I mentioned above gets more complicated. The stdio buffering is good for this as it can easily be switched to/from line or block buffering (or no buffering).
Using gnu fast lex (flex, but not the Adobe Flash thing) or similar can greatly ease the trouble with all of this. You should look into using it to generate the C code for your tokenizer (lexical analysis).
Whatever you do you should try to make it so that your code can easily be changed to use a different form of next character peek and consume functions so that if you change your mind you won't have to start over.
Consider using lex (and perhaps yacc, if the language of your grammar matches its capabilities). Lex will handle all the fiddly details of lexical analysis for you and produce efficient code. You can probably beat its memory footprint by a few bytes, but how much effort do you want to expend into that?
The most efficient on a POSIX system would probably neither of the two (or a variant of the first if you like): just map the file read-only with mmap, and parse it then. Modern systems are quite efficient with that in that they prefetch data when they detect a streaming access etc., multiple instances of your program that parse the same file will get the same physical pages of memory etc. And the interfaces are relatively simple to handle, I think.

ANSI C hash table implementation with data in one memory block

I am looking for an open source C implementation of a hash table that keeps all the data in one memory block, so it can be easily send over a network let say.
I can only find ones that allocate small pieces of memory for every key-value pair added to it.
Thank you very much in advance for all the inputs.
EDIT: It doesn't necessarily need to be a hash table, whatever key-value pair table would probably do.
The number of times you would serialize such data structure (and sending over network is serializing as well) vs the number of times you would use such data structure (in your program) is pretty low. So, most implementations focus more on the speed instead of the "maybe easier to serialize" side.
If all the data would be in one allocated memory block a lot of operations on that data structure would be a bit expensive because you would have to:
reallocate memory on add-operations
most likeley compress / vacuum on delete-operations (so that the one block you like so much is dense and has no holes)
Most network operations are buffered anyway, just iterate over the keys and send keys + values.
On a unix system I'd probably utilise a shared memory buffer (see shm_open()), or if that's not available a memory-mapped file with the MAP_SHARED flag, see the OS-specific differences though http://en.wikipedia.org/wiki/Mmap
If both shm_open and mmap aren't available you could still use a file on the disk (to some extent), you'd have to care about the proper locking, I'd send an unlock signal to the next process and maybe the seek of the updated portion of the file, then that process locks the file again, seeks to the interesting part and proceeds as usual (updates/deletes/etc.).
In any case, you could freely design the layout of the hashtable or whatever you want, like having fixed width key/seek pairs. That way you'd have the fast access to the keys of your hashtable and if necessary you seek to the data portion, then copy/delete/modify/etc.
Ideally this file should be on a ram disk, of course.
I agree completely with akira (+1). Just one more comment on data locality. Once the table gets larger, or if the satellite data is large enough, there's most certainly cache pollution which slows down any operation on the table additionally, or in other words you can rely on the level-1/2/3 cache chain to serve the key data promptly whilst putting up with a cache miss when you have to access the satellite data (e.g. for serialisation).
Libraries providing hashtables tend to hide the details and make the thing work efficiently (that is normally what programmers want when they use an hashtabe), so normally the way they handle the memory is hidden from the final programmer's eyes, and programmers shouldn't rely on the particular "memory layout", that may change in following version of the library.
Write your own function to serialize (and unserialize) the hashtable in the most convenient way for your usage. You can keep the serialized content if you need it several times (of course, when the hashtable is changed, you need to update the serialized "version" kept in memory).

Resources