File handling and Graph ADT in C - c

I created a Thesaurus program in C.
In my program user can insert a word and the synonym for that.
Another function is searching for a word and then displays the synonyms for that word.
My question is how can I keep the words I have inserted and still retrieve them when I run the program again?
Is file handling a solution?
How will I do it?

You need to design a simple file format which could describe your data, then write code to write to that format, and code to read from that format and handle errors properly.
As a simple example, you could have a file which stored lines like:
happy:joyful
happy:exuberant
In this case you would also need to make sure that users can't enter blank lines or colons as word input, so that the syntax is unambiguous.

A program cannot reliably keep information in memory between runs*, so it has to store such information in a file. Files are designed to store information between runs of a program.
As to how you'll do it, that's your decision. Most likely, you'll choose a simple and readable format with, for example, the head word at the start of a line, followed by a colon, then a list of semi-colon separated synonyms:
head: skull; cranium; noggin; noodle
head: aptitude; faculty; talent; gift; capacity; ability; mind; brain
This is flexible and allows you to use phrases (even phrases containing commas) in the synonym lists. You can sort the data before you write it out for convenience when reading in, but it is generally best to validate that the data is still sorted when you read it back in (at the start of the next run) because someone may have hand-edited the file and not preserved sorted order.
* If the process uses System V shared memory IPC, then you could store the data in a shared memory segment that would exist between runs of the program. However, it is not a particularly sensible idea to try doing that. A file has better durability; it will (usually) survive reboots, and could be placed on a distributed file system whereas shared memory is restricted to a single machine.

Related

Glove: training with single text file. Does GLoVE try to read it into memory? Or is it streamed?

I need to train some glove models to compare them with word2vec and fasttext output. It's implemented in C, and I can't read C code. The github is here.
The training corpus needs to be formatted into a single text file. For me, this would be >>100G -- way too big for memory. Before I waste time constructing such a thing, I'd be grateful if someone could tell me whether the glove algo tries to read the thing into memory, or whether it streams it from disk.
If the former, then glove's current implementation wouldn't be compatible with my data (I think). If the latter, I'd have at it.
Glove first constructs a word co-occurrence matrix and later works on that. While constructing this matrix, the linked implementation streams the input file on several threads. Each thread reads one line at a time.
The required memory will be mainly dependent on the amount of unique words in your corpus, as long as lines are not excessively long.

Temporary File in C

I am writing a program which outputs a file. This file has two parts of the content. The second part however, is computed before the first. I was thinking of creating a temporary file, writing the data to it. And then creating a permanent file and then dumping the temp file content into the permanent one and deleting that file. I saw some posts that this does not work, and it might produce some problems among different compilers or something.
The data is a bunch of chars. Every 32 chars have to appear on a different line. I can store it in a linked list or something, but I do not want to have to write a linked list for that.
Does anyone have any suggestions or alternative methods?
A temporary file can be created, although some people do say they have problems with this, i personally have used them with no issues. Using the platform functions to obtain a temporary file is the best option. Dont assume you can write to c:\ etc on windows as this isnt always possible. Dont assume a filename incase the file is already used etc. Not using temporary files correctly is what causes people problems, rather than temporary files being bad
Is there any reason you cannot just keep the second part in ram until you are ready for the first? Otherwise, can you work out the size needed for the first part and leave that section of the file blank to come back to fill in later on. This would eliminate the needs of the temporary file.
Both solutions you propose could work. You can output intermediate results to a temporary file, and then later append that file to the file that contains the dataset that you want to present first. You could also store your intermediate data in memory. The right data structure depends on how you want to organize the data.
As one of the other answerers notes, files are inherently platform specific. If your code will only run on a single platform, then this is less of a concern. If you need to support multiple platforms, then you may need to special case some or all of those platforms, if you go with the temporary file solution. Whether this is a deal-breaker for you depends on how much complexity this adds compared to structuring and storing your data in memory.

How to represent a random-access text file in memory (C)

I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.
I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.
1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.
It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...
solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.
As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.
Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains

Determining string uniqueness in a large file

In C, I want to process a file that contains 108 16-digit alphanumeric strings and determine if each one is unique in the file. How can I do that?
As other people have said, the most straightforward method is to simply load the entire file and use something like qsort to sort it.
If you can't load that much into memory at once, another option is to load the data in several passes. On your first pass, read the file and only load in lines that start with A. Sort those and find the unique lines. For the next pass, load all the lines that start with B, sort, and find unique lines. Repeat this process for every alphanumeric character that a line might start with. Using this technique, you should only have to load a fraction of the file into memory at a time and it shouldn't cause you to mis-classify any lines.
Given that you're talking about ~16 megabytes of data, the obvious way to do it would be to just load the data into a hash table (or something on that order) and count the occurrences of each string.
I can't quite imagine doing this in C though -- most other languages will supply a reasonable data structure (some sort of map), making the job substantially easier.
Do a bucket sort(Hash function) into multiple files, one file for each bucket. Then process each bucket's file to determine if all strings are unique within the bucket.
You'll need to sort the file.
Just load it into a single memory block, run qsort from the C runtime library on the memory block and the finally run sequentially over all strings to check for two consecutive strings that are the same.
Take a library with set/map functions, e.g. see link text

Lots of questions about file I/O (reading/writing message strings)

For this university project I'm doing (for which I've made a couple of posts in the past), which is some sort of social network, it's required the ability for the users to exchange messages.
At first, I designed my data structures to hold ALL messages in a linked list, limiting the message size to 256 chars. However, I think my instructors will prefer if I save the messages on disk and read them only when I need them. Of course, they won't say what they prefer, I need to make a choice and justify the best I can why I went that route.
One thing to keep in mind is that I only need to save the latest 20 messages from each user, no more.
Right now I have an Hash Table that will act as inbox, this will be inside the user profile. This Hash Table will be indexed by name (the user that sent the message). The value for each element will be a data structure holding an array of size_t with 20 elements (20 messages like I said above). The idea is to keep track of the disk file offsets and bytes written. Then, when I need to read a message, I just need to use fseek() and read the necessary bytes.
I think this could work nicely... I could use just one single file to hold all messages from all users in the network. I'm saying one single file because a colleague asked an instructor about saving the messages from each user independently which he replied that it might not be the best approach cause the file system has it's limits. That's why I'm thinking of going the single file route.
However, this presents a problem... Since I only need to save the latest 20 messages, I need to discard the older ones when I reach this limit.
I have no idea how to do this... All I know is about fread() and fwrite() to read/write bytes from/to files. How can I go to a file offset and say "hey, delete the following X bytes"? Even if I could do that, there's another problem... All offsets below that one will be completely different and I would have to process all users mailboxes to fix the problem. Which would be a pain...
So, any suggestions to solve my problems? What do you suggest?
You can't arbitrarily delete bytes from the middle of a file; the only way that works is to rewrite the entire file without them. Disregarding the question of whether doing things this way is a good idea, if you have fixed length fields, one solution would be to just overwrite the oldest message with the newest one; that way, the size / position of the message on disk doesn't change, so none of the other offsets are affected.
Edit: If you're allowed to use external libraries, making a simple SQLite db could be a good solution.
You're complicating your life way more than you need to.
If your messages are 256 characters, then use a array of 256 characters to hold each message.
Write it to disk with fwrite, read with fread, delete it by changing the first character of the string to \0 (or whatever else strikes your fancy) and write that to disk.
Keep an index of the messages in a simple structure (username/recno) and bounce around in the file with fseek. You can either brute-force the next free record when writing a new one (start reading at the beginning of the file and stop when you hit your \0) or keep an index of free records in an array and grab one of them when writing a new one (or if your array is empty then fseek to the end of the file and write a complete new record.)
I want to suggest another solution for completeness' sake:
Strings should be ending with a null-byte character, "hello world\0", so you might read the raw binary data until reaching "\0".
Other datatypes have fixed bits, beware of byteorder (endian).
Also you could define a payload before each message, so you know its string length:
"11hello world;2hi;15my name is loco"
Thus making it possible to treat raw snippets like data fields.

Resources