Lots of questions about file I/O (reading/writing message strings)

Lots of questions about file I/O (reading/writing message strings) - c

For this university project I'm doing (for which I've made a couple of posts in the past), which is some sort of social network, it's required the ability for the users to exchange messages.
At first, I designed my data structures to hold ALL messages in a linked list, limiting the message size to 256 chars. However, I think my instructors will prefer if I save the messages on disk and read them only when I need them. Of course, they won't say what they prefer, I need to make a choice and justify the best I can why I went that route.
One thing to keep in mind is that I only need to save the latest 20 messages from each user, no more.
Right now I have an Hash Table that will act as inbox, this will be inside the user profile. This Hash Table will be indexed by name (the user that sent the message). The value for each element will be a data structure holding an array of size_t with 20 elements (20 messages like I said above). The idea is to keep track of the disk file offsets and bytes written. Then, when I need to read a message, I just need to use fseek() and read the necessary bytes.
I think this could work nicely... I could use just one single file to hold all messages from all users in the network. I'm saying one single file because a colleague asked an instructor about saving the messages from each user independently which he replied that it might not be the best approach cause the file system has it's limits. That's why I'm thinking of going the single file route.
However, this presents a problem... Since I only need to save the latest 20 messages, I need to discard the older ones when I reach this limit.
I have no idea how to do this... All I know is about fread() and fwrite() to read/write bytes from/to files. How can I go to a file offset and say "hey, delete the following X bytes"? Even if I could do that, there's another problem... All offsets below that one will be completely different and I would have to process all users mailboxes to fix the problem. Which would be a pain...
So, any suggestions to solve my problems? What do you suggest?

You can't arbitrarily delete bytes from the middle of a file; the only way that works is to rewrite the entire file without them. Disregarding the question of whether doing things this way is a good idea, if you have fixed length fields, one solution would be to just overwrite the oldest message with the newest one; that way, the size / position of the message on disk doesn't change, so none of the other offsets are affected.
Edit: If you're allowed to use external libraries, making a simple SQLite db could be a good solution.

You're complicating your life way more than you need to.
If your messages are 256 characters, then use a array of 256 characters to hold each message.
Write it to disk with fwrite, read with fread, delete it by changing the first character of the string to \0 (or whatever else strikes your fancy) and write that to disk.
Keep an index of the messages in a simple structure (username/recno) and bounce around in the file with fseek. You can either brute-force the next free record when writing a new one (start reading at the beginning of the file and stop when you hit your \0) or keep an index of free records in an array and grab one of them when writing a new one (or if your array is empty then fseek to the end of the file and write a complete new record.)

I want to suggest another solution for completeness' sake:
Strings should be ending with a null-byte character, "hello world\0", so you might read the raw binary data until reaching "\0".
Other datatypes have fixed bits, beware of byteorder (endian).
Also you could define a payload before each message, so you know its string length:
"11hello world;2hi;15my name is loco"
Thus making it possible to treat raw snippets like data fields.

Related

Replacing spaces with %20 in a file on hard disk

I have gone through all the answers for the similar question posted earlier Replacing spaces with %20 in C. However I'm unable to guess how can we do this in case of a file on hard disk, where disk accesses can be expensive and file is too long to load into memory at once. In case it is possible to fit, we can simply load the file and write onto the same existing one.
Further, for memory constraints one would like to replace the original file and not create a new one.

Horrible idea. Since the "%20" is longer than " " you can't just replace chars inside the file, you have to move whatever follows it further back. This is extremely messy and expensive if you want to do it on the existing disk file.
You could try to determine the total growth of the file on a first pass, then do the whole shifting from the back of the file taking blocksize into account and adjusting the shifting as you encounter " ". But as I said -- messy. You really don't want to do that unless it's a definite must.
Read the file, do the replacements, write to a new file, and rename the new file over the old one.
EDIT: as a side effect, if your program terminates while doing the thing you won't end up with a half-converted file. That's actually the reason why many programs write to a new file even if they wouldn't need to, to make sure the file is "always" correct because the new file only replaces the old file after it has been written successfully. It's a simple transaction scheme that doesn't take system failures into account, but works well for application failures (including users forcibly terminating the program)

For the replacement part, you can have two buffers, one that you read into and one that you write the translated string to and which you write to disk. Depending on your memory constraints even a small input buffer (say 1KiB) is enough. However, to avoid repeating reallocations you can keep a fixed buffer for the output, and have it three times the size of the input buffer (worst case scenario, input is all spaces). Total that's 4KiB of memory, plus whatever buffers the OS uses. I would recommend to use a multiple of the disk block size as the input size.
The problem is your requirement of reading and writing to the same file. Unfortunately this is impossible.If you read char-by-char, think about what happens when you reach a space... You then have to write three characters and overwrite the next two characters in the file. Not exactly what you want.

Temporary File in C

I am writing a program which outputs a file. This file has two parts of the content. The second part however, is computed before the first. I was thinking of creating a temporary file, writing the data to it. And then creating a permanent file and then dumping the temp file content into the permanent one and deleting that file. I saw some posts that this does not work, and it might produce some problems among different compilers or something.
The data is a bunch of chars. Every 32 chars have to appear on a different line. I can store it in a linked list or something, but I do not want to have to write a linked list for that.
Does anyone have any suggestions or alternative methods?

A temporary file can be created, although some people do say they have problems with this, i personally have used them with no issues. Using the platform functions to obtain a temporary file is the best option. Dont assume you can write to c:\ etc on windows as this isnt always possible. Dont assume a filename incase the file is already used etc. Not using temporary files correctly is what causes people problems, rather than temporary files being bad
Is there any reason you cannot just keep the second part in ram until you are ready for the first? Otherwise, can you work out the size needed for the first part and leave that section of the file blank to come back to fill in later on. This would eliminate the needs of the temporary file.

Both solutions you propose could work. You can output intermediate results to a temporary file, and then later append that file to the file that contains the dataset that you want to present first. You could also store your intermediate data in memory. The right data structure depends on how you want to organize the data.
As one of the other answerers notes, files are inherently platform specific. If your code will only run on a single platform, then this is less of a concern. If you need to support multiple platforms, then you may need to special case some or all of those platforms, if you go with the temporary file solution. Whether this is a deal-breaker for you depends on how much complexity this adds compared to structuring and storing your data in memory.

How to represent a random-access text file in memory (C)

I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.

I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.

1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.

It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...

solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.

As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.

Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains

How do you read a file until you hit a certain string in c?

I wanted to know how, in C, you can read a certain file until the reading hits a certain string, or character array. What I want to be able to do is, once the file hits that string, I want the position to be set at that point. I am going to use fseek for that, and that's not a problem. It's just the reading until a certain string is hit that I am not able to do. I've been reading up on some of the functions, but there doesn't seem to be anything that guides with this. Fgets is the closest thing to this, but I don't want to provide a certain number of characters to be read, as I don't know how many. But can you give me some tips on how to do this?
Thanks!

There are many efficient string searching algorithms, each of which can be implemented in C.
http://en.wikipedia.org/wiki/String_searching_algorithm
If you're looking for a string of length N, easiest is to keep a circular buffer of length N and read 1 byte at a time from the file adding it to the circular buffer. At each step you compare your buffer with the string you're searching for. It's highly inefficient but easy to code.

There's no built-in function to do exactly what you want, but there are a few options.
Option one: Read data in chunks. You don't know exactly where your data is, so read in a few kbs of data at a time, and search within these chunks. Make sure you deal with the case where the string you're looking for straddles a chunk boundrary! Once you've located the string, use fseek() to position yourself at the start of it.
Option two: Memory map the file and use memmem() on the entire file (as mapped into memory). This requires unportable calls to set up the memory mapping, so you'll need to know your OS (or use a portability wrapper library like glib). On 32-bit machines, it will also limit the size of files you can search in to a few hundred megabytes. It is, however, a very simple and efficient approach when it's an option.
If you go with option one, the trickiest part will be dealing with the chunk-straddling case. One option is to always keep two chunks in memory, and restart the search so it begins (length of target string) - 1 bytes before the end of the previous block. The actual search could then be done using memmem() or any other string searching algorithm. You could also convert your search into a DFA (since it is a regular language) and keep the current state across blocks.

Truncate file at front

A problem I was working on recently got me to wishing that I could lop off the front of a file. Kind of like a “truncate at front,” if you will. Truncating a file at the back end is a common operation–something we do without even thinking much about it. But lopping off the front of a file? Sounds ridiculous at first, but only because we’ve been trained to think that it’s impossible. But a lop operation could be useful in some situations.
A simple example (certainly not the only or necessarily the best example) is a FIFO queue. You’re adding new items to the end of the file and pulling items out of the file from the front. The file grows over time and there’s a huge empty space at the front. With current file systems, there are several ways around this problem:
As each item is removed, copy the
remaining items up to replace it, and
truncate the file. Although it works,
this solution is very expensive
time-wise.
Monitor the size of the empty space at
the front, and when it reaches a
particular size or percentage of the
entire file size, move everything up
and truncate the file. This is much
more efficient than the previous
solution, but still costs time when
items are moved in the file.
Implement a circular queue in the
file, adding new items to the hole at
the front of the file as items are
removed. This can be quite efficient,
especially if you don’t mind the
possibility of things getting out of
order in the queue. If you do care
about order, there’s the potential of
having to move items around. But in
general, a circular queue is pretty
easy to implement and manages disk
space well.
But if there was a lop operation, removing an item from the queue would be as easy as updating the beginning-of-file marker. As easy, in fact, as truncating a file. Why, then, is there no such operation?
I understand a bit about file systems implementation, and don't see any particular reason this would be difficult. It looks to me like all it would require is another word (dword, perhaps?) per allocation entry to say where the file starts within the block. With 1 terabyte drives under $100 US, it seems like a pretty small price to pay for such functionality.
What other tasks would be made easier if you could lop off the front of a file as efficiently as you can truncate at the end?
Can you think of any technical reason this function couldn't be added to a modern file system? Other, non-technical reasons?

On file systems that support sparse files "punching" a hole and removing data at an arbitrary file position is very easy. The operating system just has to mark the corresponding blocks as "not allocated". Removing data from the beginning of a file is just a special case of this operation. The main thing that is required is a system call that will implement such an operation: ftruncate2(int fd, off_t offset, size_t count).
On Linux systems this is actually implemented with the fallocate system call by specifying the FALLOC_FL_PUNCH_HOLE flag to zero-out a range and the FALLOC_FL_COLLAPSE_RANGE flag to completely remove the data in that range. Note that there are restrictions on what ranges can be specified and that not all filesystems support these operations.

Truncate files at front seems not too hard to implement at system level.
But there are issues.
The first one is at programming level. When opening file in random access the current paradigm is to use offset from the beginning of the file to point out different places in the file. If we truncate at beginning of file (or perform insertion or removal from the middle of the file) that is not any more a stable property. (While appendind or truncating from the end is not a problem).
In other words truncating the beginning would change the only reference point and that is bad.
At a system level uses exist as you pointed out, but are quite rare. I believe most uses of files are of the write once read many kind, so even truncate is not a critical feature and we could probably do without it (well some things would become more difficult, but nothing would become impossible).
If we want more complex accesses (and there are indeed needs) we open files in random mode and add some internal data structure. Theses informations can also be shared between several files. This leads us to the last issue I see, probably the most important.
In a sense when we using random access files with some internal structure... we are still using files but we are not any more using files paradigm. Typical such cases are the databases where we want to perform insertion or removal of records without caring at all about their physical place. Databases can use files as low level implementation but for optimisation purposes some database editors choose to completely bypass filesystem (think about Oracle partitions).
I see no technical reason why we couldn't do everything that is currently done in an operating system with files using a database as data storage layer. I even heard that NTFS has many common points with databases in it's internals. An operating system can (and probably will in some not so far future) use another paradigm than files one.
Summarily i believe that's not a technical problem at all, just a change of paradigm and that removing the beginning is definitely not part of the current "files paradigm", but not a big and useful enough change to compell changing anything at all.

NTFS can do something like this with it's sparse file support but it's generaly not that useful.

I think there's a bit of a chicken-and-egg problem in there: because filesystems have not supported this kind of behavior efficiently, people haven't written programs to use it, and because people haven't written programs to use it, there's little incentive for filesystems to support it.
You could always write your own filesystem to do this, or maybe modify an existing one (although filesystems used "in the wild" are probably pretty complicated, you might have an easier time starting from scratch). If people find it useful enough it might catch on ;-)

Actually there are record base file systems - IBM have one and I believe DEC VMS also had this facility. I seem to remember both allowed (allow? I guess they are still around) deleting and inserting at random positions in a file.

There is also a unix command called head -- so you could do this via:
head -n1000 file > file_truncated

may can achieve this goal in two steps
long fileLength; //file total length
long reserveLength; //reserve length until the file ending
int fd; //file open for read & write
sendfile(fd, fd, fileLength-reserveLength, reserveLength);
ftruncate(fd, reserveLength);

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight