Retrieving gobs written to file by appending several times - file

I am trying to use encoding/gob to store data to a file and load it later. I want to be able to append new data to the file and load all saved data later, e.g. after restarting my application. While storing to the file using Encode() there are no problems, but when reading it seems I always get only the item which was first stored, not the succinctly stored items.
Here is a minimal example: https://play.golang.org/p/patGkKDLhM
As you see, it works to write two times to an encoder and then read it back. But when closing the file and reopening it again in append mode, writing seems to work, but reading works only for the first two elements (which have been written previously). The two newly added structs cannot be retrieved, I get the error:
panic: extra data in buffer
I am aware of Append to golang gob in a file on disk and I also read https://groups.google.com/forum/#!topic/golang-nuts/bn6vjC5Abd8
Finally, I also found https://gist.github.com/kjk/8015952 which seems to demonstrate that what I am trying to do does not work. Why? What does this error mean?

I have not used the encoding/gob package yet (looks cool, I might have to find a project for it). But reading the godoc, it would seem to me that each encoding is a single record expected to be decoded from beginning to end. That is, once you Encode a stream, the resulting bytes is a complete set respecting the entire stream from start to finish - not able to be appended to later by encoding again.
The godoc states that an encoded gob is self-descriptive. At the beginning of the encoded stream, it describes the entire data set struct, types, etc that will be following including the field names. Then what follows in the byte stream is the the size and byte representation of the value of those Exported fields.
Then one could assume that what is omitted from the docs is since the stream self-describes itself at the very beginning, including each field that is about to be passed, that is all that the Decoder will care about. The Decoder will not know of any successive bytes added after what has been described as it only sees what was described at the beginning. Therefore, that error message panic: extra data in buffer is accurate.
In your Playground example, you are encoding twice to the same encoder instance and then closing the file. Since you are passing exactly two records in, and encoding two records, that may work as the single instance of the encoder may see the two Encode calls as a single encoded stream. Then when you close the file io's stream, the gob is now complete - and the stream is treated as a single record (even though you sent in two types).
And the same in the decoding function, you are reading X number of times from the same stream. But, you are writing a single record when closing the file - that actually has two types in that one single record. Hence why it works when reading 2, and EXACTLY 2. But fails if reading more than 2.
A solution, if you want to store this in a single file, is that you will need to create your own index of each complete "write" or encoder instance/session. Some form your own Block method that allows you to wrap or define each entry written to disk with a "begin" and "end" marker. That way, when reading back the file, you know exactly what buffer to allocate because of the begin/end markers. Once you have a single record in a buffer, then you use gob's Decoder to decode it. And close the file after each write.
The pattern I use for such markers is something like:
uint64:uint64
uint64:uint64
...
The first being the beginning byte number, and the second entry separated by a colon being its length. I usually store this in another file though, called appropriately indexes. That way it can be quickly read into memory, and then I can stream the large file knowing exactly where each start and end address is in the byte stream.
Another option is just to store each gob in its own file, using the file system directory structure to organize as you see fit (or one could even use the directories to define types, for example). Then the existence of each file is a single record. This is how I use my rendered json from Event Sourcing techniques, storing millions of files organized in directories.
In summary, it would seem to me that a gob of data is a complete set of data from beginning to end - a single "record" have you. If you want to store multiple encodings/multiple gobs, then to will need to create your own index to track the start and size/end of each gob bytes as you store them. Then, you will want to Decode each entry separately.

Related

Why is in-place replacing so hard in files?

I have a very large CSV file that I want to import straight into Postgresql with COPY. For that, the CSV column headers need to match DB column names. So I need to do a simple string replace on the first line of the very large file.
There are many answers on how to do that like:
Is it possible to modify lines in a file in-place?
Optimizing find and replace over large files in Python
All the answers imply creating a copy of the large file or using file-system level solutions that access the entire file, although only the first line is relevant. That makes all solutions slow and seemingly overkill.
What is the underlying cause that makes this simple job so hard? Is it file-system related?
The underlying cause is that a .csv file is a textfile, and making changes to the first line of the file implies random access to the first "record" of the file. But textfiles don't really have "records", they have lines, of unequal length. So changing the first line implies reading the file up to the first carriage return, putting something in its place, and then moving all of the rest of the data in the file to the left, if the replacement is shorter, or to the right if it is longer. And to do that you have two choices. (1) Read the entire file into memory so you can do the left or right shift. (2) Read the file line by line and write out a new one.
It is easy to add stuff at the end because that doesn't involve displacing what is there already.

What does a file pointer point to in C?

I am trying to understand input and output files in C. In the beginning, when we want to open a file to read, we declare a file pointer as follows:
FILE *fptr1 = fopen( "filename", "r")
I understand that FILE is a data structure in the stdio.h library and that it contains information about the file. I also know that the fopen() function returns a FILE structure. But, is that the purpose of the pointer. It just points to a bunch of information about the file? I've been reading into this and I have heard the term "file streams" floating around a bit. I understand that it is a an interface of communication with the file (find it vague, but I'll take it). Is that what the pointer points to in simple terms - a file stream? In the above code example, would the pointer be pointing to an input file stream?
Thank you!
The FILE structure is intended to be opaque. In other words, you are not supposed to look into it if you want your programs to remain portable.
Further, FILE is always used through a pointer, so you don't even need to know its size.
In a way, you can consider it a void * for all intents and purposes.
Now, if you are really interested on what the FILE type may hold, the C standard itself explains it quite well! See C11 7.21.1p2:
(...) FILE which is an object type capable of recording all the information needed to control a stream, including its file position indicator, a pointer to its associated buffer (if any), an error indicator that records whether a read/write error has occurred, and an end-of-file indicator that records whether the end of the file has been reached; (...)
So as you see, at least it contains stuff like:
The position inside the file
A pointer to a buffer
Error flags
EOF flag
It mentions (as you do) streams. You can find some more details about it in section 7.21.2 Streams:
Input and output, whether to or from physical devices such as terminals and tape drives, or whether to or from files supported on structured storage devices, are mapped into logical data streams, whose properties are more uniform than their various inputs and outputs. Two forms of mapping are supported, for text streams and for binary streams.
(...)
A binary stream is an ordered sequence of characters that can transparently record internal data. (...)
As we can read, a stream is an ordered sequence of characters. Note that it does not say whether this sequence is finite or not! (More on that later)
So, how do they relate to files? Let's see section 7.21.3 Files:
A stream is associated with an external file (which may be a physical device) by opening a file, which may involve creating a new file. Creating an existing file causes its former contents to be discarded, if necessary. If a file can support positioning requests (such as a disk file, as opposed to a terminal), then a file position indicator associated with the stream is positioned at the start character number zero) of the file, unless the file is opened with append mode in which case it is implementation-defined whether the file position indicator is initially positioned at the beginning or the end of the file. The file position indicator is maintained by subsequent reads, writes, and positioning requests, to facilitate an orderly progression through the file.
(...)
See, when you open a "disk file" (the typical file in your computer), you are associating a "stream" (finite, in this case) which you can open/read/write/close/... through fread() and related functions; and the data structure that holds all the required information about it is FILE.
However, there are other kinds of files. Imagine a pseudo-random number generator. You can conceptualize it as an infinite read-only file: every time you read it gives you a different value and it never "ends". Therefore, this file would have an infinite stream associated with it. And some operations may not make sense with it (e.g. maybe you cannot seek it, i.e. move the file position indicator).
This only serves as a quick introduction, but as you can see, the FILE structure is an abstraction over the concept of a file. If you want to learn more about this kind of thing, the best you can do is reach for a good book on Operating Systems, e.g. Modern Operating Systems from Tanenbaum. This book also refers to C, so even better.

Text files edit C

I have a program which takes data(int,floats and strings) given by the user and writes it in a text file.Now I have to update a part of that written data.
For example:
At line 4 in file I want to change the first 2 words (there's an int and a float). How can I do that?
With the information I found out, fseek() and fputs() can be used but I don't know exactly how to get to a specific line.
(Explained code will be appreciated as I'm a starter in C)
You can't "insert" characters in a file. You will have to create program, which will read whole file, then copy part before insert to a new file, your edition, rest of file.
You really need to read all the file, and ignore what is not needed.
fseek is not really useful: it positions the file at some byte offset (relative to the start or the end of the file) and don't know about line boundaries.
Actually, lines inside a file are an ill defined concept. Often a line is a sequence of bytes (different from the newline character) ended by a newline ('\n'). Some operating systems (Windows, MacOSX) read in a special manner text files (e.g. the real file contains \r\n to end each line, but the C library gives you the illusion that you have read \n).
In practice, you probably want to use line input routines notably getline (or perhaps fgets).
if you use getline you should care about free-ing the line buffer.
If your textual file has a very regular structure, you might fscanf the data (ignoring what you need to skip) without caring about line boundaries.
If you wanted to absolutely use fseek (which is a mistake), you would have to read the file twice: a first time to remember where each line starts (or ends) and a second time to fseek to the line start. Still, that does not work for updates, because you cannot insert bytes in the middle of a file.
And in practice, the most costly operation is the actual disk read. Buffering (partly done by the kernel and <stdio.h> functions, and partly by you when you deal with lines) is negligible.
Of course you cannot change in place some line in a file. If you need to do that, process the file for input, produce some output file (containing the modified input) and rename that when finished.
BTW, you might perhaps be interested in indexed files like GDBM etc... or even in databases like SqlLite, MariaDb, mongodb etc.... and you might be interested in standard textual serialization formats like JSON or YAML (both have many libraries, even for C, to deal with them).
fseek() is used for random-access files where each record of data has the same size. Typically the data is binary, not text.
To solve your particular issue, you will need to read one line at a time to find the line you want to change. A simple solution to make the change is to write these lines to a temporary file, write the changes to the same temporary file, then skip the parts from the original file that you want to change and copy the reset to the temporary file. Finally, close the original file, copy the temporary file to it, and delete the temporary file.
With that said, I suggest that you learn more about random-access files. These are very useful when storing records all of the same size. If you have control over creating the orignal file, these might be better for your current purpose.

Replacing spaces with %20 in a file on hard disk

I have gone through all the answers for the similar question posted earlier Replacing spaces with %20 in C. However I'm unable to guess how can we do this in case of a file on hard disk, where disk accesses can be expensive and file is too long to load into memory at once. In case it is possible to fit, we can simply load the file and write onto the same existing one.
Further, for memory constraints one would like to replace the original file and not create a new one.
Horrible idea. Since the "%20" is longer than " " you can't just replace chars inside the file, you have to move whatever follows it further back. This is extremely messy and expensive if you want to do it on the existing disk file.
You could try to determine the total growth of the file on a first pass, then do the whole shifting from the back of the file taking blocksize into account and adjusting the shifting as you encounter " ". But as I said -- messy. You really don't want to do that unless it's a definite must.
Read the file, do the replacements, write to a new file, and rename the new file over the old one.
EDIT: as a side effect, if your program terminates while doing the thing you won't end up with a half-converted file. That's actually the reason why many programs write to a new file even if they wouldn't need to, to make sure the file is "always" correct because the new file only replaces the old file after it has been written successfully. It's a simple transaction scheme that doesn't take system failures into account, but works well for application failures (including users forcibly terminating the program)
For the replacement part, you can have two buffers, one that you read into and one that you write the translated string to and which you write to disk. Depending on your memory constraints even a small input buffer (say 1KiB) is enough. However, to avoid repeating reallocations you can keep a fixed buffer for the output, and have it three times the size of the input buffer (worst case scenario, input is all spaces). Total that's 4KiB of memory, plus whatever buffers the OS uses. I would recommend to use a multiple of the disk block size as the input size.
The problem is your requirement of reading and writing to the same file. Unfortunately this is impossible.If you read char-by-char, think about what happens when you reach a space... You then have to write three characters and overwrite the next two characters in the file. Not exactly what you want.

Lots of questions about file I/O (reading/writing message strings)

For this university project I'm doing (for which I've made a couple of posts in the past), which is some sort of social network, it's required the ability for the users to exchange messages.
At first, I designed my data structures to hold ALL messages in a linked list, limiting the message size to 256 chars. However, I think my instructors will prefer if I save the messages on disk and read them only when I need them. Of course, they won't say what they prefer, I need to make a choice and justify the best I can why I went that route.
One thing to keep in mind is that I only need to save the latest 20 messages from each user, no more.
Right now I have an Hash Table that will act as inbox, this will be inside the user profile. This Hash Table will be indexed by name (the user that sent the message). The value for each element will be a data structure holding an array of size_t with 20 elements (20 messages like I said above). The idea is to keep track of the disk file offsets and bytes written. Then, when I need to read a message, I just need to use fseek() and read the necessary bytes.
I think this could work nicely... I could use just one single file to hold all messages from all users in the network. I'm saying one single file because a colleague asked an instructor about saving the messages from each user independently which he replied that it might not be the best approach cause the file system has it's limits. That's why I'm thinking of going the single file route.
However, this presents a problem... Since I only need to save the latest 20 messages, I need to discard the older ones when I reach this limit.
I have no idea how to do this... All I know is about fread() and fwrite() to read/write bytes from/to files. How can I go to a file offset and say "hey, delete the following X bytes"? Even if I could do that, there's another problem... All offsets below that one will be completely different and I would have to process all users mailboxes to fix the problem. Which would be a pain...
So, any suggestions to solve my problems? What do you suggest?
You can't arbitrarily delete bytes from the middle of a file; the only way that works is to rewrite the entire file without them. Disregarding the question of whether doing things this way is a good idea, if you have fixed length fields, one solution would be to just overwrite the oldest message with the newest one; that way, the size / position of the message on disk doesn't change, so none of the other offsets are affected.
Edit: If you're allowed to use external libraries, making a simple SQLite db could be a good solution.
You're complicating your life way more than you need to.
If your messages are 256 characters, then use a array of 256 characters to hold each message.
Write it to disk with fwrite, read with fread, delete it by changing the first character of the string to \0 (or whatever else strikes your fancy) and write that to disk.
Keep an index of the messages in a simple structure (username/recno) and bounce around in the file with fseek. You can either brute-force the next free record when writing a new one (start reading at the beginning of the file and stop when you hit your \0) or keep an index of free records in an array and grab one of them when writing a new one (or if your array is empty then fseek to the end of the file and write a complete new record.)
I want to suggest another solution for completeness' sake:
Strings should be ending with a null-byte character, "hello world\0", so you might read the raw binary data until reaching "\0".
Other datatypes have fixed bits, beware of byteorder (endian).
Also you could define a payload before each message, so you know its string length:
"11hello world;2hi;15my name is loco"
Thus making it possible to treat raw snippets like data fields.

Resources