File Bytes Array Length Go - arrays

I have recently started to learn Go. To start with I decided that I would write some code to open a file and output its contents on the terminal window. So far I have been writing code like this:
file, err := os.Open("./blah.txt")
data := make([]byte, 100)
count, err := file.Read(data)
To obtain up to 100 bytes from a file. Is there any way to ascertain the byte count on a file, such that you could set the correct (or more sensible) byte array length just using the standard Go library?
I understand you could use a slice with something like Append() once the extremities of the array have been reached, but I just wondered whether the file size/length/whatever could be accessed prior to instantiating an array through file metadata or something similar.

While you could certainly get the file's size prior to reading
from it (see the other answer), doing this is usually futile
for a number of reasons:
A filesystem is an inherently racy medium: any number of processes
might update a given file simultaneously, and even remove it.
On a filesystem with POSIX semantics (most commodity OSes
excluding Windows) the only guarantee a successful opening of a file
gives you is that it's possible to read data from it,
and that's basically all. (Well, reading may fail due to the error
in the underlying media but let's not digress further).
What would you do if you did the equivalent of a fstat(2) call,
as suggested, and it told you the file contains 42 terabytes of data?
Would you try to allocate a sufficiently large array to hold its contents?
Would you implement some custom logic which classifies the file's
size into several ranges and performs custom processing based on that—like,
say, slurping files less than N megabytes in length and reading
bigger files piecemeal?
What if the file grew bigger (was appended to) after you obtained its size?
What if you later decide to be a more Unix-way-ready and make it possible
to read the data from your program's standard input stream—like the cat
program on Unix (or its type Windows cousin) does?
You can't know how much data will be piped through that stream;
and potentially it might be of indefinite length (consider being piped
the contents of some busy log file on a continuously running system).
Sure, in some applications you assume the contents of files do not
change under you feet; one example is archivers like zip or tar which
record the file's metadata, including its size, along with the file.
(By the way, tar detects a file might have changed while the program
was reading its contents and warns the user in that case).
But what I'm leading you to, is that for a task as simple as yours,
there's little point in doing it the way you've come up with.
Instead, just use a buffer of some "sensible" size and gateway the data
between its source and destination through that buffer.
That is, you allocate the buffer, enter a loop, and on each iteration of
it you try to read as much data as fits in the buffer, process whatever
the Read function indicated it was able to read, then handle an
end-of-file condition or an error, if it was indicated.
To round up this small crash course, I'd hint that the standard library
already has io.Copy which, in your
case, may be called like
_, err := io.Copy(os.Stdout, f)
and will shovel all the contents of f to the standard output of your
program until EOF or an error is detected.
Last time I checked, this function used an internal buffer of 32 KiB in size,
but you may always check the source code of your Go installation.

I assume what you need is a way to get file size in bytes to create a slice of the same size:
f, err := f.Stat()
// handle error
// ...
size := f.Size()
(see FileInfo for more)
You can then use this size to initialise a slice.
data := make([]byte, size)
You can also consider reading the whole file in one call using ioutil.ReadFile.

Related

Retrieving gobs written to file by appending several times

I am trying to use encoding/gob to store data to a file and load it later. I want to be able to append new data to the file and load all saved data later, e.g. after restarting my application. While storing to the file using Encode() there are no problems, but when reading it seems I always get only the item which was first stored, not the succinctly stored items.
Here is a minimal example: https://play.golang.org/p/patGkKDLhM
As you see, it works to write two times to an encoder and then read it back. But when closing the file and reopening it again in append mode, writing seems to work, but reading works only for the first two elements (which have been written previously). The two newly added structs cannot be retrieved, I get the error:
panic: extra data in buffer
I am aware of Append to golang gob in a file on disk and I also read https://groups.google.com/forum/#!topic/golang-nuts/bn6vjC5Abd8
Finally, I also found https://gist.github.com/kjk/8015952 which seems to demonstrate that what I am trying to do does not work. Why? What does this error mean?
I have not used the encoding/gob package yet (looks cool, I might have to find a project for it). But reading the godoc, it would seem to me that each encoding is a single record expected to be decoded from beginning to end. That is, once you Encode a stream, the resulting bytes is a complete set respecting the entire stream from start to finish - not able to be appended to later by encoding again.
The godoc states that an encoded gob is self-descriptive. At the beginning of the encoded stream, it describes the entire data set struct, types, etc that will be following including the field names. Then what follows in the byte stream is the the size and byte representation of the value of those Exported fields.
Then one could assume that what is omitted from the docs is since the stream self-describes itself at the very beginning, including each field that is about to be passed, that is all that the Decoder will care about. The Decoder will not know of any successive bytes added after what has been described as it only sees what was described at the beginning. Therefore, that error message panic: extra data in buffer is accurate.
In your Playground example, you are encoding twice to the same encoder instance and then closing the file. Since you are passing exactly two records in, and encoding two records, that may work as the single instance of the encoder may see the two Encode calls as a single encoded stream. Then when you close the file io's stream, the gob is now complete - and the stream is treated as a single record (even though you sent in two types).
And the same in the decoding function, you are reading X number of times from the same stream. But, you are writing a single record when closing the file - that actually has two types in that one single record. Hence why it works when reading 2, and EXACTLY 2. But fails if reading more than 2.
A solution, if you want to store this in a single file, is that you will need to create your own index of each complete "write" or encoder instance/session. Some form your own Block method that allows you to wrap or define each entry written to disk with a "begin" and "end" marker. That way, when reading back the file, you know exactly what buffer to allocate because of the begin/end markers. Once you have a single record in a buffer, then you use gob's Decoder to decode it. And close the file after each write.
The pattern I use for such markers is something like:
uint64:uint64
uint64:uint64
...
The first being the beginning byte number, and the second entry separated by a colon being its length. I usually store this in another file though, called appropriately indexes. That way it can be quickly read into memory, and then I can stream the large file knowing exactly where each start and end address is in the byte stream.
Another option is just to store each gob in its own file, using the file system directory structure to organize as you see fit (or one could even use the directories to define types, for example). Then the existence of each file is a single record. This is how I use my rendered json from Event Sourcing techniques, storing millions of files organized in directories.
In summary, it would seem to me that a gob of data is a complete set of data from beginning to end - a single "record" have you. If you want to store multiple encodings/multiple gobs, then to will need to create your own index to track the start and size/end of each gob bytes as you store them. Then, you will want to Decode each entry separately.

Copy sparse files

I'm trying to understand Linux (UNIX) low-level interfaces and as an exercise want to write a code which will copy a file with holes into a new file (again with holes).
So my question is, how to read from the first file not till the first hole, but till the very end of the file?
If I'm not mistaken, read() returns 0 when reaches the first hole(EOF).
I was thinking about seeking right byte by byte and trying to read this byte, but then I have to know the number of holes in advance.
If by holes you mean sparse files, then you have to find the holes in the input file and recreate them using lseek when writing the output file. Since Linux 3.1, you can even use lseek to jump to the beginning or end of a hole, as described in great detail in the man page.
As ThiefMaster already pointed out, normal file operations will treat holes simply as sequences of zero bytes, so you won't see the EOF you mention.
For copies of sparse files, from the cp manual;
By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.
Thus, try --sparse=always if you need to copy a sparse file 'as-is' (still seems affected by an algo)
A file is not presented as if it has any gaps. If your intention is to say that the file has sections on one area of the disk, then more on another, etc., you are not going to be able to see this through a call to open() on that file and a series of read() calls. You would instead need to open() and read() the raw disk instead, seeking to sectors on your own.
If your meaning of "holes" in a file is as #ThiefMaster says, just areas of 0 bytes -- these are only "holes" according to your application use of the data; to the file system they're just bytes in a file, no different than any other. In this case, you can copy it through a simple read of the data source and write to the data target, and you will get a full copy (along with what you're calling holes).

Replacing spaces with %20 in a file on hard disk

I have gone through all the answers for the similar question posted earlier Replacing spaces with %20 in C. However I'm unable to guess how can we do this in case of a file on hard disk, where disk accesses can be expensive and file is too long to load into memory at once. In case it is possible to fit, we can simply load the file and write onto the same existing one.
Further, for memory constraints one would like to replace the original file and not create a new one.
Horrible idea. Since the "%20" is longer than " " you can't just replace chars inside the file, you have to move whatever follows it further back. This is extremely messy and expensive if you want to do it on the existing disk file.
You could try to determine the total growth of the file on a first pass, then do the whole shifting from the back of the file taking blocksize into account and adjusting the shifting as you encounter " ". But as I said -- messy. You really don't want to do that unless it's a definite must.
Read the file, do the replacements, write to a new file, and rename the new file over the old one.
EDIT: as a side effect, if your program terminates while doing the thing you won't end up with a half-converted file. That's actually the reason why many programs write to a new file even if they wouldn't need to, to make sure the file is "always" correct because the new file only replaces the old file after it has been written successfully. It's a simple transaction scheme that doesn't take system failures into account, but works well for application failures (including users forcibly terminating the program)
For the replacement part, you can have two buffers, one that you read into and one that you write the translated string to and which you write to disk. Depending on your memory constraints even a small input buffer (say 1KiB) is enough. However, to avoid repeating reallocations you can keep a fixed buffer for the output, and have it three times the size of the input buffer (worst case scenario, input is all spaces). Total that's 4KiB of memory, plus whatever buffers the OS uses. I would recommend to use a multiple of the disk block size as the input size.
The problem is your requirement of reading and writing to the same file. Unfortunately this is impossible.If you read char-by-char, think about what happens when you reach a space... You then have to write three characters and overwrite the next two characters in the file. Not exactly what you want.

What's a short way to prepend 3 bytes to the beginning of a binary file in C?

The straightforward way I know of is to create a new file, write the three bytes to it, and then read the original file into memory (in a loop) and write it out to the new file.
Is there a faster way that would either permit skipping the creation of a new file, or skip reading the original file into memory and writing back out again?
There is, unfortunately, no way (with POSIX or standard libc file APIs) to insert or delete a range of bytes in an existing file.
This isn't so much about C as about filesystems; there aren't many common filesystem APIs that provide shortcuts for prepending data, so in general the straightforward way is the only one.
You may be able to use some form of memory-mapped I/O appropriate to your platform, but this trades off one set of problems for another (such as, can you map the entire file into your address space or are you forced to break it up into chunks?).
You could open the file as read/write, read the first 4KB, seek backward 4KB, write your three bytes, write (4KB - 3) bytes, and repeat the process until you reach the end of the file.

C File Input/Output for Unknown File Types: File Copying

having some issues with a networking assignment. End goal is to have a C program that grabs a file from a given URL via HTTP and writes it to a given filename. I've got it working fine for most text files, but I'm running into some issues, which I suspect all come from the same root cause.
Here's a quick version of the code I'm using to transfer the data from the network file descriptor to the output file descriptor:
unsigned long content_length; // extracted from HTTP header
unsigned long successfully_read = 0;
while(successfully_read != content_length)
{
char buffer[2048];
int extracted = read(connection,buffer,2048);
fprintf(output_file,buffer);
successfully_read += extracted;
}
As I said, this works fine for most text files (though the % symbol confuses fprintf, so it would be nice to have a way to deal with that). The problem is that it just hangs forever when I try to get non-text files (a .png is the basic test file I'm working with, but the program needs to be able to handle anything).
I've done some debugging and I know I'm not going over content_length, getting errors during read, or hitting some network bottleneck. I looked around online but all the C file i/o code I can find for binary files seems to be based on the idea that you know how the data inside the file is structured. I don't know how it's structured, and I don't really care; I just want to copy the contents of one file descriptor into another.
Can anyone point me towards some built-in file i/o functions that I can bludgeon into use for that purpose?
Edit: Alternately, is there a standard field in the HTTP header that would tell me how to handle whatever file I'm working with?
You are using the wrong tool for the job. fprintf takes a format string and extra arguments, like this:
fprintf(output_file, "hello %s, today is the %d", cstring, dayoftheweek);
If you pass the second argument from an unknown source (like the web, which you are doing) you can accidentally have %s or %d or other format specifiers in the string. Then fprintf will try to read more arguments than it was passed, and cause undefined behaviour.
Use fwrite for this:
fwrite(buffer, 1, extracted, output_file);
A couple things with your code:
For fprintf - you are using the data as the second argument, when in fact it should be the format, and the data should be the third argument. This is why you are getting problems with the % character, and why it is struggling when presented with binary data, because it is expecting a format string.
You need to use a different function, such as fwrite, to output the file.
As a side note this is a bit of a security problem - if you fetch a specially crafted file from the server it is possible to expose random areas of your memory.
In addition to Seth's answer: unless you are using a third-party library for handling all the HTTP stuff, you need to deal with the Transfer-Encoding header and the possible compression, or at least detect them and throw an error if you don't know how to handle that case.
In general, it may (or may not) be a good idea to parse the HTTP response headers, and only if they contain exclusively stuff that you understand should you continue to interpret the data that follows the header.
I bet your program is hanging because it's expecting X bytes but receiving Y instead, with X < Y (most likely, sans compression - but PNG don't compress well with gzip). You'll get chunks [*] of data, with one of the chunks most likely spanning content_length so your condition while(successfully_read != content_length) is always true.
You could try running your program under strace or whatever its equivalent is for your OS, if you want to see how your program continues trying to read data it will never get (because you've likely made an HTTP/1.1 request that holds the connection open, and you haven't made a second request) or has ended (if the server closes the connection, your (repeated) calls to read(2) will just return 0, which leaves your (still true) loop condition unchanged.
If you are sending your program's output to stdout, you may find that it produces no output - this can happen if the resource you are retrieving contains no newline or other flush-forcing control characters. Other stdio buffering regimes may apply when output goes to a file. (For example, the file will remain empty until the stdio buffers have accumulates at least 4096 bytes.)
[*] Then there's also Transfer-Encoding: chunked, as #roland-illig alludes to, which will ruin the exact equivalence between content_length (presumably derived from the eponymous HTTP header) and the actual number of bytes transferred over the socket.
You are opening the file as a text file. Doing so means that the program will add \r\n characters at the end of every write() call. Try opening the file as binary, and those errors in size shall go away.

Resources