Copy sparse files - c

I'm trying to understand Linux (UNIX) low-level interfaces and as an exercise want to write a code which will copy a file with holes into a new file (again with holes).
So my question is, how to read from the first file not till the first hole, but till the very end of the file?
If I'm not mistaken, read() returns 0 when reaches the first hole(EOF).
I was thinking about seeking right byte by byte and trying to read this byte, but then I have to know the number of holes in advance.

If by holes you mean sparse files, then you have to find the holes in the input file and recreate them using lseek when writing the output file. Since Linux 3.1, you can even use lseek to jump to the beginning or end of a hole, as described in great detail in the man page.
As ThiefMaster already pointed out, normal file operations will treat holes simply as sequences of zero bytes, so you won't see the EOF you mention.

For copies of sparse files, from the cp manual;
By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.
Thus, try --sparse=always if you need to copy a sparse file 'as-is' (still seems affected by an algo)

A file is not presented as if it has any gaps. If your intention is to say that the file has sections on one area of the disk, then more on another, etc., you are not going to be able to see this through a call to open() on that file and a series of read() calls. You would instead need to open() and read() the raw disk instead, seeking to sectors on your own.
If your meaning of "holes" in a file is as #ThiefMaster says, just areas of 0 bytes -- these are only "holes" according to your application use of the data; to the file system they're just bytes in a file, no different than any other. In this case, you can copy it through a simple read of the data source and write to the data target, and you will get a full copy (along with what you're calling holes).

Related

File Bytes Array Length Go

I have recently started to learn Go. To start with I decided that I would write some code to open a file and output its contents on the terminal window. So far I have been writing code like this:
file, err := os.Open("./blah.txt")
data := make([]byte, 100)
count, err := file.Read(data)
To obtain up to 100 bytes from a file. Is there any way to ascertain the byte count on a file, such that you could set the correct (or more sensible) byte array length just using the standard Go library?
I understand you could use a slice with something like Append() once the extremities of the array have been reached, but I just wondered whether the file size/length/whatever could be accessed prior to instantiating an array through file metadata or something similar.
While you could certainly get the file's size prior to reading
from it (see the other answer), doing this is usually futile
for a number of reasons:
A filesystem is an inherently racy medium: any number of processes
might update a given file simultaneously, and even remove it.
On a filesystem with POSIX semantics (most commodity OSes
excluding Windows) the only guarantee a successful opening of a file
gives you is that it's possible to read data from it,
and that's basically all. (Well, reading may fail due to the error
in the underlying media but let's not digress further).
What would you do if you did the equivalent of a fstat(2) call,
as suggested, and it told you the file contains 42 terabytes of data?
Would you try to allocate a sufficiently large array to hold its contents?
Would you implement some custom logic which classifies the file's
size into several ranges and performs custom processing based on that—like,
say, slurping files less than N megabytes in length and reading
bigger files piecemeal?
What if the file grew bigger (was appended to) after you obtained its size?
What if you later decide to be a more Unix-way-ready and make it possible
to read the data from your program's standard input stream—like the cat
program on Unix (or its type Windows cousin) does?
You can't know how much data will be piped through that stream;
and potentially it might be of indefinite length (consider being piped
the contents of some busy log file on a continuously running system).
Sure, in some applications you assume the contents of files do not
change under you feet; one example is archivers like zip or tar which
record the file's metadata, including its size, along with the file.
(By the way, tar detects a file might have changed while the program
was reading its contents and warns the user in that case).
But what I'm leading you to, is that for a task as simple as yours,
there's little point in doing it the way you've come up with.
Instead, just use a buffer of some "sensible" size and gateway the data
between its source and destination through that buffer.
That is, you allocate the buffer, enter a loop, and on each iteration of
it you try to read as much data as fits in the buffer, process whatever
the Read function indicated it was able to read, then handle an
end-of-file condition or an error, if it was indicated.
To round up this small crash course, I'd hint that the standard library
already has io.Copy which, in your
case, may be called like
_, err := io.Copy(os.Stdout, f)
and will shovel all the contents of f to the standard output of your
program until EOF or an error is detected.
Last time I checked, this function used an internal buffer of 32 KiB in size,
but you may always check the source code of your Go installation.
I assume what you need is a way to get file size in bytes to create a slice of the same size:
f, err := f.Stat()
// handle error
// ...
size := f.Size()
(see FileInfo for more)
You can then use this size to initialise a slice.
data := make([]byte, size)
You can also consider reading the whole file in one call using ioutil.ReadFile.

Text files edit C

I have a program which takes data(int,floats and strings) given by the user and writes it in a text file.Now I have to update a part of that written data.
For example:
At line 4 in file I want to change the first 2 words (there's an int and a float). How can I do that?
With the information I found out, fseek() and fputs() can be used but I don't know exactly how to get to a specific line.
(Explained code will be appreciated as I'm a starter in C)
You can't "insert" characters in a file. You will have to create program, which will read whole file, then copy part before insert to a new file, your edition, rest of file.
You really need to read all the file, and ignore what is not needed.
fseek is not really useful: it positions the file at some byte offset (relative to the start or the end of the file) and don't know about line boundaries.
Actually, lines inside a file are an ill defined concept. Often a line is a sequence of bytes (different from the newline character) ended by a newline ('\n'). Some operating systems (Windows, MacOSX) read in a special manner text files (e.g. the real file contains \r\n to end each line, but the C library gives you the illusion that you have read \n).
In practice, you probably want to use line input routines notably getline (or perhaps fgets).
if you use getline you should care about free-ing the line buffer.
If your textual file has a very regular structure, you might fscanf the data (ignoring what you need to skip) without caring about line boundaries.
If you wanted to absolutely use fseek (which is a mistake), you would have to read the file twice: a first time to remember where each line starts (or ends) and a second time to fseek to the line start. Still, that does not work for updates, because you cannot insert bytes in the middle of a file.
And in practice, the most costly operation is the actual disk read. Buffering (partly done by the kernel and <stdio.h> functions, and partly by you when you deal with lines) is negligible.
Of course you cannot change in place some line in a file. If you need to do that, process the file for input, produce some output file (containing the modified input) and rename that when finished.
BTW, you might perhaps be interested in indexed files like GDBM etc... or even in databases like SqlLite, MariaDb, mongodb etc.... and you might be interested in standard textual serialization formats like JSON or YAML (both have many libraries, even for C, to deal with them).
fseek() is used for random-access files where each record of data has the same size. Typically the data is binary, not text.
To solve your particular issue, you will need to read one line at a time to find the line you want to change. A simple solution to make the change is to write these lines to a temporary file, write the changes to the same temporary file, then skip the parts from the original file that you want to change and copy the reset to the temporary file. Finally, close the original file, copy the temporary file to it, and delete the temporary file.
With that said, I suggest that you learn more about random-access files. These are very useful when storing records all of the same size. If you have control over creating the orignal file, these might be better for your current purpose.

What's a short way to prepend 3 bytes to the beginning of a binary file in C?

The straightforward way I know of is to create a new file, write the three bytes to it, and then read the original file into memory (in a loop) and write it out to the new file.
Is there a faster way that would either permit skipping the creation of a new file, or skip reading the original file into memory and writing back out again?
There is, unfortunately, no way (with POSIX or standard libc file APIs) to insert or delete a range of bytes in an existing file.
This isn't so much about C as about filesystems; there aren't many common filesystem APIs that provide shortcuts for prepending data, so in general the straightforward way is the only one.
You may be able to use some form of memory-mapped I/O appropriate to your platform, but this trades off one set of problems for another (such as, can you map the entire file into your address space or are you forced to break it up into chunks?).
You could open the file as read/write, read the first 4KB, seek backward 4KB, write your three bytes, write (4KB - 3) bytes, and repeat the process until you reach the end of the file.

What is the best way to truncate the beginning of a file in C?

There are many similar questions, but nothing that answers this specifically after googling around quite a bit. Here goes:
Say we have a file (could be binary, and much bigger too):
abcdefghijklmnopqrztuvwxyz
what is the best way in C to "move" a right most portion of this file to the left, truncating the beginning of the file.. so, for example, "front truncating" 7 bytes would change the file on disk to be:
hijklmnopqrztuvwxyz
I must avoid temporary files, and would prefer not to use a large buffer to read the whole file into memory. One possible method I thought of is to use fopen with "rb+" flag, and constantly fseek back and forth reading and writing to copy bytes starting from offset to the beginning, then setEndOfFile to truncate at the end. That seems to be a lot of seeking (possibly inefficient).
Another way would be to fopen the same file twice, and use fgetc and fputc with the respective file pointers. Is this even possible?
If there are other ways, I'd love to read all of them.
You could mmap the file into memory and then memmove the contents. You would have to truncate the file separately.
You don't have to use an enormous buffer size, and the kernel is going to be doing the hard work for you, but yes, reading a buffer full from up the file and writing nearer the beginning is the way to do it if you can't afford to do the simpler job of create a new file, copy what you want into that file, and then copy the new (temporary) file over the old one. I wouldn't rule out the possibility that the approach of copying what you want to a new file and then either moving the new file in place of the old or copying the new over the old will be faster than the shuffling process you describe. If the number of bytes to be removed was a disk block size, rather than 7 bytes, the situation might be different, but probably not. The only disadvantage is that the copying approach requires more intermediate disk space.
Your outline approach will require the use of truncate() or ftruncate() to shorten the file to the proper length, assuming you are on a POSIX system. If you don't have truncate(), then you will need to do the copying.
Note that opening the file twice will work OK if you are careful not to clobber the file when opening for writing - using "r+b" mode with fopen(), or avoiding O_TRUNC with open().
If you are using Linux, since Kernel 3.15 you can use
#include <fcntl.h>
int fallocate(int fd, int mode, off_t offset, off_t len);
with the FALLOC_FL_COLLAPSE_RANGE flag.
http://manpages.ubuntu.com/manpages/disco/en/man2/fallocate.2.html
Note that not all file systems support it but most modern ones such as ext4 and xfs do.

Pack files using c so that can be unpacked to original files

I have to pack few files in such a way so that at some later stage i can unpack them again to the original files using c program, please suggest.
I suppose the explanation for wanting to write your own implementation might be curiosity.
Whether you add compression or not, if you simply want to store files in an archive, similar to the tar command, then you have a few possible approaches.
One of the fundamental choices you have to make is: how to demarcate the boundaries of the packed files within the archive? It is not a great idea to use a special character, because the packed files could contain any character to begin with.
To keep track of the end of files, you can use the length of the file in bytes. For example, you could, for each file:
Write to the archive the '\0' terminated C-string which names the packed file.
Write to the archive an off64_t which gives the length, in bytes, of the packed file.
Write to the archive the actual bytes (if any) of the packed file.
(Optional) Write to the archive a checksum or CRC of the packed file.
Repeatedly perform this for each file, concatenating the results with no intervening characters.
Finally, when no files remain, write an empty C-string, a zero character.
The unpacking process is:
Read the '\0'-terminated C-string which names this packed file.
If the name is empty, assert that we have read the entire archive, then exit.
Read the off64_t which gives the length of the packed file.
Read as many bytes as the packed file length from the archive and write to the newly-created unpacked file.
Again, repeat these steps until step (2) concludes the program.
This design, in which file names alternate with file data is workable. It has some drawbacks. The essential problem is that the data structure isn't designed for random access. In order to get the information about a file in the "middle" of the archive, a program is required to process the preceding files. The program can call lseek_64 to skip reading program data that isn't needed, but a processor needs to read at least each file name and each file length. The file length is needed to skip over the file data. The file name, as I arranged the data, must be read in order to locate the file length.
So this is inefficient. Even if the file names did not have to be read in order to access file size, the fact that the file details are sprinkled throughout the archive mean that reading the index data requires accessing several ranges of data on the disk.
A better approach might be to write a "block" of index data to the front of the file. This data structure might be something like:
The size of the first file in the archive.
The name of the first file in the archive.
The position, in bytes, within this archive, where the "first file" may be located as a contiguous block of bytes.
The size of the second file in the archive...
And the data in the index might repeat until, again, a file with empty name marks the end of the index.
Having an index like this is nice, but presents a difficulty: when the user wishes to append a file to the archive, the index might need to grow in size. This could change the locations of the packed files within the archive -- the archive program may need to move them around to make room for a bigger index.
The file structure can get more and more complex in order to serve all these different needs. For example, the index can be designed so that it is always allocated out of what the file system considers a "page" (the amount the OS reads or writes from the disk as a minimum-size granule), and if the index needs to grow, discontiguous "index pages" are chained together by file-position data leading from one index page to another. (Like a linked list, but on disk.) The complexity can go on and on.
A fast solution would be to take advantage of an external library like zLib (usage example: http://zlib.net/zlib_how.html ) and use it for compression.
If you want to dig deeper into the topic of compression, have a look at the different lossless compression algorithms and further hints at Wikipedia - Data compression.
I wrote a tar like program a couple of day ago, here my implementation (hope you can get some ideas):
Each file is stored in the file archive with an "header", which is like:
<file-type,file-path,file-size,file-mode>
in file-type i used 0 for files and 1 for directories (in this way you can recreate the directories tree)
For example, the header of a file named foo.txt of size 245 bytes with mode 0755 (on unix, see chmod) will looks like:
<0,foo.txt,245,0755>
here the file contents
in this way, the first character of the file archive is always a <, then you parse the list separated by commas (first possible bug) and extract the file type, the path, the size (which you will use to read the next size bytes from the archive - to avoid the "special character bug" pointed out by Heath Hunnicutt) and the mode of the file (let's say you have a binary file and you want to have it executable when extracted too, you need to chmod it with the original file mode).
About the first possible bug, a comma is not commonly used in a file name, but it's probably better to use another character or "sanitize" the path with a couple "" (sorry i don't remeber the name now, and english is not my mother tongue), obviously the parser should be aware of it, and ignore any comma in the "".
For writing and reading files in C see fgetc and fputc from stdio.h
To get file infos, chmod and directories tree see stat and chmod from sys/stat.h and ftw from ftw.h (probably linux/unix only, because is a system call).
Hope it helps! (if you need some code i can post some snippets, the header parsing is probably the hardest part).

Resources