Pack files using c so that can be unpacked to original files - c

I have to pack few files in such a way so that at some later stage i can unpack them again to the original files using c program, please suggest.

I suppose the explanation for wanting to write your own implementation might be curiosity.
Whether you add compression or not, if you simply want to store files in an archive, similar to the tar command, then you have a few possible approaches.
One of the fundamental choices you have to make is: how to demarcate the boundaries of the packed files within the archive? It is not a great idea to use a special character, because the packed files could contain any character to begin with.
To keep track of the end of files, you can use the length of the file in bytes. For example, you could, for each file:
Write to the archive the '\0' terminated C-string which names the packed file.
Write to the archive an off64_t which gives the length, in bytes, of the packed file.
Write to the archive the actual bytes (if any) of the packed file.
(Optional) Write to the archive a checksum or CRC of the packed file.
Repeatedly perform this for each file, concatenating the results with no intervening characters.
Finally, when no files remain, write an empty C-string, a zero character.
The unpacking process is:
Read the '\0'-terminated C-string which names this packed file.
If the name is empty, assert that we have read the entire archive, then exit.
Read the off64_t which gives the length of the packed file.
Read as many bytes as the packed file length from the archive and write to the newly-created unpacked file.
Again, repeat these steps until step (2) concludes the program.
This design, in which file names alternate with file data is workable. It has some drawbacks. The essential problem is that the data structure isn't designed for random access. In order to get the information about a file in the "middle" of the archive, a program is required to process the preceding files. The program can call lseek_64 to skip reading program data that isn't needed, but a processor needs to read at least each file name and each file length. The file length is needed to skip over the file data. The file name, as I arranged the data, must be read in order to locate the file length.
So this is inefficient. Even if the file names did not have to be read in order to access file size, the fact that the file details are sprinkled throughout the archive mean that reading the index data requires accessing several ranges of data on the disk.
A better approach might be to write a "block" of index data to the front of the file. This data structure might be something like:
The size of the first file in the archive.
The name of the first file in the archive.
The position, in bytes, within this archive, where the "first file" may be located as a contiguous block of bytes.
The size of the second file in the archive...
And the data in the index might repeat until, again, a file with empty name marks the end of the index.
Having an index like this is nice, but presents a difficulty: when the user wishes to append a file to the archive, the index might need to grow in size. This could change the locations of the packed files within the archive -- the archive program may need to move them around to make room for a bigger index.
The file structure can get more and more complex in order to serve all these different needs. For example, the index can be designed so that it is always allocated out of what the file system considers a "page" (the amount the OS reads or writes from the disk as a minimum-size granule), and if the index needs to grow, discontiguous "index pages" are chained together by file-position data leading from one index page to another. (Like a linked list, but on disk.) The complexity can go on and on.

A fast solution would be to take advantage of an external library like zLib (usage example: http://zlib.net/zlib_how.html ) and use it for compression.
If you want to dig deeper into the topic of compression, have a look at the different lossless compression algorithms and further hints at Wikipedia - Data compression.

I wrote a tar like program a couple of day ago, here my implementation (hope you can get some ideas):
Each file is stored in the file archive with an "header", which is like:
<file-type,file-path,file-size,file-mode>
in file-type i used 0 for files and 1 for directories (in this way you can recreate the directories tree)
For example, the header of a file named foo.txt of size 245 bytes with mode 0755 (on unix, see chmod) will looks like:
<0,foo.txt,245,0755>
here the file contents
in this way, the first character of the file archive is always a <, then you parse the list separated by commas (first possible bug) and extract the file type, the path, the size (which you will use to read the next size bytes from the archive - to avoid the "special character bug" pointed out by Heath Hunnicutt) and the mode of the file (let's say you have a binary file and you want to have it executable when extracted too, you need to chmod it with the original file mode).
About the first possible bug, a comma is not commonly used in a file name, but it's probably better to use another character or "sanitize" the path with a couple "" (sorry i don't remeber the name now, and english is not my mother tongue), obviously the parser should be aware of it, and ignore any comma in the "".
For writing and reading files in C see fgetc and fputc from stdio.h
To get file infos, chmod and directories tree see stat and chmod from sys/stat.h and ftw from ftw.h (probably linux/unix only, because is a system call).
Hope it helps! (if you need some code i can post some snippets, the header parsing is probably the hardest part).

Related

Why is in-place replacing so hard in files?

I have a very large CSV file that I want to import straight into Postgresql with COPY. For that, the CSV column headers need to match DB column names. So I need to do a simple string replace on the first line of the very large file.
There are many answers on how to do that like:
Is it possible to modify lines in a file in-place?
Optimizing find and replace over large files in Python
All the answers imply creating a copy of the large file or using file-system level solutions that access the entire file, although only the first line is relevant. That makes all solutions slow and seemingly overkill.
What is the underlying cause that makes this simple job so hard? Is it file-system related?
The underlying cause is that a .csv file is a textfile, and making changes to the first line of the file implies random access to the first "record" of the file. But textfiles don't really have "records", they have lines, of unequal length. So changing the first line implies reading the file up to the first carriage return, putting something in its place, and then moving all of the rest of the data in the file to the left, if the replacement is shorter, or to the right if it is longer. And to do that you have two choices. (1) Read the entire file into memory so you can do the left or right shift. (2) Read the file line by line and write out a new one.
It is easy to add stuff at the end because that doesn't involve displacing what is there already.

What is the real size of file?

How it is possible that text file has size, which is equal to number of chars inside? For example in file.txt you have string "abc" and size of it is 3 bytes. Everything fine, but what with file icon, filename and file informations? Where these data has been stored?
I checked it on Windows, but at Unix systems situation is probalby the same.
When the file is written to disk, it is by means of low level system call like write() and operating systems know exactly how many bytes they write in a given file on a disk. This information, as well as several others (creation and modification date, ownership, etc) is written with the file.
In linux (and generally unix), it is by means of an inodethat fully describes the file. Informations stored in these inodes are:
* access mode
* ids of user and group that owns the file
* size in bytes
* date of creation, modification and access
* list of disk blocks containing file data
These are more or less the informations that are displayed by ls -l
You can also see inode number of each file with ls -i
You can find here additional details on inodes.
Other informations are coded differently. Names, for instance are only in special files describing a directory, not in the inode. A directory is indeed a list that associate a name with an inode.
Icons are generally defined system wide and the association of an icon with a file is done with either filename (and file extension) or with a file "type" that is written in the "inode" (or its equivalent in other OS).
Disks allocate space in blocks. Blocks historically were 512 bytes but that has increased over the years so that 4K is common. Your file size will always be a multiple of the block size.
Most file systems (and Windoze does this) allocate disk space in clusters. A cluster is a number of adjacent blocks. Your file size then will always be a multiple of the block size times the cluster factor. This is usually the size of the file as counted by the operating system.
This all depends upon the disk format and the operating system:
Everything fine, but what with file icon, filename and file informations? Where these data has been stored?
The file information (date, owner, etc.) are usually in some kind of master file table. Sometimes this table will have extensions where information can be stored. Security information is often store in such overflows.
A rationally designed file system will have "A" filename stored in the header. File names are also stored in directories and a file can have multiple names if it is linked to multiple directories. The header file name is used to restored the file in case of corruption.
The location of an icon is entirely system specific and can be done in many ways. In the case of executable files, they are often stored in the file itself. They can also be hidden files in the same directory.

Text files edit C

I have a program which takes data(int,floats and strings) given by the user and writes it in a text file.Now I have to update a part of that written data.
For example:
At line 4 in file I want to change the first 2 words (there's an int and a float). How can I do that?
With the information I found out, fseek() and fputs() can be used but I don't know exactly how to get to a specific line.
(Explained code will be appreciated as I'm a starter in C)
You can't "insert" characters in a file. You will have to create program, which will read whole file, then copy part before insert to a new file, your edition, rest of file.
You really need to read all the file, and ignore what is not needed.
fseek is not really useful: it positions the file at some byte offset (relative to the start or the end of the file) and don't know about line boundaries.
Actually, lines inside a file are an ill defined concept. Often a line is a sequence of bytes (different from the newline character) ended by a newline ('\n'). Some operating systems (Windows, MacOSX) read in a special manner text files (e.g. the real file contains \r\n to end each line, but the C library gives you the illusion that you have read \n).
In practice, you probably want to use line input routines notably getline (or perhaps fgets).
if you use getline you should care about free-ing the line buffer.
If your textual file has a very regular structure, you might fscanf the data (ignoring what you need to skip) without caring about line boundaries.
If you wanted to absolutely use fseek (which is a mistake), you would have to read the file twice: a first time to remember where each line starts (or ends) and a second time to fseek to the line start. Still, that does not work for updates, because you cannot insert bytes in the middle of a file.
And in practice, the most costly operation is the actual disk read. Buffering (partly done by the kernel and <stdio.h> functions, and partly by you when you deal with lines) is negligible.
Of course you cannot change in place some line in a file. If you need to do that, process the file for input, produce some output file (containing the modified input) and rename that when finished.
BTW, you might perhaps be interested in indexed files like GDBM etc... or even in databases like SqlLite, MariaDb, mongodb etc.... and you might be interested in standard textual serialization formats like JSON or YAML (both have many libraries, even for C, to deal with them).
fseek() is used for random-access files where each record of data has the same size. Typically the data is binary, not text.
To solve your particular issue, you will need to read one line at a time to find the line you want to change. A simple solution to make the change is to write these lines to a temporary file, write the changes to the same temporary file, then skip the parts from the original file that you want to change and copy the reset to the temporary file. Finally, close the original file, copy the temporary file to it, and delete the temporary file.
With that said, I suggest that you learn more about random-access files. These are very useful when storing records all of the same size. If you have control over creating the orignal file, these might be better for your current purpose.

Copy sparse files

I'm trying to understand Linux (UNIX) low-level interfaces and as an exercise want to write a code which will copy a file with holes into a new file (again with holes).
So my question is, how to read from the first file not till the first hole, but till the very end of the file?
If I'm not mistaken, read() returns 0 when reaches the first hole(EOF).
I was thinking about seeking right byte by byte and trying to read this byte, but then I have to know the number of holes in advance.
If by holes you mean sparse files, then you have to find the holes in the input file and recreate them using lseek when writing the output file. Since Linux 3.1, you can even use lseek to jump to the beginning or end of a hole, as described in great detail in the man page.
As ThiefMaster already pointed out, normal file operations will treat holes simply as sequences of zero bytes, so you won't see the EOF you mention.
For copies of sparse files, from the cp manual;
By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.
Thus, try --sparse=always if you need to copy a sparse file 'as-is' (still seems affected by an algo)
A file is not presented as if it has any gaps. If your intention is to say that the file has sections on one area of the disk, then more on another, etc., you are not going to be able to see this through a call to open() on that file and a series of read() calls. You would instead need to open() and read() the raw disk instead, seeking to sectors on your own.
If your meaning of "holes" in a file is as #ThiefMaster says, just areas of 0 bytes -- these are only "holes" according to your application use of the data; to the file system they're just bytes in a file, no different than any other. In this case, you can copy it through a simple read of the data source and write to the data target, and you will get a full copy (along with what you're calling holes).

What's a short way to prepend 3 bytes to the beginning of a binary file in C?

The straightforward way I know of is to create a new file, write the three bytes to it, and then read the original file into memory (in a loop) and write it out to the new file.
Is there a faster way that would either permit skipping the creation of a new file, or skip reading the original file into memory and writing back out again?
There is, unfortunately, no way (with POSIX or standard libc file APIs) to insert or delete a range of bytes in an existing file.
This isn't so much about C as about filesystems; there aren't many common filesystem APIs that provide shortcuts for prepending data, so in general the straightforward way is the only one.
You may be able to use some form of memory-mapped I/O appropriate to your platform, but this trades off one set of problems for another (such as, can you map the entire file into your address space or are you forced to break it up into chunks?).
You could open the file as read/write, read the first 4KB, seek backward 4KB, write your three bytes, write (4KB - 3) bytes, and repeat the process until you reach the end of the file.

Resources