I'm writing a program in C that basically creates an archive file for a given list of file names. This is pretty similar to the ar command in linux. This is how the archive file would look like:
!<arch>
file1.txt/ 1350248044 45503 13036 100660 28 `
hello
this is sample file 1
file2.txt/ 1350512270 45503 13036 100660 72 `
hello
this is sample file 2
this file is a little larger than file1.txt
But I'm having difficulties trying to exract a file from the archive. Let's say the user wants to extract file1.txt. The idea is it should get the index/location of the file name (in this case file1.txt), skip 58 characters to reach the content of the file, read the content, and write it to a new file. So here's my questions:
1) How can I get the index/location of the file name in the archive file? Note that duplicate file names are NOT allowed, so I don't have to worry about having two different indecies.
2) How can I skip several characters (in this case 58) when reading a file?
3) How can I figure out when the content of a file ends? i.e. I need it to read the content and stop right before the file2.txt/ header.
My approach to solving this problem would be:
To have a header information that contains the size of each file, its name and its location in the file.
Then parse the header, use fseek() and ftell() as well as fgetc() or fread() functions to get bytes of the file and then, create+write that data to it. This is the simplest way I can think of.
http://en.wikipedia.org/wiki/Ar_(Unix)#File_header <- Header of ar archives.
EXAMPLE:
#programmer93 Consider your header is 80 bytes long(header contains the meta-data of the archive file). You have two files one of 112 bytes and the other of 182 bytes. Now they're laid out in a flat file(the archive file). So it would be 80(header).112(file1.txt).182(file2.txt).EOF . Thus if you know the size of each file, you can easily navigate(using fseek()) to a particular file and extract only that file. [to extract file2.txt I will just fseek(FILE*,(112+80),SEEK_SET); and then fgetc() 182 times. I think I made myself clear?
If the format of the file cannot be changed by adding additional header information to help, you'll have to search through it and work things out as you go.
This should not be too hard. Just read the file, and when you read a header line such as
file1.txt/ 1350248044 45503 13036 100660 28 `
you can check the filename and size etc. (You know you'll have a header line at the start after the !<arch>). If this is the file you want, the ftell() function from stdio.h will tell you exactly where you are in the file. Since the file size in bytes is given in the header line, you can read the file by reading that particular number of bytes ahead in the normal manner. Similarly, if it is not the file you want, you can use fseek() to move forward the number of bytes in the file you are skipping and be ready to read in the header info for the next file and repeat the process.
Related
I am reading .docx file in a buffer and writing it to a new file successfully. (Using fread and fwrite in C) However now I want to enhance the scope of this project for the purpose of encryption. For which I want to be able to manipulate the buffer, then write it in new file.
Now one question might be, what manipulation do I need?
It could be anything really, like I'd write character 's' in buffer's location 15. Like below, and then write this new buffer (having character 's' at location 15, but the rest of the buffer remains unchanged) in a new .docx file.
buffer[15] = 's';
When I did this, the file that was created was corrupt. Since I am not fully aware of the structure of .docx file, this byte number 15 could be some potential identifier, or header, or any important information of .docx file needed for creating a non-corrupt file.
However, the things I know about .docx internal structure are:
It consists of XML files, zipped together.
The content that is written in .docx file, (for e.g. I have a file named test.docx, and it contains "Hello, how are you?") then the contents "Hello, how are you?" are stored in XML files.
There is a .rels (not confirm) extension file, among those files that are zipped together, that tells MS word about where the content is stored in file, i.e. where to look for content.
Apart from these 3 points I don't know much about structure of .docx file. Now considering all this, I want to be able to extract the contents of .docx file, from the XML files zipped together, read it (in C) in a buffer, change the buffer as I need it, and create a new file, with the new content that is present in the buffer.
Can someone guide me through this?
Also kindly mention, if I need to provide code, or any other essential details. Thanks in advance.
EDIT
PURPOSE OF ALL THIS:
I want to do all this for encryption. As by encrypting a file (using AES) the whole file will become unreadable, corrupt and everything inside will be changed from its place. When I decrypt that file, the file is unable to open. My guess is, as AES decryption algo does not know how to parse the contents recovered from decrypting the encrypted file, in to a new .docx file, thus it is unable to place the contents/structure properly in its place.
I have tried it. Original docx file was of 14 KB, encrypted docx file was of 14 KB as well as the decrypted docx file. But when I try to open the decrypted file, it says file is corrupt. Also I tried to check it in HEX editor. Decrypted file has just 00 bytes after exactly 30 Bytes.
DOCX files are based on OPC and OOXML. OPC is based on Zip. OOXML is based on XML. Therefore, you can use Zip and XML tools to operate on DOCX files. Beyond this, you'll have to be more specific about what you wish to do in order to receive better guidance.
Poking characters into random index locations in an XML file is operating at the wrong level of abstraction.
I have a project for school which implies making a c program that works like tar in unix system. I have some questions that I would like someone to explain to me:
The dimension of the archive. I understood (from browsing the internet) that an archive has a define number of blocks 512 bytes each. So the header has 512 bytes, then it's followed by the content of the file (if it's only one file to archive) organized in blocks of 512 bytes then 2 more blocks of 512 bytes.
For example: Let's say that I have a txt file of 0 bytes to archive. This should mean a number of 512*3 bytes to use. Why when I'm doing with the tar function in unix and click properties it has 10.240 bytes? I think it adds some 0 (NULL) bytes, but I don't know where and why and how many...
The header chcksum. As I know this should be the size of the archive. When I check it with hexdump -C it appears like a number near the real size (when clicking properties) of the archive. For example 11200 or 11205 or something similar if I archive a 0 byte txt file. Is this size in octal or decimal? My bets are that is in octal because all information you put in the header it needs to be in octal. My second question at this point is what is added more from the original size of 10240 bytes?
Header Mode. Let's say that I have a file with 664, the format file will be 0, then I should put in header 0664. Why, on a authentic archive is printed 3 more 0 at the start (000064) ?
There have been various versions of the tar format, and not all of the extensions to previous formats were always compatible with each other. So there's always a bit of guessing involved. For example, in very old unix systems, file names were not allowed to have more than 14 bytes, so the space for the file name (including path) was plenty; later, with longer file names, it had to be extended but there wasn't space, so the file name got split in 2 parts; even later, gnu tar introduced the ##LongLink pseudo-symbolic links that would make older tars at least restore the file to its original name.
1) Tar was originally a *T*ape *Ar*chiver. To achieve constant througput to tapes and avoid starting/stopping the tape too much, several blocks needed to be written at once. 20 Blocks of 512 bytes were the default, and the -b option is there to set the number of blocks. Very often, this size was pre-defined by the hardware and using wrong blocking factors made the resulting tape unusable. This is why tar appends \0-filled blocks until the tar size is a multiple of the block size.
2) The file size is in octal, and contains the true size of the original file that was put into the tar. It has nothing to do with the size of the tar file.
The checksum is calculated from the sum of the header bytes, but then stored in the header as well. So the act of storing the checksum would change the header, thus invalidate the checksum. That's why you store all other header fields first, set the checksum to spaces, then calculate the checksum, then replace the spaces with your calculated value.
Note that the header of a tarred file is pure ascii. This way, In those old days, when a tar file (whose components were plain ascii) got corrupted, an admin could just open the tar file with an editor and restore the components manually. That's why the designers of the tar format were afraid of \0 bytes and used spaces instead.
3) Tar files can store block devices, character devices, directories and such stuff. Unix stores these file modes in the same place as the permission flags, and the header file mode contains the whole file mode, including file type bits. That's why the number is longer than the pure permission.
There's a lot of information at http://en.wikipedia.org/wiki/Tar_%28computing%29 as well.
How does one create a file header in C, so that the file type can be detected when the header is read?
What is the correct way to do this, are there any standards to follow?
I would like to add a small header to my file so the file type can be detected when reading the header.
Update (if you don't have the hat):
I want to add a header for my own file format (not a .c or .h file), using C, and I will be using C to read the file, identify it and process it.
You could just write some custom data at the beginning of your file just like you would store any other data.
For example PGM format specifies that there are dimensions of picture and maximum value stored in first lines:
P2
# Shows the word "FEEP" (example from Netpbm main page on PGM)
24 7
15
... picture data continues from here
There are no standards that would specify making this kind of header since it is very rare to do such a thing. In case of PGM pictures you wouldn't know dimensions of picture without this header - you would read 12 bytes but you wouldn't know if it's picture 3x4 or 6x2...
Note that this kind of custom data is something that you have to expect to be stored at the beginning of the file when you are reading it. You can make up custom header for your files, but then make sure that people who are going to use your files know it.
Many file formats start off with a small ASCII code or recognisable number to make it identifiable if it is opened by an editor or hex editor. These are also sometimes called “magic numbers”, or “file signatures”. For example:
The first four bytes of a GIF file are GIF, followed by a three letter version (87a or 89a).
The first two bytes of a zip file are PK (the original ZIP file's author's initials)
The first six bytes of Apple's binary plist file format are bplist
There's a comprehensive list here. What usually follows is information about what the file contains, like a table of contents, and then after that your actual data.
EDIT
It sounds like what you're after is a variable-length header. A variable length header usually starts with the number of items in the header, so for example, if you have 5 items in your file, your header may look like this:
HELIUM3
5
Item1 INDEX
Item2 INDEX
Item3 INDEX
Item4 INDEX
Item5 INDEX
< then all the data after that >
i would like to know how could I possibly use the programming language C to create a file archiver such as tar.
Im stuck on the first bit on how to copy a bunch of files into one file, and then extrating them back out of that one file.
Any help would be appreciated thanks.
It's a good idea to read up on the tar format for some inspiration.
http://en.wikipedia.org/wiki/Tar_%28file_format%29
http://www.gnu.org/software/automake/manual/tar/Standard.html
It's quite simple and shouldn't be too hard to implement yourself, if you got a good grasp of basic C I/O.
Assuming you don't want compression, which is pretty hard, and just want's something REALLY simple, you are gonna need to do the following:
Create a file to hold all the files you want.
Fetch one of the files you want to archive, get it's name, name_size and it's size.
Write the name_size of the name, name, size of the file, and the size * bytes of the file into the archive one.
Repeat to all of the files you want to archive.
To get the files back from the one archive, you are gonna need to read the name's size, create that file with the next name_size next bytes, then read the size of the file bytes, and write them to the single file you created.
You would have this:
File1:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
FileN:
yyyyyyyyyyyyyyyyyyyy
After the archiving you would have:
5File1size of File1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5FileNsizeof FileNyyyyyyyyyyyyyyyyyyyy
When you open a .txt file with fopen
Is there any way to delete some strings in a file without rewriting.
For example this is the txt file that i will open with fopen() ;
-------------
1 some string
2 SOME string
3 some STRING
-------------
i want to delete the line which's first character is 2 and change it into
-------------
1 some string
3 some STRING
-------------
My solution is;
First read all data and keep them in string variables. Then fopen the same file with w mode. And write the data again except line 2. (But this is not logical i am searching for an easier way in C ...)
(i hope my english wasn't problem)
The easiest way might be to memory-map the whole file using mmap. With mmap you get access to the file as a long memory buffer that you can modify with changes being reflected on disk. Then you can find the offset of that line and move the whole tail of the file that many bytes back to overwrite the line.
you should not overwrite the file, better is to open another (temp)-file, write contents inside and then delete old file and rename the file. So it is safer if problems occur.
I think the easiest way is to
read whole file
modify contents in memory
write back to a temp file
delete original file
rename temp file to original file
Sounds not too illogical to me..
For sequential files, no matter what technique you use to delete line 2, you still have to write the file back to disk.