File archiver in C

File archiver in C - c

i would like to know how could I possibly use the programming language C to create a file archiver such as tar.
Im stuck on the first bit on how to copy a bunch of files into one file, and then extrating them back out of that one file.
Any help would be appreciated thanks.

It's a good idea to read up on the tar format for some inspiration.
http://en.wikipedia.org/wiki/Tar_%28file_format%29
http://www.gnu.org/software/automake/manual/tar/Standard.html
It's quite simple and shouldn't be too hard to implement yourself, if you got a good grasp of basic C I/O.

Assuming you don't want compression, which is pretty hard, and just want's something REALLY simple, you are gonna need to do the following:
Create a file to hold all the files you want.
Fetch one of the files you want to archive, get it's name, name_size and it's size.
Write the name_size of the name, name, size of the file, and the size * bytes of the file into the archive one.
Repeat to all of the files you want to archive.
To get the files back from the one archive, you are gonna need to read the name's size, create that file with the next name_size next bytes, then read the size of the file bytes, and write them to the single file you created.
You would have this:
File1:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
FileN:
yyyyyyyyyyyyyyyyyyyy
After the archiving you would have:
5File1size of File1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5FileNsizeof FileNyyyyyyyyyyyyyyyyyyyy

Related

What is the difference between .bin and .dat file?

When writing to a binary file, when should I use .bin vs .dat? If I'm just trying to store information not meant to be read by humans, like item description/serial number pair, does it matter which one I pick if I'm just trying to make it unreadable from a text editor?

Let me give you some brief details about these files :
.BIN File : The BIN file type is primarily associated with 'Binary File'. Binary files are used for a wide variety of content and can be associated with a great many different programs. In general, a .BIN file will look like garbage when viewed in a file editor.
.DAT File : The DAT file type is primarily associated with 'Data'. Can be just about anything: text, graphic, or general binary data. Data file in special format or ASCII.
Reference:
Abhijit Banerjee answered that question on quora

.dat is a more frequently used suffix for binary data. It doesn't matter what extension you pick, as long as you are on Unix or Linux based systems.

Sufixes can mean whatever you want them to mean... Those rules are more like guidelines than actual rules...
However, BIN seems like a short to binary, so a BIN file will likely hold data in binary form. DAT seems like a short to data, so a DAT file will contain information in whatever format the developer of the program that reads that file seems fit (ASCII, Binary, a mix of them, something else entirely)

On a UNIX system, there is no difference. The extensions are interchangeable.

If you do not put any extension, it makes it kinda hard for someone not knowing what the file's extension should be, to open the file. Additionally, with Unix or Linux, if you place a dot (period) before the file name, the file hides itself.

How to read, manipulate and write .docx file in c

I am reading .docx file in a buffer and writing it to a new file successfully. (Using fread and fwrite in C) However now I want to enhance the scope of this project for the purpose of encryption. For which I want to be able to manipulate the buffer, then write it in new file.
Now one question might be, what manipulation do I need?
It could be anything really, like I'd write character 's' in buffer's location 15. Like below, and then write this new buffer (having character 's' at location 15, but the rest of the buffer remains unchanged) in a new .docx file.
buffer[15] = 's';
When I did this, the file that was created was corrupt. Since I am not fully aware of the structure of .docx file, this byte number 15 could be some potential identifier, or header, or any important information of .docx file needed for creating a non-corrupt file.
However, the things I know about .docx internal structure are:
It consists of XML files, zipped together.
The content that is written in .docx file, (for e.g. I have a file named test.docx, and it contains "Hello, how are you?") then the contents "Hello, how are you?" are stored in XML files.
There is a .rels (not confirm) extension file, among those files that are zipped together, that tells MS word about where the content is stored in file, i.e. where to look for content.
Apart from these 3 points I don't know much about structure of .docx file. Now considering all this, I want to be able to extract the contents of .docx file, from the XML files zipped together, read it (in C) in a buffer, change the buffer as I need it, and create a new file, with the new content that is present in the buffer.
Can someone guide me through this?
Also kindly mention, if I need to provide code, or any other essential details. Thanks in advance.
EDIT
PURPOSE OF ALL THIS:
I want to do all this for encryption. As by encrypting a file (using AES) the whole file will become unreadable, corrupt and everything inside will be changed from its place. When I decrypt that file, the file is unable to open. My guess is, as AES decryption algo does not know how to parse the contents recovered from decrypting the encrypted file, in to a new .docx file, thus it is unable to place the contents/structure properly in its place.
I have tried it. Original docx file was of 14 KB, encrypted docx file was of 14 KB as well as the decrypted docx file. But when I try to open the decrypted file, it says file is corrupt. Also I tried to check it in HEX editor. Decrypted file has just 00 bytes after exactly 30 Bytes.

DOCX files are based on OPC and OOXML. OPC is based on Zip. OOXML is based on XML. Therefore, you can use Zip and XML tools to operate on DOCX files. Beyond this, you'll have to be more specific about what you wish to do in order to receive better guidance.
Poking characters into random index locations in an XML file is operating at the wrong level of abstraction.

Unknown Master List .dat file, issues retrieving information

I come to you completely stumped. I do some side work for a company that uses an old DOS based program to input and retrieve data. This is a legacy piece of software, and they have since moved to either QuickBooks or Outlook for all of their address or billing related needs. However there have been some changes made, and they work with this database fairly regularly. Since the computer that this software is on, is running XP (and none of the other computers in the office can run it) they're looking to phase this software out for when the computer inevitably explodes.
TLDR; I have an old .csv file (roughly two years) that has a good chunk of information on it, but again it's two years old. I have another file called ml.dat (I'm assuming masterlist.dat) that's in the same folder as this legacy software. I open it with notepad and excel and am presented with information like this:
S;Û).;PÃS;*p(â'a,µ,
The above chunk of text is recognized much less within notepad or excel. It's a lot more of the unrecognized squares.
Some of the information is actually readable however. I can for example read the occasional town name, or person's name but I'm unable to get all of the information since there's a lot missing. Perhaps the data isn't in unicode or something? I have no idea. Any suggestions? I'm ultimately trying to take this information and toss it into either quickbooks or outlook.
Please help!
Thanks
Edit: I'm guessing the file might be encrypted since .dat's are usually clear text? Any thoughts?

.DAT files can be anything, they are usually just application data. Since there is readable text, then it is very unlikely that this file is encrypted. Instead you are seeing ASCII representations of the bytes of other content. http://www.asciitable.com/ Assuming single byte values, the number 77 might appear in the file somewhere as M.
Your options:
Search for some utility to load and translate the dat file for that application.
Set up an appropriate dos emulator so you can run this application on another box, or even a virtual machine running freedos or something.
Figure out the file format and then write a program to translate the data.
For #3, you can attach a debugger to the application to trace how the file is read and written. Alternatively you can try to figure out record boundaries (if all the records are the same size, then things are a little bit easier.) Then you can use known values to try to find field boundaries. If you can find (or reverse compile) the source code, then that could also give you insight into the file format.
1 is your best bet, and #2 will buy you some time so that you don't need that original machine anymore. #3 would likely be something to outsource.
If you can find the source or file format, then you just recreate whatever data structure was dumped to the file and read the file into it.
To find which exe opens it, you can do something like:
for %f in (*.exe) do find "ml.dat" %f -c
Assuming the original application was written in C then there would be code something like this to read the first record from the file:
struct SecretData
{
int first;
double money;
char city[10];
};
FILE* input;
struct SecretData secretdata;
input = fopen("ml.dat", "rb");
fread(&data, sizeof(data), 1, input);
fclose(input);
(The file would have been written with fwrite.) Basically you need to figure out the innards of the SecretData structure to be able to read the file.
There likely wasn't a separate utility used to make the file, dumping data and reading it back from a file is relatively easy in most languages.

Delete a character from a file in C

How can I delete few characters from a file using C program?
I could not find any predefined functions for it.
To understand the purpose, I am trying to send a file through a socket, if N bytes are sent successfully, I want to delete those bytes from the file. At the end, the file will be empty.
Any other way to do this efficiently?
Thanks
Pradeep

If they're at the end, truncate the file at the appropriate length. If they're not then you'll need to rewrite the file.

Your way is pretty inefficient for large files, since you would have to copy "the rest of the file" some bytes further to the beginning, which costs much. I would rather record the "current sending position" somewhere outside of the file and update that information. That way, you don't have to copy the rest of the file so often.

There is no straightforward way to delete bytes from the beginning of a file. You will have to start from where you want to trim the file, and read from there to the end of the file, writing to the start of the file.
It might make more sense to just track how many bytes you have already written to the file in some other file.

you should use an index which points to the beginning of the data you haven't sent yet.
It is not necessary to delete what you have sent, just pass them, when you send the whole file delete it.

If the char's are one after the other than why dont you give a try to fseek();

C - Reading multiple files

just had a general question about how to approach a certain problem I'm facing. I'm fairly new to C so bear with me here. Say I have a folder with 1000+ text files, the files are not named in any kind of numbered order, but they are alphabetical. For my problem I have files of stock data, each file is named after the company's respective ticker. I want to write a program that will open each file, read the data find the historical low and compare it to the current price and calculate the percent change, and then print it. Searching and calculating are not a problem, the problem is getting the program to go through and open each file. The only way I can see to attack this is to create a text file containing all of the ticker symbols, having the program read that into an array and then run a loop that first opens the first filename in the array, perform the calculations, print the output, close the file, then loop back around moving to the second element (the next ticker symbol) in the array. This would be fairly simple to set up (I think) but I'd really like to avoid typing out over a thousand file names into a text file. Is there a better way to approach this? Not really asking for code ( unless there is some amazing function in c that will do this for me ;) ), just some advice from more experienced C programmers.
Thanks :)
Edit: This is on Linux, sorry I forgot to metion that!

Under Linux/Unix (BSD, OS X, POSIX, etc.) you can use opendir / readdir to go through the directory structure. No need to generate static files that need to be updated, when the file system has the information you want. If you only want a sub-set of stocks at a given time, then using glob would be quicker, there is also scandir.
I don't know what Win32 (Windows / Platform SDK) functions are called, if you are developing using Visual C++ as your C compiler. Searching MSDN Library should help you.

Assuming you're running on linux...
ls /path/to/text/files > names.txt
is exactly what you want.

opendir(); on linux.
http://linux.die.net/man/3/opendir
Exemple :
http://snippets.dzone.com/posts/show/5734

In pseudo code it would look like this, I cannot define the code as I'm not 100% sure if this is the correct approach...
for each directory entry
scan the filename
extract the ticker name from the filename
open the file
read the data
create a record consisting of the filename, data.....
close the file
add the record to a list/array...
> sort the list/array into alphabetical order based on
the ticker name in the filename...
You could vary it slightly if you wish, scan the filenames in the directory entries and sort them first by building a record with the filenames first, then go back to the start of the list/array and open each one individually reading the data and putting it into the record then....
Hope this helps,
best regards,
Tom.

There are no functions in standard C that have any notion of a "directory". You will need to use some kind of platform-specific function to do this. For some examples, take a look at this post from Cprogrammnig.com.
Personally, I prefer using the opendir()/readdir() approach as shown in the second example. It works natively under Linux and also on Windows if you are using Cygwin.

Approach 1) I would just have a specific directory in which I have ONLY these files containing the ticker data and nothing else. I would then use the C readdir API to list all files in the directory and iterate over each one performing the data processing that you require. Which ticker the file applies to is determined only by the filename.
Pros: Easy to code
Cons: It really depends where the files are stored and where they come from.
Approach 2) Change the file format so the ticker files start with a magic code identifying that this is a ticker file, and a string containing the name. As before use readdir to iterate through all files in the folder and open each file, ensure that the magic number is set and read the ticker name from the file, and process the data as before
Pros: More flexible than before. Filename needn't reflect name of ticker
Cons: Harder to code, file format may be fixed.

but I'd really like to avoid typing out over a thousand file names into a text file. Is there a better way to approach this?
I have solved the exact same problem a while back, albeit for personal uses :)
What I did was to use the OS shell commands to generate a list of those files and redirected the output to a text file and had my program run through them.

On UNIX, there's the handy glob function:
glob_t results;
memset(&results, 0, sizeof(results));
glob("*.txt", 0, NULL, &results);
for (i = 0; i < results.gl_pathc; i++)
printf("%s\n", results.gl_pathv[i]);
globfree(&results);

On Linux or a related system, you could use the fts library. It's designed for traversing file hierarchies: man fts,
or even something as simple as readdir
If on Windows, you can use their Directory Management API's. More specifically, the FindFirstFile function, used with wildcards, in conjunction with FindNextFile

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight