Write a string as binary data to file - C - c

I want to write a string as binary data to a file.
This is my code:
FILE *ptr;
ptr = fopen("test.dat","wb"); // w for write, b for binary
fprintf(ptr,"this is a test");
fclose(ptr);
After i run the program and open the file test.dat, i read "this is a test" but not the binary data i want. Anyone can help me?

You seem to be somewhat confused; all data in typical computers is binary. The fact that you opened the file for binary access means it will have e.g. end-of-line conversions done, it doesn't change the interpretation of the data you write.
You're just looking at binary data whose representation is a bunch of human-readable characters. Not sure what you expected to find, that is after all what you put into the file.
The letter 't' is represented by the binary sequence 01110100 (assuming an ASCII-compatible encoding), but many programs will show that as 't' instead.

Notepad decodes the binary data and shows ASCII equivalent code for it.
If you need to see the binary equivalent of the stored data then use hex viewer softwares and open your file in it.e.g. WinHex.

Related

Trying to store a string in a file using binary mode in C

I am trying to store a simple string in a file opened in wb mode as shown in code below. Now from what i understand, the content of string should be stored as 0s and 1s as it was opened in binary mode but when i opened the file manually in Notepad, I was able to see the exact string stored in the file and not some binary data. Just for curiosity I tried to read this binary file in text mode. Again the string was perfectly shown on output without any random characters. The below code explains my point :
#include<stdio.h>
int main()
{
char str[]="Testing 123";
FILE *fp;
fp = fopen("test.bin","wb");
fwrite(str,sizeof(str),1,fp);
fclose(fp);
return 0;
}
So i have three doubts out of this:
Why on seeing the file in Notepad, it is not showing some random characters but the exact string ?
Can we read a file in text mode which was written in binary mode and vice versa ?
Not exactly related to above question but can we use functions like fgetc, fputc, fgets, fputs, fprintf, fscanf to work with binary mode and functions like fread, fwrite to work with text mode ?
Edit: Forgot to mention that i am working on Windows platform.
In binary mode, file API does not modify the data but just passes it along directly.
In text mode, some systems transform the data. For example Windows changes \n to \r\n in text mode.
On Linux there is no difference between binary vs text modes.
Notepad will print whatever is in the file so even if you write 100% binary data there is a chance that you'll see some readable characters.

Identify files based on the ASCII characters in the binary

I have a file uploader and in order to validate the file is actually of the type expected, I am inspecting the binary and checking the ASCII identifying characters (see the PDF example here).
The majority of files have an ASCII identifier, however some don't (like XLS files)
How would I best identify these ones?
I see all have a hex values but as it stands, I don't currently have the capability of converting the binary data to Hex.
Workaround...
I don't have a HEX converter, but I can convert both Binary and HEX to Base64 so that is what I am doing and comparing the Base64 output.

What is the difference between .bin and .dat file?

When writing to a binary file, when should I use .bin vs .dat? If I'm just trying to store information not meant to be read by humans, like item description/serial number pair, does it matter which one I pick if I'm just trying to make it unreadable from a text editor?
Let me give you some brief details about these files :
.BIN File : The BIN file type is primarily associated with 'Binary File'. Binary files are used for a wide variety of content and can be associated with a great many different programs. In general, a .BIN file will look like garbage when viewed in a file editor.
.DAT File : The DAT file type is primarily associated with 'Data'. Can be just about anything: text, graphic, or general binary data. Data file in special format or ASCII.
Reference:
Abhijit Banerjee answered that question on quora
.dat is a more frequently used suffix for binary data. It doesn't matter what extension you pick, as long as you are on Unix or Linux based systems.
Sufixes can mean whatever you want them to mean... Those rules are more like guidelines than actual rules...
However, BIN seems like a short to binary, so a BIN file will likely hold data in binary form. DAT seems like a short to data, so a DAT file will contain information in whatever format the developer of the program that reads that file seems fit (ASCII, Binary, a mix of them, something else entirely)
On a UNIX system, there is no difference. The extensions are interchangeable.
If you do not put any extension, it makes it kinda hard for someone not knowing what the file's extension should be, to open the file. Additionally, with Unix or Linux, if you place a dot (period) before the file name, the file hides itself.

C program for reading doc, docx, pdf

I want to write a program in C(only c not c++ or java) that will read doc, docx, pdf and want to make it available on github to use for all who needs that code. So I started with .doc file I explored that if I open .doc file with simple notepad it will show you all text but just with some extra content which you can easily trim. So I did write a simple c program to read .doc wile in both 'r' and 'rb' mode but both time it gives me only 5-9 character in the file and those also not readable. I don't know why it's happening. Any comment or disccussion will be very helpful for me.
Here is the link for github Source code. Please help me to complete all three format.
To answer your specific question, the reason your little application stops reading is because it mistakenly thinks there is an EOF character in your file.
Look at your code:
char ch;
int nol=0, not=0, nob=0, noc=0;
FILE *fp;
fp = fopen("file.doc","rb");
while(1)
{
ch = fgetc(fp);
if(ch==EOF)
{
break;
}
You store the result of fgetc(fp) in a variable of type char, which is a single-byte variable. However, the result of fgetc is very purposefully "int", not "char".
fgetc always returns a positive result in the range 0 to 255, except for when you reach the end of the file in which case it returns EOF, which is often implemented as a -1 value.
If you read a byte of value 255 and store it in an int, everything is OK, it's stored as the value 255 and your loop can continue. If you store the result in a char, it's going to be interpreted equal to EOF. And your loop stops.
Don't expect to get anywhere with this idea. .doc is a huge binary file format that is inhumanly complicated to parse. With that said, Cubia mentioned the offset where the text section of the document starts. I'm not familiar with the details of the format, but if the raw text is contained in one location, use fseek to get at it and stop when you reach the end. This won't be the case for the other formats because they are very different.
.docx and .pdf should be easier to parse because they are more modern formats. If you want to read anything from a docx you need to read from a zip file with a ton of xml in it and use a parser to figure out which text you want.
.pdf should be the easiest of the three because you might be able to find a library out there that can almost do what you want.
As for why you are getting strange output from your program, remember that .doc is a binary format and the vast majority of the data is garbage from your perspective. Dumping it to the terminal will yield readable text but also a bunch of control characters that should screw with your terminal.
As a last note - don't try to read docx files directly using fread - they are compressed so you likely won't recover the text unaltered. Take a look at libarchive. Also - expect to have to read the document specifications. docx seems to be a microsoft extension to the openoffice format. See this and some PDF specification documents (there are multiple versions).
Look at the .doc file type as a txt file but with extra non-printable characters before, in the middle, and after your content. These non-printable characters are used for defining special formatting, metadata and other infos.
With this said, all .doc files follow a certain structure.
If you open two different .doc files in a hex editor, you will notice that the text content of both files start at an offset of 0xA00 (2560 bytes) from the beginning of the file. This means that when you open your file initially, you can ignore the first 2560 bytes of the file (Take a look at the fseek() function).
From this point on, you can read the contents of your file until you reach '\0'.
I have not seen the implementation of a .pdf or a .docx file, but you can take open up both files with a hex editor and figure out what pattern you can use the isolate the important contents of the files.
Hope this helps.
EDIT : You can always find documentation on the different file formats that you want to manipulate. Here are the specifications of the PDF file type :
http://www.adobe.com/devnet/pdf/pdf_reference.html
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Do binary files have encoding? Confused

Suppose I write the following C program and save it in a text file called Hello.c
#include<stdio.h>
int main()
{
printf("Hello there");
return 0;
}
The Hello.c file will probably get saved in a UTF8 encoded format.
Now, I compile this file to create a binary file called Hello
Now, this binary file should in some way store the text "Hello there". The question is what encoding is used to store this text?
As far as I'm aware, vanilla C doesn't have any concept of encoding, although if you correctly keep track of multi-byte characters, you can probably use an encoding. By default, ASCII is used to map characters to single-byte characters.
You are correct about the string "Hello there" being stored in the executable itself. The string literal is put into global memory and replaced with a pointer in the call to printf, so you can see the string literal in the data segment of the binary.
If you have access to a hex editor, try compiling your program and opening the binary in the editor. Here is a screenshot from when I did this. You can see that each character of the string literal is represented by a single byte, followed by a 0 (NULL). This is ASCII.

Resources