Do binary files have encoding? Confused - c

Suppose I write the following C program and save it in a text file called Hello.c
#include<stdio.h>
int main()
{
printf("Hello there");
return 0;
}
The Hello.c file will probably get saved in a UTF8 encoded format.
Now, I compile this file to create a binary file called Hello
Now, this binary file should in some way store the text "Hello there". The question is what encoding is used to store this text?

As far as I'm aware, vanilla C doesn't have any concept of encoding, although if you correctly keep track of multi-byte characters, you can probably use an encoding. By default, ASCII is used to map characters to single-byte characters.
You are correct about the string "Hello there" being stored in the executable itself. The string literal is put into global memory and replaced with a pointer in the call to printf, so you can see the string literal in the data segment of the binary.
If you have access to a hex editor, try compiling your program and opening the binary in the editor. Here is a screenshot from when I did this. You can see that each character of the string literal is represented by a single byte, followed by a 0 (NULL). This is ASCII.

Related

Trying to store a string in a file using binary mode in C

I am trying to store a simple string in a file opened in wb mode as shown in code below. Now from what i understand, the content of string should be stored as 0s and 1s as it was opened in binary mode but when i opened the file manually in Notepad, I was able to see the exact string stored in the file and not some binary data. Just for curiosity I tried to read this binary file in text mode. Again the string was perfectly shown on output without any random characters. The below code explains my point :
#include<stdio.h>
int main()
{
char str[]="Testing 123";
FILE *fp;
fp = fopen("test.bin","wb");
fwrite(str,sizeof(str),1,fp);
fclose(fp);
return 0;
}
So i have three doubts out of this:
Why on seeing the file in Notepad, it is not showing some random characters but the exact string ?
Can we read a file in text mode which was written in binary mode and vice versa ?
Not exactly related to above question but can we use functions like fgetc, fputc, fgets, fputs, fprintf, fscanf to work with binary mode and functions like fread, fwrite to work with text mode ?
Edit: Forgot to mention that i am working on Windows platform.
In binary mode, file API does not modify the data but just passes it along directly.
In text mode, some systems transform the data. For example Windows changes \n to \r\n in text mode.
On Linux there is no difference between binary vs text modes.
Notepad will print whatever is in the file so even if you write 100% binary data there is a chance that you'll see some readable characters.

UTF-8 Japanese and Hangul script gets space inserted between characters in terminal emulator/tty

I have problems with printing Japanese and Hangul scripts encoded in UTF-8 in C. The program itself is trivial (I have omitted includes):
int main()
{
uint8_t valid_utf8_string1x[] = "宇宙に飛びたい"; //uint8_t is used on purpose
printf(valid_utf8_string1x);
return 0;
}
When I run it (on st, kitty or tty) every character is separated by regular space (code 32 dec), however copying it and pasting in (even in the same terminal window) or redirecting output to file (by ./program > outfile) will get rid of those, yet echo "宇宙に飛びたい" (where string is pasted without spaces) or cat outfile will make them appear again, even if they are not present in the input. Every other script I've tested (Greek, Russian, Polish and Latin, Arabic) worked fine. How can I get rid of spaces being printed? They mess up my UI.
PS. I copied and pasted the space itself between characters to get it's code.
It turns out it's an terminal feature.

Write a string as binary data to file - C

I want to write a string as binary data to a file.
This is my code:
FILE *ptr;
ptr = fopen("test.dat","wb"); // w for write, b for binary
fprintf(ptr,"this is a test");
fclose(ptr);
After i run the program and open the file test.dat, i read "this is a test" but not the binary data i want. Anyone can help me?
You seem to be somewhat confused; all data in typical computers is binary. The fact that you opened the file for binary access means it will have e.g. end-of-line conversions done, it doesn't change the interpretation of the data you write.
You're just looking at binary data whose representation is a bunch of human-readable characters. Not sure what you expected to find, that is after all what you put into the file.
The letter 't' is represented by the binary sequence 01110100 (assuming an ASCII-compatible encoding), but many programs will show that as 't' instead.
Notepad decodes the binary data and shows ASCII equivalent code for it.
If you need to see the binary equivalent of the stored data then use hex viewer softwares and open your file in it.e.g. WinHex.

Edit String in library archive (filename.a)

I has been compiled C library, and I have file library example filelib.a , and I want to edit string in filelib.a because my source code C has been removed from my PC, I want to edit string there, In file filelib.a there are string "article seen".
If I grep:
$ grep -R "/etc/resolv.conf" *
Binary file filelib.a matches
Binary file filelib.so matches
So there are string "/etc/resolv.conf" in file filelib.a and filelib.so.
How to edit and replace string in binary file filelib.a and filelib.so, example I want to replace string "/etc/resolv.conf" to "/system/etc/resolv.conf"
I have edit with hex editor BLESS, but if I use this lib I get error:
could not read symbols: Malformed archive
collect2: error: ld returned 1 exit status
I'm using linux ubuntu.
Thanks.
If you really don't have the slightest chance to obtain/recover the soirce code, amd the new string is equally long as or shorter than the original one, you can open the archive using a hex editor, binpatch the string and pad with zeroes if it's shorter than before (there must always be at least one terminating zero byte).
If you want to change the string to something longer, that's not easy - your best chance would be perhaps to extract the archive, disassemble the object file in which you want to make changes, change the assembly, then reassemble it and use ar to update the modified object file in the library.
As long as the string you want to change in is shorter or equal to the one in the binary file, you can just use a hex editor and substitute the string, and replace any reaming characters with \0.
I believe the Bless Hex Editor should do the job for you.
Just make sure that you do not change the length of the file. It may be possible to use a shorter string than the old one, if you insert a '\0' terminator, but that all depends on how the program uses it, so I'd recommend against.

Linux & C-Programming: How can I write utf-8 encoded text to a file?

I am interested in writing utf-8 encoded strings to a file.
I did this with low level functions open() and write().
In the first place I set the locale to a utf-8 aware character set with
setlocale("LC_ALL", "de_DE.utf8").
But the resulting file does not contain utf-8 characters, only iso8859 encoded umlauts. What am I doing wrong?
Addendum: I don't know if my strings are really utf-8 encoded in the first place. I just keep them in the source file in this form: char *msg = "Rote Grütze";
See screenshot for content of the textfile:
alt text http://img19.imageshack.us/img19/9791/picture1jh9.png
Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.
Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.
To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.
Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:
iconv -f latin1 -t utf8 file.c
This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.
Regards,
Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.
Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.
If you are coming from an Unicode file I believe the function you looking for is wcstombs().
Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.
Once you have done this, could you edit your original post to add the pertinent information?

Resources