I've a file that was written on windows with encoding WINDOWS-1256, and I need to write a C program that reads bytes from this file and write them back to a new file with UTF-8 encoding.
How to read a file with a specific encoding in C ??
Other than text mode and binary mode, there is no way to directly read specific encodings with C with standard API.
You would open the file as binary and read it in. Then you can use a library like libiconv to do the encoding / decoding of specific formats.
Related
I am trying to store a simple string in a file opened in wb mode as shown in code below. Now from what i understand, the content of string should be stored as 0s and 1s as it was opened in binary mode but when i opened the file manually in Notepad, I was able to see the exact string stored in the file and not some binary data. Just for curiosity I tried to read this binary file in text mode. Again the string was perfectly shown on output without any random characters. The below code explains my point :
#include<stdio.h>
int main()
{
char str[]="Testing 123";
FILE *fp;
fp = fopen("test.bin","wb");
fwrite(str,sizeof(str),1,fp);
fclose(fp);
return 0;
}
So i have three doubts out of this:
Why on seeing the file in Notepad, it is not showing some random characters but the exact string ?
Can we read a file in text mode which was written in binary mode and vice versa ?
Not exactly related to above question but can we use functions like fgetc, fputc, fgets, fputs, fprintf, fscanf to work with binary mode and functions like fread, fwrite to work with text mode ?
Edit: Forgot to mention that i am working on Windows platform.
In binary mode, file API does not modify the data but just passes it along directly.
In text mode, some systems transform the data. For example Windows changes \n to \r\n in text mode.
On Linux there is no difference between binary vs text modes.
Notepad will print whatever is in the file so even if you write 100% binary data there is a chance that you'll see some readable characters.
The project I'm working on takes xml files and input streams and converts them to pdf's and text. In the unit tests I compare this generated text with a .txt file that has the expected output.
I'm now facing the issue of these .txt files not being encoded in UTF-8 and been written without persisting this information (namely umlauts).
I have read few articles on the topic of persisting and encoding .txt files. Including correcting the encoding, saving and opening files in Visual Studio with encoding, and some more.
I was wondering if there is a text file format that supports meta information about encoding like xml or html for example does.
I'm looking for a solution that is:
Easy adaptable to any coworker on the same team
It being persitant and not depending on me choosing an encoding in an editor
Does not require any additional exotic program
Can be read without or only little modification of the File class and it's input reading of C#
Does at least support UTF-8 encoding
A Unicode Byte Order Mark (BOM) is sometimes used for this purpose. Systems that process Unicode are required to strip off this metadata when passing on the text. File.ReadAllText etc do this. A BOM should exist only at the beginning of files and streams.
A BOM is sometimes conflated with encoding because both affect the file format and BOM applies only to Unicode encodings. In Visual Studio, with UTF-8, it's called "Unicode (UTF-8 with signature) - Codepage 65001".
Some C# code that demonstrates these concepts:
var path = Path.GetTempFileName() + ".txt";
File.WriteAllText(path, "Test", new UTF8Encoding(true, true));
Debug.Assert(File.ReadAllBytes(path).Length == 7);
Debug.Assert(File.ReadAllText(path).Length == 4); // slightly mushy encoding detection
However, this doesn't get anyone past the agreement required when using text files. The fundamental rule is that a text file must be read with the same encoding it was written with. A BOM is not a communication that suffices as a complete agreement for text files in general.
Test editors almost universally adopt the principle that they should guess a file's character encoding first, and—for the most part—allow users to correct them later. Some IDEs with project systems allow recording which encoding a file actually uses.
A reasonable text editor would preserve both the encoding and the presence of a Unicode BOM for existing files.
It seems that you're after a universal strategy. Unfortunately, the history of the concept of a text file doesn't allow one.
I need some help.
I'm writing a program that opens 2 source files in UTF-8 encoding without BOM. The first contains English text and some other information, including ID. The second contains only string ID and translation. The program changes every string from the first file by replacing English chars to Russian translation from the second one and writes these strings to output file. Everything seems to be ok, but there is BOM appears in destination file. And i want to create file without BOM, like source.
I open files with fopen function in text mode with ccs=UTF-8
read string with fgetws function to wchar_t buffer
and write with fputws function to output file
Don't use text mode, don't use the MS ccs= extension to fopen, and don't use fputws. Instead use fopen in binary mode and write the correct UTF-8 yourself.
I have to create a m3u8 playlist in my http streaming C++ server code. m3u8 is nothing but UTF-8 m3u file.
How do I create an UTF-8 file (i.e. how to write UTF-8 characters to a file)? Maybe with the open() function or some other function in C++ on Linux?
int fd = open("myplaylist.m3u8", O_WRONLY | O_APPEND);
The way you are planning to do the open() is fine.
What will determine if a file is in utf-8, is the characters you're going to write to it. Provided you encode the relevant characters as utf-8, everything will work as expected.
If you plan on converting a given encoding (say ISO-8859-1) to utf-8, a good way to achieve it is to use libiconv which allows to do exactly that.
Whether a file "is" utf-8 depends on the content. As long as you write() the correct byte sequences in there you will be fine.
I'm trying to implement ftp commands GET and PUT thru a UNIX socket for file transfer using usual functions like fread(), fwrite(), send() and recv().
It works fine for text files, but fails for binary files (diff says: "binary files differ")
Any suggestions regarding the following will be appreciated:
Are there any specific commands to read and write binary data?
Can diff be used to compare binary files?
Is it possible to send the binary parts in chunks of memory?
the FTP protocol has 2 modes of operation: text and binary.
try it in any FTP client -- I believe the commands for switching in between are ASCII and BIN. The text mode has only effect from what I recall on the CR/LF pairs though.
If you're reading from a file and then writing the file's data to the socket, make sure you open the file in binary mode.
Yes, diff can be used to compare binary files, typically with the -q option to suppress the actual printing of differences, which rarely makes sense for binary files. You can also use md5 or cmp if you have them.