Creating and writing to an UTF-8 file with C++ - file

I have to create a m3u8 playlist in my http streaming C++ server code. m3u8 is nothing but UTF-8 m3u file.
How do I create an UTF-8 file (i.e. how to write UTF-8 characters to a file)? Maybe with the open() function or some other function in C++ on Linux?
int fd = open("myplaylist.m3u8", O_WRONLY | O_APPEND);

The way you are planning to do the open() is fine.
What will determine if a file is in utf-8, is the characters you're going to write to it. Provided you encode the relevant characters as utf-8, everything will work as expected.
If you plan on converting a given encoding (say ISO-8859-1) to utf-8, a good way to achieve it is to use libiconv which allows to do exactly that.

Whether a file "is" utf-8 depends on the content. As long as you write() the correct byte sequences in there you will be fine.

Related

(Text-) File that supports encoding

The project I'm working on takes xml files and input streams and converts them to pdf's and text. In the unit tests I compare this generated text with a .txt file that has the expected output.
I'm now facing the issue of these .txt files not being encoded in UTF-8 and been written without persisting this information (namely umlauts).
I have read few articles on the topic of persisting and encoding .txt files. Including correcting the encoding, saving and opening files in Visual Studio with encoding, and some more.
I was wondering if there is a text file format that supports meta information about encoding like xml or html for example does.
I'm looking for a solution that is:
Easy adaptable to any coworker on the same team
It being persitant and not depending on me choosing an encoding in an editor
Does not require any additional exotic program
Can be read without or only little modification of the File class and it's input reading of C#
Does at least support UTF-8 encoding
A Unicode Byte Order Mark (BOM) is sometimes used for this purpose. Systems that process Unicode are required to strip off this metadata when passing on the text. File.ReadAllText etc do this. A BOM should exist only at the beginning of files and streams.
A BOM is sometimes conflated with encoding because both affect the file format and BOM applies only to Unicode encodings. In Visual Studio, with UTF-8, it's called "Unicode (UTF-8 with signature) - Codepage 65001".
Some C# code that demonstrates these concepts:
var path = Path.GetTempFileName() + ".txt";
File.WriteAllText(path, "Test", new UTF8Encoding(true, true));
Debug.Assert(File.ReadAllBytes(path).Length == 7);
Debug.Assert(File.ReadAllText(path).Length == 4); // slightly mushy encoding detection
However, this doesn't get anyone past the agreement required when using text files. The fundamental rule is that a text file must be read with the same encoding it was written with. A BOM is not a communication that suffices as a complete agreement for text files in general.
Test editors almost universally adopt the principle that they should guess a file's character encoding first, and—for the most part—allow users to correct them later. Some IDEs with project systems allow recording which encoding a file actually uses.
A reasonable text editor would preserve both the encoding and the presence of a Unicode BOM for existing files.
It seems that you're after a universal strategy. Unfortunately, the history of the concept of a text file doesn't allow one.

Is it possible to prevent adding BOM to output UTF-8 file? (Visual Studio 2005)

I need some help.
I'm writing a program that opens 2 source files in UTF-8 encoding without BOM. The first contains English text and some other information, including ID. The second contains only string ID and translation. The program changes every string from the first file by replacing English chars to Russian translation from the second one and writes these strings to output file. Everything seems to be ok, but there is BOM appears in destination file. And i want to create file without BOM, like source.
I open files with fopen function in text mode with ccs=UTF-8
read string with fgetws function to wchar_t buffer
and write with fputws function to output file
Don't use text mode, don't use the MS ccs= extension to fopen, and don't use fputws. Instead use fopen in binary mode and write the correct UTF-8 yourself.

C program for reading doc, docx, pdf

I want to write a program in C(only c not c++ or java) that will read doc, docx, pdf and want to make it available on github to use for all who needs that code. So I started with .doc file I explored that if I open .doc file with simple notepad it will show you all text but just with some extra content which you can easily trim. So I did write a simple c program to read .doc wile in both 'r' and 'rb' mode but both time it gives me only 5-9 character in the file and those also not readable. I don't know why it's happening. Any comment or disccussion will be very helpful for me.
Here is the link for github Source code. Please help me to complete all three format.
To answer your specific question, the reason your little application stops reading is because it mistakenly thinks there is an EOF character in your file.
Look at your code:
char ch;
int nol=0, not=0, nob=0, noc=0;
FILE *fp;
fp = fopen("file.doc","rb");
while(1)
{
ch = fgetc(fp);
if(ch==EOF)
{
break;
}
You store the result of fgetc(fp) in a variable of type char, which is a single-byte variable. However, the result of fgetc is very purposefully "int", not "char".
fgetc always returns a positive result in the range 0 to 255, except for when you reach the end of the file in which case it returns EOF, which is often implemented as a -1 value.
If you read a byte of value 255 and store it in an int, everything is OK, it's stored as the value 255 and your loop can continue. If you store the result in a char, it's going to be interpreted equal to EOF. And your loop stops.
Don't expect to get anywhere with this idea. .doc is a huge binary file format that is inhumanly complicated to parse. With that said, Cubia mentioned the offset where the text section of the document starts. I'm not familiar with the details of the format, but if the raw text is contained in one location, use fseek to get at it and stop when you reach the end. This won't be the case for the other formats because they are very different.
.docx and .pdf should be easier to parse because they are more modern formats. If you want to read anything from a docx you need to read from a zip file with a ton of xml in it and use a parser to figure out which text you want.
.pdf should be the easiest of the three because you might be able to find a library out there that can almost do what you want.
As for why you are getting strange output from your program, remember that .doc is a binary format and the vast majority of the data is garbage from your perspective. Dumping it to the terminal will yield readable text but also a bunch of control characters that should screw with your terminal.
As a last note - don't try to read docx files directly using fread - they are compressed so you likely won't recover the text unaltered. Take a look at libarchive. Also - expect to have to read the document specifications. docx seems to be a microsoft extension to the openoffice format. See this and some PDF specification documents (there are multiple versions).
Look at the .doc file type as a txt file but with extra non-printable characters before, in the middle, and after your content. These non-printable characters are used for defining special formatting, metadata and other infos.
With this said, all .doc files follow a certain structure.
If you open two different .doc files in a hex editor, you will notice that the text content of both files start at an offset of 0xA00 (2560 bytes) from the beginning of the file. This means that when you open your file initially, you can ignore the first 2560 bytes of the file (Take a look at the fseek() function).
From this point on, you can read the contents of your file until you reach '\0'.
I have not seen the implementation of a .pdf or a .docx file, but you can take open up both files with a hex editor and figure out what pattern you can use the isolate the important contents of the files.
Hope this helps.
EDIT : You can always find documentation on the different file formats that you want to manipulate. Here are the specifications of the PDF file type :
http://www.adobe.com/devnet/pdf/pdf_reference.html
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Read a file with a specific encoding in C?

I've a file that was written on windows with encoding WINDOWS-1256, and I need to write a C program that reads bytes from this file and write them back to a new file with UTF-8 encoding.
How to read a file with a specific encoding in C ??
Other than text mode and binary mode, there is no way to directly read specific encodings with C with standard API.
You would open the file as binary and read it in. Then you can use a library like libiconv to do the encoding / decoding of specific formats.

Linux & C-Programming: How can I write utf-8 encoded text to a file?

I am interested in writing utf-8 encoded strings to a file.
I did this with low level functions open() and write().
In the first place I set the locale to a utf-8 aware character set with
setlocale("LC_ALL", "de_DE.utf8").
But the resulting file does not contain utf-8 characters, only iso8859 encoded umlauts. What am I doing wrong?
Addendum: I don't know if my strings are really utf-8 encoded in the first place. I just keep them in the source file in this form: char *msg = "Rote Grütze";
See screenshot for content of the textfile:
alt text http://img19.imageshack.us/img19/9791/picture1jh9.png
Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.
Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.
To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.
Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:
iconv -f latin1 -t utf8 file.c
This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.
Regards,
Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.
Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.
If you are coming from an Unicode file I believe the function you looking for is wcstombs().
Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.
Once you have done this, could you edit your original post to add the pertinent information?

Resources