I am programming in C using the libxml2 library, I am converting a normal xml file into valid XMLTV format but the Spanish characters are not displaying correctly.
I am aware that the xml encoding for Spanish characters should be UTF-8.
I have this line at the top of my files:
<?xml version="1.0" encoding="utf-8"?>
And I also use the libxml function:
xmlSaveFormatFileEnc( (const xmlChar *) szFilename, doc, "utf-8", 1 );
which I believe is meant to save the file in utf-8 encoding.
When importing the xmltv file to the EPG the Spanish characters are displaying as different Spanish characters (e.g. é may display as Á). I have the feeling the file is still not encoding in UTF-8 but in ISO-8859-1
Related
I have a text file in Unix with two columns that contains strings in various languages (Chines, Korean, Japaneese, Arabic, English, French, German, Etc...) in the first column.
Current file's encoding is:
> file index.txt
index.txt: Non-ISO extended-ASCII English text, with LF, NEL line
terminators
I've been told that this file have a subset of entries (in column 1) that using a non-ASCII, non-UTF8 encoding and that I should switching data in this column preferably to ASCII. If not possible, to UTF8.
For example:
1. How user see it: 'Bibliothe<C3>que'.
2. Via vim: 'Bibliothèque'.
3. Via less: 'Bibliothèque'.
I already tried many conversion and method (for days) but non of them convert it as expected.
E.g
I tried to change the encoding to UTF8:
iconv -f CP1256 -t UTF-8 < index.txt > index.txt.2
770> file index.txt.2
index.txt.2: UTF-8 Unicode English text But the characters
seems to be corrupted in the new file.
But got: 1. Via vim: 'Bibliothﺃ¨que' 2. Via less: 'Bibliothأ¨que'.
I check how many non-ASCii rows this file contains and got output file with hundreds lines in the file 'index.txt.non_ascii':
pcregrep --color='auto' -n "[\x80-\xFF]" index.txt > index.txt.non_ascii
I also tried to write a short script (in Perl) that read the data and store it as utf8, but strings where corrupted again.
I will really appreciate if someone could assist me with this problem.
Thanks in advance!
Mike
The project I'm working on takes xml files and input streams and converts them to pdf's and text. In the unit tests I compare this generated text with a .txt file that has the expected output.
I'm now facing the issue of these .txt files not being encoded in UTF-8 and been written without persisting this information (namely umlauts).
I have read few articles on the topic of persisting and encoding .txt files. Including correcting the encoding, saving and opening files in Visual Studio with encoding, and some more.
I was wondering if there is a text file format that supports meta information about encoding like xml or html for example does.
I'm looking for a solution that is:
Easy adaptable to any coworker on the same team
It being persitant and not depending on me choosing an encoding in an editor
Does not require any additional exotic program
Can be read without or only little modification of the File class and it's input reading of C#
Does at least support UTF-8 encoding
A Unicode Byte Order Mark (BOM) is sometimes used for this purpose. Systems that process Unicode are required to strip off this metadata when passing on the text. File.ReadAllText etc do this. A BOM should exist only at the beginning of files and streams.
A BOM is sometimes conflated with encoding because both affect the file format and BOM applies only to Unicode encodings. In Visual Studio, with UTF-8, it's called "Unicode (UTF-8 with signature) - Codepage 65001".
Some C# code that demonstrates these concepts:
var path = Path.GetTempFileName() + ".txt";
File.WriteAllText(path, "Test", new UTF8Encoding(true, true));
Debug.Assert(File.ReadAllBytes(path).Length == 7);
Debug.Assert(File.ReadAllText(path).Length == 4); // slightly mushy encoding detection
However, this doesn't get anyone past the agreement required when using text files. The fundamental rule is that a text file must be read with the same encoding it was written with. A BOM is not a communication that suffices as a complete agreement for text files in general.
Test editors almost universally adopt the principle that they should guess a file's character encoding first, and—for the most part—allow users to correct them later. Some IDEs with project systems allow recording which encoding a file actually uses.
A reasonable text editor would preserve both the encoding and the presence of a Unicode BOM for existing files.
It seems that you're after a universal strategy. Unfortunately, the history of the concept of a text file doesn't allow one.
I have an rtf file which contains special character as it is without decode to hexadecimal or unicode. I am using lib rtf in c language.
Below is the content from the rtf file. It has not mentioned encoded mode.
\b0\fi-380\li1800 · Going to home
While parsing i will get some junk value for the character. How to get proper character from the rtf file if it contains any character in specula mode only.
Regards,
Try this:
\b0\fi-380\li1800 \u183? Going to home
How to do it:
https://stackoverflow.com/a/20117501/1543816
For some reason, every file that I bake with CakePHP's console is regarded as ISO-8859-1 encoded by my IDE Dreamweaver. This works fine up to the point where I end up typing a special character, which will be wrongly displayed by the browser, since its encoding (by the editor) differs from the overall rendering.
How can I force the console to produce UTF-8 files, with a BOM if necessary?
I've already tried converting the template files that are used to bake the standard scaffolding pages, but with no luck.
I have the same problem - baked files are NOT UTF-8 but ASCII. (use notepad++ editor which allows easily convert, save files in another format).
Once bake generates files I have to convert them to UTF-8 one by one, to be able to work with Polish local characters.
I tried changing template files to UTF-8 but somehow this does not help. This may have something to do with the fact that the default file does not contain any non ascii character, therefore even if saved as UTF they stay ASCII.
The simplest way I found to overcome this is to modify template file eg.
cake\console\templates\default\classes\model.ctp
to include utf-8 character somewhere, e.g.:
//'message' => 'Your custom message here ł',
(notice last non ASCII character at the end of line.
then converting and saving as UTF-8 makes sure template file is utf-8.
now, model files are generated as UTF-8.
The baked files are UTF-8, or rather, they only contain basic ASCII characters which are identical to the basic UTF-8 range, so can be regarded as either. It's Dreamweaver's problem, not a problem with bake. Check the Dreamweaver settings (or code in a decent editor ;-P).
You do not want to include a BOM, it'll screw you over later.
Use the Bake_UTF8 plugin =]
http://www.github.com/pedroelsner/bake_utf8
I hope this be helpfull.
Pedro Elsner
Another way to achieve this is to open PHP files that are producing UTF-8 content (without BOM) and then saving them in format UTF-8 with BOM using Notepad++ (Encoding->Encode in UTF-8)
In my case I had Excel CSV file:
/patients/exportFirstReport/atskaite1-25-10-2013.csv
Then I had to convert encoding of PHP files down the stack:
\index.php
\app\Controller\PatientsController.php
\app\View\Patients\csv\export_first_report.ctp
\app\View\Layouts\csv\default.ctp
After conversion of encoding of these files it produces readable UTF8 excel files
I am interested in writing utf-8 encoded strings to a file.
I did this with low level functions open() and write().
In the first place I set the locale to a utf-8 aware character set with
setlocale("LC_ALL", "de_DE.utf8").
But the resulting file does not contain utf-8 characters, only iso8859 encoded umlauts. What am I doing wrong?
Addendum: I don't know if my strings are really utf-8 encoded in the first place. I just keep them in the source file in this form: char *msg = "Rote Grütze";
See screenshot for content of the textfile:
alt text http://img19.imageshack.us/img19/9791/picture1jh9.png
Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.
Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.
To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.
Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:
iconv -f latin1 -t utf8 file.c
This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.
Regards,
Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.
Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.
If you are coming from an Unicode file I believe the function you looking for is wcstombs().
Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.
Once you have done this, could you edit your original post to add the pertinent information?