How to bake UTF-8 files with CakePHP's console? - cakephp

For some reason, every file that I bake with CakePHP's console is regarded as ISO-8859-1 encoded by my IDE Dreamweaver. This works fine up to the point where I end up typing a special character, which will be wrongly displayed by the browser, since its encoding (by the editor) differs from the overall rendering.
How can I force the console to produce UTF-8 files, with a BOM if necessary?
I've already tried converting the template files that are used to bake the standard scaffolding pages, but with no luck.

I have the same problem - baked files are NOT UTF-8 but ASCII. (use notepad++ editor which allows easily convert, save files in another format).
Once bake generates files I have to convert them to UTF-8 one by one, to be able to work with Polish local characters.
I tried changing template files to UTF-8 but somehow this does not help. This may have something to do with the fact that the default file does not contain any non ascii character, therefore even if saved as UTF they stay ASCII.
The simplest way I found to overcome this is to modify template file eg.
cake\console\templates\default\classes\model.ctp
to include utf-8 character somewhere, e.g.:
//'message' => 'Your custom message here ł',
(notice last non ASCII character at the end of line.
then converting and saving as UTF-8 makes sure template file is utf-8.
now, model files are generated as UTF-8.

The baked files are UTF-8, or rather, they only contain basic ASCII characters which are identical to the basic UTF-8 range, so can be regarded as either. It's Dreamweaver's problem, not a problem with bake. Check the Dreamweaver settings (or code in a decent editor ;-P).
You do not want to include a BOM, it'll screw you over later.

Use the Bake_UTF8 plugin =]
http://www.github.com/pedroelsner/bake_utf8
I hope this be helpfull.
Pedro Elsner

Another way to achieve this is to open PHP files that are producing UTF-8 content (without BOM) and then saving them in format UTF-8 with BOM using Notepad++ (Encoding->Encode in UTF-8)
In my case I had Excel CSV file:
/patients/exportFirstReport/atskaite1-25-10-2013.csv
Then I had to convert encoding of PHP files down the stack:
\index.php
\app\Controller\PatientsController.php
\app\View\Patients\csv\export_first_report.ctp
\app\View\Layouts\csv\default.ctp
After conversion of encoding of these files it produces readable UTF8 excel files

Related

(Text-) File that supports encoding

The project I'm working on takes xml files and input streams and converts them to pdf's and text. In the unit tests I compare this generated text with a .txt file that has the expected output.
I'm now facing the issue of these .txt files not being encoded in UTF-8 and been written without persisting this information (namely umlauts).
I have read few articles on the topic of persisting and encoding .txt files. Including correcting the encoding, saving and opening files in Visual Studio with encoding, and some more.
I was wondering if there is a text file format that supports meta information about encoding like xml or html for example does.
I'm looking for a solution that is:
Easy adaptable to any coworker on the same team
It being persitant and not depending on me choosing an encoding in an editor
Does not require any additional exotic program
Can be read without or only little modification of the File class and it's input reading of C#
Does at least support UTF-8 encoding
A Unicode Byte Order Mark (BOM) is sometimes used for this purpose. Systems that process Unicode are required to strip off this metadata when passing on the text. File.ReadAllText etc do this. A BOM should exist only at the beginning of files and streams.
A BOM is sometimes conflated with encoding because both affect the file format and BOM applies only to Unicode encodings. In Visual Studio, with UTF-8, it's called "Unicode (UTF-8 with signature) - Codepage 65001".
Some C# code that demonstrates these concepts:
var path = Path.GetTempFileName() + ".txt";
File.WriteAllText(path, "Test", new UTF8Encoding(true, true));
Debug.Assert(File.ReadAllBytes(path).Length == 7);
Debug.Assert(File.ReadAllText(path).Length == 4); // slightly mushy encoding detection
However, this doesn't get anyone past the agreement required when using text files. The fundamental rule is that a text file must be read with the same encoding it was written with. A BOM is not a communication that suffices as a complete agreement for text files in general.
Test editors almost universally adopt the principle that they should guess a file's character encoding first, and—for the most part—allow users to correct them later. Some IDEs with project systems allow recording which encoding a file actually uses.
A reasonable text editor would preserve both the encoding and the presence of a Unicode BOM for existing files.
It seems that you're after a universal strategy. Unfortunately, the history of the concept of a text file doesn't allow one.

How to output foreign characters in console?

How can I print foreign characters on the screen using C?
Here's my code, which doesn't work:
#include <stdio.h>
#include <locale.h>
int main(){
setlocale(LC_ALL,"Turkish");
printf("İ ş ğ ü ö ı");
system("pause");
return 0;
}
On my Windows there is no such characters in the 'Terminal' font. I think you can't print them.
But I suggest you to check this font yourself. Maybe you have a different version of it.
If you're using a narrow charset then you need to make sure that the terminal/console is using the same charset and the source code file is encoded in the correct encoding, otherwise of course the system will misinterpret the character codes
To set the charset in the console run chcp. For example to use code page Windows-1254 run chcp 1254. You can use SetConsoleOutputCP to set the code page programmatically, like SetConsoleOutputCP(1254)
However you should avoid the legacy ANSI code pages and use Unicode instead. The current preferred way on Windows is to output Unicode characters as wide char with wprintf. You may need to set the mode to wide first with
int result = _setmode(_fileno(stdout), _O_U16TEXT);
then
wprintf(L"İ ş ğ ü ö ı");
See also wprintf manual in Windows, Linux or Mac. However on POSIX systems UTF-8 is preferred
On older Windows UTF-8 support on console is not very good, but it's increasingly getting better, and Windows 10 even supports UTF-8 as a locale so you can just call SetConsoleOutputCP(CP_UTF8); or SetConsoleOutputCP(65001); (or run chcp 65001 in the console) and it'll work immediately, provided that you saved the source code as UTF-8. Remember to also set the font to the one that supports those characters like Lucida Console or Consolas. The default raster font contains very a limited number of characters and appears with a lot of aliasing. It also doesn't work well on modern hidpi displays
There are already lots of questions about outputting Unicode on this site like Output unicode strings in Windows console app or UTF-8 character in .NET Console Application. Please have a look and try to see which one fits you.
Edit
When you use
wchar_t c=L'ğ';
fputwc(c,ptr);
you're printing to a file and not the console. In that case just the stream of bytes is saved into the file. When you open the file again, it's the job of the editor to treat the bytes in the correct charset and print it correctly. For example the character "ğ" is stored as c4 9f in UTF-8 and when open the file as UTF-8, the editor knows that it represents the char "ğ" to display
Unfortunately there's no character encoding information embedded in a text file so the editor must choose one. Remember There Ain't No Such Thing As Plain Text (must read). A simple editor may just choose to open the file as ANSI in the current Windows codepage and the characters won't be displayed correctly if the original encoding is not that one and you'll just see garbage
Some more advanced editors like Notepad++ or MS Word will try to guess the encoding of the file. But as with any guessing, it can be wrong and the result is again a file with garbage
The simplest solution is to add a BOM to the beginning of the file so the editor can recognize the encoding easily. If your files doesn't contain a BOM you need to tell the editor to read the file in the correct encoding if the encoding is wrong (for wchar_t on Windows like that it's UTF-16LE). For example in Notepad++ it's this menu
Unfortunately the OP didn't edit the question to show what was tried, there's nothing more I can explain
Your code works: http://ideone.com/K9hrv5
setlocale(LC_ALL,"Turkish");
printf("İ ş ğ ü ö ı");
The only issue is that you have to set your terminal locale as well before executing your c program's output binary.
Setting your terminal locale works:
setlocale only affects the runtime locale. It doesn't make your compiler support extra source file characters.
You may need to specify the non-ASCII characters in your source file by using character constants (e.g. \xF1 for the character with code 241).

appengine and jinja2 templating, correctly handling encoding for latin charset

I'm building a simple hierarchical template with jinja.
I was splitting my monolithic html file in smaller chunks to reuse them.
In my html I had accented characters like èàù...
when I tried to run the app I received errors complaining about my files not beeing properly encoded.
So I decided to adhere to utf-8 standars and substituted those characters with things like ù
and so on...
This worked but is seems a bit cumbersome as a workflow.
I'm wandering which is the correct approach to this kind of problems both with template files and content that have to be rendered in those templates.
I suppose that it is good practice to encode texts in utf-8 before storing them but I'm not sure.
When you use utf-8 in your (html) files, templates and python code it works fine and you can use : èàù...
Make sure your editors use utf-8 encoded files are read this article about utf-8. : http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python. You also can use :
#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
at the start of your Python modules.
In templates, it is sufficient to save the file with correct utf-8 encoding, see image below.
For what concerns strings that you want to store in source code and pass around in your program, if you only need a quick and dirty suggestion, here it is:
save your .py files as utf-8
now you can use all characters you want in your strings
you only need explicitly inform the rest of the system that you are
using an utf-8 encode
EXAMPLE
my_string = """I write àny character I want herè, just only ensurè that the encoding for the file you are writing is utf-8 àèàééàà"""
def utf8(s):
return unicode(s, 'utf8')
pass_string_around = utf8(my_string)

How to get cakePHP i18n .pot files encoded in UTF-8?

I'm using the cake i18n command to extract the content of my __() functions in my application.
However, the default.pot output file is not encoded in UTF-8, and thus does not correctly display accentuated characters which is a problem since the main language is french (lot of 'é' , 'à' ...).
I'm using wamp server on Windows 7.
I've tried to change Windows console's encoding with chcp, to convert default.pot file in UTF-8 with notepad++ or PSpad editor, without success.
Do you know any way to get this default.pot file in UTF-8 ?
All the .php or .ctp files are edited with either Komodo or Geany, both on Windows and configured to use UTF-8.
Also i'm using subversion, if it helps.
Thank you for reading.
Had the same problem with cakephp 1.3 (not sure if it is fixed in 2.x): all "special" character which are not ANSI compliant (such as ä, ü, ö, ß), where extracted in the .pot file and there interpreted by ANSI (e.g. "ü" instead of "ü").
The solution mentioned by Camille (manually changing the characters) was not very feasable, since it was a lot of characters, this partly destroyed the .pot format and even worse an automated update of your .po file won't work.
The workaround I found was with the help of a comment in the php documentation for write() (which is used in the console task): http://www.php.net/manual/en/function.fwrite.php#73764.
According to the description there, I extended the file /cake/console/libs/tasks/extract.php with two lines:
First line went into the function __buildFiles():
$string = utf8_decode($string);
I wrote it in line 351, but it just needs to be in the second foreach loop and of couse before the variable is used by the function.
the second line went into the function __writeHeader():
replace the line $File->write($output); with
$File->write(utf8_encode($output));
That did it for me, take care the updating your cakephp will overwrite this changes.
I found a way to deal with this thanks to #Msalters. I changed the default encoding of my editor and overwrote wrong characters.

How to read lines from a pdf file into a c program using ghostscript?

I am currently taking a curse in C programming, and for our final project we need to read some text from a pdf into a string, so we can manipulate the string.
In essence what i am looking for is something similar to this, only with a .pdf instead of a .txt file.
char *line;
fscanf(myfile.txt," %[^\n]", line);
I have no experience with ghostscript, so I have no idea if this is even possible, although we where told that we should use ghostscript.
The current version of Ghostscript includes the 'txtwrite' device, which will extract text from any supported input (PostScript, PDF, XPS, PCL) and will emit it in a variety of forms.
The UTF-8 output would probably be most useful to you.
Caveat! Many things which appear to be text in PDF files are not text, and no attempt is made to deal with these.
ps2ascii is deprecated with the release of the txtwrite device, but in any case its perfectly capable (despite the name) of dealing with PDF as an input.
I can't think why anyone assigned you this project, PDF files are not text files, and cannot be treated as such. In addition to the fact that PDF files are generally compressed, identifying the contents stream and all the other streams it relies on (which may themselves include text) is non-trivial. Plus, the text is often encoded in a way which can be difficult to understand (this is particularly true of CIDFonts and TrueType fonts).
Perhaps your tutor expected you to first become expert in the PDF format, but that seems excessive for a C course.
You can convert your PDF to Postscript using pdf2ps, and then to ASCII using ps2ascii. You already know how to read ASCII.
Both utilities mentioned are in the ghostscript package.

Resources