I can't seem to be able to decode UCS-2 BE files (legacy stuff) under Python 3.3, using the built-in open() function (stack trace shows UnicodeDecodeError and contains my readLine() method) - in fact, I wasn't able to find a flag for specifying this encoding.
Using Windows 8, terminal is set to codepage 65001, using 'Lucida Console' fonts.
Code snippet won't be of too much help, I guess:
def display_resource():
f = open(r'D:\workspace\resources\JP.res', encoding=<??tried_several??>)
while True:
line = f.readline()
if len(line) == 0:
break
Appreciating any insight into this issue.
UCS-2 is UTF-16, really, for any codepoint that was assigned when it was still called UCS-2 in any case.
Open it with encoding='utf16'. If there is no BOM (the Byte order mark, 2 bytes at the start, for BE that'd be \xfe\xff), then use encoding='utf_16_be' to force a byte order.
Related
I am developing on an embedded system (STM32F4) and I tried to send some data to a simple Windows Forms client program on the PC side. When I used a character based string format everything was working fine but when I changed to a binary package to increase performance I run into an problem with Escape characters.
I'm using nanopb to implement Googles Protocol Buffer for transmission and I observed that in 5% of package I'm receiving exceptions in my client program telling me that my packages are corrupted.
I debugged in WireShark and saw that in this corrupted package the size was 2-4 bytes smaller than the original package size. Upon further inspecting I found out that the corrupted packages always included the binary value 27 and other packages never included this value. I searched for it and saw that this value represents an escape character and that this might lead to problems.
The technical document of the Wi-Fi module I'm using (Gainspan GSM2100) mentions that commands are preceded by an escape character so I think I need to get rid of this values in my package.
I couldn't find a solution to my problem so I would appreciate if somebody more experienced could led me to the right approach to solve this problem.
How are you sending the data? Are you using a library or sending raw bytes? According to the manual, your data commands should start with an escape sequence, but also have data length specified:
// Each escape sequence starts with the ASCII character 27 (0x1B),
// the equivalent to the ESC key. The contents of < > are a byte or byte stream.
// - Cid is connection id (udp, tcp, etc)
// - Data Length is 4 ASCII char represents decimal value
// i.e. 1400 bytes would be '1' '4' '0' '0' (0x31 0x34 0x30 0x30).
// - Data size must match with specified length.
// Ignore all command or esc sequence in between data pay load.
<Esc>Z<Cid><Data Length xxxx 4 ascii char><data>
Note the remark regarding data size: "Ignore all command or esc sequence in between data pay load".
For example, this is how the GSCore::writeData function in GSCore.cpp looks like:
// Including a trailing 0 that snprintf insists to write
uint8_t header[8];
// Prepare header: <esc> Z <cid> <ascii length>
snprintf((char*)header, sizeof(header), "\x1bZ%x%04d", cid, len);
// First, write the escape sequence up to the cid. After this, the
// module responds with <ESC>O or <ESC>F.
writeRaw(header, 3);
if (!readDataResponse()) {
if (GS_LOG_ERRORS && this->error)
this->error->println("Sending bulk data frame failed");
return false;
}
// Then, write the rest of the escape sequence (-1 to not write the
// trailing 0)
writeRaw(header + 3, sizeof(header) - 1 - 3);+
// And write the actual data
writeRaw(buf, len);
This should most likely work. Alternatively, a dirty hack might be to "escape the escape character" before sending, i.e. replace each 0x27 with two characters (0x27 0x27) before sending - but this is just a wild guess and I am presuming you should just check the manual.
In my application, when I want import a file, i use TStringList.
But, when someone export data from Excel, the file encoding is UCS-2 Little Endian, and TStringList can't read the data.
There is any way to validate this situation, identify the text encoding and send a warning to the user that the text provided is not compatible?
Just to be clear, the user will provide only plain text..letter and numbers, otherwise this, I must send the warning.
Unicode File without BOM is good. (TStringList can read it!)
ANSI file Too. (TStringList can read it!)
Even Unicode with BOM will be good, if there is a way to remove it. (TStringList can read it!, but with "i" ">>" and "reverse ?" characters, that belongs to BOM bytes)
I used the following function in Delphi 6 to detect Unicode BOMs.
const
//standard byte order marks (BOMs)
UTF8BOM: array [0..2] of AnsiChar = #$EF#$BB#$BF;
UTF16LittleEndianBOM: array [0..1] of AnsiChar = #$FF#$FE;
UTF16BigEndianBOM: array [0..1] of AnsiChar = #$FE#$FF;
UTF32LittleEndianBOM: array [0..3] of AnsiChar = #$FF#$FE#$00#$00;
UTF32BigEndianBOM: array [0..3] of AnsiChar = #$00#$00#$FE#$FF;
function FileHasUnicodeBOM(const FileName: string): Boolean;
var
Buffer: array [0..3] of AnsiChar;
Stream: TFileStream;
begin
Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite); // Allow other programs read access at the same time.
Try
FillChar(Buffer, SizeOf(Buffer), $AA);//fill with characters that we are not expecting then...
Stream.Read(Buffer, SizeOf(Buffer)); //...read up to SizeOf(Buffer) bytes - there may not be enough
//use Read rather than ReadBuffer so the no exception is raised if we can't fill Buffer
Finally
FreeAndNil(Stream);
End;
Result := CompareMem(#UTF8BOM, #Buffer, SizeOf(UTF8BOM)) or
CompareMem(#UTF16LittleEndianBOM, #Buffer, SizeOf(UTF16LittleEndianBOM)) or
CompareMem(#UTF16BigEndianBOM, #Buffer, SizeOf(UTF16BigEndianBOM)) or
CompareMem(#UTF32LittleEndianBOM, #Buffer, SizeOf(UTF32LittleEndianBOM)) or
CompareMem(#UTF32BigEndianBOM, #Buffer, SizeOf(UTF32BigEndianBOM));
end;
This will detect all the standard BOMs. You could use it to block such files if that's the behaviour you want.
You state that Delphi 6 TStringList can load 16 bit encoded files if they do not have a BOM. Whilst that may be the case, you will find that, for characters in the ASCII range, every other character is #0. Which I guess is not what you want.
If you want to detect that text is Unicode for files without BOMs then you could use IsTextUnicode. However, it may give false positives. This is a situation where I suspect it is better to ask for forgiveness than permission.
Now, if I were you I would not actually try to block Unicode files. I would read them. Use the TNT Unicode library. The class you want is called TWideStringList.
I was following this tutorial.
https://developers.google.com/apps-script/articles/appengine
When I tried to follow Section 1-6
"Test this URL in your browser: http:// localhost:8080/rpc?action=Echo¶ms={"example":"blah"}&key=mySecretKey."
(I added a space between "http://" and "localhost" to avoid auto error check of stackover
flow.)
I could not follow because of this error.
<type 'exceptions.SyntaxError'>: 'ascii' codec can't decode byte 0xc2 in position 141: ordinal not in range(128) please see http://www.python.org/peps/pep-0263.html for details (backend.py)
args = ("'ascii' codec can't decode byte 0xc2 in position...n.org/peps/pep-0263.html for details (backend.py)",)
filename = None
lineno = None
message = "'ascii' codec can't decode byte 0xc2 in position...n.org/peps/pep-0263.html for details (backend.py)"
msg = "'ascii' codec can't decode byte 0xc2 in position...n.org/peps/pep-0263.html for details (backend.py)"
offset = None
print_file_and_line = None
text = None
Before this tutorial, I've read a "Hello World" tutorial for Google App Engine. And it worked fine.
What should I do to remove the error?
P.S.
In the tutorial I found a typo "Section 1: Using the Script Editor" should be "Section 1: Creating and deploying an App Engine service". I think.
The pilcrow sign ¶ (in "... action=Echo¶ms= ...") has a representation in ascii as B6, but in UTF-8 it is represented as C2 B6.
Your browser or editor is probably (and quite reasonably) using UTF-8 as the encoding for the script. A workaround may be to change your encoding to Western or ascii, then paste the script again.
Unicode problems are very common in GAE python applications. This article from Nick Johnson will help you with your Python code:
http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python
I am getting the very familiar:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 24: ordinal not in range(128)
I have checked out multiple posts on SO and they recommend - variable.encode('ascii', 'ignore')
however, this is not working. Even after this I am getting the same error ...
The stack trace:
'ascii' codec can't encode character u'\x92' in position 18: ordinal not in range(128)
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 513, in __call__
handler.post(*groups)
File "/base/data/home/apps/autominer1/1.343038273644030157/siteinfo.py", line 2160, in post
imageAltTags.append(str(image["alt"]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 18: ordinal not in range(128)
The code responsible for the same:
siteUrl = urlfetch.fetch("http://www."+domainName, headers = { 'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5' } )
webPage = siteUrl.content.decode('utf-8', 'replace').encode('ascii', 'replace')
htmlDom = BeautifulSoup(webPage)
imageTags = htmlDom.findAll('img', { 'alt' : True } )
for image in imageTags :
if len(image["alt"]) > 3 :
imageAltTags.append(str(image["alt"]))
Any help would be greatly appreciated. thanks.
There are two different things that Python treats as strings - 'raw' strings and 'unicode' strings. Only the latter actually represent text. If you have a raw string, and you want to treat it as text, you first need to convert it to a unicode string. To do this, you need to know the encoding for the string - they way unicode codepoints are represented as bytes in the raw string - and call .decode(encoding) on the raw string.
When you call str() on a unicode string, the opposite transformation takes place - Python encodes the unicode string as bytes. If you don't specify a character set, it defaults to ascii, which is only capable of representing the first 128 codepoints.
Instead, you should do one of two things:
Represent 'imageAltTags' as a list of unicode strings, and thus dump the str() call - this is probably the best approach
Instead of str(x), call x.encode(encoding). The encoding to use will depend on what you're doing, but the most likely choice is utf-8 - eg, x.encode('utf-8').
I am using VC++ 2008 express edition for C. When I try to run this:
/* Demonstrates printer output. */
#include <stdio.h>
main()
{
float f = 2.0134;
fprintf(stdprn, "This message is printed.\n\n");
fprintf(stdprn, "And now some numbers:\n\n");
fprintf(stdprn, "The square of %f is %f.", f, f*f);
/* Send a form feed */
fprintf(stdprn, "\f");
}
I get four of these errors: error C2065: 'stdprn' : undeclared identifier.
On this forum, they wrote that it works to define the printer as follows:
FILE *printer;
printer = fopen("PRN", "w");
EDIT
It builds with a warning that fopen is unsafe. When it runs the error appears:
Debug Assertion fails.
File: f:\dd\vctools\crt_bld\self_x86\crt\src\fprintf.c
Line: 55
Expression: (str != NULL)
The stdprn stream was an extension provided by Borland compilers - as far as I know, MS have never supported it. Regarding the use of fopen to open the printer device, I don't think this will work with any recent versions of Windows, but a couple of things to try:
use PRN: as the name instead of PRN (note the colon)
try opening the specific device using (for example) LPT1: (once again, note the colon). This will of course not work if you don't have a printer attached.
don't depend on a printer dialog coming up - you are not really using the WIndows printing system when you take this approach (and so it probably won't solve your problem, but is worth a try).
I do not have a printer attached, but I do have the Microsoft XPS document writer installed, s it shoulod at least bring up the standard Windows Print dialog from which one can choose the printer.
No. It wouldn't bring up a dialogue. This is because you are flushing data out to a file. And not going through the circuitous Win32 API.
The print doesn't work because the data is not proper PDL -- something that the printer could understand. For the print to work fine, you need to push in a PDL file, with language specific constructs. This varies from printer to printer, a PS printer will need you to push in a PostScript snippet, a PCL -- a PCL command-set and in case of MXDW you will have to write up XML based page description markup and create a zip file (with all resources embedded in it) i.e. an XPS file to get proper printout.
The PDL constructs are important because otherwise the printer doesn't know where to put the data, which color to print it on, what orientation to use, how many copies to print and so on and so forth.
Edit: I am curious why you are doing this. I understand portability is probably something you are trying to address. But apart from that, I'd like to know, there may be better alternatives available. Win32 Print Subsytem APIs are something that you ought to lookup if you are trying to print programmatically on Windows with any degree of fidelity.
Edit#2:
EDIT It builds with a warning that fopen is unsafe.
This is because MS suggests you use the safer versions nowadays fopen_s . See Security Enhancements in the CRT.
When it runs the error appears:
Debug Assertion fails. File: f:\dd\vctools\crt_bld\self_x86\crt\src\fprintf.c Line: 55
Expression: (str != NULL)
This is because fopen (whose return value you do not check) returns a NULL pointer. The file open failed. Also, if it did succeed a matching fclose call is called for.
There's no such thing as stdprn in ANSI C, it was a nonstandard extension provided by some compilers many years ago.
Today to print you have to use the specific APIs provided on your platform; to print on Windows you have to use the printing APIs to manage the printing of the document and obtain a DC to the printer and the GDI APIs to perform the actual drawing on the DC.
On UNIX-like OSes, instead, usually CUPS is used.
You can substitute the printer using this command with net use, see here on the MSDN kb
NET USE LPT1 \\server_name\printer_name
There is an excellent chapter on printing in DOS using the BIOS, ok, its a bit antiquated but interesting to read purely for nostalgic sake.
Onto your problem, you may need to use CreateFile to open the LPT1 port, see here for an example, I have it duplicated it here, for your benefit.
HANDLE hFile;
hFile = CreateFile("LPT1", GENERIC_WRITE, 0,NULL, OPEN_EXISTING, FILE_FLAG_OVERLAPPED, NULL);
if (hFile == INVALID_HANDLE_VALUE)
{
// handle error
}
OVERLAPPED ov = {};
ov.hEvent = CreateEvent(0, false, false, 0);
char szData[] = "1234567890";
DWORD p;
if (!WriteFile(hFile,szData, 10, &p, &ov))
{
if (GetLastError() != ERROR_IO_PENDING)
{
// handle error
}
}
// Wait for write op to complete (maximum 3 second)
DWORD dwWait = WaitForSingleObject(ov.hEvent, 3000);
if (dwWait == WAIT_TIMEOUT)
{
// it took more than 3 seconds
} else if (dwWait == WAIT_OBJECT_0)
{
// the write op completed,
// call GetOverlappedResult(...)
}
CloseHandle(ov.hEvent);
CloseHandle(hFile);
But if you insist on opening the LPT1 port directly, error checking is omitted...
FILE *prn = fopen("lpt1", "w");
fprintf(prn, "Hello World\n\f");
fclose(prn);
Hope this helps,
Best regards,
Tom.