I am getting the very familiar:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 24: ordinal not in range(128)
I have checked out multiple posts on SO and they recommend - variable.encode('ascii', 'ignore')
however, this is not working. Even after this I am getting the same error ...
The stack trace:
'ascii' codec can't encode character u'\x92' in position 18: ordinal not in range(128)
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 513, in __call__
handler.post(*groups)
File "/base/data/home/apps/autominer1/1.343038273644030157/siteinfo.py", line 2160, in post
imageAltTags.append(str(image["alt"]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 18: ordinal not in range(128)
The code responsible for the same:
siteUrl = urlfetch.fetch("http://www."+domainName, headers = { 'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5' } )
webPage = siteUrl.content.decode('utf-8', 'replace').encode('ascii', 'replace')
htmlDom = BeautifulSoup(webPage)
imageTags = htmlDom.findAll('img', { 'alt' : True } )
for image in imageTags :
if len(image["alt"]) > 3 :
imageAltTags.append(str(image["alt"]))
Any help would be greatly appreciated. thanks.
There are two different things that Python treats as strings - 'raw' strings and 'unicode' strings. Only the latter actually represent text. If you have a raw string, and you want to treat it as text, you first need to convert it to a unicode string. To do this, you need to know the encoding for the string - they way unicode codepoints are represented as bytes in the raw string - and call .decode(encoding) on the raw string.
When you call str() on a unicode string, the opposite transformation takes place - Python encodes the unicode string as bytes. If you don't specify a character set, it defaults to ascii, which is only capable of representing the first 128 codepoints.
Instead, you should do one of two things:
Represent 'imageAltTags' as a list of unicode strings, and thus dump the str() call - this is probably the best approach
Instead of str(x), call x.encode(encoding). The encoding to use will depend on what you're doing, but the most likely choice is utf-8 - eg, x.encode('utf-8').
Related
I am developing on an embedded system (STM32F4) and I tried to send some data to a simple Windows Forms client program on the PC side. When I used a character based string format everything was working fine but when I changed to a binary package to increase performance I run into an problem with Escape characters.
I'm using nanopb to implement Googles Protocol Buffer for transmission and I observed that in 5% of package I'm receiving exceptions in my client program telling me that my packages are corrupted.
I debugged in WireShark and saw that in this corrupted package the size was 2-4 bytes smaller than the original package size. Upon further inspecting I found out that the corrupted packages always included the binary value 27 and other packages never included this value. I searched for it and saw that this value represents an escape character and that this might lead to problems.
The technical document of the Wi-Fi module I'm using (Gainspan GSM2100) mentions that commands are preceded by an escape character so I think I need to get rid of this values in my package.
I couldn't find a solution to my problem so I would appreciate if somebody more experienced could led me to the right approach to solve this problem.
How are you sending the data? Are you using a library or sending raw bytes? According to the manual, your data commands should start with an escape sequence, but also have data length specified:
// Each escape sequence starts with the ASCII character 27 (0x1B),
// the equivalent to the ESC key. The contents of < > are a byte or byte stream.
// - Cid is connection id (udp, tcp, etc)
// - Data Length is 4 ASCII char represents decimal value
// i.e. 1400 bytes would be '1' '4' '0' '0' (0x31 0x34 0x30 0x30).
// - Data size must match with specified length.
// Ignore all command or esc sequence in between data pay load.
<Esc>Z<Cid><Data Length xxxx 4 ascii char><data>
Note the remark regarding data size: "Ignore all command or esc sequence in between data pay load".
For example, this is how the GSCore::writeData function in GSCore.cpp looks like:
// Including a trailing 0 that snprintf insists to write
uint8_t header[8];
// Prepare header: <esc> Z <cid> <ascii length>
snprintf((char*)header, sizeof(header), "\x1bZ%x%04d", cid, len);
// First, write the escape sequence up to the cid. After this, the
// module responds with <ESC>O or <ESC>F.
writeRaw(header, 3);
if (!readDataResponse()) {
if (GS_LOG_ERRORS && this->error)
this->error->println("Sending bulk data frame failed");
return false;
}
// Then, write the rest of the escape sequence (-1 to not write the
// trailing 0)
writeRaw(header + 3, sizeof(header) - 1 - 3);+
// And write the actual data
writeRaw(buf, len);
This should most likely work. Alternatively, a dirty hack might be to "escape the escape character" before sending, i.e. replace each 0x27 with two characters (0x27 0x27) before sending - but this is just a wild guess and I am presuming you should just check the manual.
I am working on a program that outputs PDF documents. Given a sequence of UTF-8 encoded characters and the name of a font that shall be used to render it, I would like to show the appropriate glyphs that make the actual content of the document. I would like to be able to display national characters such as č or ö. It would be great to support ligatures like ae or ffi.
The problem is, I do not know how the actual glyphs to be shown are specified (inside a content stream, for example).
If I, for example, want to display the string "Hello World", I need not to worry about encoding, I simply write (Hello World)Tj. The PDF reader will then use the appropriate font to render this string.
But what if I wanted to show the string
It is difficult to read the PDF specification all day. Prostě dočista nemožné!
with the ligatures ffi, fi and ea and the Czech national symbols ě, č and é in a given font, how would I proceed?
I am trying to get through the PDF specification, but it is not easy.
How do I find out the "code of the glyph" that corresponds to a given character or ligature?
How is this code encoded within a PDF content stream?
Help is much appreciated.
Edit: I may have overestimated the problem. Counting the glyphs that are needed to display a "common European document", I cannot think of a way how this number could exceed 256. If my assumptions are correct, I can remap the encoding of the font completely. This should be sufficient to cover all common symbols of the latin alphabet, numbers, punctuation, common symbols like ( and [ and still I would have plenty of room for national symbols, ligatures and other elements of high-quality typography. (I can implement a priority queue to select the most used ligatures if the total number of glyphs shall exceed 256.)
That being said, I do not think I need to use the CID-keyed fonts.
Still I wander how do I map UTF-8 encoded characters onto glyphs of an arbitrary font. I have the AFM of the font available. For the DejaVu font, for example, character information go like this:
C 63 ; WX 536 ; N question ; B 67 -15 488 743 ;
C 64 ; WX 1000 ; N at ; B 65 -174 930 705 ;
C 65 ; WX 722 ; N A ; B -6 0 732 730 ;
But after the 256th character is mapped, the codes are -1:
C 255 ; WX 564 ; N ydieresis ; B -3 -223 563 767 ;
C -1 ; WX 722 ; N Amacron ; B -6 0 732 899 ;
C -1 ; WX 596 ; N amacron ; B 49 -15 568 746 ;
For example, if I had the sequence 11100010 10000010 10101100 (Euro sign) in my input, how would I know what glyph name it corresponds to so that I can map it in the /Encoding dictionary?
Encoding varies based on the font type. Typically, there is a font resource that is defined as the current font and within that font dictionary is a reference to a base font and a means of describing the encoding (via the /Encoding key). If that key doesn't exist, the encoding will be "standard", but you can use other simple encodings such as /MacRoman and /WinAnsi for the value of the encoding, or you can specify a standard encoding and an encoding delta to show the differences.
Easy so far - as long as you're working with 8-bit characters. For many early apps, they would create a couple different fonts, one with say Roman encoding and another that maps roman characters to unavailable characters. In order to do that, your encoding delta would include references to the ligatures and other typically non-encoded symbols. This works great for Type 1 fonts, but is specifically contraindicated by the spec in the section on TrueType Fonts:
A nonsymbolic font should specify MacRomanEncoding or WinAnsiEncoding as the value of its Encoding entry, with no Differences array
This is vastly different when you want to use, say, Unicode. In which case you would be using a CID font (a font based on character IDs). In that case there is a procedure referenced by the font which is used to map from a character encoding in your string to a character ID in your font (and vice versa). I would strongly recommend that you read and fully understand section 9.7 in the PDF specification on Composite Fonts, which describes everything you need in order to encode UTF16BE into strings to get them to render properly in PDF. It is decidedly non-trivial in that there are a lot of details that if missed will result in a blank rendered page in Acrobat.
As a software engineer who professionally writes code that produces and consumes PDF, let me state that when I get tasked with having to put in special cases in my code to deal with non-spec compliant PDF, a little piece of me dies inside. Please, please, don't even think of releasing any documents you produce into the wild until they pass Preflight at the least. This is not the same as "Acrobat renders it so it must be OK." Let me give you an example - I've seen a number of files in the wild that include fonts that are missing the key elements of the FontDescriptor dictionary, including /Ascent, /Descent, /CapHeight, etc. These render in Acrobat, but are in violation of the spec since each of those is required. I know how Acrobat handles that - it comes with an enormous database of font metrics and looks up the value if it can't find it in the file (heck, it might even ignore the metrics in the file). I don't have that luxury, so I have to do a number of (potentially expensive/invalid) stop gap measures.
You might want to consider using a library to do this work for you - maybe iText which has a decent enough licensing scheme for education because, I get it, you're a student. There are some C based libraries too. Maybe you can figure a way to make GhostScript do your bidding.
If you are unwilling or unable to follow my advice with regards to cleaving to the specification or to use a library which ostensibly does so, please do me the favor of at least filling out the /Creator and /Producer strings in the Document Information Dictionary referenced by the trailer (see sections 14.3.3 and section 7.5.5). That way, when I have to parse/consume/manipulate your documents, I will have a way to directly cast aspersions on your parentage.
Let's go top down and start with the page object - I'm using output from my own library and am stripping out what I think you don't need:
1 0 obj <<
/Type /Page
/Parent 18 0 R
/Resources <<
/Font <<
/U0 13 0 R
>>
/ProcSet [ /PDF /Text ]
>>
/MediaBox [ 0 0 612 792 ]
/Contents 19 0 R
/Dur -1
>>
endobj
U0 is a reference to a font that will be used for unicode text.
The content stream is intended to print the following text: Greek: Γειά σου κόσμος.
BT /U0 24 Tf 72 670 Td
(\000G\000r\000e\000e\000k\000:\000 \003\223\003\265\003\271\003\254\000 \003\303\003\277\003\305\000 \003\272\003\314\003\303\003\274\003\277\003\302)
Tj ET
The font dictionary referenced looks like this:
13 0 obj <<
/BaseFont /DejaVuSansCondensed
/DescendantFonts [ 4 0 R ]
/ToUnicode 14 0 R
/Type /Font
/Subtype /Type0
/Encoding /Identity-H
>>
endobj
Which has the /ToUnicode entry points to a stream containing the following PostScript code:
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 1 beginbfrange <0000> <FFFF> <0000> endbfrange endcmap CMapName currentdict /CMap defineresource pop end end
which is defined by the CID font specification.
and the DescendantFonts array points to this object:
4 0 obj <<
/Subtype /CIDFontType2
/Type /Font
/BaseFont /DejaVuSansCondensed
/CIDSystemInfo 7 0 R
/FontDescriptor 8 0 R
/DW 1000
/W 9 0 R
/CIDToGIDMap 10 0 R
>>
The CIDToGIDMap is a compressed stream with the actual map, the CIDSystemInfo is <</Registry (Adobe) /Ordering (USC) /Supplement 0>> (it's a reference because I share it among all unicode fonts that I output. The FontDescriptor is a straight forward boiler plate, and the W array is derived from the font metrics.
With all this detail, are you understanding why I don't say lightly, "walk away before you pollute my environment any furhter"?
I'm really beginning to question the nature of the this assignment. Writing a simple PDF is one thing, but writing code that can handle full unicode in any arbitrary OpenType/TrueType font requires you to understand the CID spec and the TrueType spec (hint: I have a full TrueType parser that can extract all the metrics for any glyph in a font so that I can output the /W array).
If, however, you are required to only output to Type 1 fonts, well my friend, your life got a whole lot easier, because you would take your entire UTF8 stream, read it as unicode and for every unique character that comes in, you build a map from a unicode character to a glyph name and an internal character number by using this table. The internal character number essentially the unique index of the character that came in mod. So for example, if you have less than 257 unique characters on the page, you will have exactly one font that is encoded to map to the characters in the order that the arrived. If you had "abcba" for input, the output string in pdf would be (\000\001\002\001\000) and would map to a font with an encoding dictionary with a differences array that would be [0/a/b/c]. If you have n unique characters where n > 256, you're going to have (n / 256) + 1 fonts, each with encodings.
If your teacher/professor wants anything but Type 1 fonts in a short period of time, s/he has unrealistic expectations for the students and/or low expectations for the quality of output. You should ask whether your are required to handle CID fonts and if you are, then your professor is at the very least a sadist. It took me, a seasoned professional, about 4 days to write a TrueType parser for extracting widths. I had the advantage of (1) using a managed language (C#) which cut down on concerns that will be biting your ass in C and was also able to use reflection to automate parsing and (2) when I don't have interruptions, I write solid code about 10-20 times faster than a typical student, so my 32 hours would translate into 320 student hours, more or less (then again, my code has different constraints than yours - it has to consume any crap font it gets gracefully), so let's call it 200 or less if you're allowed to steal something like stb. That's just for getting one particular element in the font descriptor.
In my application, when I want import a file, i use TStringList.
But, when someone export data from Excel, the file encoding is UCS-2 Little Endian, and TStringList can't read the data.
There is any way to validate this situation, identify the text encoding and send a warning to the user that the text provided is not compatible?
Just to be clear, the user will provide only plain text..letter and numbers, otherwise this, I must send the warning.
Unicode File without BOM is good. (TStringList can read it!)
ANSI file Too. (TStringList can read it!)
Even Unicode with BOM will be good, if there is a way to remove it. (TStringList can read it!, but with "i" ">>" and "reverse ?" characters, that belongs to BOM bytes)
I used the following function in Delphi 6 to detect Unicode BOMs.
const
//standard byte order marks (BOMs)
UTF8BOM: array [0..2] of AnsiChar = #$EF#$BB#$BF;
UTF16LittleEndianBOM: array [0..1] of AnsiChar = #$FF#$FE;
UTF16BigEndianBOM: array [0..1] of AnsiChar = #$FE#$FF;
UTF32LittleEndianBOM: array [0..3] of AnsiChar = #$FF#$FE#$00#$00;
UTF32BigEndianBOM: array [0..3] of AnsiChar = #$00#$00#$FE#$FF;
function FileHasUnicodeBOM(const FileName: string): Boolean;
var
Buffer: array [0..3] of AnsiChar;
Stream: TFileStream;
begin
Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite); // Allow other programs read access at the same time.
Try
FillChar(Buffer, SizeOf(Buffer), $AA);//fill with characters that we are not expecting then...
Stream.Read(Buffer, SizeOf(Buffer)); //...read up to SizeOf(Buffer) bytes - there may not be enough
//use Read rather than ReadBuffer so the no exception is raised if we can't fill Buffer
Finally
FreeAndNil(Stream);
End;
Result := CompareMem(#UTF8BOM, #Buffer, SizeOf(UTF8BOM)) or
CompareMem(#UTF16LittleEndianBOM, #Buffer, SizeOf(UTF16LittleEndianBOM)) or
CompareMem(#UTF16BigEndianBOM, #Buffer, SizeOf(UTF16BigEndianBOM)) or
CompareMem(#UTF32LittleEndianBOM, #Buffer, SizeOf(UTF32LittleEndianBOM)) or
CompareMem(#UTF32BigEndianBOM, #Buffer, SizeOf(UTF32BigEndianBOM));
end;
This will detect all the standard BOMs. You could use it to block such files if that's the behaviour you want.
You state that Delphi 6 TStringList can load 16 bit encoded files if they do not have a BOM. Whilst that may be the case, you will find that, for characters in the ASCII range, every other character is #0. Which I guess is not what you want.
If you want to detect that text is Unicode for files without BOMs then you could use IsTextUnicode. However, it may give false positives. This is a situation where I suspect it is better to ask for forgiveness than permission.
Now, if I were you I would not actually try to block Unicode files. I would read them. Use the TNT Unicode library. The class you want is called TWideStringList.
I can't seem to be able to decode UCS-2 BE files (legacy stuff) under Python 3.3, using the built-in open() function (stack trace shows UnicodeDecodeError and contains my readLine() method) - in fact, I wasn't able to find a flag for specifying this encoding.
Using Windows 8, terminal is set to codepage 65001, using 'Lucida Console' fonts.
Code snippet won't be of too much help, I guess:
def display_resource():
f = open(r'D:\workspace\resources\JP.res', encoding=<??tried_several??>)
while True:
line = f.readline()
if len(line) == 0:
break
Appreciating any insight into this issue.
UCS-2 is UTF-16, really, for any codepoint that was assigned when it was still called UCS-2 in any case.
Open it with encoding='utf16'. If there is no BOM (the Byte order mark, 2 bytes at the start, for BE that'd be \xfe\xff), then use encoding='utf_16_be' to force a byte order.
I was following this tutorial.
https://developers.google.com/apps-script/articles/appengine
When I tried to follow Section 1-6
"Test this URL in your browser: http:// localhost:8080/rpc?action=Echo¶ms={"example":"blah"}&key=mySecretKey."
(I added a space between "http://" and "localhost" to avoid auto error check of stackover
flow.)
I could not follow because of this error.
<type 'exceptions.SyntaxError'>: 'ascii' codec can't decode byte 0xc2 in position 141: ordinal not in range(128) please see http://www.python.org/peps/pep-0263.html for details (backend.py)
args = ("'ascii' codec can't decode byte 0xc2 in position...n.org/peps/pep-0263.html for details (backend.py)",)
filename = None
lineno = None
message = "'ascii' codec can't decode byte 0xc2 in position...n.org/peps/pep-0263.html for details (backend.py)"
msg = "'ascii' codec can't decode byte 0xc2 in position...n.org/peps/pep-0263.html for details (backend.py)"
offset = None
print_file_and_line = None
text = None
Before this tutorial, I've read a "Hello World" tutorial for Google App Engine. And it worked fine.
What should I do to remove the error?
P.S.
In the tutorial I found a typo "Section 1: Using the Script Editor" should be "Section 1: Creating and deploying an App Engine service". I think.
The pilcrow sign ¶ (in "... action=Echo¶ms= ...") has a representation in ascii as B6, but in UTF-8 it is represented as C2 B6.
Your browser or editor is probably (and quite reasonably) using UTF-8 as the encoding for the script. A workaround may be to change your encoding to Western or ascii, then paste the script again.
Unicode problems are very common in GAE python applications. This article from Nick Johnson will help you with your Python code:
http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python