TStringList behavior with non ANSI files - file

In my application, when I want import a file, i use TStringList.
But, when someone export data from Excel, the file encoding is UCS-2 Little Endian, and TStringList can't read the data.
There is any way to validate this situation, identify the text encoding and send a warning to the user that the text provided is not compatible?
Just to be clear, the user will provide only plain text..letter and numbers, otherwise this, I must send the warning.
Unicode File without BOM is good. (TStringList can read it!)
ANSI file Too. (TStringList can read it!)
Even Unicode with BOM will be good, if there is a way to remove it. (TStringList can read it!, but with "i" ">>" and "reverse ?" characters, that belongs to BOM bytes)

I used the following function in Delphi 6 to detect Unicode BOMs.
const
//standard byte order marks (BOMs)
UTF8BOM: array [0..2] of AnsiChar = #$EF#$BB#$BF;
UTF16LittleEndianBOM: array [0..1] of AnsiChar = #$FF#$FE;
UTF16BigEndianBOM: array [0..1] of AnsiChar = #$FE#$FF;
UTF32LittleEndianBOM: array [0..3] of AnsiChar = #$FF#$FE#$00#$00;
UTF32BigEndianBOM: array [0..3] of AnsiChar = #$00#$00#$FE#$FF;
function FileHasUnicodeBOM(const FileName: string): Boolean;
var
Buffer: array [0..3] of AnsiChar;
Stream: TFileStream;
begin
Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite); // Allow other programs read access at the same time.
Try
FillChar(Buffer, SizeOf(Buffer), $AA);//fill with characters that we are not expecting then...
Stream.Read(Buffer, SizeOf(Buffer)); //...read up to SizeOf(Buffer) bytes - there may not be enough
//use Read rather than ReadBuffer so the no exception is raised if we can't fill Buffer
Finally
FreeAndNil(Stream);
End;
Result := CompareMem(#UTF8BOM, #Buffer, SizeOf(UTF8BOM)) or
CompareMem(#UTF16LittleEndianBOM, #Buffer, SizeOf(UTF16LittleEndianBOM)) or
CompareMem(#UTF16BigEndianBOM, #Buffer, SizeOf(UTF16BigEndianBOM)) or
CompareMem(#UTF32LittleEndianBOM, #Buffer, SizeOf(UTF32LittleEndianBOM)) or
CompareMem(#UTF32BigEndianBOM, #Buffer, SizeOf(UTF32BigEndianBOM));
end;
This will detect all the standard BOMs. You could use it to block such files if that's the behaviour you want.
You state that Delphi 6 TStringList can load 16 bit encoded files if they do not have a BOM. Whilst that may be the case, you will find that, for characters in the ASCII range, every other character is #0. Which I guess is not what you want.
If you want to detect that text is Unicode for files without BOMs then you could use IsTextUnicode. However, it may give false positives. This is a situation where I suspect it is better to ask for forgiveness than permission.
Now, if I were you I would not actually try to block Unicode files. I would read them. Use the TNT Unicode library. The class you want is called TWideStringList.

Related

How to read a file in Crystal?

I recently picked up on Crystal after being a Rubyist for a while, and I can't seem to find anything about the File class. I want to open and read a file, but it gives me an error.
file = File.open("ditto.txt")
file = file.read
tequila#tequila-pc:~/code$ crystal fileopen.cr
Error in fileopen.cr:2: wrong number of arguments for 'File#read' (given 0, expected 1)
Overloads are:
- IO::Buffered#read(slice : Bytes)
- IO#read(slice : Bytes)
file = file.read
^~~~
You're probably looking for IO#gets_to_end which reads the entire file as a String. But you might as well use File.read
file_content = File.read("ditto.txt")
IO#read is a more low-level method which allows to read pieces of an IO into a byte slice.

Contiguous Hex file generation using GCC

I have a Hex file for STM32F427 that was built using GCC(gcc-arm-none-eabi) version 4.6 that had contiguous memory addresses. I wrote boot loader for loading that hex file and also added checksum capability to make sure Hex file is correct before starting the application.
Snippet of Hex file:
:1005C80018460AF02FFE07F5A64202F1D00207F5F9
:1005D8008E4303F1A803104640F6C821C2F2000179
:1005E8001A460BF053F907F5A64303F1D003184652
:1005F8000BF068F907F5A64303F1E80340F6FC1091
:10060800C2F2000019463BF087FF07F5A64303F145
:10061800E80318464FF47A710EF092FC07F5A643EA
:1006280003F1E80318460EF03DFC034607F5A64221
:1006380002F1E0021046194601F0F2FC07F56A5390
As you can see all the addresses are sequential. Then we changed the compiler to version 4.8 and i got the same type of Hex file.
But now we used compiler version 6.2 and the Hex file generated is not contiguous. It is somewhat like this:
:10016000B9BC0C08B9BC0C08B9BC0C08B9BC0C086B
:10017000B9BC0C08B9BC0C08B9BC0C08B9BC0C085B
:08018000B9BC0C08B9BC0C0865
:1001900081F0004102E000BF83F0004330B54FEA38
:1001A00041044FEA430594EA050F08BF90EA020FA5
As you can see after 0188 it is starting at 0190 means rest of 8 bytes(0189 to 018F) are 0xFF as they are not flashed.
Now boot loader is kind of dumb where we just pass the starting address and no of bytes to calculate the checksum.
Is there a way to make hex file in contiguous way as compiler 4.6 and compiler 4.8? the code is same in all the three times.
If post-processing the hex file is an option, you can consider using the IntelHex python library. This lets you manipulate hex file data (i.e. ignoring the 'markup'; record type, address, checksum etc) rather than as lines, will for instance create output with the correct line checksum.
A fast way to get this up and running could be to use the bundled convenience scripts hex2bin.py and bin2hex.py:
python hex2bin.py --pad=FF noncontiguous.hex tmp.bin
python bin2hex.py tmp.bin contiguous.hex
The first line converts the input file noncontiguous.hex to a binary file, padding it with FF where there is no data. The second line converts it the binary file back to a hex file.
The result would be
:08018000B9BC0C08B9BC0C0865
becomes
:10018000B9BC0C08B9BC0C08FFFFFFFFFFFFFFFF65
As you can see, padding bytes are added where the input doesn't have any data, equivalent to writing the input file to the device and reading it back out. Bytes that are in the input file are kept the same - and at the same address.
The checksum is also correct as changing the length byte from 0x08 to 0x10 compensates for the extra 0xFF bytes. If you padded with something else, IntelHex would output the correct checksum
You can skip the the creation of a temporary file by piping these: omit tmp.bin in the first line and replacing it with - in the second line:
python hex2bin.py --pad=FF noncontiguous.hex | python bin2hex.py - contiguous.hex
An alternative way could be to have a base file with all FF and use the hexmerge.py convenience script to merge gcc's output onto it with --overlap=replace
The longer, more flexible way, would be to implement your own tool using the IntelHex API. I've used this to good effect in situations similar to yours - tweak hex files to satisfy tools that are costly to change, but only handle hex files the way they were when the tool was written.
One of many possible ways:
Make your hex file with v6.2, e.g., foo.hex.
Postprocess it with this Perl oneliner:
perl -pe 'if(m/^:(..)(.*)$/) { my $rest=16-hex($1); $_ = ":10" . $2 . ("FF" x $rest) . "\n"; }' foo.hex > foo2.hex
Now foo2.hex will have all 16-byte lines
Note: all this does is FF-pad to 0x10 bytes. It doesn't check addresses or anything else.
Explanation
perl -pe '<some script>' <input file> runs <some script> for each line of <input file>, and prints the result. The script is:
if(m/^:(..)(.*)$/) { # grab the existing byte count into $1
my $rest=16 - hex($1); # how many bytes of 0xFF we need
$_ = ":10" . $2 . ("FF" x $rest) . "\n"; # make the new 16-byte line
# existing bytes-^^ ^^^^^^^^^^^^^^-pad bytes
}
Another solution is to change the linker script to ensure the preceding .isr_vector section ends on a 16 byte alignment, as the mapfile reveals that the following .text section is 16 byte aligned.
This will ensure there is no unprogrammed flash bytes between the two sections
You can use bincopy to fill all empty space with 0xff.
$ pip install bincopy
$ bincopy fill foo.hex
Use the -gap-fill option of objcopy, e.g.:
arm-none-eabi-objcopy --gap-fill 0xFF -O ihex firmware.elf firmware.hex

Escape characters (0x1b/27) in binary packages don't get sent through Wi-Fi and corrupt message during transmission

I am developing on an embedded system (STM32F4) and I tried to send some data to a simple Windows Forms client program on the PC side. When I used a character based string format everything was working fine but when I changed to a binary package to increase performance I run into an problem with Escape characters.
I'm using nanopb to implement Googles Protocol Buffer for transmission and I observed that in 5% of package I'm receiving exceptions in my client program telling me that my packages are corrupted.
I debugged in WireShark and saw that in this corrupted package the size was 2-4 bytes smaller than the original package size. Upon further inspecting I found out that the corrupted packages always included the binary value 27 and other packages never included this value. I searched for it and saw that this value represents an escape character and that this might lead to problems.
The technical document of the Wi-Fi module I'm using (Gainspan GSM2100) mentions that commands are preceded by an escape character so I think I need to get rid of this values in my package.
I couldn't find a solution to my problem so I would appreciate if somebody more experienced could led me to the right approach to solve this problem.
How are you sending the data? Are you using a library or sending raw bytes? According to the manual, your data commands should start with an escape sequence, but also have data length specified:
// Each escape sequence starts with the ASCII character 27 (0x1B),
// the equivalent to the ESC key. The contents of < > are a byte or byte stream.
// - Cid is connection id (udp, tcp, etc)
// - Data Length is 4 ASCII char represents decimal value
// i.e. 1400 bytes would be '1' '4' '0' '0' (0x31 0x34 0x30 0x30).
// - Data size must match with specified length.
// Ignore all command or esc sequence in between data pay load.
<Esc>Z<Cid><Data Length xxxx 4 ascii char><data>
Note the remark regarding data size: "Ignore all command or esc sequence in between data pay load".
For example, this is how the GSCore::writeData function in GSCore.cpp looks like:
// Including a trailing 0 that snprintf insists to write
uint8_t header[8];
// Prepare header: <esc> Z <cid> <ascii length>
snprintf((char*)header, sizeof(header), "\x1bZ%x%04d", cid, len);
// First, write the escape sequence up to the cid. After this, the
// module responds with <ESC>O or <ESC>F.
writeRaw(header, 3);
if (!readDataResponse()) {
if (GS_LOG_ERRORS && this->error)
this->error->println("Sending bulk data frame failed");
return false;
}
// Then, write the rest of the escape sequence (-1 to not write the
// trailing 0)
writeRaw(header + 3, sizeof(header) - 1 - 3);+
// And write the actual data
writeRaw(buf, len);
This should most likely work. Alternatively, a dirty hack might be to "escape the escape character" before sending, i.e. replace each 0x27 with two characters (0x27 0x27) before sending - but this is just a wild guess and I am presuming you should just check the manual.

Python 3: reading UCS-2 (BE) file

I can't seem to be able to decode UCS-2 BE files (legacy stuff) under Python 3.3, using the built-in open() function (stack trace shows UnicodeDecodeError and contains my readLine() method) - in fact, I wasn't able to find a flag for specifying this encoding.
Using Windows 8, terminal is set to codepage 65001, using 'Lucida Console' fonts.
Code snippet won't be of too much help, I guess:
def display_resource():
f = open(r'D:\workspace\resources\JP.res', encoding=<??tried_several??>)
while True:
line = f.readline()
if len(line) == 0:
break
Appreciating any insight into this issue.
UCS-2 is UTF-16, really, for any codepoint that was assigned when it was still called UCS-2 in any case.
Open it with encoding='utf16'. If there is no BOM (the Byte order mark, 2 bytes at the start, for BE that'd be \xfe\xff), then use encoding='utf_16_be' to force a byte order.

UnicodeEncodeError Google App Engine

I am getting the very familiar:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 24: ordinal not in range(128)
I have checked out multiple posts on SO and they recommend - variable.encode('ascii', 'ignore')
however, this is not working. Even after this I am getting the same error ...
The stack trace:
'ascii' codec can't encode character u'\x92' in position 18: ordinal not in range(128)
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 513, in __call__
handler.post(*groups)
File "/base/data/home/apps/autominer1/1.343038273644030157/siteinfo.py", line 2160, in post
imageAltTags.append(str(image["alt"]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 18: ordinal not in range(128)
The code responsible for the same:
siteUrl = urlfetch.fetch("http://www."+domainName, headers = { 'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5' } )
webPage = siteUrl.content.decode('utf-8', 'replace').encode('ascii', 'replace')
htmlDom = BeautifulSoup(webPage)
imageTags = htmlDom.findAll('img', { 'alt' : True } )
for image in imageTags :
if len(image["alt"]) > 3 :
imageAltTags.append(str(image["alt"]))
Any help would be greatly appreciated. thanks.
There are two different things that Python treats as strings - 'raw' strings and 'unicode' strings. Only the latter actually represent text. If you have a raw string, and you want to treat it as text, you first need to convert it to a unicode string. To do this, you need to know the encoding for the string - they way unicode codepoints are represented as bytes in the raw string - and call .decode(encoding) on the raw string.
When you call str() on a unicode string, the opposite transformation takes place - Python encodes the unicode string as bytes. If you don't specify a character set, it defaults to ascii, which is only capable of representing the first 128 codepoints.
Instead, you should do one of two things:
Represent 'imageAltTags' as a list of unicode strings, and thus dump the str() call - this is probably the best approach
Instead of str(x), call x.encode(encoding). The encoding to use will depend on what you're doing, but the most likely choice is utf-8 - eg, x.encode('utf-8').

Resources