I'm wondering if there is a canonical way to read Unicode files in Rust. Most examples read a line, because if the file is well formed utf8 a line should consist of whole/complete 'characters' (Unicode Scalar Values).
Here's a simple example of reading a file as a utf8 file, but only works if 'one byte' == 'one character', which isn't guaranteed.
let mut chr: char;
let f = File::open(filename).expect("File not found");
let mut rdr = BufReader::new(f);
while (true) {
let mut x: [u8; 1] = [0];
let n = rdr.read(&mut x);
let bytes = utf8_char_width(x[0]); // unstable feature
chr = x[0] as char;
...
I'm new to Rust, but the only thing I could find that would help me read a full character was the utf8_char_width, which is marked unstable.
Does Rust have a facility such that I can open a file as (Unicode) 'text' and it will read/respect the BOM (if available) and allow me to iterate over the contents of that file returning a Rust char type for each 'character' (Unicode Scalar Value) found?
Am I making something easy hard? Again, I'm new to Rust so everything is hard to me currently :-D
Update (in response to comments)
A "Unicode file" is a file containing only Unicode encoded data. I'd like to be able to read Unicode encoded files, without worrying about the various details of character size or endianness. Since Rust uses a four byte (u32) 'char' I'd like to be able to read the file one character at a time, not worrying about line length (or it's allocation).
While UTF8 is byte oriented, the Unicode standard does define a BOM for it as well as saying that the default (no BOM) is UTF8.
It is somewhat counter-intuitive (I'm new to Rust) that the char type is UTF32 while a string is (effectively) a vector of u8. However, I can see the reasoning behind forcing the developer to be explicit regarding 'byte' or 'char' as I've seen a lot of bugs caused by people assuming that those are the same size. Clearly, there is an iterator to return char's from a string so the code to handle the UTF8 -> UTF32 is in place, it just needs to take it's input from a file stream rather than a memory vector. Perhaps as I learn more a solution will present itself.
I wants to know what the size of string/blob in apex.
What i found is just size() method, which return the number of characters in string/blob.
What the size of single character in Salesforce ?
Or there is any way to know the size in bytes directly ?
I think the only real answer here is "it depends". Why do you need to know this?
The methods on String like charAt and codePointAt suggest that UTF-16 might be used internally; in that case, each character would be represented by 2 or 4 bytes, but this is hardly "proof".
Apex seems to be translated to Java and running on some form of JVM and Strings in Java are represented internally as UTF-16 so again that could indicate that characters are 2 or 4 bytes in Apex.
Any time Strings are sent over the wire (e.g. as responses from a #RestResource annotated class), UTF-8 seems to be used as a character encoding, which would mean 1 to 4 bytes per character are used, depending on what character it is. (See section 2.5 of the Unicode standard.)
But you should really ask yourself why you think your code needs to know this because it most likely doesn't matter.
You can estimate string size doing the following:
String testString = 'test string';
Blob testBlob = Blob.valueOf(testString);
// below converts blob to hexadecimal representation; four ones and zeros
// from blob will get converted to single hexadecimal character
String hexString = EncodingUtil.convertToHex(testBlob);
// One byte consists of eight ones and zeros so, one byte consists of two
// hex characters, hence the number of bytes would be
Integer numOfBytes = hexString.length() / 2;
Another option to estimate the size would be to get the heap size before and after assigning value to String variable:
String testString;
System.debug(Limits.getHeapSize());
testString = 'testString';
System.debug(Limits.getHeapSize());
The difference between two printed numbers would be the size a string takes on the heap.
Please note that the values obtained from those methods will be different. We don't know what type of encoding is used for storing string in Salesforce heap or when converting string to blob.
I've migrated my project from XE5 to 10 Seattle. I'm still using ANSII codes to communicate with devices. With my new build, Seattle IDE is sending † character instead of space char (which is #32 in Ansii code) in Char array. I need to send space character data to text file but I can't.
I tried #32 (like before I used), #032 and #127 but it doesn't work. Any idea?
Here is how I use:
fillChar(X,50,#32);
Method signature (var X; count:Integer; Value:Ordinal)
Despite its name, FillChar() fills bytes, not characters.
Char is an alias for WideChar (2 bytes) in Delphi 2009+, in prior versions it is an alias for AnsiChar (1 byte) instead.
So, if you have a 50-element array of WideChar elements, the array is 100 bytes in size. When you call fillChar(X,50,#32), it fills in the first 50 bytes with a value of $20 each. Thus the first 25 WideChar elements will have a value of $2020 (aka Unicode codepoint U+2020 DAGGER, †) and the second 25 elements will not have any meaningful value.
This issue is explained in the FillChar() documentation:
Fills contiguous bytes with a specified value.
In Delphi, FillChar fills Count contiguous bytes (referenced by X) with the value specified by Value (Value can be of type Byte or AnsiChar)
Note that if X is a UnicodeString, this may not work as expected, because FillChar expects a byte count, which is not the same as the character count.
In addition, the filling character is a single-byte character. Therefore, when Buf is a UnicodeString, the code FillChar(Buf, Length(Buf), #9); fills Buf with the code point $0909, not $09. In such cases, you should use the StringOfChar routine.
This is also explained in Embarcadero's Unicode Migration Resources white papers, for instance on page 28 of Delphi Unicode Migration for Mere Mortals: Stories and Advice from the Front Lines by Cary Jensen:
Actually, the complexity of this type of code is not related to pointers and buffers per se. The problem is due to Chars being used as pointers. So, now that the size of Strings and Chars in bytes has changed, one of the fundamental assumptions that much of this code embraces is no longer valid: That individual Chars are one byte in length.
Since this type of code is so problematic for Unicode conversion (and maintenance in general), and will require detailed examination, a good argument can be made for refactoring this code where possible. In short, remove the Char types from these operations, and switch to another, more appropriate data type. For example, Olaf Monien wrote, "I wouldn't recommend using byte oriented operations on Char (or String) types. If you need a byte-buffer, then use ‘Byte’ as [the] data type: buffer: array[0..255] of Byte;."
For example, in the past you might have done something like this:
var
Buffer: array[0..255] of AnsiChar;
begin
FillChar(Buffer, Length(Buffer), 0);
If you merely want to convert to Unicode, you might make the following changes:
var
Buffer: array[0..255] of Char;
begin
FillChar(Buffer, Length(buffer) * SizeOf(Char), 0);
On the other hand, a good argument could be made for dropping the use of an array of Char as your buffer, and switch to an array of Byte, as Olaf suggests. This may look like this (which is similar to the first segment, but not identical to the second, due to the size of the buffer):
var
Buffer: array[0..255] of Byte;
begin
FillChar(Buffer, Length(buffer), 0);
Better yet, use this second argument to FillChar which works regardless of the data type of the array:
var
Buffer: array[0..255] of Byte;
begin
FillChar(Buffer, Length(buffer) * SizeOf(Buffer[0]), 0);
The advantage of these last two examples is that you have what you really wanted in the first place, a buffer that can hold byte-sized values. (And Delphi will not try to apply any form of implicit string conversion since it's working with bytes and not code units.) And, if you want to do pointer math, you can use PByte. PByte is a pointer to a Byte.
The one place where changes like may not be possible is when you are interfacing with an external library that expects a pointer to a character or character array. In those cases, they really are asking for a buffer of characters, and these are normally AnsiChar types.
So, to address your issue, since you are interacting with an external device that expects Ansi data, you need to declare your array as using AnsiChar or Byte elements instead of (Wide)Char elements. Then your original FillChar() call will work correctly again.
If you want to use ANSI for communication with devices, you would define the array as
x: array[1..50] of AnsiChar;
In this case to fill it with space characters you use
FillChar(x, 50, #32);
Using an array of AnsiChar as communication buffer may become troublesome in a Unicode environment, so therefore I would suggest to use a byte array as communication buffer
x: array[1..50] of byte;
and intialize it with
FillChar(x, 50, 32);
I use sizeof() method to get the memory size of char variable. i.e.,I write such a code in Xcode:
char a = '1';
size_t size = sizeof(a);
I find the size is 1.So what is this character's encoding type? UTF-8 or another?
You'd have to know this before using the character, just as you are expected to know the encoding of a string before you can interpret it correctly.
You cannot determine the encoding without prior knowledge.
The sizeof method is not enough to determine what the encoding of a character is.
In general, you must be told what encoding is being used before being able to interpret characters.
As stated in a comment, look here for some information on C string encoding: What is the default encoding for C strings?
For some character literals the C Standard defines a fixed value (the digits if I remembers correctly) for others (letters) the C Standard leaves it to the implementation which char-set/encoding to use.
I have a function which returns a dynamic array of byte
type
TMyEncrypt = Array of Byte;
TMyDecrypt = Array of Byte;
function decrypt(original: TMyEncrypt) : TMyDecrypt;
The content of the returned dynamic array TMyDecrypt is a standard text with CRLF.
How can i load this into a TStringList with CRLF as separator, without saving it to a temporary file before?
EDIT: the retruned array of byte contains unicode coded characters
Decode the byte array to a string, and then assign to the Text property of the string list.
var
Bytes: TBytes;
StringList: TStringList;
....
StringList.Text := TEncoding.Unicode.GetString(Bytes);
Note the use of TBytes which is the standard type used to hold dynamic arrays of bytes. For compatibility reasons it makes sense to use TBytes. That way your data can be processed by other RTL and library code. A fact we immediately take advantage of by using TEncoding.
You could use SetString, as my answer originally suggested:
var
Text: string;
Bytes: TBytes;
StringList: TStringList;
....
SetString(Text, PChar(Bytes), Length(Bytes) div SizeOf(Char)));
StringList.Text := Text;
Personally I prefer to use TEncoding because it is very explicit about the encoding being used.
If your text was null terminated then you could use:
StringList.Text := PChar(Bytes);
Again, I'd prefer to be explicit about the encoding. And I might be a little paranoid about my data somehow not being null terminated.
You might find that UTF-8 is a more efficient representation than UTF-16.