Get size of Blob / String in bytes in apex ? - salesforce

I wants to know what the size of string/blob in apex.
What i found is just size() method, which return the number of characters in string/blob.
What the size of single character in Salesforce ?
Or there is any way to know the size in bytes directly ?

I think the only real answer here is "it depends". Why do you need to know this?
The methods on String like charAt and codePointAt suggest that UTF-16 might be used internally; in that case, each character would be represented by 2 or 4 bytes, but this is hardly "proof".
Apex seems to be translated to Java and running on some form of JVM and Strings in Java are represented internally as UTF-16 so again that could indicate that characters are 2 or 4 bytes in Apex.
Any time Strings are sent over the wire (e.g. as responses from a #RestResource annotated class), UTF-8 seems to be used as a character encoding, which would mean 1 to 4 bytes per character are used, depending on what character it is. (See section 2.5 of the Unicode standard.)
But you should really ask yourself why you think your code needs to know this because it most likely doesn't matter.

You can estimate string size doing the following:
String testString = 'test string';
Blob testBlob = Blob.valueOf(testString);
// below converts blob to hexadecimal representation; four ones and zeros
// from blob will get converted to single hexadecimal character
String hexString = EncodingUtil.convertToHex(testBlob);
// One byte consists of eight ones and zeros so, one byte consists of two
// hex characters, hence the number of bytes would be
Integer numOfBytes = hexString.length() / 2;
Another option to estimate the size would be to get the heap size before and after assigning value to String variable:
String testString;
System.debug(Limits.getHeapSize());
testString = 'testString';
System.debug(Limits.getHeapSize());
The difference between two printed numbers would be the size a string takes on the heap.
Please note that the values obtained from those methods will be different. We don't know what type of encoding is used for storing string in Salesforce heap or when converting string to blob.

Related

GNU memmem vs C strstr

Is there any use case that can be solved by memmem but not by strstr?
I was thinking of able to parse a string raw bytes (needle) inside a bigger string of raw bytes(haystack). Like trying to find a particular raw byte pattern inside a blob of raw bytes read from C's read function.
There is the case you mention: raw binary data.
This is because raw binary data may contain zeroes, which are interpreted and string terminator by strstr, making it ignore the rest of the haystack or needle.
Additionally, if the raw binary data contains no zero bytes, and you don't have a valid (inside the same array or buffer allocation) extra zero after the binary data, then strstr will happily go beyond the data and cause Undefined Behavior via buffer overflow.
Or, to the point: strstr can't be used if the data is not strings. memmem doesn't have this limitation.
In addition to searching in non-string data, memmem() can be used to look for substrings in just a portion of a longer string, something strstr() can't do:
char somestr[] = "a long string with the word apple";
// Look in just the 5th through 15 characters
// (Haystack must have at least 15 characters or else)
char *loc = memmem(somestr + 4, 10, "pp", 2);
and if you already know the lengths of the strings, it might be faster than strstr() when used on the entire haystack string, but that depends a lot on the implementation and should be benchmarked.

How to determine the size of a string in C, or at least ensuring that it doesn't exceed a maximum number of bytes?

Is it possible to determine the size in bytes of a string in C?
I'm trying to ensure that JSON strings built in C do not exceed a 1 MB size limit before passing them to the requesting application. I don't know the strings at compile time.
I've read that it is just strlen * sizeof( char ); but I don't understand that, because I read elsewhere that UTF-8 can have characters of size up to four bytes and sizeof( char ) is always one.
I am likely misunderstanding something basic.
If a character array is allocated as char JSON[1048576], does this allocate that many characters or bytes? If it is bytes, then as long as something like snprintf is used when writing to JSON array, would this guarantee that it can never exceed 1 MB in size, even if there were character in that array that exceed one byte?
Thank you.
Since you are after a size limit 1MB and not a string length limit per se, you can just use strlen(json_str). Provided that your json string is null terminated, '\0'.
If you allocate char JSON[1048576] that will give you an array with that many bytes. And snprintf(JSON, 1048576, "<json string>", ...) will guarantee that you never overfill your array.
It does not guarantee however that your string is a valid utf-8 string since the last character may be a multi byte character that is split in the middle.
A C char is not the same as a utf-8 character. In C char is by definition 1 Byte but in utf-8 the visual character that you want, like the heart in your comment, may be represented by several bytes of data.
One byte gives you 256 different values and since there are way more than 256 Unicode "characters" more than one byte is needed to encode many of them. The designers of utf-8 was clever though so the first 127 characters can be encoded using just one byte and if only those characters are used it will both valid utf-8 and ascii.

Can I store UTF8 in C-style char array

Follow up
Can UTF-8 contain zero byte?
Can I safely store UTF8 string in zero terminated char * ?
I understand strlen() will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.
Yes.
Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL character in such a string (the value \u+0000 is the Unicode code point NUL, very much like in ASCII).
As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.
In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.
To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.
It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.
So the most feasible solution is not to accept U+0000 in strings.

Sending † character instead of Space character in Char array

I've migrated my project from XE5 to 10 Seattle. I'm still using ANSII codes to communicate with devices. With my new build, Seattle IDE is sending † character instead of space char (which is #32 in Ansii code) in Char array. I need to send space character data to text file but I can't.
I tried #32 (like before I used), #032 and #127 but it doesn't work. Any idea?
Here is how I use:
fillChar(X,50,#32);
Method signature (var X; count:Integer; Value:Ordinal)
Despite its name, FillChar() fills bytes, not characters.
Char is an alias for WideChar (2 bytes) in Delphi 2009+, in prior versions it is an alias for AnsiChar (1 byte) instead.
So, if you have a 50-element array of WideChar elements, the array is 100 bytes in size. When you call fillChar(X,50,#32), it fills in the first 50 bytes with a value of $20 each. Thus the first 25 WideChar elements will have a value of $2020 (aka Unicode codepoint U+2020 DAGGER, †) and the second 25 elements will not have any meaningful value.
This issue is explained in the FillChar() documentation:
Fills contiguous bytes with a specified value.
In Delphi, FillChar fills Count contiguous bytes (referenced by X) with the value specified by Value (Value can be of type Byte or AnsiChar)
Note that if X is a UnicodeString, this may not work as expected, because FillChar expects a byte count, which is not the same as the character count.
In addition, the filling character is a single-byte character. Therefore, when Buf is a UnicodeString, the code FillChar(Buf, Length(Buf), #9); fills Buf with the code point $0909, not $09. In such cases, you should use the StringOfChar routine.
This is also explained in Embarcadero's Unicode Migration Resources white papers, for instance on page 28 of Delphi Unicode Migration for Mere Mortals: Stories and Advice from the Front Lines by Cary Jensen:
Actually, the complexity of this type of code is not related to pointers and buffers per se. The problem is due to Chars being used as pointers. So, now that the size of Strings and Chars in bytes has changed, one of the fundamental assumptions that much of this code embraces is no longer valid: That individual Chars are one byte in length.
Since this type of code is so problematic for Unicode conversion (and maintenance in general), and will require detailed examination, a good argument can be made for refactoring this code where possible. In short, remove the Char types from these operations, and switch to another, more appropriate data type. For example, Olaf Monien wrote, "I wouldn't recommend using byte oriented operations on Char (or String) types. If you need a byte-buffer, then use ‘Byte’ as [the] data type: buffer: array[0..255] of Byte;."
For example, in the past you might have done something like this:
var
Buffer: array[0..255] of AnsiChar;
begin
FillChar(Buffer, Length(Buffer), 0);
If you merely want to convert to Unicode, you might make the following changes:
var
Buffer: array[0..255] of Char;
begin
FillChar(Buffer, Length(buffer) * SizeOf(Char), 0);
On the other hand, a good argument could be made for dropping the use of an array of Char as your buffer, and switch to an array of Byte, as Olaf suggests. This may look like this (which is similar to the first segment, but not identical to the second, due to the size of the buffer):
var
Buffer: array[0..255] of Byte;
begin
FillChar(Buffer, Length(buffer), 0);
Better yet, use this second argument to FillChar which works regardless of the data type of the array:
var
Buffer: array[0..255] of Byte;
begin
FillChar(Buffer, Length(buffer) * SizeOf(Buffer[0]), 0);
The advantage of these last two examples is that you have what you really wanted in the first place, a buffer that can hold byte-sized values. (And Delphi will not try to apply any form of implicit string conversion since it's working with bytes and not code units.) And, if you want to do pointer math, you can use PByte. PByte is a pointer to a Byte.
The one place where changes like may not be possible is when you are interfacing with an external library that expects a pointer to a character or character array. In those cases, they really are asking for a buffer of characters, and these are normally AnsiChar types.
So, to address your issue, since you are interacting with an external device that expects Ansi data, you need to declare your array as using AnsiChar or Byte elements instead of (Wide)Char elements. Then your original FillChar() call will work correctly again.
If you want to use ANSI for communication with devices, you would define the array as
x: array[1..50] of AnsiChar;
In this case to fill it with space characters you use
FillChar(x, 50, #32);
Using an array of AnsiChar as communication buffer may become troublesome in a Unicode environment, so therefore I would suggest to use a byte array as communication buffer
x: array[1..50] of byte;
and intialize it with
FillChar(x, 50, 32);

Trouble comparing UTF-8 characters using wchar.h

I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome.
I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.
C has two different kinds of characters: multibyte characters and wide characters.
Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.
Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).
So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).
Unfortunately, there is more to Unicode than this.
ά can be represented as a single Unicode codepoint, or as two separate codepoints:
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.
All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).
First of all, please do take the time to read this great article, which explains UTF8 vs Unicode and lots of other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html
What you are trying to do in your code is read in unicode character by character, and do comparisons with those. That's won't work if the input stream is UTF8, and it's not really possible to do with quite this structure.
In short: Fully unicode strings can be encoded in several ways. One of them is using a series of equally-sized "wide" chars, one for each character. That is what the wchar_t type (sometimes WCHAR) is for. Another way is UTF8, which uses a variable number of raw bytes to encode each character, depending on the value of the character.
UTF8 is just a stream of bytes, which can encode a unicode string, and is commonly used in files. It is not the same as a string of WCHARs, which are the more common in-memory representation. You can't poke through a UTF8 stream reliably, and do character replacements within it directly. You'll need to read the whole thing in and decode it, and then loop through the WCHARs that result to do your comparisons and replacement, and then map that result back to UTF8 to write to the output file.
On Win32, use MultiByteToWideChar to do the decoding, and you can use the corresponding WideCharToMultiByte to go back.
When you use a "string literal" with regular quotes, you're creating a nul-terminated ASCII string (char*), which does not support Unicode. The L"string literal" with the L prefix will create a nul-terminated string of WCHARs (wchar_t *), which you can use in string or character comparisons. The L prefix also works with single-quote character literals, like so: L'ε'
As a commenter noted, when you use fread/fwrite, you should be using sizeof(wchar_t) and not its pointer type, since the amount you are trying to read/write is an actual wchar, not the size of a pointer to one. This advice is just code feedback independent of the above-- you don't want to be reading the input character by character anyways.
Note too that when you do string comparisons (wcscmp), you should use actual wide strings (which are terminated with a nul wide char)-- not use single characters in memory as input. If (when) you want to do character-to-character comparisons, you don't even need to use the string functions. Since a WCHAR is just a value, you can compare directly: if (sample == L'ά') {}.

Resources