Converting C strings to Pascal strings - c

When converting a C string into a Pascal string, why should the length of the original string be less or equal to 127 instead of 256? I understand that an unsigned int ranges from 0~256 and a signed one ranges from -128~127, but isn't the first character of a Pascal string unsigned?

The Pascal string you are referring to is probably the one used in older Pascals (called ShortString in e.g. Delphi and FreePascal, the most popular Pascal implementations these days). That can contain 255 single-byte characters (char in C). There is no need to restrict this to 127 characters.
Perhaps you were thinking of the fact that 255 bytes can only contain 127 UTF-16 code points. But these strings were popular in the old CP/M and DOS days, when no one knew anything about Unicode yet, and were made to contain ASCII or "Extended ASCII" (8 bit, using code pages).
But most modern Pascal implementations allow you to use strings up to 2 GB in size. There, the length indicator is not stored as the first element anymore, just close to the text data. And these days, most of these strings can contain Unicode too, either as UTF-16 or as UTF-8, depending on the string type you choose (modern Pascal implementations have several different string types for different purposes, so there is not one single "Pascal string type" anymore).
Some languages do have the ability to restrict the size of a ShortString, as so called "counted" strings:
var
s: string[18];
That string has a maximum of 18 bytes text data and 1 byte length data (at index 0). Such shorter strings can be used in, say, records, so they don't grow too big.

FreePascal's wiki has a great page showing all the types of strings that Pascal (at least that implementation) supports: http://wiki.freepascal.org/Character_and_string_types - it includes length-prefixed and null-terminated string types. None of the types on that page have a length restriction of 127.
The string type you're referring to would match ShortString which has a single byte prefix, however their documentation states it accepts 0-255.
I am aware of a string-type that has a variable-length-integer prefix, which would restrict the length of the string to 127 characters if you want the in-memory representation to be binary-compatible with ShortString, as being 128 characters or longer would set the MSB bit to 1 which in variable-length-integers means the integer is at least 2 bytes long instead of 1 byte.

Related

Length of Greek character string is larger than it should be

I'm writing a program and I take a string of Greek characters as input and when I print its len, it outputs its double. For example, if ch="ΑΒ"(greek characters) or ch="αβ",
printf("%d",strlen(ch)); outputs 4 instead of 2. And if ch="ab", it outputs 2. What's going on?
You can use mbstowcs() function to convert multybite string to wide-character string. And then use wcslen() to determine it's length.
Probably because your string is encoded using variable-width character encoding.
In the good old days, we only bothered with 128 different characters: a-z, A-Z, 0-9, and some commas and brackets and control things. Everything was taken care of in 7 bits, and we called it ASCII. Then that wasn't enough and we added some other things like letters with lines or dots on top, and we went to 8 bits (1 byte) and could do any of 256 characters in one byte. (Although people's ideas of what should go in those extra 128 slots varied widely, based on what was most useful in their language - see comment from usr2564301 - and you then had to say whose version you were using for what should be in those extra slots.)
If you had 2 characters in your string, it would be 2 bytes long (plus a null terminator perhaps), always.
But then people woke up to the fact that English isn't the only language in the world, and there were in fact thousands of letters in hundreds of languages around the globe. Now what to do?
Well, we could say there are only about 65,000 characters that interest us, and encode all letters in two bytes. There are some encoding formats that do this. A two-letter string will then always be 4 bytes (um, perhaps with some byte order mark at the front, and maybe a null terminator at the end). Two problems: a) not very backwards compatible with ASCII, and b) wasteful of bytes if most text is stuff that is in the good ol' ASCII character set anyway.
Step in UTF-8, which I'll wager is what your string is using for its encoding, or something similar. ASCII characters, like 'a' and 'b', are encoded with one byte, and more exotic characters (--blush-- from an English-speaking perspective) take up more than one byte, of which the first byte is to say "what follows is to be taken along with this byte to represent a letter". So you get variable-width encoding. So the length of a two-letter string will be at least two bytes, but if it includes non-ASCII characters, it'll be more.

How can I print a string with the same length with or without multicharacters?

I am trying to do exercise 1-22 in K&R book. It asks to fold long lines (i.e.going into a new line) after a predefined number of characters in string.
As I was testing the program and it worked well, but I saw that some lines were "folding" earlier than they should. I noticed that it was the lines on which special characters appeared, such as:
ö ş ç ğ
So, my question is, how do I ensure that lines are printed with the same maximum length with or without multicharacters?
What happens in your code ?
The K&R was written in a time where all characters were encoded on one single char. Example of such encoding standards are ASCII or ISO 8859.
Nowadays the leading encoding standard is UNICODE, which comes in several flavors. The UTF-8 encoding is used to represent the thousands of unicode characters on 8 bit bytes, using a variable length scheme:
the ascii characters (i.e. 0x00 to 0x7F) are encoded on a single byte.
all other characters are encoded on 2 to 4 bytes.
So the letter ö and the others in your list are encoded as 2 consecutive bytes. Unfortunately, the standard C library and the algorithms of K&R do not manage variable encoding. So each of your special char is counted as two so that your algorithm is tricked.
How to solve it ?
There is no easy way. You must make a distinction between the length of the strings in memory, and the length of the strings when they are displayed.
I can propose you a trick that uses the properties of the encoding scheme: whenever you count the display length of a string, just ignore the characters c in memory that comply with the condition c&0xC0==0x80.
Another way would be to use wide chars wchar_t/win_t (requires header wchar.h) instead of char/int and use getwc()/putwc() instead of getc()/putc(). If on your environment sizeof(wchar_t) is 4 then you will be able to work with unicode just using the wide characters and wide library functions instead of the normal ones mentioned in K&R. If however
sizeof(wchar_t) is smaller (for example 2), you could work correctly with a larger subset of unicode but still could encounter alignement issues in some cases.
As in the comment, your string is probably encoded in UTF-8. That means that some characters, including the ones you mention, use more than one byte. If you simply count bytes to determine the width of your output, your computed value may be too large.
To properly determine the number of characters in a string with multibyte characters, use a function such as mbrlen(3).
You can use mbrtowc(3) to find out the number of bytes of the first character in a string, if you're counting character for character.
This of course goes way beyond the scope of the K&R book. It was written before multibyte characters were used.

Are there any unicode/wide chars that encode to multiple encoded characters

Consider wctomb(), which takes a wide character and encodes to the currently selected character set. The glibc man page states that the output buffer should be MB_CUR_MAX, while the FreeBSD man page states the output buffer size should be MB_LEN_MAX. Which is correct here?
Are there any example wide char/encoding combinations where it takes multiple encoded characters to represent the wide char?
On a more general note, does MB_CUR_MAX refer to the max combined encoded char byte count to represent a wide char, or is it just representing the max byte count for any particular encoded char?
MB_CUR_MAX is correct, but both are big enough. You might want to use MB_LEN_MAX if you want to avoid variable-length array declarations.
MB_CUR_MAX is the maximum number of bytes in a multibyte character in the current locale. MB_LEN_MAX is the maximum number of bytes in a character for any supported locale. Unlike MB_CUR_MAX, MB_LEN_MAX is a macro so it can be used in an array declaration without creating a VLA.
Both constants refer to a single wide character. There is no simple definition of what a multibyte character is exactly, since multibyte encodings can include shift sequences; if the multibyte locale includes shift sequences, the number of bytes required for a particular call to wctomb with a particular wide character might vary from call to call depending on the shift state. (Also, the actual code might be different in different shift states.)
As far as I know, there is nothing which prevents a wide character from being translated to a multibyte sequence which might be decomposable into other multibyte sequences (as with Unicode composition); the definition of wctomb talks only about "representation". But I don't know of an implementation which does that, either; Unicode canonical decomposition must be done with separate APIs.
So it is possible that no installed locale requires a value as large as MB_LEN_MAX. But there is nothing stopping you from adding locales -- or even creating your own -- provided that they don't exceed the encoding limit (16 bytes on Linux).

LZW Compression with Entire unicode library

I am trying to do this problem:
Assume we have an initial alphabet of the entire Unicode character set,
instead of just all the possible byte values. Recall that unicode
characters are unsigned 2-byte values, so this means that each
2 bytes of uncompressed data will be treated as one symbol, and
we'll have an alphabet with over 60,000 symbols. (Treating symbols as
2-byte Unicodes, rather than a byte at a time, makes for better
compression in the case of internationalized text.) And, note, there's
nothing that limits the number of bits per code to at most 16. As you
generalize the LZW algorithm for this very large alphabet, don't worry
if you have some pretty long codes.
With this, give the compressed version of this four-symbol sequence,
using our project assumptions, including an EOD code, and grouping
into 4-byte ints. (These three symbols are Unicode values,
represented numerically.) Write your answer as 3 8-digit hex values,
space separated, using capital hex digits, not lowercase.
32767 32768 32767 32768
The problem I am having is that I don't know the entire range of the alphabet, so when doing LZW compression I don't know what byte value the new codes will have. Stemming from that problem I also don't know the the EOD code will be.
Also, it seems to me that it will only take two integers the compressed data.
The problem statement is ill-formed.
In Unicode, as we know it today, code points (those numbers that represent characters, composable parts of characters and other useful but more sneaky things) cannot be all numbered from 0 to 65535 to fit into 16 bits. There are more than 100 thousand of Chinese, Japanese and Korean characters in Unicode. Clearly, you'd need 17+ bits just for those. So, Unicode clearly cannot be the correct option here.
OTOH, there exist a sort of "abridged" version of Unicode, Universal Character Set, whose UCS-2 encoding uses 16-bit code points and can technically be used for at most 65536 characters and the like. Those characters with codes greater than 65535 are, well, unlucky, you can't have them with UCS-2.
So, if it's really UCS-2, you can download its specification (ISO/IEC 10646, I believe) and figure out exactly which codes out of those 64K are used and thus should form your initial LZW alphabet.

Using narrow string manipulation functions on wide data

I'm parsing an XML file which can contain localized strings in different languages (at the moment its just english and spanish, but in the future it could be any language), the API for the XML parser returns all data within the XML via a char* which is UTF8 encoded.
Some manipulation of the data is required after its been parsed (searching within it for substrings, concatenating strings, determining the length of substrings etc.).
It would be convenient to use standard functions such as strlen, strcat etc. As the raw data I'm receiving from the XML parser is a char* I can do all manipulation readily using these standard string handling functions.
However these all of course make the assumption and requirement that the strings are NULL terminated.
My question therefore is - if you have wide data represented as a char*, can a NULL terminator character occur within the data rather than at the end?
i.e. if a character in a certain language doesn't require 2 bytes to represent it, and it is represented in one byte, will/can the other byte be NULL?
UTF-8 is not "wide". UTF-8 is multibyte encoding, where Unicode character can take 1 to 4 bytes. UTF-8 won't have zero terminators inside valid character. Make sure you are not confused on what your parser is giving you. It could be UTF-16 or UCS2 or their 4-byte equivalents placed in wide character strings, in which case you have to treat them as wide strings.
C distinguishes between between multibyte characters and wide characters:
Wide characters must be able to represent any character of the execution character set using exactly the same number of bytes (e.g. if 兀 takes 4 bytes to be represented, A must also take 4 bytes to be represented). Examples of wide character encodings are UCS-4, and the deprecated UCS-2.
Multibyte characters can take a varying number of bytes to be represented. Examples of multibyte encodings are UTF-8 and UTF-16.
When using UTF-8, you can continue to use the str* functions, but you have to bear in mind that they don't provide a way to return the length in characters of a string, you need to convert to wide characters, and use wcslen. strlen returns the length in bytes, not characters, which is useful in different situations.
I can't stress enough that all elements of the execution character set need to be represented into a single wide character of a predefined size in bytes. Some systems use UTF-16 for their wide characters, the result is that the implementation can't be conforming to the C standard, and some wc* functions can't possibly work right.

Resources