LZW Compression with Entire unicode library - c

I am trying to do this problem:
Assume we have an initial alphabet of the entire Unicode character set,
instead of just all the possible byte values. Recall that unicode
characters are unsigned 2-byte values, so this means that each
2 bytes of uncompressed data will be treated as one symbol, and
we'll have an alphabet with over 60,000 symbols. (Treating symbols as
2-byte Unicodes, rather than a byte at a time, makes for better
compression in the case of internationalized text.) And, note, there's
nothing that limits the number of bits per code to at most 16. As you
generalize the LZW algorithm for this very large alphabet, don't worry
if you have some pretty long codes.
With this, give the compressed version of this four-symbol sequence,
using our project assumptions, including an EOD code, and grouping
into 4-byte ints. (These three symbols are Unicode values,
represented numerically.) Write your answer as 3 8-digit hex values,
space separated, using capital hex digits, not lowercase.
32767 32768 32767 32768
The problem I am having is that I don't know the entire range of the alphabet, so when doing LZW compression I don't know what byte value the new codes will have. Stemming from that problem I also don't know the the EOD code will be.
Also, it seems to me that it will only take two integers the compressed data.

The problem statement is ill-formed.
In Unicode, as we know it today, code points (those numbers that represent characters, composable parts of characters and other useful but more sneaky things) cannot be all numbered from 0 to 65535 to fit into 16 bits. There are more than 100 thousand of Chinese, Japanese and Korean characters in Unicode. Clearly, you'd need 17+ bits just for those. So, Unicode clearly cannot be the correct option here.
OTOH, there exist a sort of "abridged" version of Unicode, Universal Character Set, whose UCS-2 encoding uses 16-bit code points and can technically be used for at most 65536 characters and the like. Those characters with codes greater than 65535 are, well, unlucky, you can't have them with UCS-2.
So, if it's really UCS-2, you can download its specification (ISO/IEC 10646, I believe) and figure out exactly which codes out of those 64K are used and thus should form your initial LZW alphabet.

Related

libiconv: Safe estimation of target bytes-length when converting between UTF-8/UTF-16/UTF-32?

There's tons of info about Unicode codeunits, codepoints, etc, but I'm still a bit fuzzy about converting combined characters, graphemes, etc using byte-streams (required by libiconv).
Currently I'm only interested in converting between UTF-8/UTF-16/UTF-32 using libconv's iconv(), which expects the byte-lengths of both source and destination buffers as arguments.
Question: Is there a safe way to calculate fast the maximum possible bytes-length of the target buffer, based on the already known bytes-length of the source buffer?
Let's say for example, converting from u16buf to u8buf with a known u16byteslen (excluding 0x0000-termination if any). In the worst-case scenario, there will be 1 two-byte unit per codepoint in the UTF-16 source buffer, corresponding to a 4 single-byte units per codepoint in the UTF-8 target buffer. Is that enough to safely assume that the UTF-8 target buffer can never be longer than 2 * u16lenbytes?
I've actually experimented with that and seems to work, but I'm not sure if I'm missing corner cases involving combined characters and grapheme clusters. My doubts come from my ignorance regarding how those things are converted across these 3 different encodings. I mean, is it possible for a grapheme to need say 3 UTF-16 codepoints but like 10 UTF-8 codepoints when converted?
In that case, doubling u16lenbytes wouldn't suffice, right? And if so, is there any other straight forward way to pre-calc the maximum length of the target buffer?
Question: Is there a safe way to calculate fast the maximum possible bytes-length of the target buffer, based on the already known bytes-length of the source buffer?
Yes.
to UTF-8
to UTF-16
to UTF-32
from UTF-8
×2
×4
from UTF-16
×1 ½
×1
from UTF-32
×1
×1
You can calculate this yourself by breaking it down by code-point ranges. Pick a source and destination column, and find the largest ratio.
Code Point
UTF-8 length
UTF-16 length
UTF-32 length
0000…007F
1
2
4
0080…07FF
2
2
4
0800…FFFF
3
2
4
10000…10FFFF
4
4
4
Combining characters and grapheme clusters do not affect anything. Encodings simply convert a sequence of Unicode scalar values to bytes, and they are very straightforward.
Note that you will need to add two extra bytes when converting to UTF-16, and four extra bytes when converting to UTF-32, since these encodings will add a BOM U+FEFF to the beginning of the text. (If you don’t want that, use one of the BOM-less encodings, like UTF-16BE or UTF-16LE.)
I mean, is it possible for a grapheme to need say 3 UTF-16 codepoints but like 10 UTF-8 codepoints when converted?
No. That would imply some other kind of conversion, like a decomposition. The number of scalar values input is equal to the number of scalar values output, with the possible addition of U+FEFF byte order mark at the beginning. (I say "scalar value" instead of "code point", because "scalar value" excludes surrogates. If you are transcoding text which might have errors or might be garbage data, it doesn’t change the size of the result.)
Unicode code points can be encoded:
UTF-8: 1, 2, 3, or 4 bytes
UTF-16: 2 or 4 bytes
UTF-32: 4 bytes
(obsolete): UCS-2: 2 bytes (but it requires two surrogates for some code points).
So, as first estimate, if you have the lenght of UTF-16 in byte, you can be safe by using such formula:
byte_len_utf8 = 4 * byte_len_utf16 / 2
But this is not a good way: we know better: UTF-8 is 4 byte length only if UTF-16 is 4 byte length. So we have two cases: 4 * len / 4 or 3 * len / 2.
So if on first formula we allocate the double of bytes (as you supposed), in the second formula, the maximum is just 1.5 time the number of byte. For the Chinese/Japanese/Korean, you are in such region of the codepoints.

Converting C strings to Pascal strings

When converting a C string into a Pascal string, why should the length of the original string be less or equal to 127 instead of 256? I understand that an unsigned int ranges from 0~256 and a signed one ranges from -128~127, but isn't the first character of a Pascal string unsigned?
The Pascal string you are referring to is probably the one used in older Pascals (called ShortString in e.g. Delphi and FreePascal, the most popular Pascal implementations these days). That can contain 255 single-byte characters (char in C). There is no need to restrict this to 127 characters.
Perhaps you were thinking of the fact that 255 bytes can only contain 127 UTF-16 code points. But these strings were popular in the old CP/M and DOS days, when no one knew anything about Unicode yet, and were made to contain ASCII or "Extended ASCII" (8 bit, using code pages).
But most modern Pascal implementations allow you to use strings up to 2 GB in size. There, the length indicator is not stored as the first element anymore, just close to the text data. And these days, most of these strings can contain Unicode too, either as UTF-16 or as UTF-8, depending on the string type you choose (modern Pascal implementations have several different string types for different purposes, so there is not one single "Pascal string type" anymore).
Some languages do have the ability to restrict the size of a ShortString, as so called "counted" strings:
var
s: string[18];
That string has a maximum of 18 bytes text data and 1 byte length data (at index 0). Such shorter strings can be used in, say, records, so they don't grow too big.
FreePascal's wiki has a great page showing all the types of strings that Pascal (at least that implementation) supports: http://wiki.freepascal.org/Character_and_string_types - it includes length-prefixed and null-terminated string types. None of the types on that page have a length restriction of 127.
The string type you're referring to would match ShortString which has a single byte prefix, however their documentation states it accepts 0-255.
I am aware of a string-type that has a variable-length-integer prefix, which would restrict the length of the string to 127 characters if you want the in-memory representation to be binary-compatible with ShortString, as being 128 characters or longer would set the MSB bit to 1 which in variable-length-integers means the integer is at least 2 bytes long instead of 1 byte.

Length of Greek character string is larger than it should be

I'm writing a program and I take a string of Greek characters as input and when I print its len, it outputs its double. For example, if ch="ΑΒ"(greek characters) or ch="αβ",
printf("%d",strlen(ch)); outputs 4 instead of 2. And if ch="ab", it outputs 2. What's going on?
You can use mbstowcs() function to convert multybite string to wide-character string. And then use wcslen() to determine it's length.
Probably because your string is encoded using variable-width character encoding.
In the good old days, we only bothered with 128 different characters: a-z, A-Z, 0-9, and some commas and brackets and control things. Everything was taken care of in 7 bits, and we called it ASCII. Then that wasn't enough and we added some other things like letters with lines or dots on top, and we went to 8 bits (1 byte) and could do any of 256 characters in one byte. (Although people's ideas of what should go in those extra 128 slots varied widely, based on what was most useful in their language - see comment from usr2564301 - and you then had to say whose version you were using for what should be in those extra slots.)
If you had 2 characters in your string, it would be 2 bytes long (plus a null terminator perhaps), always.
But then people woke up to the fact that English isn't the only language in the world, and there were in fact thousands of letters in hundreds of languages around the globe. Now what to do?
Well, we could say there are only about 65,000 characters that interest us, and encode all letters in two bytes. There are some encoding formats that do this. A two-letter string will then always be 4 bytes (um, perhaps with some byte order mark at the front, and maybe a null terminator at the end). Two problems: a) not very backwards compatible with ASCII, and b) wasteful of bytes if most text is stuff that is in the good ol' ASCII character set anyway.
Step in UTF-8, which I'll wager is what your string is using for its encoding, or something similar. ASCII characters, like 'a' and 'b', are encoded with one byte, and more exotic characters (--blush-- from an English-speaking perspective) take up more than one byte, of which the first byte is to say "what follows is to be taken along with this byte to represent a letter". So you get variable-width encoding. So the length of a two-letter string will be at least two bytes, but if it includes non-ASCII characters, it'll be more.

How can I print a string with the same length with or without multicharacters?

I am trying to do exercise 1-22 in K&R book. It asks to fold long lines (i.e.going into a new line) after a predefined number of characters in string.
As I was testing the program and it worked well, but I saw that some lines were "folding" earlier than they should. I noticed that it was the lines on which special characters appeared, such as:
ö ş ç ğ
So, my question is, how do I ensure that lines are printed with the same maximum length with or without multicharacters?
What happens in your code ?
The K&R was written in a time where all characters were encoded on one single char. Example of such encoding standards are ASCII or ISO 8859.
Nowadays the leading encoding standard is UNICODE, which comes in several flavors. The UTF-8 encoding is used to represent the thousands of unicode characters on 8 bit bytes, using a variable length scheme:
the ascii characters (i.e. 0x00 to 0x7F) are encoded on a single byte.
all other characters are encoded on 2 to 4 bytes.
So the letter ö and the others in your list are encoded as 2 consecutive bytes. Unfortunately, the standard C library and the algorithms of K&R do not manage variable encoding. So each of your special char is counted as two so that your algorithm is tricked.
How to solve it ?
There is no easy way. You must make a distinction between the length of the strings in memory, and the length of the strings when they are displayed.
I can propose you a trick that uses the properties of the encoding scheme: whenever you count the display length of a string, just ignore the characters c in memory that comply with the condition c&0xC0==0x80.
Another way would be to use wide chars wchar_t/win_t (requires header wchar.h) instead of char/int and use getwc()/putwc() instead of getc()/putc(). If on your environment sizeof(wchar_t) is 4 then you will be able to work with unicode just using the wide characters and wide library functions instead of the normal ones mentioned in K&R. If however
sizeof(wchar_t) is smaller (for example 2), you could work correctly with a larger subset of unicode but still could encounter alignement issues in some cases.
As in the comment, your string is probably encoded in UTF-8. That means that some characters, including the ones you mention, use more than one byte. If you simply count bytes to determine the width of your output, your computed value may be too large.
To properly determine the number of characters in a string with multibyte characters, use a function such as mbrlen(3).
You can use mbrtowc(3) to find out the number of bytes of the first character in a string, if you're counting character for character.
This of course goes way beyond the scope of the K&R book. It was written before multibyte characters were used.

Detect UTF-16 file content

Is it possible to know if a file has Unicode (16-byte per char) or 8-bit ASCII content?
You may be able to read a byte-order-mark, if the file has this present.
UTF-16 characters are all at least 16-bits, with some being 32-bits with the right prefix (0xE000 to 0xFFFF). So simply scanning each char to see if less than 128 won't work. For example, the two bytes 0x20 0x20 would encode in ASCII and UTF-8 for two spaces, but encode in UTF-16 for a single character 0x2020 (dagger). If the text is known to be English with the occasional non-ASCII character, then most every other byte will be zero. But without some apriori knowledge about the text and/or it's encoding, there is no reliable way distinguish a general ASCII string from a general UTF-16 string.
Ditto to what Brian Agnew said about reading the byte order mark, a special two bytes that might appear at the beginning of the file.
You can also know if it is ASCII by scanning every byte in the file and seeing if they are all less than 128. If they are all less than 128, then it's just an ASCII file. If some of them are more than 128, there is some other encoding in there.
First off, ASCII is 7-bit, so if any byte has its high bit set you know the file isn't ASCII.
The various "common" character sets such as ISO-8859-x, Windows-1252, etc, are 8-bit, so if every other byte is 0, you know that you're dealing with Unicode that only uses the ISO-8859 characters.
You'll run into problems where you're trying to distinguish between Unicode and some encoding such as UTF-8. In this case, almost every byte will have a value, so you can't make an easy decision. You can, as Pascal says do some sort of statistical analysis of the content: Arabic and Ancient Greek probably won't be in the same file. However, this is probably more work than it's worth.
Edit in response to OP's comment:
I think that it will be sufficient to check for the presence of 0-value bytes (ASCII NUL) within your content, and make the choice based on that. The reason being that JavaScript keywords are ASCII, and ASCII is a subset of Unicode. Therefore any Unicode representation of those keywords will consist of one byte containing the ASCII character (low byte), and another containing 0 (the high byte).
My one caveat is that you carefully read the documentation to ensure that their use of the word "Unicode" is correct (I looked at this page to understand the function, did not look any further).
If the file for which you have to solve this problem is long enough each time, and you have some idea what it's supposed to be (say, English text in unicode or English text in ASCII), you can do a simple frequency analysis on the chars and see if the distribution looks like that of ASCII or of unicode.
Unicode is an alphabet, not a encoding. You probably meant UTF-16. There is lot of libraries around (python-chardet comes to mind instantly) to autodetect encoding of text, though they all use heuristics.
To programmatically discern the type of a file -- including, but not limited to the encoding -- the best bet is to use libmagic. BSD-licensed it is part of just about every Unix-system you are about to encounter, but for a lesser ones you can bundle it with your application.
Detecting the mime-type from C, for example, is as simple as:
Magic = magic_open(MAGIC_MIME|MAGIC_ERROR);
mimetype = magic_buffer(Magic, buf, bufsize);
Other languages have their own modules wrapping this library.
Back to your question, here is what I get from file(1) (the command-line interface to libmagic(3)):
% file /tmp/*rdp
/tmp/meow.rdp: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
For your specific use-case, it's very easy to tell. Just scan the file, if you find any NULL ("\0"), it must be UTF-16. JavaScript got to have ASCII chars and they are represented by a leading 0 in UTF-16.

Resources