Compressing const wchar_t strings in program memory

Compressing const wchar_t strings in program memory - c

I am building an avr program with many wchar_t strings. At the moment the way I store them in to the chip is this:
const _flash wchar_t* const Greek_text[] =
{
[ACCELERATION_TEXT] = (const _flash wchar_t[]) {L"Επιταγχυνση"},
[ACCELERATION_SHORT_TEXT] = (const _flash wchar_t[]) {L"Επιταγχ."},
[REDUCED_TEXT] = (const _flash wchar_t[]) {L"Μειωμενη"},
[FULL_TEXT] = (const _flash wchar_t[]) {L"Μεγιστη"}
}
I am looking for a way to store them while being compressed. One way that I can think of, is removing the Unicode prefix for the specific language, but the only way of doing this to my knowledge, is storing the lower half of each word manually in unsigned chars. Is there a more practical way of doing something like this?

It looks like all your strings are Greek. You could assign each letter or character that appears in your strings to a number between 1 and 255 and then just have char strings where each char contains one of those encoded numbers. This would halve the space of each individual string, but you might need a lookup table in your program to convert those codes back to unicode, and that table could occupy up to 512 bytes in your program memory depending on how many different characters you use.
People have already developed encodings like this before, so maybe you could just use a standard one, like Windows-1253.
Figuring out a good system to author these encoded strings and insert them into your program is another challenge. I'd suggest writing a simple Ruby program like the one below. (You'd save the program itself as a UTF-8 text file on your computer and edit it using a standard text editor.)
string_table = {
ACCELERATION_TEXT: 'Επιταγχυνση',
ACCELERATION_SHORT_TEXT: 'Επιταγχ.',
REDUCED_TEXT: 'Μειωμενη',
FULL_TEXT: 'Μεγιστη',
}
string_table.each do |key, value|
hex = value.encode('Windows-1253').each_byte.map { |b| '\x%02x' % b }.join
puts "[#{key}] = \"#{hex}\","
end
The program above outputs this:
[ACCELERATION_TEXT] = "\xc5\xf0\xe9\xf4\xe1\xe3\xf7\xf5\xed\xf3\xe7",
[ACCELERATION_SHORT_TEXT] = "\xc5\xf0\xe9\xf4\xe1\xe3\xf7\x2e",
[REDUCED_TEXT] = "\xcc\xe5\xe9\xf9\xec\xe5\xed\xe7",
[FULL_TEXT] = "\xcc\xe5\xe3\xe9\xf3\xf4\xe7",
Also, another way to maybe save space (while increasing CPU time used) would be to get rid of the array of pointers you are creating. This would save two bytes per string. Just store all the strings next to each other in memory. The strings would be separated by null characters and you can optionally have a double null character marking the end of the table. To look up a string in the table, you start reading bytes from the beginning of the table one at a time and count how many null characters you have seen.
const _flash char greek_text[] =
"\xc5\xf0\xe9\xf4\xe1\xe3\xf7\xf5\xed\xf3\xe7\0"
"\xc5\xf0\xe9\xf4\xe1\xe3\xf7\x2e\0"
"\xcc\xe5\xe9\xf9\xec\xe5\xed\xe7\0"
"\xcc\xe5\xe3\xe9\xf3\xf4\xe7\0";

Related

How to iterate character by character in strings which lengths are larger than INT_MAX, or SIZE_MAX?

C language, How to iterate character by character in strings which lengths are larger than INT_MAX, or SIZE_MAX?
How to find out that string length exceeded the any MAXIMUM SIZE applicable for the code below?
int len = strlen(item);
int i=0;
while (i <= len ) {
//do smth
i++;
}

You can access characters in a string (or elements in an array generally) without integer indices by using pointers:
for (char *p = item; *p; ++p)
{
// Do something.
// *p is the current character.
}

int len = strlen(item);
first, this is not an impedment to have a string longer thatn INT_MAX, but it will if you have to deal with it's length. If you thinkg about the implementation of strlen() you'll see that, as how strings are defined (a sequence of chars in memory bounded by thre presence of a null char) you'll see that the only possible implementation is to search the string, incrementing the length as you traverse it searching for the first null char on it. This makes your code very ineficient, because you first traverse the string searching for its end, then you traverse it a second time to do useful work.
int i=0;
while (i <= len ) {
//do smth
i++;
}
it should be better to use directly a pointer, in a for loop, like this one:
char *p;
for (p = item; *p; p++) {
// so something knowing that the char `*p` is the iterated char.
}
In this way, you navigate the string and stop when you find the null char, and you will not have to traverse it twice.
By the way, having strings longer than INT_MAX is quite difficult, because normally (and more with the new 64bit architectures) you are not allowed to create a so compact memory structure (this meaning that if you try to create a static array of that size, you will be fighting with the compiler, and if you try to malloc() such a huge amount of memory, you will end fighting wiht the operating system)
It's most normal that developers having to deal with huge amounts of memory, use an unseen structure to hold large amounts of characters. Just imagine that you need to insert one char and this forces you to move one gigabyte of memory one position because you have no other way to make room for it. It's simply unpractical to use such an amount. A simple approach is to use a similar structure as it is used for the file data in a disk in a unix system. The data has a series of direct pointers that point to fixed blocks of memory holding characters, but at some point those pointers become double pointers, pointing to an array of simple poointers, then a triple pointer, etc. This way you can handle strings as sort as one byte (with just a memory page)to more than INT_MAX bytes, by selecting an appropiate size for the page and the number of pointers.
Another approach is the mbuf_t approach used by BSD software to handle networking packets. This is expressely appropiate when you have to add to the string in front of it (e.g. to add a new protocol header) or to the rear of the packet (to add payload and/or checksum or trailing data)
One last thing... if you create an array of 5Gb, most probably every today operating system will swap it, as soon as you stop using part of it. This will make your application to start swaping as soon as you move on the array, and probably you will not be able to run your application in a computer with a limited address space (like, today a 32bit machine is)

figure out 2 strings similar or not

Rules:
2 strings, a and b, both of them consist of ASCII chars and non-ASCII chars (say, Chinese Characters gbk-encoded).
If the non-ASCII chars contained in b also show up in a and no less than the times they appear in b, then we say b is similar with a.
For example:
a = "ab中ef日jkl中本" //non-ASCII chars:'中'(twice), '日'(once), '本'(once)
b = "bej中中日" //non-ASCII chars:'中'(twice), '日'(once)
c = 'lk日日日' //non-ASCII chars:'日'(3 times, more than twice in a)
according to the rule, b is similar with a, but c is not.
Here is my question:
We don't know how many non-ASCII chars are there in a and b, probably many.
So to find out how many times a non-ASCII char appears in a and b, am I supposed to use a Hash-Table to store their appearing-times?
Take string a as an example:
[non-ASCII's hash-value]:[times]
中's hash-val : 2
日's hash-val : 1
本's hash-val : 1
Check string b, if we encounter a non-ASCII char in b, then hash it and check a's hash-table, if the char is present in a's hash-table, then its appearing-times decrements by 1.
If the appearing-times is less than 0 (-1), then we say b is not similar with a.
Or is there any better way?
PS:
I read string a byte by byte, if the byte is less than 128, then I take is as an ASCII char, otherwise I take it as part of a non-ASCII char (multi-bytes).
This is what I am doing to find out the non-ASCII chars.
Is it right?

You have asked two questions:
Can we count the non-ASCII characters using a hashtable? Answer: sure. As you read the characters (not the bytes), examine the codepoints. For any codepoint greater than 127, put it into a counting hashtable. That is for a character c, add (c,1) if c is not in the table, and update (c,x) to (c, x+1) if c is in the table already.
Is there a better way to solve this problem than your approach of incrementing counts in a and decrementing as you run through b? If your hashtable implementation gives nearly O(1) access, then I suspect not. You are looking at each character in the string exactly once, and for each character your are doing either an hashtable insert or lookup and an addition or subtraction, and a check against 0. With unsorted strings, you have to look at all the characters in both strings anyway, so you've given, I think, the best solution.
The interviewer might be looking for you to say things like, "Hmmmmm, if these strings were actually massive files that could not fit in memory, what would I do?" Or for you to ask "Well are the string sorted? Because if they are, I can do it faster...".
But now let's say the strings are massive. The only thing you are storing in memory is the hashtable. Unicode has only around 1 million codepoints and you are storing an integer count for each, so even if you are getting data from gigabyte sized files you only need around 4MB or so for your hash table (or a small multiple of this, as there will be overhead).
In the absence of any other conditions, your algorithm is nice. Sorting the strings beforehand isn't good; it takes up more memory and isn't a linear-time operation.
ADDENDUM
Since your original comments mentioned the type char as opposed to wchar_t, I thought I'd show an example of using wide strings. See http://codepad.org/B3MXOgqc
Hope that helps.
ADDENDUM 2
Okay here is a C program that shows exactly how to go through a widestring and work at the character level:
http://codepad.org/QVX3QPat
It is a very short program so I will also paste it here:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *s1 = "abd中日";
wchar_t *s2 = L"abd中日";
int main() {
int i, n;
printf("length of s1 is %d\n", strlen(s1));
printf("length of s2 using wcslen is %d\n", wcslen(s2));
printf("The codepoints of the characters of s2 are\n");
for (i = 0, n = wcslen(s2); i < n; i++) {
printf("%02x\n", s2[i]);
}
return 0;
}
Output:
length of s1 is 9
length of s2 using wcslen is 5
The codepoints of the characters of s2 are
61
62
64
4e2d
65e5
What can we learn from this? A couple things:
If you use plain old char for CJK characters then the string length will be wrong.
To use Unicode characters in C, use wchar_t
String literals have a leading L for wide strings
In this example I defined a string with CJK characters and used wchar_t and a for-loop with wcslen. Please note here that I am working with real characters, NOT BYTES, so I get the correct count of characters, which is 5. Now I print out each codepoint. In your interview question, you will be looking to see if the codepoint is >= 128. I showed them in Hex, as is the culture, so you can look for > 0x7F. :-)
ADDENDUM 3
A few notes in http://tldp.org/HOWTO/Unicode-HOWTO-6.html are worth reading. There is a lot more to character handling than the simple example above shows. In the comments below J.F. Sebastian gives a number of other important links.
Of the few things that need to be addressed is normalization. For example, does your interviewer care that when given two strings, one containing just a Ç and the other a C followed by a COMBINING MARK CEDILLA BELOW, would they be the same? They represent the same character, but one uses one codepoint and the other uses two.

How to count characters in a unicode string in C

Lets say I have a string:
char theString[] = "你们好āa";
Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:
strlen(theString) == 12
How can I count the number of characters? How can i do the equivalent of subscripting so that:
theString[3] == "好"
How can I slice, and cat such strings?

You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).
That's because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.
See here for a description of the encoding and how strlen can work on a UTF-8 string.
For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.
Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:
utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;
to get, respectively:
the left sz UTF-8 bytes of a string.
the sz UTF-8 bytes of a string, starting at pos.
the rest of the UTF-8 bytes of a string, starting at pos.
This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.

Try this for size:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
size_t len = 0;
for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
return len;
}
// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{
++pos;
for (; *s; ++s) {
if ((*s & 0xC0) != 0x80) --pos;
if (pos == 0) return s;
}
return NULL;
}
// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
char *p = utf8index(s, *start);
*start = p ? p - s : -1;
p = utf8index(s, *end);
*end = p ? p - s : -1;
}
// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
return strcat(dest, src);
}
// test program
int main(int argc, char **argv)
{
// slurp all of stdin to p, with length len
char *p = malloc(0);
size_t len = 0;
while (true) {
p = realloc(p, len + 0x10000);
ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
if (cnt == -1) {
perror("read");
abort();
} else if (cnt == 0) {
break;
} else {
len += cnt;
}
}
// do some demo operations
printf("utf8len=%zu\n", utf8len(p));
ssize_t start = 2, end = 3;
utf8slice(p, &start, &end);
printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
start = 3; end = 4;
utf8slice(p, &start, &end);
printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
return 0;
}
Sample run:
matt#stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā
Note that your example has an off by one error. theString[2] == "好"

The easiest way is to use a library like ICU

Depending on your notion of "character", this question can get more or less involved.
First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv() of ICU, though if this is the only thing you do, iconv() is a lot easier, and it's part of POSIX.
Your string of unicode codepoints could be something like a null-terminated uint32_t[], or if you have C1x, an array of char32_t. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.
However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a with an accent ^ can be expressed as two unicode codepoints, or as a combined legacy codepoint â - both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.
That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.
Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv() first.

In the real world, theString[3]=foo; is not a meaningful operation. Why would you ever want to replace a character at a particular position in the string with a different character? There's certainly no natural-language-text processing task for which this operation is meaningful.
Counting characters is also unlikely to be meaningful. How many characters (for your idea of "character") are there in "á"? How about "á"? Now how about "གི"? If you need this information for implementing some sort of text editing, you're going to have to deal with these hard questions, or just use an existing library/gui toolkit. I would recommend the latter unless you're an expert on world scripts and languages and think you can do better.
For all other purposes, strlen tells you exactly the piece of information that's actually useful: how much storage space a string takes. This is what's needed for combining and separating strings. If all you want to do is combine strings or separate them at a particular delimiter, snprintf (or strcat if you insist...) and strstr are all you need.
If you want to perform higher-level natural-language-text operations, like capitalization, line breaking, etc. or even higher-level operations like pluralization, tense changes, etc. then you'll need either a library like ICU or respectively something much higher-level and linguistically-capable (and specific to the language(s) you're working with).
Again, most programs do not have any use for this sort of thing and just need to assemble and parse text without any considerations to natural language.

while (s[i]) {
if ((s[i] & 0xC0) != 0x80)
j++;
i++;
}
return (j);
This will count characters in a UTF-8 String... (Found in this article: Even faster UTF-8 character counting)
However I'm still stumped on slicing and concatenating?!?

In general we should use a different data type for unicode characters.
For example, you can use the wide char data type
wchar_t theString[] = L"你们好āa";
Note the L modifier that tells that the string is composed of wide chars.
The length of that string can be calculated using the wcslen function, which behaves like strlen.

One thing that's not clear from the above answers is why it's not simple. Each character is encoded in one way or another - it doesn't have to be UTF-8, for example - and each character may have multiple encodings, with varying ways to handle combining of accents, etc. The rules are really complicated, and vary by encoding (e.g., utf-8 vs. utf-16).
This question has enormous security concerns, so it is imperative that this be done correctly. Use an OS-supplied library or a well-known third-party library to manipulate unicode strings; don't roll your own.

I did similar implementation years back. But I do not have code with me.
For each unicode characters, first byte describes the number of bytes follow it to construct a unicode character. Based on the first byte you can determine the length of each unicode character.
I think its a good UTF8 library.
enter link description here

A sequence of code points constitute a single syllable / letter / character in many other Non Western-European languages (eg: all Indic languages)
So, when you are counting the length OR finding the substring (there are definitely use cases of finding the substrings - let us say playing a hangman game), you need to advance syllable by syllable , not by code point by code point.
So the definition of the character/syllable and where you actually break the string into "chunks of syllables" depends upon the nature of the language you are dealing with.
For example, the pattern of the syllables in many Indic languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following
V (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V
You need to parse the string and look for the above patterns to break the string and to find the substrings.
I do not think it is possible to have a general purpose method which can magically break the strings in the above fashion for any unicode string (or sequence of code points) - as the pattern that works for one language may not be applicable for another letter;
I guess there may be some methods / libraries that can take some definition / configuration parameters as the input to break the unicode strings into such syllable chunks. Not sure though! Appreciate if some one can share how they solved this problem using any commercially available or open source methods.

Trouble understanding how to process C string

I'm trying to use Mac OS X's listxattr C function and turn it into something useful in Python. The man page tells me that the function returns a string buffer, which is a "simple NULL-terminated UTF-8 strings and are returned in arbitrary order. No extra padding is provided between names in the buffer."
In my C file, I have it set up correctly it seems (I hope):
char buffer[size];
res = listxattr("/path/to/file", buffer, size, options);
But when I got to print it, I only get the FIRST attribute ONLY, which was two characters long, even though its size is 25. So then I manually set buffer[3] = 'z' and low and behold when I print buffer again I get the first TWO attributes.
I think I understand what is going on. The buffer is a sequence of NULL-terminated strings, and stops printing as soon as it sees a NULL character. But then how am I supposed to unpack the entire sequence into ALL of the attributes?
I'm new to C and using it to figure out the mechanics of extending Python with C, and ran into this doozy.

char *p = buffer;
get the length with strlen(p). If the length is 0, stop.
process the first chunk.
p = p + length + 1;
back to step 2.

So you guessed pretty much right.
The listxattr function returns a bunch of null-terminated strings packed in next to each other. Since strings (and arrays) in C are just blobs of memory, they don't carry around any extra information with them (such as their length). The convention in C is to use a null character ('\0') to represent the end of a string.
Here's one way to traverse the list, in this case changing it to a comma-separated list.
int i = 0;
for (; i < res; i++)
if (buffer[i] == '\0' && i != res -1) //we're in between strings
buffer[i] = ',';
Of course, you'll want to make these into Python strings rather than just substituting in commas, but that should give you enough to get started.

It looks like listxattr returns the size of the buffer it has filled, so you can use that to help you. Here's an idea:
for(int i=0; i<res-1; i++)
{
if( buffer[i] == 0 )
buffer[i] = ',';
}
Now, instead of being separated by null characters, the attributes are separated by commas.

Actually, since I'm going to send it to Python I don't have to process it C-style after all. Just use the Py_BuildValue passing it the format character s#, which knows what do with it. You'll also need the size.
return Py_BuildValue("s#", buffer, size);
You can process it into a list on Python's end using split('\x00'). I found this after trial and error, but I'm glad to have learned something about C.

C Newbie, ascii control function

I have written a program that works well in C that converts non-readable ASCII to their character values. I would appreciate if a C master? would show me a better way of doing it that I have currently done, mainly this section:
if (isascii(ch)) {
switch (ch) {
case 0:
printControl("NUL");
break;
case 1:
printControl("SOH");
break;
.. etc (32 in total)
case default:
putchar(ch);
break;
}
}
Is it normal to make a switch that big? Or should I be using some other method (input from an ascii table?)

If you're always doing the same operation (e.g., putchar), you can just statically initialize an array that maps to what each character should map. You could then access the proper mapping value by smartly accessing the array per the offset of the incoming character.
For example, (in pseudo-code -- it's been awhile since I wrote in C), you would define:
const char* [] map = {"NUL", "SOH, ...};
and then index that smartly via something like:
const char* val = map[((int)ch)];
to get your value.
You would not be able to use this if your "from" values are not sequential; in that case, you would need to have some conditional blocks. But if you can leverage the sequentiality, you should.

Too many years ago when assembly languages for 8-bit micros were how I spent my time, I would have written something like
printf("%3.3s",
("NULSOHSTXETXEOTENQACKBELBS HT LF VT FF CR SO SI "
"DLEDC1DC2DC3DC4NAKSYNETBCANEM SUBESCFS GS RS US ")[3*ch]);
but not because its particularly better. And the multiply by three is annoying because 8-bit micros don't multiply so it would have required both a shift and an add, as well as a spare register.
A much more C-like result would be to use a table with four bytes per control, with the NUL bytes included. That allows each entry to be referred to as a string constant, but saves the extra storage for 32 pointers.
const char *charname(int ch) {
if (ch >= 0 && ch <= 0x20)
return ("NUL\0" "SOH\0" "STX\0" "ETX\0" /* 00..03 */
"EOT\0" "ENQ\0" "ACK\0" "BEL\0" /* 04..07 */
"BS\0\0" "HT\0\0" "LF\0\0" "VT\0\0" /* 08..0B */
"FF\0\0" "CR\0\0" "SO\0\0" "SI\0\0" /* 0C..0F */
"DLE\0" "DC1\0" "DC2\0" "DC3\0" /* 10..13 */
"DC4\0" "NAK\0" "SYN\0" "ETB\0" /* 14..17 */
"CAN\0" "EM\0\0" "SUB\0" "ESC\0" /* 18..1B */
"FS\0\0" "GS\0\0" "RS\0\0" "US\0\0" /* 1C..1F */
"SP\0\0") + (ch<<2); /* 20 */
if (ch == 0x7f)
return "DEL";
if (ch == EOF)
return "EOF";
return NULL;
}
I've tried to format the main table so its organization is clear. The function returns NULL for characters that name themselves, or are not 7-bit ASCII. Otherwise, it returns a pointer to a NUL-terminated ASCII string containing the conventional abbreviation of that control character, or "EOF" for the non-character EOF returned by C standard IO routines on end of file.
Note the effort taken to pad each character name slot to exactly four bytes. This is a case where building this table with a scripting language or a separate program would be a good idea. In that case, the simple answer is to build a 129-entry table (or 257-entry) containing the names of all 7-bit ASCII (or 8-bit extended in your preferred code page) characters with an extra slot for EOF.
See the sources to the functions declared in <ctype.h> for a sample of handling the extra space for EOF.

You can make a switch this big but it does become a bit difficult to manage.
The way I would approach this is to build an array with char c; char* ctrl; for each item. Then you could just loop through the array. This would make it a little easier to maintain the data.
Note that if you use every character in a particular range (for example, character 0 through 32), then your array would only need the name and it wouldn't be necessary to store the character value.

I would say build a table with the vals (0-32) and their corresponding control string ("NUL", "SOH"). (In this case the table requires just an array)
Then you can just check if it is in range an index into the table to get the string to pass to your printControl() function.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight