Find non-ascii characters from a UTF-8 string

Find non-ascii characters from a UTF-8 string - c

I need to find the non-ASCII characters from a UTF-8 string.
my understanding:
UTF-8 is a superset of character encoding in which 0-127 are ascii characters.
So if in a UTF-8 string , a characters value is Not between 0-127, then it is not a ascii character , right? Please correct me if i'm wrong here.
On the above understanding i have written following code in C :
Note:
I'm using the Ubuntu gcc compiler to run C code
utf-string is x√ab c
long i;
char arr[] = "x√ab c";
printf("length : %lu \n", sizeof(arr));
for(i=0; i<sizeof(arr); i++){
char ch = arr[i];
if (isascii(ch))
printf("Ascii character %c\n", ch);
else
printf("Not ascii character %c\n", ch);
}
Which prints the output like:
length : 9
Ascii character x
Not ascii character
Not ascii character �
Not ascii character �
Ascii character a
Ascii character b
Ascii character
Ascii character c
Ascii character
To naked eye length of x√ab c seems to be 6, but in code it is coming as 9 ?
Correct answer for the x√ab c is 1 ...i.e it has only 1 non-ascii character , but in above output it is coming as 3 (times Not ascii character).
How can i find the non-ascii character from UTF-8 string, correctly.
Please guide on the subject.

What C calls a char is actually a byte. A UTF-8 character can be made up of several bytes.
In fact only the ASCII characters are represented by a single byte in UTF-8 (which is why all valid ASCII-encoded text is also effectively UTF-8 encoded).
So to count the number of UTF-8 characters you have to do a partial decoding: count the number of UTF-8 start codepoints.
See the Wikipedia article on UTF-8 to find out how they are encoded.
Basically there are 3 categories:
single-byte codes 0b0xxxxxxx
start bytes: 0b110xxxxx, 0b1110xxxx, 0b11110xxx
continuation bytes: 0b10xxxxxx
To count the number of unicode codepoint simply count all characters that are not continuation bytes.
However unicode codepoints don't always have a 1-to-1 correspondence to "characters" (depending on your exact definition of character).

The UTF-8 characters when taken in a character array occupies it in such a way that the first byte occupied by each UTF-8 character would contain the information regarding the number of bytes taken to represent the character. The number of consecutive 1's from the MSB of the first byte would represent the total bytes taken by the non-ascii character. In case of '√' the binary form would be: 11100010,10001000,10011010. Counting the number of 1's the in the first byte gives the number of bytes occupied as 3. Something like the code below would work for this:
int get_count(char non_ascii_char){
/*
The function returns the number of bytes occupied by the UTF-8 character
It takes the non ASCII character as the input and returns the length
to the calling function.
*/
int bit_counter=7,count=0;
/*
bit_counter - is the counter initialized to traverse through each bit of the
non ascii character
count - stores the number of bytes occupied by the character
*/
for(;bit_counter>=0;bit_counter--){
if((non_ascii_char>>bit_counter)&1){
count++;// increments on the number of consecutive 1s in the byte
}
else{
break;// breaks on encountering the first 0
}
}
return count;// returns the count to the calling function
}

Related

Logical XOR in character arrays

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings. I tried to do this program in C and have been getting an error.The length of the two strings are the same.
#include<stdio.h>
#include<string.h>
int main()
{
printf("Enter your string to be encrypted ");
char a[50];
char b[50];
scanf("%s",a);
printf("Enter the key ");
scanf("%s",b);
char c[50];
int q=strlen(a);
int i=0;
for(i=0;i<q;i++)
{
c[i]=(char)(a[i]^b[i]);
}
printf("%s",c);
}
Whenever I run the code, I get output as ????? in boxes. What is the method to XOR these two strings ?

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings
Yes, it does, but that's not the only thing it requires. The Vernam cipher involves first representing the message and key in the ITA2 encoding (also known as Baudot-Murray code), and then computing the XOR of each pair of corresponding character codes from the message and key streams.
Moreover, to display the result in the manner you indicate wanting to do, you must first convert it from ITA2 to the appropriate character encoding for your locale, which is probably a superset of ASCII.
The transcoding to and from ITA2 is relatively straightforward, but not so trivial that I'm inclined to write them for you. There is a code chart at the ITA2 link above.
Note also that ITA2 is a stateful encoding that includes shift codes and a null character. This implies that the enciphered message may contain non-printing characters, which could cause some confusion, including a null character, which will be misinterpreted as a string terminator if you are not careful. More importantly, encoding in ITA2 may increase the length of the message as a result of a need to insert shift codes.
Additionally, as a technical matter, if you want to treat the enciphered bytes as a C string, then you need to ensure that it is terminated with a null character. On a related note, scanf() will do that for the strings it reads, which uses one character, leaving you only 49 each for the actual message and key characters.
What is the method to XOR these two strings ?
The XOR itself is not your problem. Your code for that is fine. The problem is that you are XORing the wrong values, and (once the preceding is corrected) outputting the result in a manner that does not serve your purpose.

Whenever I run the code, I get output as ????? in boxes...
XORing two printable characters does not always result in a printable value.
Consider the following:
the ^ operator operates at the bit level.
there is a limited range of values that are printable. (from here):
Control Characters (0–31 & 127): Control characters are not printable characters. They are used to send commands to the PC or the
printer and are based on telex technology. With these characters, you
can set line breaks or tabs. Today, they are mostly out of use.
Special Characters (32–47 / 58–64 / 91–96 / 123–126): Special characters include all printable characters that are neither letters
nor numbers. These include punctuation or technical, mathematical
characters. ASCII also includes the space (a non-visible but printable
character), and, therefore, does not belong to the control characters
category, as one might suspect.
Numbers (30–39): These numbers include the ten Arabic numerals from 0-9.
Letters (65–90 / 97–122): Letters are divided into two blocks, with the first group containing the uppercase letters and the second
group containing the lowercase.
Using the following two strings and the following code:
char str1 = {"asdf"};
char str1 = {"jkl;"};
Following demonstrates XORing the elements of the strings:
int main(void)
{
char str1[] = {"asdf"};
char str2[] = {"jkl;"};
for(int i=0;i<sizeof(str1)/sizeof(str1[i]);i++)
{
printf("%d ^ %d: %d\n", str1[i],str2[i], str1[i]^str2[i]);
}
getchar();
return 0;
}
While all of the input characters are printable (except the NULL character), not all of the XOR results of corresponding characters are:
97 ^ 106: 11 //not printable
115 ^ 107: 24 //not printable
100 ^ 108: 8 //not printable
102 ^ 59: 93
0 ^ 0: 0
This is why you are seeing the odd output. While all of the values may be completely valid for your purposes, they are not all printable.

Check for Special Characters in C

I got some help on how to see if a "string" in C contains a specific character. In short:
if(*s=='x') { //Where x is some character
//Do something
}
Now, as far as I can see, this works for letters in the English alphabet (a-z, A-Z).
However, how can I check if the current character equals a special character (such as æ, ø, or å)?

Just compare to the ASCII code of the character:
if(*s==10) { //Where 10 is the ASCII code of the special character
//Do something
}
You can find your ASCII code here: http://www.asciitable.com/

In most cases you should be using a function for this: strchr for looking for a single byte, or strstr for looking for more than a single byte.
Also, in general C does not know about characters, it only knows about bytes. A special character may be a single byte - æ in iso8859-1 encoding is \xe6, for example - or it may be more than one byte: the same character in utf-8 encoding is the 2-byte sequence \xc3\xa6.
To search a utf-8 encoded string for æ you could use
strstr(s, "\xc3\xa6")

Compare the character to the actual value (as long as the value is between 0 and 255).
See the æ wiki page for particular values. So
if (*s == 0xe6)
for lower case
if (*s == 0xc6)
for upper case

Confused about C string constants

When I came across this C language implementation of Porters Stemming algorithm I found a C-ism I was confused about.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void test( char *s )
{
int len = s[0];
printf("len= %i\n", len );
printf("s[len] = %c\n", s[len] );
}
int main()
{
test("\07" "abcdefg");
return 0;
}
and output:
len = 7
s[len] = g
However, when I input
test("\08" "abcdefgh");
or any string constant that is longer than 7 with the corresponding length in the first pair of parenthesis ( i.e. test("\09" "abcdefghi"); the output is
len = 0
s[len] =
But any input like test("\01" "abcdefgh"); prints out the character in that position ( if we call the first character position 1 and not 0 for the moment )
It appears if test( char *s ) reads the number in the first pair of parenthesis ( how it does this I am not sure since I thought s[0] would be able to only read a single char, i.e. the '\' ) and prints the last character at that index + 1 of the string constant in the second pair of parenthesis.
My question is this: It seems as if we are passing two string constants into test( char *s ). What exactly is happening here, meaning, how does the compiler seem to "split" up the string over two pairs of parenthesis? Another question one might have is, is a string of the form "blah" "abcdefg" one consecutive block of memory? It may be the case that I have overlooked something elementary, but even so I would like to know what I overlooked. I know this is a basic concept but I could not find a clear example or situation on the web that explains this and in all honesty I don't follow the output. Any helpful comments are welcomed.

There are at least three things going on here:
Literal strings juxtaposed against one another are concatenated by the compiler. "a" "b" is exactly the same as "ab".
The backslash is an escape character, which means it is not copied literally into the resulting string. The notation \01 means "the character with ASCII value 1".
The notation \0... means an octal character constant. Octal numbers are base 8, made up from digits that range from 0 through 7 inclusive. 8 is not a valid octal constant, so "\08" does not follow "\07".

The problem is not in the length of the string, but in the \o syntax for specifying non-printable values in string literals. \o, \oo, and \ooo denote octal constants, i.e. a single character whose value is written in base 8. Since 08 in \08 doesn't represent a valid base 8 number, it is interpreted as \0 followed by the ASCII character 8.
To fix the problem, represent 8 as \10 or \010:
test("\007" "abcdefg");
test("\010" "abcdefgh");
...or switch to hexadecimal, where the \x prefix makes the base more explicit to the casual reader:
test("\x07" "abcdefg");
test("\x08" "abcdefgh");
test("\x09" "abcdefghi");
test("\x0a" "abcdefghij");
...

\number in a character or string literal is means the character whose code is the value number. number is interpreted in octal, so the first non-octal digit terminates the number. So "\07" is a one-character string containing the character with code 7, but \08 is a two-character string containing the character with code 0 followed by the digit 8.
Additionally, code 0 the null terminator that's used in C to indicate the end of the string. So that second string ends at the beginning, because its first byte is the terminator. This why the length of the string in your second example is 0.

When two or more string literals are adjacent (separated only by white-space), the compiler will join them into a single string. Therefore "\07" "abcdefg" is equivalent to "\07abcdefg".
"\07" is an octal escape. An octal escape ends after three digits or with first non-octal character. So, when you enter "\08", 8 is a non octal character therefore escape ends and 0 is stored at s[0].
Now, len is 0 and printing s[len] will try to print the character at s[0] which has a non printable ASCII code (Only character above ASCII value above 32 are printable).

sizeof character and strlen string mismatch

As per my code, I assume each greek character is stored in 2bytes.
sizeof returns the size of each character as 4 (i.e the sizeof int)
How does strlen return 16 ? [Making me think each character occupies 2 bytes] (Shouldn't it be 4*8 = 32 ? Since it counts the number of bytes.)
Also, how does printf("%c",bigString[i]); print each character properly? Shouldn't it read 1 byte (a char) and then display because of %c, why is the greek character not split in this case.
strcpy(bigString,"ειδικούς");//greek
sLen = strlen(bigString);
printf("Size is %d\n ",sizeof('ε')); //printing for each character similarly
printf("%s is of length %d\n",bigString,sLen);
int k1 = 0 ,k2 = sLen - 2;
for(i=0;i<sLen;i++)
printf("%c",bigString[i]);
Output:
Size is 4
ειδικούς is of length 16
ειδικούς

Character literals in C have type int, so sizeof('ε') is the same as sizeof(int). You're playing with fire in this statement, a bit. 'ε' will be a multicharacter literal, which isn't standard, and might come back to bite you. Be careful with using extensions like this one. Clang, for example, won't accept this program with that literal in it. GCC gives a warning, but will still compile it.
strlen returns 16, since that's the number of bytes in your string before the null-terminator. Your greek characters are all 16 bits long in UTF-8, so your string looks something like:
c0c0 c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 0
in memory, where c0c0, for example, is the two bytes of the first character. There is a single null-termination byte in your string.
The printf appears to work because your terminal is UTF-8 aware. You are printing each byte separately, but the terminal is interpreting the first two prints as a single character, and so on. If you change that printf call to:
printf("%d: %02x\n", i, (unsigned char)bigString[i]);
You'll see the byte-by-byte behaviour you're expecting.

Is spacebar a character?

Consider this line of text:
First line of text.
If a character array string is used to load the first TEN characters in the array it will output as:
First lin'\0'
First contains 5 letters, lin contains 3 letters. Where are the other two characters being used?
Is \0 considered two characters?
Or is the space between the words considered a character, thus '\0` is one character?

Yes, space is a character. In ASCII encoding it has code number 32.

The space between the two words has ASCII code 0x20 (0408, or 3210); it occupies one byte.
The null at the end of the string, ASCII code 0x00 (0 in both octal and decimal) occupies the other byte.
Note that the space bar is simply the key on the keyboard that generates a space character when typed.

'\0' is the null-terminator, it is literally the value zero in all implementations.
'\0' is considered a single character because the backslash \ means to escape a character. '\0' and '0' thus are both single characters, but mean very different things.
Note that space is represented by a different ascii value.

space is represented in String as "\s", probably space is represented as '\s' as a character

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight