I got some help on how to see if a "string" in C contains a specific character. In short:
if(*s=='x') { //Where x is some character
//Do something
}
Now, as far as I can see, this works for letters in the English alphabet (a-z, A-Z).
However, how can I check if the current character equals a special character (such as æ, ø, or å)?
Just compare to the ASCII code of the character:
if(*s==10) { //Where 10 is the ASCII code of the special character
//Do something
}
You can find your ASCII code here: http://www.asciitable.com/
In most cases you should be using a function for this: strchr for looking for a single byte, or strstr for looking for more than a single byte.
Also, in general C does not know about characters, it only knows about bytes. A special character may be a single byte - æ in iso8859-1 encoding is \xe6, for example - or it may be more than one byte: the same character in utf-8 encoding is the 2-byte sequence \xc3\xa6.
To search a utf-8 encoded string for æ you could use
strstr(s, "\xc3\xa6")
Compare the character to the actual value (as long as the value is between 0 and 255).
See the æ wiki page for particular values. So
if (*s == 0xe6)
for lower case
if (*s == 0xc6)
for upper case
Related
I want to index the characters in a utf8 string which does not necessarily contain
only ascii characters. I want the same kind of behavior I get in javascript:
> str = "lλך" // i.e. Latin ell, Greek lambda, Hebrew lamedh
'lλך'
> str[0]
'l'
> str[1]
'λ'
> str[2]
'ך'
Following the advice of UTF-8 Everywhere, I am representing my mixed character-length string just as any other sting in c - and not using wchars.
The problem is that, in C, one cannot access the 16th character of a string: only the 16th byte. Because λ is encoded with two bytes in utf-8, I have to access the 16th and 17th bytes of the string in order to print out one λ.
For reference, the output of:
#include <stdio.h>
int main () {
char word_with_greek[] = "this is lambda:_λ";
printf("%s\n",word_with_greek);
printf("The 0th character is: %c\n", word_with_greek[0]);
printf("The 15th character is: %c\n",word_with_greek[15]);
printf("The 16th character is: %c%c\n",word_with_greek[16],word_with_greek[17]);
return 0;
}
is:
this is lambda:_λ
The 0th character is: t
The 15th character is: _
The 16th character is: λ
Is there an easy way to break up the string into characters? It does not seem too difficult to write a function which breaks a string into wchars- but I imagine that someone has already written this yet I cannot find it.
It depends on what your unicode characters can be. Most strings are restricted to the Basic Multilanguage Plane. If yours are (not by accident by because of their very nature: at least no risk for emoji...) you can use the char16_t to represent any character. BTW wchar_t is at least as large as char16_t so in that case it is safe to use it.
If your script can contain emoji character, or other characters not in the BMP or simply if you are unsure, the only foolproof way is to convert everything to char32_t because any unicode character (at least in 2019...) as a code using less than 32 bits.
Converting for UTF8 to 32 (or 16) bits unicode is not that hard, and can be coded by hand, Wikipedia contains enough information for it. But you will find tons of library where this is already coded and tested, mainly the excellent libiconv, but the C11 version of the C standard library contains functions for UTF8 conversions. Not as nice but useable.
How do I check in C if an array of uint8 contains only ASCII elements?
If possible please refer me to the condition that checks if an element is ASCII or not
Your array elements are uint8, so must be in the range 0-255
For standard ASCII character set, bytes 0-127 are used, so you can use a for loop to iterate through the array, checking if each element is <= 127.
If you're treating the array as a string, be aware of the 0 byte (null character), which marks the end of the string
From your example comment, this could be implemented like this:
int checkAscii (uint8 *array) {
for (int i=0; i<LEN; i++) {
if (array[i] > 127) return 0;
}
return 1;
}
It breaks out early at the first element greater than 127.
All valid ASCII characters have value 0 to 127, so the test is simply a value check or 7-bit mask. For example given the inclusion of stdbool.h:
bool is_ascii = (ch & ~0x7f) == 0 ;
Possibly however you intended only printable ASCII characters (excluding control characters). In that case, given inclusion of ctype.h:
bool is_printable_ascii = (ch & ~0x7f) == 0 &&
(isprint() || isspace()) ;
Your intent may be lightly different in terms of what characters you intend to include in your set - in which case other functions in ctype.h may be applied or simply test the values for value or range to include/exclude.
Note also that the ASCII set is very restricted in international terms. The ANSI or "extended ASCII" set uses locale specific codepages to define the glyphs associated with codes 128 to 255. That is to say the set changes depending on language/locale settings to accommodate different language characters, accents and alphabets. In modern systems it is common instead to use a multi-byte Unicode encoding (or which there are several with either fixed or variable length codes). UTF-8 encoding is a variable width encoding where all single byte encodings are also ASCII codes. As such, while it is trivial to determine whether data is entirely within the ASCII set, it does not follow that the data is therefore text. If the test is intended to distinguish binary data from text, it will fail in a great many scenarios unless you can guarantee a priori that all text is restricted to the ASCII set - and that is application specific.
You cannot check if something is "ASCII" with standard C.
Because C does not specify which symbol table that is used by a compiler. Various other more or less exotic symbol tables exists/existed.
UTF8 for example, is a superset of ASCII. Older, dysfunctional 8 bit symbol tables have existed, such as EBCDIC and "Extended ASCII". To tell if something is for example ASCII or EBCDIC can't be done trivially, without a long line of value checks.
With standard C, you can only do the following:
You can check if a character is printable, with the function isprint() from ctype.h.
Or you can check if it only has up to 7 bits only set, if((ch & 0x7F)==ch).
In C programming, a character variable holds ASCII value (an integer number between 0 and 127) rather than that character itself.
The ASCII value of lowercase alphabets are from 97 to 122. And, the ASCII value of uppercase alphabets are from 65 to 90.
incase of giving the actual code , i am giving you example.
You can assign int to char directly.
int a = 47;
char c = a;
printf("%c", c);
And this will also work.
printf("%c", a); // a is in valid range
Another approach.
An integer can be assigned directly to a character. A character is different mostly just because how it is interpreted and used.
char c = atoi("47");
Try to implement this after understand the following logic properly.
I don't know the following cases in GCC, who can help me?
Whether a valid UTF-8 character (except code point 0) still contains zero byte? If so, I think function such as strlen will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to '\n'? If so, I think function such as "gets" will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to ' ' or '\t'? If so, I think function such as scanf("%s%s") will break that UTF-8 character and be interpreted as two or more words.
The answer to all your questions are the same: No.
It's one of the advantages of UTF-8: all ASCII bytes do not occur when encoding non-ASCII code points into UTF-8.
For example, you can safely use strlen on a UTF-8 string, only that its result is the number of bytes instead of UTF-8 code points.
I need to find the non-ASCII characters from a UTF-8 string.
my understanding:
UTF-8 is a superset of character encoding in which 0-127 are ascii characters.
So if in a UTF-8 string , a characters value is Not between 0-127, then it is not a ascii character , right? Please correct me if i'm wrong here.
On the above understanding i have written following code in C :
Note:
I'm using the Ubuntu gcc compiler to run C code
utf-string is x√ab c
long i;
char arr[] = "x√ab c";
printf("length : %lu \n", sizeof(arr));
for(i=0; i<sizeof(arr); i++){
char ch = arr[i];
if (isascii(ch))
printf("Ascii character %c\n", ch);
else
printf("Not ascii character %c\n", ch);
}
Which prints the output like:
length : 9
Ascii character x
Not ascii character
Not ascii character �
Not ascii character �
Ascii character a
Ascii character b
Ascii character
Ascii character c
Ascii character
To naked eye length of x√ab c seems to be 6, but in code it is coming as 9 ?
Correct answer for the x√ab c is 1 ...i.e it has only 1 non-ascii character , but in above output it is coming as 3 (times Not ascii character).
How can i find the non-ascii character from UTF-8 string, correctly.
Please guide on the subject.
What C calls a char is actually a byte. A UTF-8 character can be made up of several bytes.
In fact only the ASCII characters are represented by a single byte in UTF-8 (which is why all valid ASCII-encoded text is also effectively UTF-8 encoded).
So to count the number of UTF-8 characters you have to do a partial decoding: count the number of UTF-8 start codepoints.
See the Wikipedia article on UTF-8 to find out how they are encoded.
Basically there are 3 categories:
single-byte codes 0b0xxxxxxx
start bytes: 0b110xxxxx, 0b1110xxxx, 0b11110xxx
continuation bytes: 0b10xxxxxx
To count the number of unicode codepoint simply count all characters that are not continuation bytes.
However unicode codepoints don't always have a 1-to-1 correspondence to "characters" (depending on your exact definition of character).
The UTF-8 characters when taken in a character array occupies it in such a way that the first byte occupied by each UTF-8 character would contain the information regarding the number of bytes taken to represent the character. The number of consecutive 1's from the MSB of the first byte would represent the total bytes taken by the non-ascii character. In case of '√' the binary form would be: 11100010,10001000,10011010. Counting the number of 1's the in the first byte gives the number of bytes occupied as 3. Something like the code below would work for this:
int get_count(char non_ascii_char){
/*
The function returns the number of bytes occupied by the UTF-8 character
It takes the non ASCII character as the input and returns the length
to the calling function.
*/
int bit_counter=7,count=0;
/*
bit_counter - is the counter initialized to traverse through each bit of the
non ascii character
count - stores the number of bytes occupied by the character
*/
for(;bit_counter>=0;bit_counter--){
if((non_ascii_char>>bit_counter)&1){
count++;// increments on the number of consecutive 1s in the byte
}
else{
break;// breaks on encountering the first 0
}
}
return count;// returns the count to the calling function
}
I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome.
I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.
C has two different kinds of characters: multibyte characters and wide characters.
Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.
Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).
So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).
Unfortunately, there is more to Unicode than this.
ά can be represented as a single Unicode codepoint, or as two separate codepoints:
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.
All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).
First of all, please do take the time to read this great article, which explains UTF8 vs Unicode and lots of other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html
What you are trying to do in your code is read in unicode character by character, and do comparisons with those. That's won't work if the input stream is UTF8, and it's not really possible to do with quite this structure.
In short: Fully unicode strings can be encoded in several ways. One of them is using a series of equally-sized "wide" chars, one for each character. That is what the wchar_t type (sometimes WCHAR) is for. Another way is UTF8, which uses a variable number of raw bytes to encode each character, depending on the value of the character.
UTF8 is just a stream of bytes, which can encode a unicode string, and is commonly used in files. It is not the same as a string of WCHARs, which are the more common in-memory representation. You can't poke through a UTF8 stream reliably, and do character replacements within it directly. You'll need to read the whole thing in and decode it, and then loop through the WCHARs that result to do your comparisons and replacement, and then map that result back to UTF8 to write to the output file.
On Win32, use MultiByteToWideChar to do the decoding, and you can use the corresponding WideCharToMultiByte to go back.
When you use a "string literal" with regular quotes, you're creating a nul-terminated ASCII string (char*), which does not support Unicode. The L"string literal" with the L prefix will create a nul-terminated string of WCHARs (wchar_t *), which you can use in string or character comparisons. The L prefix also works with single-quote character literals, like so: L'ε'
As a commenter noted, when you use fread/fwrite, you should be using sizeof(wchar_t) and not its pointer type, since the amount you are trying to read/write is an actual wchar, not the size of a pointer to one. This advice is just code feedback independent of the above-- you don't want to be reading the input character by character anyways.
Note too that when you do string comparisons (wcscmp), you should use actual wide strings (which are terminated with a nul wide char)-- not use single characters in memory as input. If (when) you want to do character-to-character comparisons, you don't even need to use the string functions. Since a WCHAR is just a value, you can compare directly: if (sample == L'ά') {}.