I have a query based on the below program -
char ch;
ch = 'z';
while(ch >= 'a')
{
printf("char is %c and the value is %d\n", ch, ch);
ch = ch-1;
}
Why is the printing of whole set of lowercase letters not guaranteed in the above program. If C doesn't make many guarantees about the ordering of characters in internal form, then who actually does it and how ?
The compiler implementor chooses their underlying character set. About the only thing the standard has to say is that a certain minimal number of characters must be available and that the numeric characters are contiguous.
The required characters for a C99 execution environment are A through Z, a through z, 0 through 9 (which must be together and in order), any of !"#%&'()*+,-./:;<=>?[\]^_{|}~, space, horizontal tab, vertical tab, form-feed, alert, backspace, carriage return and new line. This remains unchanged in the current draft of C1x, the next iteration of that standard.
Everything else depends on the implementation.
For example, code like:
int isUpperAlpha(char c) {
return (c >= 'A') && (c <= 'Z');
}
will break on the mainframe which uses EBCDIC, dividing the upper case characters into two regions.
Truly portable code will take that into account. All other code should document its dependencies.
A more portable implementation of your example would be something along the lines of:
static char chrs[] = "zyxwvutsrqponmlkjihgfedcba";
char *pCh = chrs;
while (*pCh != 0) {
printf ("char is %c and the value is %d\n", *pCh, *pCh);
pCh++;
}
If you want a real portable solution, you should probably use islower() since code that checks only the Latin characters won't be portable to (for example) Greek using Unicode for its underlying character set.
Why is the printing of whole set of
lowercase letters not guaranteed in
the above program.
Because it's possible to use C with an EBCDIC character encoding, in which the letters aren't consecutive.
Obviously determined by the implementation of C you're using, but more then likely for you it's determined by the American Standard Code for Information Interchange (ASCII).
It is determined by whatever the execution character set is.
In most cases nowadays, that is the ASCII character set, but C has no requirement that a specific character set be used.
Note that there are some guarantees about the ordering of characters in the execution character set. For example, the digits '0' through '9' are guaranteed each to have a value one greater than the value of the previous digit.
These days, people going around calling your code non-portable are engaging in useless pedantry. Support for ASCII-incompatible encodings only remains in the C standard because of legacy EBCDIC mainframes that refuse to die. You will never encounter an ASCII-incompatible char encoding on any modern computer, now or in the future. Give it a few decades, and you'll never encounter anything but UTF-8.
To answer your question about who decides the character encoding: While it's nominally at the discression of your implementation (the C compiler, library, and OS) it was ultimately decided by the internet, both existing practice and IETF standards. Presumably modern systems are intended to communicate and interoperate with one another, and it would be a huge headache to have to convert every protocol header, html file, javascript source, username, etc. back and forth between ASCII-compatible encodings and EBCDIC or some other local mess.
In recent times, it's become clear that a universal encoding not just for machine-parsed text but also for natural-language text is also highly desirable. (Natural language text interchange is not as fundamental as machine-parsed text, but still very common and important.) Unicode provided the character set, and as the only ASCII-compatible Unicode encoding, UTF-8 is pretty much the successor to ASCII as the universal character encoding.
Related
I want to use √ symbol in the program written below.
#include <stdio.h>
main(){
char a='√';
if (a=='√'){
printf("Working");
}
else{
printf("Not working");
}
}
√ is not ASCII that's why its not working. But I want to know to make it work.
Thanks in advance.
There are two different things going on here to be aware of:
The source C file itself may not be able to contain this character correctly.
The char type within the semantics of the actual program does not support this character, either.
As to the first issue, it depends on your platform (etc) but being conservative with C source is most portable, which means sticking to ASCII characters only within the code file. That means, e.g., in comments as well as within meaningful code. That said, lots of platforms will allow and support Unicode characters inside the source files.
Regarding the second, a char is old-fashioned for containing characters, and is limited to an octet, which means arbitrary unicode characters with values above 0xFF just don't fit inside of them. I suppose some non-ASCII characters do in a platform dependent way (Windows code pages?) above value 0x7F, but in this case, I would treat this as a string, using a unicode escape sequence for this character: "\u221A".
char * sqrt = "\u221A";
if (strcmp(sqrt, "\u221A") == 0) {
printf("Working");
} else {
printf("Not working");
}
Heads-up that C strings (char*) are not really designed around non-ASCII characters either, so in this case you end up embedding the UTF8 encoded representation of the character (which is three bytes long) inside the char string. This works, preserves the value, and the compare works, but if you're going to be working with unicode more generally...
If your platform supports "wide characters" (wchar_t or unichar or similar) that can hold Unicode characters, then you can use those types to hold this character, and do direct equality comparisons like you were doing:
wchar_t sqrt = L'\u221A';
if (sqrt == L'\u221A') {
...
(FYI Be a little aware that these wide char types may not be wide enough for arbitrary Unicode code points on your platform thus might work for the square root char, but not, say, an emoji.)
Finally, for the sake of completeness, I feel honor-bound to admit that given a contemporary development environment/toolchain and target platform, you could probably get away with using the explicit character in a widechar literal like so:
wchar_t sqrt = L'√';
if (sqrt == L'√') {
....
But I'm old-fashioned, this feels sketchy, and I don't recommend it. :)
How do I check in C if an array of uint8 contains only ASCII elements?
If possible please refer me to the condition that checks if an element is ASCII or not
Your array elements are uint8, so must be in the range 0-255
For standard ASCII character set, bytes 0-127 are used, so you can use a for loop to iterate through the array, checking if each element is <= 127.
If you're treating the array as a string, be aware of the 0 byte (null character), which marks the end of the string
From your example comment, this could be implemented like this:
int checkAscii (uint8 *array) {
for (int i=0; i<LEN; i++) {
if (array[i] > 127) return 0;
}
return 1;
}
It breaks out early at the first element greater than 127.
All valid ASCII characters have value 0 to 127, so the test is simply a value check or 7-bit mask. For example given the inclusion of stdbool.h:
bool is_ascii = (ch & ~0x7f) == 0 ;
Possibly however you intended only printable ASCII characters (excluding control characters). In that case, given inclusion of ctype.h:
bool is_printable_ascii = (ch & ~0x7f) == 0 &&
(isprint() || isspace()) ;
Your intent may be lightly different in terms of what characters you intend to include in your set - in which case other functions in ctype.h may be applied or simply test the values for value or range to include/exclude.
Note also that the ASCII set is very restricted in international terms. The ANSI or "extended ASCII" set uses locale specific codepages to define the glyphs associated with codes 128 to 255. That is to say the set changes depending on language/locale settings to accommodate different language characters, accents and alphabets. In modern systems it is common instead to use a multi-byte Unicode encoding (or which there are several with either fixed or variable length codes). UTF-8 encoding is a variable width encoding where all single byte encodings are also ASCII codes. As such, while it is trivial to determine whether data is entirely within the ASCII set, it does not follow that the data is therefore text. If the test is intended to distinguish binary data from text, it will fail in a great many scenarios unless you can guarantee a priori that all text is restricted to the ASCII set - and that is application specific.
You cannot check if something is "ASCII" with standard C.
Because C does not specify which symbol table that is used by a compiler. Various other more or less exotic symbol tables exists/existed.
UTF8 for example, is a superset of ASCII. Older, dysfunctional 8 bit symbol tables have existed, such as EBCDIC and "Extended ASCII". To tell if something is for example ASCII or EBCDIC can't be done trivially, without a long line of value checks.
With standard C, you can only do the following:
You can check if a character is printable, with the function isprint() from ctype.h.
Or you can check if it only has up to 7 bits only set, if((ch & 0x7F)==ch).
In C programming, a character variable holds ASCII value (an integer number between 0 and 127) rather than that character itself.
The ASCII value of lowercase alphabets are from 97 to 122. And, the ASCII value of uppercase alphabets are from 65 to 90.
incase of giving the actual code , i am giving you example.
You can assign int to char directly.
int a = 47;
char c = a;
printf("%c", c);
And this will also work.
printf("%c", a); // a is in valid range
Another approach.
An integer can be assigned directly to a character. A character is different mostly just because how it is interpreted and used.
char c = atoi("47");
Try to implement this after understand the following logic properly.
I'm trying to write a program that counts all the characters in a string at Turkish language. I can't see why this does not work. i added library, setlocale(LC_ALL,"turkish") but still doesn't work. Thank you. Here is my code:
My file character encoding: utf_8
int main(){
setlocale(LC_ALL,"turkish");
char string[9000];
int c = 0, count[30] = {0};
int bahar = 0;
...
if ( string[c] >= 'a' && string[c] <= 'z' ){
count[string[c]-'a']++;
bahar++;
}
my output:
a 0.085217
b 0.015272
c 0.022602
d 0.035736
e 0.110263
f 0.029933
g 0.015272
h 0.053146
i 0.071167
k 0.010996
l 0.047954
m 0.025046
n 0.095907
o 0.069334
p 0.013745
q 0.002443
r 0.053451
s 0.073916
t 0.095296
u 0.036958
v 0.004582
w 0.019243
x 0.001527
y 0.010996
This is English alphabet but i need this characters calculate too: "ğ,ü,ç,ı,ö"
setlocale(LC_ALL,"turkish");
First: "turkish" isn't a locale.
The proper name of a locale will typically look like xx_YY.CHARSET, where xx is the ISO 639-1 code for the language, YY is the ISO 3166-1 Alpha-2 code for the country, and CHARSET is an optional character set name (usually ISO8859-1, ISO8859-15, or UTF-8). Note that not all combinations are valid; the computer must have locale files generated for that specific combination of language code, country code, and character set.
What you probably want here is setlocale(LC_ALL, "tr_TR.UTF-8").
if ( string[c] >= 'a' && string[c] <= 'z' ){
Second: Comparison operators like >= and <= are not locale-sensitive. This comparison will always be performed on bytes, and will not include characters outside the ASCII a-z range.
To perform a locale-sensitive comparison, you must use a function like strcoll(). However, note additionally that some letters (including the ones you're trying to include here!) are composed of multi-byte sequences in UTF-8, so looping over bytes won't work either. You will need to use a function like mblen() or mbtowc() to separate these sequences.
Since you are apparently working with a UTF-8 file, the answer will depend upon your execution platform:
If you're on Linux, setlocale(LC_CTYPE, "en_US.UTF-8") or something similar should work, but the important part is the UTF-8 at the end! The language shouldn't matter. You can verify it worked by using
if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL) {
abort();
}
That will stop the program from executing. Anything after that code means that the locale was set correctly.
If you're on Windows, you can instead open the file using fopen("myfile.txt", "rt, ccs=UTF-8"). However, this isn't entirely portable to other platforms. It's a lot cleaner than the alternatives, however, which is likely more important in this particular case.
If you're using FreeBSD or another system that doesn't allow you to use either approach (e.g. there are no UTF-8 locales), you'd need to parse the bytes manually or use a library to convert them for you. If your implementation has an iconv() function, you might be able to use it to convert from UTF-8 to ISO-8859-9 to use your special characters as single bytes.
Once you're ready to read the file, you can use fgetws with a wchar_t array.
Another problem is checking if one of your non-ASCII characters was detected. You could do something like this:
// lower = "abcdefghijklmnopqrstuvwxyzçöüğı"
// upper = "ABCDEFGHİJKLMNOPQRSTUVWXYZÇÖÜĞI"
const wchar_t lower[] = L"abcdefghijklmnopqrstuvwxyz\u00E7\u00F6\u00FC\u011F\u0131";
const wchar_t upper[] = L"ABCDEFGH\u0130JKLMNOPQRSTUVWXYZ\u00C7\u00D6\u00DC\u011EI";
const wchar_t *lchptr = wcschr(lower, string[c]);
const wchar_t *uchptr = wcschr(upper, string[c]);
if (lchptr) {
count[(size_t)(lchptr-lower)]++;
bahar++;
} else if (uchptr) {
count[(size_t)(uchptr-upper)]++;
bahar++;
}
That code assumes you're counting characters without regard for case (case insensitive). That is, ı (\u0131) and I are considered the same character (count[8]++), just like İ (\u0130) and i are considered the same (count[29]++). I won't claim to know much about the Turkish language, but I used what little I understand about Turkish casing rules when I created the uppercase and lowercase strings.
Edit
As #JonathanLeffler mentioned in the question's comments, a better solution would be to use something like isalpha (or in this case, iswalpha) on each character in string instead of the lower and upper strings of valid characters I used. This, however, would only allow you to know that the character is an alphabetic character; it wouldn't tell you the index of your count array to use, and the truth is that there is no universal answer to do so because some languages use only a few characters with diacritic marks rather than an entire group where you can just do string[c] >= L'à' && string[c] <= L'ç'. In other words, even when you have read the data, you still need to convert it to fit your solution, and that requires knowledge of what you're working with to create a mapping from characters to integer values, which my code does by using strings of valid characters and the indices of each character in the string as the indices of the count array (i.e. lower[29] will mean count[29]++ is executed, and upper[18] will mean count[18]++ is executed).
The solution depends on the character encoding of your files.
If the file is in ISO 8859-9 (latin-5), then each special character is still encoded in a single byte, and you can modify your code easily: You already have a distiction between upper case and lower case. Just add more branches for the special characters.
If the file is in UTF-8, or some other unicode encoding, you need a multi-byte capable string library.
I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.
I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)
If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
Go have a look at ICU. That library is what you need to deal with all the issues.
Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.
Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)
I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.
I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)
If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
Go have a look at ICU. That library is what you need to deal with all the issues.
Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.
Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)