Check if array is ASCII - c

How do I check in C if an array of uint8 contains only ASCII elements?
If possible please refer me to the condition that checks if an element is ASCII or not

Your array elements are uint8, so must be in the range 0-255
For standard ASCII character set, bytes 0-127 are used, so you can use a for loop to iterate through the array, checking if each element is <= 127.
If you're treating the array as a string, be aware of the 0 byte (null character), which marks the end of the string
From your example comment, this could be implemented like this:
int checkAscii (uint8 *array) {
for (int i=0; i<LEN; i++) {
if (array[i] > 127) return 0;
}
return 1;
}
It breaks out early at the first element greater than 127.

All valid ASCII characters have value 0 to 127, so the test is simply a value check or 7-bit mask. For example given the inclusion of stdbool.h:
bool is_ascii = (ch & ~0x7f) == 0 ;
Possibly however you intended only printable ASCII characters (excluding control characters). In that case, given inclusion of ctype.h:
bool is_printable_ascii = (ch & ~0x7f) == 0 &&
(isprint() || isspace()) ;
Your intent may be lightly different in terms of what characters you intend to include in your set - in which case other functions in ctype.h may be applied or simply test the values for value or range to include/exclude.
Note also that the ASCII set is very restricted in international terms. The ANSI or "extended ASCII" set uses locale specific codepages to define the glyphs associated with codes 128 to 255. That is to say the set changes depending on language/locale settings to accommodate different language characters, accents and alphabets. In modern systems it is common instead to use a multi-byte Unicode encoding (or which there are several with either fixed or variable length codes). UTF-8 encoding is a variable width encoding where all single byte encodings are also ASCII codes. As such, while it is trivial to determine whether data is entirely within the ASCII set, it does not follow that the data is therefore text. If the test is intended to distinguish binary data from text, it will fail in a great many scenarios unless you can guarantee a priori that all text is restricted to the ASCII set - and that is application specific.

You cannot check if something is "ASCII" with standard C.
Because C does not specify which symbol table that is used by a compiler. Various other more or less exotic symbol tables exists/existed.
UTF8 for example, is a superset of ASCII. Older, dysfunctional 8 bit symbol tables have existed, such as EBCDIC and "Extended ASCII". To tell if something is for example ASCII or EBCDIC can't be done trivially, without a long line of value checks.
With standard C, you can only do the following:
You can check if a character is printable, with the function isprint() from ctype.h.
Or you can check if it only has up to 7 bits only set, if((ch & 0x7F)==ch).

In C programming, a character variable holds ASCII value (an integer number between 0 and 127) rather than that character itself.
The ASCII value of lowercase alphabets are from 97 to 122. And, the ASCII value of uppercase alphabets are from 65 to 90.
incase of giving the actual code , i am giving you example.
You can assign int to char directly.
int a = 47;
char c = a;
printf("%c", c);
And this will also work.
printf("%c", a); // a is in valid range
Another approach.
An integer can be assigned directly to a character. A character is different mostly just because how it is interpreted and used.
char c = atoi("47");
Try to implement this after understand the following logic properly.

Related

What happens when we make an array defined using characters instead of integers in C?

This is a code I have used to define an array:
int characters[126];
following which I wanted to get a record of the frequencies of all the characters recorded for which I used the while loop in this format:
while((a=getchar())!=EOF){
characters[a]=characters[a]+1;
}
Then using a for loop I print the values of integers in the array.
How exactly is this working?
Does C assign a specific number for letters ie. a,b,c, etc in the array?
What happens when we make an array defined using characters instead of integers in C?
Let's be sure we are clear: you are using integer values returned by getchar() as indexes into your array. This is not defining the array, it is just accessing its elements.
Does C assign a specific number for letters ie. a,b,c, etc in the array?
There are no letters in the array. There are ints. However, yes, the characters read by getchar() are encoded as integer values, so they are, in principle, suitable array indexes. Thus, this line ...
characters[a]=characters[a]+1;
... reads the int value then stored at index a in array characters, adds 1 to it, and then assigns the result back to element a of the array, provided that the value of a is a valid index into the array.
More generally, it is important to understand that although one of its major uses is to represent characters, type char is an integer type. Its values are numbers. The mapping from characters to numbers is implementation and context dependent, but it is common enough for the mapping to be consistent with the ASCII code that you will often see programs that assume such a mapping.
Indeed, your code makes exactly such an assumption (and others) by allowing only for character codes less than 126.
You should also be aware that if your characters array is declared inside a function then it is not initialized. The code depends on all elements to be initially to zero. I would recommend this declaration instead:
int characters[UCHAR_MAX + 1] = {0};
That upper bound will be sufficient for all the non-EOF values returned by getchar(), and the explicit zero-initialization will ensure the needed initial values regardless of where the array is declared.
I have realized the charecter set that can function as an input for getchar() is part of the ASCII table and comes under an int. I used the code following to find that out:
#include <stdio.h>
int main(){
int a[128];
a['b']=4;
printf("%d",a[98]); //it is 98 as according to the table 'b' is assigned the value of 98
}
following which executing this code i get the output of 4.
I am really new to coding so feel free to correct me.
Character values are represented using some kind of integer encoding - ASCII (very common), EBCDIC (mostly IBM mainframes), UTF-8 (backward-compatible to ASCII), etc.
The character value 'a' maps to some integer value - 97 in ASCII and UTF-8, 129 in EBCDIC. So yes, you can use a character value to index into an array - arr['a']++ would be equivalent to arr[97]++ if you were using ASCII or UTF-8.
The C language does not dictate this - it's determined by the underlying platform.

Logical XOR in character arrays

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings. I tried to do this program in C and have been getting an error.The length of the two strings are the same.
#include<stdio.h>
#include<string.h>
int main()
{
printf("Enter your string to be encrypted ");
char a[50];
char b[50];
scanf("%s",a);
printf("Enter the key ");
scanf("%s",b);
char c[50];
int q=strlen(a);
int i=0;
for(i=0;i<q;i++)
{
c[i]=(char)(a[i]^b[i]);
}
printf("%s",c);
}
Whenever I run the code, I get output as ????? in boxes. What is the method to XOR these two strings ?
I've been trying to make a program on Vernam Cipher which requires me to XOR two strings
Yes, it does, but that's not the only thing it requires. The Vernam cipher involves first representing the message and key in the ITA2 encoding (also known as Baudot-Murray code), and then computing the XOR of each pair of corresponding character codes from the message and key streams.
Moreover, to display the result in the manner you indicate wanting to do, you must first convert it from ITA2 to the appropriate character encoding for your locale, which is probably a superset of ASCII.
The transcoding to and from ITA2 is relatively straightforward, but not so trivial that I'm inclined to write them for you. There is a code chart at the ITA2 link above.
Note also that ITA2 is a stateful encoding that includes shift codes and a null character. This implies that the enciphered message may contain non-printing characters, which could cause some confusion, including a null character, which will be misinterpreted as a string terminator if you are not careful. More importantly, encoding in ITA2 may increase the length of the message as a result of a need to insert shift codes.
Additionally, as a technical matter, if you want to treat the enciphered bytes as a C string, then you need to ensure that it is terminated with a null character. On a related note, scanf() will do that for the strings it reads, which uses one character, leaving you only 49 each for the actual message and key characters.
What is the method to XOR these two strings ?
The XOR itself is not your problem. Your code for that is fine. The problem is that you are XORing the wrong values, and (once the preceding is corrected) outputting the result in a manner that does not serve your purpose.
Whenever I run the code, I get output as ????? in boxes...
XORing two printable characters does not always result in a printable value.
Consider the following:
the ^ operator operates at the bit level.
there is a limited range of values that are printable. (from here):
Control Characters (0–31 & 127): Control characters are not printable characters. They are used to send commands to the PC or the
printer and are based on telex technology. With these characters, you
can set line breaks or tabs. Today, they are mostly out of use.
Special Characters (32–47 / 58–64 / 91–96 / 123–126): Special characters include all printable characters that are neither letters
nor numbers. These include punctuation or technical, mathematical
characters. ASCII also includes the space (a non-visible but printable
character), and, therefore, does not belong to the control characters
category, as one might suspect.
Numbers (30–39): These numbers include the ten Arabic numerals from 0-9.
Letters (65–90 / 97–122): Letters are divided into two blocks, with the first group containing the uppercase letters and the second
group containing the lowercase.
Using the following two strings and the following code:
char str1 = {"asdf"};
char str1 = {"jkl;"};
Following demonstrates XORing the elements of the strings:
int main(void)
{
char str1[] = {"asdf"};
char str2[] = {"jkl;"};
for(int i=0;i<sizeof(str1)/sizeof(str1[i]);i++)
{
printf("%d ^ %d: %d\n", str1[i],str2[i], str1[i]^str2[i]);
}
getchar();
return 0;
}
While all of the input characters are printable (except the NULL character), not all of the XOR results of corresponding characters are:
97 ^ 106: 11 //not printable
115 ^ 107: 24 //not printable
100 ^ 108: 8 //not printable
102 ^ 59: 93
0 ^ 0: 0
This is why you are seeing the odd output. While all of the values may be completely valid for your purposes, they are not all printable.

Counting Turkish character in C

I'm trying to write a program that counts all the characters in a string at Turkish language. I can't see why this does not work. i added library, setlocale(LC_ALL,"turkish") but still doesn't work. Thank you. Here is my code:
My file character encoding: utf_8
int main(){
setlocale(LC_ALL,"turkish");
char string[9000];
int c = 0, count[30] = {0};
int bahar = 0;
...
if ( string[c] >= 'a' && string[c] <= 'z' ){
count[string[c]-'a']++;
bahar++;
}
my output:
a 0.085217
b 0.015272
c 0.022602
d 0.035736
e 0.110263
f 0.029933
g 0.015272
h 0.053146
i 0.071167
k 0.010996
l 0.047954
m 0.025046
n 0.095907
o 0.069334
p 0.013745
q 0.002443
r 0.053451
s 0.073916
t 0.095296
u 0.036958
v 0.004582
w 0.019243
x 0.001527
y 0.010996
This is English alphabet but i need this characters calculate too: "ğ,ü,ç,ı,ö"
setlocale(LC_ALL,"turkish");
First: "turkish" isn't a locale.
The proper name of a locale will typically look like xx_YY.CHARSET, where xx is the ISO 639-1 code for the language, YY is the ISO 3166-1 Alpha-2 code for the country, and CHARSET is an optional character set name (usually ISO8859-1, ISO8859-15, or UTF-8). Note that not all combinations are valid; the computer must have locale files generated for that specific combination of language code, country code, and character set.
What you probably want here is setlocale(LC_ALL, "tr_TR.UTF-8").
if ( string[c] >= 'a' && string[c] <= 'z' ){
Second: Comparison operators like >= and <= are not locale-sensitive. This comparison will always be performed on bytes, and will not include characters outside the ASCII a-z range.
To perform a locale-sensitive comparison, you must use a function like strcoll(). However, note additionally that some letters (including the ones you're trying to include here!) are composed of multi-byte sequences in UTF-8, so looping over bytes won't work either. You will need to use a function like mblen() or mbtowc() to separate these sequences.
Since you are apparently working with a UTF-8 file, the answer will depend upon your execution platform:
If you're on Linux, setlocale(LC_CTYPE, "en_US.UTF-8") or something similar should work, but the important part is the UTF-8 at the end! The language shouldn't matter. You can verify it worked by using
if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL) {
abort();
}
That will stop the program from executing. Anything after that code means that the locale was set correctly.
If you're on Windows, you can instead open the file using fopen("myfile.txt", "rt, ccs=UTF-8"). However, this isn't entirely portable to other platforms. It's a lot cleaner than the alternatives, however, which is likely more important in this particular case.
If you're using FreeBSD or another system that doesn't allow you to use either approach (e.g. there are no UTF-8 locales), you'd need to parse the bytes manually or use a library to convert them for you. If your implementation has an iconv() function, you might be able to use it to convert from UTF-8 to ISO-8859-9 to use your special characters as single bytes.
Once you're ready to read the file, you can use fgetws with a wchar_t array.
Another problem is checking if one of your non-ASCII characters was detected. You could do something like this:
// lower = "abcdefghijklmnopqrstuvwxyzçöüğı"
// upper = "ABCDEFGHİJKLMNOPQRSTUVWXYZÇÖÜĞI"
const wchar_t lower[] = L"abcdefghijklmnopqrstuvwxyz\u00E7\u00F6\u00FC\u011F\u0131";
const wchar_t upper[] = L"ABCDEFGH\u0130JKLMNOPQRSTUVWXYZ\u00C7\u00D6\u00DC\u011EI";
const wchar_t *lchptr = wcschr(lower, string[c]);
const wchar_t *uchptr = wcschr(upper, string[c]);
if (lchptr) {
count[(size_t)(lchptr-lower)]++;
bahar++;
} else if (uchptr) {
count[(size_t)(uchptr-upper)]++;
bahar++;
}
That code assumes you're counting characters without regard for case (case insensitive). That is, ı (\u0131) and I are considered the same character (count[8]++), just like İ (\u0130) and i are considered the same (count[29]++). I won't claim to know much about the Turkish language, but I used what little I understand about Turkish casing rules when I created the uppercase and lowercase strings.
Edit
As #JonathanLeffler mentioned in the question's comments, a better solution would be to use something like isalpha (or in this case, iswalpha) on each character in string instead of the lower and upper strings of valid characters I used. This, however, would only allow you to know that the character is an alphabetic character; it wouldn't tell you the index of your count array to use, and the truth is that there is no universal answer to do so because some languages use only a few characters with diacritic marks rather than an entire group where you can just do string[c] >= L'à' && string[c] <= L'ç'. In other words, even when you have read the data, you still need to convert it to fit your solution, and that requires knowledge of what you're working with to create a mapping from characters to integer values, which my code does by using strings of valid characters and the indices of each character in the string as the indices of the count array (i.e. lower[29] will mean count[29]++ is executed, and upper[18] will mean count[18]++ is executed).
The solution depends on the character encoding of your files.
If the file is in ISO 8859-9 (latin-5), then each special character is still encoded in a single byte, and you can modify your code easily: You already have a distiction between upper case and lower case. Just add more branches for the special characters.
If the file is in UTF-8, or some other unicode encoding, you need a multi-byte capable string library.

C programming - integers and characters

I ripped this from an ebook on C programming.
I understand that ASCII representations of the characters '0' and '9' are integers, so I understand the compatibility with the integer array. I am simply not sure how the shown output is computed? There input is the code itself.
What does this statement mean?
++ndigit[c-'0'];
So, is the program essentially checking if the input is one of the first 10 installments of of the ASCII code table?
ASCII CODE
No, it doesn't.
c - '0' subtracts the (not necessarily ASCII) character code of the character 0 from that of c. This will yield a number between 0 and 9 if c is a digit. Then, the resulting integer is used to index the zero-initialized ndigit array using the [] operator, and the prefix increment operator (++) is then used to increment the element at that particular index.
By the way, the code is erroneous at multiple places. I suggest you switch to another book because this one appears to be either outdated and/or encouraging the use of several types of bad programming practice.
First, main() doesn't have a return type, which is an error. It needs to be declared as int main() or int main(void) or int main(int, char **). Older compilers had the bad habit of assuming an implicit int return type if it was omitted, but this behavior is now deprecated.
Second, it would be better to initialize the ndigit array, like this:
int ndigit[10] = { 0 };
The for loop is superfluous because we can have initialization; it's also less readable than the initialization syntax, and it's also dangerous: the author doesn't calculate the count of the array using sizeof(ndigits) / sizeof(ndigits[0]), but he hardcodes its length, which may cause a buffer overrun when the length of the array is changed (decreased) and the hard-coded length value in the for loop is forgotten about.
The program computes the number of times a digit between 0 and 9 was introduced as input, how many white spaces and how many other characters were in the input.
++ndigit[c-'0'];
'0' - as integer is the ASCII code for 0.
c - is the read character (its ASCII code)
c - '0' = the actual digit (between 0 and 9) represented by the ASCII code c.
For example '3'(ASCII) would be 3(digit=integer) + '0'(ASCII)
So that's how you obtain the index in the array for your digit and you increment the number of times that digit showed up.

Who determines the ordering of characters

I have a query based on the below program -
char ch;
ch = 'z';
while(ch >= 'a')
{
printf("char is %c and the value is %d\n", ch, ch);
ch = ch-1;
}
Why is the printing of whole set of lowercase letters not guaranteed in the above program. If C doesn't make many guarantees about the ordering of characters in internal form, then who actually does it and how ?
The compiler implementor chooses their underlying character set. About the only thing the standard has to say is that a certain minimal number of characters must be available and that the numeric characters are contiguous.
The required characters for a C99 execution environment are A through Z, a through z, 0 through 9 (which must be together and in order), any of !"#%&'()*+,-./:;<=>?[\]^_{|}~, space, horizontal tab, vertical tab, form-feed, alert, backspace, carriage return and new line. This remains unchanged in the current draft of C1x, the next iteration of that standard.
Everything else depends on the implementation.
For example, code like:
int isUpperAlpha(char c) {
return (c >= 'A') && (c <= 'Z');
}
will break on the mainframe which uses EBCDIC, dividing the upper case characters into two regions.
Truly portable code will take that into account. All other code should document its dependencies.
A more portable implementation of your example would be something along the lines of:
static char chrs[] = "zyxwvutsrqponmlkjihgfedcba";
char *pCh = chrs;
while (*pCh != 0) {
printf ("char is %c and the value is %d\n", *pCh, *pCh);
pCh++;
}
If you want a real portable solution, you should probably use islower() since code that checks only the Latin characters won't be portable to (for example) Greek using Unicode for its underlying character set.
Why is the printing of whole set of
lowercase letters not guaranteed in
the above program.
Because it's possible to use C with an EBCDIC character encoding, in which the letters aren't consecutive.
Obviously determined by the implementation of C you're using, but more then likely for you it's determined by the American Standard Code for Information Interchange (ASCII).
It is determined by whatever the execution character set is.
In most cases nowadays, that is the ASCII character set, but C has no requirement that a specific character set be used.
Note that there are some guarantees about the ordering of characters in the execution character set. For example, the digits '0' through '9' are guaranteed each to have a value one greater than the value of the previous digit.
These days, people going around calling your code non-portable are engaging in useless pedantry. Support for ASCII-incompatible encodings only remains in the C standard because of legacy EBCDIC mainframes that refuse to die. You will never encounter an ASCII-incompatible char encoding on any modern computer, now or in the future. Give it a few decades, and you'll never encounter anything but UTF-8.
To answer your question about who decides the character encoding: While it's nominally at the discression of your implementation (the C compiler, library, and OS) it was ultimately decided by the internet, both existing practice and IETF standards. Presumably modern systems are intended to communicate and interoperate with one another, and it would be a huge headache to have to convert every protocol header, html file, javascript source, username, etc. back and forth between ASCII-compatible encodings and EBCDIC or some other local mess.
In recent times, it's become clear that a universal encoding not just for machine-parsed text but also for natural-language text is also highly desirable. (Natural language text interchange is not as fundamental as machine-parsed text, but still very common and important.) Unicode provided the character set, and as the only ASCII-compatible Unicode encoding, UTF-8 is pretty much the successor to ASCII as the universal character encoding.

Resources