Clear explanation of ndigit[c - '0'] - c

I've finally started to read the K&R, and I've just arrived to the Array part.
But in the example of this section, there's a piece of code I really don't understand completely, so I'd like to ask you for a clear explanation of it, as I wanna understand every concept of C as I know it's the fundamental for a fast learning of C++.
By the way I already have a decent knowledge of JAVA, hope this will help you in your explanation setup.
Question:
In this piece of code ndigit[c - '0'] I don't understand what it's trying to do, I know from other Stack Overlfow questions that 0 should refer to the ASCI standards and should be 48, but still don't understand what c and that 0 have in relationship.

Short answer:
by ndigit[c-'0'] you are not doing normal arithmetic e.g [5+5] = [10]. Rather you are first converting your characters into character code. Then doing arithmetic with character code.
To say simply- we represent five in english with this "5" sign. Other languages have their own sign for five i.e δΊ” is five in japanise. Likewise Computer represent five with 53(assuming ascii, ascii is like a language). So with ndigit[c-'0'], its first being converted to character code then doing arithmetic.
Long answer:
Lets go through ndigit[c - '0'] first.
ndigit = name of the array
[ ] = index, used for specifying total elements number.
c - '0' = arithmetic operation. c is a variable. - is minus.And
'0' is not 0 but 48.
now let me add additional code from the from the book -
int ndigit[10];
...//fill in the array with 0s
while((c = getchar()) != EOF)
if(c >= '0' && c <= '9')
++ndigit[c - '0']; //<== unable to understand this part
here something to note of is getchar(). getchar() return int type data. even though it return int type but it wont return the character rather it will return character code. Let me give a example-
#include <stdio.h>
int main(){
int c;
c = getchar();
printf("%d\n", c);
}
output:
#mix:~
$ cc test.c
#mix:~
$ ./a.out
5 ;wrote 5 in terminal
53 ; printed 53 instead of 5
5 is the character. And 53 is the character code of 5.
Now lets back to our main topic ndigit[c - '0']. c is getting a value from getchar(). and getchar() reads from input. lets say the input is 5 . Now due to getchar() function behaviour instead of 5, c will contain 53. So
ndigit[c - '0'] == `ndigit[53 - '0']` != `ndigit[5 - '0']`
Also notice we are not using 0. Rather '0'.
means the arithmetic -
ndigit[53 - '0']
=ndigit[53] ;**wrong**
using '0' means we meant to use character code(used single quote). Like described above '0' = 48 (according to ascii). so
ndigit[53 - '0']
= ndigit[53 - 48]
= ndigit[5]
Now we get back our character which was read by getchar(). But why should we get back our character? Will ndigit[c] instead of ndigit[c - '0'] work?
ndigit[c] wont work because at the start of our code we wrote ndigit[10] . Our ndigit[10] array can hold 10 elements at max. As a result c cant be greater than 10 or ndigit[53] is invalid as its size is 53 and surpassing 10 . Thats why we use ndigit[c - '0'] to do character code subtraction and get a value under 10.
IF still unclear, search and learn about the following-
charater encoding
array in c

In this code c is a char, presumably representing a digit. In C a char is an integral type, so you can perform arithmetic operations on them.
Digits are encoded with numbers from a consecutive range: if the code for '0' is k, the code for '1' is k+1, the code for '2' is k+2, and so on. That is why by subtracting '0' from a character representing a digit you get the numeric value of that digit.
For example, by subtracting '5'-'0' you get a numeric 5 instead of character '5'.
If you make an array ndigit[10], then ndigit[c - '0'] lets you access an array element corresponding to the digit. This can be used, for example, to count the number of different digits in the input.

As you said '0' is equal to 48 (assuming ASCII encoding). Thus the other digits are equal to 49 through 57 respectively. So '1' is equal to 49, '2' to 50 etc. Thus '1' - '0' is equal to '49 - 48', which is 1 and '2' - '0' is equal to '50 - 48', which is 2 and so on.
In other words c - '0' converts a digit like '5' to its integer equivalent (which would be 5 for '5').

Related

Comparing single quote numbers instead of regular numbers(numbers without quotes)? C Programming Language K&R

This is part of a code to count white spaces, numbers, or other from the K&R "C programming book." I am confused why it compares "int c" to digits using '0' and '9' instead of 0 and 9. I realize the code doesn't work if I use 0 and 9 without quotes. I am just trying to understand why. Does this have to do with c being equal to getchar()?
while ((c = getchar()) != EOF)
if (c >= '0' && c <= '9')
++ndigit[c-'0'];
else if (c == ' ' || c == '\n' || c == '\t')
++nwhite;
else
++nother;
looking at the man page for getchar, we see that it returns the character read as an unsigned char cast to an int. So we can assume the value stored is not an integer number, but its ascii equivalent, and can be compared with chars such as '0' and '9'.
A char usually is just an integer. Where the meaing is given by some charset. For example ASCII.
So for example we could store "Hello" as the sequence 72, 65, 108, 108 and 111.
Using single quotes (as in '9') we tell that we mean the number which represents the character '9'. Behind the scenes the computer only knows numbers and so this will end up in the code 57 for our example (see char '9', in red, maps to code 57 in the ASCII table). For more examples see linked ASCII table above.
Same counts for the chars in our input data. Also those are encoded into those numbers according to the charset we're using.
In contrast if we would just use a plain 9 we would ask for exactly the code 9. And not "the code which represents char 9". That's the difference.
BTW: There's another "trick" used in the code sample. it is c-'0' which asks to subtract "the code behind the character '0'" from our current character c. If we do this, we will end up with the digit not as the character, but as the number behind it. Example:
Assume c is the character '4'.
So in c it is stored as the code 52 (see ASCII table)
If we now want the numeric value 4 in place of the character '4' we just subtract the character '0' from it (code 48 in ASCII)
So 52 - 48 will end up as 4 (not a char but the number behind it)
getchar() returns a signed integer in order to allow it to return EOF (-1). If it returned a char, you could not have an error value.
Moreover '9' is a literal character constant, whose value is the character set code for the digit character '9' and not the integer value 9, and in C (but not C++) has type int, so there is in any case no type mismatch in the expression c <= '9' for example, it is an int comparison.
Even if that were not the case, and a literal character constants had char type (as in C++), there would be an implicit type promotion to int before comparison.
Also, you need to understand that a char is not specifically a character, but rather simply an integer type that is the:
Smallest addressable unit of the machine that can contain basic character set.

What's the logic behind using 'a' - 'A' instead of "32" or the space character?

This is a code from the book "The C Programming Language" which maps a single character to lower case for the ASCII character set and returns unchanged, if the character is not an upper case letter:
int lower(int c)
{
if (c >= 'A' && c <= 'Z')
return c + 'a' - 'A';
else
return c;
}
I don't understand the logic behind return c + 'a' - 'A';.
Why didn't they simply put ' ' or the number 32 instead of 'a' - 'A'?
In the ASCII character set 'a' - 'A' just happens to have a value of 32. It's completely unrelated to the ASCII space character ' ' also having a value of 32, so it makes no sense to replace 'a' - 'A' with ' '.
Using 'a' - 'A' is much more meaningful and understandable than 32, and also doesn't tie the implementation to using a specific character set (though a-z and A-Z need to be contiguous for it to work, which isn't true for all character sets).
Why not 32? Because "magic numbers" are bad.
By using 'a'-'A' it makes it clear to the reader that the difference in character encoding between upper case and lower case is being added to the current character encoding.
Note that this also depends on the set of upper case characters being contiguous as well as the set of lower case characters. This is true for ASCII but necessarily in general
c - 'A': gives you the letter number in the alphabet; not in the character set:
so for example if you pass 'A' to c - 'A' you get 0, because everything subtracted by itself becomes zero; if you pass 'B' you get 1; if you pass 'C' you get 2 and so on. You get a number between 0 to 25 (The English alphabet includes 26 letters which we count from 1)
c - 'a': makes your upper-case letter a lower-case letter. It puts your letter number in the lower-case sequence in the character set.
so for example if you pass 'A', you get 0; then 0 + 'a' gives you the letter 'a'. if you pass 'B', you get 1; then 1 + 'a' gives you 'b' which comes right after 'a'. if you pass 'C', you get 2; then 2 + 'a' gives you 'c' which is two letters after 'a' and so on.
Also consider the following:
Take a look at ASCII table.
This function is designed generally to work with the character sets that their order corresponds to the English alphabet order. Character sets that their characters are contiguous like: A, B, C, D... and their lower-case and upper-case letters are a fixed distance.
It's the same reason as for writing things like
val = 10 * val + digitchar - '0';
when you're writing code to convert a string of digits to the corresponding integer.
The "obvious" way to write it would be
val = 10 * val + digitchar - 48;
But where did that magic number 48 come from? You had to look it up on the ASCII chart, and if I'm not familiar with it, I have to look it up on an ASCII chart to figure out how your program works. It saves both of us time if you write the constant '0' instead. (And, incidentally, using the constant '0' means that the program is portable to a machine using a character set other than ASCII, if anyone cares.)
Similarly, if I know that the codes for the upper- and the lower-case letters are in the same order but separated by some amount, using the computation 'A' - 'a' to represent that amount is again easier on both me and my reader than it would be if I went to my ASCII chart and worked out that the offset is actually 32.
In both cases, the principle here is Let the machine do the dirty work.
I agree, it's a little cryptic at first. If you're used to looking things up on the ASCII chart whenever you need to, it can be very disorienting to see those strange scraps of code like 'A' - 'a' and digitchar - '0'. Once you get used to the idioms, though, they're so much easier and less trouble.
Think about the authors perspective, they might tired to build up the knowlege on earlier pages and introduced to ASCII codes. They also discussed about type conversion in ealier code example and paragraph. For beginners the author tried to make them understand and use int and char interchangebly. As others discussed it is cleaner. Someone already know about C programming this question might arise.
We can avoid magic numbers in this case 32 as well.

What is the size of a given char?

I've got a code that i cannot understand in C;
char c is string, that supposed to be randomized,
here is the question however, 26 is supposed to be range of values starting from 97, but it easy to understand for integer, but in case of char i have no clue what it is supposed to be
char c = (char) rand() % 26 + 97;
That is generating a random character. In ASCII, alphabetical characters start at 97. So, the code is taking 97, adding a random number between 0 and 25 to it, then casting it to a char, which generates a random alphabetical character.
97 = ascii 'a'
It generates a random character between 'a' and 'z' inclusive.
Ref: ASCII values
It's a bad, non-portable way to generate a random character by directly computing the ASCII code.
A better, more portable, way is to randomize the index into a table of characters. This pushes the responsibility for what code is used to represent each character into the compiler, where it belongs:
char random_char(void)
{
const char alpha[] = "abcdefghijklmnopqrstuvwxyz";
return alpha[rand() % sizeof alpha];
}
Any decent compiler will very likely inline the above.
NOTE Using % to range-limit the return value of rand() is generally frowned upon, but that's not the focus here.

C programming - integers and characters

I ripped this from an ebook on C programming.
I understand that ASCII representations of the characters '0' and '9' are integers, so I understand the compatibility with the integer array. I am simply not sure how the shown output is computed? There input is the code itself.
What does this statement mean?
++ndigit[c-'0'];
So, is the program essentially checking if the input is one of the first 10 installments of of the ASCII code table?
ASCII CODE
No, it doesn't.
c - '0' subtracts the (not necessarily ASCII) character code of the character 0 from that of c. This will yield a number between 0 and 9 if c is a digit. Then, the resulting integer is used to index the zero-initialized ndigit array using the [] operator, and the prefix increment operator (++) is then used to increment the element at that particular index.
By the way, the code is erroneous at multiple places. I suggest you switch to another book because this one appears to be either outdated and/or encouraging the use of several types of bad programming practice.
First, main() doesn't have a return type, which is an error. It needs to be declared as int main() or int main(void) or int main(int, char **). Older compilers had the bad habit of assuming an implicit int return type if it was omitted, but this behavior is now deprecated.
Second, it would be better to initialize the ndigit array, like this:
int ndigit[10] = { 0 };
The for loop is superfluous because we can have initialization; it's also less readable than the initialization syntax, and it's also dangerous: the author doesn't calculate the count of the array using sizeof(ndigits) / sizeof(ndigits[0]), but he hardcodes its length, which may cause a buffer overrun when the length of the array is changed (decreased) and the hard-coded length value in the for loop is forgotten about.
The program computes the number of times a digit between 0 and 9 was introduced as input, how many white spaces and how many other characters were in the input.
++ndigit[c-'0'];
'0' - as integer is the ASCII code for 0.
c - is the read character (its ASCII code)
c - '0' = the actual digit (between 0 and 9) represented by the ASCII code c.
For example '3'(ASCII) would be 3(digit=integer) + '0'(ASCII)
So that's how you obtain the index in the array for your digit and you increment the number of times that digit showed up.

Char to int conversion in C

If I want to convert a single numeric char to it's numeric value, for example, if:
char c = '5';
and I want c to hold 5 instead of '5', is it 100% portable doing it like this?
c = c - '0';
I heard that all character sets store the numbers in consecutive order so I assume so, but I'd like to know if there is an organized library function to do this conversion, and how it is done conventionally. I'm a real beginner :)
Yes, this is a safe conversion. C requires it to work. This guarantee is in section 5.2.1 paragraph 2 of the latest ISO C standard, a recent draft of which is N1570:
Both the basic source and basic execution character sets shall have the following
members:
[...]
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
[...]
In both the source and execution basic character sets, the
value of each character after 0 in the above list of decimal digits shall be one greater than
the value of the previous.
Both ASCII and EBCDIC, and character sets derived from them, satisfy this requirement, which is why the C standard was able to impose it. Note that letters are not contiguous iN EBCDIC, and C doesn't require them to be.
There is no library function to do it for a single char, you would need to build a string first:
int digit_to_int(char d)
{
char str[2];
str[0] = d;
str[1] = '\0';
return (int) strtol(str, NULL, 10);
}
You could also use the atoi() function to do the conversion, once you have a string, but strtol() is better and safer.
As commenters have pointed out though, it is extreme overkill to call a function to do this conversion; your initial approach to subtract '0' is the proper way of doing this. I just wanted to show how the recommended standard approach of converting a number as a string to a "true" number would be used, here.
Try this :
char c = '5' - '0';
int i = c - '0';
You should be aware that this doesn't perform any validation against the character - for example, if the character was 'a' then you would get 91 - 48 = 49. Especially if you are dealing with user or network input, you should probably perform validation to avoid bad behavior in your program. Just check the range:
if ('0' <= c && c <= '9') {
i = c - '0';
} else {
/* handle error */
}
Note that if you want your conversion to handle hex digits you can check the range and perform the appropriate calculation.
if ('0' <= c && c <= '9') {
i = c - '0';
} else if ('a' <= c && c <= 'f') {
i = 10 + c - 'a';
} else if ('A' <= c && c <= 'F') {
i = 10 + c - 'A';
} else {
/* handle error */
}
That will convert a single hex character, upper or lowercase independent, into an integer.
You can use atoi, which is part of the standard library.
Since you're only converting one character, the function atoi() is overkill. atoi() is useful if you are converting string representations of numbers. The other posts have given examples of this. If I read your post correctly, you are only converting one numeric character. So, you are only going to convert a character that is the range 0 to 9. In the case of only converting one numeric character, your suggestion to subtract '0' will give you the result you want. The reason why this works is because ASCII values are consecutive (like you said). So, subtracting the ASCII value of 0 (ASCII value 48 - see ASCII Table for values) from a numeric character will give the value of the number. So, your example of c = c - '0' where c = '5', what is really happening is 53 (the ASCII value of 5) - 48 (the ASCII value of 0) = 5.
When I first posted this answer, I didn't take into consideration your comment about being 100% portable between different character sets. I did some further looking around around and it seems like your answer is still mostly correct. The problem is that you are using a char which is an 8-bit data type. Which wouldn't work with all character types. Read this article by Joel Spolsky on Unicode for a lot more information on Unicode. In this article, he says that he uses wchar_t for characters. This has worked well for him and he publishes his web site in 29 languages. So, you would need to change your char to a wchar_t. Other than that, he says that the character under value 127 and below are basically the same. This would include characters that represent numbers. This means the basic math you proposed should work for what you were trying to achieve.
Yes. This is safe as long as you are using standard ascii characters, like you are in this example.
Normally, if there's no guarantee that your input is in the '0'..'9' range, you'd have to perform a check like this:
if (c >= '0' && c <= '9') {
int v = c - '0';
// safely use v
}
An alternative is to use a lookup table. You get simple range checking and conversion with less (and possibly faster) code:
// one-time setup of an array of 256 integers;
// all slots set to -1 except for ones corresponding
// to the numeric characters
static const int CHAR_TO_NUMBER[] = {
-1, -1, -1, ...,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, // '0'..'9'
-1, -1, -1, ...
};
// Now, all you need is:
int v = CHAR_TO_NUMBER[c];
if (v != -1) {
// safely use v
}
P.S. I know that this is an overkill. I just wanted to present it as an alternative solution that may not be immediately evident.
As others have suggested, but wrapped in a function:
int char_to_digit(char c) {
return c - '0';
}
Now just use the function. If, down the line, you decide to use a different method, you just need to change the implementation (performance, charset differences, whatever), you wont need to change the callers.
This version assumes that c contains a char which represents a digit. You can check that before calling the function, using ctype.h's isdigit function.
Since the ASCII codes for '0','1','2'.... are placed from 48 to 57 they are essentially continuous. Now the arithmetic operations require conversion of char datatype to int datatype.Hence what you are basically doing is:
53-48 and hence it stores the value 5 with which you can do any integer operations.Note that while converting back from int to char the compiler gives no error but just performs a modulo 256 operation to put the value in its acceptable range
You can simply use theatol()function:
#include <stdio.h>
#include <stdlib.h>
int main()
{
const char *c = "5";
int d = atol(c);
printf("%d\n", d);
}

Resources