I am learning C from the K&R book and I came across the code to count the no. of occurrence of white space characters (blank, tab, newline) and of all other characters.
The code is like this:
#include <stdio.h>
/* count digits, white space, others */
main()
{
int c, i, nwhite, nother;
int ndigit[10];
nwhite = nother = 0;
for (i = 0; i < 10; ++i)
ndigit[i] = 0;
while ((c = getchar()) != EOF)
if (c >= '0' && c <= '9')
++ndigit[c-'0'];
else if (c == ' ' || c == '\n' || c == '\t')
++nwhite;
else
++nother;
printf("digits =");
for (i = 0; i < 10; ++i)
printf(" %d", ndigit[i]);
printf(", white space = %d, other = %d\n",
nwhite, nother);
}
I need to ask 2 questions..
1st question:
if (c >= '0' && c <= '9')
++ndigit[c-'0'];
I very well know that '0' and '9'represents the ASCII value of 0 & 9 respectively. But what I don't seem to understand is why we even need to use the ASCII vale and not the integer itself. Like why can't we simply use
if (c >= 0 && c <= 9)
to find if c lies between 0 and 9?
2nd question:
++ndigit[c-'0']
What does the above statement do?
Why aren't we taking the ASCII value of c here?
Because if we did, it should have been written as ['c'-'0'].
1.
C is a character, not an integer. Thus we need to compare them to their ASCII values. The integers 0 and 9 correspond to Nul and Tab, not something we are looking for.
2.
By subtracting off the ASCII value the index corresponding to the integer is increased. For example if our number is '1'. Then '1' - '0' = 1 so the index at one is increased, its a convenient way to keep track of characters. We dont put ['c' - '0'] because we care about the variable c not the character 'c'
This table shows how characters are represented, they are different from integers. The main take away is '1' != 1
http://www.asciitable.com/
With the current C standards, this would be a perfect exercise for localized wide input:
#include <stdlib.h>
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include "wdigit.h"
int main(void)
{
size_t num_space = 0; /* Spaces, tabs, newlines */
size_t num_letter = 0;
size_t num_punct = 0; /* Punctuation */
size_t num_digit[10] = { 0, }; /* Digits - all initialized to zero */
size_t num_other = 0; /* Other printable characters */
size_t total = 0;
wint_t wc;
int digit;
if (!setlocale(LC_ALL, "")) {
fprintf(stderr, "Current locale is not supported by the C library.\n");
return EXIT_FAILURE;
}
if (fwide(stdin, 1) < 1) {
fprintf(stderr, "The C library does not support wide input for this locale.\n");
return EXIT_FAILURE;
}
while ((wc = fgetwc(stdin)) != WEOF) {
total++;
digit = wdigit(wc);
if (digit >= 0 && digit <= 9)
num_digit[digit]++;
else
if (iswspace(wc))
num_space++;
else
if (iswpunct(wc))
num_punct++;
else
if (iswalpha(wc))
num_letter++;
else
if (iswprint(wc))
num_other++;
/* All nonprintable non-whitespace characters are ignored */
}
printf("Read %zu wide characters total.\n", total);
printf("%15zu letters\n", num_letter);
printf("%15zu zeros (equivalent to '0')\n", num_digit[0]);
printf("%15zu ones (equivalent to '1')\n", num_digit[1]);
printf("%15zu twos (equivalent to '2')\n", num_digit[2]);
printf("%15zu threes (equivalent to '3')\n", num_digit[3]);
printf("%15zu fours (equivalent to '4')\n", num_digit[4]);
printf("%15zu fives (equivalent to '5')\n", num_digit[5]);
printf("%15zu sixes (equivalent to '6')\n", num_digit[6]);
printf("%15zu sevens (equivalent to '7')\n", num_digit[7]);
printf("%15zu eights (equivalent to '8')\n", num_digit[8]);
printf("%15zu nines (equivalent to '9')\n", num_digit[9]);
printf("%15zu whitespaces (including newlines and tabs)\n", num_space);
printf("%15zu punctuation characters\n", num_punct);
printf("%15zu other printable characters\n", num_other);
return EXIT_SUCCESS;
}
You also need wdigit.h, a header file that returns the decimal digit value (0 to 9, inclusive) if the given wide character is a decimal digit, and -1 otherwise. If this was an exercise, the header file would be provided.
The following "wdigit.h" should support all decimal digits defined in Unicode (which is the closest standard we have to an universal character set). I don't think it is copyrightable (because it is essentially just a listing from the Unicode standard), but if it is, I dedicate it to public domain:
#ifndef WDIGIT_H
#define WDIGIT_H
#include <wchar.h>
/* wdigits[] are wide strings that contain all known versions of a decimal digit.
For example, wdigits[0] is a wide string that contains all known zero decimal digit
wide characters. You can use e.g.
wcschr(wdigits[0], wc)
to determine if wc is a zero decimal digit wide character.
*/
static const wchar_t *const wdigits[10] = {
L"0" L"\u0660\u06F0\u07C0\u0966\u09E6\u0A66\u0AE6\u0B66\u0BE6\u0C66"
L"\u0CE6\u0D66\u0DE6\u0E50\u0ED0\u0F20\u1040\u1090\u17E0\u1810"
L"\u1946\u19D0\u1A80\u1A90\u1B50\u1BB0\u1C40\u1C50\uA620\uA8D0"
L"\uA900\uA9D0\uA9F0\uAA50\uABF0\uFF10"
L"\U000104A0\U00011066\U000110F0\U00011136\U000111D0\U000112F0"
L"\U00011450\U000114D0\U00011650\U000116C0\U00011730\U000118E0"
L"\U00011C50\U00011D50\U00016A60\U00016B50\U0001D7CE\U0001D7D8"
L"\U0001D7E2\U0001D7EC\U0001D7F6\U0001E950",
L"1" L"\u0661\u06F1\u07C1\u0967\u09E7\u0A67\u0AE7\u0B67\u0BE7\u0C67"
L"\u0CE7\u0D67\u0DE7\u0E51\u0ED1\u0F21\u1041\u1091\u17E1\u1811"
L"\u1947\u19D1\u1A81\u1A91\u1B51\u1BB1\u1C41\u1C51\uA621\uA8D1"
L"\uA901\uA9D1\uA9F1\uAA51\uABF1\uFF11"
L"\U000104A1\U00011067\U000110F1\U00011137\U000111D1\U000112F1"
L"\U00011451\U000114D1\U00011651\U000116C1\U00011731\U000118E1"
L"\U00011C51\U00011D51\U00016A61\U00016B51\U0001D7CF\U0001D7D9"
L"\U0001D7E3\U0001D7ED\U0001D7F7\U0001E951",
L"2" L"\u0662\u06F2\u07C2\u0968\u09E8\u0A68\u0AE8\u0B68\u0BE8\u0C68"
L"\u0CE8\u0D68\u0DE8\u0E52\u0ED2\u0F22\u1042\u1092\u17E2\u1812"
L"\u1948\u19D2\u1A82\u1A92\u1B52\u1BB2\u1C42\u1C52\uA622\uA8D2"
L"\uA902\uA9D2\uA9F2\uAA52\uABF2\uFF12"
L"\U000104A2\U00011068\U000110F2\U00011138\U000111D2\U000112F2"
L"\U00011452\U000114D2\U00011652\U000116C2\U00011732\U000118E2"
L"\U00011C52\U00011D52\U00016A62\U00016B52\U0001D7D0\U0001D7DA"
L"\U0001D7E4\U0001D7EE\U0001D7F8\U0001E952",
L"3" L"\u0663\u06F3\u07C3\u0969\u09E9\u0A69\u0AE9\u0B69\u0BE9\u0C69"
L"\u0CE9\u0D69\u0DE9\u0E53\u0ED3\u0F23\u1043\u1093\u17E3\u1813"
L"\u1949\u19D3\u1A83\u1A93\u1B53\u1BB3\u1C43\u1C53\uA623\uA8D3"
L"\uA903\uA9D3\uA9F3\uAA53\uABF3\uFF13"
L"\U000104A3\U00011069\U000110F3\U00011139\U000111D3\U000112F3"
L"\U00011453\U000114D3\U00011653\U000116C3\U00011733\U000118E3"
L"\U00011C53\U00011D53\U00016A63\U00016B53\U0001D7D1\U0001D7DB"
L"\U0001D7E5\U0001D7EF\U0001D7F9\U0001E953",
L"4" L"\u0664\u06F4\u07C4\u096A\u09EA\u0A6A\u0AEA\u0B6A\u0BEA\u0C6A"
L"\u0CEA\u0D6A\u0DEA\u0E54\u0ED4\u0F24\u1044\u1094\u17E4\u1814"
L"\u194A\u19D4\u1A84\u1A94\u1B54\u1BB4\u1C44\u1C54\uA624\uA8D4"
L"\uA904\uA9D4\uA9F4\uAA54\uABF4\uFF14"
L"\U000104A4\U0001106A\U000110F4\U0001113A\U000111D4\U000112F4"
L"\U00011454\U000114D4\U00011654\U000116C4\U00011734\U000118E4"
L"\U00011C54\U00011D54\U00016A64\U00016B54\U0001D7D2\U0001D7DC"
L"\U0001D7E6\U0001D7F0\U0001D7FA\U0001E954",
L"5" L"\u0665\u06F5\u07C5\u096B\u09EB\u0A6B\u0AEB\u0B6B\u0BEB\u0C6B"
L"\u0CEB\u0D6B\u0DEB\u0E55\u0ED5\u0F25\u1045\u1095\u17E5\u1815"
L"\u194B\u19D5\u1A85\u1A95\u1B55\u1BB5\u1C45\u1C55\uA625\uA8D5"
L"\uA905\uA9D5\uA9F5\uAA55\uABF5\uFF15"
L"\U000104A5\U0001106B\U000110F5\U0001113B\U000111D5\U000112F5"
L"\U00011455\U000114D5\U00011655\U000116C5\U00011735\U000118E5"
L"\U00011C55\U00011D55\U00016A65\U00016B55\U0001D7D3\U0001D7DD"
L"\U0001D7E7\U0001D7F1\U0001D7FB\U0001E955",
L"6" L"\u0666\u06F6\u07C6\u096C\u09EC\u0A6C\u0AEC\u0B6C\u0BEC\u0C6C"
L"\u0CEC\u0D6C\u0DEC\u0E56\u0ED6\u0F26\u1046\u1096\u17E6\u1816"
L"\u194C\u19D6\u1A86\u1A96\u1B56\u1BB6\u1C46\u1C56\uA626\uA8D6"
L"\uA906\uA9D6\uA9F6\uAA56\uABF6\uFF16"
L"\U000104A6\U0001106C\U000110F6\U0001113C\U000111D6\U000112F6"
L"\U00011456\U000114D6\U00011656\U000116C6\U00011736\U000118E6"
L"\U00011C56\U00011D56\U00016A66\U00016B56\U0001D7D4\U0001D7DE"
L"\U0001D7E8\U0001D7F2\U0001D7FC\U0001E956",
L"7" L"\u0667\u06F7\u07C7\u096D\u09ED\u0A6D\u0AED\u0B6D\u0BED\u0C6D"
L"\u0CED\u0D6D\u0DED\u0E57\u0ED7\u0F27\u1047\u1097\u17E7\u1817"
L"\u194D\u19D7\u1A87\u1A97\u1B57\u1BB7\u1C47\u1C57\uA627\uA8D7"
L"\uA907\uA9D7\uA9F7\uAA57\uABF7\uFF17"
L"\U000104A7\U0001106D\U000110F7\U0001113D\U000111D7\U000112F7"
L"\U00011457\U000114D7\U00011657\U000116C7\U00011737\U000118E7"
L"\U00011C57\U00011D57\U00016A67\U00016B57\U0001D7D5\U0001D7DF"
L"\U0001D7E9\U0001D7F3\U0001D7FD\U0001E957",
L"8" L"\u0668\u06F8\u07C8\u096E\u09EE\u0A6E\u0AEE\u0B6E\u0BEE\u0C6E"
L"\u0CEE\u0D6E\u0DEE\u0E58\u0ED8\u0F28\u1048\u1098\u17E8\u1818"
L"\u194E\u19D8\u1A88\u1A98\u1B58\u1BB8\u1C48\u1C58\uA628\uA8D8"
L"\uA908\uA9D8\uA9F8\uAA58\uABF8\uFF18"
L"\U000104A8\U0001106E\U000110F8\U0001113E\U000111D8\U000112F8"
L"\U00011458\U000114D8\U00011658\U000116C8\U00011738\U000118E8"
L"\U00011C58\U00011D58\U00016A68\U00016B58\U0001D7D6\U0001D7E0"
L"\U0001D7EA\U0001D7F4\U0001D7FE\U0001E958",
L"9" L"\u0669\u06F9\u07C9\u096F\u09EF\u0A6F\u0AEF\u0B6F\u0BEF\u0C6F"
L"\u0CEF\u0D6F\u0DEF\u0E59\u0ED9\u0F29\u1049\u1099\u17E9\u1819"
L"\u194F\u19D9\u1A89\u1A99\u1B59\u1BB9\u1C49\u1C59\uA629\uA8D9"
L"\uA909\uA9D9\uA9F9\uAA59\uABF9\uFF19"
L"\U000104A9\U0001106F\U000110F9\U0001113F\U000111D9\U000112F9"
L"\U00011459\U000114D9\U00011659\U000116C9\U00011739\U000118E9"
L"\U00011C59\U00011D59\U00016A69\U00016B59\U0001D7D7\U0001D7E1"
L"\U0001D7EB\U0001D7F5\U0001D7FF\U0001E959",
};
static int wdigit(const wint_t wc)
{
int i;
for (i = 0; i < 10; i++)
if (wcschr(wdigits[i], wc))
return i;
return -1;
}
#endif /* WDIGIT_H */
On a Linux, *BSD, or Mac machine, you can compile the above using e.g.
gcc -std=c99 -Wall -Wextra -pedantic example.c -o example
or
clang -std=c99 -Wall -Wextra -pedantic example.c -o example
and test it using e.g.
printf 'Bengali decimal digit five is ৫.\n' | ./example
which outputs
Read 33 wide characters total.
25 letters
0 zeros (equivalent to '0')
0 ones (equivalent to '1')
0 twos (equivalent to '2')
0 threes (equivalent to '3')
0 fours (equivalent to '4')
1 fives (equivalent to '5')
0 sixes (equivalent to '6')
0 sevens (equivalent to '7')
0 eights (equivalent to '8')
0 nines (equivalent to '9')
6 whitespaces (including newlines and tabs)
1 punctuation characters
0 other printable characters
The above code is fully compliant to ISO C99 (and later versions of the ISO C standard), and should be completely portable.
However, note that not all C libraries fully support C99; the main one people have issues with is Microsoft C. I don't use Windows myself, but if you are, try using the UTF-8 codepage (chcp 65001). This is wholly and completely a Microsoft issue, as it apparently can support UTF-8 input with some nonstandard Windows extensions. They just don't want you to write portable code, it seems.
I need to ask 2 questions..
1st question: I very well know that '0' and '9'represents the ASCII value of 0 & 9 respectively. But what I don't seem to understand is why we even need to use the ASCII vale and not the integer itself. Like why can't we simply use
if (c >= 0 && c <= 9)
Let's start with basics. All user input, file input, etc. is given in characters, so when you need to compare the character you have just read, it must be compared against another character. Within the character set, digits 0-9 are represented with ASCII values 48-57, so character '0' is represented by 48, and so on.
Your test above tests whether c is a digit, an ASCII value between 48-57, so you must use the characters themselves within the comparison, e.g. if ('0' <= c && c <= '9') you then know c is a digit. This brings us to:
2nd question:
++ndigit[c-'0']
In any classification problem you do, you will generally use an array initialized to all zero with at least enough elements for the set (of characters here). You can split them out as an array of ten elements to hold your digits, uppercase, lowercase, etc...
Your ndigit array, begins initialized to all zeros, the plan is to increment the proper element in the array each time a digit is encountered during your read. This is where you make use of the ASCII value for the bottom of the digits '0' (48). Since your ndigit array is likely indexed 0-9 each time a digit is encountered it must be scaled (or mapped) into the correct index of ndigit (so that '0' is mapped to 0, '1' mapped to 1, and so on.
Above through your test we determined, in this case, that c held a digit, so to classify that digit and have it map to the correct element of the ndigit array, we use c - '0'. If the digit in c is '3' (ASCII 51), then incrementing
++ndigit[c-'0'];
is actually indexing
++ndigit[51 - 48];
or
++ndigit[3]; /* since c was 3, we no increment ndigit[3] adding one more
occurrence of '3' to the data stored at ndigit[3] */
That way when you are done, the ndigit array will hold the exact number of 0, 1, 2, 3, 4, ... digits found in your input. It takes a bit to wrap your head around the scheme, but all in all, you simply need somewhere to begin counting from zero to store the totals for each character, digits, punctuations, seen, and an array that is sized for the character set will hold these values exactly when you are done because each character has been classified, and the corresponding ++ndigits[] element incremented to capture the information as you went along.
These, in a general sense, are called frequency arrays because they are used to store the frequency with which the individual members of a set appeared. They are many, many applications outside simply classifying characters.
Look all of the answers over and let me know if you are still confused and I'm more than happy to help further.
getchar() returns character codes and sentinel values (EOF). So, we know c holds a character code inside the loop.
c-'0' is the distance on the character code "number line" from the value of c (a character code) to the code for '0'. Per the C standard, character codes must have these digits in consecutive order '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'. So, the expression computes the integer value of the digit character.
Related
Sorry if my title is a little misleading, I am still new to a lot of this but:
I recently worked on a small cipher project where the user can give the file a argument at the command line but it must be alphabetical. (Ex: ./file abc)
This argument will then be used in a formula to encipher a message of plain text you provide. I got the code to work, thanks to my friend for helping but i'm not 100% a specific part of this formula.
#include <stdio.h>
#include <cs50.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <ctype.h>
int main (int argc, string argv[])
{ //Clarify that the argument count is not larger than 2
if (argc != 2)
{
printf("Please Submit a Valid Argument.\n");
return 1;
}
//Store the given arguemnt (our key) inside a string var 'k' and check if it is alpha
string k = (argv[1]);
//Store how long the key is
int kLen = strlen(k);
//Tell the user we are checking their key
printf("Checking key validation...\n");
//Pause the program for 2 seconds
sleep(2);
//Check to make sure the key submitted is alphabetical
for (int h = 0, strlk = strlen(k); h < strlk; h++)
{
if isalpha(k[h])
{
printf("Character %c is valid\n", k[h]);
sleep(1);
}
else
{ //Telling the user the key is invalid and returning them to the console
printf("Key is not alphabetical, please try again!\n");
return 0;
}
}
//Store the users soon to be enciphered text in a string var 'pt'
string pt = get_string("Please enter the text to be enciphered: ");
//A prompt that the encrypted text will display on
printf("Printing encrypted text: ");
sleep(2);
//Encipher Function
for(int i = 0, j = 0, strl = strlen(pt); i < strl; i++)
{
//Get the letter 'key'
int lk = tolower(k[j % kLen]) - 'a';
//If the char is uppercase, run the V formula and increment j by 1
if isupper(pt[i])
{
printf("%c", 'A' + (pt[i] - 'A' + lk) % 26);
j++;
}
//If the char is lowercase, run the V formula and increment j by 1
else if islower(pt[i])
{
printf("%c", 'a' + (pt[i] - 'a' + lk) % 26);
j++;
}
//If the char is a symbol just print said symbol
else
{
printf("%c", pt[i]);
}
}
printf("\n");
printf("Closing Script...\n");
return 0;
}
The Encipher Function:
Uses 'A' as a char for the placeholder but does 'A' hold a zero indexed value automatically? (B = 1, C = 2, ...)
In C, character literals like 'A' are of type int, and represent whatever integer value encodes the character A on your system. On the 99.999...% of systems that use ASCII character encoding, that's the number 65. If you have an old IBM mainframe from the 1970s using EBCDIC, it might be something else. You'll notice that the code is subtracting 'A' to make 0-based values.
This does make the assumption that the letters A-Z occupy 26 consecutive codes. This is true of ASCII (A=65, B=66, etc.), but not of all codes, and not guaranteed by the language.
does 'A' hold a zero indexed value automatically? (B = 1, C = 2, ...)
No. Strictly conforming C code can not depend on any character encoding other than the numerals 0-9 being represented consecutively, even though the common ASCII character set does represent them consecutively.
The only guarantee regarding character sets is per 5.2.1 Character sets, paragraph 3 of the C standard:
... the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous...
Character sets such as EBCDIC don't represent letters consecutively
char is a numeric type that happens to also often be used to represent visible characters (or special non-visible pseudo-characters). 'A' is a value (with actual type int) that can be converted to a char without overflow or underflow. That is, it's really some number, but you usually don't need to know what number, since you generally use a particular char value either as just a number or as just a character, not both.
But this program is using char values in both ways, so it somewhat does matter what the numeric values corresponding to visible characters are. One way it's very often done, but not always, is using the ASCII values which are numbered 0 to 127, or some other scheme which uses those values plus more values outside that range. So for example, if the computer uses one of those schemes, then 'A'==65, and 'A'+1==66, which is 'B'.
This program is assuming that all the lowercase Latin-alphabet letters have numeric values in consecutive order from 'a' to 'z', and all the uppercase Latin-alphabet letters have numeric values in consecutive order from 'A' to 'Z', without caring exactly what those values are. This is true of ASCII, so it will work on many kinds of machines. But there's no guarantee it will always be true!
C does guarantee the ten digit characters from '0' to '9' are in consecutive order, which means that if n is a digit number from zero to nine inclusive, then n + '0' is the character for displaying that digit, and if c is such a digit character, then c - '0' is the number from zero to nine it represents. But that's the only guarantee the C language makes about the values of characters.
For one counter-example, see EBCDIC, which is not in much use now, but was used on some older computers, and C supports it. Its alphabetic characters are arranged in clumps of consecutive letters, but not with all 26 letters of each case all together. So the program would give incorrect results running on such a computer.
Sequentiality is only one aspect of concern.
Proper use of isalpha(ch) is another, not quite implemented properly in OP's code.
isalpha(ch) expects a ch in the range of unsigned char or EOF. With k[h], a char, that value could be negative. Insure a non-negative value with:
// if isalpha(k[h])
if isalpha((unsigned char) k[h])
I wrote a C program for lex analyzer (a small code) that will identify keywords, identifiers and constants. I am taking a string (C source code as a string) and then converting splitting it into words.
#include <stdio.h>
#include <conio.h>
#include <string.h>
char symTable[5][7] = { "int", "void", "float", "char", "string" };
int main() {
int i, j, k = 0, flag = 0;
char string[7];
char str[] = "int main(){printf(\"Hello\");return 0;}";
char *ptr;
printf("Splitting string \"%s\" into tokens:\n", str);
ptr = strtok(str, " (){};""");
printf("\n\n");
while (ptr != NULL) {
printf ("%s\n", ptr);
for (i = k; i < 5; i++) {
memset(&string[0], 0, sizeof(string));
for (j = 0; j < 7; j++) {
string[j] = symTable[i][j];
}
if (strcmp(ptr, string) == 0) {
printf("Keyword\n\n");
break;
} else
if (string[j] == 0 || string[j] == 1 || string[j] == 2 ||
string[j] == 3 || string[j] == 4 || string[j] == 5 ||
string[j] == 6 || string[j] == 7 || string[j] == 8 ||
string[j] == 9) {
printf("Constant\n\n");
break;
} else {
printf("Identifier\n\n");
break;
}
}
ptr = strtok(NULL, " (){};""");
k++;
}
_getch();
return 0;
}
With the above code, I am able to identify keywords and identifiers but I couldn't obtain the result for numbers. I've tried using strspn() but of no avail. I even replaced 0,1,2...,9 to '0','1',....,'9'.
Any help would be appreciated.
Here are some problems in your parser:
The test string[j] == 0 does not test if string[j] is the digit 0. The characters for digits are written '0' through '9', their values are 48 to 57 in ASCII and UTF-8. Furthermore, you should be comparing *p instead of string[j] to test if you have a digit in the string indicating the start of a number.
Splitting the string with strtok() is not a good idea: it modifies the string and overwrites the first separator character with '\0': this will prevent matching operators such as (, )...
The string " (){};""" is exactly the same as " (){};". In order to escape " inside strings, you must use \".
To write a lexer for C, you should switch on the first character and check the following characters depending on the value of the first character:
if you have white space, skip it
if you have //, it is a line comment: skip all characters up to the newline.
if you have /*, it is a block comment: skip all characters until you get the pair */.
if you have a ', you have a character constant: parse the characters, handling escape sequences until you get a closing '.
if you have a ", you have astring literal. do the same as for character constants.
if you have a digit, consume all subsequent digits, you have an integer. Parsing the full number syntax requires much more code: leave that for later.
if you have a letter or an underscore: consume all subsequent letters, digits and underscores, then compare the word with the set of predefined keywords. You have either a keyword or an identifier.
otherwise, you have an operator: check if the next characters are part of a 2 or 3 character operator, such as == and >>=.
That's about it for a simple C parser. The full syntax requires more work, but you will get there one step at a time.
When you're writing lexer, always create specific function that finds your tokens (name yylex is used for tool System Lex, that is why I used that name). Writing lexer in main is not smart idea, especially if you want to do syntax, semantic analysis later on.
From your question it is not clear whether you just want to figure out what are number tokens, or whether you want token + fetch number value. I will assume first one.
This is example code, that finds whole numbers:
int yylex(){
/* We read one char from standard input */
char c = getchar();
/* If we read new line, we will return end of input token */
if(c == '\n')
return EOI;
/* If we see digit on input, we can not return number token at the moment.
For example input could be 123a and that is lexical error */
if(isdigit(c)){
while(isdigit(c = getchar()))
;
ungetc(c,stdin);
return NUM;
}
/* Additional code for keywords, identifiers, errors, etc. */
}
Tokens EOI, NUM, etc. should be defined on top. Later on, when you want to write syntax analysis, you use these tokens to figure out whether code responds to language syntax or not. In lexical analysis, usually ASCII values are not defined at all, your lexer function would simply return ')' for example. Knowing that, tokens should be defined above 255 value. For example:
#define EOI 256
#define NUM 257
If you have any futher questions, feel free to ask.
string[j]==1
This test is wrong(1) (on all C implementations I heard of), since string[j] is some char e.g. using ASCII (or UTF-8, or even the old EBCDIC used on IBM mainframes) encoding and the encoding of the char digit 1 is not the the number 1. On my Linux/x86-64 machine (and on most machines using ASCII or UTF-8, e.g. almost all of them) using UTF-8, the character 1 is encoded as the byte of code 48 (that is (char)48 == '1')
You probably want
string[j]=='1'
and you should consider using the standard isdigit (and related) function.
Be aware that UTF-8 is practically used everywhere but is a multi-byte encoding (of displayable characters). See this answer.
Note (1): the string[j]==1 test is probably misplaced too! Perhaps you might test isdigit(*ptr) at some better place.
PS. Please take the habit of compiling with all warnings and debug info (e.g. with gcc -Wall -Wextra -g if using GCC...)
and use the debugger (e.g. gdb). You should have find out your bug in less time than it took you to get an answer here.
This is code from C by Dennis Ritchie, chapter "Array":
#include <stdio.h>
/* count digits, white space, others */
main()
{
int c, i, nwhite, nother;
int ndigit[10];
nwhite = nother = 0;
for (i = 0; i < 10; ++i)
ndigit[i] = 0;
while ((c = getchar()) != EOF)
if (c >= '0' && c <= '9')
++ndigit[c-'0'];
else if (c == ' ' || c == '\n' || c == '\t')
++nwhite;
else
++nother;
printf("digits =");
for (i = 0; i < 10; ++i)
printf(" %d", ndigit[i]);
printf(", white space = %d, other = %d\n", nwhite, nother);
}
Why do we need -'0' in this line?
++ndigit[c-'0'];
If I change it to ++ndigit[c], the program doesn't work properly. Why can't we just write ++ndigit[c]?
I already read the explanation of the book, but I don't understand it.
This works only if '0', '1', ..., '9' have consecutive increasing values. Fortunately, this is true for all character sets. By definition, chars are just small integers, so char variables and constants are identical to ints in arithmetic expressions. This is natural and convenient; for example c-'0' is an integer expression with a value between 0 and 9 corresponding to the character '0' to '9' stored in c, and thus a valid subscript for the array ndigit
to understand why we need "-'0'" you first need to understand ASCII table - http://www.asciitable.com/
now you need to understand that every character in C is represented by a number between 0 and 127 ( 255 for extended ).
for example if you'll print the character '0' for his numeric value:
printf( "%d", '0' );
output: 48
now you've declared an array of size 10 - ndigit[ 10 ], where the n cell represent the number of times the number n was given as input.
so if you receive '0' as input you'd want to do ndigit[ 0 ]++ so you need to convert from char to integer. and you can do that by subtracting 48 ( = '0' )
thats why we use the line ++ndigit[c-'0'];
if c = '5', we will get
++ndigit['5' - '0']
++ndigit[ 53 - 48 ]
++ndigit[ 5 ]
exactly like we wanted it to be
c = getchar() will store the character code read to c, and it is differ from the integer that the character stands for.
Quote from N1256 5.2.1 Character sets
. In both the source and execution basic character sets, the
value of each character after 0 in the above list of decimal digits shall be one greater than
the value of the previous.
As this shows, the character codes for decimal digits are continuous, so you can convert the character code of decimal digits to the integer that the characters stand for by subtracting '0', which is 0's character code, from the character code.
In conclusion, c-'0' yields the integer that the character in c stands for.
I was going through the book "The C Programming language" by Kernighan and Ritchie and I am stuck at a topic.
Topic number 1.6 talks about Arrays. In the book, they have included a program that counts the digits, white spaces and other characters. The program goes like this:
#include <stdio.h>
main(){
int c,i,nother,nwhite;
int ndigit[10];
nwhite=nother=0;
for(i=0;i<10;++i)
ndigit[i]=0;
while((c=getchar())!=EOF)
if (c>='0' && c<='9')
++ndigit[c-'0'];
else if (c==' '|| c=='\t'||c=='\n')
++nwhite;
else
++nother;
printf("digits:");
for(i=0; i<10;++i)
printf(" %d",ndigit[i]);
printf(", white space = %d, other = %d\n", nwhite, nother);
}
First, I don't understand the purpose of the first for loop that is :
for(i=0;i<10;++i)
ndigit[i]=0;
And secondly, I can't understand the logic behind this part of the while loop:
if (c>='0' && c<='9')
++ndigit[c-'0'];
I really need someone to explain me the logic behind the program so that I can move further with C programming.
Thanks for the help!
This loop
for(i=0;i<10;++i)
ndigit[i]=0;
is used to set all elements of array ndigit to 0. The array will count numbers of eneterd digits.
Instead of this loop you could initially initialize all elements of the array to 0 when it was declared.
int ndigit[10] = { 0 };
As for this statement
if (c>='0' && c<='9')
++ndigit[c-'0'];
then if the entered char is a digit c>='0' && c<='9' then expression c-'0' gives you the integer value of the digit. Characters that correspond to character constant '0' - '9' internally in the computer memory represented by their ASCII or some other coding scheme codes. For example cgaracter '0' in ASCII has internal code 48, character '1' - 49, character '2' - 50 and so on. For example in EBCDIC cgaracter '0' has another code 240, character '1' - 241 and so on.
The C Standard guarantees that all digits follow each other.
So if variable c keeps some digit then expression c - '0' gives number from 0 (if c keeps '0' ) to 9 (if c keeps character '9' ).
This value (from 0 to 9) is used as an index in array ndigit.
For example let assume that c keeps character '6' . Then c - '0' will equal to integer number 6. So ndigit[6] is increased
++ndigit[c-'0']
This element of the array with index 6 counts how many times character '6' was entered.
ndigit[i] holds the number of times digit i (0-9) was counted. E.g., ndigit[5] contains the number of times the digit 5 was counted. So the first loop just initializes all to 0, as nothing was seen thus far.
The if statement checks whether the current character c is a digit. If so, it determines which digit it is by subtracting '0' from it. This will give the desired index, for which the value contained is increased by one.
This question already has answers here:
What's the real use of using n[c-'0']?
(13 answers)
Closed 9 years ago.
I'm currently reading 'The C Programming Language' by Kernighan & Richie and I'm struggling to work out what a line does. I think I'm just being a little stupid, and not quite understanding their explanation.
++ndigit[c-'0'];
I've had to change the program ever so slightly as ndigit was previously giving me garbage values, so I just instantiate the array to 0, instead of traversing it with a for loop, and changing the values that way.
#include <stdio.h>
main()
{
int c, i, nwhite, nother;
int ndigit[10]= {0};
nwhite = nother = 0;
while ((c = getchar()) != EOF)
if (c >= '0' && c <= '9')
++ndigit[c-'0'];
else if (c == ' ' || c == '\n' || c == '\t')
++nwhite;
else
++nother;
printf("digits =");
for (i = 0; i < 10; i++)
printf (" %d", ndigit[i]);
printf (", white space = %d, other = %d\n", nwhite, nother);
}
Using the program as its input, we get this printed to the console -
digits = 7 2 0 0 0 0 0 0 0 1, white space = 104, other = 291
I understand that 7 2 0 0 0 0 0 0 0 1 is a count of how many times the single numbers appear in the input 0 appears 7 times, 1 appears twice etc.)
But, how does the ...[c-'0']; 'work'?
You asked how the below expression works
c-'0'
The ASCII code of the entered charecter is subtracted from the ASCII code of 0 and it defines the position in the array where the count has to be stored .
Suppose you enter 1 from the keyboard ASCII code for 1 is 49 and ASCII code for 0 is 48.
hence
49-48 =1
and the count will be stored in the array index location 1 .
In C, when you have a variable c of type char, it actually stores some integer encoding of the char (usually the ASCII code). So c-'0' means the difference of the code of the character contained in c and the character 0. Since the digits are in natural order it convert the digit in the associated number.
c-'0' is technique to give int value == to char number e.g. 1 for '1' , 5 for '5'.
char symbols '0', '1', '2' ..... '9' are assigned continue encoding values so difference of a numeric char constant with '0' gives decimal number. (in your compiler for example in ASCII char they are assigned continues acsii values).
So for example in variable c is '7', then c - '0' == 7;
In your code array declared as:
int ndigit[10]= {0};
// default initialized with `0`
So index can be from 0 to 9. So in you code:
++ndigit[c-'0']; // ndigit[c-'0'] = ndigit[c-'0'] + 1;
increments frequency of a number by 1 when at corresponding digit of a number char.
Ascii is an encoding which gives consecutive id's to consecutive digits. As Eric Postpischil pointed out, the standard demands that property, even though the underlying encoding need not be ascii. Ascii is quite common though.
char c1 = '0';
char c2 = '1';
So whatever number '0' is mapped to, '1' will be that number + 1. In essence:
c2 == c1 + 1
Subtracting '0' from a character which is a digit, will return its numeric value:
'1' - '0' == 1
The C standard requires that the characters '0' to '9' have consecutive values.
'0' represents the value zero as a character and c is the value you enter it calculates to be like this :
'1' - '0' == 1
'2' - '0' == 2
and so on ... i.e equal to the value of c if c is a digit
((c = getchar()) != EOF)
if (c >= '0' && c <= '9')
++ndigit[c-'0'];
c gives you the current character. With the range checking you confirm it is a number by using its ASCII value
Check chr column for 0, it says 48 and for 8 it says 56.So, '8'- '0' gives 56 - 48 = 8
ndigit is used to keep track of how many times a number occurs in with each element of the array representing the number of times its subscript has occured.
ndigit[0] will ggive you number of times 0 has occurred and so on.. ndigit[x] gives number of times x appeared
[c - '0']
suppose your c i.e current character is 8 then '8' - '0' will give you 8. so you get ndigit[8] and you ++ that value [you initialised it to 0 at start]
Loog at ASCII wikipedia.
The American Standard Code for Information Interchange (ASCII /ˈæski/ ass-kee)1 is a character-encoding scheme originally based on the English alphabet that encodes 128 specified characters - the numbers 0-9, the letters a-z and A-Z, some basic punctuation symbols, some control codes that originated with Teletype machines, and a blank space - into the 7-bit binary integers.
So In the ASCII Scheme the '0' char is the number 48, the char '1' is 41, and so on. So c - '0' is equivalent to c - 48. If c is '1' the expression became 49 - 48 = 1. So in few word 'c' - '0' convert a digit char ['0'-'9'] into an integer [0-9].
Edit 1
As suggested by #Eric Postpischil, ASCII is not part of ANSi C (nor c++). But is very common and all compiler I know use ASCII set.
'0' is a char and it has value 48. You can look up any ASCII table in the google.
It works this way because you read from the input char value not the int, so if you don't add "-'0'" part it will increment 48th cell of an array.
Instead of "-'0'" you can put "-48", in my opinion it's more readable this way.