Mixed UTF-16 and ASCII string

Mixed UTF-16 and ASCII string - c

I have mixed ASCII and UTF-16 strings, the main problem is that, I need to somehow split it as each character in string.
for example assuming we're under Windows and (in most cases) default encoding is UTF-16:
const wchar_t msg[] = L"AД诶B";
I have defined total of 4 characters.
A = 2 bytes.
Д = 2 bytes.
诶 = 4 bytes.
B = 2 bytes.
I need to take 4th character from the string (ASCII B), but if I do msg[4] it will split Chinese keyword and will return wrong result. How can I solve that without any additional libraries?

As you've already discovered, UTF-16 is really a variable-width encoding. So, you will have to scan across the string to perform accurate character indexing.
Luckily, it is very easy to tell if a character is part of a multi-word sequence: the only multiword sequences in UTF-16 (as currently defined) are surrogate pairs: a word in the range [D800-DBFF] followed by a word in the range [DC00-DFFF]. So, when you encounter such a sequence, treat it as a single character.
This may work for your needs:
UChar32 utf16_char_at_index(const wchar_t *s, off_t index) {
while(1) {
if(s[0] >= 0xd800 && s[0] <= 0xdbff) {
/* First half of surrogate pair; check next half */
if(s[1] >= 0xdc00 && s[1] <= 0xdfff) {
/* surrogate pair: skip or return */
if(index == 0) {
return ((s[0] - 0xd800) << 10) | (s[1] - 0xdc00);
}
s += 2;
index--;
continue;
}
/* Otherwise, decoding error...may want to flag error here */
}
if(index == 0) {
return s[0];
}
s++;
index--;
}
}

Related

How to extract values from a string in hexadecimal in a C program?

I have a hexadecimal string like:
char str[] = "40004A0060007A0034006600";
I want to extract individual values from it like 0x40, 0x00, 0x4A, 0x00 etc.
How to do it?

Copy the 2 bytes of interest into a temporary 3 byte array.
Null terminate the 3 byte array to turn it into a string.
Call strtoul on this array from stdlib.h.
Alternatively you could manually decode it, since it's a trivial thing to do. Mask out nibbles, subtract some ASCII values or do a lookup table check, then multiply the ms nibble by 16.

NOTE: This answer applies to revision 3 of the question. Meanwhile, the question has been modified, thereby invalidating option #1 of my answer. As pointed out in the comments section of the question, this was not OP's fault, though.
You have two options:
Convert the string to an integer type, for example using the function strtoul or strtoull, and then use bit-shifting (>> operator) and bit-masking (& operator) to obtain the desired values. However, due to limitations in the range of values that the data types long and long long can represent, this option is only guaranteed to work with up to 8 hexadecimal digits with strtoul and 16 digits with strtoull. EDIT: Meanwhile, the question has been modified in such a way that the string is longer than 16 digits, so this solution is no longer viable.
Obtain the desired values by looking them up directly in the string. For example, if you are looking for the 3rd group of hexadecimal digits, then you will find them using str[4] and str[5]. This will give you two character values. If you want to convert these two hexadecimal characters to the number that they represent, then you can create a string from these two values and then use strtoul on that string.

Since you seem to be a beginner, I broke the task up into its constituent parts. This is a very simple hex dump facility where each step your code needs to take is its own routine. It is a quick and dirty and rather imperfect implementation, but understanding how to improve it will help you learn and write your own.
#include <ctype.h>
#include <stdint.h>
#include <stdio.h>
int
nibble(uint8_t ch) {
if ((ch >= '0') && (ch <= '9')) {
return ch - '0';
}
if ((ch >= 'A') && (ch <= 'F')) {
return 10 + (ch - 'A');
}
if ((ch >= 'a') && (ch <= 'f')) {
return 10 + (ch - 'a');
}
/* should never get here if isxdigit was called first */
return -1;
}
int
next_byte(const char *in)
{
uint8_t hi = 16 * nibble(*in);
uint8_t lo = nibble(*(in + 1));
return hi + lo;
}
int points_to_byte(const char *in) {
return ((*in) && isxdigit(*in))
&& (*(in + 1)) && isxdigit(*(in + 1));
}
void
dump(const char *in) {
/* Decide what to do with input that is not a string of hex bytes */
for (int i = 0; points_to_byte(in + i); i += 2) {
printf("%d\n", next_byte(in + i));
}
}
int
main(int argc, char *argv[]) {
if (argc < 2) {
puts("Need hex strings as arguments");
}
for (int i = 1; i < argc; ++i) {
dump(argv[i]);
}
}
When compile this into an executable called t and run it with your input, this is the output I get:
$ ./t 400004a005b002000113efb29f73f57589343e70e5244162edf312e303030322e313420200043472d58585858000000000032303139303833585858585858000000505230474C5043343554334C3343
64
0
4
160
...
67
52
195
52
want to convert it into a other string like
char str1[] ="0x40,0x00,0x4A,0x00,0x60";
Since you do not control the input string, you are going to need to malloc the buffer for the output. That and storing the transformed output is left as an exercise.

Syntax and different meanings of '<letter>'

I am learning C from the K&R book and I came across the code to count the no. of occurrence of white space characters (blank, tab, newline) and of all other characters.
The code is like this:
#include <stdio.h>
/* count digits, white space, others */
main()
{
int c, i, nwhite, nother;
int ndigit[10];
nwhite = nother = 0;
for (i = 0; i < 10; ++i)
ndigit[i] = 0;
while ((c = getchar()) != EOF)
if (c >= '0' && c <= '9')
++ndigit[c-'0'];
else if (c == ' ' || c == '\n' || c == '\t')
++nwhite;
else
++nother;
printf("digits =");
for (i = 0; i < 10; ++i)
printf(" %d", ndigit[i]);
printf(", white space = %d, other = %d\n",
nwhite, nother);
}
I need to ask 2 questions..
1st question:
if (c >= '0' && c <= '9')
++ndigit[c-'0'];
I very well know that '0' and '9'represents the ASCII value of 0 & 9 respectively. But what I don't seem to understand is why we even need to use the ASCII vale and not the integer itself. Like why can't we simply use
if (c >= 0 && c <= 9)
to find if c lies between 0 and 9?
2nd question:
++ndigit[c-'0']
What does the above statement do?
Why aren't we taking the ASCII value of c here?
Because if we did, it should have been written as ['c'-'0'].

1.
C is a character, not an integer. Thus we need to compare them to their ASCII values. The integers 0 and 9 correspond to Nul and Tab, not something we are looking for.
2.
By subtracting off the ASCII value the index corresponding to the integer is increased. For example if our number is '1'. Then '1' - '0' = 1 so the index at one is increased, its a convenient way to keep track of characters. We dont put ['c' - '0'] because we care about the variable c not the character 'c'
This table shows how characters are represented, they are different from integers. The main take away is '1' != 1
http://www.asciitable.com/

With the current C standards, this would be a perfect exercise for localized wide input:
#include <stdlib.h>
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include "wdigit.h"
int main(void)
{
size_t num_space = 0; /* Spaces, tabs, newlines */
size_t num_letter = 0;
size_t num_punct = 0; /* Punctuation */
size_t num_digit[10] = { 0, }; /* Digits - all initialized to zero */
size_t num_other = 0; /* Other printable characters */
size_t total = 0;
wint_t wc;
int digit;
if (!setlocale(LC_ALL, "")) {
fprintf(stderr, "Current locale is not supported by the C library.\n");
return EXIT_FAILURE;
}
if (fwide(stdin, 1) < 1) {
fprintf(stderr, "The C library does not support wide input for this locale.\n");
return EXIT_FAILURE;
}
while ((wc = fgetwc(stdin)) != WEOF) {
total++;
digit = wdigit(wc);
if (digit >= 0 && digit <= 9)
num_digit[digit]++;
else
if (iswspace(wc))
num_space++;
else
if (iswpunct(wc))
num_punct++;
else
if (iswalpha(wc))
num_letter++;
else
if (iswprint(wc))
num_other++;
/* All nonprintable non-whitespace characters are ignored */
}
printf("Read %zu wide characters total.\n", total);
printf("%15zu letters\n", num_letter);
printf("%15zu zeros (equivalent to '0')\n", num_digit[0]);
printf("%15zu ones (equivalent to '1')\n", num_digit[1]);
printf("%15zu twos (equivalent to '2')\n", num_digit[2]);
printf("%15zu threes (equivalent to '3')\n", num_digit[3]);
printf("%15zu fours (equivalent to '4')\n", num_digit[4]);
printf("%15zu fives (equivalent to '5')\n", num_digit[5]);
printf("%15zu sixes (equivalent to '6')\n", num_digit[6]);
printf("%15zu sevens (equivalent to '7')\n", num_digit[7]);
printf("%15zu eights (equivalent to '8')\n", num_digit[8]);
printf("%15zu nines (equivalent to '9')\n", num_digit[9]);
printf("%15zu whitespaces (including newlines and tabs)\n", num_space);
printf("%15zu punctuation characters\n", num_punct);
printf("%15zu other printable characters\n", num_other);
return EXIT_SUCCESS;
}
You also need wdigit.h, a header file that returns the decimal digit value (0 to 9, inclusive) if the given wide character is a decimal digit, and -1 otherwise. If this was an exercise, the header file would be provided.
The following "wdigit.h" should support all decimal digits defined in Unicode (which is the closest standard we have to an universal character set). I don't think it is copyrightable (because it is essentially just a listing from the Unicode standard), but if it is, I dedicate it to public domain:
#ifndef WDIGIT_H
#define WDIGIT_H
#include <wchar.h>
/* wdigits[] are wide strings that contain all known versions of a decimal digit.
For example, wdigits[0] is a wide string that contains all known zero decimal digit
wide characters. You can use e.g.
wcschr(wdigits[0], wc)
to determine if wc is a zero decimal digit wide character.
*/
static const wchar_t *const wdigits[10] = {
L"0" L"\u0660\u06F0\u07C0\u0966\u09E6\u0A66\u0AE6\u0B66\u0BE6\u0C66"
L"\u0CE6\u0D66\u0DE6\u0E50\u0ED0\u0F20\u1040\u1090\u17E0\u1810"
L"\u1946\u19D0\u1A80\u1A90\u1B50\u1BB0\u1C40\u1C50\uA620\uA8D0"
L"\uA900\uA9D0\uA9F0\uAA50\uABF0\uFF10"
L"\U000104A0\U00011066\U000110F0\U00011136\U000111D0\U000112F0"
L"\U00011450\U000114D0\U00011650\U000116C0\U00011730\U000118E0"
L"\U00011C50\U00011D50\U00016A60\U00016B50\U0001D7CE\U0001D7D8"
L"\U0001D7E2\U0001D7EC\U0001D7F6\U0001E950",
L"1" L"\u0661\u06F1\u07C1\u0967\u09E7\u0A67\u0AE7\u0B67\u0BE7\u0C67"
L"\u0CE7\u0D67\u0DE7\u0E51\u0ED1\u0F21\u1041\u1091\u17E1\u1811"
L"\u1947\u19D1\u1A81\u1A91\u1B51\u1BB1\u1C41\u1C51\uA621\uA8D1"
L"\uA901\uA9D1\uA9F1\uAA51\uABF1\uFF11"
L"\U000104A1\U00011067\U000110F1\U00011137\U000111D1\U000112F1"
L"\U00011451\U000114D1\U00011651\U000116C1\U00011731\U000118E1"
L"\U00011C51\U00011D51\U00016A61\U00016B51\U0001D7CF\U0001D7D9"
L"\U0001D7E3\U0001D7ED\U0001D7F7\U0001E951",
L"2" L"\u0662\u06F2\u07C2\u0968\u09E8\u0A68\u0AE8\u0B68\u0BE8\u0C68"
L"\u0CE8\u0D68\u0DE8\u0E52\u0ED2\u0F22\u1042\u1092\u17E2\u1812"
L"\u1948\u19D2\u1A82\u1A92\u1B52\u1BB2\u1C42\u1C52\uA622\uA8D2"
L"\uA902\uA9D2\uA9F2\uAA52\uABF2\uFF12"
L"\U000104A2\U00011068\U000110F2\U00011138\U000111D2\U000112F2"
L"\U00011452\U000114D2\U00011652\U000116C2\U00011732\U000118E2"
L"\U00011C52\U00011D52\U00016A62\U00016B52\U0001D7D0\U0001D7DA"
L"\U0001D7E4\U0001D7EE\U0001D7F8\U0001E952",
L"3" L"\u0663\u06F3\u07C3\u0969\u09E9\u0A69\u0AE9\u0B69\u0BE9\u0C69"
L"\u0CE9\u0D69\u0DE9\u0E53\u0ED3\u0F23\u1043\u1093\u17E3\u1813"
L"\u1949\u19D3\u1A83\u1A93\u1B53\u1BB3\u1C43\u1C53\uA623\uA8D3"
L"\uA903\uA9D3\uA9F3\uAA53\uABF3\uFF13"
L"\U000104A3\U00011069\U000110F3\U00011139\U000111D3\U000112F3"
L"\U00011453\U000114D3\U00011653\U000116C3\U00011733\U000118E3"
L"\U00011C53\U00011D53\U00016A63\U00016B53\U0001D7D1\U0001D7DB"
L"\U0001D7E5\U0001D7EF\U0001D7F9\U0001E953",
L"4" L"\u0664\u06F4\u07C4\u096A\u09EA\u0A6A\u0AEA\u0B6A\u0BEA\u0C6A"
L"\u0CEA\u0D6A\u0DEA\u0E54\u0ED4\u0F24\u1044\u1094\u17E4\u1814"
L"\u194A\u19D4\u1A84\u1A94\u1B54\u1BB4\u1C44\u1C54\uA624\uA8D4"
L"\uA904\uA9D4\uA9F4\uAA54\uABF4\uFF14"
L"\U000104A4\U0001106A\U000110F4\U0001113A\U000111D4\U000112F4"
L"\U00011454\U000114D4\U00011654\U000116C4\U00011734\U000118E4"
L"\U00011C54\U00011D54\U00016A64\U00016B54\U0001D7D2\U0001D7DC"
L"\U0001D7E6\U0001D7F0\U0001D7FA\U0001E954",
L"5" L"\u0665\u06F5\u07C5\u096B\u09EB\u0A6B\u0AEB\u0B6B\u0BEB\u0C6B"
L"\u0CEB\u0D6B\u0DEB\u0E55\u0ED5\u0F25\u1045\u1095\u17E5\u1815"
L"\u194B\u19D5\u1A85\u1A95\u1B55\u1BB5\u1C45\u1C55\uA625\uA8D5"
L"\uA905\uA9D5\uA9F5\uAA55\uABF5\uFF15"
L"\U000104A5\U0001106B\U000110F5\U0001113B\U000111D5\U000112F5"
L"\U00011455\U000114D5\U00011655\U000116C5\U00011735\U000118E5"
L"\U00011C55\U00011D55\U00016A65\U00016B55\U0001D7D3\U0001D7DD"
L"\U0001D7E7\U0001D7F1\U0001D7FB\U0001E955",
L"6" L"\u0666\u06F6\u07C6\u096C\u09EC\u0A6C\u0AEC\u0B6C\u0BEC\u0C6C"
L"\u0CEC\u0D6C\u0DEC\u0E56\u0ED6\u0F26\u1046\u1096\u17E6\u1816"
L"\u194C\u19D6\u1A86\u1A96\u1B56\u1BB6\u1C46\u1C56\uA626\uA8D6"
L"\uA906\uA9D6\uA9F6\uAA56\uABF6\uFF16"
L"\U000104A6\U0001106C\U000110F6\U0001113C\U000111D6\U000112F6"
L"\U00011456\U000114D6\U00011656\U000116C6\U00011736\U000118E6"
L"\U00011C56\U00011D56\U00016A66\U00016B56\U0001D7D4\U0001D7DE"
L"\U0001D7E8\U0001D7F2\U0001D7FC\U0001E956",
L"7" L"\u0667\u06F7\u07C7\u096D\u09ED\u0A6D\u0AED\u0B6D\u0BED\u0C6D"
L"\u0CED\u0D6D\u0DED\u0E57\u0ED7\u0F27\u1047\u1097\u17E7\u1817"
L"\u194D\u19D7\u1A87\u1A97\u1B57\u1BB7\u1C47\u1C57\uA627\uA8D7"
L"\uA907\uA9D7\uA9F7\uAA57\uABF7\uFF17"
L"\U000104A7\U0001106D\U000110F7\U0001113D\U000111D7\U000112F7"
L"\U00011457\U000114D7\U00011657\U000116C7\U00011737\U000118E7"
L"\U00011C57\U00011D57\U00016A67\U00016B57\U0001D7D5\U0001D7DF"
L"\U0001D7E9\U0001D7F3\U0001D7FD\U0001E957",
L"8" L"\u0668\u06F8\u07C8\u096E\u09EE\u0A6E\u0AEE\u0B6E\u0BEE\u0C6E"
L"\u0CEE\u0D6E\u0DEE\u0E58\u0ED8\u0F28\u1048\u1098\u17E8\u1818"
L"\u194E\u19D8\u1A88\u1A98\u1B58\u1BB8\u1C48\u1C58\uA628\uA8D8"
L"\uA908\uA9D8\uA9F8\uAA58\uABF8\uFF18"
L"\U000104A8\U0001106E\U000110F8\U0001113E\U000111D8\U000112F8"
L"\U00011458\U000114D8\U00011658\U000116C8\U00011738\U000118E8"
L"\U00011C58\U00011D58\U00016A68\U00016B58\U0001D7D6\U0001D7E0"
L"\U0001D7EA\U0001D7F4\U0001D7FE\U0001E958",
L"9" L"\u0669\u06F9\u07C9\u096F\u09EF\u0A6F\u0AEF\u0B6F\u0BEF\u0C6F"
L"\u0CEF\u0D6F\u0DEF\u0E59\u0ED9\u0F29\u1049\u1099\u17E9\u1819"
L"\u194F\u19D9\u1A89\u1A99\u1B59\u1BB9\u1C49\u1C59\uA629\uA8D9"
L"\uA909\uA9D9\uA9F9\uAA59\uABF9\uFF19"
L"\U000104A9\U0001106F\U000110F9\U0001113F\U000111D9\U000112F9"
L"\U00011459\U000114D9\U00011659\U000116C9\U00011739\U000118E9"
L"\U00011C59\U00011D59\U00016A69\U00016B59\U0001D7D7\U0001D7E1"
L"\U0001D7EB\U0001D7F5\U0001D7FF\U0001E959",
};
static int wdigit(const wint_t wc)
{
int i;
for (i = 0; i < 10; i++)
if (wcschr(wdigits[i], wc))
return i;
return -1;
}
#endif /* WDIGIT_H */
On a Linux, *BSD, or Mac machine, you can compile the above using e.g.
gcc -std=c99 -Wall -Wextra -pedantic example.c -o example
or
clang -std=c99 -Wall -Wextra -pedantic example.c -o example
and test it using e.g.
printf 'Bengali decimal digit five is ৫.\n' | ./example
which outputs
Read 33 wide characters total.
25 letters
0 zeros (equivalent to '0')
0 ones (equivalent to '1')
0 twos (equivalent to '2')
0 threes (equivalent to '3')
0 fours (equivalent to '4')
1 fives (equivalent to '5')
0 sixes (equivalent to '6')
0 sevens (equivalent to '7')
0 eights (equivalent to '8')
0 nines (equivalent to '9')
6 whitespaces (including newlines and tabs)
1 punctuation characters
0 other printable characters
The above code is fully compliant to ISO C99 (and later versions of the ISO C standard), and should be completely portable.
However, note that not all C libraries fully support C99; the main one people have issues with is Microsoft C. I don't use Windows myself, but if you are, try using the UTF-8 codepage (chcp 65001). This is wholly and completely a Microsoft issue, as it apparently can support UTF-8 input with some nonstandard Windows extensions. They just don't want you to write portable code, it seems.

I need to ask 2 questions..
1st question: I very well know that '0' and '9'represents the ASCII value of 0 & 9 respectively. But what I don't seem to understand is why we even need to use the ASCII vale and not the integer itself. Like why can't we simply use
if (c >= 0 && c <= 9)
Let's start with basics. All user input, file input, etc. is given in characters, so when you need to compare the character you have just read, it must be compared against another character. Within the character set, digits 0-9 are represented with ASCII values 48-57, so character '0' is represented by 48, and so on.
Your test above tests whether c is a digit, an ASCII value between 48-57, so you must use the characters themselves within the comparison, e.g. if ('0' <= c && c <= '9') you then know c is a digit. This brings us to:
2nd question:
++ndigit[c-'0']
In any classification problem you do, you will generally use an array initialized to all zero with at least enough elements for the set (of characters here). You can split them out as an array of ten elements to hold your digits, uppercase, lowercase, etc...
Your ndigit array, begins initialized to all zeros, the plan is to increment the proper element in the array each time a digit is encountered during your read. This is where you make use of the ASCII value for the bottom of the digits '0' (48). Since your ndigit array is likely indexed 0-9 each time a digit is encountered it must be scaled (or mapped) into the correct index of ndigit (so that '0' is mapped to 0, '1' mapped to 1, and so on.
Above through your test we determined, in this case, that c held a digit, so to classify that digit and have it map to the correct element of the ndigit array, we use c - '0'. If the digit in c is '3' (ASCII 51), then incrementing
++ndigit[c-'0'];
is actually indexing
++ndigit[51 - 48];
or
++ndigit[3]; /* since c was 3, we no increment ndigit[3] adding one more
occurrence of '3' to the data stored at ndigit[3] */
That way when you are done, the ndigit array will hold the exact number of 0, 1, 2, 3, 4, ... digits found in your input. It takes a bit to wrap your head around the scheme, but all in all, you simply need somewhere to begin counting from zero to store the totals for each character, digits, punctuations, seen, and an array that is sized for the character set will hold these values exactly when you are done because each character has been classified, and the corresponding ++ndigits[] element incremented to capture the information as you went along.
These, in a general sense, are called frequency arrays because they are used to store the frequency with which the individual members of a set appeared. They are many, many applications outside simply classifying characters.
Look all of the answers over and let me know if you are still confused and I'm more than happy to help further.

getchar() returns character codes and sentinel values (EOF). So, we know c holds a character code inside the loop.
c-'0' is the distance on the character code "number line" from the value of c (a character code) to the code for '0'. Per the C standard, character codes must have these digits in consecutive order '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'. So, the expression computes the integer value of the digit character.

Condition to limit between 2 characters

I'm writing code that need to limit the use to enter characters that be only from A to H. Greater then H should not be accepted.
I saw that with numbers I can use that like:
if (input == 0 - 9) return 1;
But, how I do that A to H (char)?

The C Standard does not specify that character encoding should be ASCII, though it is likely. Nonetheless, it is possible for the encoding to be other (EBCDIC, for example), and the characters of the Latin alphabet may not be encoded in a contiguous sequence. This would cause problems for solutions that compare char values directly.
One solution is to create a string that holds valid input characters, and to use strchr() to search for the input in this string in order to validate:
#include <stdio.h>
#include <string.h>
int main(void)
{
char *valid_input = "ABCDEFGH";
char input;
printf("Enter a letter from 'A' - 'H': ");
if (scanf("%c", &input) == 1) {
if (input == '\0' || strchr(valid_input, input) == NULL) {
printf("Input '%c' is invalid\n", input);
} else {
puts("Valid input");
}
}
return 0;
}
This approach is portable, though solutions which compare ASCII values are likely to work in practice. Note that in the original code that I posted, an edge case was missed, as pointed out by #chux. It is possible to enter a '\0' character from the keyboard (or to obtain one by other methods), and since a string contains the '\0' character, this would be accepted as valid input. I have updated the validation code to check for this condition.
Yet there is another advantage to using the above solution. Consider the following comparison-style code:
if (input >= 'A' || input <= 'H') {
puts("Valid input");
} else {
puts("Invalid input");
}
Now, suppose that conditions for valid input change, and the program must be modified. It is simpler to modify a validation string, for example to change to:
char *valid_input = "ABCDEFGHIJ";
With the comparison code, which may occur in more than one location, each comparison must be found in the code. But with the validation string, only one line of code needs to be found and modified.
Further, the validation string is simpler for more complex requirements. For example, if valid input is a character in the range 'A' - 'I' or a character in the range '0' - '9', the validation string can simply be changed to:
char *valid_input = "ABCDEFGHI0123456789";
The comparison method begins to look unwieldy:
if ((input >= 'A' && input <= 'I') || (input >= '0' && input <= '9')) {
puts("Valid input");
} else {
puts("Invalid input");
}
Do note that one of the few requirements placed on character encoding by the C Standard is that the characters '0', ..., '9' be encoded in a contiguous sequence. This does allow for portable direct comparison of decimal digit characters, and also for reliably finding the integer value associated with a decimal digit character through subtraction:
char ch = '3';
int num;
if (ch >= '0' && ch <= '9') {
printf("'%c' is a decimal digit\n", ch);
num = ch - '0';
printf("'%c' represents integer value %d\n", ch, num);
}

The if statement you present here is equal to:
if (input == -9) return 1;
which will return 1 in the case of an input equal to -9, so there is no range checking at all.
To allow numbers from 0 to 9 you have to compare like:
if (input >= 0 && input <= 9) /* range valid */
or with the characters that you want (A to H)1:
if (input >= 'A' && input <= 'H') /* range valid */
If you want to return 1 if the input is not in a valid range just put the logical not operator (!) in front of the condition:
if (!(input >= 'A' && input <= 'H')) return 1; /* range invalid */
1 You should take care of the used range if working with conditions that uses character ranges because the range needs an encoding that specify the letters in an incrementing order without any gaps in between the range (ASCII code e.g.: A = 65, B = 66, C = 67, ..., Z = 90).
There are encoding where this rule breaks. As the other answer of #DavidBowling stated there is for example EBCDIC (e.g.: A = 193, B = 194, ..., I = 200, J = 209, ..., Z = 233) which has some gaps in between the range from A to Z. Nevertheless the condition: (input >= 'A' && input <= 'H') will work with both encodings.
I never fall about such an implementation yet and it is very unlikely. Most implementations uses the ASCII code for which the condition works.
Nevertheless his answer provides a solution that is working in every case.

It's as simple as:
if(input >='A' && input<='H') return 1;
C doesn't let you specify ranges like 0 - 9.
In fact that's an arithmetic expression "zero minus nine" and evaluates to minus nine (of course).
Nerd Corner:
As others point out this is not guaranteed by the C standard because it doesn't specify a character encoding though in practice all modern platforms encode these characters the same as ASCII. So it's very unlikely you will come unstuck and if you're working in an environment where it won't work you'd have been told!
A truly portable implementation could be:
#include <string.h>//contains strchr()
const char* alpha="ABCDEFGHIJKLMNOPQRSTUVWXYZ";
const char* pos=strchr(alpha,input);
if(pos!=NULL&&(pos-alpha)<8) return 1;
This tries to find the character in an alphabet string then determines if the character (if any) pointed to is before 'I'.
This is total over engineering and not the answer you're looking for.

Lexical Analyzer C program for identifying tokens

I wrote a C program for lex analyzer (a small code) that will identify keywords, identifiers and constants. I am taking a string (C source code as a string) and then converting splitting it into words.
#include <stdio.h>
#include <conio.h>
#include <string.h>
char symTable[5][7] = { "int", "void", "float", "char", "string" };
int main() {
int i, j, k = 0, flag = 0;
char string[7];
char str[] = "int main(){printf(\"Hello\");return 0;}";
char *ptr;
printf("Splitting string \"%s\" into tokens:\n", str);
ptr = strtok(str, " (){};""");
printf("\n\n");
while (ptr != NULL) {
printf ("%s\n", ptr);
for (i = k; i < 5; i++) {
memset(&string[0], 0, sizeof(string));
for (j = 0; j < 7; j++) {
string[j] = symTable[i][j];
}
if (strcmp(ptr, string) == 0) {
printf("Keyword\n\n");
break;
} else
if (string[j] == 0 || string[j] == 1 || string[j] == 2 ||
string[j] == 3 || string[j] == 4 || string[j] == 5 ||
string[j] == 6 || string[j] == 7 || string[j] == 8 ||
string[j] == 9) {
printf("Constant\n\n");
break;
} else {
printf("Identifier\n\n");
break;
}
}
ptr = strtok(NULL, " (){};""");
k++;
}
_getch();
return 0;
}
With the above code, I am able to identify keywords and identifiers but I couldn't obtain the result for numbers. I've tried using strspn() but of no avail. I even replaced 0,1,2...,9 to '0','1',....,'9'.
Any help would be appreciated.

Here are some problems in your parser:
The test string[j] == 0 does not test if string[j] is the digit 0. The characters for digits are written '0' through '9', their values are 48 to 57 in ASCII and UTF-8. Furthermore, you should be comparing *p instead of string[j] to test if you have a digit in the string indicating the start of a number.
Splitting the string with strtok() is not a good idea: it modifies the string and overwrites the first separator character with '\0': this will prevent matching operators such as (, )...
The string " (){};""" is exactly the same as " (){};". In order to escape " inside strings, you must use \".
To write a lexer for C, you should switch on the first character and check the following characters depending on the value of the first character:
if you have white space, skip it
if you have //, it is a line comment: skip all characters up to the newline.
if you have /*, it is a block comment: skip all characters until you get the pair */.
if you have a ', you have a character constant: parse the characters, handling escape sequences until you get a closing '.
if you have a ", you have astring literal. do the same as for character constants.
if you have a digit, consume all subsequent digits, you have an integer. Parsing the full number syntax requires much more code: leave that for later.
if you have a letter or an underscore: consume all subsequent letters, digits and underscores, then compare the word with the set of predefined keywords. You have either a keyword or an identifier.
otherwise, you have an operator: check if the next characters are part of a 2 or 3 character operator, such as == and >>=.
That's about it for a simple C parser. The full syntax requires more work, but you will get there one step at a time.

When you're writing lexer, always create specific function that finds your tokens (name yylex is used for tool System Lex, that is why I used that name). Writing lexer in main is not smart idea, especially if you want to do syntax, semantic analysis later on.
From your question it is not clear whether you just want to figure out what are number tokens, or whether you want token + fetch number value. I will assume first one.
This is example code, that finds whole numbers:
int yylex(){
/* We read one char from standard input */
char c = getchar();
/* If we read new line, we will return end of input token */
if(c == '\n')
return EOI;
/* If we see digit on input, we can not return number token at the moment.
For example input could be 123a and that is lexical error */
if(isdigit(c)){
while(isdigit(c = getchar()))
;
ungetc(c,stdin);
return NUM;
}
/* Additional code for keywords, identifiers, errors, etc. */
}
Tokens EOI, NUM, etc. should be defined on top. Later on, when you want to write syntax analysis, you use these tokens to figure out whether code responds to language syntax or not. In lexical analysis, usually ASCII values are not defined at all, your lexer function would simply return ')' for example. Knowing that, tokens should be defined above 255 value. For example:
#define EOI 256
#define NUM 257
If you have any futher questions, feel free to ask.

string[j]==1
This test is wrong(1) (on all C implementations I heard of), since string[j] is some char e.g. using ASCII (or UTF-8, or even the old EBCDIC used on IBM mainframes) encoding and the encoding of the char digit 1 is not the the number 1. On my Linux/x86-64 machine (and on most machines using ASCII or UTF-8, e.g. almost all of them) using UTF-8, the character 1 is encoded as the byte of code 48 (that is (char)48 == '1')
You probably want
string[j]=='1'
and you should consider using the standard isdigit (and related) function.
Be aware that UTF-8 is practically used everywhere but is a multi-byte encoding (of displayable characters). See this answer.
Note (1): the string[j]==1 test is probably misplaced too! Perhaps you might test isdigit(*ptr) at some better place.
PS. Please take the habit of compiling with all warnings and debug info (e.g. with gcc -Wall -Wextra -g if using GCC...)
and use the debugger (e.g. gdb). You should have find out your bug in less time than it took you to get an answer here.

Own strcmp function - non standard chars

I am currently writing a little sort function. I can only use stdio libary, so I wrote my 'own strcmp' function.
int ownstrcmp(char a[], char b[])
{
int i = 0;
while( a[i] == b[i] )
{
if( a[i] == '\0' )
return 0;
++i;
}
return ( a[i] < b[i]) ? 1 : -1;
}
This works great for me. But there is one little problem: What can I do for 'non-Standard-Chars'? Like "ä,ü,ß Their decimal ASCII value is greater than the normal chars, so it sort the string 'example' behind 'ääää'.
I have already read about locale, but the only library that i can use is stdio.h. Is there a 'simple' solution for this problem?

Your question is somewhat vague. First of all, how characters with umlaut are represented depends on your encoding. For example, my computer's locale is set to Greek, meaning that in place of those special Latin characters I have Greek characters. You can't assume anything like that, as far as I can tell.
Second, the answer to your question depends on your representation. Are you still using a "one char per character" representation? If that's so, the above code might still work.
If you're using multi char representation, for example two chars per character, you should change your code so that it exits when two consecutive chars are \0.
Generally, you may want to look into how wchar_t and its family of functions (specifically wcscmp) are implemented.

For german the umlauts ä,ö,ü and ß will be sorted as if they occur in their 'expanded' form:
ä -> ae
ö -> oe
ü -> ue
ß -> ss
In order to get the collation according to the standard you could expand the strings before comparing.

You need to know the encoding the characters are in, and make sure you treat the strings properly. If the encoding is multi-byte, you must start reading (and comparing) individual characters, not bytes.
Also, the way to compare characters internationally varies with the locale, there's no single solution. In some languages, 'ä' sorts after 'z', in some it sorts right next to 'a'.
One simple way of implementing this is of course to create a table which holds the relative order for each character, like so:
unsigned char character_order[256];
character_order[(unsigned char) 'a'] = 1;
character_order[(unsigned char) 'ä'] = character_order[(unsigned char) 'a'];
/* ... and so on ... */
Then instead of subtracting the character's encoded value (which no longer can be used as a "proxy" for the sorting order of the character), you compare the character_order values.
The above assumes single-byte encoding, i.e. Latin-1 or something, since the array size is only 256.
Also note casts to unsigned char when indexing with character literals.

If you are using ISO/IEC_8859-16 encoding, which is the normal enconding for German Language, it's enough to transform your char to unsigned char.
In this way chars can be represented in interval 0-255, suitable for this standard.

Under UTF8 this can help, following your code
if ((a[i] > 0) ^ (b[i] > 0))
return a[i] > 0 ? 1 : -1;
else
return a[i] < b[i] ? 1 : -1;
But you have to check cases like ownstrcmp("ab", "abc");
Furthermore your code doesn't work like strcmp() in <string.h>
A value greater than zero indicates that the first character that does not match has a greater value in str1 than in str2; And a value less than zero indicates the opposite.
I would do it like this:
int ownstrcmp(char a[], char b[])
{
int i = 0;
while(a[i] == b[i]) {
if (a[i] == 0) return 0;
++i;
}
if ((a[i] == 0) || (b[i] == 0))
return a[i] != 0 ? 1 : -1;
if ((a[i] > 0) ^ (b[i] > 0))
return a[i] < 0 ? 1 : -1;
else
return a[i] > b[i] ? 1 : -1;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Mixed UTF-16 and ASCII string - c

Related

How to extract values from a string in hexadecimal in a C program?

Syntax and different meanings of '<letter>'

Condition to limit between 2 characters

Lexical Analyzer C program for identifying tokens

Own strcmp function - non standard chars

Categories

Resources