Easy way to differentiate between number and letter in C - c

I'm currently reading in a file, and storing that file into a cstring. I'm using strtok to parse it out for the first few strings i'm interested in. After that, the substrings could be numbers (500,150,30) or character combinations( P(4),(K(5)). Is there an easy method in the string library to differentiate between numbers and letters? \
Thanks for the answers guys!

If you are sure that there are no other symbols (##$%^%&*^) you can use the isalpha() function.
Usage:
isalpha(p);// returns true if its alphabetic and false otherwise.
Also note that you should include ctype.h.

You probably look for isalpha and isdigit library functions.

well, if you're reading a stream of bytes and want to differentiate between numbers and letters the following can be done:
// returns true if given char is a character, false otherwise
bool is_letter(char c) {
return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
}
which is easy enough to implement where it is needed. If you really want a library function, you can still use isalpha() or isdigit() from ctype.h, which basically should do the same thing.
N.B.: you might want to choose between bool or unsigned short. I won't enter into that debate.

Related

Counting Turkish character in C

I'm trying to write a program that counts all the characters in a string at Turkish language. I can't see why this does not work. i added library, setlocale(LC_ALL,"turkish") but still doesn't work. Thank you. Here is my code:
My file character encoding: utf_8
int main(){
setlocale(LC_ALL,"turkish");
char string[9000];
int c = 0, count[30] = {0};
int bahar = 0;
...
if ( string[c] >= 'a' && string[c] <= 'z' ){
count[string[c]-'a']++;
bahar++;
}
my output:
a 0.085217
b 0.015272
c 0.022602
d 0.035736
e 0.110263
f 0.029933
g 0.015272
h 0.053146
i 0.071167
k 0.010996
l 0.047954
m 0.025046
n 0.095907
o 0.069334
p 0.013745
q 0.002443
r 0.053451
s 0.073916
t 0.095296
u 0.036958
v 0.004582
w 0.019243
x 0.001527
y 0.010996
This is English alphabet but i need this characters calculate too: "ğ,ü,ç,ı,ö"
setlocale(LC_ALL,"turkish");
First: "turkish" isn't a locale.
The proper name of a locale will typically look like xx_YY.CHARSET, where xx is the ISO 639-1 code for the language, YY is the ISO 3166-1 Alpha-2 code for the country, and CHARSET is an optional character set name (usually ISO8859-1, ISO8859-15, or UTF-8). Note that not all combinations are valid; the computer must have locale files generated for that specific combination of language code, country code, and character set.
What you probably want here is setlocale(LC_ALL, "tr_TR.UTF-8").
if ( string[c] >= 'a' && string[c] <= 'z' ){
Second: Comparison operators like >= and <= are not locale-sensitive. This comparison will always be performed on bytes, and will not include characters outside the ASCII a-z range.
To perform a locale-sensitive comparison, you must use a function like strcoll(). However, note additionally that some letters (including the ones you're trying to include here!) are composed of multi-byte sequences in UTF-8, so looping over bytes won't work either. You will need to use a function like mblen() or mbtowc() to separate these sequences.
Since you are apparently working with a UTF-8 file, the answer will depend upon your execution platform:
If you're on Linux, setlocale(LC_CTYPE, "en_US.UTF-8") or something similar should work, but the important part is the UTF-8 at the end! The language shouldn't matter. You can verify it worked by using
if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL) {
abort();
}
That will stop the program from executing. Anything after that code means that the locale was set correctly.
If you're on Windows, you can instead open the file using fopen("myfile.txt", "rt, ccs=UTF-8"). However, this isn't entirely portable to other platforms. It's a lot cleaner than the alternatives, however, which is likely more important in this particular case.
If you're using FreeBSD or another system that doesn't allow you to use either approach (e.g. there are no UTF-8 locales), you'd need to parse the bytes manually or use a library to convert them for you. If your implementation has an iconv() function, you might be able to use it to convert from UTF-8 to ISO-8859-9 to use your special characters as single bytes.
Once you're ready to read the file, you can use fgetws with a wchar_t array.
Another problem is checking if one of your non-ASCII characters was detected. You could do something like this:
// lower = "abcdefghijklmnopqrstuvwxyzçöüğı"
// upper = "ABCDEFGHİJKLMNOPQRSTUVWXYZÇÖÜĞI"
const wchar_t lower[] = L"abcdefghijklmnopqrstuvwxyz\u00E7\u00F6\u00FC\u011F\u0131";
const wchar_t upper[] = L"ABCDEFGH\u0130JKLMNOPQRSTUVWXYZ\u00C7\u00D6\u00DC\u011EI";
const wchar_t *lchptr = wcschr(lower, string[c]);
const wchar_t *uchptr = wcschr(upper, string[c]);
if (lchptr) {
count[(size_t)(lchptr-lower)]++;
bahar++;
} else if (uchptr) {
count[(size_t)(uchptr-upper)]++;
bahar++;
}
That code assumes you're counting characters without regard for case (case insensitive). That is, ı (\u0131) and I are considered the same character (count[8]++), just like İ (\u0130) and i are considered the same (count[29]++). I won't claim to know much about the Turkish language, but I used what little I understand about Turkish casing rules when I created the uppercase and lowercase strings.
Edit
As #JonathanLeffler mentioned in the question's comments, a better solution would be to use something like isalpha (or in this case, iswalpha) on each character in string instead of the lower and upper strings of valid characters I used. This, however, would only allow you to know that the character is an alphabetic character; it wouldn't tell you the index of your count array to use, and the truth is that there is no universal answer to do so because some languages use only a few characters with diacritic marks rather than an entire group where you can just do string[c] >= L'à' && string[c] <= L'ç'. In other words, even when you have read the data, you still need to convert it to fit your solution, and that requires knowledge of what you're working with to create a mapping from characters to integer values, which my code does by using strings of valid characters and the indices of each character in the string as the indices of the count array (i.e. lower[29] will mean count[29]++ is executed, and upper[18] will mean count[18]++ is executed).
The solution depends on the character encoding of your files.
If the file is in ISO 8859-9 (latin-5), then each special character is still encoded in a single byte, and you can modify your code easily: You already have a distiction between upper case and lower case. Just add more branches for the special characters.
If the file is in UTF-8, or some other unicode encoding, you need a multi-byte capable string library.

Program that converts character capitalization

I'm making a program that, if the user inputs a lowercase character, generates its character in uppercase, and the opposite too. I'm using a function in order convert the character into lowercase or uppercase based on the ASCII table. Lowercase to uppercase is being converted correctly, but uppercase to lowercase is not.
char changeCapitalization(char n)
{
//uppercase to lowercase
if(n>=65 && n<=90)
n=n+32;
//lowercase to uppercase
if(n>= 97 && n<=122)
n=n-32;
return n;
}
What the others are essentially saying is you want something like this ('else if' instead of 'if' on the lower to upper logic):
char changeCapitalization(char n)
{
if(n>=65 && n<=90) //uppercase to lowercase
n=n+32;
else if(n>= 97 && n<=122) //lowercase to uppercase
n=n-32;
return n;
}
Chang the line
if(n>= 97 && n<=122)
with
else if(n>= 97 && n<=122)
Because this condition is the opposite way like you said in the question
Two if statements in sequence are executed - well - in sequence. So if you have an uppercase character, it will first be converted to lowercase, and afterwards, the next if statement will convert it back to lowercase. When you want to check the second condition only if the first one wasn't true, put else in front of the second if.
Also, rather than using the ASCII codes directly, you can compare characters to each other: if (n >= 'A' && n <= 'Z').
Later, when you're more comfortable with programming and start doing bigger projects, you should use the language's built-in functions for working with strings and characters, such as islower() and isupper() - and if you need to support any non-English characters, you should read this great article on the intricacies of encoding international characters.

How to convert string with escape sequence to one char in C

just to give you background. We have a school project where we need to write our own compiler in C. My task is to write a lexical analysis. So far so good but I am having some difficulties with escape sequences.
When I find an escape sequence and the escape sequence is correct I have it saved in a string which looks like this \xAF otherwise it is lexical error.
My problem is how do I convert the string containing only escape sequence to one char? So I can add it to "buffer" containing the rest of the string.
I had an idea about a massive table containing only escape sequences and then comparing it one by one but it does not seem elegant.
This solution can be used for numerical escape sequences of all lengths and type, both octal, hexadecimal and others.
What you do when you see a '\' is to check the next character. If it's a 'x' (or 'X') then you read one character, if it's a hexadecimal digit (isxdigit) then you read another. If the last is not a hexadecimal digit then put it back into the stream (an "unget" operation), and use only the first digit you read.
Each digit you read you put into a string, and then you can use e.g. strtol to convert that string into a number. Put that number directly into the token value.
For octal sequences, just up to three characters instead.
For an example of a similar method see this old lexer I made many years ago. Search for the
lex_getescape function. Though this method uses direct arithmetic instead of strtoul to convert the escape code into a number, and not the standard isxdigit etc. functions either.
you can use the following code, call xString2char with your string.
char x2char(const char c)
{
if (c >= '0' && c <= '9')
return c - '0';
if (c >= 'a' && c <= 'f')
return c - 'a';
if (c >= 'A' && c <= 'F')
return c - 'A';
//if we got here it's an error - handle it as you like...
}
char xString2char(const char* buf)
{
char ans;
ans = x2char(buf[2]);
ans <<= 4;
ans += x2char(buf[3]);
return ans;
}
This should work, just add the error checking & handling (in case you didn't already validate them in your code)
flex has a start condition. This enables contextual analysis.
For instance, there is an example for C comment analysis(between /* and */) in flex manual:
<INITIAL>"/*" BEGIN(IN_COMMENT);
<IN_COMMENT>{
"*/" BEGIN(INITIAL);
[^*\n]+ /* eat comment in chunks */
"*" /* eat the lone star */
\n yylineno++;
}
The start condition also enables string literal analysis. There is an example of how to match C-style quoted strings using start conditions in the item Start Conditions, and there is also FAQ item titled "How do I expand backslash-escape sequences in C-style quoted strings?" in flex manual.
Probably this will answer your question.

Tool functions for chars

I want to handle some char variables and would like to get a list of some functions that can do these tasks when it comes to handling chars.
Getting first characters of a char (var_name[1] doesnt seem to work)
Getting last characters of a char
Checking for char1 matches with char2 ( eg if "unicorn" matches words with "bicycle"
I am pretty sure some of these methods exist in libraries such as stdio.h or so but google isnt my friend.
EDIT:My 3rd question means not direct match with strcmp but single character match(eg if "hey" and "hello") have e as common letter.
Use var_name[0] to get first character (array indexes run from 0 to N - 1, where N is the number of elements in the array).
Use var_name[strlen(var_name) - 1] to get the last character.
Use strcmp() to compare two char strings.
EDIT:
To search for character in a string you can use strchr():
if (strchr("hello", 'e') && strchr("hey", 'e'))
{
}
There is also strpbrk() function that would indicate if two strings have any common characters:
if (strpbrk("hello", "hey"))
{
}
Assuming you mean a char[], and not a char which is a single character.
C uses 0-based indexing, var_name[0] gives you the first char.
strlen() gives you the length of the string, which together with my answer to 1. means
char lastchar = var_name[strlen(var_name)-1]; http://www.cplusplus.com/reference/clibrary/cstring/strlen/
strcmp(var_name1, var_name2) == 0. http://www.cplusplus.com/reference/clibrary/cstring/strcmp/
I am pretty sure some of these methods exist in libraries such as
stdio.h or so but google isnt my friend.
The string functions in the C standard library (libc) are described in the header file . If you're on a unix-ish machine, try typing man 3 string at a command line. You can then use the man program again to get more information about specific functions, e.g. man 3 strlen. (The '3' just tells man to look in "section 3", which describes the C standard library functions.)
What you're looking for is the string functions in the C runtime library. These are defined in string.h, not stdio.h.
But your list of problems is simple:
var_name[0] works perfectly well for accessing the first char in an array. var_name[ 1] doesn't work because arrays in C are zero-based.
The last char in an array is:
char c;
c = var_name[strlen(var_name)-1];
Testing for equality is simple:
if (var_name[0] == var_name[1])
; // they match
C and C++ strings are zero indexed. The memory you need to hold a particular length string has to be at least the string length and one character for the string terminator \0. So, the first character is array[0].
As #Carey Gregory said, the basic string handling functions are in string.h. But these are only primitives for handling strings. C is a low level enough language, that you have an opportunity to build up your own string handling library based on the functions in string.h.
On example might be that you want to pass a string pointer to a function and also the length of the buffer holding that sane string, not just the string length itself.

Input Until Non-Word Character in C, Like Java's \W

I want to input words from a file which are delimited by anything that isn't a letter. Is there an easy way to do this in C similar to using the \W predefined character classes in Java?
How should I approach this?
Thanks.
Character classes in general and \W specifically really aren't related to Java at all. They're just part of regular expression support, which is available for many languages, including GNU C. See GNU C library Regular Expressions.
You can make use of the is_ family of functions defined in <ctype.h>.
For instance, isalpha() will tell you if something is an alphabetic character. isspace() will tell you if something is whitespace.
You can use fscanf (provided you're guaranteed a max length of word)
char temp_str[1000];
char temp_w[1000]
while(fscanf(file, "%[a-zA-Z]%[^a-zA-Z]", temp_str, temp_w) != EOF) {
printf("STR: %s\n", temp_str);
}
Edit: this should do the trick, temp_str will hold the sequence of letters and temp_w will hold the delimiting characters inbetween

Resources