I'm making a program that, if the user inputs a lowercase character, generates its character in uppercase, and the opposite too. I'm using a function in order convert the character into lowercase or uppercase based on the ASCII table. Lowercase to uppercase is being converted correctly, but uppercase to lowercase is not.
char changeCapitalization(char n)
{
//uppercase to lowercase
if(n>=65 && n<=90)
n=n+32;
//lowercase to uppercase
if(n>= 97 && n<=122)
n=n-32;
return n;
}
What the others are essentially saying is you want something like this ('else if' instead of 'if' on the lower to upper logic):
char changeCapitalization(char n)
{
if(n>=65 && n<=90) //uppercase to lowercase
n=n+32;
else if(n>= 97 && n<=122) //lowercase to uppercase
n=n-32;
return n;
}
Chang the line
if(n>= 97 && n<=122)
with
else if(n>= 97 && n<=122)
Because this condition is the opposite way like you said in the question
Two if statements in sequence are executed - well - in sequence. So if you have an uppercase character, it will first be converted to lowercase, and afterwards, the next if statement will convert it back to lowercase. When you want to check the second condition only if the first one wasn't true, put else in front of the second if.
Also, rather than using the ASCII codes directly, you can compare characters to each other: if (n >= 'A' && n <= 'Z').
Later, when you're more comfortable with programming and start doing bigger projects, you should use the language's built-in functions for working with strings and characters, such as islower() and isupper() - and if you need to support any non-English characters, you should read this great article on the intricacies of encoding international characters.
Related
This is part of a code to count white spaces, numbers, or other from the K&R "C programming book." I am confused why it compares "int c" to digits using '0' and '9' instead of 0 and 9. I realize the code doesn't work if I use 0 and 9 without quotes. I am just trying to understand why. Does this have to do with c being equal to getchar()?
while ((c = getchar()) != EOF)
if (c >= '0' && c <= '9')
++ndigit[c-'0'];
else if (c == ' ' || c == '\n' || c == '\t')
++nwhite;
else
++nother;
looking at the man page for getchar, we see that it returns the character read as an unsigned char cast to an int. So we can assume the value stored is not an integer number, but its ascii equivalent, and can be compared with chars such as '0' and '9'.
A char usually is just an integer. Where the meaing is given by some charset. For example ASCII.
So for example we could store "Hello" as the sequence 72, 65, 108, 108 and 111.
Using single quotes (as in '9') we tell that we mean the number which represents the character '9'. Behind the scenes the computer only knows numbers and so this will end up in the code 57 for our example (see char '9', in red, maps to code 57 in the ASCII table). For more examples see linked ASCII table above.
Same counts for the chars in our input data. Also those are encoded into those numbers according to the charset we're using.
In contrast if we would just use a plain 9 we would ask for exactly the code 9. And not "the code which represents char 9". That's the difference.
BTW: There's another "trick" used in the code sample. it is c-'0' which asks to subtract "the code behind the character '0'" from our current character c. If we do this, we will end up with the digit not as the character, but as the number behind it. Example:
Assume c is the character '4'.
So in c it is stored as the code 52 (see ASCII table)
If we now want the numeric value 4 in place of the character '4' we just subtract the character '0' from it (code 48 in ASCII)
So 52 - 48 will end up as 4 (not a char but the number behind it)
getchar() returns a signed integer in order to allow it to return EOF (-1). If it returned a char, you could not have an error value.
Moreover '9' is a literal character constant, whose value is the character set code for the digit character '9' and not the integer value 9, and in C (but not C++) has type int, so there is in any case no type mismatch in the expression c <= '9' for example, it is an int comparison.
Even if that were not the case, and a literal character constants had char type (as in C++), there would be an implicit type promotion to int before comparison.
Also, you need to understand that a char is not specifically a character, but rather simply an integer type that is the:
Smallest addressable unit of the machine that can contain basic character set.
This is a code from the book "The C Programming Language" which maps a single character to lower case for the ASCII character set and returns unchanged, if the character is not an upper case letter:
int lower(int c)
{
if (c >= 'A' && c <= 'Z')
return c + 'a' - 'A';
else
return c;
}
I don't understand the logic behind return c + 'a' - 'A';.
Why didn't they simply put ' ' or the number 32 instead of 'a' - 'A'?
In the ASCII character set 'a' - 'A' just happens to have a value of 32. It's completely unrelated to the ASCII space character ' ' also having a value of 32, so it makes no sense to replace 'a' - 'A' with ' '.
Using 'a' - 'A' is much more meaningful and understandable than 32, and also doesn't tie the implementation to using a specific character set (though a-z and A-Z need to be contiguous for it to work, which isn't true for all character sets).
Why not 32? Because "magic numbers" are bad.
By using 'a'-'A' it makes it clear to the reader that the difference in character encoding between upper case and lower case is being added to the current character encoding.
Note that this also depends on the set of upper case characters being contiguous as well as the set of lower case characters. This is true for ASCII but necessarily in general
c - 'A': gives you the letter number in the alphabet; not in the character set:
so for example if you pass 'A' to c - 'A' you get 0, because everything subtracted by itself becomes zero; if you pass 'B' you get 1; if you pass 'C' you get 2 and so on. You get a number between 0 to 25 (The English alphabet includes 26 letters which we count from 1)
c - 'a': makes your upper-case letter a lower-case letter. It puts your letter number in the lower-case sequence in the character set.
so for example if you pass 'A', you get 0; then 0 + 'a' gives you the letter 'a'. if you pass 'B', you get 1; then 1 + 'a' gives you 'b' which comes right after 'a'. if you pass 'C', you get 2; then 2 + 'a' gives you 'c' which is two letters after 'a' and so on.
Also consider the following:
Take a look at ASCII table.
This function is designed generally to work with the character sets that their order corresponds to the English alphabet order. Character sets that their characters are contiguous like: A, B, C, D... and their lower-case and upper-case letters are a fixed distance.
It's the same reason as for writing things like
val = 10 * val + digitchar - '0';
when you're writing code to convert a string of digits to the corresponding integer.
The "obvious" way to write it would be
val = 10 * val + digitchar - 48;
But where did that magic number 48 come from? You had to look it up on the ASCII chart, and if I'm not familiar with it, I have to look it up on an ASCII chart to figure out how your program works. It saves both of us time if you write the constant '0' instead. (And, incidentally, using the constant '0' means that the program is portable to a machine using a character set other than ASCII, if anyone cares.)
Similarly, if I know that the codes for the upper- and the lower-case letters are in the same order but separated by some amount, using the computation 'A' - 'a' to represent that amount is again easier on both me and my reader than it would be if I went to my ASCII chart and worked out that the offset is actually 32.
In both cases, the principle here is Let the machine do the dirty work.
I agree, it's a little cryptic at first. If you're used to looking things up on the ASCII chart whenever you need to, it can be very disorienting to see those strange scraps of code like 'A' - 'a' and digitchar - '0'. Once you get used to the idioms, though, they're so much easier and less trouble.
Think about the authors perspective, they might tired to build up the knowlege on earlier pages and introduced to ASCII codes. They also discussed about type conversion in ealier code example and paragraph. For beginners the author tried to make them understand and use int and char interchangebly. As others discussed it is cleaner. Someone already know about C programming this question might arise.
We can avoid magic numbers in this case 32 as well.
I'm currently reading in a file, and storing that file into a cstring. I'm using strtok to parse it out for the first few strings i'm interested in. After that, the substrings could be numbers (500,150,30) or character combinations( P(4),(K(5)). Is there an easy method in the string library to differentiate between numbers and letters? \
Thanks for the answers guys!
If you are sure that there are no other symbols (##$%^%&*^) you can use the isalpha() function.
Usage:
isalpha(p);// returns true if its alphabetic and false otherwise.
Also note that you should include ctype.h.
You probably look for isalpha and isdigit library functions.
well, if you're reading a stream of bytes and want to differentiate between numbers and letters the following can be done:
// returns true if given char is a character, false otherwise
bool is_letter(char c) {
return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
}
which is easy enough to implement where it is needed. If you really want a library function, you can still use isalpha() or isdigit() from ctype.h, which basically should do the same thing.
N.B.: you might want to choose between bool or unsigned short. I won't enter into that debate.
just to give you background. We have a school project where we need to write our own compiler in C. My task is to write a lexical analysis. So far so good but I am having some difficulties with escape sequences.
When I find an escape sequence and the escape sequence is correct I have it saved in a string which looks like this \xAF otherwise it is lexical error.
My problem is how do I convert the string containing only escape sequence to one char? So I can add it to "buffer" containing the rest of the string.
I had an idea about a massive table containing only escape sequences and then comparing it one by one but it does not seem elegant.
This solution can be used for numerical escape sequences of all lengths and type, both octal, hexadecimal and others.
What you do when you see a '\' is to check the next character. If it's a 'x' (or 'X') then you read one character, if it's a hexadecimal digit (isxdigit) then you read another. If the last is not a hexadecimal digit then put it back into the stream (an "unget" operation), and use only the first digit you read.
Each digit you read you put into a string, and then you can use e.g. strtol to convert that string into a number. Put that number directly into the token value.
For octal sequences, just up to three characters instead.
For an example of a similar method see this old lexer I made many years ago. Search for the
lex_getescape function. Though this method uses direct arithmetic instead of strtoul to convert the escape code into a number, and not the standard isxdigit etc. functions either.
you can use the following code, call xString2char with your string.
char x2char(const char c)
{
if (c >= '0' && c <= '9')
return c - '0';
if (c >= 'a' && c <= 'f')
return c - 'a';
if (c >= 'A' && c <= 'F')
return c - 'A';
//if we got here it's an error - handle it as you like...
}
char xString2char(const char* buf)
{
char ans;
ans = x2char(buf[2]);
ans <<= 4;
ans += x2char(buf[3]);
return ans;
}
This should work, just add the error checking & handling (in case you didn't already validate them in your code)
flex has a start condition. This enables contextual analysis.
For instance, there is an example for C comment analysis(between /* and */) in flex manual:
<INITIAL>"/*" BEGIN(IN_COMMENT);
<IN_COMMENT>{
"*/" BEGIN(INITIAL);
[^*\n]+ /* eat comment in chunks */
"*" /* eat the lone star */
\n yylineno++;
}
The start condition also enables string literal analysis. There is an example of how to match C-style quoted strings using start conditions in the item Start Conditions, and there is also FAQ item titled "How do I expand backslash-escape sequences in C-style quoted strings?" in flex manual.
Probably this will answer your question.
I am pulling information from a binary file in C and one of my strings is coming out as \\b\\3777\\375\\v\\177 in GDB. I want to be able to parse this sort of useless data out of my output in a non-specific way - I.e anything that doesn't start with a number/character should be kicked out. How can this be achieved?
The data is being buffered into a struct n bytes at a time, and I am sure that this information is correct based on how data later in the file is being read correctly.
if( isalnum( buf[ 0 ]) {
printf( "%s", buf );
}
It sounds a bit like you're reimplementing the linux utility strings.
For each file given, GNU strings
prints the printable character
sequences that are at least 4
characters long (or the number given
with the options below) and are
followed by an unprintable character.
By default, it only prints the strings
from the initialized and loaded
sections of object files; for other
types of files, it prints the strings
from the whole file.
As the vast majority of the ASCII printable characteres are in the range of 0x20 (' ', space) to 0x7E('~', tilde), you can use this test:
if( (buf[0] >= 0x20) && ( buf[0] <= 0x7E ) )
{
printf( "%s", buf );
}
this will validate any string starting with any ASCII character.
Iterate over your bytes, and check the value of each one to see if it one of the characters that you consider to be valid. I don't know what you consider to be "a integer or char" (i.e. valid values) but you can try comparing the characters to (for example) ensure that:
(c >= '0' && c <= '9') || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
The above condition will ensure that the character's ASCII value is either a number (0 through 9) or a capital or lowercase English letter. Then you have to decide what to do when you encounter a character that you don't want. You can either replace the "bad" character with something "safe" (like a space) or you can build up a new string in a separate buffer, containing only the "good" characters.
Note that the above condition will only work for English, doesn't work for accented characters, and all punctuation and whitespace is also excluded. Another possible test would be to see if the character is a printable ASCII character ((c >= 0x20 && c <= 0x7e) || c == 0xa || c == 0xd which also includes punctuation, space and CR/LF). And this doesn't even get started trying to deal with encodings that aren't ASCII-compatible.