Finding bad string in C - c

I am pulling information from a binary file in C and one of my strings is coming out as \\b\\3777\\375\\v\\177 in GDB. I want to be able to parse this sort of useless data out of my output in a non-specific way - I.e anything that doesn't start with a number/character should be kicked out. How can this be achieved?
The data is being buffered into a struct n bytes at a time, and I am sure that this information is correct based on how data later in the file is being read correctly.

if( isalnum( buf[ 0 ]) {
printf( "%s", buf );
}

It sounds a bit like you're reimplementing the linux utility strings.
For each file given, GNU strings
prints the printable character
sequences that are at least 4
characters long (or the number given
with the options below) and are
followed by an unprintable character.
By default, it only prints the strings
from the initialized and loaded
sections of object files; for other
types of files, it prints the strings
from the whole file.

As the vast majority of the ASCII printable characteres are in the range of 0x20 (' ', space) to 0x7E('~', tilde), you can use this test:
if( (buf[0] >= 0x20) && ( buf[0] <= 0x7E ) )
{
printf( "%s", buf );
}
this will validate any string starting with any ASCII character.

Iterate over your bytes, and check the value of each one to see if it one of the characters that you consider to be valid. I don't know what you consider to be "a integer or char" (i.e. valid values) but you can try comparing the characters to (for example) ensure that:
(c >= '0' && c <= '9') || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
The above condition will ensure that the character's ASCII value is either a number (0 through 9) or a capital or lowercase English letter. Then you have to decide what to do when you encounter a character that you don't want. You can either replace the "bad" character with something "safe" (like a space) or you can build up a new string in a separate buffer, containing only the "good" characters.
Note that the above condition will only work for English, doesn't work for accented characters, and all punctuation and whitespace is also excluded. Another possible test would be to see if the character is a printable ASCII character ((c >= 0x20 && c <= 0x7e) || c == 0xa || c == 0xd which also includes punctuation, space and CR/LF). And this doesn't even get started trying to deal with encodings that aren't ASCII-compatible.

Related

Comparing single quote numbers instead of regular numbers(numbers without quotes)? C Programming Language K&R

This is part of a code to count white spaces, numbers, or other from the K&R "C programming book." I am confused why it compares "int c" to digits using '0' and '9' instead of 0 and 9. I realize the code doesn't work if I use 0 and 9 without quotes. I am just trying to understand why. Does this have to do with c being equal to getchar()?
while ((c = getchar()) != EOF)
if (c >= '0' && c <= '9')
++ndigit[c-'0'];
else if (c == ' ' || c == '\n' || c == '\t')
++nwhite;
else
++nother;
looking at the man page for getchar, we see that it returns the character read as an unsigned char cast to an int. So we can assume the value stored is not an integer number, but its ascii equivalent, and can be compared with chars such as '0' and '9'.
A char usually is just an integer. Where the meaing is given by some charset. For example ASCII.
So for example we could store "Hello" as the sequence 72, 65, 108, 108 and 111.
Using single quotes (as in '9') we tell that we mean the number which represents the character '9'. Behind the scenes the computer only knows numbers and so this will end up in the code 57 for our example (see char '9', in red, maps to code 57 in the ASCII table). For more examples see linked ASCII table above.
Same counts for the chars in our input data. Also those are encoded into those numbers according to the charset we're using.
In contrast if we would just use a plain 9 we would ask for exactly the code 9. And not "the code which represents char 9". That's the difference.
BTW: There's another "trick" used in the code sample. it is c-'0' which asks to subtract "the code behind the character '0'" from our current character c. If we do this, we will end up with the digit not as the character, but as the number behind it. Example:
Assume c is the character '4'.
So in c it is stored as the code 52 (see ASCII table)
If we now want the numeric value 4 in place of the character '4' we just subtract the character '0' from it (code 48 in ASCII)
So 52 - 48 will end up as 4 (not a char but the number behind it)
getchar() returns a signed integer in order to allow it to return EOF (-1). If it returned a char, you could not have an error value.
Moreover '9' is a literal character constant, whose value is the character set code for the digit character '9' and not the integer value 9, and in C (but not C++) has type int, so there is in any case no type mismatch in the expression c <= '9' for example, it is an int comparison.
Even if that were not the case, and a literal character constants had char type (as in C++), there would be an implicit type promotion to int before comparison.
Also, you need to understand that a char is not specifically a character, but rather simply an integer type that is the:
Smallest addressable unit of the machine that can contain basic character set.

Check if array is ASCII

How do I check in C if an array of uint8 contains only ASCII elements?
If possible please refer me to the condition that checks if an element is ASCII or not
Your array elements are uint8, so must be in the range 0-255
For standard ASCII character set, bytes 0-127 are used, so you can use a for loop to iterate through the array, checking if each element is <= 127.
If you're treating the array as a string, be aware of the 0 byte (null character), which marks the end of the string
From your example comment, this could be implemented like this:
int checkAscii (uint8 *array) {
for (int i=0; i<LEN; i++) {
if (array[i] > 127) return 0;
}
return 1;
}
It breaks out early at the first element greater than 127.
All valid ASCII characters have value 0 to 127, so the test is simply a value check or 7-bit mask. For example given the inclusion of stdbool.h:
bool is_ascii = (ch & ~0x7f) == 0 ;
Possibly however you intended only printable ASCII characters (excluding control characters). In that case, given inclusion of ctype.h:
bool is_printable_ascii = (ch & ~0x7f) == 0 &&
(isprint() || isspace()) ;
Your intent may be lightly different in terms of what characters you intend to include in your set - in which case other functions in ctype.h may be applied or simply test the values for value or range to include/exclude.
Note also that the ASCII set is very restricted in international terms. The ANSI or "extended ASCII" set uses locale specific codepages to define the glyphs associated with codes 128 to 255. That is to say the set changes depending on language/locale settings to accommodate different language characters, accents and alphabets. In modern systems it is common instead to use a multi-byte Unicode encoding (or which there are several with either fixed or variable length codes). UTF-8 encoding is a variable width encoding where all single byte encodings are also ASCII codes. As such, while it is trivial to determine whether data is entirely within the ASCII set, it does not follow that the data is therefore text. If the test is intended to distinguish binary data from text, it will fail in a great many scenarios unless you can guarantee a priori that all text is restricted to the ASCII set - and that is application specific.
You cannot check if something is "ASCII" with standard C.
Because C does not specify which symbol table that is used by a compiler. Various other more or less exotic symbol tables exists/existed.
UTF8 for example, is a superset of ASCII. Older, dysfunctional 8 bit symbol tables have existed, such as EBCDIC and "Extended ASCII". To tell if something is for example ASCII or EBCDIC can't be done trivially, without a long line of value checks.
With standard C, you can only do the following:
You can check if a character is printable, with the function isprint() from ctype.h.
Or you can check if it only has up to 7 bits only set, if((ch & 0x7F)==ch).
In C programming, a character variable holds ASCII value (an integer number between 0 and 127) rather than that character itself.
The ASCII value of lowercase alphabets are from 97 to 122. And, the ASCII value of uppercase alphabets are from 65 to 90.
incase of giving the actual code , i am giving you example.
You can assign int to char directly.
int a = 47;
char c = a;
printf("%c", c);
And this will also work.
printf("%c", a); // a is in valid range
Another approach.
An integer can be assigned directly to a character. A character is different mostly just because how it is interpreted and used.
char c = atoi("47");
Try to implement this after understand the following logic properly.

Program that converts character capitalization

I'm making a program that, if the user inputs a lowercase character, generates its character in uppercase, and the opposite too. I'm using a function in order convert the character into lowercase or uppercase based on the ASCII table. Lowercase to uppercase is being converted correctly, but uppercase to lowercase is not.
char changeCapitalization(char n)
{
//uppercase to lowercase
if(n>=65 && n<=90)
n=n+32;
//lowercase to uppercase
if(n>= 97 && n<=122)
n=n-32;
return n;
}
What the others are essentially saying is you want something like this ('else if' instead of 'if' on the lower to upper logic):
char changeCapitalization(char n)
{
if(n>=65 && n<=90) //uppercase to lowercase
n=n+32;
else if(n>= 97 && n<=122) //lowercase to uppercase
n=n-32;
return n;
}
Chang the line
if(n>= 97 && n<=122)
with
else if(n>= 97 && n<=122)
Because this condition is the opposite way like you said in the question
Two if statements in sequence are executed - well - in sequence. So if you have an uppercase character, it will first be converted to lowercase, and afterwards, the next if statement will convert it back to lowercase. When you want to check the second condition only if the first one wasn't true, put else in front of the second if.
Also, rather than using the ASCII codes directly, you can compare characters to each other: if (n >= 'A' && n <= 'Z').
Later, when you're more comfortable with programming and start doing bigger projects, you should use the language's built-in functions for working with strings and characters, such as islower() and isupper() - and if you need to support any non-English characters, you should read this great article on the intricacies of encoding international characters.

How to convert string with escape sequence to one char in C

just to give you background. We have a school project where we need to write our own compiler in C. My task is to write a lexical analysis. So far so good but I am having some difficulties with escape sequences.
When I find an escape sequence and the escape sequence is correct I have it saved in a string which looks like this \xAF otherwise it is lexical error.
My problem is how do I convert the string containing only escape sequence to one char? So I can add it to "buffer" containing the rest of the string.
I had an idea about a massive table containing only escape sequences and then comparing it one by one but it does not seem elegant.
This solution can be used for numerical escape sequences of all lengths and type, both octal, hexadecimal and others.
What you do when you see a '\' is to check the next character. If it's a 'x' (or 'X') then you read one character, if it's a hexadecimal digit (isxdigit) then you read another. If the last is not a hexadecimal digit then put it back into the stream (an "unget" operation), and use only the first digit you read.
Each digit you read you put into a string, and then you can use e.g. strtol to convert that string into a number. Put that number directly into the token value.
For octal sequences, just up to three characters instead.
For an example of a similar method see this old lexer I made many years ago. Search for the
lex_getescape function. Though this method uses direct arithmetic instead of strtoul to convert the escape code into a number, and not the standard isxdigit etc. functions either.
you can use the following code, call xString2char with your string.
char x2char(const char c)
{
if (c >= '0' && c <= '9')
return c - '0';
if (c >= 'a' && c <= 'f')
return c - 'a';
if (c >= 'A' && c <= 'F')
return c - 'A';
//if we got here it's an error - handle it as you like...
}
char xString2char(const char* buf)
{
char ans;
ans = x2char(buf[2]);
ans <<= 4;
ans += x2char(buf[3]);
return ans;
}
This should work, just add the error checking & handling (in case you didn't already validate them in your code)
flex has a start condition. This enables contextual analysis.
For instance, there is an example for C comment analysis(between /* and */) in flex manual:
<INITIAL>"/*" BEGIN(IN_COMMENT);
<IN_COMMENT>{
"*/" BEGIN(INITIAL);
[^*\n]+ /* eat comment in chunks */
"*" /* eat the lone star */
\n yylineno++;
}
The start condition also enables string literal analysis. There is an example of how to match C-style quoted strings using start conditions in the item Start Conditions, and there is also FAQ item titled "How do I expand backslash-escape sequences in C-style quoted strings?" in flex manual.
Probably this will answer your question.

Recognizing Spaces in text

I'm writing a program that deciphers sentences, syllables, and words given in a basic text file.
The program cycles through the file character by character.
It first looks if it is some kind of end-of-sentence marker, like ! ? : ; or ..
Then if the character is not a space or tab, it assumes it is a character.
Finally, it identifies that if it is a space or tab, and the last character before it was a valid letter/character (e.g. not an end-of-sentence marker), it is a word.
I was a bit light on the details, but here is the problem I have.
My word count is equal to my sentence count. What this interprets to, is it realizes that a word stops when there is an end of sentence marker, BUT the real problem is the spaces are considered valid letters.
Heres my if statement, to decide if the character in question is a valid letter in a word:
else if(character != ' ' || character != '\t')
I've already ruled out end-of-sentence markers by that point in the program. (In the original if actually). From reading off an Ascii table, 32 should be the space character.
However, when i output all of the characters that make it into that block of code, spaces are in there.
So what am I doing wrong? How can i stop spaces from getting through this if?
Thanks in advance, and I have a feeling the question may be a bit vague, or poorly worded. If you have any questions or need clarification, let me know.
You should not rely on actual numbers for characters: that depends upon the encoding your platform uses, and may not be ASCII. You can check for any particular character by simply testing against it. For example, to test if c is a space character:
if (c == ' ')
will work, is easier to read, and is portable.
If you want to skip all white-space, you should use #include <ctype.h> and then use isspace():
if (isspace((unsigned char)c))
Edit: As others said, your condition to check for "not a space" is wrong, but the above point still applies. So, your condition can be replaced by:
if (!isspace((unsigned char)c))
I note that
(character != 32 || character != 9)
is always true. because if the character is 32 it is not 9, and true OR false is true...
You probably mean
(character != ' ' && character != '\t')
It would probably be better to just compare against the specific characters you consider whitespace, also use an &&:
if ((character != ' ') &&
(character != '\t'))

Resources