Recognizing Spaces in text

Recognizing Spaces in text - c

I'm writing a program that deciphers sentences, syllables, and words given in a basic text file.
The program cycles through the file character by character.
It first looks if it is some kind of end-of-sentence marker, like ! ? : ; or ..
Then if the character is not a space or tab, it assumes it is a character.
Finally, it identifies that if it is a space or tab, and the last character before it was a valid letter/character (e.g. not an end-of-sentence marker), it is a word.
I was a bit light on the details, but here is the problem I have.
My word count is equal to my sentence count. What this interprets to, is it realizes that a word stops when there is an end of sentence marker, BUT the real problem is the spaces are considered valid letters.
Heres my if statement, to decide if the character in question is a valid letter in a word:
else if(character != ' ' || character != '\t')
I've already ruled out end-of-sentence markers by that point in the program. (In the original if actually). From reading off an Ascii table, 32 should be the space character.
However, when i output all of the characters that make it into that block of code, spaces are in there.
So what am I doing wrong? How can i stop spaces from getting through this if?
Thanks in advance, and I have a feeling the question may be a bit vague, or poorly worded. If you have any questions or need clarification, let me know.

You should not rely on actual numbers for characters: that depends upon the encoding your platform uses, and may not be ASCII. You can check for any particular character by simply testing against it. For example, to test if c is a space character:
if (c == ' ')
will work, is easier to read, and is portable.
If you want to skip all white-space, you should use #include <ctype.h> and then use isspace():
if (isspace((unsigned char)c))
Edit: As others said, your condition to check for "not a space" is wrong, but the above point still applies. So, your condition can be replaced by:
if (!isspace((unsigned char)c))

I note that
(character != 32 || character != 9)
is always true. because if the character is 32 it is not 9, and true OR false is true...
You probably mean
(character != ' ' && character != '\t')

It would probably be better to just compare against the specific characters you consider whitespace, also use an &&:
if ((character != ' ') &&
(character != '\t'))

Related

K&R - section 1.9: understanding character arrays (and incidentally buffers)

Let's start with a very basic question about character arrays that I could not understand from the description in the book:
Does every character array end with '\0'?
Is the length of it always equal to the number of characters + 1 for '\0'?
meaning that if I specify a character array length of 10 I would be able to store only 9 characters that are not '\0'?
or does the '\0' come after the last array slot, so all 10 slots could be used for any character and an 11th non-reachable slot would contain the '\0' char?
Going further into the example in this section, it defines a getline() function that reads a string and counts the number of characters in it.
you can see the entire code here (in this example getline() was changed to gline(), since getline() is already defined in newer stdio.h libraries)
Here's the function:
int getline(char s[], int lim) {
int c, i;
for (i = 0; i < lim - 1 && (c = getchar()) != EOF && c != '\n'; ++i) {
s[i] = c;
}
if (c == '\n') {
s[i] = c;
++i;
}
s[i] = '\0';
return i;
}
It is explained that the array stores the input in this manner:
[h][e][l][l][o][\n][\0]
and the function will return a count of 6, including the '\n' char,
but this is only true if the loop exits because of a '\n' char.
If the loop exits because it has reached it's limit, it will return an array like this (as I understand this):
[s][n][a][z][z][y][\0]
now the count will also be 6.
Comparing both strings will return that they're equal when clearly "snazzy" is a longer word than "hello",
and so this code has a bug (by my personal requirements, as I would like to not count '\n' as part of the string).
Trying to fix this I tried (among many other things) to remove adding the '\n' char to the array and not incrementing the counter,
and I found out incidentally that when entering more characters than the array could store, the extra characters wait in the input buffer,
and would later be passed to the getline() function, so if I would enter:
"snazzy lolz\n"
it would use it up like this:
first getline() call: [s][n][a][z][z][y][\0]
second getline() call: [ ][l][o][l][z][\n][\0]
This change also introduced an interesting bug, if I try to enter a string that is exactly 7 characters long (including '\n') the program would quit straight away because it would pass a '\0' char to the next getline() call which would return 0 and would exit the while loop that calls getline() in main().
I am now confused as to what to do next.
How can I make it not count the '\n' char but also avoid the bug it created?
Many thanks

There is a convention in C that strings end with a null character. On that convention, all your questions are based. So
Does every character array end with '\0'?
No, It ends with \0 because the programmers put it there.
Is the length of it always equal to the number of characters + 1 for '\0'?
Yes, but only because of this convention. Thereto, for example you allocate one more byte (char) than the length of the string to accommodate this \0.
Strings are stored in character arrays such as char s[32]; or char *s = malloc(strlen(name) + 1);

Does every character array end with '\0'?
No; strings are a special case - they are character arrays with a nul (\0) terminator. This is more a convention than a feature of the language, although it is part of the language in-so-far that literal constant strings have a nul terminator. Moreover in a character string, the nul appears at the end of the string, not the end of the array - the array holding the string may be longer that the string it holds.
So the nul merely indicates the end of a string in a character array. If the character array represents data other than a string, then it may contain zero elements anywhere.
Is the length of it always equal to the number of characters + 1 for '\0'?
Again you are conflating strings with character arrays. They are not the same. A string happens to use a character array as a container. A string requires an array that is at least the length of the string plus one.
meaning that if I specify a character array length of 10 I would be
able to store only 9 characters that are not '\0'?
You will be able to store 10 characters of any value. If however you choose to interpret the array as a string, the string comprises only those characters up-to and including the first nul character.
or does the '\0' come after the last array slot, so all 10 slots could
be used for any character and an 11th non-reachable slot would contain
the '\0' char?
The nul is at the end of the string, not the end of the array, and certainly not after the end of the array.
Comparing both strings will return that they're equal when clearly
"snazzy" is a longer word than "hello",
In what world are those strings equal? They have equal length, not equal content.
and so this code has a bug (by my personal requirements, as I would
like to not count '\n' as part of the string).
Someone else's code not doing what you require is hardly a bug; that implementation is by design and is identical to the behaviour of the standard library fgets() function. If you require different behaviour, then you are of course free to implement to your needs; just omit the part:
if (c == '\n') {
s[i] = c;
++i;
}
To explicitly flush any remaining characters in the buffer the removed code above may be replaced with:
while(c != '\n') {
c = getchar() ;
}
One reason why you might not do that is that the data may be coming from a file redirected to stdin.
One reason for retaining the '\n' is that enables detection of incomplete input, which may be useful in some cases. For example you may want all the data in the line, regardless of length and despite a necessarily finite buffer length, a string returned without a newline would indicate that there is more day to be read, so you could then write code to handle that situation.

Scanning for a valid integer

Okay, total newbie here, but I need a little help/insight on how to start writing a specific program. I'm not asking for someone to do it for me, I'm just asking for an approach to this problem because I'm honestly not sure how to begin.
The program I am supposed to write is to detect valid integers. However, in this program, a valid integer is defined as the following:
0 or more leading white spaces followed by...
an optional '+' or '-' followed by...
1 or more digits, followed by a non-alphanumeric, but not a '.' followed by 1 or more digits.
Examples of valid integers: “1234”, “ 1234 ”, “1234.”, “ +1234 ”, "12+34", "1234.", "1234 x", and “ -1234 ” are all integers, and none of “1234e5”, “e1234”, “1234.56”, and “1234abc” are.
So far, all I can think of is using a bunch of if statements to check for valid integers, but I cant help but think there has to be a better and more robust approach than using a lot of if statements to check each character of the string. I can't think of any functions that would be useful to me other than using isdigit() and maybe strtol()? Any advice would be appreciated.

You just need to examine each character in a loop and keep a little state machine as you're going, until you decide it's not valid or you reach the end.
Edit: Nothing wrong with if statements, or you could use a switch statement.

I'd probably use sscanf (or fscanf, etc.)
Although it doesn't support full regular expressions, scanf format strings do support scan set conversions, which are about like a character set in a regular expression (including inverted ones, so for example %1[^a-zA-Z0-9] matches a single non-alphanumeric character).
A single space in a format string matches an arbitrary amount of white space in the input.

Put your words into code - one piece at a time. Pseudo code follows
// to detect valid integers.
success_failure detect valid integers(const char *s) {
// 0 or more leading white spaces followed by...
while (test_for_whitespace(*s)) s++;
// an optional '+' or '-' followed by...
if (test_if_sign(*s)) s++;
// 1 or more digits, ...
digit_found = false;
while (test_if_digit(*s)) { s++; digit_found = true; ]
if (!digit_found) return fail;
// followed by a non-alphanumeric, but not a '.' followed by 1 or more digits.
if (is_a_non_alphanumeric_non_dp_not_null(*s)) {
s++;
digit_found = false;
while (test_if_digit(*s)) { s++; digit_found = true; ]
if (!digit_found) return fail;
}
if (is_not_a_null_character(*s)) return fail;
return success;
}

Have a look at strtol(), it can tell you about invalid parts of the string by pointer return.
And beware of enthusiastic example code.. see the man page for comprehensive error-handling.

Clarifying K&R exercise 1-9

I have a few questions on this exercise. Here is the code I'm dealing with:
#include <stdio.h>
int main (void)
{
int c;
int inspace;
inspace = 0;
while((c = getchar()) != EOF)
{
if(c == ' ')
{
if(inspace == 0)
{
inspace = 1;
putchar(c);
}
}
if(c != ' ')
{
inspace = 0;
putchar(c);
}
}
return 0;
}
(Sorry I'm having a lot of trouble comprehending how these programs work because they're so simple and lack description on how they actually work)
First of all, how does putchar(c) not output the same exact data that came in. Despite it checking for a blank or != blank, it still says to output "c" which is just getchar(c) meaning whatever was inputted. I see no code that specifies to delete extra spaces and output just one space. Where does the code specify that that is what must take place? I'm having trouble understanding how getchar/putchar works it seems to me.
Also, what importance does inspace == 1 or 0 have? If inspace is == 1 then it just outputs the characters inputted back out. There's nothing saying that the extra blanks are deleted and inspace isn't defined as anything except 0 or 1, there's nothing defining it as a space so how can it possibly have any real meaning as to what the program is doing?
I'm really confused, where is the code that's replacing the spaces and how does it work? Is there a simpler book I should be learning from that explains the solutions?

First of all, how does putchar(c) not output the same exact data that came in. Despite it checking for a blank or != blank, it still says to output "c" which is just getchar(c) meaning whatever was inputted. I see no code that specifies to delete extra spaces and output just one space. Where does the code specify that that is what must take place? I'm having trouble understanding how getchar/putchar works it seems to me.
You are correct that if putchar is called, it just outputs the input character. The key to this program is that putchar isn't called on every input character. The various if statements control when it is called. At a high level, the program avoids calling putchar on the second, third, fourth, etc., spaces if there are multiple spaces in a row. It's only called on the first space.
Also, what importance does inspace == 1 or 0 have? If inspace is == 1 then it just outputs the characters inputted back out. There's nothing saying that the extra blanks are deleted and inspace isn't defined as anything except 0 or 1, there's nothing defining it as a space so how can it possibly have any real meaning as to what the program is doing?
Don't think of it as spaces being deleted. Think of it as them being omitted. Sometimes putchar is called, sometimes it isn't. Look at the loop and try to figure out what conditions would cause putchar not to be called.
Importantly, look at what happens if you start a loop iteration, inspace == 1, and c == ' '. What happens?
It might help to put together a table showing when putchar is and isn't called.
Is putchar(c) called?
=====================
| c == ' ' | c != ' '
-------------+----------+---------
inspace == 0 | Y | Y
inspace == 1 | N | Y

Think about the logic in this block when there are two ore more consecutive ' ' characters in the input.
if(c == ' ')
{
if(inspace == 0)
{
inspace = 1;
putchar(c);
}
}
When the first space character is encountered, the code enters the nested if block and prints the character.
When the second space character is encountered, the code does not enter the nested if block and the character is not printed.
If you follow this logic, you'll notice that if there are two or more consecutive space characters in the input, only one is printed.

how does putchar(c) not output the same exact data that came in.
When the code reaches putchar(c), it outputs the same exact character that came in. However, the code may not be reaching putchar(c) on some of the iterations.
what importance does inspace == 1 or 0 have?
Once the program sets inspace to 1, it stops printing further space characters, because the code will not reach putchar(c) on second and subsequent iterations of the loop.
inspace is set to 1 after printing the first space in a sequence of one or more spaces. If inspace is set to zero coming into the first conditional, a space would be printed; otherwise, no space would be printed.
Here is a diagram that explains what is happening:
The program starts in the black circle, and proceeds to one of two states, depending on the input character:
If the character is space, the state on the left is entered, when the first space is printed, and the rest of spaces are ignored (i.e. inspace is set to 1)
If the character is non-space, the state on the right is entered, when each character is printed.
Each time a new character is read the program decides if it wants to switch the state, or to remain in the current state.
Note: the diagram is not showing the EOF to save some space. When EOF is reached, the program exits.

The first if block along with the inner one is saying, "If the input is a space and the inspace flag is zero, print it, and also set the flag to one". I.e. print space if it is the first one, and indicate the next one won't be the first. The second block is saying "If the input is not space, print it and reset the previous spaces flag so the next encountered space will be considered the first one.". That's all.

Code to replace all the tabs and if there are more than 1 space by a single space?

This question is in K&R, exercise 1.9. I wrote the following code:
#include<stdio.h>
main()
{
int c,i=0,n=0;
while((c=getchar())!=EOF)
{
if(c!=' '||c!='\t')
{
i=0;
putchar(c);
}
else if(c==' '||c=='\t')
{
i++
}
if((c+1)!=' '||(c+1)!='\t')
n=i;
if(n!=0)
{
c=' ';
putchar(c);
}
}
}
but i could not get the desired output. I am using gcc in ubuntu. When I enter something like hello\t\ta as input then my output is hello\_\_a i.e number of tab is replaced by number of space and when I enter hello\_\_a then my output is same as input.
Please help me with it or suggest me something new to get the desired output.

Instead of giving your the full working program, I prefer to guide you to the right direction.
First of all, c+1 does not mean "next character in the input". It only adds 1 to the value of c, which effectively converts c to the next character in the ASCII table.
For example if c is 'a', c+1 means 'b', which is next character int the ASCII table, and if c is ' ' (a single space) that has a code of 32 in the table, c+1 is '!' that has a code 33 in the table.
Well, to get the next character, you need to read it! In the same way you read the first character. The best way to achieve this, is to always hold the previous read character, and check that with the currently read character.
So you need two variables, for example c and pc. You read the character and store it in c. At first, pc is '\0'. If the read character is not space or tab, you write it to the output. If it is tab, you change it to space. And if it is space, you check the previous character (pc). If it is not space, print c. At the end of the loop, you should store the value of c into pc, which means you are holding the previous character in pc.
I guess I told you the complete solution!

The problem is: you want to check the NEXT character, but you check the current character's value incremented by one.

The approach is slightly wrong, here is a hint, keep the last character as state, if the newly entered character is a space and the last character was a space, then don't output, simply go back round the loop and wait for the next character.
If the current character is not a space, output and update the state...

Finding bad string in C

I am pulling information from a binary file in C and one of my strings is coming out as \\b\\3777\\375\\v\\177 in GDB. I want to be able to parse this sort of useless data out of my output in a non-specific way - I.e anything that doesn't start with a number/character should be kicked out. How can this be achieved?
The data is being buffered into a struct n bytes at a time, and I am sure that this information is correct based on how data later in the file is being read correctly.

if( isalnum( buf[ 0 ]) {
printf( "%s", buf );
}

It sounds a bit like you're reimplementing the linux utility strings.
For each file given, GNU strings
prints the printable character
sequences that are at least 4
characters long (or the number given
with the options below) and are
followed by an unprintable character.
By default, it only prints the strings
from the initialized and loaded
sections of object files; for other
types of files, it prints the strings
from the whole file.

As the vast majority of the ASCII printable characteres are in the range of 0x20 (' ', space) to 0x7E('~', tilde), you can use this test:
if( (buf[0] >= 0x20) && ( buf[0] <= 0x7E ) )
{
printf( "%s", buf );
}
this will validate any string starting with any ASCII character.

Iterate over your bytes, and check the value of each one to see if it one of the characters that you consider to be valid. I don't know what you consider to be "a integer or char" (i.e. valid values) but you can try comparing the characters to (for example) ensure that:
(c >= '0' && c <= '9') || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
The above condition will ensure that the character's ASCII value is either a number (0 through 9) or a capital or lowercase English letter. Then you have to decide what to do when you encounter a character that you don't want. You can either replace the "bad" character with something "safe" (like a space) or you can build up a new string in a separate buffer, containing only the "good" characters.
Note that the above condition will only work for English, doesn't work for accented characters, and all punctuation and whitespace is also excluded. Another possible test would be to see if the character is a printable ASCII character ((c >= 0x20 && c <= 0x7e) || c == 0xa || c == 0xd which also includes punctuation, space and CR/LF). And this doesn't even get started trying to deal with encodings that aren't ASCII-compatible.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Recognizing Spaces in text - c

I note that (character != 32 || character != 9) is always true. because if the character is 32 it is not 9, and true OR false is true... You probably mean (character != ' ' && character != '\t')

It would probably be better to just compare against the specific characters you consider whitespace, also use an &&: if ((character != ' ') && (character != '\t'))

Related

K&R - section 1.9: understanding character arrays (and incidentally buffers)

Scanning for a valid integer

Clarifying K&R exercise 1-9

Code to replace all the tabs and if there are more than 1 space by a single space?

Finding bad string in C

Categories

Resources