Word count program - stdin - c

For below question,
Write a program to read English text to end-of-data (type control-D to indicate end of data at a terminal, see below for detecting it), and print a count of word lengths, i.e. the total number of words of length 1 which occurred, the number of length 2, and so on.
Define a word to be a sequence of alphabetic characters. You should allow for word lengths up to 25 letters.
Typical output should be like this:
length 1 : 10 occurrences
length 2 : 19 occurrences
length 3 : 127 occurrences
length 4 : 0 occurrences
length 5 : 18 occurrences
....
To read characters to end of data see above question.
Here is my working solution,
#include<stdio.h>
int main(void){
char ch;
short wordCount[20] = {0};
int count = 0;
while(ch = getchar(), ch >= 0){
if(ch == ' ' || ch == ',' || ch == ';'|| ch == ':'|| ch == '.'|| ch == '/'){
wordCount[count]++;
count=0;
}else{
count++;
}
}
wordCount[count]++; // Incrementing here looks weird to me
for(short i=1; i< sizeof(wordCount)/sizeof(short); i++){
printf("\nlength %d : %d occurences",i, wordCount[i]);
}
}
Question:
1)
From code elegance aspect, Can I avoid incrementing(++) wordCount outside while loop?
2)
Can I make wordCount array size more dynamic based on word size, rather than constant size 20?
Note: Learnt about struct but am yet to learn dynamic structures like Linkedlist

For the dynamic allocations you can start with space for 20 shorts (although the problem statement appears to ask for you to allow for words up to 25 characters):
short maxWord = 20;
short *wordCount = malloc(sizeof(*wordCount) * maxWord);
Then, when you increment count you can allocate more space if the current word is longer than can be counted in your dynamic array:
} else {
count++;
if (count >= maxWord) {
maxWord++;
wordCount = realloc(sizeof(*wordCount) * maxWord);
}
}
Don't forget to free(wordCount) when you are done.
Since you don't need to count zero-length words, you might consider modifying your code so that wordCount[0] stores the number of words of length 1, and so on.

To 1):
maybe scan from one delimiting character to the next until you increment wordCount. Make EOF a delimiting character as well.
To 2)
you can scan the file twice and then decide how much memory you need. Or you dynamically realloc whenever the more memory is needed. This is something the std::array class does internally for example.
Also you should think about what happens if there are two characters after one another. Right now you would count this as a word.

Related

explain the logic behind the structure of code needed to account for extra spaces, in order to calculate the correct average word length

This is the question on my assignment: Write a program that prompts the user to enter a sentence (assume that a sentence can have a maximum of 50 characters). It then counts the vowels and consonants in it. It also calculates the average word length of the input sentence. Word length is the total number of alphabetic characters in the sentence divided by the total number of words in it. Words are separated by one or more spaces. All the results are displayed at the end.
So far I have been able to complete all aspects of the question but I am running into a logical error on my part. When the user inputs more than a normal amount of spaces, it messes up the answer given for average word length.
Here is my code calculating average word length:
for(i = 1; sent[i] != '\0'; i++){
if( sent[i] == ' '){
++spaceCount;
}
else if((sent[i] != ' ') && (sent[i] != '\n')){
++charCount;
}
}
avgWordLength = (charCount / (spaceCount+1)) ;
Could someone help explain the logic behind the structure of code needed to account for extra spaces, in order to calculate the correct average word length
Here is a link to a previously already answered question:
Average word length for a sentence
But my school has not taught the "getchar" function yet and I would not like to use it unless I have too. To be more clear, is there away to complete the question without using the "getchar" function?
Here is an example of the problem when compiling and running
// Everything works good when
string: Thursday is ok
Average word length: 4.00 characters
// this is where my code fall apart
string: Thursday is ok
Average word length: 1.86 characters
Well, if you think about it, what you want to do is just treat any uninterrupted series of whitepace characters as one for the purpose of computing the word count. You can include ctype.h and use the isspace function to test all possible whitespace characters, or if you are supposed to do it manually, then at least check for space or tab characters (e.g. you could have a mixed sequence of spaces and tabs that should still be counted as a single (e.g. " \t \t ")
To handle multiple whitespace characters and count the sequence as one, just set a flag (e.g. ws for whitespace) and only increment spaceCount when you encounter the first whitespace, and reset the flag if another non-whitespace character is encountered.
Putting those pieces together, you could do something like the following:
int ws = 0; /* flag to treat multiple whitespace as 1 */
for(i = 0; sent[i]; i++){
if (sent[i] == ' ' || sent[i] == '\t') {
if (!ws) {
spaceCount++;
ws = 1;
}
}
else {
charCount++; /* non-whitespace character count */
ws = 0;
}
}
(note: begin your check at i = 0 to protect against Undefined Behavior in the event sent is the empty-string.)
(note2: you can check charCount before setting your first spaceCount and check ws after leaving the loop to handle leading and trailing whitespace -- and adjust spaceCount as necessary. That is left as an exercise)
Look things over and let me know if you have any further questions.
Could someone help explain the logic behind the structure of code needed to account for extra spaces, in order to calculate the correct average word length
You could use a state machine. You have two states:
1) Looking for the end of a word.
2) Looking for the end of a space sequence.
Look at the first character in the sentence. It is either the beginning of a word or a space. This tells you if you are in state 1 or 2.
If in state 1, then look for a space or the end of the sentence. If you find a space, set your state to 2.
If in state 2, then look for a non-space or the end of the sentence. If you find a non-space then set your state to 1.
counts the vowels and consonants in it. It also calculates the average word length of the input sentence.
Could someone help explain the logic behind the structure of code needed to account for extra spaces
There really is no need to count spaces. Instead all that is needed to to count the number of times a letter begins a word - it followed a non-letter - or was first character.
// pseudo code
sentence_stats(const char *s) {
vowels = 0;
consonants = 0;
word_count = 0;
previous = 0;
while (*s) {
if (isletter(*s)) { // OP to make isletter(), isvowel()
if (!isletter(previous)) {
word_count++; // start of word
}
if (isvowel(*s)) vowels++;
else consonants++;
} else if (*s == ' ') {
; // nothing to do
} else {
TBD_CODE_Handle_non_letter_non_space();
}
previous = *s;
s++;
}
average = (vowels + consonants)/word_count
}

Format Specifier Q and unique bug in Mario Solution to Pyramid algorithm

Okay I have two problems with my solution to this problem, I was hoping I could get some help on. The problem itself is being able to print out #s in a specific format based on user input.
My questions are:
When I input 7, it outputs the correct solution, but when I output 8 (or higher), my buffer, for whatever reason add some garbage at the end, which I am unsure why it happens. I would add a picture but I don't have enough rep points for it :(
In my code, where I've inputted **HELPHERE**, I'm unsure why this gives me the correct solution. I'm confused because in the links I've read (on format specifiers) I thought that the 1 input (x in my case) specified how many spaces you wanted. I thought this would've made the solution x-n, as each consequent row, you'd need the space segment to decrease by 1 each time. Am I to understand that the array somehow reverses it's input into the printf statement? I'm confused because does that mean since the array increases by 1, on each subsequent iteration of the loop, it eats into the space area?
int main(void){
printf("Height: ");
int x = GetInt();
int n = 1;
int k=0;
char buff[x]; /* creates buffer where hashes will go*/
while(n<=x){ /* stops when getint value is hit*/
while(k<n) /* fill buffer on each iteration of loop with 1 more hashtag*/
{
buff[k] = '#';
k++;
}
printf("%*s",x, buff); /*makes x number of spaces ****HELPHERE*****, then prints buffer*/
printf(" ");
printf("%s\n",buff); /*prints other side of triangle */
/*printf("%*c \n",x-n, '\0');*/
n++;
}
}
Allocate enough memory and make sure the string is null terminated:
char buff[x+1];//need +1 for End of the string('\0')
memset(buff, '\0', sizeof(buff));//Must be initialized by zero
Print as many blanks as requested by blank-padding an empty string:
printf("%*s", x, "");
※the second item was written by Jonathan Leffler.
In printf("%*s",x, buff);, buff in not null character terminated.
Present code "worked" sometimes as buff was not properly terminated and the result was UB - undefined behavior. What likely happened in OP's case was that the buffer up to size 7, fortunately had '\0' in subsequent bytes, but not so when size was 8.
1) As per #BLUEPIXY, allocated a large enough buffer to accommodate the '#' and the terminating '\0' with char buff[x+1];
2) Change while loop to append the needed '\0'.
while (k<n) {
buff[k] = '#';
k++;
}
buff[k] = '\0';
3) Minor:insure x is valid.
if (x < 0) Handle_Error();
char buff[x];
4) Minor: Return a value for int main() such as return 0;.

C Resetting data counters in FOR loop

I've got a very large text file that I'm trying to do word analysis on. Among word count, I might be looking for other information as well, but I left that out for simplicity.
In this text file I have blocks of text separated by asterisks '*'. The code I have below scans the text file and prints out # of characters and words as it should, but I'd like to reset the counter after an asterisk is met, and store all information in a table of some sort. I'm not so worried on how I'll make the table as much as I am unsure of how to loop the same counting code for each text block between asterisks.
Maybe a for loop like
for (arr = strstr(arr, "*"); arr; arr = strstr(arr + strlen("*"), "*"))
Example text file:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
I have a sentence. I have two sentences now.
*
I have another sentence. And another.
*
I'd like to count the amount of words and characters from the asterisk above this
one until the next asterkisk, not including the count from the last one.
*
...
...
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
(EOF)
Desired output:
*# #words #alphaChar
----------------------------
1 9 34
-----------------------------
2 5 30
-----------------------------
3 28 124
...
...
I have tried
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
int characterCount=0;
int counterPosition, wordCount=0, alphaCount=0;
//input file
FILE *file= fopen("test.txt", "r");
if (file== NULL)
printf("Cannot find the file.\n");
//Count total number of characters in file
while (1)
{
counterPosition = fgetc(speechFile);
if (counterPosition == EOF)
break;
++characterCount;
}
rewind(file); // Sends the pointer to the beginning of the file
//Dynamically allocate since array size cant be variable
char *arr= ( char*) malloc(totalCharacterCount);
while(fscanf(speechFile, "%c", &arr[i]) != EOF ) //Scan until the end of file.
i++; //increment, storing each character in a unique position
for(i = 0; i <characterCount; i++)
{
if(arr[i] == ' ') //count words
wordCount++;
if(isalpha(arr[i])) //count letters only
alphaCount++;
}//end for loop
printf("word count is %d and alpha count is %d", wordCount,alphaCount);
}
Since you are having full files text in array arr[], you need to divide that string arr using * as delimiter. you can use strtok() to divide that string using * as delimiter. Then perform the word count and character count operation on each token. read this link to know about strtok.

Converting Character Array to Integer Array in C for ISBN Validation

I really hope someone can give a well explained example. I've been searching everywhere but can't find a proper solution.
I am taking an introduction to C Programming class, and our last assignment is to write a program which validates a 10 digit ISBN with dashes... The ISBN is inputted as a string in a CHAR array. From there I need to separate each digit and convert them into an integer, so I can calculated the validity of the ISBN. On top of that, the dashes need to be ignored..
My thought process was to create an INT array and then use a loop to store each character into the array, and pass it through the atoi() function. I also tried using an IF statement to check each part of the CHAR array to see if it found a dash. If it did find one, it would skip to the next spot in the array. It looked something like this:
int num[12], i = 0, j = 0, count = 0;
char isbn[12];
printf ("Enter an ISBN to validate: ");
scanf ("%13[0-9Xx-]%*c", &isbn);
do {
if (isbn[i] == '-') {
i++;
j++;
}
else {
num[i]= atoi(isbn[j]);
i++;
j++;
}
count++;
} while (count != 10);
But that creates a segmentation fault, so I can't even tell if my IF statement has actually filtered the dashes....
If someone could try and solve this I'd really appreciate that. The Assignment was due Dec 4th, however I got an extension until Dec 7th, so I'm pressed for time.
Please write out the code in your explanation. I'm a visual learner, and need to see step by step.
There's obviously a lot more that needs to be coded, but I can't move ahead until I get over this obstacle.
Thanks in advance!
First of all, your definition of isbn is not sufficient to hold 13 characters; it should therefore be 14 chars long (to also store the terminating '\0').
Second, your loop is overly complicated; three loop variables that maintain the same value is redundant.
Third, the loop is not safe, because a string might be as short as one character, but your code happily loops 10 times.
Lastly, converting a char that holds the ascii value of a digit can be converted by simply subtracting '0' from it.
This is the code after above improvements have been made.
#include <stdio.h>
int main(void)
{
int num[14], i;
char isbn[14], *p;
printf("Enter an ISBN to validate: ");
scanf("%13[0-9Xx-]%*c", &isbn);
// p iterates over each character of isbn
// *p evaluates the value of each character
// the loop stops when the end-of-string is reached, i.e. '\0'
for (p = isbn, i = 0; *p; ++p) {
if (*p == '-' || *p == 'X' || *p == 'x') {
continue;
}
// it's definitely a digit now
num[i++] = *p - '0';
}
// post: i holds number of digits in num
// post: num[x] is the digit value, for 0 <= x < i
return 0;
}

Determining the length of an array for memory efficiency

Write a function in C language that:
Takes as its only parameter a sentence stored in a string (e.g., "This is a short sentence.").
Returns a string consisting of the number of characters in each word (including punctuation), with spaces separating the numbers. (e.g., "4 2 1 5 9").
I wrote the following program:
int main()
{
char* output;
char *input = "My name is Pranay Godha";
output = numChar(input);
printf("output : %s",output);
getch();
return 0;
}
char* numChar(char* str)
{
int len = strlen(str);
char* output = (char*)malloc(sizeof(char)*len);
char* out = output;
int count = 0;
while(*str != '\0')
{
if(*str != ' ' )
{
count++;
}
else
{
*output = count+'0';
output++;
*output = ' ';
output++;
count = 0;
}
str++;
}
*output = count+'0';
output++;
*output = '\0';
return out;
}
I was just wondering that I am allocating len amount of memory for output string which I feel is more than I should have allocated hence there is some wasting of memory. Can you please tell me what can I do to make it more memory efficient?
I see lots of little bugs. If I were your instructor, I'd grade your solution at "C-". Here's some hints on how to turn it into "A+".
char* output = (char*)malloc(sizeof(char)*len);
Two main issues with the above line. For starters, you are forgetting to "free" the memory you allocate. But that's easily forgiven.
Actual real bug. If your string was only 1 character long (e.g. "x"), you would only allocate one byte. But you would likely need to copy two bytes into the string buffer. a '1' followed by a null terminating '\0'. The last byte gets copied into invalid memory. :(
Another bug:
*output = count+'0';
What happens when "count" is larger than 9? If "count" was 10, then *output gets assigned a colon, not "10".
Start by writing a function that just counts the number of words in a string. Assign the result of this function to a variable call num_of_words.
Since you could very well have words longer than 9 characters, so some words will have two or more digits for output. And you need to account for the "space" between each number. And don't forget the trailing "null" byte.
If you think about the case in which a 1-byte unsigned integer can have at most 3 chars in a string representation ('0'..'255') not including the null char or negative numbers, then sizeof(int)*3 is a reasonable estimate of the maximum string length for an integer representation (not including a null char). As such, the amount of memory you need to alloc is:
num_of_words = countWords(str);
num_of_spaces = (num_of_words > 0) ? (num_of_words - 1) : 0;
output = malloc(num_of_spaces + sizeof(int)*3*num_of_words + 1); // +1 for null char
So that's a pretty decent memory allocation estimate, but it will definitely allocate enough memory in all scenarios.
I think you have a few other bugs in your program. For starters, if there are multiple spaces between each word e.g.
"my gosh"
I would expect your program to print "2 4". But your code prints something else. Likely other bugs exist if there are leading or trailing spaces in your string. And the memory allocation estimate doesn't account for the extra garbage chars you are inserting in those cases.
Update:
Given that you have persevered and attempted to make a better solution in your answer below, I'm going to give you a hint. I have written a function that PRINTs the length of all words in a string. It doesn't actually allocate a string. It just prints it - as if someone had called "printf" on the string that your function is to return. Your job is to extrapolate how this function works - and then modify it to return a new string (that contains the integer lengths of all the words) instead of just having it print. I would suggest you modify the main loop in this function to keep a running total of the word count. Then allocate a buffer of size = (word_count * 4 *sizeof(int) + 1). Then loop through the input string again to append the length of each word into the buffer you allocated. Good luck.
void PrintLengthOfWordsInString(const char* str)
{
if ((str == NULL) || (*str == '\0'))
{
return;
}
while (*str)
{
int count = 0;
// consume leading white space
while ((*str) && (*str == ' '))
{
str++;
}
// count the number of consecutive non-space chars
while ((*str) && (*str != ' '))
{
count++;
str++;
}
if (count > 0)
{
printf("%d ", count);
}
}
printf("\n");
}
The answer is: it depends. There are trade-offs.
Yes, it's possible to write some extra code that, before performing this action, counts the number of words in the original string and then allocates the new string based on the number of words rather than the number of characters.
But is it worth it? The extra code would make your program longer. That is, you would have more binary code, taking up more memory, which may be more than you gain. In addition, it will take more time to run.
By the way, you have a memory leak in your program, which is more of a problem.
As long as none of the words in the sentence are longer than 9 characters, the length of your output array needs only to be the number of words in the sentence, multiplied by 2 (to account for the spaces), plus an extra one for the null terminator.
So for the string
My name is Pranay Godha
...you need only an array of length 11.
If any of the words are ten characters or more, you'll need to calculate how many extra char your array will need by determining the length of the numeric required. (e.g. a word of length 10 characters clearly requires two char to store the number 10.)
The real question is, is all of this worth it? Unless you're specifically required (homework?) to use the minimal space required in your output array, I'd be minded to allocate a suitably large array and perform some bounds checking when writing to it.

Resources