Trying to count the number of words in a file [closed] - c

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a file named myf which has a lot of text in it and I am trying to use blank spaces as a way of counting the number of words. Basically, in the count method of my program, there is a variable int d which acts like a boolean function. Also, there is an incrementer called count.
I have a for loop which will traverse the array that's put into the argument of the method count, and will see if the pointer *p is a non letter. If it is a non letter AND d=0, d=1 and count is incremented. This way, if the next character is also a non space, since d=1, the else if statement will not be incremented again. The only way for d to reset to 0 is if a space is present, at which point, if another letter is found, it will be incremented again. Then the method count will return the variable count. Seems simple enough, but I keep getting wrong numbers.
#include <stdio.h>
#include<stdlib.h>
#include <string.h>
#include <ctype.h>
int count(char x[]) {
int d = 0;
int count = 0;
for (char *p = x; *p != EOF; *p++) {
// this will traverse file
printf("%c", *p);
// this is just to see the output of the file
if (*p == ' ' && d == 1) {
d = 0;
}
else if (*p != ' ' && d == 0) {
count++;
d = 1;
}
}
return count;
}
int main() {
char c;
int r = 0;
char l[1000];
FILE *fp = fopen("myf", "r");
while ((c = fgetc(fp)) != EOF) {
l[r] = c;
r++;
}
printf("\n %d", count(l));
}

To count the number of words, count the occurrences of a letter after a non-letter.
*p != EOF is the wrong test. EOF indicate that the input operation either 1) had not more input or 2) an input error occurred. It does not signify the end of a string.
Use int to save the result from fgetc() as that returns an int in the range of unsigned char and EOF. Typically 257 different values. char is insufficient.
Small stuff: No need for an array. Let code consider ' as a letter. As the number of words could be very large, let code use a wide type like unsigned long long.
#include <ctype.h>
int isletter(int ch) {
return isalpha(c) || c == '\'';
}
#include <stdio.h>
int main(void) {
unsigned long long count = 0;
FILE *fp = fopen("myf", "r");
if (fp) {
int c;
int previous = ' ';
while ((c = fgetc(fp)) != EOF) {
if (!isletter(previous) && isletter(ch)) count++;
previous = ch;
}
fclose(fp);
}
printf("%llu\n", count);
}

Don't do this
*p != EOF
EOF is actually a negative integer and you're using it as a char. You should pass in how many character you want to iterate over ie
int count(char x[], int max){
then use the for loop like
int m = 0;
for ( char *p = x; m < max; p++, m++)
Note I also changed *p++ to p++. You also need to update your program to consider things that are non space etc ie this line
else if (*p != ' ' && d==0 )
What happens when it encounters a \n, it will likely count an extra word.

Related

C, counting the number of blankspaces

I'm writing a function that replaces blank spaces into '-' (<- this character).
I ultimately want to return how many changes I made.
#include <stdio.h>
int replace(char c[])
{
int i, cnt;
cnt = 0;
for (i = 0; c[i] != EOF; i++)
if (c[i]==' ' || c[i] == '\t' || c[i] == '\n')
{
c[i] = '-';
++cnt;
}
return cnt;
}
main()
{
char cat[] = "The cat sat";
int n = replace(cat);
printf("%d\n", n);
}
The problem is, it correctly changes the string into "The-cat-sat" but for n, it returns the value 3, when it's supposed to return 2.
What have I done wrong?
#4386427 suggested this should be another answer. #wildplasser already provided the solution, this answer explains EOF and '\0'.
You would use EOF only when reading from a file (EOF -> End Of File). See this discussion. EOF is used to denote the end of file, and its value is system dependent. In fact, EOF is rather a condition than a value. You can find great explainations in this thread. When working with char array or a char pointer, it will always be terminated by a '\0' character, and there is always exactly one of those, thus, you would use it to break out of the loop when iterating through an array/pointer. This is a sure way to ensure that you don't access memory that is not allocated.
#include <stdio.h>
int repl(int c);
int main(void){
int c, nc;
nc =0;
while ((c=getchar())!=EOF)
nc = replc(c);
printf("replaced: %d times\n", nc);
return 0;
}
int replc(int c){
int nc = 0;
for(; (c = getchar())!=EOF; ++c)
if (c == ' '){
putchar('-');
++nc;
} else putchar(c);
return nc;
}
A string ends with a 0 (zero) value, not an EOF (so: the program in the question will scan the string beyond the terminal\0 until it happens to find a -1 somewhere beyond; but you are already in UB land, here)
[sylistic] the function argument could be a character pointer (an array argument cannot exist in C)
[stylistic] a pointer version wont need the 'i' variable.
[stylistic] The count can never be negative: intuitively an unsigned counter is preferred. (it could even be a size_t, just like the other string functions)
[stylistic] a switch(){} can avoid the (IMO) ugly || list, it is also easier to add cases.
unsigned replace(char *cp){
unsigned cnt;
for(cnt = 0; *cp ; cp++) {
switch (*cp){
case ' ' : case '\t': case '\n':
*cp = '-';
cnt++;
default:
break;
}
}
return cnt;
}
EOF used in the for loop end condition is the problem as you are not using is to check end of file/stream.
for (i = 0; c[i] != EOF; i++)
EOF itself is not a character, but a signal that there are no more characters available in the stream.
If you are trying to check end of line please use
for (i = 0; c[i] != "\0"; i++)

Converting words from camelCase to snake_case in C

What I am trying to code is, if I input camelcase, it should just print out camelcase, but if there contains any uppercase, for example, if I input camelCase, it should print out camel_case.
The below is the one I am working on but the problem is, if I input, camelCase, it prints out camel_ase.
Can someone please tell me the reason and how to fix it?
#include <stdio.h>
#include <ctype.h>
int main() {
char ch;
char input[100];
int i = 0;
while ((ch = getchar()) != EOF) {
input[i] = ch;
if (isupper(input[i])) {
input[i] = '_';
//input[i+1] = tolower(ch);
} else {
input[i] = ch;
}
printf("%c", input[i]);
i++;
}
}
First look at your code and think about what happens when someone enters a word longer than 100 characters -> undefined behavior. If you use a buffer for input, you always have to add checks so you don't overflow this buffer.
But then, as you directly print the characters, why do you need a buffer at all? It's completely unnecessary with the approach you show. Try this:
#include <stdio.h>
#include <ctype.h>
int main()
{
int ch;
int firstChar = 1; // needed to also accept PascalCase
while((ch = getchar())!= EOF)
{
if(isupper(ch))
{
if (!firstChar) putchar('_');
putchar(tolower(ch));
} else
{
putchar(ch);
}
firstChar = 0;
}
}
Side note: I changed the type of ch to int. This is because getchar() returns an int, putchar(), isupper() and islower() take an int and they all use a value of an unsigned char, or EOF. As char is allowed to be signed, on a platform with signed char, you would get undefined behavior calling these functions with a negative char. I know, this is a bit complicated. Another way around this issue is to always cast your char to unsigned char when calling a function that takes the value of an unsigned char as an int.
As you use a buffer, and it's useless right now, you might be interested there is a possible solution making good use of a buffer: Read and write a whole line at a time. This is slightly more efficient than calling a function for every single character. Here's an example doing that:
#include <stdio.h>
static size_t toSnakeCase(char *out, size_t outSize, const char *in)
{
const char *inp = in;
size_t n = 0;
while (n < outSize - 1 && *inp)
{
if (*inp >= 'A' && *inp <= 'Z')
{
if (n > outSize - 3)
{
out[n++] = 0;
return n;
}
out[n++] = '_';
out[n++] = *inp + ('a' - 'A');
}
else
{
out[n++] = *inp;
}
++inp;
}
out[n++] = 0;
return n;
}
int main(void)
{
char inbuf[512];
char outbuf[1024]; // twice the lenght of the input is upper bound
while (fgets(inbuf, 512, stdin))
{
toSnakeCase(outbuf, 1024, inbuf);
fputs(outbuf, stdout);
}
return 0;
}
This version also avoids isupper() and tolower(), but sacrifices portability. It only works if the character encoding has letters in sequence and has the uppercase letters before the lowercase letters. For ASCII, these assumptions hold. Be aware that what is considered an (uppercase) letter could also depend on the locale. The program above only works for letters A-Z as in the english language.
I don't know exactly how to code in C but I think you should do something like this.
if(isupper(input[i]))
{
input[i] = tolower(ch);
printf("_");
} else
{
input[i] = ch;
}
There are two problems in your code:
You insert one character in each branch of if, while one of them is supposed to insert two characters, and
You print characters as you go, but the first branch is supposed to print both _ and ch.
You can fix this by incrementing i on insertion with i++, and by printing the entire word at the end:
int ch; // <<== Has to be int, not char
char input[100];
int i = 0;
while((ch = getchar())!= EOF && (i < sizeof(input)-1)) {
if(isupper(ch)) {
if (i != 0) {
input[i++] = '_';
}
ch = tolower(ch);
}
input[i++] = ch;
}
input[i] = '\0'; // Null-terminate the string
printf("%s\n", input);
Demo.
There are multiple problems in your code:
ch is defined as a char: you cannot properly test for end of file if c is not defined as an int. getc() can return all values of type unsigned char plus the special value EOF, which is negative. Define ch as int.
You store the byte into the array input and use isupper(input[i]). isupper() is only defined for values returned by getc(), not for potentially negative values of the char type if this type is signed on the target system. Use isupper(ch) or isupper((unsigned char)input[i]).
You do not check if i is small enough before storing bytes to input[i], causing a potential buffer overflow. Note that it is not necessary to store the characters into an array for your problem.
You should insert the '_' in the array and the character converted to lowercase. This is your principal problem.
Whether you want Main to be converted to _main, main or left as Main is a question of specification.
Here is a simpler version:
#include <ctype.h>
#include <stdio.h>
int main(void) {
int c;
while ((c = getchar()) != EOF) {
if (isupper(c)) {
putchar('_');
putchar(tolower(c));
} else {
putchar(c);
}
}
return 0;
}
To output the entered characters in the form as you showed there is no need to use an array. The program can look the following way
#include <stdio.h>
#include <ctype.h>
int main( void )
{
int c;
while ((c = getchar()) != EOF && c != '\n')
{
if (isupper(c))
{
putchar('_');
c = tolower(c);
}
putchar(c);
}
putchar('\n');
return 0;
}
If you want to use a character array you should reserve one its element for the terminating zero if you want that the array would contain a string.
In this case the program can look like
#include <stdio.h>
#include <ctype.h>
int main( void )
{
char input[100];
const size_t N = sizeof(input) / sizeof(*input);
int c;
size_t i = 0;
while ( i + 1 < N && (c = getchar()) != EOF && c != '\n')
{
if (isupper(c))
{
input[i++] = '_';
c = tolower(c);
}
if ( i + 1 != N ) input[i++] = c;
}
input[i] = '\0';
puts(input);
return 0;
}

Standard Input - Counting chars/words/lines

I've written some code for finding the # of chars, lines and words in a standard input but I have a few questions.
On running the program - It doesn't grab any inputs from me. Am I able to use shell redirection for this?
My word count - only counts if getchar() is equal to the escape ' or a ' ' space. I want it so that it also counts if its outside of a decimal value range on the ASCII table. IE. if getchar() != in the range of a->z and A->Z or a ', wordcount += 1.
I was thinking about using a decimal value range here to represent the range - ie: getchar() != (65->90 || 97->122 || \' ) -> wordcount+1
https://en.wikipedia.org/wiki/ASCII for ref.
Would this be the best way of going about answering this? and if so, what is the best way to implement the method?
#include <stdio.h>
int main() {
unsigned long int charcount;
unsigned long int wordcount;
unsigned long int linecount;
int c = getchar();
while (c != EOF) {
//characters
charcount += 1;
//words separated by characters outside range of a->z, A->Z and ' characters.
if (c == '\'' || c == ' ')
wordcount += 1;
//line separated by \n
if (c == '\n')
linecount += 1;
}
printf("%lu %lu %lu\n", charcount, wordcount, linecount);
}
Your code has multiple problems:
You do not initialize the charcount, wordcount nor linecount. Uninitialized local variables with automatic storage must be initialized before used, otherwise you invoke undefined behavior.
You only read a single byte from standard input. You should keep reading until you get EOF.
Your method for detecting words is incorrect: it is questionable whether ' is a delimiter, but you seem to want to specifically consider it to be. The standard wc utility considers only white space to separate words. Furthermore, multiple separators should only count for 1.
Here is a corrected version with your semantics, namely words are composed of letters, everything else counting as separators:
#include <ctype.h>
#include <stdio.h>
int main(void) {
unsigned long int charcount = 0;
unsigned long int wordcount = 0;
unsigned long int linecount = 0;
int c, lastc = '\n';
int inseparator = 1;
while ((c = getchar()) != EOF) {
charcount += 1; // characters
if (isalpha(c)) {
wordcount += inseparator;
inseparator = 0;
} else {
inseparator = 1;
if (c == '\n')
linecount += 1;
}
lastc = c;
}
if (lastc != '\n')
linecount += 1; // count the last line if not terminated with \n
printf("%lu %lu %lu\n", charcount, wordcount, linecount);
}
You need:
while((getchar()) != EOF )
As the head of you loop. As you have it getchar will read one character, the while block will loop around with no further getchar() ocurring !

Program runs too slowly with large input - C

The goal for this program is for it to count the number of instances that two consecutive letters are identical and print this number for every test case. The input can be up to 1,000,000 characters long (thus the size of the char array to hold the input). The website which has the coding challenge on it, however, states that the program times out at a 2s run-time. My question is, how can this program be optimized to process the data faster? Does the issue stem from the large char array?
Also: I get a compiler warning "assignment makes integer from pointer without a cast" for the line str[1000000] = "" What does this mean and how should it be handled instead?
Input:
number of test cases
strings of capital A's and B's
Output:
Number of duplicate letters next to each other for each test case, each on a new line.
Code:
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
int main() {
int n, c, a, results[10] = {};
char str[1000000];
scanf("%d", &n);
for (c = 0; c < n; c++) {
str[1000000] = "";
scanf("%s", str);
for (a = 0; a < (strlen(str)-1); a++) {
if (str[a] == str[a+1]) { results[c] += 1; }
}
}
for (c = 0; c < n; c++) {
printf("%d\n", results[c]);
}
return 0;
}
You don't need the line
str[1000000] = "";
scanf() adds a null terminator when it parses the input and writes it to str. This line is also writing beyond the end of the array, since the last element of the array is str[999999].
The reason you're getting the warning is because the type of str[10000000] is char, but the type of a string literal is char*.
To speed up the program, take the call to strlen() out of the loop.
size_t len = strlen(str)-1;
for (a = 0; a < len; a++) {
...
}
str[1000000] = "";
This does not do what you think it does and you're overflowing the buffer which results in undefined behaviour. An indexer's range is from 0 - sizeof(str) EXCLUSIVE. So you either add one to the
1000000 when initializing or use 999999 to access it instead. To get rid of the compiler warning and produce cleaner code use:
str[1000000] = '\0';
Or
str[999999] = '\0';
Depending on what you did to fix it.
As to optimizing, you should look at the assembly and go from there.
count the number of instances that two consecutive letters are identical and print this number for every test case
For efficiency, code needs a new approach as suggeted by #john bollinger & #molbdnilo
void ReportPairs(const char *str, size_t n) {
int previous = EOF;
unsigned long repeat = 0;
for (size_t i=0; i<n; i++) {
int ch = (unsigned char) str[i];
if (isalpha(ch) && ch == previous) {
repeat++;
}
previous = ch;
}
printf("Pair count %lu\n", repeat);
}
char *testcase1 = "test1122a33";
ReportPairs(testcase1, strlen(testcase1));
or directly from input and "each test case, each on a new line."
int ReportPairs2(FILE *inf) {
int previous = EOF;
unsigned long repeat = 0;
int ch;
for ((ch = fgetc(inf)) != '\n') {
if (ch == EOF) return ch;
if (isalpha(ch) && ch == previous) {
repeat++;
}
previous = ch;
}
printf("Pair count %lu\n", repeat);
return ch;
}
while (ReportPairs2(stdin) != EOF);
Unclear how OP wants to count "AAAA" as 2 or 3. This code counts it as 3.
One way to dramatically improve the run-time for your code is to limit the number of times you read from stdin. (basically process input in bigger chunks). You can do this a number of way, but probably one of the most efficient would be with fread. Even reading in 8-byte chunks can provide a big improvement over reading a character at a time. One example of such an implementation considering capital letters [A-Z] only would be:
#include <stdio.h>
#define RSIZE 8
int main (void) {
char qword[RSIZE] = {0};
char last = 0;
size_t i = 0;
size_t nchr = 0;
size_t dcount = 0;
/* read up to 8-bytes at a time */
while ((nchr = fread (qword, sizeof *qword, RSIZE, stdin)))
{ /* compare each byte to byte before */
for (i = 1; i < nchr && qword[i] && qword[i] != '\n'; i++)
{ /* if not [A-Z] continue, else compare */
if (qword[i-1] < 'A' || qword[i-1] > 'Z') continue;
if (i == 1 && last == qword[i-1]) dcount++;
if (qword[i-1] == qword[i]) dcount++;
}
last = qword[i-1]; /* save last for comparison w/next */
}
printf ("\n sequential duplicated characters [A-Z] : %zu\n\n",
dcount);
return 0;
}
Output/Time with 868789 chars
$ time ./bin/find_dup_digits <dat/d434839c-d-input-d4340a6.txt
sequential duplicated characters [A-Z] : 434893
real 0m0.024s
user 0m0.017s
sys 0m0.005s
Note: the string was actually a string of '0's and '1's run with a modified test of if (qword[i-1] < '0' || qword[i-1] > '9') continue; rather than the test for [A-Z]...continue, but your results with 'A's and 'B's should be virtually identical. 1000000 would still be significantly under .1 seconds. You can play with the RSIZE value to see if there is any benefit to reading a larger (suggested 'power of 2') size of characters. (note: this counts AAAA as 3) Hope this helps.

C program, Reversing an array

I am writing C program that reads input from the standard input a line of characters.Then output the line of characters in reverse order.
it doesn't print reversed array, instead it prints the regular array.
Can anyone help me?
What am I doing wrong?
main()
{
int count;
int MAX_SIZE = 20;
char c;
char arr[MAX_SIZE];
char revArr[MAX_SIZE];
while(c != EOF)
{
count = 0;
c = getchar();
arr[count++] = c;
getReverse(revArr, arr);
printf("%s", revArr);
if (c == '\n')
{
printf("\n");
count = 0;
}
}
}
void getReverse(char dest[], char src[])
{
int i, j, n = sizeof(src);
for (i = n - 1, j = 0; i >= 0; i--)
{
j = 0;
dest[j] = src[i];
j++;
}
}
You have quite a few problems in there. The first is that there is no prototype in scope for getReverse() when you use it in main(). You should either provide a prototype or just move getReverse() to above main() so that main() knows about it.
The second is the fact that you're trying to reverse the string after every character being entered, and that your input method is not quite right (it checks an indeterminate c before ever getting a character). It would be better as something like this:
count = 0;
c = getchar();
while (c != EOF) {
arr[count++] = c;
c = getchar();
}
arr[count] = '\0';
That will get you a proper C string albeit one with a newline on the end, and even possibly a multi-line string, which doesn't match your specs ("reads input from the standard input a line of characters"). If you want a newline or file-end to terminate input, you can use this instead:
count = 0;
c = getchar();
while ((c != '\n') && (c != EOF)) {
arr[count++] = c;
c = getchar();
}
arr[count] = '\0';
And, on top of that, c should actually be an int, not a char, because it has to be able to store every possible character plus the EOF marker.
Your getReverse() function also has problems, mainly due to the fact it's not putting an end-string marker at the end of the array but also because it uses the wrong size (sizeof rather than strlen) and because it appears to re-initialise j every time through the loop. In any case, it can be greatly simplified:
void getReverse (char *dest, char *src) {
int i = strlen(src) - 1, j = 0;
while (i >= 0) {
dest[j] = src[i];
j++;
i--;
}
dest[j] = '\0';
}
or, once you're a proficient coder:
void getReverse (char *dest, char *src) {
int i = strlen(src) - 1, j = 0;
while (i >= 0)
dest[j++] = src[i--];
dest[j] = '\0';
}
If you need a main program which gives you reversed characters for each line, you can do that with something like this:
int main (void) {
int count;
int MAX_SIZE = 20;
int c;
char arr[MAX_SIZE];
char revArr[MAX_SIZE];
c = getchar();
count = 0;
while(c != EOF) {
if (c != '\n') {
arr[count++] = c;
c = getchar();
continue;
}
arr[count] = '\0';
getReverse(revArr, arr);
printf("'%s' => '%s'\n", arr, revArr);
count = 0;
c = getchar();
}
return 0;
}
which, on a sample run, shows:
pax> ./testprog
hello
'hello' => 'olleh'
goodbye
'goodbye' => 'eybdoog'
a man a plan a canal panama
'a man a plan a canal panama' => 'amanap lanac a nalp a nam a'
Your 'count' variable goes to 0 every time the while loop runs.
Count is initialised to 0 everytime the loop is entered
you are sending the array with each character for reversal which is not a very bright thing to do but won't create problems. Rather, first store all the characters in the array and send it once to the getreverse function after the array is complete.
sizeof(src) will not give the number of characters. How about you send i after the loop was terminated in main as a parameter too. Ofcourse there are many ways and various function but since it seems like you are in the initial stages, you can try up strlen and other such functions.
you have initialised j to 0 in the for loop but again, specifying it INSIDE the loop will initialise the value everytime its run from the top hence j ends up not incrmenting. So remore the j=0 and i=0 from INSIDE the loop since you only need to get it initialised once.
check this out
#include <stdio.h>
#include <ctype.h>
void getReverse(char dest[], char src[], int count);
int main()
{
// *always* initialize variables
int count = 0;
const int MaxLen = 20; // max length string, leave upper case names for MACROS
const int MaxSize = MaxLen + 1; // add one for ending \0
int c = '\0';
char arr[MaxSize] = {0};
char revArr[MaxSize] = {0};
// first collect characters to be reversed
// note that input is buffered so user could enter more than MAX_SIZE
do
{
c = fgetc(stdin);
if ( c != EOF && (isalpha(c) || isdigit(c))) // only consider "proper" characters
{
arr[count++] = (char)c;
}
}
while(c != EOF && c != '\n' && count < MaxLen); // EOF or Newline or MaxLen
getReverse( revArr, arr, count );
printf("%s\n", revArr);
return 0;
}
void getReverse(char dest[], char src[], int count)
{
int i = count - 1;
int j = 0;
while ( i > -1 )
{
dest[j++] = src[i--];
}
}
Dealing with strings is a rich source of bugs in C, because even simple operations like copying and modifying require thinking about issues of allocation and storage. This problem though can be simplified considerably by thinking of the input and output not as strings but as streams of characters, and relying on recursion and local storage to handle all allocation.
The following is a complete program that will read one line of standard input and print its reverse to standard output, with the length of the input limited only by the growth of the stack:
int florb (int c) { return c == '\n' ? c : putchar(florb(getchar())), c; }
main() { florb('-'); }
..or check this
#include <stdio.h>
#include <stdlib.h>
#define MAX 100
char *my_rev(const char *source);
int main(void)
{
char *stringA;
stringA = malloc(MAX); /* memory allocation for 100 characters */
if(stringA == NULL) /* if malloc returns NULL error msg is printed and program exits */
{
fprintf(stdout, "Out of memory error\n");
exit(1);
}
else
{
fprintf(stdout, "Type a string:\n");
fgets(stringA, MAX, stdin);
my_rev(stringA);
}
return 0;
}
char *my_rev(const char *source) /* const makes sure that function does not modify the value pointed to by source pointer */
{
int len = 0; /* first function calculates the length of the string */
while(*source != '\n') /* fgets preserves terminating newline, that's why \n is used instead of \0 */
{
len++;
*source++;
}
len--; /* length calculation includes newline, so length is subtracted by one */
*source--; /* pointer moved to point to last character instead of \n */
int b;
for(b = len; b >= 0; b--) /* for loop prints string in reverse order */
{
fprintf(stdout, "%c", *source);
len--;
*source--;
}
return;
}
Output looks like this:
Type a string:
writing about C programming
gnimmargorp C tuoba gnitirw

Resources