Printf() prints string arguments out of order - c

I have some C-code that reads in a text file line by line, hashes the strings in each line, and keeps a running count of the string with the biggest hash values.
It seems to be doing the right thing but when I issue the print statement:
printf("Found Bigger Hash:%s\tSize:%d\n", textFile.biggestHash, textFile.maxASCIIHash);
my print returns this in the output:
Preprocessing: dict1
Found BiSize:110h:a
Found BiSize:857h:aardvark
Found BiSize:861h:aardwolf
Found BiSize:937h:abandoned
Found BiSize:951h:abandoner
Found BiSize:1172:abandonment
Found BiSize:1283:abbreviation
Found BiSize:1364:abiogenetical
Found BiSize:1593:abiogenetically
Found BiSize:1716:absentmindedness
Found BiSize:1726:acanthopterygian
Found BiSize:1826:accommodativeness
Found BiSize:1932:adenocarcinomatous
Found BiSize:2162:adrenocorticotrophic
Found BiSize:2173:chemoautotrophically
Found BiSize:2224:counterrevolutionary
Found BiSize:2228:counterrevolutionist
Found BiSize:2258:dendrochronologically
Found BiSize:2440:electroencephalographic
Found BiSize:4893:pneumonoultramicroscopicsilicovolcanoconiosis
Biggest Size:46umonoultTotal Words:71885covolcanoconiosis
So tt seems I'm misusing printf(). Below is the code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define WORD_LENGTH 100 // Max number of characters per word
// data1 struct carries information about the dictionary file; preprocess() initializes it
struct data1
{
int numRows;
int maxWordSize;
char* biggestWord;
int maxASCIIHash;
char* biggestHash;
};
int asciiHash(char* wordToHash);
struct data1 preprocess(char* fileName);
int main(int argc, char* argv[]){
//Diagnostics Purposes; Not used for algorithm
printf("Preprocessing: %s\n",argv[1]);
struct data1 file = preprocess(argv[1]);
printf("Biggest Word:%s\t Size:%d\tTotal Words:%d\n", file.biggestWord, file.maxWordSize, file.numRows);
//printf("Biggest hashed word (by ASCII sum):%s\tSize: %d\n", file.biggestHash, file.maxASCIIHash);
//printf("**%s**", file.biggestHash);
return 0;
}
int asciiHash(char* word)
{
int runningSum = 0;
int i;
for(i=0; i<strlen(word); i++)
{
runningSum += *(word+i);
}
return runningSum;
}
struct data1 preprocess(char* fName)
{
static struct data1 textFile = {.numRows = 0, .maxWordSize = 0, .maxASCIIHash = 0};
textFile.biggestWord = (char*) malloc(WORD_LENGTH*sizeof(char));
textFile.biggestHash = (char*) malloc(WORD_LENGTH*sizeof(char));
char* str = (char*) malloc(WORD_LENGTH*sizeof(char));
FILE* fp = fopen(fName, "r");
while( strtok(fgets(str, WORD_LENGTH, fp), "\n") != NULL)
{
// If found a larger hash
int hashed = asciiHash(str);
if(hashed > textFile.maxASCIIHash)
{
textFile.maxASCIIHash = hashed; // Update max hash size found
strcpy(textFile.biggestHash, str); // Update biggest hash string
printf("Found Bigger Hash:%s\tSize:%d\n", textFile.biggestHash, textFile.maxASCIIHash);
}
// If found a larger word
if( strlen(str) > textFile.maxWordSize)
{
textFile.maxWordSize = strlen(str); // Update biggest word size
strcpy(textFile.biggestWord, str); // Update biggest word
}
textFile.numRows++;
}
fclose(fp);
free(str);
return textFile;
}

You forget to remove the \r after reading. This is in your input because (1) your source file comes from a Windows machine (or at least one which uses \r\n line endings), and (2) you use the fopen mode "r", which does not translate line endings on your OS (again, presumably Windows).
This results in the weird output as follows:
Found Bigger Hash:text\r\tSize:123
– see the position of the \r? So what happens when outputting this string, you get at first
Found Bigger Hash:text
and then the cursor gets repositioned to the start of the line by \r. Next, a tab is output – not by printing spaces but merely moving the cursor to the 8thth position:
1234567↓
Found Bigger Hash:text
and the rest of the string is printed over the one already shown:
Found BiSize:123h:text
Possible solutions:
Open your file in "rt" "text" mode, and/or
Check for, and remove, the \r code as well as \n.
I'd go for both. strchr is pretty cheap and will make your code a bit more foolproof.
(Also, please simplify your fgets line by splitting it up into several distinct operations.)

Your statement
while( strtok(fgets(str, WORD_LENGTH, fp), "\n") != NULL)
takes no account of the return value from fgets() or the way strtok() works.
The way to do this is something like
char *fptr, *sptr;
while ((fptr = fgets(str, WORD_LENGTH, fp)) != NULL) {
sptr = strtok(fptr, "\n");
while (sptr != NULL) {
printf ("%s,", sptr);
sptr = strtok (NULL, "\n");
}
printf("\n");
}
Note than after the first call to strtok(), subsequent calls on the same sequence must pass the parameter NULL.

Related

Getting strange strings in C after reading getc from file

I am getting strange strings after the first iteration. I suspect it could be because of string termination, but I am not sure how to fix it. Or I might be using malloc the wrong way.
I am happy for any hints.
#include <stdio.h>
#include <memory.h>
#include <malloc.h>
#include <ctype.h>
#include "file_reader.h"
/**
* Opens a text file and reads the file. The text of the file is stored
* in memory in blocks of size blockSize. The linked list with the text is
* returned by the function. Each block should contain only complete words.
* If a word is split by the end of the block, the last letters should be
* moved into the next text block. Each text block must be NULL-terminated.
* If the reading of the file fails, the program should return a meaningful
* error message.
*/
int getFileSize(FILE* file) {
FILE* endOfFile = file;
fseek(endOfFile, 0, SEEK_END);
long int size = ftell(file);
fseek(file, 0, SEEK_SET);
return (int) size;
}
LinkedList* read_text_file(const char* filename, int blockSize) {
int globalByteCounter = 0;
LinkedList* list = LinkedList_create();
int blockByteCounter;
FILE* fp = fopen(filename, "r");
int fileSize = getFileSize(fp);
char* tokPointer = malloc(sizeof(getc(fp)));
char* block = malloc(sizeof strcat("",""));
//Loop for blocks in list
while (globalByteCounter <= fileSize) {
blockByteCounter = 0;
char* word = malloc(sizeof(blockSize));
//loop for each block
while(blockByteCounter<blockSize) {
char tok;
//Building a word
do {
strcat(word, tokPointer);
tok = (char) getc(fp);
tokPointer=&tok;
blockByteCounter++;
}while (isalpha(tok));
//Does this word still fit the block?
if (blockByteCounter + strlen(word) < blockSize) {
strcat(block, word);
//Setze Wort zurück und füge Sonderzeicehen an
word = strcpy(word,tokPointer);
} else {
strcpy(block,word);
}
}
globalByteCounter += blockByteCounter;
LinkedList_append(list, block);
free(word);
}
LinkedList_append(list,block);
fclose(fp);
free(block);
free(tokPointer);
return list;
}
There are multiple issues with the code. Let me tackle a few of them:
sizeof(getc(fp))
This is the same as applying sizeof on the return type of getc. In your case, what you are doing here is sizeof(int). That's not what you want.
Assuming that you have a text file, where the size of what you want to read is a number in ASCII, what you are looking for is the good old fscanf.
Similar here:
strcat("","")
but actually worse. strcat("a", "b") does not return "ab". It attempts to concatenate "b" onto "a" and returns the address of a, which is pretty bad because not only it doesn't do what you want, but also attempts to modify the string "a". You can't modify string literals.
blockByteCounter is not initialized.
And you got your hunch right:
char* word = malloc(sizeof(blockSize));
If you don't initialize word as an empty string, when you try to concatenate tokPointer onto it you'll run through a non-terminated string. Not only that, but tokPointer is also not initialized!
I'm also not sure why you are trying to use strcat to build a word. You don't need all those pointers. Once you know the required size of your buffer, you can 1) simply use fscanf to read one word; or 2) use fgetc with a good old simple counter i to put each letter into the buffer array, and then terminate it with 0 before printing.

counting the number of strings in a text file containing numbers as well

I wanted to only count the number of strings in a text file, containing numbers as well. But the code below, counts even the numbers in the file as strings. How do I rectify the problem?
int count;
char *temp;
FILE *fp;
fp = fopen("multiplexyz.txt" ,"r" );
while(fscanf(fp,"%s",temp) != EOF )
{
count++;
}
printf("%d ",count);
return 0;
}
Well, first up, using the temp pointer without having backing storage for it is going to cause you a world of pain.
I'd suggest, as a start, using something like char temp[1000] instead, keeping in mind that's still a bit risky if you have words more than a thousand or so characters long (that's a different issue to the one you're asking about so I'll mention it but not spend too much time on fixing it).
Secondly, it appears you want to count words with numbers (like alpha7 or pi/2). If that's the case, you simply need to check temp after reading the "word" and increment count only if it matches a "non-numeric" pattern.
That could be as simple as just not incrementing if the word consists only of digits, or it could be complicated if you want to handle decimals, exponential formats and so on.
But the bottom line remains the same:
while(fscanf(fp,"%s",temp) != EOF )
{
if (! isANumber(temp))
count++;
}
with a suitable definition of isANumber. For example, for unsigned integers only, something like this would be a good start:
int isANumber (char *str) {
// Empty string is not a number.
if (*str == '\0')
return 0;
// Check every character.
while (*str != '\0') {
// If non-digit, it's not a number.
if (! isdigit (*str))
return 0;
str++;
}
// If all characters were digits, it was a number.
return 1;
}
For more complex checking, you can use the strto* calls in C, giving them the temp buffer and ensuring you use the endptr method to ensure the entire string is scanned. Off the top of my head, so not well tested, that would go something like:
int isANumber (char *str) {
// Empty string is not a number.
if (*str == '\0')
return 0;
// Use strtod to get a double.
char *endPtr;
long double d = strtold (str, &endPtr);
// Characters unconsumed, not number (things like 42b).
if (*endPtr != '\0')
return 0;
// Was a long double, so number.
return 1;
}
The only thing you need to watch out for there is that certain strings like NaN or +Inf are considered a number by strtold so you may need extra checks for that.
inside your while loop, loop through the string to check if any of its characters are digits. Something like:
while(*temp != '\0'){
if(isnumber(*temp))
break;
}
[dont copy exact same code]
I find strpbrk to be one of the most helpful function to search for several needles in a haystack. Your set of needles being the numeric characters "0123456789" which if present in a line read from your file will count as a line. I also prefer POSIX getline for a line count do to its proper handling of files with non-POSIX line endings for the last line (both fgets and wc -l omit text (and a count) of the last line if it does not contain a POSIX line end ('\n'). That said, a small function that searches a line for characters contained in a trm passed as a parameter could be written as:
/** open and read each line in 'fn' returning the number of lines
* continaing any of the characters in 'trm'.
*/
size_t nlines (char *fn, char *trm)
{
if (!fn) return 0;
size_t lines = 0, n = 0;
char *buf = NULL;
FILE *fp = fopen (fn, "r");
if (!fp) return 0;
while (getline (&buf, &n, fp) != -1)
if (strpbrk (buf, trm))
lines++;
fclose (fp);
free (buf);
return lines;
}
Simply pass the filename of interest and the terms to search for in each line. A short test code with a default term of "0123456789" that takes the filename as the first parameter and the term as the second could be written as follows:
#include <stdio.h> /* printf */
#include <stdlib.h> /* free */
#include <string.h> /* strlen, strrchr */
size_t nlines (char *fn, char *trm);
int main (int argc, char **argv) {
char *fn = argc > 1 ? argv[1] : NULL;
char *srch = argc > 2 ? argv[2] : "0123456789";
if (!fn) return 1;
printf ("%zu %s\n", nlines (fn, srch), fn);
return 0;
}
/** open and read each line in 'fn' returning the number of lines
* continaing any of the characters in 'trm'.
*/
size_t nlines (char *fn, char *trm)
{
if (!fn) return 0;
size_t lines = 0, n = 0;
char *buf = NULL;
FILE *fp = fopen (fn, "r");
if (!fp) return 0;
while (getline (&buf, &n, fp) != -1)
if (strpbrk (buf, trm))
lines++;
fclose (fp);
free (buf);
return lines;
}
Give it a try and see if this is what you are expecting, if not, just let me know and I am glad to help further.
Example Input File
$ cat dat/linewno.txt
The quick brown fox
jumps over 3 lazy dogs
who sleep in the sun
with a temp of 101
Example Use/Output
$ ./bin/getline_nlines_nums dat/linewno.txt
2 dat/linewno.txt
$ wc -l dat/linewno.txt
4 dat/linewno.txt

How do I count occurrences of a list of strings and output them to a new file?

I have been given three '.txt' files.
The first is a list of words.
The second is a document to search.
The third is a blank document that will have my output written to it.
I'm supposed to take each word in the first file, search the second file and print the number of occurrences in the third file as "wordX = numOccurences."
I've got a good function that will return the wordCount, and it returns it correctly for the first word, but then I get a zero for all the remaining words.
I've tried to dereference everything, and I think I've come to a standstill. There's something wrong with the "pointer talk."
I have yet to start outputting the words to a new file, but that printf statement should be a print to file statement in append mode. Easy enough.
Here is the working wordCount function - it works if I just give it a single word, like "testing," but if I give it an array I want to iterate through, it just returns 0.
int countWord(char* filePath, char* word){ //Not mine. This is a working prototype function from SO, returns word count of particular word
FILE *fp;
int count = 0;
int ch, len;
if(NULL==(fp=fopen(filePath, "r")))
return -1;
len = strlen(word);
for(;;){
int i;
if(EOF==(ch=fgetc(fp))) break;
if((char)ch != *word) continue;
for(i=1;i<len;++i){
if(EOF==(ch = fgetc(fp))) goto end;
if((char)ch != word[i]){
fseek(fp, 1-i, SEEK_CUR);
goto next;
}
}
++count;
next: ;
}
end:
fclose(fp);
return count;
}
This is my part of the program, trying to call the function while the loop gets all the words from the first file. The loop IS grabbing the words, because it prints them, but wordCount isn't accepting anything beyond the first word.
int main(){
FILE *ptr_file;
char words[100];
ptr_file = fopen("searchWords.txt", "r");
if(!ptr_file)
return -1;
while( fgets(words, 100, ptr_file)!=NULL )
{
int wordCount = 0;
char key[100] = &*words;
wordCount = countWord("document.txt", words);
printf("%s = %d\n", words, wordCount);
}
fclose(ptr_file);
return 0;
}
fgets reads \n too.That is the problem. To quote
A newline character makes fgets stop reading, but it is considered a valid character by the function and included in the string copied to str.
To solve this, change it
while( fgets(words, 100, ptr_file)!=NULL )
{
int len = strlen(words);
words[len-1] = '\0';
An immediate problem: fgets doesn't strip end-of-line from the string, so whatever you pass to countWord has an embedded newline.

Reading a file in C

I have an input file I need to extract words from. The words can only contain letters and numbers so anything else will be treated as a delimiter. I tried fscanf,fgets+sscanf and strtok but nothing seems to work.
while(!feof(file))
{
fscanf(file,"%s",string);
printf("%s\n",string);
}
Above one clearly doesn't work because it doesn't use any delimiters so I replaced the line with this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
So I used fgets to read the first line and use sscanf:
sscanf(line,"%[A-z]%n,word,len);
line+=len;
This one doesn't work either because whatever I try I can't move the pointer to the right place. I tried strtok but I can't find how to set delimitters
while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");
This one obviously take blank character as a delimitter but I have literally 100s of delimitters.
Am I missing something here becasue extracting words from a file seemed a simple concept at first but nothing I try really works?
Consider building a minimal lexer. When in state word it would remain in it as long as it sees letters and numbers. It would switch to state delimiter when encountering something else. Then it could do an exact opposite in the state delimiter.
Here's an example of a simple state machine which might be helpful. For the sake of brevity it works only with digits. echo "2341,452(42 555" | ./main will print each number in a separate line. It's not a lexer but the idea of switching between states is quite similar.
#include <stdio.h>
#include <string.h>
int main() {
static const int WORD = 1, DELIM = 2, BUFLEN = 1024;
int state = WORD, ptr = 0;
char buffer[BUFLEN], *digits = "1234567890";
while ((c = getchar()) != EOF) {
if (strchr(digits, c)) {
if (WORD == state) {
buffer[ptr++] = c;
} else {
buffer[0] = c;
ptr = 1;
}
state = WORD;
} else {
if (WORD == state) {
buffer[ptr] = '\0';
printf("%s\n", buffer);
}
state = DELIM;
}
}
return 0;
}
If the number of states increases you can consider replacing if statements checking the current state with switch blocks. The performance can be increased by replacing getchar with reading a whole block of the input to a temporary buffer and iterating through it.
In case of having to deal with a more complex input file format you can use lexical analysers generators such as flex. They can do the job of defining state transitions and other parts of lexer generation for you.
Several points:
First of all, do not use feof(file) as your loop condition; feof won't return true until after you attempt to read past the end of the file, so your loop will execute once too often.
Second, you mentioned this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
That's not quite what's happening; if the next character in the stream doesn't match the format specifier, scanf returns without having read anything, and string is unmodified.
Here's a simple, if inelegant, method: it reads one character at a time from the input file, checks to see if it's either an alpha or a digit, and if it is, adds it to a string.
#include <stdio.h>
#include <ctype.h>
int get_next_word(FILE *file, char *word, size_t wordSize)
{
size_t i = 0;
int c;
/**
* Skip over any non-alphanumeric characters
*/
while ((c = fgetc(file)) != EOF && !isalnum(c))
; // empty loop
if (c != EOF)
word[i++] = c;
/**
* Read up to the next non-alphanumeric character and
* store it to word
*/
while ((c = fgetc(file)) != EOF && i < (wordSize - 1) && isalnum(c))
{
word[i++] = c;
}
word[i] = 0;
return c != EOF;
}
int main(void)
{
char word[SIZE]; // where SIZE is large enough to handle expected inputs
FILE *file;
...
while (get_next_word(file, word, sizeof word))
// do something with word
...
}
I would use:
FILE *file;
char string[200];
while(fscanf(file, "%*[^A-Za-z]"), fscanf(file, "%199[a-zA-Z]", string) > 0) {
/* do something with string... */
}
This skips over non-letters and then reads a string of up to 199 letters. The only oddness is that if you have any 'words' that are longer than 199 letters they'll be split up into multiple words, but you need the limit to avoid a buffer overflow...
What are your delimiters? The second argument to strtok should be a string containing your delimiters, and the first should be a pointer to your string the first time round then NULL afterwards:
char * p = strtok(line, ","); // assuming a , delimiter
printf("%s\n", p);
while(p)
{
p = strtok(NULL, ",");
printf("%S\n", p);
}

Parsing text in C

I have a file like this:
...
words 13
more words 21
even more words 4
...
(General format is a string of non-digits, then a space, then any number of digits and a newline)
and I'd like to parse every line, putting the words into one field of the structure, and the number into the other. Right now I am using an ugly hack of reading the line while the chars are not numbers, then reading the rest. I believe there's a clearer way.
Edit: You can use pNum-buf to get the length of the alphabetical part of the string, and use strncpy() to copy that into another buffer. Be sure to add a '\0' to the end of the destination buffer. I would insert this code before the pNum++.
int len = pNum-buf;
strncpy(newBuf, buf, len-1);
newBuf[len] = '\0';
You could read the entire line into a buffer and then use:
char *pNum;
if (pNum = strrchr(buf, ' ')) {
pNum++;
}
to get a pointer to the number field.
fscanf(file, "%s %d", word, &value);
This gets the values directly into a string and an integer, and copes with variations in whitespace and numerical formats, etc.
Edit
Ooops, I forgot that you had spaces between the words.
In that case, I'd do the following. (Note that it truncates the original text in 'line')
// Scan to find the last space in the line
char *p = line;
char *lastSpace = null;
while(*p != '\0')
{
if (*p == ' ')
lastSpace = p;
p++;
}
if (lastSpace == null)
return("parse error");
// Replace the last space in the line with a NUL
*lastSpace = '\0';
// Advance past the NUL to the first character of the number field
lastSpace++;
char *word = text;
int number = atoi(lastSpace);
You can solve this using stdlib functions, but the above is likely to be more efficient as you're only searching for the characters you are interested in.
Given the description, I think I'd use a variant of this (now tested) C99 code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
struct word_number
{
char word[128];
long number;
};
int read_word_number(FILE *fp, struct word_number *wnp)
{
char buffer[140];
if (fgets(buffer, sizeof(buffer), fp) == 0)
return EOF;
size_t len = strlen(buffer);
if (buffer[len-1] != '\n') // Error if line too long to fit
return EOF;
buffer[--len] = '\0';
char *num = &buffer[len-1];
while (num > buffer && !isspace((unsigned char)*num))
num--;
if (num == buffer) // No space in input data
return EOF;
char *end;
wnp->number = strtol(num+1, &end, 0);
if (*end != '\0') // Invalid number as last word on line
return EOF;
*num = '\0';
if (num - buffer >= sizeof(wnp->word)) // Non-number part too long
return EOF;
memcpy(wnp->word, buffer, num - buffer);
return(0);
}
int main(void)
{
struct word_number wn;
while (read_word_number(stdin, &wn) != EOF)
printf("Word <<%s>> Number %ld\n", wn.word, wn.number);
return(0);
}
You could improve the error reporting by returning different values for different problems.
You could make it work with dynamically allocated memory for the word portion of the lines.
You could make it work with longer lines than I allow.
You could scan backwards over digits instead of non-spaces - but this allows the user to write "abc 0x123" and the hex value is handled correctly.
You might prefer to ensure there are no digits in the word part; this code does not care.
You could try using strtok() to tokenize each line, and then check whether each token is a number or a word (a fairly trivial check once you have the token string - just look at the first character of the token).
Assuming that the number is immediately followed by '\n'.
you can read each line to chars buffer, use sscanf("%d") on the entire line to get the number, and then calculate the number of chars that this number takes at the end of the text string.
Depending on how complex your strings become you may want to use the PCRE library. At least that way you can compile a perl'ish regular expression to split your lines. It may be overkill though.
Given the description, here's what I'd do: read each line as a single string using fgets() (making sure the target buffer is large enough), then split the line using strtok(). To determine if each token is a word or a number, I'd use strtol() to attempt the conversion and check the error condition. Example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/**
* Read the next line from the file, splitting the tokens into
* multiple strings and a single integer. Assumes input lines
* never exceed MAX_LINE_LENGTH and each individual string never
* exceeds MAX_STR_SIZE. Otherwise things get a little more
* interesting. Also assumes that the integer is the last
* thing on each line.
*/
int getNextLine(FILE *in, char (*strs)[MAX_STR_SIZE], int *numStrings, int *value)
{
char buffer[MAX_LINE_LENGTH];
int rval = 1;
if (fgets(buffer, buffer, sizeof buffer))
{
char *token = strtok(buffer, " ");
*numStrings = 0;
while (token)
{
char *chk;
*value = (int) strtol(token, &chk, 10);
if (*chk != 0 && *chk != '\n')
{
strcpy(strs[(*numStrings)++], token);
}
token = strtok(NULL, " ");
}
}
else
{
/**
* fgets() hit either EOF or error; either way return 0
*/
rval = 0;
}
return rval;
}
/**
* sample main
*/
int main(void)
{
FILE *input;
char strings[MAX_NUM_STRINGS][MAX_STRING_LENGTH];
int numStrings;
int value;
input = fopen("datafile.txt", "r");
if (input)
{
while (getNextLine(input, &strings, &numStrings, &value))
{
/**
* Do something with strings and value here
*/
}
fclose(input);
}
return 0;
}

Resources