Program not working for large files in C - c

I'm using the following program in C to filter a log file with about 200,000 lines. But the program stops responding after about 12000 lines. Any explanations why does this happen and any solution to it?
The code is compiled in GCC (windows).
PS: The code is executing properly and giving desired output for small files.
#include<stdio.h>
#include<string.h>
int check(char *url)
{
//some code to filter the data and return either 0 or 1 depending upon input
}
int main()
{
FILE *fpi, *fpo;
fpi=fopen("access.log","r");
fpo=fopen("edited\\filter.txt","w");
char date[11],time[9],ip[16],url[500],temp[3];
while(!feof(fpi))
{
printf(".");
fscanf(fpi," %s %s %s %s %s %s",date,time,temp,ip,temp,url);
if(check(url))
fprintf(fpo,"%s %s %s %s %s %s\n",date,time,temp,ip,temp,url);
}
fclose(fpi);
fclose(fpo);
printf("\n\n\nDONE! :)");
return 0;
}

It is possible that one of the lines in the input file contains a field that is larger than the string variable you pass to fscanf(). It might result in a buffer overflow, which later results in an infinite loop somewhere. Just a speculation. I suggest you delimit %s in the fscanf() format string with the maximum length of the output string variable.
For example, this will make sure that there are no buffer overflows and that the resulted strings terminate:
fscanf(fpi," %10s %8s %2s %15s %49s %2s", date, time, temp, ip, temp, url);
date[10] = '\0';
time[8] = '\0';
ip[15] = '\0';
temp[2] = '\0';
url[499] = '\0';
Also, you are reading temp twice. The latter read will override the former. Is this what you intended?
Another improvement, assuming that the input file is line-terminated, and each log is in a separate line, is to use fgets() in order to read a line and only then use sscanf() on the intermediate buffer. This way you ensure that no formatting errors extend beyond a single line. Also, sscanf returns the number of read items, in your case - 6. It's would be safer to check the return value.

Related

Write a program that takes a user's input and compares it to words in a file

I am in school and got an assignment to write a C program that takes an input from a user then scans a file and returns how many times that word shows up in a file. I feel like I got it 90% done, but for some reason I can't get the while loop. When I run the program it crashes at the while loop. Any help or guidance would be greatly appreciated.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <windows.h>
int main() {
char input[50], file[50], word[50];
int wordcount;
printf("Enter a string to search for\n");
scanf("%s", input);
printf("Enter a file location to open\n");
scanf("%s", file);
FILE * fp;
fp = fopen("%s", "r", file);
while (fscanf(fp, "%s", word) != EOF) {
if (strcmp(word, input)) {
printf("found the word %s\n", input);
wordcount++;
}
}
printf("The world %s shows up %d times\n", input, wordcount);
system("pause");
}
You have 2 problems:
fp = fopen("%s", "r", file);
is incorrect, fopen expects only two arguments, not three. The correct version
is
fp = fopen(file, "r");
Note that there is no feature in the C language that allows you to construct
strings from variables like this "%s", variable1. This only works for function
like printf that read a format and interpret the format base on a fix set of
rules you can see here.
The second problem is this:
if (strcmp(word, input))
strcmp is used to compared two strings, however it return 0 when the strings
are equal, non-zero otherwise. So the correct check should be
if(strcmp(word, input) == 0)
{
printf("found the word %s\n", input);
wordcount++;
}
One last thing: when you read a string with scanf, you should limit the amount
of characters to be read, otherwise you will overflow the buffer and this yield
undefined behaviour which could lead to a segfault.
input is a char[50], so it can hold at most 49 characters, the better
scanf call would be
scanf("%49s", input);
with this you are making sure not to write beyond the bounds of the array.
Fotenotes
1The string "%s" has no real meaning in the C language, like any
other string it is merly a sequence of characters that ends with the
'\0'-terminating character. The memory layout for this strings is
+---+---+----+
| % | s | \0 |
+---+---+----+
The printf family of functions however give certains sequences of characters
(the ones beginning with %) a well defined meaning. They're used to determine the type of the variable that should
be used when printing as well as other format options. See the printf documentation for more information about that. You have to
remember however, that this type of constructs only works with printf because
printf was design to work this way.
If you need to construct a string using values of other variables, then you need
to have an array with enough space and use a function like sprintf. For
example:
const char *base = "records";
int series = 8;
char fn[100];
sprintf(fn, "%s%d.dat", base, series);
// now fn has the string "records8.dat"
FILE *fp = fopen(fn, "r");
...
But in your case this is unnecessary because the whole filename was already
stored in variable file, so construction a new string based on file is not
needed.
You are trying to open a file named "%s", which I'm pretty sure does not exist. If you had checked the return from fopen, you could have figured it out yourself.

Won't read from file to struct

I've been sitting with this problem for 2 days and I can't figure out what I'm doing wrong. I've tried debugging (kind of? Still kind of new), followed this link: https://ericlippert.com/2014/03/05/how-to-debug-small-programs/ And I've tried Google and all kinds of things. Basically I'm reading from a file with this format:
R1 Fre 17/07/2015 18.00 FCN - SDR 0 - 2 3.211
and I have to make the program read this into a struct, but when I try printing the information it comes out all wrong. My code looks like this:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define MAX_INPUT 198
typedef struct game{
char weekday[4],
home_team[4],
away_team[4];
int round,
hour,
minute,
day,
month,
year,
home_goals,
away_goals,
spectators;}game;
game make_game(FILE *superliga);
int main(void){
int input_number,
number_of_games = 198,
i = 0;
game tied[MAX_INPUT];
FILE *superliga;
superliga = fopen("superliga-2015-2016.txt", "r");
for(i = 0; i < number_of_games; ++i){
tied[i] = make_game(superliga);
printf("R%d %s %d/%d/%d %d.%d %s - %s %d - %d %d\n",
tied[i].round, tied[i].weekday, tied[i].day, tied[i].month,
tied[i].year, tied[i].hour, tied[i].minute, tied[i].home_team,
tied[i].away_team, tied[i].home_goals, tied[i].away_goals,
tied[i].spectators);}
fclose(superliga);
return 0;
}
game make_game(FILE *superliga){
double spect;
struct game game_info;
fscanf(superliga, "R%d %s %d/%d/%d %d.%d %s - %s %d - %d %lf\n",
&game_info.round, game_info.weekday, &game_info.day, &game_info.month,
&game_info.year, &game_info.hour, &game_info.minute, game_info.home_team,
game_info.away_team, &game_info.home_goals, &game_info.away_goals,
&spect);
game_info.spectators = spect * 1000;
return game_info;
}
The problem is in your file. It starts with whitespaces, not with R's as you stated in the control string.
Check the return value of fscanf() and you'll see that it's zero everytime.
If you add a leading whitespace to your fscanf() call, your problem will be solved, like this:
fscanf(superliga, " R%d %s %d/%d/%d %d.%d %s - %s %d - %d %lf\n",
&game_info.round, game_info.weekday, &game_info.day, &game_info.month,
&game_info.year, &game_info.hour, &game_info.minute, game_info.home_team,
game_info.away_team, &game_info.home_goals, &game_info.away_goals,
&spect);
If each line in the file is a separate record, you should read each line as a string, then try to parse each string.
(Note that this also has the added feature of speculative parsing: you can try parsing the line in several different formats, and accept the one that parses correctly. I like to use this when I accept e.g. vector inputs, so that the user can use x y z, x, y, z, x/y/z, (x,y,z), [x,y,z], <x y z>, <x,y,z>, and so on, depending on what they like. It's only one additional scanf per format, after all.)
To read lines, you can use fgets() into a local buffer. The local buffer must be long enough. If the program is to run on POSIX.1 machines only (i.e., not on Windows), then you can use getline() instead, which can dynamically reallocate the given buffer as needed, so you're not limited to any specific line length.
To parse the string, use sscanf().
Note that all tabs, spaces, and newlines in the pattern in all of the scanf family of functions are treated exactly the same: they indicate any number of any type of whitespace. In other words, \n does not mean "and then a newline"; it means the same as a space, i.e. "and possibly some whitespace here". However, all conversions except %c and %[ automatically skip any leading whitespace; so, with the exception of a space before one of those two, the spaces in the pattern are only meaningful to us humans, they do not have any functional effect in the scanning.
All scanf family of functions return the number of successful conversions. (The only exception is the "conversion" %n, which yields the number of characters consumed; some implementations include it in the conversion count, and some others do not.) If end of input occurs prior to the first conversion, or a read error occurs, or the input does not match with the fixed part of the pattern, the functions will return EOF.
Even if you suppress saving the result of a conversion -- for example, if you have a word in the input you don't need, you can convert but discard it with %*s --, it is counted. So, for example sscanf(line, " %*d %*s %*d") returns 3 if the line starts with an integer, followed by a word (anything that is not a newline nor contains whitespace), followed by an integer.
Rather than have the function return the parsed structure, pass a pointer to the structure (and the file handle to read from), and return a status code. I prefer 0 for success, and nonzero for failure, but feel free to change that.
In other words, I'd suggest you change your read function into
#ifndef GAME_LINE_MAX
#define GAME_LINE_MAX 1022
#endif
int read_game(game *one, FILE *in)
{
char buffer[GAME_LINE_MAX + 2]; /* + '\n' + '\0' */
char *line;
/* Sanity check: no NULL pointers accepted! */
if (!one || !in)
return -1;
/* Paranoid check: Fail if read error has already occurred. */
if (ferror(in))
return -1;
/* Read the line */
line = fgets(buffer, sizeof buffer, in);
if (!line)
return -1;
/* Parse the game; pattern from OP's example: */
if (sscanf(line, "R%d %3s %d/%d/%d %d.%d %3s - %3s %d - %d %d\n",
&(one->round), one->weekday,
&(one->day), &(one->month), &(one->year),
&(one->hour), &(one->minute)
one->home_team,
one->away_team,
&(one->home_goals),
&(one->away_goals),
&(one->spectators)) < 12)
return -1; /* Line not formatted like above */
/* Spectators in the file are in units of 1000; convert: */
one->spectators *= 1000;
/* Success. */
return 0;
}
To use the above function in a loop, reading games one after another from standard input (stdin):
game g;
while (!read_game(&g, stdin)) {
/* Do something with current game stats, g */
}
if (ferror(stdin)) {
/* Read error occurred! */
} else
if (!feof(stdin)) {
/* Not all data was read/parsed! */
}
The two if clauses above are to check if there was a real read error (as in, a problem with the hardware or something like that), and whether there was unread/unparsed data (not at end of file), respectively.
There are two differences in the scanning pattern compared to the OP: First, all strings parsed are limited to 3 characters, because the structure has only room for 3+1 each. The one character is reserved for the end of string '\0', which is not counted in the maximum length for %s. Second, I parse the spectator count directly, and just multiply the field by 1000 if successful.
Also note how I used one->weekday, one->home_team, and one->away_team to refer to the character arrays. This works, because an array variable can be used as if it was a pointer to the first element in that array. (Given char a[5];, a and &a and &(a[0]) can all be used to refer to the first element in the array a). I like to use this "raw form" when scanning, because it makes it easier to match them to %s conversions, and ensure the pattern matches the parameters.

Storing String Inside a String?

My problem is when I try to save the string (series[0]) Inside (c[0])
and I display it, it always ignore the last digit.
For Example the value of (series[0]) = "1-620"
So I save this value inside (c[0])
and ask the program to display (c[0]), it displays "1-62" and ignores the last digit which is "0". How can I solve this?
This is my code:
#include <stdio.h>
int main(void)
{
int price[20],i=0,comic,j=0;
char name,id,book[20],els[20],*series[20],*c[20];
FILE *rent= fopen("read.txt","r");
while(!feof(rent))
{
fscanf(rent,"%s%s%s%d",&book[i],&els[i],&series[i],&price[i]);
printf("1.%s %s %s %d",&book[i],&els[i],&series[i],price[i]);
i++;
}
c[0]=series[0];
printf("\n%s",&c[0]);
return 0;
}
The use of fscanf and printf is wrong :
fscanf(rent,"%s%s%s%d",&book[i],&els[i],&series[i],&price[i]);
Should be:
fscanf(rent,"%c%c%s%d",&book[i],&els[i],series[i],&price[i]);
You have used the reference operator on a char pointer when scanf expecting a char pointer, also you read a string to book and else instead of one character.
printf("1.%s %s %s %d",&book[i],&els[i],&series[i],price[i]);
Should be:
printf("1.%c %c %s %d",book[i],els[i],series[i],price[i]);
And:
printf("\n%s",&c[0]);
Should be:
printf("\n%s",c[0]);
c is an array of char * so c[i] can point to a string and that is what you want to send to printf function.
*Keep in mind that you have to allocate (using malloc) a place in memory for all the strings you read before sending them to scanf:
e.g:
c[0] = (char*)malloc(sizeof(char)*lengthOfString+1);
and only after this you can read characters in to it.
or you can use a fixed size double character array:
c[10][20];
Now c is an array of 20 strings that can be up to 9 characters long.
Amongst other problems, at the end you have:
printf("\n%s",&c[0]);
There are multiple problems there. The serious one is that c[0] is a char *, so you're passing the address of a char * — a char ** — to printf() but the %s format expects a char *. The minor problem is that you should terminate lines of output with newline.
In general, you have a mess with your memory allocation. You haven't allocated space for char *series[20] pointers to point at, so you get undefined behaviour when you use it.
You need to make sure you've allocated enough space to store the data, and it is fairly clear that you have not done that. One minor difficulty is working out what the data looks like, but it seems to be a series of lines each with 3 words and 1 number. This code does that job a bit more reliably:
#include <stdio.h>
int main(void)
{
int price[20];
int i;
char book[20][32];
char els[20][32];
char series[20][20];
const char filename[] = "read.txt";
FILE *rent = fopen(filename, "r");
if (rent == 0)
{
fprintf(stderr, "Failed to open file '%s' for reading\n", filename);
return 1;
}
for (i = 0; i < 20; i++)
{
if (fscanf(rent, "%31s%31s%19s%d", book[i], els[i], series[i], &price[i]) != 4)
break;
printf("%d. %s %s %s %d\n", i, book[i], els[i], series[i], price[i]);
}
printf("%d titles read\n", i);
fclose(rent);
return 0;
}
There are endless ways this could be tweaked, but as written, it ensures no overflow of the buffers (by the counting loop and input conversion specifications including the length), detects when there is an I/O problem or EOF, and prints data with newlines at the end of the line. It checks and reports if it fails to open the file (including the name of the file — very important when the name isn't hard-coded and a good idea even when it is), and closes the file before exiting.
Since you didn't provide any data, I created some random data:
Tixrpsywuqpgdyc Yeiasuldknhxkghfpgvl 1-967 8944
Guxmuvtadlggwjvpwqpu Sosnaqwvrbvud 1-595 3536
Supdaltswctxrbaodmerben Oedxjwnwxlcvpwgwfiopmpavseirb 1-220 9698
Hujpaffaocnr Teagmuethvinxxvs 1-917 9742
Daojgyzfjwzvqjrpgp Vigudvipdlbjkqjm 1-424 4206
Sebuhzgsqpyidpquzjxswbccqbruqf Vuhssjvcjjylcevcisdzedkzlp 1-581 3451
Doeraxdmyqcbbzyp Litbetmttcgfldbhqqfdxqi 1-221 2485
Raqqctfdlhrmhtzusntvgbvotpk Iowdcqlwgljwlfvwhfmw 1-367 3505
Kooqkvabwemxoocjfaa Hicgkztiqvqdjjx 1-466 435
Lowywyzzkkrazfyjuggidsqfvzzqb Qiginniroivqymgseushahzlrywe 1-704 5514
The output from the code above on that data is:
0. Tixrpsywuqpgdyc Yeiasuldknhxkghfpgvl 1-967 8944
1. Guxmuvtadlggwjvpwqpu Sosnaqwvrbvud 1-595 3536
2. Supdaltswctxrbaodmerben Oedxjwnwxlcvpwgwfiopmpavseirb 1-220 9698
3. Hujpaffaocnr Teagmuethvinxxvs 1-917 9742
4. Daojgyzfjwzvqjrpgp Vigudvipdlbjkqjm 1-424 4206
5. Sebuhzgsqpyidpquzjxswbccqbruqf Vuhssjvcjjylcevcisdzedkzlp 1-581 3451
6. Doeraxdmyqcbbzyp Litbetmttcgfldbhqqfdxqi 1-221 2485
7. Raqqctfdlhrmhtzusntvgbvotpk Iowdcqlwgljwlfvwhfmw 1-367 3505
8. Kooqkvabwemxoocjfaa Hicgkztiqvqdjjx 1-466 435
9. Lowywyzzkkrazfyjuggidsqfvzzqb Qiginniroivqymgseushahzlrywe 1-704 5514
10 titles read

Reading values from CSV file into variables

I am trying to write a simple piece of code to read values from a CSV file with a max of 100 entries into an array of structs.
Example of a line of the CSV file:
1,Mr,James,Quigley,Director,200000,0
I use the following code to read in the values, but when I print out the values they are incorrect
for(i = 0; i < 3; i++) /*just assuming number of entries here to demonstrate problem*/
{
fscanf(f, "%d,%s,%s,%s,%s,%d,%d", &inArray[i].ID, inArray[i].salutation, inArray[i].firstName, inArray[i].surName, inArray[i].position, &inArray[i].sal, &inArray[i].deleted);
}
Then when I print out the first name, the values are all assigned to the first name:
for(j = 0; j < 3; j++) /* test by printing values*/
{
printf("Employee name is %s\n", inArray[j].firstName);
}
Gives ames,Quigley,Director,200000,0 and so on in that way. I am sure it's how i format the fscanf line but I can't get it to work.
Here is my struct I'm reading into:
typedef struct Employee
{
int ID;
char salutation[4];
char firstName[21];
char surName[31];
char position[16];
int sal;
int deleted;
} Employee;
This is because a string %s can contain the comma, so it gets scanned into the first string. There's no "look-ahead" in the scanf() formatting specifier, the fact that the %s is followed by a comma in the format specification string means nothing.
Use character groups (search the manual for [).
const int got = fscanf(f, "%d,%[^,],%[^,],%[^,],%[^,],%d,%d", &inArray[i].ID,
inArray[i].salutation, inArray[i].firstName,
inArray[i].surName, inArray[i].position, &inArray[i].sal,
&inArray[i].deleted);
And learn to check the return value, since I/O calls can fail! Don't depend on the data being valid unless got is 7.
To make your program read the entire file (multiple records, i.e. lines), I would recommend loading entire lines into a (large) fixed-size buffer with fgets(), then using sscanf() on that buffer to parse out the column values. That is much easier and will ensure that you really do scan separate lines, calling fscanf() in a loop will not, since to fscanf() a linefeed is just whitespace.
Might as well post my comment as an answer:
%s reads a full word by default.
It finds the %d, the integer part, then the ,, and then it has to read a string. , is considered valid in a word (it is not a whitespace), so it reads until the end of the line (there is no whitespace until then), not until the first comma... And the rest remains empty. (From this answer)
You have to change the separator with specifying a regex:
fscanf(f, "%d,%[^,],%[^,],%[^,],%[^,],%d,%d", &inArray[i].ID, inArray[i].salutation, inArray[i].firstName, inArray[i].surName, inArray[i].position, &inArray[i].sal, &inArray[i].deleted);
Instead of %s, use %[^,], which means "grab all chars, and stop when found a ,".
EDIT
%[^,]s is bad, it would need a literal s after the end of the scanset... Thanks #MichaelPotter
(From Changing the scanf() delimiter and Reading values from CSV file into variables )

sscanf function changes the content of another string

I am having problems reading strings with sscanf. I have dumbed down the code to focus on the problem. Below is a function in the whole code that is supposed to open a file and read something. But sscanf is acting strangely. For instance I declare a string called atm with the content 'ATOM'. Before the sscanf it prints this string as ATOM while after it is null. What could be the problem? I assume it must be an allocation problem but I could not find it. I tried some suggestions on other topics like replacing %s with other things but it did not help.
void Get (struct protein p, int mode, int type)
{
FILE *fd; //input file
char name[100]="1CMA"; //array for input file name
char string[600]; //the array where each line of the data file is stored when reading
char atm[100]="ATOM";
char begin[4];
int index1 =0;
fd = fopen(name, "r"); // open the input file
if(fd==NULL) {
printf("Error: can't open file.\n");
return 1;
}
if( type==0 ) { //pdb file type
if( mode==0 ) {
while( fgets(string, 600, fd)!=NULL ) {
printf("1 %s\n",atm);
sscanf (string, "%4s", begin );
printf("2 %s \n",atm);
}
}
}
fclose(fd);
free(fd);
free(name);
}
The string begin isn't big enough to hold the four characters that sscanf will read and its \0 terminator. If the \0 is written into atm (depending on where the strings are in memory), atm would be modified. From the sscanf manpage, about the s directive:
s    Matches a sequence of non-white-space characters; the next pointer must be a pointer to character array that is long enough to hold the input sequence and the terminating null byte ('\0'), which is added automatically. The input string stops at white space or at the maximum field width, whichever occurs first.
I was able to reproduce this behavior on my machine, though the exact positioning of the strings in memory was a bit different. By printing the addresses of the strings, though, it is easy to figure exactly what's happening. Here's a minimal example:
#include<stdio.h>
int main() {
char begin[2];
char atm[100]="ATOM";
printf("begin: %p\n", begin);
printf("begin+16: %p\n", begin+16);
printf("atom: %p\n", atm);
printf("1 %s\n",atm);
sscanf("AAAABBBBCCCCDDDD", "%16s", begin);
printf("2 %s \n",atm);
return 0;
}
This produces the output:
$ ./a.out
begin: 0x7fffffffe120
begin+16: 0x7fffffffe130
atom: 0x7fffffffe130
1 ATOM
2
I printed the values of the pointers to figure out how big a string it would take to overflow into atm. Since (on my machine) atom begins at begin+16, reading sixteen characters into begin puts a null terminator at begin+16, which is the first character of atm, so now atm has length 0.

Resources