code in C being killed when reading a 250MB file - c

I am trying to process a 250MB file using a script in C.
The file is basically a dataset and I want to read just some of the columns and (more importantly) break one of them (which is originally a string) into a sequence of characters.
However, even though I have plenty of RAM available, the code is killed by konsole (using KDE Neon) everytime I run it.
The source is available below:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main() {
FILE *arquivo;
char *line = NULL;
size_t len = 0;
int i = 0;
int j;
int k;
char *vetor[500];
int acertos[45];
FILE *licmat = fopen("licmat.csv", "w");
//creating the header
fprintf(licmat,"CO_CATEGAD,CO_UF_CURSO,ACERTO09,ACERTO10,ACERTO11,ACERTO12,ACERTO13,ACERTO14,ACERTO15,ACERTO16,ACERTO17,ACERTO18,ACERTO19,ACERTO20,ACERTO21,ACERTO22,ACERTO23,ACERTO24,ACERTO25,ACERTO26,ACERTO27,ACERTO28,ACERTO29,ACERTO30,ACERTO31,ACERTO32,ACERTO33,ACERTO34,ACERTO35\n");
if ((arquivo = fopen("MICRODADOS_ENADE_2017.csv", "r")) == NULL) {
printf ("\nError");
exit(0);
}
//reading one line at a time
while (getline(&line, &len, arquivo)) {
char *ptr = strsep(&line,";");
j=0;
//breaking the line into a vector based on ;
while(ptr != NULL)
{
vetor[j]=ptr;
j=j+1;
ptr = strsep(&line,";");
}
//filtering based on content
if (strcmp(vetor[4],"702")==0 && strcmp(vetor[33],"555")==0) {
//copying some info
fprintf(licmat,"%s,%s,",vetor[2],vetor[8]);
//breaking the string (32) into isolated characters
for (k=0;k<27;k=k+1) {
fprintf(licmat,"%c", vetor[32][k]);
if (k<26) {
fprintf(licmat,",");
}
}
fprintf(licmat,"\n");
}
i=i+1;
}
free(line);
fclose(arquivo);
fclose(licmat);
}
The output is perfect up to the point when the script is killed. The output file is just 640KB long and has about 10000 lines only.
What could be the issue?

It looks to me like you're mishandling the memory buffer managed by getline() - which allocates/reallocates as needed - by the use of strsep(), which seems to manipulate that same pointer value.
Once line has been updated to reflect some other element on the line, it's no longer pointing to the start of allocated memory, and then boom the next time getline() needs to do anything with it.
Use a different variable to pass to strsep():
while (getline(&line, &len, arquivo) > 0) { // use ">=" if you want blank lines
char *parseline = line;
char *ptr = strsep(&parseline,";");
// do the same thing later
The key thing here: you're not allowed to muck with the value of line other than to free() it at the end (which you do), and you can't let any other routine do it either.
Edit: updated to reflect getline() returning <0 on error (h/t to #user3121023)

Related

Trying to read an unknown string length from a file using fgetc()

So yeah, saw many similar questions to this one, but thought to try solving it my way. Getting huge amount of text blocks after running it (it compiles fine).
Im trying to get an unknown size of string from a file. Thought about allocating pts at size of 2 (1 char and null terminator) and then use malloc to increase the size of the char array for every char that exceeds the size of the array.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main()
{
char *pts = NULL;
int temp = 0;
pts = malloc(2 * sizeof(char));
FILE *fp = fopen("txtfile", "r");
while (fgetc(fp) != EOF) {
if (strlen(pts) == temp) {
pts = realloc(pts, sizeof(char));
}
pts[temp] = fgetc(fp);
temp++;
}
printf("the full string is a s follows : %s\n", pts);
free(pts);
fclose(fp);
return 0;
}
You probably want something like this:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define CHUNK_SIZE 1000 // initial buffer size
int main()
{
int ch; // you need int, not char for EOF
int size = CHUNK_SIZE;
char *pts = malloc(CHUNK_SIZE);
FILE* fp = fopen("txtfile", "r");
int i = 0;
while ((ch = fgetc(fp)) != EOF) // read one char until EOF
{
pts[i++] = ch; // add char into buffer
if (i == size + CHUNK_SIZE) // if buffer full ...
{
size += CHUNK_SIZE; // increase buffer size
pts = realloc(pts, size); // reallocate new size
}
}
pts[i] = 0; // add NUL terminator
printf("the full string is a s follows : %s\n", pts);
free(pts);
fclose(fp);
return 0;
}
Disclaimers:
this is untested code, it may not work, but it shows the idea
there is absolutely no error checking for brevity, you should add this.
there is room for other improvements, it can probably be done even more elegantly
Leaving aside for now the question of if you should do this at all:
You're pretty close on this solution but there are a few mistakes
while (fgetc(fp) != EOF) {
This line is going to read one char from the file and then discard it after comparing it against EOF. You'll need to save that byte to add to your buffer. A type of syntax like while ((tmp=fgetc(fp)) != EOF) should work.
pts = realloc(pts, sizeof(char));
Check the documentation for realloc, you'll need to pass in the new size in the second parameter.
pts = malloc(2 * sizeof(char));
You'll need to zero this memory after acquiring it. You probably also want to zero any memory given to you by realloc, or you may lose the null off the end of your string and strlen will be incorrect.
But as I alluded to earlier, using realloc in a loop like this when you've got a fair idea of the size of the buffer already is generally going to be non-idiomatic C design. Get the size of the file ahead of time and allocate enough space for all the data in your buffer. You can still realloc if you go over the size of the buffer, but do so using chunks of memory instead of one byte at a time.
Probably the most efficient way is (as mentioned in the comment by Fiddling Bits) is to read the whole file in one go (after first getting the file's size):
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/stat.h>
int main()
{
size_t nchars = 0; // Declare here and set to zero...
// ... so we can optionally try using the "stat" function, if the O/S supports it...
struct stat st;
if (stat("txtfile", &st) == 0) nchars = st.st_size;
FILE* fp = fopen("txtfile", "rb"); // Make sure we open in BINARY mode!
if (nchars == 0) // This code will be used if the "stat" function is unavailable or failed ...
{
fseek(fp, 0, SEEK_END); // Go to end of file (NOTE: SEEK_END may not be implemented - but PROBABLY is!)
// while (fgetc(fp) != EOF) {} // If your system doesn't implement SEEK_END, you can do this instead:
nchars = (size_t)(ftell(fp)); // Add one for NUL terminator
}
char* pts = calloc(nchars + 1, sizeof(char));
if (pts != NULL)
{
fseek(fp, 0, SEEK_SET); // Return to start of file...
fread(pts, sizeof(char), nchars, fp); // ... and read one great big chunk!
printf("the full string is a s follows : %s\n", pts);
free(pts);
}
else
{
printf("the file is too big for me to handle (%zu bytes)!", nchars);
}
fclose(fp);
return 0;
}
On the issue of the use of SEEK_END, see this cppreference page, where it states:
Library implementations are allowed to not meaningfully support SEEK_END (therefore, code using it has no real standard portability).
On whether or not you will be able to use the stat function, see this Wikipedia page. (But it is now available in MSVC on Windows!)

String / char * concatinate, C

Am trying to open a file(Myfile.txt) and concatenate each line to a single buffer, but am getting unexpected output. The problem is,my buffer is not getting updated with the last concatenated lines. Any thing missing in my code?
Myfile.txt (The file to open and read)
Good morning line-001:
Good morning line-002:
Good morning line-003:
Good morning line-004:
Good morning line-005:
.
.
.
Mycode.c
#include <stdio.h>
#include <string.h>
int main(int argc, const char * argv[])
{
/* Define a temporary variable */
char Mybuff[100]; // (i dont want to fix this size, any option?)
char *line = NULL;
size_t len=0;
FILE *fp;
fp =fopen("Myfile.txt","r");
if(fp==NULL)
{
printf("the file couldn't exist\n");
return;
}
while (getline(&line, &len, fp) != -1 )
{
//Any function to concatinate the strings, here the "line"
strcat(Mybuff,line);
}
fclose(fp);
printf("Mybuff is: [%s]\n", Mybuff);
return 0;
}
Am expecting my output to be:
Mybuff is: [Good morning line-001:Good morning line-002:Good morning line-003:Good morning line-004:Good morning line-005:]
But, am getting segmentation fault(run time error) and a garbage value. Any think to do? thanks.
Specify MyBuff as a pointer, and use dynamic memory allocation.
#include <stdlib.h> /* for dynamic memory allocation functions */
char *MyBuff = calloc(1,1); /* allocate one character, initialised to zero */
size_t length = 1;
while (getline(&line, &len, fp) != -1 )
{
size_t newlength = length + strlen(line)
char *temp = realloc(MyBuff, newlength);
if (temp == NULL)
{
/* Allocation failed. Have a tantrum or take recovery action */
}
else
{
MyBuff = temp;
length = newlength;
strcat(MyBuff, temp);
}
}
/* Do whatever is needed with MyBuff */
free(MyBuff);
/* Also, don't forget to release memory allocated by getline() */
The above will leave newlines in MyBuff for each line read by getline(). I'll leave removing those as an exercise.
Note: getline() is linux, not standard C. A function like fgets() is available in standard C for reading lines from a file, albeit it doesn't allocate memory like getline() does.

Data entry into array of character pointers in C

this is my first question asked on here so if I'm not following the formatting rules here please forgive me. I am writing a program in C which requires me to read a few lines from a file. I am attempting to put each line into a cstring. I have declared a 2D character array called buf which is to hold each of the 5 lines from the file. The relevant code is shown below
#include <stdlib.h>
#include <sys/types.h>
#include <sys/file.h>
#include <sys/socket.h>
#include <sys/un.h> /* UNIX domain header */
void FillBuffersForSender();
char buf[5][2000]; //Buffer for 5 frames of output
int main()
{
FillBuffersForSender();
return 0;
}
void FillBuffersForSender(){
FILE *fp;
int line = 0;
char* temp = NULL;
size_t len = 0;
ssize_t read;
fp = fopen("frames.txt", "r");
printf("At the beginning of Fill Buffers loop.\n");
//while ((read = getline(&temp, &len, fp)) != -1){
while(line < 5){
//fprintf(stderr, "Read in: %s\n", temp);
fgets(temp, 2000, fp);
strcpy(buf[line], temp);
line++;
fprintf(stderr, "Line contains: %s.\n", temp);
temp = NULL;
}
while(line != 0){
fprintf(stderr, "Line contains: %s.\n", buf[line]);
line--;
}
}
The line
strcpy(buf[line], temp);
is causing a segmentation fault. I have tried this numerous ways, and cannot seem to get it to work. I am not used to C, but have been tasked with writing a bidirectional sliding window protocol in it. I keep having problems with super basic issues like this! If this were in C++, I'd be done already. Any help anyone could provide would be incredible. Thank you.
temp needs to point to an allocated buffer that fgets can write into.
In C programming, error checking is an important part of every program (in fact sometimes it seems like there's more error handling code than functional code). The code should check the return value from every function to make sure that it worked, e.g. if fopen returns NULL then it wasn't able to open the file, likewise if fgets returns NULL it wasn't able to read a line.
Also, the code needs to clean up after itself. For example, there is no destructor that closes a file when the file pointer goes out of scope, so the code needs to call fclose explicitly to close the file when it's finished with the file.
Finally, note that many of the C library functions have quirks that need to be understood, and properly handled. You can learn about these quirks by reading the man pages for the functions. For example, the fgets function will leave the newline character \n at the end of each line that it reads. But the last line of a file may not have a newline character. So when using fgets, it's good practice to strip the newline.
With all that in mind, the code should look like this:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAXLINE 5
#define MAXLENGTH 2000
static char buffer[MAXLINE][MAXLENGTH];
void FillBufferForSender(void)
{
char *filename = "frames.txt";
FILE *fp;
if ((fp = fopen(filename, "r")) == NULL)
{
printf("file '%s' does not exist\n", filename);
exit(1);
}
for (int i = 0; i < MAXLINE; i++)
{
// read a line
if (fgets( buffer[i], MAXLENGTH, fp ) == NULL)
{
printf("file does not have %d lines\n", MAXLINE);
exit(1);
}
// strip the newline, if any
size_t newline = strcspn(buffer[i], "\n");
buffer[i][newline] = '\0';
}
fclose(fp);
}
int main(void)
{
FillBufferForSender();
for (int i = 0; i < MAXLINE; i++)
printf("%s\n", buffer[i]);
}
Note: for an explanation of how strcspn is used to strip the newline, see this answer.
When it comes to C you have to think of the memory. Where is the memory for a point with NULL assigned to it? How can we copy something to somewhere that we have no space for?

Reading data from a text file in C?

So I'm pretty new at reading data from a text file in C. I'm used to getting input using scanf or hard coding.
I am trying to learn how to not only read data from a text file but manipulate that data. For example, say a text file called bst.txt had the following information used to perform operations on a binary search tree:
insert 10
insert 13
insert 5
insert 7
insert 20
delete 5
delete 10
....
With that example, I would have the following code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
FILE *fptr;
char *charptr;
char temp[50];
fptr = fopen("bst.txt", "r");
while(fgets(temp, 50, fptr) != NULL)
{
charptr = strtok(temp, " ");
while(charptr != NULL)
{
charptr = strtok(NULL, " ");
}
}
return 0;
}
I know that within the first while loop strtok() splits each line in the text file and within the second while loop strtok() splits off when the program recognizes a space, which in this case would separate the operations from the integers.
So my main question is, after, for example, the word "insert" is separated from the integer "10", how do I get the program to continue like this:
if(_____ == "insert")
{
//read integer from input file and call insert function, i.e. insert(10);
}
I need to fill in the blank.
Any help would be greatly appreciated!
If I were doing what you're doing, I would be doing it that way :)
I see a lot of people getting upvoted (not here, I mean on SO generally) for recommending that people use functions like scanf() and strtok() despite the fact that these functions are uniformly considered evil, not just because they're not thread-safe, but because they modify their arguments in ways that are hard to predict, and are a giant pain in the ass to debug.
If you're malloc()ing an input buffer for reading from a file, always make it at least 4kB — that's the smallest page the kernel can give you anyway, so unless you're doing a bazillion stupid little 100-byte malloc()s, you might as well — and don't be afraid to allocate 10x or 100x that if that makes life easy.
So, for these kinds of problems where you're dealing with little text files of input data, here's what you do:
malloc() yourself a fine big buffer that's big enough to slurp in the whole file with buckets and buckets of headroom
open the file, slurp the whole damn thing in with read(), and close it
record how many bytes you read in n_chars (or whatever)
do one pass through the buffer and 1) replace all the newlines with NULs and 2) record the start of each line (occurs after a newline!) into successive positions in a lines array (e.g. char **lines; lines=malloc(n_chars*sizeof(char *)): there can't be more lines than bytes!)
(optional) as you go, advance your start-of-line pointers to skip leading whitespace
(optional) as you go, overwrite trailing whitespace with NULs
keep a count of the lines as you go and save it in n_lines
remember to free() that buffer when you're done with it
Now, what do you have? You have an array of strings that are the lines of your file (optionally with each line stripped of leading and trailing whitespace) and you can do what the hell you like with it.
So what do you do?
Go through the array of lines one-by-one, like this:
for(i=0; i<n_lines; i++) {
if( '\0'==*lines[i] || '#' == *lines[i] )
continue;
// More code
}
Already you have ignored empty lines and lines that start with a "#". Your config file now has comments!
long n;
int len;
for(i=0; i<n_lines; i++) {
if( '\0'==*lines[i] || '#' == *lines[i] )
continue;
// More code
len = strlen("insert");
if( 0== strncmp(lines[i], "insert", len) ) {
n = strtol(lines[i]+len+1, &endp, 10);
// error checking
tree_insert( (int)n );
continue;
}
len = strlen("delete");
if( 0== strncmp(lines[i], "delete", len) ) {
n = strtol(lines[i]+len+1, &endp, 10);
// error checking
tree_delete( (int)n );
}
}
Now, you can probably see 10 ways of making this code better. Me too. How about a struct that contains a keywords and a function pointer to the appropriate tree function?
Other ideas? Knock yourself out!
you can call as follows.For example i have put printf but you can replace your insert/delete function instead that.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
FILE *fptr;
char *charptr;
char temp[50];
fptr = fopen("bst.txt", "r");
while(fgets(temp, 50, fptr) != NULL)
{
charptr = strtok(temp, " ");
if(strcmp(charptr,"insert")==0)
{
charptr = strtok(NULL, " ");
printf("insert num %d\n",atoi(charptr));
}
else if(strcmp(charptr,"delete")==0)
{
charptr = strtok(NULL, " ");
printf("delete num %d\n",atoi(charptr));
}
}
return 0;
}
I think the best way to read formatted strings in file is using fscanf, the following example shows how to parse the file. You could store the charptr and value for further operations:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
FILE *fptr;
char charptr[50];
int value;
fptr = fopen("bst.txt", "r");
while (fscanf(fptr, "%s%d", charptr, &value) > 0)
{
printf("%s: %d\n", charptr, value);
}
return 0;
}
try this code
int main(){
FILE *fp;
char character[50];
int value;
fptr = fopen("input.txt", "r");
while (fscanf(fp, "%s%d", character, &value) > 0)
{
if(strcmp(character,"insert")==0){
insert(value);//call you function which you want value is 10 or change according to file
}
}
return 0;
}

fgets() not reading from a text file?

I have a function loadsets() (short for load settings) which is supposed to load settings from a text file named Progsets.txt. loadsets() returns 0 on success, and -1 when a fatal error is detected. However, the part of the code which actually reads from Progsets.txt, (the three fgets()), seem to all fail and return the null pointer, hence not loading anything at all but a bunch of nulls. Is there something wrong with my code? fp is a valid pointer when I ran the code, and I was able to open it for reading. So what's wrong?
This code is for loading the default text color of my very basic text editor program using cmd.
headers:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <Windows.h>
#define ARR_SIZE 100
struct FINSETS
{
char color[ARR_SIZE + 1];
char title[ARR_SIZE + 1];
char maxchars[ARR_SIZE + 1];
} SETTINGS;
loadsets():
int loadsets(int* pMAXCHARS) // load settings from a text file
{
FILE *fp = fopen("C:\\Typify\\Settings (do not modify)\\Progsets.txt", "r");
char *color = (char*) malloc(sizeof(char*) * ARR_SIZE);
char *title = (char*) malloc(sizeof(char*) * ARR_SIZE);
char *maxchars = (char*) malloc(sizeof(char*) * ARR_SIZE);
char com1[ARR_SIZE + 1] = "color ";
char com2[ARR_SIZE + 1] = "title ";
int i = 0;
int j = 0;
int k = 0;
int found = 0;
while (k < ARR_SIZE + 1) // fill strings with '\0'
{
color[k] = title[k] = maxchars[k] = '\0';
SETTINGS.color[k] = SETTINGS.maxchars[k] = SETTINGS.title[k] = '\0';
k++;
}
if (!fp) // check for reading errors
{
fprintf(stderr, "Error: Unable to load settings. Make sure that Progsets.txt exists and has not been modified.\a\n\n");
return -1; // fatal error
}
if (!size(fp)) // see if Progsets.txt is not a zero-byte file (it shouldn't be)
{
fprintf(stderr, "Error: Progsets.txt has been modified. Please copy the contents of Defsets.txt to Progsets.txt to manually reset to default settings.\a\n\n");
free(color);
free(title);
free(maxchars);
return -1; // fatal error
}
// PROBLEMATIC CODE:
fgets(color, ARR_SIZE, fp); // RETURNS NULL (INSTEAD OF READING FROM THE FILE)
fgets(title, ARR_SIZE, fp); // RETURNS NULL (INSTEAD OF READING FROM THE FILE)
fgets(maxchars, ARR_SIZE, fp); // RETURNS NULL (INSTEAD OF READING FROM THE FILE)
// END OF PROBLEMATIC CODE:
system(strcat(com1, SETTINGS.color)); // set color of cmd
system(strcat(com2, SETTINGS.title)); // set title of cmd
*pMAXCHARS = atoi(SETTINGS.maxchars);
// cleanup
fclose(fp);
free(color);
free(title);
free(maxchars);
return 0; // success
}
Progsets.txt:
COLOR=$0a;
TITLE=$Typify!;
MAXCHARS=$10000;
EDIT: Here is the definition of the size() function. Since I'm just working with ASCII text files, I assume that every character is one byte and the file size in bytes can be worked out by counting the number of characters. Anything suspicious?
size():
int size(FILE* fp)
{
int size = 0;
int c;
while ((c = fgetc(fp)) != EOF)
{
size++;
}
return size;
}
The problem lies in your use of the size() function. It repeatedly calls fgetc() on the file handle until it gets to the end of the file, incrementing a value to track the number of bytes in the file.
That's not a bad approach (though I'm sure there are better ones that don't involve inefficient character-based I/O) but it does have one fatal flaw that you seem to have overlooked.
After you've called it, you've read the file all the way to the end so that any further reads, such as:
fgets(color, ARR_SIZE, fp);
will simply fail since you're already at the end of the file. You may want to consider something like rewind() before returning from size() - that will put the file pointer back to the start of the file so that you can read it again.

Resources