C efficient way to read a file of 20000000 lines - c

I'm trying to read a huge dataset of 20 millions lines, in each line there is a huge number (in fact I'm storing the number in unsigned long long variables), for example: 1774251443, 8453058335, 19672843924, and so on...
I develop a simple function to do this, I'll show below
void read(char pathToDataset[], void **arrayToFill, int arrayLength) {
FILE *dataset = fopen(pathToDataset, "r");
if (dataset == NULL ) {
printf("Error while opening the file.\n");
exit(0); // exit failure, it closes the program
}
int i = 0;
/* Prof. suggestion: do a malloc RIGHT HERE, for allocate a
* space in memory in which store the element
* to insert in the array
*/
while (i < arrayLength && fscanf(dataset, "%llu", (unsigned long long *)&arrayToFill[i]) != EOF) {
// ONLY FOR DEBUG, it will print
//printf("line: %d.\n", i); 20ML of lines!
/* Prof. suggestion: do another malloc here,
* for each element to be read.
*/
i++;
}
printf("Read %d lines", i);
fclose(dataset);
}
the parameter arrayToFill is of type void** because of the exercise goal. Every function has to perform on generic type, and the array could potentially be filled with every type of data (in this example, huge numbers, but it could contain huge strings, integers and so on...).
I don't understand why I have to do 2 malloc calls, isn't a single one enough?

For your first question, think of malloc as a call for memory to store a number of N objects, all of which are size S. When you have the parameters void ** arrayToFill, int arrayLength, you are saying this array will contain arrayLength amount of pointers of the size sizeof(void*). That is the first allocation and call to malloc.
But the members of that array are pointers, which are meant to hold arrays or essentially memory of some other object themselves. The first call to malloc only allocates memory to store the void* of each array member, but the memory for each individual member of the array needs it's own malloc() call.
Efficient Line Reading
For your other question, making lots of small allocations of memory, and then later on freeing them (assuming you would do so, otherwise you would leak a lot of memory), is very slow. However, the performance hit for I/O related tasks is more based on the number of calls than it is for the amount of memory you are allocating.
Have your program read the entire file into memory, and allocate an array of unsigned long long for 20 million, or however many integers you expect to handle. This way, you can parse through the file contents, use strtol function from <stdlib.h>, and one by one copy the resulting long to your large array.
This way, you only use a 2-3 large memory allocations and deallocations.

I've come up with this POSIX solution, see if it helps
#include <unistd.h> //for read, write, lseek
#include <stdio.h> //fprintf
#include <fcntl.h> //for open
#include <string.h> //
#include <stdlib.h> // for exit and define
#include <sys/types.h>
#include <sys/stat.h>
int main(int argc, char * argv[])
{
int fd; // file descriptor
char * buffer; //pointer for the malloc
if(argc < 2)
{
fprintf(stderr, "Insert the file name as parameter\n");
exit(EXIT_FAILURE);
}
if((fd = open(argv[1], O_RDONLY)) == -1)// opens the file in read-only mode
{
fprintf(stderr, "Can't open file\n");
exit(EXIT_FAILURE);
}
off_t bytes = lseek(fd, 0, SEEK_END); // looks at how many bytes the file has
lseek(fd, 0, SEEK_SET); // returns the file pointer to the start position
buffer = malloc(sizeof(char)*bytes); // allocates enough memory for reading the file
int r = read(fd, buffer, bytes); //reads the file
if(r == -1)
{
fprintf(stdout, "Error reading\n");
exit(EXIT_FAILURE);
}
fprintf(stdout, "\n%s", buffer); // prints the file
close(fd);
exit(EXIT_SUCCESS);
}

Related

I need to split a file (for now text file) into multiple buffer C

i'm trying to read a file and split this file into multiple buffers.
This is what i came up with:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define PKT_SIZE 2048;
#define PATH "directory of some kind"
int main() {
char filepath[200] = PATH;
FILE *packet;
int size = PKT_SIZE;
char *buffer[size];
int i=0;
//OPEN FILE
if((packet = fopen(filepath, "r")) == NULL){ //I'm trying with a txt file, then i'll change it to 'rb'
printf("Error Opening File\n");
return -1;
}
//READ FILE
while(*fgets((char *) *buffer[i], (int) strlen(buffer[i]), packet) != NULL) { //read the file and cycling insert the fgets into the buffer i
printf("Create %d buffer\n", i);
i++;
}
fclose(packet);
return 0;
}
Now, when i run this program, i get a SIGSEGV error, i managed to understand that this error is definetly:
*fgets((char *) *buffer[i], (int) strlen(buffer[i]), packet) != NULL
Do you have any suggestions?
*fgets((char *) *buffer[i], (int) strlen(buffer[i]), packet)
This line as several problems.
buffer[i] is just an un-initialized pointer pointing nowhere.
*buffer[i] is of type char you need to pass the char*.
strlen is not returning the size of the buffer. It is undefined behavior here because you called it over uninitialized pointer value.
Also dererencing whatever fgets is return is bad when the fgets returns NULL. It invokes undefined behavior.
There many solutions to this ranging from dynamic memory allocation to using
char buffer[size][MAXSIZE];. If you go about this you can get input this way:
#define MAXSIZE 100
...
char buffer[size][MAXSIZE];
while(fgets(buffer[i], sizeof(buffer[i]), packet)!=NULL){...
char* buffer[size] is an array of N char* pointers which are uninitialized. You must allocate memory to these before using them or your program will explode in a ball of fire.
The fix is to allocate:
for (size_t i = 0; i < size; ++i) {
buffer[i] = malloc(PKT_SIZE);
}
You're going to be responsible for that memory going forward, too, so don't forget to free later.
Allocating an arbitrary number of buffers is pretty wasteful. It's usually better to use some kind of simple linked-list type structure and append chunks as necessary. This avoids pointless over-allocation of memory.

How to make a void pointer to read a given part of a binary file

I have a binary file which contains 3 differents structs and a christmas text. On the first line of the binaryfile have they provided me with a int which represents the size of a package inside the file. A package contains 3 structs ,the chistmastext and the size.
The structs lies in a file called framehdr.h and the binary file I'm reading is called TCPdump.
Now am I trying to create a program att will read each package at a time and then withdraw the text.
I have started with something like this:
#pragma warning(disable: 4996)
#include <stdio.h>
#include <stdlib.h>
#include "framehdr.h"
#include <crtdbg.h>
int main()
{
_CrtSetDbgFlag(_CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF);
FILE *fileOpen;
char *buffer;
size_t dataInFile;
long filesize;
// The three structs
struct ethernet_hdr ethHdr;
struct ip_hdr ipHdr;
struct tcp_hdr tcpHDr;
fileOpen = fopen("C:\\Users\\Viktor\\source\\repos\\Laboration_3\\Laboration_3\\TCPdump", "rb");
if (fileOpen == NULL)
{
printf("Error\n");
}
else
{
printf("Success\n");
}
char lenOf[10];
size_t nr;
// Reads until \n comes
fgets(lenOf, sizeof(lenOf), fileOpen);
sscanf(lenOf, "%d", &nr);
// Withdraw the size of a package and check if it's correct
printf("Value: %d\n", nr);
printf("Adress: %d\n", &nr);
void *ptr;
fread(&ptr, nr, 1, fileOpen);
int resEth = 14;
printf("resEth: %d\n", resEth);
int resIP = IP_HL((struct ip_hdr*)ptr);
printf("ResIP: %d\n", resIP);
int resTcp = TH_OFF((struct tcp_hdr*)ptr);
printf("tcpIP: %d\n", resTcp);
int res = resEth + resIP + resTcp;
printf("Total: %d", res);
fclose(fileOpen);
//free(buffer);
system("pause");
return 0;
}
I know that the first struct ethernet will always have the size of 14 but I need to get the size of the other 2 and I'm suppose to use IP_HL and TH_OFF for that.
But my problems lies in that I can't seem to read the entire package to one
void * with the fread. I get noting in my *ptr.
Which in turn makes the code break when I try to convert the void * to one of the structs ones.
What I'm doing wrong with the void *?
Two problems:
First you should not really use text functions when reading binary files. Binary files doesn't really have "lines" in the sense that text file have it.
Secondly, with
void *ptr;
fread(&ptr, nr, 1, fileOpen);
you are passing a pointer to the pointer variable, you don't actually read anything into memory and then make ptr point to that memory. What happens now is that the fread function will read nr bytes from the file, and then write it to the memory pointed to by &ptr, which will lead to undefined behavior if nr > sizeof ptr (as then the data will be written out of bounds).
You have to allocate nr bytes of memory, and then pass a pointer to the first element of that:
char data[nr];
fread(data, nr, 1, fileOpen);
You should also get into the habit of checking for errors. What if the fread function fails? Or the file is truncated and there isn't nr bytes left to read?
You can check for these conditions by checking what fread returns.
And not only check for fread, there are more functions than fopen that can fail.

Segmentation Fault on fputs

I am pretty new to C and memory allocation in general. Basically what I am trying to do is copy the contents of an input file of unknown size and reverse it's contents using recursion. I feel that I am very close, but I keep getting a segmentation fault when I try to put in the contents of what I presume to be the reversed contents of the file (I presume because I think I am doing it right....)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int recursive_back(char **lines, int lineNumber, FILE *input) {
char *input_line = malloc(sizeof(char) * 1000);
lines = realloc(lines, (lineNumber) * 1000 * sizeof(char));
if(fgets(input_line, 201, input) == NULL) {
*(lines + lineNumber) = input_line;
return 1;
}
else {
printf("%d\n", lineNumber);
return (1+recursive_back(lines, ++lineNumber, input));
}
}
void backward (FILE *input, FILE *output, int debugflag ) {
int i;
char **lines; //store lines in here
lines = malloc(1000 * sizeof(char *) ); //1000 lines
if(lines == NULL) { //if malloc failed
fprintf(stderr, "malloc of lines failed\n");
exit(1);
}
int finalLineCount, lineCount;
finalLineCount = recursive_back(lines, 0, input);
printf("test %d\n", finalLineCount);
for(i = finalLineCount; i > 0; i--) {
fputs(*(lines+i), output); //segfault here
}
}
I am using a simple input file to test the code. My input file is 6 lines long that says "This is a test input file". The actual input files are being opened in another function and passed over to the backward function. I have verified that the other functions in my program work since I have been playing around with different options. These two functions are the only functions that I am having trouble with. What am I doing wrong?
Your problem is here:
lines = realloc(lines, (lineNumber) * 1000 * sizeof(char));
exactly as #ooga said. There are at least three separate things wrong with it:
You are reallocating the memory block pointed to by recursive_back()'s local variable lines, and storing the new address (supposing that the reallocation succeeds) back into that local variable. The new location is not necessarily the same as the old, but the only pointer to it is a local variable that goes out of scope at the end of recursive_back(). The caller's corresponding variable is not changed (including when the caller is recursive_back() itself), and therefore can no longer be relied upon to be a valid pointer after recursive_back() returns.
You allocate space using the wrong type. lines has type char **, so the object it points to has type char *, but you are reserving space based on the size of char instead.
You are not reserving enough space, at least on the first call, when lineNumber is zero. On that call, when the space requested is exactly zero bytes, the effect of the realloc() is to free the memory pointed to by lines. On subsequent calls, the space allocated is always one line's worth less than you think you are allocating.
It looks like the realloc() is altogether unnecessary if you can rely on the input to have at most 1000 lines, so you should consider just removing it. If you genuinely do need to be able to reallocate in a way that the caller will see, then the caller needs to pass a pointer to its variable, so that recursive_back() can modify it via that pointer.

Beginner C : Dynamic memory allocation

Switching to C from Java, and I'm having some troubles grasping memory management
Say I have a function *check_malloc that behaves as such:
// Checks if malloc() succeeds.
void *check_malloc(size_t amount){
void *tpt;
/* Allocates a memory block in amount bytes. */
tpt = malloc( amount );
/* Checks if it was successful. */
if ( tpt == NULL ){
fprintf(stderr, "No memory of %lu bytes\n", amount);
exit(1);
}
return tpt;
}
I also have the following variables to work with:
FILE *f = fopen("abc.txt", "r"); // Pointer to a file with "mynameisbob" on the first line and
// "123456789" on the second line
char *pname; // Pointer to a string for storing the name
}
My goal is to use *check_malloc to dynamically allocate memory so that the String pointed to by *pname is just the correct size for storing "mynamisbob", which is the only thing on the first line of the text file.
Here is my (failed) attempt:
int main(int argc, char *argv[]){
FILE *f = fopen("abc.txt", "r"); // A file with "mynameisbob" on the first line and
// "123456789" on the second line
char *pname; // Pointer to a string for storing the name
char currentline[150]; // Char array for storing current line of file
while(!feof(f)){
fgets(currentline,100,f);
pname = &currentline;
}
But I know this probably isn't the way to go about this, because I need to use my nice check_malloc* function.
Additionally, in my actual text file there is a "<" symbol before the name on the first line.But I just want the *pname to point to a String saying "mynameisbob" without the "<" symbol. This isn't that important now, it just is reinforcement to me that I know I can't just set the pointer to point straight to currentline.
Can anyone help me fix my thinking on this one? Thanks a lot.
In C you need to copy chars, not the "strings" (which are just pointers). Check out strcpy() and strlen(). Use strlen() to determine how long the line actually is which fgets has read, then use your malloc() to allocate exactly that (plus 1 for the 0). Then copy the chars over with strcpy().
There are several problems in your code, see my comments in this example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// Checks if malloc() succeeds.
void *check_malloc (size_t amount) {
void *tpt;
/* Allocates a memory block in amount bytes. */
tpt = malloc( amount );
/* Checks if it was successful. */
if (tpt == NULL) {
fprintf (stderr, "No memory of %lu bytes\n", amount);
exit (EXIT_FAILURE);
}
return tpt;
}
// To avoid subtle errors I have defined buffer size here
#define BUFFER_SIZE 150
// I have used the (void) version of main () here, while not strictly neccessary, you where not using argc and argv anyway, so the can be left out in this case
int main (void) {
// It might be a good idea to make the filename a char[] as well, but I leave that as an exercise to the reader.
FILE *f = fopen("abc.txt", "r"); // A file with "mynameisbob" on the first line and
// "123456789" on the second line
// You have to check whether the file was *actually openend*
if (f == NULL) {
fprintf (stderr, "Could not open file abc.txt\n"); // '"...%s\n", filename);' might better.
exit (EXIT_FAILURE);
}
char *pname; // Pointer to a string for storing the name
char currentline[BUFFER_SIZE]; // Char array for storing current line of file
while(!feof (f)) {
char *res = fgets (currentline, BUFFER_SIZE, f);
// fgets returns NULL when EOF was encountered before the next '\n'
if (res) {
size_t read = strlen (res);
// The line might have been empty
if (read) {
// Better use "sizeof *varname", while char is always 1 byte it is a good practice
pname = check_malloc ((read + 1) * sizeof *pname); // + 1 because we have to provide an extra char für '\0'
strncpy (pname, currentline, read); // You have to use strcpy or strncpy to copy the contents of the string rather than just assigning the pointer
// What was allocated must be freed again
free (pname);
}
}
}
fclose(f); // Always close everything you open!
return EXIT_SUCCESS;
}
Actually you really don't have to use pname in this simple case, because currentline already contains the line, but since you're trying to learn about memory management this should give you a general idea of how things work.
In your code you had this line:
pname = &currentline;
There are two problems here:
As already mentioned in my code assigning currentline to pname only copies the pointer not the contents.
The correct assignment would be pname = currentline (without the address operator &), because currentline is also a pointer under the hood (it behaves like char *currentline even though it's statically allocated).

How to malloc properly when dealing with 2-D pointers and what are some of the advantanges of using a 2-D pointer array?

I am currently, working on solving a maze and so far I have read the maze from a text file and stored it into an 1-D pointer, however, I am trying to store it into a 2-D pointer array, but I keep getting a segmentation fault. Also, my second question, what are some advantages of using a 2-D pointer array? I do not seem to understand how to properly implement them. This is my first time using 2-D pointers so I'm not as a great at it, however I would like to improve so I can become good at it in the future. Thank you so much for the help in advance :)
Here is what I have done so far:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "mazegen.h"
#define BUFFERSIZE 500
int main(int argc, char**argv)
{
char*readFile;
char**storeMaze;
FILE*fp;
int i;
i=0;
readFile = malloc(sizeof(char)*(BUFFERSIZE)+1);
if(argc != 2)
{
printf("Error opening file, incorrect format. <programNam <inputfileName>\n");
exit(0);
}
else
{
fp = fopen(argv[1], "r");
if(fp == NULL)
{
printf("Empty File. Error. Exiting Program.\n");
exit(0);
}
while(fgets(readFile,sizeof(readFile),fp) != NULL)
{
storeMaze = malloc(sizeof(char*)*(strlen(readFile)+1));
strcpy(storeMaze[i], readFile);
}
}
free(readFile);
fclose(fp);
return 0;
}
You've dynamically allocated space for the fgets() to read into, but you then pass the wrong size as the size. There's no reason to use malloc() unless you're on an unusually small machine (say less than 8 MiB — yes, I mean megabytes — of main memory).
char readLine[4096];
while (fgets(readLine, sizeof(readLine), inputFile) != NULL)
Or, if you insist on malloc(), specify 101 as the size in the call to fgets().
You're compiling on a 32-bit system so sizeof(inputFile) == sizeof(FILE *) which is 4 on your system. Hence you got up to three characters and a null from the input for each call to fgets().

Resources