Parsing Large File in C

Parsing Large File in C - c

For a class, I've been given the task of writing radix sort in parallel using pthreads, openmp, and MPI. My language of choice in this case is C -- I don't know C++ too well.
Anyways, the way I'm going about reading a text file is causing a segmentation fault at around 500MB file size. The files are line separated 32 bit numbers:
12351
1235234
12
53421
1234
I know C, but I don't know it well; I use things I know, and in this case the things I know are terribly inefficient. My code for reading the text file is as follows:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
int main(int argc, char **argv){
if(argc != 4) {
printf("rs_pthreads requires three arguments to run\n");
return -1;
}
char *fileName=argv[1];
uint32_t radixBits=atoi(argv[2]);
uint32_t numThreads=atoi(argv[3]);
if(radixBits > 32){
printf("radixBitx cannot be greater than 32\n");
return -1;
}
FILE *fileForReading = fopen( fileName, "r" );
if(fileForReading == NULL){
perror("Failed to open the file\n");
return -1;
}
char* charBuff = malloc(1024);
if(charBuff == NULL){
perror("Error with malloc for charBuff");
return -1;
}
uint32_t numNumbers = 0;
while(fgetc(fileForReading) != EOF){
numNumbers++;
fgets(charBuff, 1024, fileForReading);
}
uint32_t numbersToSort[numNumbers];
rewind(fileForReading);
int location;
for(location = 0; location < numNumbers; location++){
fgets(charBuff, 1024, fileForReading);
numbersToSort[location] = atoi(charBuff);
}
At a file of 50 million numbers (~500MB), I'm getting a segmentation fault at rewind of all places. My knowledge of how file streams work is almost non-existent. My guess is it's trying to malloc without enough memory or something, but I don't know.
So, I've got a two parter here: How is rewind segmentation faulting? Am I just doing a poor job before rewind and not checking some system call I should be?
And, what is a more efficient way to read in an arbitrary amount of numbers from a text file?
Any help is appreciated.

I think the most likely cause here is (ironically enough) a stack overflow. Your numbersToSort array is allocated on the stack, and the stack has a fixed size (varies by compiler and operating system, but 1 MB is a typical number). You should dynamically allocate numbersToSort on the heap (which has much more available space) using malloc():
uint32_t *numbersToSort = malloc(sizeof(uint32_t) * numNumbers);
Don't forget to deallocate it later:
free(numbersToSort);
I would also point out that your first-pass loop, which is intended to count the number of lines, will fail if there are any blank lines. This is because on a blank line, the first character is '\n', and fgetc() will consume it; the next call to fgets() will then be reading the following line, and you'll have skipped the blank one in your count.

The problem is in this line
uint32_t numbersToSort[numNumbers];
You are attempting to allocate a huge array in stack, your stack size is in few KBytes (Moreover older C standards don't allow this). So you can try this
uint32_t *numbersToSort; /* Declare it with other declarations */
/* Remove uint32_t numbersToSort[numNumbers]; */
/* Add the code below */
numbersToSort = malloc(sizeof(uint32_t) * numNumbers);
if (!numbersToSort) {
/* No memory; do cleanup and bail out */
return 1;
}

Related

reading data from large file into struct in C

I am a beginner to C programming. I need to efficiently read millions of from a file using struct in a file. Below is the example of input file.
2,33.1609992980957,26.59000015258789,8.003999710083008
5,15.85200023651123,13.036999702453613,31.801000595092773
8,10.907999992370605,32.000999450683594,1.8459999561309814
11,28.3700008392334,31.650999069213867,13.107999801635742
I have a current code shown in below, it is giving an error "Error in file"
suggesting the file is NULL but file has data.
#include<stdio.h>
#include<stdlib.h>
struct O_DATA
{
int index;
float x;
float y;
float z;
};
int main ()
{
FILE *infile ;
struct O_DATA input;
infile = fopen("input.dat", "r");
if (infile == NULL);
{
fprintf(stderr,"\nError file\n");
exit(1);
}
while(fread(&input, sizeof(struct O_DATA), 1, infile))
printf("Index = %d X= %f Y=%f Z=%f", input.index , input.x , input.y , input.z);
fclose(infile);
return 0;
}
I need to efficiently read and store data from an input file to process it further. Any help would be really appreciated. Thanks in advnace.
~
~
~

First figure out how to convert one line of text to data
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct my_data
{
unsigned int index;
float x;
float y;
float z;
};
struct my_data *
deserialize_data(struct my_data *data, const char *input, const char *separators)
{
char *p;
struct my_data tmp;
if(sscanf(input, "%d,%f,%f,%f", &data->index, &data->x, &data->y, &data->z) != 7)
return NULL;
return data;
}
deserialize_data(struct my_data *data, const char *input, const char *separators)
{
char *p;
struct my_data tmp;
char *str = strdup(input); /* make a copy of the input line because we modify it */
if (!str) { /* I couldn't make a copy so I'll die */
return NULL;
}
p = strtok (str, separators); /* use line for first call to strtok */
if (!p) goto err;
tmp.index = strtoul (p, NULL, 0); /* convert text to integer */
p = strtok (NULL, separators); /* strtok remembers line */
if (!p) goto err;
tmp.x = atof(p);
p = strtok (NULL, separators);
if (!p) goto err;
tmp.y = atof(p);
p = strtok (NULL, separators);
if (!p) goto err;
tmp.z = atof(p);
memcpy(data, &tmp, sizeof(tmp)); /* copy values out */
goto out;
err:
data = NULL;
out:
free (str);
return data;
}
int main() {
struct my_data somedata;
deserialize_data(&somedata, "1,2.5,3.12,7.955", ",");
printf("index: %d, x: %2f, y: %2f, z: %2f\n", somedata.index, somedata.x, somedata.y, somedata.z);
}
Combine it with reading lines from a file:
just the main function here (insert the rest from the previous example)
int
main(int argc, char *argv[])
{
FILE *stream;
char *line = NULL;
size_t len = 0;
ssize_t nread;
struct my_data somedata;
if (argc != 2) {
fprintf(stderr, "Usage: %s <file>\n", argv[0]);
exit(EXIT_FAILURE);
}
stream = fopen(argv[1], "r");
if (stream == NULL) {
perror("fopen");
exit(EXIT_FAILURE);
}
while ((nread = getline(&line, &len, stream)) != -1) {
deserialize_data(&somedata, line, ",");
printf("index: %d, x: %2f, y: %2f, z: %2f\n", somedata.index, somedata.x, somedata.y, somedata.z);
}
free(line);
fclose(stream);
exit(EXIT_SUCCESS);
}

You've got an incorrect ; after your if (infile == NULL) test - try removing that...
[Edit: 2nd by 9 secs! :-)]

if (infile == NULL);
{ /* floating block */ }
The above if is a complete statement that does nothing regardless of the value of infile. The "floating" block is executed no matter what infile contains.
Remove the semicolon to 'attach' the "floating" block to the if
if (infile == NULL)
{ /* if block */ }

You already have solid responses in regard to syntax/structs/etc, but I will offer another method for reading the data in the file itself: I like Martin York's CSVIterator solution. This is my go-to approach for CSV processing because it requires less code to implement and has the added benefit of being easily modifiable (i.e., you can edit the CSVRow and CSVIterator defs depending on your needs).
Here's a mostly complete example using Martin's unedited code without structs or classes. In my opinion, and especially so as a beginner, it is easier to start developing your code with simpler techniques. As your code begins to take shape, it is much clearer why and where you need to implement more abstract/advanced devices.
Note this would technically need to be compiled with C++11 or greater because of my use of std::stod (and maybe some other stuff too I am forgetting), so take that into consideration:
//your includes
//...
#include"wherever_CSVIterator_is.h"
int main (int argc, char* argv[])
{
int index;
double tmp[3]; //since we know the shape of your input data
std::vector<double*> saved = std::vector<double*>();
std::vector<int> indices;
std::ifstream file(argv[1]);
for (CSVIterator loop(file); loop != CSVIterator(); ++loop) { //loop over rows
index = (*loop)[0];
indices.push_back(index); //store int index first, always col 0
for (int k=1; k < (*loop).size(); k++) { //loop across columns
tmp[k-1] = std::stod((*loop)[k]); //save double values now
}
saved.push_back(tmp);
}
/*now we have two vectors of the same 'size'
(let's pretend I wrote a check here to confirm this is true),
so we loop through them together and access with something like:*/
for (int j=0; j < (int)indices.size(); j++) {
double* saved_ptr = saved.at(j); //get pointer to first elem of each triplet
printf("\nindex: %g |", indices.at(j));
for (int k=0; k < 3; k++) {
printf(" %4.3f ", saved_ptr[k]);
}
printf("\n");
}
}
Less fuss to write, but more dangerous (if saved[] goes out of scope, we are in trouble). Also some unnecessary copying is present, but we benefit from using std::vector containers in lieu of knowing exactly how much memory we need to allocate.

Don't give an example of input file. Specify your input file format -at least on paper or in comments- e.g. in EBNF notation (since your example is textual... it is not a binary file). Decide if the numbers have to be in different lines (or if you might accept a file with a single huge line made of million bytes; read about the Comma Separated Values format). Then, code some parser for that format. In your case, it is likely that some very simple recursive descent parsing is enough (and your particular parser won't even use recursion).
Read more about <stdio.h> and its routines. Take time to carefully read that documentation. Since your input is textual, not binary, you don't need fread. Notice that input routines can fail, and you should handle the failure case.
Of course, fopen can fail (e.g. because your working directory is not what you believe it is). You'll better use perror or errno to find more about the failure cause. So at least code:
infile = fopen("input.dat", "r");
if (infile == NULL) {
perror("fopen input.dat");
exit(EXIT_FAILURE);
}
Notice that semi-colons (or their absence) are very important in C (no semi-colon after condition of if). Read again the basic syntax of C language. Read about How to debug small programs. Enable all warnings and debug info when compiling (with GCC, compile with gcc -Wall -g at least). The compiler warnings are very useful!
Remember that fscanf don't handle the end of line (newline) differently from a space character. So if the input has to have different lines you need to read every line separately.
You'll probably read every line using fgets (or getline) and parse every line individually. You could do that parsing with the help of sscanf (perhaps the %n could be useful) - and you want to use the return count of sscanf. You could also perhaps use strtok and/or strtod to do such a parsing.
Make sure that your parsing and your entire program is correct. With current computers (they are very fast, and most of the time your input file sits in the page cache) it is very likely that it would be fast enough. A million lines can be read pretty quickly (if on Linux, you could compare your parsing time with the time used by wc to count the lines of your file). On my computer (a powerful Linux desktop with AMD2970WX processor -it has lots of cores, but your program uses only one-, 64Gbytes of RAM, and SSD disk) a million lines can be read (by wc) in less than 30 milliseconds, so I am guessing your entire program should run in less than half a second, if given a million lines of input, and if the further processing is simple (in linear time).
You are likely to fill a large array of struct O_DATA and that array should probably be dynamically allocated, and reallocated when needed. Read more about C dynamic memory allocation. Read carefully about C memory management routines. They could fail, and you need to handle that failure (even if it is very unlikely to happen). You certainly don't want to re-allocate that array at every loop. You probably could allocate it in some geometrical progression (e.g. if the size of that array is size, you'll call realloc or a new malloc for some int newsize = 4*size/3 + 10; only when the old size is too small). Of course, your array will generally be a bit larger than what is really needed, but memory is quite cheap and you are allowed to "lose" some of it.
But StackOverflow is not a "do my homework" site. I gave some advice above, but you should do your homework.

Exploit a buffer overflow with canary protection

I'm trying to exploit this simple program for homework:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BUFSIZE 1024
typedef struct {
char flag_content[BUFSIZE];
char guess[47];
unsigned canary;
char flag_file[BUFSIZE];
} st_t;
int main(){
FILE *f = NULL;
st_t st = {
.flag_file = "fake_flag",
.canary = 0x4249b876
};
printf("Guess the flag!\n");
scanf("%s", st.guess);
f = fopen(st.flag_file, "rb");
if (f == NULL){
printf("flag error\n");
exit(-1);
}
if (fread(st.flag_content, BUFSIZE, 1, f) < 0){
printf("flag error\n");
exit(-1);
}
if (st.canary != 0x4249b876) {
printf("canary error\n");
exit(-1);
}
if (!strcmp(st.guess, st.flag_content)){
printf("You guessed it right!\n");
} else {
printf("Sorry but the flag is \"%s\"\n", st.flag_content);
}
exit(1);
}
The purpose is to modify the st.flag_file inserting "flag.txt" instead of "fake_flag.txt" to read its content.
It's easy to find out that there is a buffer overflow, but there is also a canary, so I write this exploit as input of the scanf:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv¸IBflag.txt
I found online that the hexadecimal 0x4249b876 is translated into v¸IB
but when I run the code from my terminal this is the output
./mission1
Guess the flag!
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv¸IBflag.txt
canary error
and doing a check with gdb I found out that the variable st.flag_file = "flag.txt" that is correct but the variable st.canary = 0x4249b8c2
I cannot understand. Why?

The problem you are having comes from the alignment requirements of the struct. In particular, between the guess and the canary there is extra padding that the compiler is inserting. Try to dump the entire structure and/or the addresses of the members and you will see the padding.
The end result is that you will need more than 47 bytes (A) to reach the canary, typically one more. So instead of, e.g.:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\x76\xb8\x49\x42flag
You will need:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\x76\xb8\x49\x42flag
Take note, as well, that escaping the characters for the canary might be a good idea (to avoid problems with encoding etc.) and makes it more readable (to compare against the actual canary).
Off-topic Note that:
fread(st.flag_content, BUFSIZE, 1, f) < 0
is always false since fread returns a size_t (unsigned).

How do i read a file backwards using read() in c? [duplicate]

This question already has answers here:
Reading a text file backwards in C
(5 answers)
Closed 9 years ago.
I am supposed to create a program that takes a given file and creates a file with reversed txt. I wanted to know is there a way i can start the read() from the end of the file and copy it to the first byte in the created file if I dont know the exact size of the file?
Also i have googled this and came across many examples with fread, fopen, etc. However i cant use those for this project i can only use read, open, lseek, write, and close.
here is my code so far its not much but just for reference:
#include<stdio.h>
#include<unistd.h>
int main (int argc, char *argv[])
{
if(argc != 2)/*argc should be 2 for correct execution*/
{
printf("usage: %s filename",argv[0[]);}
}
else
{
int file1 = open(argv[1], O_RDWR);
if(file1 == -1){
printf("\nfailed to open file.");
return 1;
}
int reversefile = open(argv[2], O_RDWR | O_CREAT);
int size = lseek(argv[1], 0, SEEK_END);
char *file2[size+1];
int count=size;
int i = 0
while(read(file1, file2[count], 0) != 0)
{
file2[i]=*read(file1, file2[count], 0);
write(reversefile, file2[i], size+1);
count--;
i++;
lseek(argv[2], i, SEEK_SET);
}

I doubt that most filesystems are designed to support this operation effectively. Chances are, you'd have to read the whole file to get to the end. For the same reasons, most languages probably don't include any special feature for reading a file backwards.
Just come up with something. Try to read the whole file in memory. If it is too big, dump the beginning, reversed, into a temporary file and keep reading... In the end combine all temporary files into one. Also, you could probably do something smart with manual low-level manipulation of disk sectors, or at least with low-level programming directly against the file system. Looks like this is not what you are after, though.

Why don't you try fseek to navigate inside the file? This function is contained in stdio.h, just like fopen and fclose.
Another idea would be to implement a simple stack...

This has no error checking == really bad
get file size using stat
create a buffer with malloc
fread the file into the buffer
set a pointer to the end of the file
print each character going backwards thru the buffer.
If you get creative with google you can get several examples just like this.
IMO the assistance you are getting so far is not really even good hints.
This appears to be schoolwork, so beware of copying. Do some reading about the calls used here. stat (fstat) fread (read)
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
int main(int argc, char **argv)
{
struct stat st;
char *buf;
char *p;
FILE *in=fopen(argv[1],"r");
fstat(fileno(in), &st); // get file size in bytes
buf=malloc(st.st_size +2); // buffer for file
memset(buf, 0x0, st.st_size +2 );
fread(buf, st.st_size, 1, in); // fill the buffer
p=buf;
for(p+=st.st_size;p>=buf; p--) // print traversing backwards
printf("%c", *p);
fclose(in);
return 0;
}

File get contents in C

What is the best way to get the contents of a file into a single character array?
I have read this question:
Easiest way to get file's contents in C
But from the comments, I've seen that the solution isn't great for large files. I do have access to the stat function. If the file size is over 4 gb, should I just return an error?
The contents of the file is encrypted and since it's supplied by the user it could be as large as anyone would want it to be. I want it to return an error and not crash if the file is too big. The main purpose of populating the character array with the contents of a file, is to compare it to another character array and also (if needed and configured to do so) to log both of these to a log file (or multiple log files if necessary).

You may use fstat(3) from sys/stat.h. Here is a little function to get size of the file, allocate memory if file is less than 4GB's and return (-1) otherwise. It reads the file to the char array passed to char *buffer a char *, which contains the contents of the whole file.It should be free'd after use.
#include <stdio.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <fcntl.h>
char *loadlfile(const char *path)
{
int file_descr;
FILE *fp;
struct stat buf;
char *p, *buffer;
fstat((file_descr = open(path, O_RDONLY)), &buf);
// This check is done at preprocessing and requires no check at runtime.
// It basically means "If this machine is not of a popular 64bit architecture,
// it's probably not 128bit and possibly has limits in maximum memory size.
// This check is done for the sake of omission of malloc(3)'s unnecessary
// invocation at runtime.
// Amd 64 Arm64 Intel 64 Intel 64 for Microsofts compiler.
#if !defined(__IA_64) || !defined(__aarch64__) || !defined(__ia64__) || !defined(_M_IA64)
#define FILE_MAX_BYTES (4000000000)
// buf.st_size is of off_t, you may need to cast it.
if(buf.st_size >= FILE_MAX_BYTES-1)
return (-1);
#endif
if(NULL == (buffer = malloc(buf.st_size + 1)))
return NULL;
fp = fdopen(file_descr, "rb");
p = buffer;
while((*p++ = fgetc(fp)) != EOF)
;
*p = '\0';
fclose(fp);
close(file_descr);
return buffer;
}
A very broad list of pre-defined macros for various things can be found # http://sourceforge.net/p/predef/wiki/Home/. The reason for the architecture and file size check is, malloc can be expensive at times and it is best to omit/skip it's usage when it is not needed. And querying a memory of max. 4gb for a whole block of 4gb storage is just waste of those precious cycles.

From that guy's code just do, if I understand your question correctly:
char * buffer = 0;
long length;
FILE * f = fopen (filename, "rb");
if (f)
{
fseek (f, 0, SEEK_END);
length = ftell (f);
if(length > MY_MAX_SIZE) {
return -1;
}
fseek (f, 0, SEEK_SET);
buffer = malloc (length);
if (buffer)
{
fread (buffer, 1, length, f);
}
fclose (f);
}
if (buffer)
{
// start to process your data / extract strings here...
}

Why can't my program save a large amount (>2GB) to a file?

I am having trouble trying to figure out why my program cannot save more than 2GB of data to a file. I cannot tell if this is a programming or environment (OS) problem. Here is my source code:
#define _LARGEFILE_SOURCE
#define _LARGEFILE64_SOURCE
#define _FILE_OFFSET_BITS 64
#include <math.h>
#include <time.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/*-------------------------------------*/
//for file mapping in Linux
#include<fcntl.h>
#include<unistd.h>
#include<sys/stat.h>
#include<sys/time.h>
#include<sys/mman.h>
#include<sys/types.h>
/*-------------------------------------*/
#define PERMS 0600
#define NEW(type) (type *) malloc(sizeof(type))
#define FILE_MODE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)
void write_result(char *filename, char *data, long long length){
int fd, fq;
fd = open(filename, O_RDWR|O_CREAT|O_LARGEFILE, 0644);
if (fd < 0) {
perror(filename);
return -1;
}
if (ftruncate(fd, length) < 0)
{
printf("[%d]-ftruncate64 error: %s/n", errno, strerror(errno));
close(fd);
return 0;
}
fq = write (fd, data,length);
close(fd);
return;
}
main()
{
long long offset = 3000000000; // 3GB
char * ttt;
ttt = (char *)malloc(sizeof(char) *offset);
printf("length->%lld\n",strlen(ttt)); // length=0
memset (ttt,1,offset);
printf("length->%lld\n",strlen(ttt)); // length=3GB
write_result("test.big",ttt,offset);
return 1;
}
According to my test, the program can generate a file large than 2GB and can allocate such large memory as well.
The weird thing happened when I tried to write data into the file. I checked the file and it is empty, which is supposed to be filled with 1.
Can any one be kind and help me with this?

You need to read a little more about C strings and what malloc and calloc do.
In your original main ttt pointed to whatever garbage was in memory when malloc was called. This means a nul terminator (the end marker of a C String, which is binary 0) could be anywhere in the garbage returned by malloc.
Also, since malloc does not touch every byte of the allocated memory (and you're asking for a lot) you could get sparse memory which means the memory is not actually physically available until it is read or written.
calloc allocates and fills the allocated memory with 0. It is a little more prone to fail because of this (it touches every byte allocated, so if the OS left the allocation sparse it will not be sparse after calloc fills it.)
Here's your code with fixes for the above issues.
You should also always check the return value from write and react accordingly. I'll leave that to you...
main()
{
long long offset = 3000000000; // 3GB
char * ttt;
//ttt = (char *)malloc(sizeof(char) *offset);
ttt = (char *)calloc( sizeof( char ), offset ); // instead of malloc( ... )
if( !ttt )
{
puts( "calloc failed, bye bye now!" );
exit( 87 );
}
printf("length->%lld\n",strlen(ttt)); // length=0 (This now works as expected if calloc does not fail)
memset( ttt, 1, offset );
ttt[offset - 1] = 0; // Now it's nul terminated and the printf below will work
printf("length->%lld\n",strlen(ttt)); // length=3GB
write_result("test.big",ttt,offset);
return 1;
}
Note to Linux gurus... I know sparse may not be the correct term. Please correct me if I'm wrong as it's been a while since I've been buried in Linux minutiae. :)

Looks like you're hitting the internal file system's limitation for the iDevice: ios - Enterprise app with more than resource files of size 2GB
2Gb+ files are simply not possible. If you need to store such amount of data you should consider using some other tools or write the file chunk manager.

I'm going to go out on a limb here and say that your problem may lay in memset().
The best thing to do here is, I think, after memset() ing it,
for (unsigned long i = 0; i < 3000000000; i++) {
if (ttt[i] != 1) { printf("error in data at location %d", i); break; }
}
Once you've validated that the data you're trying to write is correct, then you should look into writing a smaller file such as 1GB and see if you have the same problems. Eliminate each and every possible variable and you will find the answer.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Parsing Large File in C - c

Related

reading data from large file into struct in C

Exploit a buffer overflow with canary protection

How do i read a file backwards using read() in c? [duplicate]

File get contents in C

Why can't my program save a large amount (>2GB) to a file?

Categories

Resources