I am a beginner to C programming. I need to efficiently read millions of from a file using struct in a file. Below is the example of input file.
2,33.1609992980957,26.59000015258789,8.003999710083008
5,15.85200023651123,13.036999702453613,31.801000595092773
8,10.907999992370605,32.000999450683594,1.8459999561309814
11,28.3700008392334,31.650999069213867,13.107999801635742
I have a current code shown in below, it is giving an error "Error in file"
suggesting the file is NULL but file has data.
#include<stdio.h>
#include<stdlib.h>
struct O_DATA
{
int index;
float x;
float y;
float z;
};
int main ()
{
FILE *infile ;
struct O_DATA input;
infile = fopen("input.dat", "r");
if (infile == NULL);
{
fprintf(stderr,"\nError file\n");
exit(1);
}
while(fread(&input, sizeof(struct O_DATA), 1, infile))
printf("Index = %d X= %f Y=%f Z=%f", input.index , input.x , input.y , input.z);
fclose(infile);
return 0;
}
I need to efficiently read and store data from an input file to process it further. Any help would be really appreciated. Thanks in advnace.
~
~
~
First figure out how to convert one line of text to data
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct my_data
{
unsigned int index;
float x;
float y;
float z;
};
struct my_data *
deserialize_data(struct my_data *data, const char *input, const char *separators)
{
char *p;
struct my_data tmp;
if(sscanf(input, "%d,%f,%f,%f", &data->index, &data->x, &data->y, &data->z) != 7)
return NULL;
return data;
}
deserialize_data(struct my_data *data, const char *input, const char *separators)
{
char *p;
struct my_data tmp;
char *str = strdup(input); /* make a copy of the input line because we modify it */
if (!str) { /* I couldn't make a copy so I'll die */
return NULL;
}
p = strtok (str, separators); /* use line for first call to strtok */
if (!p) goto err;
tmp.index = strtoul (p, NULL, 0); /* convert text to integer */
p = strtok (NULL, separators); /* strtok remembers line */
if (!p) goto err;
tmp.x = atof(p);
p = strtok (NULL, separators);
if (!p) goto err;
tmp.y = atof(p);
p = strtok (NULL, separators);
if (!p) goto err;
tmp.z = atof(p);
memcpy(data, &tmp, sizeof(tmp)); /* copy values out */
goto out;
err:
data = NULL;
out:
free (str);
return data;
}
int main() {
struct my_data somedata;
deserialize_data(&somedata, "1,2.5,3.12,7.955", ",");
printf("index: %d, x: %2f, y: %2f, z: %2f\n", somedata.index, somedata.x, somedata.y, somedata.z);
}
Combine it with reading lines from a file:
just the main function here (insert the rest from the previous example)
int
main(int argc, char *argv[])
{
FILE *stream;
char *line = NULL;
size_t len = 0;
ssize_t nread;
struct my_data somedata;
if (argc != 2) {
fprintf(stderr, "Usage: %s <file>\n", argv[0]);
exit(EXIT_FAILURE);
}
stream = fopen(argv[1], "r");
if (stream == NULL) {
perror("fopen");
exit(EXIT_FAILURE);
}
while ((nread = getline(&line, &len, stream)) != -1) {
deserialize_data(&somedata, line, ",");
printf("index: %d, x: %2f, y: %2f, z: %2f\n", somedata.index, somedata.x, somedata.y, somedata.z);
}
free(line);
fclose(stream);
exit(EXIT_SUCCESS);
}
You've got an incorrect ; after your if (infile == NULL) test - try removing that...
[Edit: 2nd by 9 secs! :-)]
if (infile == NULL);
{ /* floating block */ }
The above if is a complete statement that does nothing regardless of the value of infile. The "floating" block is executed no matter what infile contains.
Remove the semicolon to 'attach' the "floating" block to the if
if (infile == NULL)
{ /* if block */ }
You already have solid responses in regard to syntax/structs/etc, but I will offer another method for reading the data in the file itself: I like Martin York's CSVIterator solution. This is my go-to approach for CSV processing because it requires less code to implement and has the added benefit of being easily modifiable (i.e., you can edit the CSVRow and CSVIterator defs depending on your needs).
Here's a mostly complete example using Martin's unedited code without structs or classes. In my opinion, and especially so as a beginner, it is easier to start developing your code with simpler techniques. As your code begins to take shape, it is much clearer why and where you need to implement more abstract/advanced devices.
Note this would technically need to be compiled with C++11 or greater because of my use of std::stod (and maybe some other stuff too I am forgetting), so take that into consideration:
//your includes
//...
#include"wherever_CSVIterator_is.h"
int main (int argc, char* argv[])
{
int index;
double tmp[3]; //since we know the shape of your input data
std::vector<double*> saved = std::vector<double*>();
std::vector<int> indices;
std::ifstream file(argv[1]);
for (CSVIterator loop(file); loop != CSVIterator(); ++loop) { //loop over rows
index = (*loop)[0];
indices.push_back(index); //store int index first, always col 0
for (int k=1; k < (*loop).size(); k++) { //loop across columns
tmp[k-1] = std::stod((*loop)[k]); //save double values now
}
saved.push_back(tmp);
}
/*now we have two vectors of the same 'size'
(let's pretend I wrote a check here to confirm this is true),
so we loop through them together and access with something like:*/
for (int j=0; j < (int)indices.size(); j++) {
double* saved_ptr = saved.at(j); //get pointer to first elem of each triplet
printf("\nindex: %g |", indices.at(j));
for (int k=0; k < 3; k++) {
printf(" %4.3f ", saved_ptr[k]);
}
printf("\n");
}
}
Less fuss to write, but more dangerous (if saved[] goes out of scope, we are in trouble). Also some unnecessary copying is present, but we benefit from using std::vector containers in lieu of knowing exactly how much memory we need to allocate.
Don't give an example of input file. Specify your input file format -at least on paper or in comments- e.g. in EBNF notation (since your example is textual... it is not a binary file). Decide if the numbers have to be in different lines (or if you might accept a file with a single huge line made of million bytes; read about the Comma Separated Values format). Then, code some parser for that format. In your case, it is likely that some very simple recursive descent parsing is enough (and your particular parser won't even use recursion).
Read more about <stdio.h> and its routines. Take time to carefully read that documentation. Since your input is textual, not binary, you don't need fread. Notice that input routines can fail, and you should handle the failure case.
Of course, fopen can fail (e.g. because your working directory is not what you believe it is). You'll better use perror or errno to find more about the failure cause. So at least code:
infile = fopen("input.dat", "r");
if (infile == NULL) {
perror("fopen input.dat");
exit(EXIT_FAILURE);
}
Notice that semi-colons (or their absence) are very important in C (no semi-colon after condition of if). Read again the basic syntax of C language. Read about How to debug small programs. Enable all warnings and debug info when compiling (with GCC, compile with gcc -Wall -g at least). The compiler warnings are very useful!
Remember that fscanf don't handle the end of line (newline) differently from a space character. So if the input has to have different lines you need to read every line separately.
You'll probably read every line using fgets (or getline) and parse every line individually. You could do that parsing with the help of sscanf (perhaps the %n could be useful) - and you want to use the return count of sscanf. You could also perhaps use strtok and/or strtod to do such a parsing.
Make sure that your parsing and your entire program is correct. With current computers (they are very fast, and most of the time your input file sits in the page cache) it is very likely that it would be fast enough. A million lines can be read pretty quickly (if on Linux, you could compare your parsing time with the time used by wc to count the lines of your file). On my computer (a powerful Linux desktop with AMD2970WX processor -it has lots of cores, but your program uses only one-, 64Gbytes of RAM, and SSD disk) a million lines can be read (by wc) in less than 30 milliseconds, so I am guessing your entire program should run in less than half a second, if given a million lines of input, and if the further processing is simple (in linear time).
You are likely to fill a large array of struct O_DATA and that array should probably be dynamically allocated, and reallocated when needed. Read more about C dynamic memory allocation. Read carefully about C memory management routines. They could fail, and you need to handle that failure (even if it is very unlikely to happen). You certainly don't want to re-allocate that array at every loop. You probably could allocate it in some geometrical progression (e.g. if the size of that array is size, you'll call realloc or a new malloc for some int newsize = 4*size/3 + 10; only when the old size is too small). Of course, your array will generally be a bit larger than what is really needed, but memory is quite cheap and you are allowed to "lose" some of it.
But StackOverflow is not a "do my homework" site. I gave some advice above, but you should do your homework.
Related
My application needs to read like thousands of lines from a large csv file around 300GB with billion lines, each line contains several numbers. The data are like these:
1, 34, 56, 67, 678, 23462, ...
2, 3, 6, 8, 34, 5
23,547, 648, 34657 ...
...
...
I tried fget reading file line by line in c, but it took really really really long, even with wc -l in linux, just to read all of the line, it took quite a while.
I also tried to write all data to sqlite3 database based on the logics of the application. However, the data structure is different than the csv file above, which now has 100 billion lines, with only two numbers each line. I then created two indices on top of them, which resulted a 2.5TB database, while it was 1 TB without indices before. Since the scale of indices are large than data, query has to read the whole 1.5 TB indices, I think it doesn't make any sense to use database method right?
So I would like to ask, what is the quickest way to read several lines within a large csv file with billion lines in C or python. And by the way, is there any formula or something to calculate the time consume between reading file and capacity of RAM.
environment: linux, RAM 200GB, C, python
Requirements
huge csv file, several hundred GB in size
each line contains several numbers
the program must extract several thousand lines per run
the program works several times with the same file, only different lines should be extracted
Since lines in the csv files have a variable length, you would have to read the entire file to get the data of the required lines. Sequential reading of the entire file would still be very slow - even if you optimized the file reading as much as possible. A good indicator is actually the runtime of wc -l, as mentioned already by the OP in the question.
Instead, one should optimize on the algorithmic level. A one-time preprocessing of the data is necessary, which then allows fast access to certain lines - without reading the whole file.
There are several possible ways, for example:
Using a database with an index
programmatic creation of an index file (association of line numbers with file offsets)
convert the csv file into a binary file with fixed format
The OP test shows that approach 1) led to 1.5 TB indices. Method 2), to create a small program that connects the line number with a file offset is certainly also a possibility. Finally, approach 3 would allow to calculate the file offset to a line number without the need for a separate index file. This approach is especially useful if the maximum number of numbers per line is known. Otherwise, approach 2 and approach 3 are very similar.
Approach 3 is explained in more detail below. There may be additional requirements that require the approach to be slightly modified, but the following should get things started.
A one-time pre-processing is necessary. The textual csv lines are converted into int arrays and use a fixed record format to store the ints in binary format in a separate file. To then read a particular line n, you can simply calculate the file offset, e.g. with line_nr * (sizeof(int) * MAX_NUMBERS_PER_LINE);. Finally, with fseeko(fp, offset, SEEK_SET); jump to this offset and read MAX_NUMBERS_PER_LINE ints. So you only need to read the data that you actually want to process.
This has not only the advantage that the program runs much faster, it also requires very little main memory.
Test case
A test file with 3,000,000,000 lines was created. Each line contains up to 10 random int numbers, separated by a comma.
In this case this gave a csv file with about 342 GB of data.
A quick test with
time wc -l numbers.csv
gives
187.14s user 74.55s system 96% cpu 4:31.48 total
This means that it would take a total of at least 4.5 minutes if a sequential file read approach were used.
For one-time preprocessing, a converter program reads each line and stores 10 binary ints per line. The converted file is called 'numbers_bin'. A quick test with access to the data of 10,000 randomly selected rows:
time demo numbers_bin
gives
0.03s user 0.20s system 5% cpu 4.105 total
So instead of 4.5 minutes, it takes 4.1 seconds for this specific example data. That is more than a factor of 65 faster.
Source Code
This approach may sound more complicated than it actually is.
Let's start with the converter program. It reads the csv file and creates a binary fixed format file.
The interesting part takes place in the function pre_process: there a line is read in a loop with 'getline', the numbers are extracted with 'strtok' and 'strtol' and put into an int array initialized with 0. Finally this array is written to the output file with 'fwrite'.
Errors during the conversion result in a message on stderr and the program is terminated.
convert.c
#include "data.h"
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <limits.h>
static void pre_process(FILE *in, FILE *out) {
int *block = get_buffer();
char *line = NULL;
size_t line_capp = 0;
while (getline(&line, &line_capp, in) > 0) {
line[strcspn(line, "\n")] = '\0';
memset(block, 0, sizeof(int) * MAX_ELEMENTS_PER_LINE);
char *token;
char *ptr = line;
int i = 0;
while ((token = strtok(ptr, ", ")) != NULL) {
if (i >= MAX_ELEMENTS_PER_LINE) {
fprintf(stderr, "too many elements in line");
exit(EXIT_FAILURE);
}
char *end_ptr;
errno = 0;
long val = strtol(token, &end_ptr, 10);
if (val > INT_MAX || val < INT_MIN || errno || *end_ptr != '\0' || end_ptr == token) {
fprintf(stderr, "value error with '%s'\n", token);
exit(EXIT_FAILURE);
}
ptr = NULL;
block[i] = (int) val;
i++;
}
fwrite(block, sizeof(int), MAX_ELEMENTS_PER_LINE, out);
}
free(block);
free(line);
}
static void one_off_pre_processing(const char *csv_in, const char *bin_out) {
FILE *in = get_file(csv_in, "rb");
FILE *out = get_file(bin_out, "wb");
pre_process(in, out);
fclose(in);
fclose(out);
}
int main(int argc, char *argv[]) {
if (argc != 3) {
fprintf(stderr, "usage: convert <in> <out>\n");
exit(EXIT_FAILURE);
}
one_off_pre_processing(argv[1], argv[2]);
return EXIT_SUCCESS;
}
Data.h
A few auxiliary functions are used. They are more or less self-explanatory.
#ifndef DATA_H
#define DATA_H
#include <stdio.h>
#include <stdint.h>
#define NUM_LINES 3000000000LL
#define MAX_ELEMENTS_PER_LINE 10
void read_data(FILE *fp, uint64_t line_nr, int *block);
FILE *get_file(const char *const file_name, char *mode);
int *get_buffer();
#endif //DATA_H
Data.c
#include "data.h"
#include <stdlib.h>
void read_data(FILE *fp, uint64_t line_nr, int *block) {
off_t offset = line_nr * (sizeof(int) * MAX_ELEMENTS_PER_LINE);
fseeko(fp, offset, SEEK_SET);
if(fread(block, sizeof(int), MAX_ELEMENTS_PER_LINE, fp) != MAX_ELEMENTS_PER_LINE) {
fprintf(stderr, "data read error for line %lld", line_nr);
exit(EXIT_FAILURE);
}
}
FILE *get_file(const char *const file_name, char *mode) {
FILE *fp;
if ((fp = fopen(file_name, mode)) == NULL) {
perror(file_name);
exit(EXIT_FAILURE);
}
return fp;
}
int *get_buffer() {
int *block = malloc(sizeof(int) * MAX_ELEMENTS_PER_LINE);
if(block == NULL) {
perror("malloc failed");
exit(EXIT_FAILURE);
}
return block;
}
demo.c
And finally a demo program that reads the data for 10,000 randomly determined lines.
The function request_lines determines 10,000 random lines. The lines are sorted with qsort. The data for these lines is read. Some lines of the code are commented out. If you comment them out, the read data is output to the debug console.
#include "data.h"
#include <stdlib.h>
#include <assert.h>
#include <sys/stat.h>
static int comp(const void *lhs, const void *rhs) {
uint64_t l = *((uint64_t *) lhs);
uint64_t r = *((uint64_t *) rhs);
if (l > r) return 1;
if (l < r) return -1;
return 0;
}
static uint64_t *request_lines(uint64_t num_lines, int num_request_lines) {
assert(num_lines < UINT32_MAX);
uint64_t *request_lines = malloc(sizeof(*request_lines) * num_request_lines);
for (int i = 0; i < num_request_lines; i++) {
request_lines[i] = arc4random_uniform(num_lines);
}
qsort(request_lines, num_request_lines, sizeof(*request_lines), comp);
return request_lines;
}
#define REQUEST_LINES 10000
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "usage: demo <file>\n");
exit(EXIT_FAILURE);
}
struct stat stat_buf;
if (stat(argv[1], &stat_buf) == -1) {
perror(argv[1]);
exit(EXIT_FAILURE);
}
uint64_t num_lines = stat_buf.st_size / (MAX_ELEMENTS_PER_LINE * sizeof(int));
FILE *bin = get_file(argv[1], "rb");
int *block = get_buffer();
uint64_t *requests = request_lines(num_lines, REQUEST_LINES);
for (int i = 0; i < REQUEST_LINES; i++) {
read_data(bin, requests[i], block);
//do sth with the data,
//uncomment the following lines to output the data to the console
// printf("%llu: ", requests[i]);
// for (int x = 0; x < MAX_ELEMENTS_PER_LINE; x++) {
// printf("'%d' ", block[x]);
// }
// printf("\n");
}
free(requests);
free(block);
fclose(bin);
return EXIT_SUCCESS;
}
Summary
This approach provides much faster results than reading through the entire file sequentially (4 seconds instead of 4.5 minutes per run for the sample data). It also requires very little main memory.
The prerequisite is the one-time pre-processing of the data into a binary format. This conversion is quite time-consuming, but the data for certain rows can be read very quickly afterwards using a query program.
I want to implement a function in my program that sends a .txt to my email with some tasks that i have to do in the day. Here's the code:
void txtCreator(){
/**file.dat and file.txt, respectively**/
FILE *fp, *fp1;
/**struct that contain the events**/
struct evento *display = (struct evento *)malloc(sizeof(struct evento));
char buffer[48];
char email_events[] = {"dd_mm.txt"};//filename.txt
char msg[]={"Nao ha eventos disponiveis para hoje!\n"};
int count=0;
time_t rawtime;
time(&rawtime);
struct tm timenow = *localtime(&rawtime);
strftime(buffer, 48, "%d_%m", &timenow);
fp = fopen(file_name, "rb");
fp1 = fopen(email_events, "w");
if(strcmp(buffer, email_events)!=0){
strcpy(email_events, buffer);
while(fread(display, sizeof(struct evento), 1, fp)==1){
if (feof(fp) || fp==NULL){
break;
}
else if(display->dia==timenow.tm_mday && display->mes==timenow.tm_mon+1){
fwrite(display, sizeof(struct evento), 1, fp1);
fprintf(fp1, "%s", "\n");
count++;
}
}
}
if(count==0){
fprintf(fp1, "%s", msg);
}
fclose(fp);
fclose(fp1);
}
Everything is working just fine, but there's two problems:
1-
strcpy(email_events, buffer);
is not working, and:
2-
when i create the .txt file, it shows like that:
test ¹0(¹€(.v™ ™ °'¹8¹uguese_Brazil.12
it shows the event name (test) correctly, but the date is not working.
I've tried a lot of things, but nothing works.
Sorry for the bad english, not my native language.
when i create the .txt file, it shows like that:
test ¹0(¹€(.v™ ™ °'¹8¹uguese_Brazil.12
Let's address this first: you're not writing text to your .txt file. You're writing a struct. It's going to look like garbage.
For example, let's say display->dia is 19. This means the number 19 will be written to the file, not the text 19, the number 19. Read as text, 19 is garbage. 10 is a newline. 65 is A.
If your intent is to dump the structs into a file, assuming struct evento has no pointers, you're good. In fact you probably shouldn't add a newline, it will interfere with reading the file by the size of the struct.
If your intent is to produce a human readable text file, you need to translate each piece of the struct into text. For example, if you wanted to write the day and month as text...
fprintf(fp1, "%d_%d", display->dia, display->mes);
I'll assume that going forward.
strcpy(email_events, buffer); is not working
At first glance it looks like your strcpy is backwards, it's strcpy(src, dest) and presumably you want to copy email_events into buffer: strcpy(buffer, email_events).
Looking further, your code does nothing with either buffer nor email_events after that. The strcpy is pointless.
Going even further, buffer is the month and day like 07_19. email_events is always dd_mm.txt. Those will never match. strcmp(buffer, email_events)!=0 will always be true making the if check pointless.
I'm not sure what the intent of buffer and email_events are, but it appears to be trying to create a filename based on the current date? This can be done much simpler with one better named variable outfile.
time_t rawtime;
time(&rawtime);
struct tm timenow = *localtime(&rawtime);
char outfile[20];
strftime(outfile, 20, "%d_%m.txt", &timenow);
Moving along to other problems, you don't check that fp1 opened.
You do eventually check fp but you check it after you've already read from a possibly null file pointer. If you're compiling with an address sanitizer (which you should) it will cause an error. Causing an error when using a null pointer is good, it will solve many a mystery memory problem for you.
It's much easier and robust and address sanitizer friendly to check immediately. We can also do a better job naming them to avoid confusing the input from the output: in and out.
FILE *in = fopen(file_name, "rb");
if( in == NULL ) {
perror(file_name);
exit(1);
}
And since you're reading binary with rb you should be writing binary you should be using wb. This only matters on Windows, but might as well be consistent.
FILE *out = fopen(outfile, "wb");
if( out == NULL ) {
perror(outfile);
exit(1);
}
There's no need to check feof(fp), while(fread(display, sizeof(struct evento), 1, fp)==1) will already exit the loop at end of file when it fails to read. In general, explicitly checking for the end of file leads to subtle problems.
The read/write loop is now much simpler.
while(fread(display, sizeof(struct evento), 1, in)==1){
if(display->dia==timenow.tm_mday && display->mes==timenow.tm_mon+1) {
fprintf(out, "%d_%d\n", display->dia, display->mes);
count++;
}
}
Putting it all together...
void txtCreator(){
const char *no_events_found_msg = "Nao ha eventos disponiveis para hoje!\n";
// No need to cast the result of malloc, it just invites mistakes.
struct evento *display = malloc(sizeof(struct evento));
// Generate the output filename directly, no strcmp and strcpy necessary.
time_t rawtime;
time(&rawtime);
struct tm timenow = *localtime(&rawtime);
char outfile[20];
strftime(outfile, 48, "%d_%m.txt", &timenow);
// Immediatetly make sure the files are open and error immediately.
FILE *in = fopen(file_name, "rb");
if( in == NULL ) {
perror(file_name);
exit(1);
}
FILE *out = fopen(outfile, "wb");
if( out == NULL ) {
perror(outfile);
exit(1);
}
// Now that we know the files are open, reading and writing is much simpler.
int count=0;
while(fread(display, sizeof(struct evento), 1, in)==1){
if(display->dia==timenow.tm_mday && display->mes==timenow.tm_mon+1) {
fprintf(out, "%d_%d\n", display->dia, display->mes);
count++;
}
}
if(count==0){
fprintf(out, "%s", no_events_found_msg);
}
fclose(in);
fclose(out);
}
Note that I've used a style which declares variables in place. This makes the code easier to read, limits the scope of each variable, and it avoids declaring variables which you never use.
Assuming that you are meaning to copy email_events into your buffer (since you assigned a static string), your strcpy parameters are backwards.
Below is the declaration of strcpy
char *strcpy(char *dest, const char *src);
You probably meant:
strcpy(buffer, email_events);
I have to save some graph data(array of structs) into text file. I made working program using fprintf but for extra points I need to be faster. I have spend couple hours googling if there is anything faster and try to use fwrite (but I wasn't able to fwrite as a text) I cannot really find any other functions etc.
This is my write function using fprintf:
void save_txt(const graph_t * const graph, const char *fname)
{
int count = graph->num_edges, i = 0;
FILE *f = fopen(fname, "w");
while (count > 0) {
int r = fprintf(f, "%d %d %d\n", (graph->edges[i].from), (graph->edges[i].to), (graph->edges[i].cost));
i++;
if (r >= 6) {
count -= 1;
} else {
break;
}
}
if (f) {
fclose(f);
}
}
I would try setting a write buffer on the stream, and experimenting with different sizes of buffer (e.g. 1K, 2K, 4K, 8K and so on). Notice that by default your file is already using a buffer of BUFSIZ value, and it might be already enough.
#define BUFFERSIZE 0x1000
void save_txt(const graph_t * const graph, const char *fname)
{
int count = graph->num_edges, i = 0;
unsigned char buf[BUFFERSIZE];
FILE *f = fopen(fname, "w");
setvbuf(f, buf, _IOFBF, BUFFERSIZE);
...
The output file f is born with the default BUFSIZ cache, so it might benefit from a larger fully buffered write cache.
Of course this assumes that you're writing to a relatively slow medium and that the time spent saving is relevant; otherwise, whatever is slowing you down is not here, and therefore increasing save performances won't help you appreciably.
There are instrumentations like prof and gprof that can help you determine where your program is spending the most time.
One, much more awkward, possibility is merging Kiwi's answer with a buffered write call to avoid the code in printf that verifies which format to use, since you already know this, and to use as few I/O calls as possible (even just one if BUFFERSIZE is larger than your destination file's length).
// These variables must now be global, declared outside save_txt.
char kiwiBuf[BUFFERSIZE];
size_t kiwiPtr = 0;
FILE *f;
void my_putchar(char c) {
kiwiBuf[kiwiPtr++] = c;
// Is the buffer full?
if (kiwiPtr == BUFFERSIZE) {
// Yes, empty the buffer into the file.
flushBuffer();
}
}
void flushBuffer() {
if (kiwiPtr) {
fwrite(kiwiBuf, kiwiPtr, 1, f);
kiwiPtr = 0;
}
}
You need now to flush the buffer before close:
void save_txt(const graph_t * const graph, const char *fname)
{
int i, count = graph->num_edges;
f = fopen(fname, "w");
if (NULL == f) {
fprintf(stderr, "Error opening %s\n", fname);
exit(-1);
}
for (i = 0; i < count; i++) {
my_put_nbr(graph->edges[i].from);
my_putchar(' ');
my_put_nbr(graph->edges[i].to);
my_putchar(' ');
my_put_nbr(graph->edges[i].cost);
my_putchar('\n');
}
flushBuffer();
fclose(f);
}
UPDATE
By declaring the my_putchar function as inline and with a 4K buffer, the above code (modified with a mock of graph reading from an array of random integers) is around 6x faster than fprintf on
Linux mintaka 4.12.8-1-default #1 SMP PREEMPT Thu Aug 17 05:30:12 UTC 2017 (4d7933a) x86_64 x86_64 x86_64 GNU/Linux
gcc version 7.1.1 20170629 [gcc-7-branch revision 249772] (SUSE Linux)
About 2x of that seems to come from buffering. Andrew Henle made me notice an error in my code: I was comparing results to a baseline of unbuffered output, but fopen uses by default a BUFSIZ value and on my system BUFSIZ is 8192. So basically I've "discovered" just that:
there is no advantage on a 8K buffer, 4K is enough
my original suggestion of using _IOFBF is utterly worthless as the system already does it for you. This in turn means that Kiwi's answer is the most correct, for - as Andrew pointed out - avoids printf's checks and conversions.
Also, the overall increase (google Amdahl's Law) depends on what fraction of processing time goes into saving. Clearly if one hour of elaboration requires one second of saving, doubling saving speed saves you half a second; while increasing elaboration speed by 1% saves you 36 seconds, or 72 times more.
My own sample code was designed to be completely save-oriented with very large graphs; in this situation, any small improvement in writing speed reaps potentially huge rewards, which might be unrealistic in the real-world case.
Also (in answer to a comment), while using a small enough buffer will slow saving, it is not at all certain that using a larger buffer will benefit. Say that the whole graph generates in its entirety 1.2Kb of output; then of course any buffer value above 1.2Kb will yield no improvements. Actually, allocating more memory might negatively impact performances.
I would write a small function say print_graph(int int int)
and call write directly in it
or something like this with my_putchar being a write call
int my_put_nbr(int nb)
{
if (nb < 0)
{
my_putchar('-');
nb = -nb;
}
if (nb <= 9)
my_putchar(nb + 48);
else
{
my_put_nbr(nb / 10);
my_put_nbr(nb % 10);
}
return (0);
}
I had to be 1.3x faster than fprintf, here is code that worked for me. I have to say that I had to submit it multiple times, sometimes I passed only 1 out of 5 tests with the same code. In conclusion, it is faster than fprintf but not reliably 1.3times faster..
void save_txt(const graph_t * const graph, const char *fname)
{
int count = graph->num_edges, i = 0;
char c = '\n';
char d = ' ';
char buffer[15];
FILE *f = fopen(fname, "w");
while (count > 0) {
itoa(graph->edges[i].from,buffer,10);
fputs(buffer, f);
putc(d, f);
itoa(graph->edges[i].to,buffer,10);
fputs(buffer, f);
putc(d, f);
itoa(graph->edges[i].cost,buffer,10);
fputs(buffer, f);
putc(c, f);
i++;
count -= 1;
}
if (f) {
fclose(f);
}
}
For a class, I've been given the task of writing radix sort in parallel using pthreads, openmp, and MPI. My language of choice in this case is C -- I don't know C++ too well.
Anyways, the way I'm going about reading a text file is causing a segmentation fault at around 500MB file size. The files are line separated 32 bit numbers:
12351
1235234
12
53421
1234
I know C, but I don't know it well; I use things I know, and in this case the things I know are terribly inefficient. My code for reading the text file is as follows:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
int main(int argc, char **argv){
if(argc != 4) {
printf("rs_pthreads requires three arguments to run\n");
return -1;
}
char *fileName=argv[1];
uint32_t radixBits=atoi(argv[2]);
uint32_t numThreads=atoi(argv[3]);
if(radixBits > 32){
printf("radixBitx cannot be greater than 32\n");
return -1;
}
FILE *fileForReading = fopen( fileName, "r" );
if(fileForReading == NULL){
perror("Failed to open the file\n");
return -1;
}
char* charBuff = malloc(1024);
if(charBuff == NULL){
perror("Error with malloc for charBuff");
return -1;
}
uint32_t numNumbers = 0;
while(fgetc(fileForReading) != EOF){
numNumbers++;
fgets(charBuff, 1024, fileForReading);
}
uint32_t numbersToSort[numNumbers];
rewind(fileForReading);
int location;
for(location = 0; location < numNumbers; location++){
fgets(charBuff, 1024, fileForReading);
numbersToSort[location] = atoi(charBuff);
}
At a file of 50 million numbers (~500MB), I'm getting a segmentation fault at rewind of all places. My knowledge of how file streams work is almost non-existent. My guess is it's trying to malloc without enough memory or something, but I don't know.
So, I've got a two parter here: How is rewind segmentation faulting? Am I just doing a poor job before rewind and not checking some system call I should be?
And, what is a more efficient way to read in an arbitrary amount of numbers from a text file?
Any help is appreciated.
I think the most likely cause here is (ironically enough) a stack overflow. Your numbersToSort array is allocated on the stack, and the stack has a fixed size (varies by compiler and operating system, but 1 MB is a typical number). You should dynamically allocate numbersToSort on the heap (which has much more available space) using malloc():
uint32_t *numbersToSort = malloc(sizeof(uint32_t) * numNumbers);
Don't forget to deallocate it later:
free(numbersToSort);
I would also point out that your first-pass loop, which is intended to count the number of lines, will fail if there are any blank lines. This is because on a blank line, the first character is '\n', and fgetc() will consume it; the next call to fgets() will then be reading the following line, and you'll have skipped the blank one in your count.
The problem is in this line
uint32_t numbersToSort[numNumbers];
You are attempting to allocate a huge array in stack, your stack size is in few KBytes (Moreover older C standards don't allow this). So you can try this
uint32_t *numbersToSort; /* Declare it with other declarations */
/* Remove uint32_t numbersToSort[numNumbers]; */
/* Add the code below */
numbersToSort = malloc(sizeof(uint32_t) * numNumbers);
if (!numbersToSort) {
/* No memory; do cleanup and bail out */
return 1;
}
Hmm i wonder whether is a way to read a FILE faster than using fscanf()
For example suppose that i have this text
4
55 k
52 o
24 l
523 i
First i want to read the first number which gives us the number of following lines.
Let this number be called N.
After N, I want to read N lines which have an integer and a character.
With fscanf it would be like this
fscanf(fin,"%d %c",&a,&c);
You do almost no processing so probably the bottleneck is the file system throughput. However you should measure first if it really is. If you don't want to use a profiler, you can just measure the running time of your application. The size of input file divided by the running time can be used to check if you've reached the file system throughput limit.
Then if you are far away from aforementioned limit you probably need to optimize the way you read the file. It may be better to read it in larger chunks using fread() and then process the buffer stored in memory with sscanf().
You also can parse the buffer yourself which would be faster than *scanf().
[edit]
Especially for Drakosha:
$ time ./main1
Good entries: 10000000
real 0m3.732s
user 0m3.531s
sys 0m0.109s
$ time ./main2
Good entries: 10000000
real 0m0.605s
user 0m0.496s
sys 0m0.094s
So the optimized version makes ~127MB/s which may be my file system's bottleneck or maybe OS caches the file in RAM. The original version is ~20MB/s.
Tested with a 80MB file:
10000000
1234 a
1234 a
...
main1.c
#include <stdio.h>
int ok = 0;
void processEntry(int a, char c) {
if (a == 1234 && c == 'a') {
++ok;
}
}
int main(int argc, char **argv) {
FILE *f = fopen("data.txt", "r");
int total = 0;
int a;
char c;
int i = 0;
fscanf(f, "%d", &total);
for (i = 0; i < total; ++i) {
if (2 != fscanf(f, "%d %c", &a, &c)) {
fclose(f);
return 1;
}
processEntry(a, c);
}
fclose(f);
printf("Good entries: %d\n", ok);
return (ok == total) ? 0 : 1;
}
main2.c
#include <stdio.h>
#include <stdlib.h>
int ok = 0;
void processEntry(int a, char c) {
if (a == 1234 && c == 'a') {
++ok;
}
}
int main(int argc, char **argv) {
FILE *f = fopen("data.txt", "r");
int total = 0;
int a;
char c;
int i = 0;
char *numberPtr = NULL;
char buf[2048];
size_t toProcess = sizeof(buf);
int state = 0;
int fileLength, lengthLeft;
fseek(f, 0, SEEK_END);
fileLength = ftell(f);
fseek(f, 0, SEEK_SET);
fscanf(f, "%d", &total); // read the first line
lengthLeft = fileLength - ftell(f);
// read other lines using FSM
do {
if (lengthLeft < sizeof(buf)) {
fread(buf, lengthLeft, 1, f);
toProcess = lengthLeft;
} else {
fread(buf, sizeof(buf), 1, f);
toProcess = sizeof(buf);
}
lengthLeft -= toProcess;
for (i = 0; i < toProcess; ++i) {
switch (state) {
case 0:
if (isdigit(buf[i])) {
state = 1;
a = buf[i] - '0';
}
break;
case 1:
if (isdigit(buf[i])) {
a = a * 10 + buf[i] - '0';
} else {
state = 2;
}
break;
case 2:
if (isalpha(buf[i])) {
state = 0;
c = buf[i];
processEntry(a, c);
}
break;
}
}
} while (toProcess == sizeof(buf));
fclose(f);
printf("Good entries: %d\n", ok);
return (ok == total) ? 0 : 1;
}
It is unlikely you can significantly speed-up the actual reading of the data. Most of the time here will be spent on transferring the data from disk to memory, which is unavoidable.
You might get a little speed-up by replacing the fscanf call with fgets and then manually parsing the string (with strtol) to bypass the format-string parsing that fscanf has to do, but don't expect any huge savings.
In the end, it is usually not worth it to heavily optimise I/O operations, because they will typically be dominated by the time it takes to transfer the actual data to/from the hardware/peripherals.
As usual, start with profiling to make sure this part is indeed a bottleneck. Actually, FileSystem cache should make the small reads that you are doing not very expensive, however reading larger parts of the file to memory and then operating on memory might be (a little) faster.
In case (which i believe is extremely improbable) is that you need to save every CPU cycle, you might write your own fscanf variant, since you know the format of the string and you only need to support only one variant. But this improvement would bring low gains also, especially on modern CPUs.
The input looks like in various programming contests. In this case - optimize the algorithm, not the reading.
fgets() or fgetc() are faster, as they don't need to drag the whole formatting/variable argument list ballet of fscanf() into the program. Either one of those two functions will leave you with a manual character(s)-to-integer conversion however. Still, the program as whole will be much faster.
Not much hope to read file faster as it is a system call. But there is many ways to parse it faster than scanf with specialised code.
Checkout read and fread. As you practice for programming contests, you can ignore all warnings about disk IO buttle neck, cause files can be in memory or pipes from other processes generating tests "on-the-fly".
Put your tests into /dev/shm (new solution for tmpfs) or make test generator and pipe it.
I've found on programming contests, parsing numbers in manner to atoi can give much performance boost over scanf/fscanf (atoi might be not present, so be prepared to implement it by hand - it's easy).