I am trying to read a file, read in how many bytes it contains and then round it up to its nearest GB and then double the file size. However, is there is way to read the file and then some do all this stuff back into the same file?
Here is what I have so far, but it creates a new file with the new contents but I'm not sure if my logic is correct
Also, do you create a constant like BYTE with #define?
So far as a test case I just used byte as an int and make it equal to 50
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include <time.h>
// #define BYTE 50
int main()
{
FILE *fp1, *fp2;
int ch1;
clock_t elapsed;
char fname1[40], fname2[40];
char a;
printf("Enter name of the file:");
fgets(fname1, 40, stdin);
while ( fname1[strlen(fname1) - 1] == '\n')
{
fname1[strlen(fname1) -1] = '\0';
}
fp1 = fopen(fname1, "r");
if ( fp1 == NULL )
{
printf("Cannot open %s for reading\n", fname1 );
exit(1);
}
printf("This program will round up the current file into highest GB, and then double it");
elapsed = clock(); // get starting time
ch1 = getc(fp1); // read a value from each file
int num = 50;
int bytes = 0;
while(1) // keep reading while values are equal or not equal; only end if it reaches the end of one of the files
{
ch1 = getc(fp1);
bytes++;
if (ch1 == EOF) // if either file reaches the end, then its over!
{
break; // if either value is EOF
}
}
// 1,000,000,000 bytes in a GB
int nextInt = bytes%num;
// example: 2.0GB 2,000,000,000 - 1.3GB 1,300,000,000 = 7,000,000,000 OR same thing as 2,000,000,000%1,300,000,000 = 700,000,000
int counter = 0;
printf("Enter name of the file you would like to create:");
fgets(fname2, 40, stdin);
while ( fname2[strlen(fname2) - 1] == '\n')
{
fname2[strlen(fname2) -1] = '\0';
}
fp2 = fopen(fname2, "w");
if ( fp1 == NULL )
{
printf("Cannot open %s for reading\n", fname2);
exit(1);
}
if(fp2 == NULL)
{
puts("Not able to open this file");
fclose(fp1);
exit(1);
}
while(counter != nextInt)
{
a = fgetc(fp1);
fputc(a, fp2);
counter++;
}
fclose(fp1); // close files
fclose(fp2);
printf("Total number of bytes in the file %u: ", bytes);
printf("Round up the next GB %d: ", nextInt);
elapsed = clock() - elapsed; // elapsed time
printf("That took %.4f seconds\n", (float)elapsed/CLOCKS_PER_SEC);
return 0;
}
You increment bytes before you check for EOF, so you have an off-by-one error.
However, reading a file byte by byte is a slow way of finding its size. Using standard C, you may be able to use ftell() — if you're on a 64-bit Unix-like machine. Otherwise, you're working too close to the values that will fit in 32-bit values. Using a plain int for bytes is going to run into trouble.
Alternatively, and better, you stat() or fstat() to get the exact size directly.
When it comes to doubling the size of the file, you could simply seek to the new end position and write a byte at that position. However, that does not allocate all the disk space (on a Unix machine); it will be a sparse file.
On rewrite, you need to know how your system will handle two open file streams on a single file. On Unix-like systems, you can open the original file once for reading and once for writing in append mode. You could then read large chunks (64 KiB, 256 KiB?) of data at a time from the read file descriptor and write that to the write descriptor. However, you need to keep track of how much data to write because the read won't encounter EOF.
Your code is going to write a lot of 0xFF bytes to the tail of the file on most systems (where EOF is recorded as -1).
Note that there are Gibibytes GiB (230 = 1,073,741,824 bytes) and Gigabytes GB (officially 109 = 1,000,000,000 bytes, but not infrequently used to mean GiB). See Wikipedia on Binary prefix, etc.
You're working way too hard. I'll assume your OS is Windows or Linux.
On Windows, _stat will get the exact length of a file. In Linux it's stat. Both will do this from file system information, so it's almost instantaneous.
On Windows, _chsize will extend the file to any number of bytes. On Linux it's ftruncate. The OS will be writing zeros to the extension, so it will be a fast write indeed.
In all cases it's simple to find the documentation by searching.
The code will be straight-line (no loops), about 10 lines.
Rounding up to the next GB is simply done with
#define GIGA ((size_t)1 << 30)
size_t new_size = (old_size + GIGA - 1) & ~(GIGA - 1);
Related
I'm trying to move content from one file to another.
My code:
char *path = extractFileName(args[1]);
if (path == 0)
return -1;
FILE *input = fopen(path, "r");
rewind(input);
fseek(input, 0L, SEEK_END);
long sz = ftell(input);
printf("sz: %ld\n", sz);
rewind(input);
size_t a;
FILE *result = fopen("result.mp3", "w");
size_t counter = 0;
char buffer[128];
while ((a = fread(&buffer[0], 1, 128, input)) != 0) {
fwrite(&buffer[0], 1, a, result);
counter += a;
}
printf("%d\n", counter);
printf("ferror input: %d\n", ferror(input));
printf("feof input: %d\n", feof(input));
After execution it prints
sz: 6675688
25662
ferror input: 0
feof input: 16
As far as I know it means that C knows that size of input file is 665kb but returns eof when I try to read more than 25662 bytes. What I'm doing wrong?
Since your output filename is result.mp3, it's a safe bet you're dealing with non-textual data. That means you should be opening your files in binary mode - "rb" and "wb" respectively. If you're running this code on Windows, not doing that would explain the behavior you're seeing (On that platform, reading a particular byte (0x1A) in text mode causes it to signal end of file even when it's not actually the end), and using binary mode will fix it. On other OSes, it's a no-op but still clues the reader into your intentions and the type of data you're expecting to work with, and is thus a good idea even if it's not strictly needed on them.
I have to save some graph data(array of structs) into text file. I made working program using fprintf but for extra points I need to be faster. I have spend couple hours googling if there is anything faster and try to use fwrite (but I wasn't able to fwrite as a text) I cannot really find any other functions etc.
This is my write function using fprintf:
void save_txt(const graph_t * const graph, const char *fname)
{
int count = graph->num_edges, i = 0;
FILE *f = fopen(fname, "w");
while (count > 0) {
int r = fprintf(f, "%d %d %d\n", (graph->edges[i].from), (graph->edges[i].to), (graph->edges[i].cost));
i++;
if (r >= 6) {
count -= 1;
} else {
break;
}
}
if (f) {
fclose(f);
}
}
I would try setting a write buffer on the stream, and experimenting with different sizes of buffer (e.g. 1K, 2K, 4K, 8K and so on). Notice that by default your file is already using a buffer of BUFSIZ value, and it might be already enough.
#define BUFFERSIZE 0x1000
void save_txt(const graph_t * const graph, const char *fname)
{
int count = graph->num_edges, i = 0;
unsigned char buf[BUFFERSIZE];
FILE *f = fopen(fname, "w");
setvbuf(f, buf, _IOFBF, BUFFERSIZE);
...
The output file f is born with the default BUFSIZ cache, so it might benefit from a larger fully buffered write cache.
Of course this assumes that you're writing to a relatively slow medium and that the time spent saving is relevant; otherwise, whatever is slowing you down is not here, and therefore increasing save performances won't help you appreciably.
There are instrumentations like prof and gprof that can help you determine where your program is spending the most time.
One, much more awkward, possibility is merging Kiwi's answer with a buffered write call to avoid the code in printf that verifies which format to use, since you already know this, and to use as few I/O calls as possible (even just one if BUFFERSIZE is larger than your destination file's length).
// These variables must now be global, declared outside save_txt.
char kiwiBuf[BUFFERSIZE];
size_t kiwiPtr = 0;
FILE *f;
void my_putchar(char c) {
kiwiBuf[kiwiPtr++] = c;
// Is the buffer full?
if (kiwiPtr == BUFFERSIZE) {
// Yes, empty the buffer into the file.
flushBuffer();
}
}
void flushBuffer() {
if (kiwiPtr) {
fwrite(kiwiBuf, kiwiPtr, 1, f);
kiwiPtr = 0;
}
}
You need now to flush the buffer before close:
void save_txt(const graph_t * const graph, const char *fname)
{
int i, count = graph->num_edges;
f = fopen(fname, "w");
if (NULL == f) {
fprintf(stderr, "Error opening %s\n", fname);
exit(-1);
}
for (i = 0; i < count; i++) {
my_put_nbr(graph->edges[i].from);
my_putchar(' ');
my_put_nbr(graph->edges[i].to);
my_putchar(' ');
my_put_nbr(graph->edges[i].cost);
my_putchar('\n');
}
flushBuffer();
fclose(f);
}
UPDATE
By declaring the my_putchar function as inline and with a 4K buffer, the above code (modified with a mock of graph reading from an array of random integers) is around 6x faster than fprintf on
Linux mintaka 4.12.8-1-default #1 SMP PREEMPT Thu Aug 17 05:30:12 UTC 2017 (4d7933a) x86_64 x86_64 x86_64 GNU/Linux
gcc version 7.1.1 20170629 [gcc-7-branch revision 249772] (SUSE Linux)
About 2x of that seems to come from buffering. Andrew Henle made me notice an error in my code: I was comparing results to a baseline of unbuffered output, but fopen uses by default a BUFSIZ value and on my system BUFSIZ is 8192. So basically I've "discovered" just that:
there is no advantage on a 8K buffer, 4K is enough
my original suggestion of using _IOFBF is utterly worthless as the system already does it for you. This in turn means that Kiwi's answer is the most correct, for - as Andrew pointed out - avoids printf's checks and conversions.
Also, the overall increase (google Amdahl's Law) depends on what fraction of processing time goes into saving. Clearly if one hour of elaboration requires one second of saving, doubling saving speed saves you half a second; while increasing elaboration speed by 1% saves you 36 seconds, or 72 times more.
My own sample code was designed to be completely save-oriented with very large graphs; in this situation, any small improvement in writing speed reaps potentially huge rewards, which might be unrealistic in the real-world case.
Also (in answer to a comment), while using a small enough buffer will slow saving, it is not at all certain that using a larger buffer will benefit. Say that the whole graph generates in its entirety 1.2Kb of output; then of course any buffer value above 1.2Kb will yield no improvements. Actually, allocating more memory might negatively impact performances.
I would write a small function say print_graph(int int int)
and call write directly in it
or something like this with my_putchar being a write call
int my_put_nbr(int nb)
{
if (nb < 0)
{
my_putchar('-');
nb = -nb;
}
if (nb <= 9)
my_putchar(nb + 48);
else
{
my_put_nbr(nb / 10);
my_put_nbr(nb % 10);
}
return (0);
}
I had to be 1.3x faster than fprintf, here is code that worked for me. I have to say that I had to submit it multiple times, sometimes I passed only 1 out of 5 tests with the same code. In conclusion, it is faster than fprintf but not reliably 1.3times faster..
void save_txt(const graph_t * const graph, const char *fname)
{
int count = graph->num_edges, i = 0;
char c = '\n';
char d = ' ';
char buffer[15];
FILE *f = fopen(fname, "w");
while (count > 0) {
itoa(graph->edges[i].from,buffer,10);
fputs(buffer, f);
putc(d, f);
itoa(graph->edges[i].to,buffer,10);
fputs(buffer, f);
putc(d, f);
itoa(graph->edges[i].cost,buffer,10);
fputs(buffer, f);
putc(c, f);
i++;
count -= 1;
}
if (f) {
fclose(f);
}
}
I am trying to use fread and fwrite to read and write a data pertaining to a structure in a file. Here's my code:
#include<stdio.h>
#include<time.h>
#include<stdlib.h>
#include<string.h>
typedef struct book book;
struct book
{
char title[200];
char auth[200];
char publi[200];
int p_year;
int price;
int edition;
int isbn;
};
int main()
{
int i;
FILE* fp = fopen("this.dat","w");
book * a = calloc(1000000,sizeof (book));
srand(time(NULL));
for(i=0;i<1000000;i++)
{
a[i].price = rand()%1000;
a[i].p_year = 1500 + rand()%518;
a[i].isbn = 10000+rand()%100000;
a[i].edition = i%15;
strcpy(a[i].title,"title");
strcpy(a[i].auth,"author");
strcpy(a[i].publi,"publication");
}
if((i=fwrite(a,sizeof(*a),1000000,fp))!= 1000000)
{
printf("ERROR - Only %d records written\n",i);
printf("feof:%d\nferror:%d",feof(fp),ferror(fp));
return EXIT_FAILURE;
}
if(ferror(fp))
{
printf("ERROR");
return EXIT_FAILURE;
}
if(fclose(fp)!=0)
{
printf("ERROR while closing the stream");
return EXIT_FAILURE;
}
if((fp = fopen("this.dat","r")) == NULL)
{
printf("ERROR reopening");
return EXIT_FAILURE;
}
if((i=fread(a,sizeof(book),100,fp))!=100)
{
printf("ERROR - Only %d records read\n",i);
printf("feof:%d\nferror:%d",feof(fp),ferror(fp));
return EXIT_FAILURE;
}
if(ferror(fp))
{
printf("~ERROR");
return EXIT_FAILURE;
}
for(i=0;i<100;i++)
printf("price:%d\nedition:%d\nisbn:%d\np_year:%d\n\n\n",a[i].price,a[i].edition,a[i].isbn,a[i].p_year);
fclose(fp);
return EXIT_SUCCESS;
}
The thing is occasionally it executes successfully but most of the times it doesn't. I get an error while reading back from the file using fread. It ends up reading variable number of records every time and less number of records than it's supposed to (i.e 100). Following is one of the outputs of an unsuccessful execution of the program:
ERROR - Only 25 records read
feof:16
ferror:0
Question 1: Why eof achieved reading just 25 records when more than 25 were written ? (I've tried using rewind/fseek after reopening the file but the issue still persisted.)
Question 2: In such cases, is it normal for the data contained in the array a beyond a[x-1] to get tampered when x (<100) records are read ? Would the data still have been tampered beyond a[99] even if 100 records were successfully read ? (I know the data gets tampered since trying to print the fields of elements of array a beyond the xth element results in inappropriate values, like price > 1000 or price<0 and so on)
you shouldn't open your files in text mode while reading/writing as binary structures.
Whereas it has no effect on Linux/Unix, on Windows this has serious consequences. And it makes your files non-shareable between Windows and Linux.
Depending on the data LF <=> CR/LF conversion can corrupt/shift the data (removing the carriage return or inserting one)
in text mode in Windows, each LF (ASCII 10) byte is replaced by CR+LF (13+10 ASCII) bytes when writing (and reverse in reading: 13+10 => 10). Those 10 bytes can happen, for instance when writing year 1802 (hex: 0x70A) as binary.
Solution: use binary mode:
if((fp = fopen("this.dat","rb")) == NULL)
and
FILE* fp = fopen("this.dat","wb");
Note: In "text" mode, specifying a block size doesn't work since the size depends on the data. That probably answers your second question: last 100th record read is corrupt because you're reading too few bytes. I'm not sure about the details but since the system adds/removes bytes when writing/reading, block size can be buggy.
I have a text file(unsigned short values) as follows
abc.txt
2311
1231
1232
54523
32423
I'm reading this file in my function using while loop and storing in a buffer as follows
while(!feof(ref))
{
fscanf(ref,"%d\n",&ref[count]);
count++;
}
It is taking too much time for reading large file is there any way to optimize the fscanf operation.
This is because secondary memory access is slower than primary memory access. First dump the file into primary memory using fread() in binary mode. Then read from primary memory integer by integer.
A common way is to read a larger chunk into a large memory buffer, and then parse out the data from that buffer.
Another way may be to instead memory map the file, then the OS will put the file into your process virtual memory map, so you can read it like reading from memory.
use a local buffer and read blocks of data using fread() in binary mode. Parse your text data and continue with the next block.
tune your buffer size properly, maybe 64K or 1Mb in size, it depends on your application.
#include <stdio.h>
int BUFFER_SIZE = 1024;
FILE *source;
FILE *destination;
int n;
int count = 0;
int written = 0;
int main()
{
unsigned char buffer[BUFFER_SIZE];
source = fopen("myfile", "rb");
if (source)
{
while (!feof(source))
{
n = fread(buffer, 1, BUFFER_SIZE, source);
count += n;
// here parse data
}
}
fclose(source);
return 0;
}
This may be faster if each line has only one number, atoi() is a lot faster than using fscanf()
#define BUFLEN 128
#define ARRAY_SIZE 12345
int myarray[ARRAY_SIZE];
char buffer[BUFLEN]
FILE *fp= fopen(...);
index=0;
do
{
if( fgets(buffer, BUFLEN-1, fp) == NULL )
break;
myarray[index++]= atoi(buffer);
if( index >= ARRAY_SIZE)
break;
}while(!feof(fp));
...hastily typed in code, not compiled or run ;)
You can improve the file reading by setting a stream buffer e.g.
#define STRMBUF_SIZE (64*1024)
char strmbuf[STRMBUF_SIZE];
setvbuf( fp, strmbuf,_IOFBF,STRMBUF_SIZE);
Hmm i wonder whether is a way to read a FILE faster than using fscanf()
For example suppose that i have this text
4
55 k
52 o
24 l
523 i
First i want to read the first number which gives us the number of following lines.
Let this number be called N.
After N, I want to read N lines which have an integer and a character.
With fscanf it would be like this
fscanf(fin,"%d %c",&a,&c);
You do almost no processing so probably the bottleneck is the file system throughput. However you should measure first if it really is. If you don't want to use a profiler, you can just measure the running time of your application. The size of input file divided by the running time can be used to check if you've reached the file system throughput limit.
Then if you are far away from aforementioned limit you probably need to optimize the way you read the file. It may be better to read it in larger chunks using fread() and then process the buffer stored in memory with sscanf().
You also can parse the buffer yourself which would be faster than *scanf().
[edit]
Especially for Drakosha:
$ time ./main1
Good entries: 10000000
real 0m3.732s
user 0m3.531s
sys 0m0.109s
$ time ./main2
Good entries: 10000000
real 0m0.605s
user 0m0.496s
sys 0m0.094s
So the optimized version makes ~127MB/s which may be my file system's bottleneck or maybe OS caches the file in RAM. The original version is ~20MB/s.
Tested with a 80MB file:
10000000
1234 a
1234 a
...
main1.c
#include <stdio.h>
int ok = 0;
void processEntry(int a, char c) {
if (a == 1234 && c == 'a') {
++ok;
}
}
int main(int argc, char **argv) {
FILE *f = fopen("data.txt", "r");
int total = 0;
int a;
char c;
int i = 0;
fscanf(f, "%d", &total);
for (i = 0; i < total; ++i) {
if (2 != fscanf(f, "%d %c", &a, &c)) {
fclose(f);
return 1;
}
processEntry(a, c);
}
fclose(f);
printf("Good entries: %d\n", ok);
return (ok == total) ? 0 : 1;
}
main2.c
#include <stdio.h>
#include <stdlib.h>
int ok = 0;
void processEntry(int a, char c) {
if (a == 1234 && c == 'a') {
++ok;
}
}
int main(int argc, char **argv) {
FILE *f = fopen("data.txt", "r");
int total = 0;
int a;
char c;
int i = 0;
char *numberPtr = NULL;
char buf[2048];
size_t toProcess = sizeof(buf);
int state = 0;
int fileLength, lengthLeft;
fseek(f, 0, SEEK_END);
fileLength = ftell(f);
fseek(f, 0, SEEK_SET);
fscanf(f, "%d", &total); // read the first line
lengthLeft = fileLength - ftell(f);
// read other lines using FSM
do {
if (lengthLeft < sizeof(buf)) {
fread(buf, lengthLeft, 1, f);
toProcess = lengthLeft;
} else {
fread(buf, sizeof(buf), 1, f);
toProcess = sizeof(buf);
}
lengthLeft -= toProcess;
for (i = 0; i < toProcess; ++i) {
switch (state) {
case 0:
if (isdigit(buf[i])) {
state = 1;
a = buf[i] - '0';
}
break;
case 1:
if (isdigit(buf[i])) {
a = a * 10 + buf[i] - '0';
} else {
state = 2;
}
break;
case 2:
if (isalpha(buf[i])) {
state = 0;
c = buf[i];
processEntry(a, c);
}
break;
}
}
} while (toProcess == sizeof(buf));
fclose(f);
printf("Good entries: %d\n", ok);
return (ok == total) ? 0 : 1;
}
It is unlikely you can significantly speed-up the actual reading of the data. Most of the time here will be spent on transferring the data from disk to memory, which is unavoidable.
You might get a little speed-up by replacing the fscanf call with fgets and then manually parsing the string (with strtol) to bypass the format-string parsing that fscanf has to do, but don't expect any huge savings.
In the end, it is usually not worth it to heavily optimise I/O operations, because they will typically be dominated by the time it takes to transfer the actual data to/from the hardware/peripherals.
As usual, start with profiling to make sure this part is indeed a bottleneck. Actually, FileSystem cache should make the small reads that you are doing not very expensive, however reading larger parts of the file to memory and then operating on memory might be (a little) faster.
In case (which i believe is extremely improbable) is that you need to save every CPU cycle, you might write your own fscanf variant, since you know the format of the string and you only need to support only one variant. But this improvement would bring low gains also, especially on modern CPUs.
The input looks like in various programming contests. In this case - optimize the algorithm, not the reading.
fgets() or fgetc() are faster, as they don't need to drag the whole formatting/variable argument list ballet of fscanf() into the program. Either one of those two functions will leave you with a manual character(s)-to-integer conversion however. Still, the program as whole will be much faster.
Not much hope to read file faster as it is a system call. But there is many ways to parse it faster than scanf with specialised code.
Checkout read and fread. As you practice for programming contests, you can ignore all warnings about disk IO buttle neck, cause files can be in memory or pipes from other processes generating tests "on-the-fly".
Put your tests into /dev/shm (new solution for tmpfs) or make test generator and pipe it.
I've found on programming contests, parsing numbers in manner to atoi can give much performance boost over scanf/fscanf (atoi might be not present, so be prepared to implement it by hand - it's easy).