Same program is 10 times slower on windows

Same program is 10 times slower on windows - c

I wrote a simple c program that copies 10 million bytes from a file and pastes them in reverse order on another file (this is done one byte at a time, I know it's not efficient but it's just to make some tests), I don't understand why on linux it takes 2.5 seconds while on windows it takes more than 20 seconds. I run the same program changing only the paths.
I use windows 10 and archlinux, the files are on an ntfs partition.
code on windows
#include <stdio.h>
#include <time.h>
void get_nth_byte(FILE *fp, int nth_index,unsigned char* output){
fseek(fp,nth_index,SEEK_SET);
fread(output, sizeof(unsigned char), 1,fp);
}
int main() {
clock_t begin = clock();
//
FILE* input = fopen( "C:\\Users\\piero\\Desktop\\input.txt","rb");
FILE* output = fopen("C:\\Users\\piero\\Desktop\\output.txt","wb");
unsigned char byte;
for (int i = 10000000; i > 0; i--) {
get_nth_byte(input,i,&byte);
fwrite(&byte, sizeof(unsigned char),1,output);
}
//
clock_t end = clock();
double result = (double) (end - begin)/CLOCKS_PER_SEC;
printf("%f",result);
return 0;
}
code on linux
#include <stdio.h>
#include <time.h>
void get_nth_byte(FILE *fp, int nth_index,unsigned char* output){
fseek(fp,nth_index,SEEK_SET);
fread(output, sizeof(unsigned char), 1,fp);
}
int main() {
clock_t begin = clock();
//
FILE* input = fopen( "/run/media/piero/Windows/Users/piero/Desktop/input.txt","rb");
FILE* output = fopen("/run/media/piero/Windows/Users/piero/Desktop/output.txt","wb");
unsigned char byte;
for (int i = 10000000; i > 0; i--) {
get_nth_byte(input,i,&byte);
fwrite(&byte, sizeof(unsigned char),1,output);
}
//
clock_t end = clock();
double result = (double) (end - begin)/CLOCKS_PER_SEC;
printf("%f",result);
return 0;
}
output on linux : 2.224549
output on windows : 25.349647
UPDATE
I solved the problem by using cygwin rather than mingwin, now it takes about 4.3 seconds

This is a great demonstration of how it's not the code we write that runs, it's the executable that the compiler makes from the code that runs.
It is possible that your Windows C compiler is not as advanced as your Linux C compiler, and is not optimizing your code as well as it could, or it's possible that the libraries that the Windows compiler is linking to for fread() and fwrite() are slower than the equivalent libraries in the Linux system.
If I had to put up my best guess, the Linux C compiler probably noticed that it would be more efficient to read more than one byte at a time, and it could do that without affecting the semantics of your program, and the Windows compiler either didn't infer the same, or wasn't able to optimize in the same way due to some underlying proprietary filesystem thing that only Microsoft engineers understand.
I can't say for sure without a peek at the disassembled binaries

One of the strengths of Unix/Linux is that files are designed to be treated as streams of bytes, with it being maximally easy and efficient to seek to the n'th byte using fseek or lseek.
Non-Unix operating systems, such as Windows, tend to have to work much harder to implement those seek operations. In the worst case, they may actually need to read through the file, counting characters as they go.
Your code opens both files in binary mode, and this should reduce the need for the fseek implementation to perform any expensive emulations. In text mode, a 10x performance penalty for heavy fseek use wouldn't surprise me. I'm much more surprised you're seeing it in binary mode.
[Disclaimer: strictly speaking, in text mode fseek is not defined as seeking to an arbitrary byte offset at all, but rather, only to a position defined by the number returned by a previous call to ftell. If an implementation takes advantage of that freedom, it can reduce the performance penalty for text-mode fseek operations, also, but it then means that code like yours, that constructs positions to seek to on the assumption that they're pure byte offsets, may not work at all.]

Related

How to use write() or fwrite() for writing data to terminal (stdout)?

I am trying to speed up my C program to spit out data faster.
Currently I am using printf() to give some data to the outside world. It is a continuous stream of data, therefore I am unable to use return(data).
How can I use write() or fwrite() to give the data out to the console instead of file?
Overall my setup consist of program written in C and its output goes to the python script, where the data is processed further. I form a pipe:
./program_in_c | script_in_python
This gives additional benefit on Raspberry Pi by using more of processor's cores.

#include <unistd.h>
ssize_t write(int fd, const void *buf, size_t count);
write() writes up to count bytes from the buffer starting at buf to
the file referred to by the file descriptor fd.
the standard output file descriptor is: 1 in linux at least!
concern using flush the stdoutput buffer as well, before calling to write system call to ensure that all previous garabge was cleaned
fflush(stdout); // Will now print everything in the stdout buffer
write(1, buf, count);
using fwrite:
size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream);
The function fwrite() writes nmemb items of data, each size bytes
long, to the stream pointed to by stream, obtaining them from the
location given by ptr.
fflush(stdout);
int buf[8];
fwrite(buf, sizeof(int), sizeof(buf), stdout);
Please refare to man pages for further reading, in the links below:
fwrite
write

Well, there's little or no win in trying to overcome the already used buffering system of the stdio.h package. If you try to use fwrite() with larger buffers, you'll probably win no more time, and use more memory than is necessary, as stdio.h selects the best buffer size appropiate to the filesystem where the data is to be written.
A simple program like the following will show that speed is of no concern, as stdio is already buffering output.
#include <stdio.h>
int
main()
{
int c;
while((c = getchar()) >= 0)
putchar(c);
}
If you try the above and below programs:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int
main()
{
char buffer[512];
int n;
while((n = read(0, buffer, sizeof buffer)) > 0)
write(1, buffer, n);
if (n < 0) {
perror("read");
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
You will see that there's no significative difference or, even, the first program will be faster, despite it is doing I/O on a per character basis. (as B. Kernighan & Dennis Ritchie wrote it in her first edition of "The C programming language") Most probably the first program will win.
The calls to read() and write() involve a system call each, with a buffer size decided by you. The individual getchar() and putchar() calls don't. They just store the received chars in a memory buffer, as you print them, whose size has been decided by the stdio.h library implementation, based on the filesystem, and it flushes the buffer, once it is full of data. If you grow the buffer size in the second program, you'll see that you get better results increasing it up to a point, but after that you'll see no more increment in speed. The number of calls made to the library is insignificant with respect to the time involved in doing the actual I/O, and selecting a very large buffer, will eat much memory from your system (and a Raspberry Pi memory is limited in this sense, to 1Gb or ram) If you end making swap due to a so large buffer, you'll lose the battle completely.
Most filesystems have a preferred buffer size, because the kernel does write ahead (the kernel reads more than what you asked for, on sequential reads, in prevision that you'll continue reading more after you consumed the data) and this affects the optimum buffer size. For that, the stat(2) system call tells you what is the optimum buffer size, and stdio uses that when it selects the actual buffer size.
Don't think you are going to get better (or much better) than the program listed first above. Even if you use large enough buffers.
What is not correct (or valid) is to intermix calls that do buffering (like all the stdio package's) with basic system calls (like read(2) or write(2) ---as I've seen recommending you to use fflush(3) after write(2), which is totally incoherent--- that do not buffer the data) there's no earn (and probably you'll get your output incorrectly ordered, if you do part of the calls using printf(3) and part using write(2) (this happens more in pipelines like you plan to do, because the buffers are not line oriented ---another characteristic of buffered output in stdio---)
Finally, I recomend you to read "The Unix programming environment" by Dennis Ritchie and Rob Pike. It will teach you a lot of unix, but one very good thing is that it will teach you to use perfectly the stdio package and the unix filesystem calls for reading and writing. With a little of luck you'll find it in .pdf on internet.
The next program shows you the effect of buffering:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int
main()
{
int i;
char *sep = "";
for (i = 0; i < 10; i++) {
printf("%s%d", sep, i);
sep = ", ";
sleep(1);
}
printf("\n");
}
One would assume you are going to see (on the terminal) the program, writing the numbers 0 to 9, separated by , and paced on one second intervals.
But due to the buffering, what you observe is quite different, you'll see how your program waits for 10 seconds without writing anything at all on the terminal, and at the end, writes everything in one shot, including the final line end, when the program terminates, and the shell shows you the prompt again.
If you change the program to this:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int
main()
{
int i;
char *sep = "";
for (i = 0; i < 10; i++) {
printf("%s%d", sep, i);
fflush(stdout);
sep = ", ";
sleep(1);
}
printf("\n");
}
You'll see the expected output, because you have told stdio to flush the buffer at each loop pass. In both programs you did 10 calls to printf(3), but there was only one write(2) at the end to write the full buffer. In the second version you forced stdio to do one such write(2) after each printf, and that showed the data out as the program passed through the loop.
Be careful, because another characteristic of stdio can be confounding you, as printf(3), when you print to a terminal device, flushes the output at each \n, but when you run it through a pipe, it does it only when the buffer fills up completely. This saves system calls (in FreeBSD, for example, the buffer size selected by stdio is around 32kb, large enough to force two blocks to write(2) and optimum (you'll not get better going above that size)

The console output in C works almost the same way as a file. Once you have included stdio.h, you can write on the console output, named stdout (for "standard output"). In the end, the following statement:
printf("hello world!\n");
is the same as:
char str[] = "hello world\n";
fwrite(str, sizeof(char), sizeof(str) - 1, stdout);
fflush(stdout);

Fast I/O in c, stdin/out

In a coding competition specified at this link there is a task where you need to read much data on stdin, do some calculations and present a whole lot of data on stdout.
In my benchmarking it is almost only i/o that takes time although I have tried optimizing it as much as possible.
What you have as input is a string (1 <= len <= 100'000) and q rows of pair of int where q also is 1 <= q <= 100'000.
I benchmarked my code on a 100 times larger dataset (len = 10M, q = 10M) and this is the result:
Activity time accumulated
Read text: 0.004 0.004
Read numbers: 0.146 0.150
Parse numbers: 0.200 0.350
Calc answers: 0.001 0.351
Format output: 0.037 0.388
Print output: 0.143 0.531
By implementing my own formating and number parsing inline i managed to get the time down to 1/3 of the time when using printf and scanf.
However when I uploaded my solution to the competitions webpage my solution took 1.88 seconds (I think that is the total time over 22 datasets). When I look in the high-score there are several implementations (in c++) that finished in 0.05 seconds, nearly 40 times faster than mine! How is that possible?
I guess that I could speed it up a bit by using 2 threads, then I can start calculating and writing to stdout while still reading from stdin. This will however decrease the time to min(0.150, 0.143) in a theoretical best case on my large dataset. I'm still nowhere close to the highscore..
In the image below you can see the statistics of the consumed time.
The program gets compiled by the website with this options:
gcc -g -O2 -std=gnu99 -static my_file.c -lm
and timed like this:
time ./a.out < sample.in > sample.out
My code looks like this:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_LEN (100000 + 1)
#define ROW_LEN (6 + 1)
#define DOUBLE_ROW_LEN (2*ROW_LEN)
int main(int argc, char *argv[])
{
int ret = 1;
// Set custom buffers for stdin and out
char stdout_buf[16384];
setvbuf(stdout, stdout_buf, _IOFBF, 16384);
char stdin_buf[16384];
setvbuf(stdin, stdin_buf, _IOFBF, 16384);
// Read stdin to buffer
char *buf = malloc(MAX_LEN);
if (!buf) {
printf("Failed to allocate buffer");
return 1;
}
if (!fgets(buf, MAX_LEN, stdin))
goto EXIT_A;
// Get the num tests
int m ;
scanf("%d\n", &m);
char *num_buf = malloc(DOUBLE_ROW_LEN);
if (!num_buf) {
printf("Failed to allocate num_buffer");
goto EXIT_A;
}
int *nn;
int *start = calloc(m, sizeof(int));
int *stop = calloc(m, sizeof(int));
int *staptr = start;
int *stpptr = stop;
char *cptr;
for(int i=0; i<m; i++) {
fgets(num_buf, DOUBLE_ROW_LEN, stdin);
nn = staptr++;
cptr = num_buf-1;
while(*(++cptr) > '\n') {
if (*cptr == ' ')
nn = stpptr++;
else
*nn = *nn*10 + *cptr-'0';
}
}
// Count for each test
char *buf_end = strchr(buf, '\0');
int len, shift;
char outbuf[ROW_LEN];
char *ptr_l, *ptr_r, *out;
for(int i=0; i<m; i++) {
ptr_l = buf + start[i];
ptr_r = buf + stop[i];
while(ptr_r < buf_end && *ptr_l == *ptr_r) {
++ptr_l;
++ptr_r;
}
// Print length of same sequence
shift = len = (int)(ptr_l - (buf + start[i]));
out = outbuf;
do {
out++;
shift /= 10;
} while (shift);
*out = '\0';
do {
*(--out) = "0123456789"[len%10];
len /= 10;
} while(len);
puts(outbuf);
}
ret = 0;
free(start);
free(stop);
EXIT_A:
free(buf);
return ret;
}

Thanks to your question, I went and solved the problem myself. Your time is better than mine, but I'm still using some stdio functions.
I simply do not think the high score of 0.05 seconds is bona fide. I suspect it's the product of a highly automated system that returned that result in error, and that no one ever verified it.
How to defend that assertion? There's no real algorithmic complexity: the problem is O(n). The "trick" is to write specialized parsers for each aspect of the input (and avoid work done only in debug mode). The total time for 22 trials is 50 milliseconds, meaning each trial averages 2.25 ms? We're down near the threshold of measurability.
Competitions like the problem you addressed yourself to are unfortunate, in a way. They reinforce the naive idea that performance is the ultimate measure of a program (there's no score for clarity). Worse, they encourage going around things like scanf "for performance" while, in real life, getting a program to run correctly and fast basically never entails avoiding or even tuning stdio. In a complex system, performance comes from things like avoiding I/O, passing over the data only once, and minimizing copies. Using the DBMS effectively is often key (as it were), but such things never show up in programming challenges.
Parsing and formatting numbers as text does take time, and in rare circumstances can be a bottleneck. But the answer is hardly ever to rewrite the parser. Rather, the answer is to parse the text into a convenient binary form, and use that. In short: compilation.
That said, a few observations may help.
You don't need dynamic memory for this problem, and it's not helping. The problem statement says the input array may be up to 100,000 elements, and the number of trials may be as many as 100,000. Each trial is two integer strings of up to 6 digits each separated by a space and terminated by a newline: 6 + 1 + 6 + 1 = 14. Total input, maximum is 100,000 + 1 + 6 + 1 + 100,000 * 14: under 16 KB. You are allowed 1 GB of memory.
I just allocated a single 16 KB buffer, and read it in all at once with read(2). Then I made a single pass over that input.
You got suggestions to use asynchronous I/O and threads. The problem statement says you're measured on CPU time, so neither of those help. The shortest distance between two points is a straight line; a single read into statically allocated memory wastes no motion.
One ridiculous aspect of the way they measure performance is that they use gcc -g. That means assert(3) is invoked in code that is measured for performance! I couldn't get under 4 seconds on test 22 until I removed the my asserts.
In sum, you did pretty well, and I suspect the winner you're baffled by is a phantom. Your code does faff about a bit, and you can dispense with dynamic memory and tuning stdio. I bet your time can be trimmed by simplifying it. To the extent that performance matters, that's where I'd direct your attention.

You should allocate all your buffers continuously.
Allocate a buffer which is the size of all your buffers (num_buff, start, stop) then rearrange the points to the corresponding offsets by their size.
This can reduce your cache miss \ page faults.
Since the read and the write operation seems to consume a lot of time you should consider adding threads. One thread should deal with I\O and another should deal with the computation. (It is worth checking if another thread for prints could speed things up as well). Make sure you don't use any locks while doing this.

Answering this question is tricky because optimization heavily depends on the problem you have.
One idea is to look at the content of the file you are trying to read and see if there patterns or things that you can use in your favor.
The code you wrote is a "general" solution for reading from a file, executing something and then writing to a file. But if you the file is not randomly generated each time and the content is always the same why not try to write a solution for that file?
On the other hand, you could try to use low-level system functions. One that comes to my thinking is mmap which allows you to map a file directly to memory and access that memory instead of using scanf and fgets.
Another thing I found that might help is in your solutin you are having two while loops, why not try and use only one? Another thing would be to do some Asynchronous I/O reading, so instead of reading the whole file in a loop, and then doing the calculation in another loop, you can try and read a portion at the beginning, start processing it async and continue reading.
This link might help for the async part

Why is my Memory dumping soo slow?

The idea behind this program is to simply access the ram and download the data from it to a txt file.
Later Ill convert the txt file to jpeg and hopefully it will be readable .
However when I try and read from the RAM using NEW[] it takes waaaaaay to long to actually copy all the values into the file?
Isnt it suppose to be really fast? I mean I save pictures everyday and it doesn't even take a second?
Is there some other method I can use to dump memory to a file?
#include <stdio.h>
#include <stdlib.h>
#include <hw/pci.h>
#include <hw/inout.h>
#include <sys/mman.h>
main()
{
FILE *fp;
fp = fopen ("test.txt","w+d");
int NumberOfPciCards = 3;
struct pci_dev_info info[NumberOfPciCards];
void *PciDeviceHandler1,*PciDeviceHandler2,*PciDeviceHandler3;
uint32_t *Buffer;
int *BusNumb; //int Buffer;
uint32_t counter =0;
int i;
int r;
int y;
volatile uint32_t *NEW,*NEW2;
uintptr_t iobase;
volatile uint32_t *regbase;
NEW = (uint32_t *)malloc(sizeof(uint32_t));
NEW2 = (uint32_t *)malloc(sizeof(uint32_t));
Buffer = (uint32_t *)malloc(sizeof(uint32_t));
BusNumb = (int*)malloc(sizeof(int));
printf ("\n 1");
for (r=0;r<NumberOfPciCards;r++)
{
memset(&info[r], 0, sizeof(info[r]));
}
printf ("\n 2");
//Here the attach takes place.
for (r=0;r<NumberOfPciCards;r++)
{
(pci_attach(r) < 0) ? FuncPrint(1,r) : FuncPrint(0,r);
}
printf ("\n 3");
info[0].VendorId = 0x8086; //Wont be using this one
info[0].DeviceId = 0x3582; //Or this one
info[1].VendorId = 0x10B5; //WIll only be using this one PLX 9054 chip
info[1].DeviceId = 0x9054; //Also PLX 9054
info[2].VendorId = 0x8086; //Not used
info[2].DeviceId = 0x24cb; //Not used
printf ("\n 4");
//I attached the device and give it a handler and set some setting.
if ((PciDeviceHandler1 = pci_attach_device(0,PCI_SHARE|PCI_INIT_ALL, 0, &info[1])) == 0)
{
perror("pci_attach_device fail");
exit(EXIT_FAILURE);
}
for (i = 0; i < 6; i++)
//This just prints out some details of the card.
{
if (info[1].BaseAddressSize[i] > 0)
printf("Aperture %d: "
"Base 0x%llx Length %d bytes Type %s\n", i,
PCI_IS_MEM(info[1].CpuBaseAddress[i]) ? PCI_MEM_ADDR(info[1].CpuBaseAddress[i]) : PCI_IO_ADDR(info[1].CpuBaseAddress[i]),
info[1].BaseAddressSize[i],PCI_IS_MEM(info[1].CpuBaseAddress[i]) ? "MEM" : "IO");
}
printf("\nEnd of Device random info dump---\n");
printf("\nNEWs Address : %d\n",*(int*)NEW);
//Not sure if this is a legitimate way of memory allocation but I cant see to read the ram any other way.
NEW = mmap_device_memory(NULL, info[1].BaseAddressSize[3],PROT_READ|PROT_WRITE|PROT_NOCACHE, 0,info[1].CpuBaseAddress[3]);
//Here is where things are starting to get messy and REALLY long to just run through all the ram and dump it.
//Is there some other way I can dump the data in the ram into a file?
while (counter!=info[1].BaseAddressSize[3])
{
fprintf(fp, "%x",NEW[counter]);
counter++;
}
fclose(fp);
printf("0x%x",*Buffer);
}

A few issues that I can see:
You are writing blocks of 4 bytes - that's quite inefficient. The stream buffering in the C library may help with that to a degree, but using larger blocks would still be more efficient.
Even worse, you are writing out the memory dump in hexadecimal notation, rather than the bytes themselves. That conversion is very CPU-intensive, not to mention that the size of the output is essentially doubled. You would be better off writing raw binary data using e.g. fwrite().
Depending on the specifics of your system (is this on QNX?), reading from I/O-mapped memory may be slower than reading directly from physical memory, especially if your PCI device has to act as a relay. What exactly is it that you are doing?
In any case I would suggest using a profiler to actually find out where your program is spending most of its time. Even a rudimentary system monitor would allow you to determine if your program is CPU-bound or I/O-bound.
As it is, "waaaaaay to long" is hardly a valid measurement. How much data is being copied? How long does it take? Where is the output file located?
P.S.: I also have some concerns w.r.t. what you are trying to do, but that is slightly off-topic for this question...

For fastest speed: write the data in binary form and use the open() / write() / close() API-s. Since your data is already available in a contiguous block of (virtual) memory it is a waste to copy it to a temporary buffer (used by the fwrite(), fprintf(), etc. API-s).
The code using write() will be similar to:
int fd = open("filename.bin", O_RDWR|O_CREAT, S_IRWXU);
write(fd, (void*)NEW, 4*info[1].BaseAddressSize[3]);
close(fd);
You will need to add error handling and make sure that the buffer size is specified correctly.
To reiterate, you get the speed-up from:
avoiding the conversion from binary to ASCII (as pointed out by others above)
avoiding many calls to libc
reducing the number of system-calls (from inside libc)
eliminating the overhead of copying data to a temporary buffer inside the fwrite()/fprintf() and related functions (buffering would be useful if your data arrived in small chunks, including the case of converting to ASCII in 4 byte units)
I intentionally ignore commenting on other parts of your code as it is apparently not intended to be production quality yet and your question is focused on how to speed up writing data to a file.

fwrite() alternative for large files on 32-bit system

I'm trying to generate large files (4-8 GB) with C code.
Now I use fopen() with 'wb' parameters to open file binary and fwrite() function in for loop to write bytes to file. I'm writing one byte in every loop iteration. There is no problem until the file is larger or equal to 4294967296 bytes (4096 MB). It looks like some memory limit in 32-bit OS, because when it writes to that opened file, it is still in RAM. Am I right? The symptom is that the created file has smaller size than I want. The difference is 4096 MB, e.g. when I want 6000 MB file, it creates 6000 MB - 4096 MB = 1904 MB file.
Could you suggest other way to do that task?
Regards :)
Part of code:
unsigned long long int number_of_data = (unsigned int)atoi(argv[1])*1024*1024; //MB
char x[1]={atoi(argv[2])};
fp=fopen(strcat(argv[3],".bin"),"wb");
for(i=0;i<number_of_data;i++) {
fwrite(x, sizeof(x[0]), sizeof(x[0]), fp);
}
fclose(fp);

fwrite is not the problem here. The problem is the value you are calculating for number_of_data.
You need to be careful of any unintentional 32-bit casting when dealing with 64-bit integers. When I define them, I normally do it in a number of discrete steps, being careful at each step:
unsigned long long int number_of_data = atoi(argv[1]); // Should be good for up to 2,147,483,647 MB (2TB)
number_of_data *= 1024*1024; // Convert to MB
The assignment operator (*=) will be acting on the l-value (the unsigned long long int), so you can trust it to be acting on a 64-bit value.
This may look unoptimised, but a decent compiler will remove any unnecessary steps.

You should not have any problem creating large files on Windows but I have noticed that if you use a 32 bit version of seek on the file it then seems to decide it is a 32 bit file and thus cannot be larger that 4GB. I have had success using _open, _lseeki64 and _write when working with >4GB files on Windows. For instance:
static void
create_file_simple(const TCHAR *filename, __int64 size)
{
int omode = _O_WRONLY | _O_CREAT | _O_TRUNC;
int fd = _topen(filename, omode, _S_IREAD | _S_IWRITE);
_lseeki64(fd, size, SEEK_SET);
_write(fd, "ABCD", 4);
_close(fd);
}
The above will create a file over 4GB without issue. However, it can be slow as when you call _write() there the file system has to actually allocate the disk blocks for you. You may find it faster to create a sparse file if you have to fill it up randomly. If you will fill the file sequentially from the beginning then the above code will be fine. Note that if you really want to use the buffered IO provided by fwrite you can obtain a FILE* from a C library file descriptor using fdopen().
(In case anyone is wondering, the TCHAR, _topen and underscore prefixes are all MSVC++ quirks).
UPDATE
The original question is using sequential output for N bytes of value V. So a simple program that should actually produce the file desired is:
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <io.h>
#include <tchar.h>
int
_tmain(int argc, TCHAR *argv[])
{
__int64 n = 0, r = 0, size = 0x100000000LL; /* 4GB */
char v = 'A';
int fd = _topen(argv[1], _O_WRONLY | _O_CREAT| _O_TRUNC, _S_IREAD | _S_IWRITE);
while (r != -1 && n < count) {
r = _write(fd, &v, sizeof(value));
if (r >= 0) n += r;
}
_close(fd);
return 0;
}
However, this will be really slow as we are only writing one byte at a time. That is something that can be improved by using a larger buffer or using buffered I/O by calling fdopen on the descriptor (fd) and switching to fwrite.

Yuo have no problem with fwrite(). The problem seems to be your
unsigned long long int number_of_data = (unsigned int)atoi(argv[1])*1024*1024; //MB
which indeed should be rather something like
uint16_t number_of_data = atoll(argv[1])*1024ULL*1024ULL;
unsigned long long would still be ok, but unsigned int * int * int will give you a unsinged int no matter how large your target variable is.

File size lookup in C

I was wondering if there was any significant performance increase in using sys/stat.h versus fseek() and and ftell()?

Choosing between fstat() and the fseek()/ftell() combination, there isn't going to be much difference. The single function call should be slightly quicker than the double function call, but the difference won't be great.
Choosing between stat() and the combination isn't a very fair comparison. For the combination calls, the hard work was done when the file was opened, so the inode information is readily available. The stat() call has to parse the file path and then report what it finds. It should almost always be slower - unless you recently opened the file anyway so the kernel has most of the information cached. Even so, the pathname lookup required by stat() is likely to make it slower than the combination.

If you're not sure, try it!
I just coded this test. I generated 10,000 files of 2KB each, and iterated over all of them, asking for their file size.
Results on my machine by measuring with the "time" command and doing an average of 10 runs:
fseek/fclose version: 0.22 secs
stat version: 0.06 secs
So, the winner (at least on my machine): stat!
Here's the test code:
#include <stdio.h>
#include <sys/stat.h>
#if 0
size_t getFileSize(const char * filename)
{
struct stat st;
stat(filename, &st);
return st.st_size;
}
#else
size_t getFileSize(const char * filename)
{
FILE * fd=fopen(filename, "rb");
if(!fd)
printf("ERROR on file %s\n", filename);
fseek(fd, 0, SEEK_END);
size_t size = ftell(fd);
fclose(fd);
return size;
}
#endif
int main()
{
char buf[256];
int i, n;
for(i=0; i<10000; ++i)
{
sprintf(buf, "file_%d", i);
if(getFileSize(buf)!= 2048)
printf("WRONG!\n");
}
return 0;
}

Logically, one would assume that fseek() when prompted to seek to the end of the file uses stat to know how far to seek, or rather, where the end of the file is.
This would make fseek slower than using the facilities directly, and it also requires you to fopen the file in the first place.
Still, any performance difference is likely to be negligible, and if you need to open the file for some reason anyway, fseek/ftell likely improves the readability of your code significantly.

For stat.h you mainly want to use it to tell the stats of the file. Like if you want to tell if it's a file or a directory, etc.
However, if you want to do manipulations with the file, then you'll probably want to use ftell() and fseek(). That is you're actually doing manipulations on the file stream itself.
So in terms of performance, it's really what you need.
Hope it helps :) Cheers!

Depending on the circumstances, stat() can be hundred of times faster then seek()/tell(). I am currently toying around with sshfs/FUSE and getting the file size of a few thousand files with seek()/tell() takes well over a minute, doing it with stat() takes a second. So the difference is pretty huge when working over sshfs/FUSE.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight