fwrite() alternative for large files on 32-bit system - c

I'm trying to generate large files (4-8 GB) with C code.
Now I use fopen() with 'wb' parameters to open file binary and fwrite() function in for loop to write bytes to file. I'm writing one byte in every loop iteration. There is no problem until the file is larger or equal to 4294967296 bytes (4096 MB). It looks like some memory limit in 32-bit OS, because when it writes to that opened file, it is still in RAM. Am I right? The symptom is that the created file has smaller size than I want. The difference is 4096 MB, e.g. when I want 6000 MB file, it creates 6000 MB - 4096 MB = 1904 MB file.
Could you suggest other way to do that task?
Regards :)
Part of code:
unsigned long long int number_of_data = (unsigned int)atoi(argv[1])*1024*1024; //MB
char x[1]={atoi(argv[2])};
fp=fopen(strcat(argv[3],".bin"),"wb");
for(i=0;i<number_of_data;i++) {
fwrite(x, sizeof(x[0]), sizeof(x[0]), fp);
}
fclose(fp);

fwrite is not the problem here. The problem is the value you are calculating for number_of_data.
You need to be careful of any unintentional 32-bit casting when dealing with 64-bit integers. When I define them, I normally do it in a number of discrete steps, being careful at each step:
unsigned long long int number_of_data = atoi(argv[1]); // Should be good for up to 2,147,483,647 MB (2TB)
number_of_data *= 1024*1024; // Convert to MB
The assignment operator (*=) will be acting on the l-value (the unsigned long long int), so you can trust it to be acting on a 64-bit value.
This may look unoptimised, but a decent compiler will remove any unnecessary steps.

You should not have any problem creating large files on Windows but I have noticed that if you use a 32 bit version of seek on the file it then seems to decide it is a 32 bit file and thus cannot be larger that 4GB. I have had success using _open, _lseeki64 and _write when working with >4GB files on Windows. For instance:
static void
create_file_simple(const TCHAR *filename, __int64 size)
{
int omode = _O_WRONLY | _O_CREAT | _O_TRUNC;
int fd = _topen(filename, omode, _S_IREAD | _S_IWRITE);
_lseeki64(fd, size, SEEK_SET);
_write(fd, "ABCD", 4);
_close(fd);
}
The above will create a file over 4GB without issue. However, it can be slow as when you call _write() there the file system has to actually allocate the disk blocks for you. You may find it faster to create a sparse file if you have to fill it up randomly. If you will fill the file sequentially from the beginning then the above code will be fine. Note that if you really want to use the buffered IO provided by fwrite you can obtain a FILE* from a C library file descriptor using fdopen().
(In case anyone is wondering, the TCHAR, _topen and underscore prefixes are all MSVC++ quirks).
UPDATE
The original question is using sequential output for N bytes of value V. So a simple program that should actually produce the file desired is:
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <io.h>
#include <tchar.h>
int
_tmain(int argc, TCHAR *argv[])
{
__int64 n = 0, r = 0, size = 0x100000000LL; /* 4GB */
char v = 'A';
int fd = _topen(argv[1], _O_WRONLY | _O_CREAT| _O_TRUNC, _S_IREAD | _S_IWRITE);
while (r != -1 && n < count) {
r = _write(fd, &v, sizeof(value));
if (r >= 0) n += r;
}
_close(fd);
return 0;
}
However, this will be really slow as we are only writing one byte at a time. That is something that can be improved by using a larger buffer or using buffered I/O by calling fdopen on the descriptor (fd) and switching to fwrite.

Yuo have no problem with fwrite(). The problem seems to be your
unsigned long long int number_of_data = (unsigned int)atoi(argv[1])*1024*1024; //MB
which indeed should be rather something like
uint16_t number_of_data = atoll(argv[1])*1024ULL*1024ULL;
unsigned long long would still be ok, but unsigned int * int * int will give you a unsinged int no matter how large your target variable is.

Related

Same program is 10 times slower on windows

I wrote a simple c program that copies 10 million bytes from a file and pastes them in reverse order on another file (this is done one byte at a time, I know it's not efficient but it's just to make some tests), I don't understand why on linux it takes 2.5 seconds while on windows it takes more than 20 seconds. I run the same program changing only the paths.
I use windows 10 and archlinux, the files are on an ntfs partition.
code on windows
#include <stdio.h>
#include <time.h>
void get_nth_byte(FILE *fp, int nth_index,unsigned char* output){
fseek(fp,nth_index,SEEK_SET);
fread(output, sizeof(unsigned char), 1,fp);
}
int main() {
clock_t begin = clock();
//
FILE* input = fopen( "C:\\Users\\piero\\Desktop\\input.txt","rb");
FILE* output = fopen("C:\\Users\\piero\\Desktop\\output.txt","wb");
unsigned char byte;
for (int i = 10000000; i > 0; i--) {
get_nth_byte(input,i,&byte);
fwrite(&byte, sizeof(unsigned char),1,output);
}
//
clock_t end = clock();
double result = (double) (end - begin)/CLOCKS_PER_SEC;
printf("%f",result);
return 0;
}
code on linux
#include <stdio.h>
#include <time.h>
void get_nth_byte(FILE *fp, int nth_index,unsigned char* output){
fseek(fp,nth_index,SEEK_SET);
fread(output, sizeof(unsigned char), 1,fp);
}
int main() {
clock_t begin = clock();
//
FILE* input = fopen( "/run/media/piero/Windows/Users/piero/Desktop/input.txt","rb");
FILE* output = fopen("/run/media/piero/Windows/Users/piero/Desktop/output.txt","wb");
unsigned char byte;
for (int i = 10000000; i > 0; i--) {
get_nth_byte(input,i,&byte);
fwrite(&byte, sizeof(unsigned char),1,output);
}
//
clock_t end = clock();
double result = (double) (end - begin)/CLOCKS_PER_SEC;
printf("%f",result);
return 0;
}
output on linux : 2.224549
output on windows : 25.349647
UPDATE
I solved the problem by using cygwin rather than mingwin, now it takes about 4.3 seconds
This is a great demonstration of how it's not the code we write that runs, it's the executable that the compiler makes from the code that runs.
It is possible that your Windows C compiler is not as advanced as your Linux C compiler, and is not optimizing your code as well as it could, or it's possible that the libraries that the Windows compiler is linking to for fread() and fwrite() are slower than the equivalent libraries in the Linux system.
If I had to put up my best guess, the Linux C compiler probably noticed that it would be more efficient to read more than one byte at a time, and it could do that without affecting the semantics of your program, and the Windows compiler either didn't infer the same, or wasn't able to optimize in the same way due to some underlying proprietary filesystem thing that only Microsoft engineers understand.
I can't say for sure without a peek at the disassembled binaries
One of the strengths of Unix/Linux is that files are designed to be treated as streams of bytes, with it being maximally easy and efficient to seek to the n'th byte using fseek or lseek.
Non-Unix operating systems, such as Windows, tend to have to work much harder to implement those seek operations. In the worst case, they may actually need to read through the file, counting characters as they go.
Your code opens both files in binary mode, and this should reduce the need for the fseek implementation to perform any expensive emulations. In text mode, a 10x performance penalty for heavy fseek use wouldn't surprise me. I'm much more surprised you're seeing it in binary mode.
[Disclaimer: strictly speaking, in text mode fseek is not defined as seeking to an arbitrary byte offset at all, but rather, only to a position defined by the number returned by a previous call to ftell. If an implementation takes advantage of that freedom, it can reduce the performance penalty for text-mode fseek operations, also, but it then means that code like yours, that constructs positions to seek to on the assumption that they're pure byte offsets, may not work at all.]

Code to read Wav header file producing strange results? C

I need to read the header variables from a wave file and display what they are. I am using the following code, but my output has numbers far too large. I've searched for solutions for hours. Help would be much appreciated! Thanks. I got the wave soundfile format from https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
Output:
Wav file header information:
Filesize 3884 bytes
RIFF header RIFF
WAVE header WAVE
Subchunk1ID fmt
Chunk Size (based on bits used) 604962816
Subchunk1Size 268435456
Sampling Rate 288030720
Bits Per Sample 2048
AudioFormat 256
Number of channels 2048
Byte Rate 288030720
Subchunk2ID
Subchunk2Size 1684108385
Here is the source:
#include <stdio.h>
#include <stdlib.h>
typedef struct WAV_HEADER
{
char RIFF[4];
int ChunkSize;
char WAVE[4];
char fmt[4];
int Subchunk1Size;
short int AudioFormat;
short int NumOfChan;
int SamplesPerSec;
int bytesPerSec;
short int blockAlign;
short int bitsPerSample;
int Subchunk2Size;
char Subchunk2ID[4];
}wav_hdr;
int getFileSize(FILE *inFile);
int main(int argc,char *argv[])
{
//check startup conditions
if(argc >= 2); //we have enough arguments -- continue
else { printf("\nUSAGE: program requires a filename as an argument -- please try again\n"); exit(0);}
wav_hdr wavHeader;
FILE *wavFile;
int headerSize = sizeof(wav_hdr),filelength = 0;
wavFile = fopen(argv[1],"r");
if(wavFile == NULL)
{
printf("Unable to open wave file\n");
exit(EXIT_FAILURE);
}
fread(&wavHeader,headerSize,1,wavFile);
filelength = getFileSize(wavFile);
fclose(wavFile);
printf("\nWav file header information:\n");
printf("Filesize\t\t\t%d bytes\n",filelength);
printf("RIFF header\t\t\t%c%c%c%c\n",wavHeader.RIFF[0],wavHeader.RIFF[1],wavHeader.RIFF[2],wavHeader.RIFF[3]);
printf("WAVE header\t\t\t%c%c%c%c\n",wavHeader.WAVE[0],wavHeader.WAVE[1],wavHeader.WAVE[2],wavHeader.WAVE[3]);
printf("Subchunk1ID\t\t\t%c%c%c%c\n",wavHeader.fmt[0],wavHeader.fmt[1],wavHeader.fmt[2],wavHeader.fmt[3]);
printf("Chunk Size (based on bits used)\t%d\n",wavHeader.ChunkSize);
printf("Subchunk1Size\t\t\t%d\n",wavHeader.Subchunk1Size);
printf("Sampling Rate\t\t\t%d\n",wavHeader.SamplesPerSec); //Sampling frequency of the wav file
printf("Bits Per Sample\t\t\t%d\n",wavHeader.bitsPerSample); //Number of bits used per sample
printf("AudioFormat\t\t\t%d\n",wavHeader.AudioFormat);
printf("Number of channels\t\t%d\n",wavHeader.bitsPerSample); //Number of channels (mono=1/sterio=2)
printf("Byte Rate\t\t\t%d\n",wavHeader.bytesPerSec); //Number of bytes per second
printf("Subchunk2ID\t\t\t%c%c%c%c\n",wavHeader.Subchunk2ID[0],wavHeader.Subchunk2ID[1],wavHeader.Subchunk2ID[2],wavHeader.Subchunk2ID[3]);
printf("Subchunk2Size\t\t\t%d\n",wavHeader.Subchunk2Size);
printf("\n");
return 0;
}
int getFileSize(FILE *inFile)
{
int fileSize = 0;
fseek(inFile,0,SEEK_END);
fileSize=ftell(inFile);
fseek(inFile,0,SEEK_SET);
return fileSize;
}`
So, your code basically works -- if you compile it with the same compiler and O/S that the author of the file format spec was using (32-bit Windows). You're hoping that your compiler has laid out your struct exactly as you need to match the file bytes. For example, I can compile and run it on win32 and read a WAV file perfectly -- right up to the variable part of the header whose variability you failed to code for.
Having written a great deal of code to manipulate a variety of file formats, I would advise you give up on trying to read into structs and instead make a few simple utility functions for things like "read next 4 bytes and turn them into an int".
Notice things like the "extra format bytes". Parts of the file format depend on the values in previous parts of the file format. That's why you generally need to think of it as a dynamic reading process rather than one big read to grab the headers. It's not hard to keep the result highly portable C that will work between operating systems without relying on O/S specific things like stat() or adding library dependencies for things like htonl() -- should portability (even portability to a different compiler or even just different compiler options on the same O/S) be desirable.
It seems like you noticed the endian issue, but the way to handle it is with htonl, ntohl, htons, and, ntohs. This is part of your number problem.
Read here:
http://www.beej.us/guide/bgnet/output/html/multipage/htonsman.html
Note there are a lot of other posts here on WAV files. Have you considered reading them?
Also, there are standard ways to get file information, like size, either through the windows API on windows or stat on linux/unix

Why can't my program save a large amount (>2GB) to a file?

I am having trouble trying to figure out why my program cannot save more than 2GB of data to a file. I cannot tell if this is a programming or environment (OS) problem. Here is my source code:
#define _LARGEFILE_SOURCE
#define _LARGEFILE64_SOURCE
#define _FILE_OFFSET_BITS 64
#include <math.h>
#include <time.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/*-------------------------------------*/
//for file mapping in Linux
#include<fcntl.h>
#include<unistd.h>
#include<sys/stat.h>
#include<sys/time.h>
#include<sys/mman.h>
#include<sys/types.h>
/*-------------------------------------*/
#define PERMS 0600
#define NEW(type) (type *) malloc(sizeof(type))
#define FILE_MODE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)
void write_result(char *filename, char *data, long long length){
int fd, fq;
fd = open(filename, O_RDWR|O_CREAT|O_LARGEFILE, 0644);
if (fd < 0) {
perror(filename);
return -1;
}
if (ftruncate(fd, length) < 0)
{
printf("[%d]-ftruncate64 error: %s/n", errno, strerror(errno));
close(fd);
return 0;
}
fq = write (fd, data,length);
close(fd);
return;
}
main()
{
long long offset = 3000000000; // 3GB
char * ttt;
ttt = (char *)malloc(sizeof(char) *offset);
printf("length->%lld\n",strlen(ttt)); // length=0
memset (ttt,1,offset);
printf("length->%lld\n",strlen(ttt)); // length=3GB
write_result("test.big",ttt,offset);
return 1;
}
According to my test, the program can generate a file large than 2GB and can allocate such large memory as well.
The weird thing happened when I tried to write data into the file. I checked the file and it is empty, which is supposed to be filled with 1.
Can any one be kind and help me with this?
You need to read a little more about C strings and what malloc and calloc do.
In your original main ttt pointed to whatever garbage was in memory when malloc was called. This means a nul terminator (the end marker of a C String, which is binary 0) could be anywhere in the garbage returned by malloc.
Also, since malloc does not touch every byte of the allocated memory (and you're asking for a lot) you could get sparse memory which means the memory is not actually physically available until it is read or written.
calloc allocates and fills the allocated memory with 0. It is a little more prone to fail because of this (it touches every byte allocated, so if the OS left the allocation sparse it will not be sparse after calloc fills it.)
Here's your code with fixes for the above issues.
You should also always check the return value from write and react accordingly. I'll leave that to you...
main()
{
long long offset = 3000000000; // 3GB
char * ttt;
//ttt = (char *)malloc(sizeof(char) *offset);
ttt = (char *)calloc( sizeof( char ), offset ); // instead of malloc( ... )
if( !ttt )
{
puts( "calloc failed, bye bye now!" );
exit( 87 );
}
printf("length->%lld\n",strlen(ttt)); // length=0 (This now works as expected if calloc does not fail)
memset( ttt, 1, offset );
ttt[offset - 1] = 0; // Now it's nul terminated and the printf below will work
printf("length->%lld\n",strlen(ttt)); // length=3GB
write_result("test.big",ttt,offset);
return 1;
}
Note to Linux gurus... I know sparse may not be the correct term. Please correct me if I'm wrong as it's been a while since I've been buried in Linux minutiae. :)
Looks like you're hitting the internal file system's limitation for the iDevice: ios - Enterprise app with more than resource files of size 2GB
2Gb+ files are simply not possible. If you need to store such amount of data you should consider using some other tools or write the file chunk manager.
I'm going to go out on a limb here and say that your problem may lay in memset().
The best thing to do here is, I think, after memset() ing it,
for (unsigned long i = 0; i < 3000000000; i++) {
if (ttt[i] != 1) { printf("error in data at location %d", i); break; }
}
Once you've validated that the data you're trying to write is correct, then you should look into writing a smaller file such as 1GB and see if you have the same problems. Eliminate each and every possible variable and you will find the answer.

how to retrieve the large file

I am working on an application wherein i need to compare 10^8 entries(alphanumeric entries). To retrieve the entries from file( file size is 1.5 GB) and then to compare them, i need to take less than 5 minutes of time. So, what would b the effective way to do that, since, only retrieving time is exceeding 5 min. And i need to work on file only. please suggest a way out.
I m working on windows with 3GB RAM n 100Gb hard disk.
Read a part of the file, sort it, write it to a temporary file.
Merge-sort the resulting files.
Error handling and header includes are not included. You need to provide DataType and cmpfunc, samples are provided. You should be able to deduce the core workings from this snippet:
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
typedef char DataType; // is this alphanumeric?
int cmpfunc(char const *left, char const *right)
{
return *right - *left;
}
int main(int argc, char **argv)
{
int fd = open(argv[1], O_RDWR|O_LARGEFILE);
if (fd == -1)
return 1;
struct stat st;
if (fstat(fd, &st) != 0)
return 1;
DataType *data = mmap(NULL, st.st_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (!data)
return 1;
qsort(data, st.st_size / sizeof(*data), cmpfunc);
if (0 != msync(data, st.st_size, MS_SYNC))
return 1;
if (-1 == munmap(data, st.st_size))
return 1;
if (0 != close(fd))
return 1;
return 0;
}
I can't imagine you can get much faster than this. Be sure you have enough virtual memory address space (1.5GB is pushing it but will probably just work on 32bit Linux, you'll be able to manage this on any 64bit OS). Note that this code is "limited" to working on a POSIX compliant system.
In terms of C and efficiency, this approach puts the entire operation in the hands of the OS, and the excellent qsort algorithm.
If retrieving time is exceeding 5 min it seems that you need to look at how you are reading this file. One thing that has caused bad performance for me is that a C implementation sometimes uses thread-safe I/O operations by default, and you can gain some speed by using thread-unsafe I/O.
What kind of computer will this be run on? Many computers nowadays have several gigabytes of memory, so perhaps it will work to just read it all into memory and then sort it there (with, for example, qsort)?

2GB limit on file size when using fwrite in C?

I have a short C program that writes into a file until there is no more space on disk:
#include <stdio.h>
int main(void) {
char c[] = "abcdefghij";
size_t rez;
FILE *f = fopen("filldisk.dat", "wb");
while (1) {
rez = fwrite(c, 1, sizeof(c), f);
if (!rez) break;
}
fclose(f);
return 0;
}
When I run the program (in Linux), it stops when the file reaches 2GB.
Is there an internal limitation, due to the FILE structure, or something?
Thanks.
On a 32 bits system (i.e. the OS is 32 bits), by default, fopen and co are limited to 32 bits size/offset/etc... You need to enable the large file support, or use the *64 bits option:
http://www.gnu.org/software/libc/manual/html_node/Opening-Streams.html#index-fopen64-931
Then your fs needs to support this, but except fat and other primitive fs, all of them support creating files > 2 gb.
it stops when the file reaches 2GB.
Is there an internal limitation, due
to the FILE structure, or something?
This is due to the libc (the standard C library), which by default on a x86 (IA-32) Linux system is 32-bit functions provided by glibc (GNU's C Library). So by default the file stream size is based upon 32-bits -- 2^(32-1).
For using Large File Support, see the web page.
#define _FILE_OFFSET_BITS 64
/* or more commonly add -D_FILE_OFFSET_BITS=64 to CFLAGS */
#include <stdio.h>
int main(void) {
char c[] = "abcdefghij";
size_t rez;
FILE *f = fopen("filldisk.dat", "wb");
while (1) {
rez = fwrite(c, 1, sizeof(c), f);
if ( rez < sizeof(c) ) { break; }
}
fclose(f);
return 0;
}
Note: Most systems expect fopen (and off_t) to be based on 2^31 file size limit. Replacing them with off64_t and fopen64 makes this explicit, and depending on usage might be best way to go. but is not recommended in general as they are non-standard.

Resources