Writing large binary file part by part

Writing large binary file part by part - c

I am trying to do a fairly easy task but am stumped by how the FILE* works. So I am writing a binary file (las file) which has a header and then the points in binary format. Since this is a huge file, I am doing it part by part. But the problem comes when the file pointer writes almost 3/4th of the file correctly and then gives entirely wrong file position pointer.
//struct points written to
struct las_points{
float x;
float y;
float z;
float time;
};
int writepoints(){
outfilename = "pointcloud_13.las";
fout = fopen(outfilename,"wb+");
// numPts - total number of points to be processed
// numRec - number of records processed every time
numRec = numPts/4;
las_points *lp;
// ending size of the file
int endSize = 0,j=0;
// header is written
// endSize = 233
for(i=0;i<numPts;i+=numRec){
lp = (las_points*)malloc(sizeof(las_points)*numRec);
// some processing done
printf("Iteration - %d\n",j);
lasWritePoints(lp,fout,numRec,endSize);
fflush(fout);
free(lp);
lp = NULL;
}
fclose(fout);
}
int lasWritePoints(las_points*lp, FILE* fout,int numRec,int &endSize){
if(!fout || !lp){
printf("File couldnt be written \n");
exit(1);
}
printf("endSize - %d \n",endSize);
fseek(fout,endSize,SEEK_SET);
for(int i=0;i<numRec;i++) {
fwrite(&lp[i].x,sizeof(float),1,fout);
fwrite(&lp[i].y,sizeof(float),1,fout);
fwrite(&lp[i].z,sizeof(float),1,fout);
fseek(fout,8,SEEK_CUR);
fwrite(&lp[i].time,sizeof(float),1,fout);
}
endSize = ftell(fout);
printf("endSize - %d \n",endSize);
}
Only the writing of the binary file is reproduced. The problem is that for the first four iterations for a file, it runs smoothly. Then at the end of last iteration, the endSize it gives is lesser than the beginning endSize.
Output:
Total size of file to be read: 1258456752
Iteration : 0
endSize : 233
endSize : 550575041
Iteration : 1
endSize : 550575041
endSize : 1101149849
Iteration : 2
endSize : 1101149849
endSize : 1651724657
Iteration : 3
endSize : 1651724657
endSize : 54815783
Can someone point out what I am doing wrong?

You are writing more bytes than can be represented by a 32-bit int (about 2 billion = 2 GB). Use a long to store the results of ftell():
long endSize = ftell(fout);
printf("endSize - %ld\n", endSize);

You need to dereference endSize.
You are assigning the file size/position to a pointer resulting in your unexpected behavior.
endSize = ftell(fout);
s/b
*endSize = ftell(fout);
This is one of the most common mistakes in C.

You are probably getting stuck with the 32-bit versions of file access functions. What platform/compiler/file system are you using?
If you are on:
Linux using gcc try adding -D_FILE_OFFSET_BITS=64 to your compiler command
Windows using mingw try using fseeko and ftello
a 32-bit Windows with Visual C++ try using _open (not quite a drop-in replacement)
a file system that has a 2GB file size limit then my condolences

Related

Reading a specific sector from a hard drive using C on Linux (GNU/Linux )

I know that a hard drive are system's files (/dev/sdXX) –then treated like files-, I have questions about this:
I tried those following lines of code but nothing positive
-------first attempt----------
int numSecteur=2;
char secteur [512];
FILE* disqueF=fopen("/dev/sda","r"); //tried "rb" and sda1 ...every thing
fseek(disqueF, numSecteur*512,SEEK_SET);
fread(secteur, 512, 1, disqueF);
fclose(disqueF);
-------Second attempt----------
int i=open("/dev/sda1",O_RDONLY);
lseek(i, 0, SEEK_SET);
read(i,secteur,512);
close(i);
------printing the results----------
printf("hex : %04x\n",secteur);
printf("string : %s\n",secteur);
Why the size is of the file /dev/sda1 is just 8 KBytes ?
How the data is stored (binary or hex….) "for the printing"
Please, I need some clues, and if someone need more details just he ask.
Thanks a lot.
Ps: Running kali 2 64bits ”debian” on VMware and i am RooT.

I tried those following lines of code but nothing positive
That's not a question.
Why the size is of the file /dev/sda1 is just 8 KBytes ?
That isn't the file size, it's the device number. (There are two parts, so the device number of sda1 is 8,1)
How the data is stored (binary or hex….) "for the printing"
Data isn't "stored for the printing". Data is stored (in electrical voltages representing binary, but you don't need to know that), and you can print it any way you want.

You cannot printf() array of chat like that, you need right print result. For example as hex dump:
for (int i = 0; i < sizeof(secteur); i++) {
printf ("%02x ", secteur[i]);
if ((i + 1) % 16 == 0)
printf ("\n");
}

gnu FORTRAN unformatted file record markers stored as 64-bit width?

I have a legacy code and some unformatted data files that it reads, and it worked with gnu-4.1.2. I don't have access to the method that originally generated these data files. When I compile this code with a newer gnu compiler (gnu-4.7.2) and attempt to load the old data files on a different computer, it is having difficulty reading them. I start by opening the file and reading in the first record which consists of three 32-bit integers:
open(unit, file='data.bin', form='unformatted', status='old')
read(unit) x,y,z
I am expecting these three integers here to describe x,y,z spans so that next it can load a 3D matrix of float values with those same dimensions. However, instead it's loading a 0 for the first value, then the next two are offset.
Expecting:
x=26, y=127, z=97 (1A, 7F, 61 in hex)
Loaded:
x=0, y=26, z=127 (0, 1A, 7F in hex)
When I checked the data file in a hex editor, I think I figured out what was happening.
The first record marker in this case has a value of 12 (0C in hex) since it's reading three integers at 4 bytes each. This marker is stored both before and after the record. However, I notice that the 32bits immediately after each record marker is 00000000. So either the record markers are treated as 64bit integers (little-Endian) or there is a 32-bit zero padding after each record marker. Either way, the code generated with the new compiler is reading the record markers as 32-bit integers and not expecting any padding. This effectively intrudes/corrupts the data being read in.
Is there an easy way to fix this non-portable issue? The old and new hardware are 64 bit architecture and so is the executable I compiled. If I try to use the older compiler version again will it solve the problem, or is it hardware dependent? I'd prefer to use the newer compilers because they are more efficient, and I really don't want to edit the source code to open all the files as access='stream' and manually read in a trailing 0 integer after each record marker, both before and after each record.
P.S. I could probably write a C++ code to alter the data files and remove these zero paddings if there is no easier alternative.

See the -frecord-marker= option in the gfortran manual. With -frecord-marker=8 you can read the old style unformatted sequential files produced by older versions of gfortran.

Seeing as how Fortran doesn't have a standardization on this, I opted to convert the data files to a new format that uses 32-bit wide record lengths instead of 64-bit wide. In case anyone needs to do this in the future I've included some Visual C++ code here that worked for me and should be easily modifiable to C or another language. I have also uploaded a Windows executable (fortrec.zip) here.
CFile OldFortFile, OutFile;
const int BUFLEN = 1024*20;
char pbuf[BUFLEN];
int i, iIn, iRecLen, iRecLen2, iLen, iRead, iError = 0;
CString strInDir = "C:\folder\";
CString strIn = "file.dat";
CString strOutDir = "C:\folder\fortnew\"
system("mkdir \"" + strOutDir + "\""); //create a subdir to hold the output files
strIn = strInDir + strIn;
strOut = strOutDir + strIn;
if(OldFortFile.Open(strIn,CFile::modeRead|CFile::typeBinary)) {
if(OutFile.Open(strOut,CFile::modeCreate|CFile::modeWrite|CFile::typeBinary)) {
while(true) {
iRead = OldFortFile.Read(&iRecLen, sizeof(iRecLen)); //Read the record's raw data
if (iRead < sizeof(iRecLen)) //end of file reached
break;
OutFile.Write(&iRecLen, sizeof(iRecLen));//Write the record's raw data
OldFortFile.Read(&iIn, sizeof(iIn));
if (iIn != 0) {//this is the padding we need to ignore, ensure it's always zero
//Padding not found
iError++;
break;
}
i = iRecLen;
while (i > 0) {
iLen = (i > BUFLEN) ? BUFLEN : i;
OldFortFile.Read(&pbuf[0], iLen);
OutFile.Write(&pbuf[0], iLen);
i -= iLen;
}
if (i != 0) { //Buffer length mismatch
iError++;
break;
}
OldFortFile.Read(&iRecLen2, sizeof(iRecLen2));
if (iRecLen != iRecLen2) {//ensure we have reached the end of the record proeprly
//Record length mismatch
iError++;
break;
}
OutFile.Write(&iRecLen2, sizeof(iRecLen));
OldFortFile.Read(&iIn, sizeof(iIn));
if (iIn != 0) {//this is the padding we need to ignore, ensure it's always zero
//Padding not found
break;
}
}
OutFile.Close();
OldFortFile.Close();
}
else { //Could not create the ouput file.
OldFortFile.Close();
return;
}
}
else { //Could not open the input file
}
if (iError == 0)
//File successfully converted
else
//Encountered error

Create an array of values from different text files in C

I'm working in C on 64-bit Ubuntu 14.04.
I have a number of .txt files, each containing lines of floating point values (1 value per line). The lines represent parts of a complex sample, and they're stored as real(a1) \n imag(a1) \n real(a2) \n imag(a2), if that makes sense.
In a specific scenario there are 4 text files each containing 32768 samples (thus 65536 values), but I need to make the final version dynamic to accommodate up to 32 files (the maximum samples per file would not exceed 32768 though). I'll only be reading the first 19800 samples (depending on other things) though, since the entire signal is contained in those 39600 points (19800 samples).
A common abstraction is to represent the files / samples as a matrix, where columns represent return signals and rows represent the value of each signal at a sampling instant, up until the maximum duration.
What I'm trying to do is take the first sample from each return signal and move it into an array of double-precision floating point values to do some work on, move on to the second sample for each signal (which will overwrite the previous array) and do some work on them, and so forth, until the last row of samples have been processed.
Is there a way in which I can dynamically open files for each signal (depending on the number of pulses I'm using in that particular instance), read the first sample from each file into a buffer and ship that off to be processed. On the next iteration, the file pointers will all be aligned to the second sample, it would then move those into an array and ship it off again, until the desired amount of samples (19800 in our hypothetical case) has been reached.
I can read samples just fine from the files using fscanf:
rx_length = 19800;
int x;
float buf;
double *range_samples = calloc(num_pulses, 2 * sizeof(range_samples));
for (i=0; i < 2 * rx_length; i++){
x = fscanf(pulse_file, "%f", &buf);
*(range_samples) = buf;
}
All that needs to happen (in my mind) is that I need to cycle both sample# and pulse# (in that order), so when finished with one pulse it would move on to the next set of samples for the next pulse, and so forth. What I don't know how to do is to somehow declare file pointers for all return signal files, when the number of them can vary inbetween calls (e.g. do the whole thing for 4 pulses, and on the next call it can be 16 or 64).
If there are any ideas / comments / suggestions I would love to hear them.
Thanks.

I would make the code you posted a function that takes an array of file names as an argument:
void doPulse( const char **file_names, const int size )
{
FILE *file = 0;
// declare your other variables
for ( int i = 0; i < size; ++i )
{
file = fopen( file_names[i] );
// make sure file is open
// do the work on that file
fclose( file );
file = 0;
}
}

What you need is a generator. It would be reasonably easy in C++, but as you tagged C, I can imagine a function, taking a custom struct (the state of the object) as parameter. It could be something like (pseudo code) :
struct GtorState {
char *files[];
int filesIndex;
FILE *currentFile;
};
void gtorInit(GtorState *state, char **files) {
// loads the array of file into state, set index to 0, and open first file
}
int nextValue(GtorState *state, double *real, double *imag) {
// read 2 values from currentFile and affect them to real and imag
// if eof, close currentFile and open files[++currentIndex]
// if real and imag were found returns 0, else 1 if eof on last file, 2 if error
}
Then you main program could contain :
GtorState state;
// initialize the list of files to process
gtorInit(&state, files);
double real, imag);
int cr;
while (0 == (cr = nextValue(&state, &real, &imag)) {
// process (real, imag)
}
if (cr == 2) {
// process (at least display) error
}
Alternatively, your main program could iterate the values of the different files and call a function with state analog of the above generator that processes the values, and at the end uses the state of the processing function to get the results.

Tried a slightly different approach and it's working really well.
In stead of reading from the different files each time I want to do something, I read the entire contents of each file into a 2D array range_phase_data[sample_number][pulse_number], and then access different parts of the array depending on which range bin I'm currently working on.
Here's an excerpt:
#define REAL(z,i) ((z)[2*(i)])
#define IMAG(z,i) ((z)[2*(i)+1])
for (i=0; i<rx_length; i++){
printf("\t[%s] Range bin %i. Samples %i to %i.\n", __FUNCTION__, i, 2*i, 2*i+1);
for (j=0; j<num_pulses; j++){
REAL(fft_buf, j) = range_phase_data[2*i][j];
IMAG(fft_buf, j) = range_phase_data[2*i+1][j];
}
printf("\t[%s] Range bin %i done, ready to FFT.\n", __FUNCTION__, i);
// do stuff with the data
}
This alleviates the need to dynamically allocate file pointers and in stead just opens the files one at a time and writes the data to the corresponding column in the matrix.
Cheers.

reading a raw audio file as Matlab does in C

I have the following small script that I want to write in C :
`%% getting the spectgrum
clear, clc ;
fileName ='M0.raw'
[x,fs] = audioread(fileName);
[xPSD,f] = pwelch(x,hanning(8192),0,8192*4 ,fs);
plot(f,10*log10(abs(xPSD)));
xlim([0 22e3]);
absxPSD = abs(xPSD);
save('absXPSD.txt','absxPSD','-ascii');
save('xPSD.txt','xPSD','-ascii');
save('xValues.txt','x','-ascii');
save('frequency.txt','f','-ascii');`
without goning in details, I have a problem getting the correct result, when I checked I figured out that the data that I read is wrong here's sample that read the raw file to compare it with what Matalb reads :
#include <stdio.h>
#include <stdlib.h>
int main (){
FILE* inp =NULL;
FILE* oup =NULL;
double value =0;
inp = fopen("M0.raw","r");
oup = fopen("checks.txt","w+");
UPDATE
after LoPiTaL's answer I've tried to jump over the RIFF header which is 44Byte length0 using fseek
fseek (inp,352,SEEK_SET);// that didn't help getting the right result !!
if( inp == NULL || oup==NULL){
printf(" error at file opning \n");
return -1;
}
while (!(feof(inp))){
fread(&value,sizeof(double),1,inp);
printf(" %f \n ",value);
fprintf(oup,"%f\n",value);
}
fclose(inp);
fclose(oup);
return 0;
}
and the result that I get is :
-28083683309813134333858080554409220100578902032859386180468433149049781495379346137536863936326139303879846829175766826833343673613788446579155215033623707200818670767132304934425064429529496303287641688697019947073799877821581901737052884168025721481955133510652655692037990001524306465271815108431928360960.000000
0.000000
0.000000
0.000000
0.000000
0.000000
-20701636078248669570005757343846586744027511881225108933223144646890577802102653022204406730988428912367583701134782419138464527797567258583836429190479797597328678189654150340845........................................................................
and my aim is to get those value :
-1.0162354e-02
-9.3688965e-03
-7.5073242e-03
-1.9531250e-03
3.7231445e-03
1.3549805e-02
2.3223877e-02
3.2867432e-02
4.4830322e-02
5.5114746e-02
6.7291260e-02
7.7636719e-02
8.8562012e-02
9.5794678e-02
1.0055542e-01
1.0415649e-01
1.0351563e-01
1.0235596e-01
9.8785400e-02
9.1796875e-02
8.3648682e-02
7.1594238e-02
the audio file is mono an is 16bit resolution , any idea how can solve this ? thanks for any help

You must open the file in binary mode, for starters. Otherwise you get text mode, which can do translations of line endings, for instance. Not good with binary data.
Binary mode:
inp = fopen("M0.raw", "rb");
^
|
muy
importante

Sure, you cannot read an audio file as is and hope that the data is as you think it is.
Ignoring any coded audio file, which of course you have to decode prior to read it, lets focus in the RAW audio files:
RAW audio files usually are WAV files. WAV files have a .RIFF header at the beginning of the file, which obviously you would have to ignore before reading audio data.
http://en.wikipedia.org/wiki/Resource_Interchange_File_Format
After you have removed the RIFF header, then the data starts.
As you stated, the data is encoded as 16 bit resolution. 16 bit resolution means that 0x0000 is 0.0 and 0xFFFF is 1.0, and the size of the data is only two bytes!
So you have to read two bytes at a time (i.e. with a signed short) and then do the conversion to the range 0 to 1:
signed short ss;
double value;
FILE* inp =NULL;
inp = fopen("M0.raw","rb"); //As stated in other answer, use binary mode!
fseek (inp,44,SEEK_SET); // Only 44 bytes!!
//We already have discarded the header here....
while (fread(&ss, sizeof(signed short) ,1 , inp) == 1){
//Now we have to convert from signed short to double:
value=((double)ss)/(unsigned)0xFFFF;
//Print the results:
printf(" %f \n ",value);
fprintf(oup,"%f\n",value);
}
Of course, the function "audioread" from Matlab already does all of this for you, so you don't have to care about the encoding, as in your example, your particular data is in 16 bit, but if you use any other file, it could be 8, 16, 24 or 32, even could be differential or be encoded despite being a WAV file (see the RIFF header for more information).

What could cause a Labwindows/CVI C program to hate the number 2573?

Using Windows
So I'm reading from a binary file a list of unsigned int data values. The file contains a number of datasets listed sequentially. Here's the function to read a single dataset from a char* pointing to the start of it:
function read_dataset(char* stream, t_dataset *dataset){
//...some init, including setting dataset->size;
for(i=0;i<dataset->size;i++){
dataset->samples[i] = *((unsigned int *) stream);
stream += sizeof(unsigned int);
}
//...
}
Where read_dataset in such a context as this:
//...
char buff[10000];
t_dataset* dataset = malloc( sizeof( *dataset) );
unsigned long offset = 0;
for(i=0;i<number_of_datasets; i++){
fseek(fd_in, offset, SEEK_SET);
if( (n = fread(buff, sizeof(char), sizeof(*dataset), fd_in)) != sizeof(*dataset) ){
break;
}
read_dataset(buff, *dataset);
// Do something with dataset here. It's screwed up before this, I checked.
offset += profileSize;
}
//...
Everything goes swimmingly until my loop reads the number 2573. All of a sudden it starts spitting out random and huge numbers.
For example, what should be
...
1831
2229
2406
2637
2609
2573
2523
2247
...
becomes
...
1831
2229
2406
2637
2609
0xDB00000A
0xC7000009
0xB2000008
...
If you think those hex numbers look suspicious, you're right. Turns out the hex values for the values that were changed are really familiar:
2573 -> 0xA0D
2523 -> 0x9DB
2247 -> 0x8C7
So apparently this number 2573 causes my stream pointer to gain a byte. This remains until the next dataset is loaded and parsed, and god forbid it contain a number 2573. I have checked a number of spots where this happens, and each one I've checked began on 2573.
I admit I'm not so talented in the world of C. What could cause this is completely and entirely opaque to me.

You don't specify how you obtained the bytes in memory (pointed to by stream), nor what platform you're running on, but I wouldn't be surprised to find your on Windows, and you used the C stdio library call fopen(filename "r"); Try using fopen(filename, "rb");. On Windows (and MS-DOS), fopen() translates MS-DOS line endings "\r\n" (hex 0x0D 0x0A) in the file to Unix style "\n", unless you append "b" to the file mode to indicate binary.

A couple of irrelevant points.
sizeof(*dataset) doesn't do what you think it does.
There is no need to use seek on every read
I don't understand how you are calling a function that only takes one parameter but you are giving it two (or at least I don't understand why your compiler doesn't object)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight