Reading a particular line by line number in a very large file - database

The file will not fit into memory. It is over 100GB and I want to access specific lines by line number. I do not want to count line by line until I reach it.
I have read http://docstore.mik.ua/orelly/perl/cookbook/ch08_09.htm
When I built an index using the following methods, the line return works up to a certain point. Once the line number is very large, the line being returned is the same. When I go to the specific line in the file the same line is returned. It seems to work for line numbers 1 through 350000 (approximately);
# usage: build_index(*DATA_HANDLE, *INDEX_HANDLE)
sub build_index {
my $data_file = shift;
my $index_file = shift;
my $offset = 0;
while (<$data_file>) {
print $index_file pack("N", $offset);
$offset = tell($data_file);
}
}
# usage: line_with_index(*DATA_HANDLE, *INDEX_HANDLE, $LINE_NUMBER)
# returns line or undef if LINE_NUMBER was out of range
sub line_with_index {
my $data_file = shift;
my $index_file = shift;
my $line_number = shift;
my $size; # size of an index entry
my $i_offset; # offset into the index of the entry
my $entry; # index entry
my $d_offset; # offset into the data file
$size = length(pack("N", 0));
$i_offset = $size * ($line_number-1);
seek($index_file, $i_offset, 0) or return;
read($index_file, $entry, $size);
$d_offset = unpack("N", $entry);
seek($data_file, $d_offset, 0);
return scalar(<$data_file>);
}
I've also tried using the DB_file method, but it seems to take a very long time to do the tie. I also don't really understand what it means for "DB_RECNO access method ties an array to a file, one line per array element." Tie does not read the file into the array correct?

pack N creates a 32-bit integer. The maximum 32-bit integer is 4GB, so using that to store indexes into a file that's 100GB in size won't work.
Some builds use 64-bit integers. On those, you could use j.
Some builds use 32-bit integers. tell returns a floating-point number on those, allowing you to index files up to 8,388,608 GB in size losslessly. On those, you should use F.
Portable code would look as follows:
use Config qw( %Config );
my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';
...
print $index_file pack($off_t, $offset);
...
Note: I'm assuming the index file is only used by the same Perl that built it (or at least one with with the same integer size, seek size and machine endianness). Let me know if that assumption doesn't hold for you.

Related

gnu FORTRAN unformatted file record markers stored as 64-bit width?

I have a legacy code and some unformatted data files that it reads, and it worked with gnu-4.1.2. I don't have access to the method that originally generated these data files. When I compile this code with a newer gnu compiler (gnu-4.7.2) and attempt to load the old data files on a different computer, it is having difficulty reading them. I start by opening the file and reading in the first record which consists of three 32-bit integers:
open(unit, file='data.bin', form='unformatted', status='old')
read(unit) x,y,z
I am expecting these three integers here to describe x,y,z spans so that next it can load a 3D matrix of float values with those same dimensions. However, instead it's loading a 0 for the first value, then the next two are offset.
Expecting:
x=26, y=127, z=97 (1A, 7F, 61 in hex)
Loaded:
x=0, y=26, z=127 (0, 1A, 7F in hex)
When I checked the data file in a hex editor, I think I figured out what was happening.
The first record marker in this case has a value of 12 (0C in hex) since it's reading three integers at 4 bytes each. This marker is stored both before and after the record. However, I notice that the 32bits immediately after each record marker is 00000000. So either the record markers are treated as 64bit integers (little-Endian) or there is a 32-bit zero padding after each record marker. Either way, the code generated with the new compiler is reading the record markers as 32-bit integers and not expecting any padding. This effectively intrudes/corrupts the data being read in.
Is there an easy way to fix this non-portable issue? The old and new hardware are 64 bit architecture and so is the executable I compiled. If I try to use the older compiler version again will it solve the problem, or is it hardware dependent? I'd prefer to use the newer compilers because they are more efficient, and I really don't want to edit the source code to open all the files as access='stream' and manually read in a trailing 0 integer after each record marker, both before and after each record.
P.S. I could probably write a C++ code to alter the data files and remove these zero paddings if there is no easier alternative.
See the -frecord-marker= option in the gfortran manual. With -frecord-marker=8 you can read the old style unformatted sequential files produced by older versions of gfortran.
Seeing as how Fortran doesn't have a standardization on this, I opted to convert the data files to a new format that uses 32-bit wide record lengths instead of 64-bit wide. In case anyone needs to do this in the future I've included some Visual C++ code here that worked for me and should be easily modifiable to C or another language. I have also uploaded a Windows executable (fortrec.zip) here.
CFile OldFortFile, OutFile;
const int BUFLEN = 1024*20;
char pbuf[BUFLEN];
int i, iIn, iRecLen, iRecLen2, iLen, iRead, iError = 0;
CString strInDir = "C:\folder\";
CString strIn = "file.dat";
CString strOutDir = "C:\folder\fortnew\"
system("mkdir \"" + strOutDir + "\""); //create a subdir to hold the output files
strIn = strInDir + strIn;
strOut = strOutDir + strIn;
if(OldFortFile.Open(strIn,CFile::modeRead|CFile::typeBinary)) {
if(OutFile.Open(strOut,CFile::modeCreate|CFile::modeWrite|CFile::typeBinary)) {
while(true) {
iRead = OldFortFile.Read(&iRecLen, sizeof(iRecLen)); //Read the record's raw data
if (iRead < sizeof(iRecLen)) //end of file reached
break;
OutFile.Write(&iRecLen, sizeof(iRecLen));//Write the record's raw data
OldFortFile.Read(&iIn, sizeof(iIn));
if (iIn != 0) {//this is the padding we need to ignore, ensure it's always zero
//Padding not found
iError++;
break;
}
i = iRecLen;
while (i > 0) {
iLen = (i > BUFLEN) ? BUFLEN : i;
OldFortFile.Read(&pbuf[0], iLen);
OutFile.Write(&pbuf[0], iLen);
i -= iLen;
}
if (i != 0) { //Buffer length mismatch
iError++;
break;
}
OldFortFile.Read(&iRecLen2, sizeof(iRecLen2));
if (iRecLen != iRecLen2) {//ensure we have reached the end of the record proeprly
//Record length mismatch
iError++;
break;
}
OutFile.Write(&iRecLen2, sizeof(iRecLen));
OldFortFile.Read(&iIn, sizeof(iIn));
if (iIn != 0) {//this is the padding we need to ignore, ensure it's always zero
//Padding not found
break;
}
}
OutFile.Close();
OldFortFile.Close();
}
else { //Could not create the ouput file.
OldFortFile.Close();
return;
}
}
else { //Could not open the input file
}
if (iError == 0)
//File successfully converted
else
//Encountered error

Converting code from Matlab to C?

I need to write a code that will:
Read a file that is a source and it is "emitting" 8-bytes signs (letters, numbers). The code needs to add +1 in the accumulator that is the size of 2^8^n every time a sign occurs. After I can calculate the entropy. The input parameters are the file and "n" that is 8 bytes * n (from 3 to 5).
Example:
if we read from file 10001110|11110000|11111111|00001011|10101010, for n = 3, we put +1 in the accumulator at 100011101111000011111111 and proceed with +1 at 111100001111111100001011 and so on...
The main problem is speed of this process. I need to read files that are max 50MB. Programming language can be anything from C, Matlab or Java.
Code so far in Matlab for n = 1. I need a good implementation for accumulator, so its not ful size from the very beginning...
load B.mat; %just some small part of file
file = dec2bin(B);
file = str2num(file);
[fileSize, ~] = size(file);
%Choose n
n = 1;
%Make accumulator
acu10 = linspace(0, 2^8^n-1, 2^8^n);
acu = dec2bin(acu10);
acu = str2num(acu);
index = zeros(size(acu10))';
%go through file and add +1 in accumulator
for i = 1:n:fileSize
for j = 1:size(acu)
if acu(j,:) == file(i);
index(j) = index(j) + 1;
end
end
end

Create an array of values from different text files in C

I'm working in C on 64-bit Ubuntu 14.04.
I have a number of .txt files, each containing lines of floating point values (1 value per line). The lines represent parts of a complex sample, and they're stored as real(a1) \n imag(a1) \n real(a2) \n imag(a2), if that makes sense.
In a specific scenario there are 4 text files each containing 32768 samples (thus 65536 values), but I need to make the final version dynamic to accommodate up to 32 files (the maximum samples per file would not exceed 32768 though). I'll only be reading the first 19800 samples (depending on other things) though, since the entire signal is contained in those 39600 points (19800 samples).
A common abstraction is to represent the files / samples as a matrix, where columns represent return signals and rows represent the value of each signal at a sampling instant, up until the maximum duration.
What I'm trying to do is take the first sample from each return signal and move it into an array of double-precision floating point values to do some work on, move on to the second sample for each signal (which will overwrite the previous array) and do some work on them, and so forth, until the last row of samples have been processed.
Is there a way in which I can dynamically open files for each signal (depending on the number of pulses I'm using in that particular instance), read the first sample from each file into a buffer and ship that off to be processed. On the next iteration, the file pointers will all be aligned to the second sample, it would then move those into an array and ship it off again, until the desired amount of samples (19800 in our hypothetical case) has been reached.
I can read samples just fine from the files using fscanf:
rx_length = 19800;
int x;
float buf;
double *range_samples = calloc(num_pulses, 2 * sizeof(range_samples));
for (i=0; i < 2 * rx_length; i++){
x = fscanf(pulse_file, "%f", &buf);
*(range_samples) = buf;
}
All that needs to happen (in my mind) is that I need to cycle both sample# and pulse# (in that order), so when finished with one pulse it would move on to the next set of samples for the next pulse, and so forth. What I don't know how to do is to somehow declare file pointers for all return signal files, when the number of them can vary inbetween calls (e.g. do the whole thing for 4 pulses, and on the next call it can be 16 or 64).
If there are any ideas / comments / suggestions I would love to hear them.
Thanks.
I would make the code you posted a function that takes an array of file names as an argument:
void doPulse( const char **file_names, const int size )
{
FILE *file = 0;
// declare your other variables
for ( int i = 0; i < size; ++i )
{
file = fopen( file_names[i] );
// make sure file is open
// do the work on that file
fclose( file );
file = 0;
}
}
What you need is a generator. It would be reasonably easy in C++, but as you tagged C, I can imagine a function, taking a custom struct (the state of the object) as parameter. It could be something like (pseudo code) :
struct GtorState {
char *files[];
int filesIndex;
FILE *currentFile;
};
void gtorInit(GtorState *state, char **files) {
// loads the array of file into state, set index to 0, and open first file
}
int nextValue(GtorState *state, double *real, double *imag) {
// read 2 values from currentFile and affect them to real and imag
// if eof, close currentFile and open files[++currentIndex]
// if real and imag were found returns 0, else 1 if eof on last file, 2 if error
}
Then you main program could contain :
GtorState state;
// initialize the list of files to process
gtorInit(&state, files);
double real, imag);
int cr;
while (0 == (cr = nextValue(&state, &real, &imag)) {
// process (real, imag)
}
if (cr == 2) {
// process (at least display) error
}
Alternatively, your main program could iterate the values of the different files and call a function with state analog of the above generator that processes the values, and at the end uses the state of the processing function to get the results.
Tried a slightly different approach and it's working really well.
In stead of reading from the different files each time I want to do something, I read the entire contents of each file into a 2D array range_phase_data[sample_number][pulse_number], and then access different parts of the array depending on which range bin I'm currently working on.
Here's an excerpt:
#define REAL(z,i) ((z)[2*(i)])
#define IMAG(z,i) ((z)[2*(i)+1])
for (i=0; i<rx_length; i++){
printf("\t[%s] Range bin %i. Samples %i to %i.\n", __FUNCTION__, i, 2*i, 2*i+1);
for (j=0; j<num_pulses; j++){
REAL(fft_buf, j) = range_phase_data[2*i][j];
IMAG(fft_buf, j) = range_phase_data[2*i+1][j];
}
printf("\t[%s] Range bin %i done, ready to FFT.\n", __FUNCTION__, i);
// do stuff with the data
}
This alleviates the need to dynamically allocate file pointers and in stead just opens the files one at a time and writes the data to the corresponding column in the matrix.
Cheers.

C: Write to a specific line in the text file without searching

Hello I have file with text:
14
5 4
45 854
14
4
47 5
I need to write a text to a specific line. For example to the line number 4 (Doesn't matter whether I will append the text or rewrite the whole line):
14
5 4
45 854
14 new_text
4
47 5
I have found function fseek(). But in the documentation is written
fseek(file pointer,offset, position);
"Offset specifies the number of positions (bytes) to be moved from the location specified bt the position."
But I do not know the number of bites. I only know the number of lines. How to do that? Thank you
You can't do that, (text) files are not line-addressable.
Also, you can't insert data in the middle of a file.
The best way is to "spool" to a new file, i.e. read the input line by line, and write that to a new file which is the output. You can then easily keep track of which line you're on, and do whatever you want.
I will assume that you are going to be doing this many times for a single file, as such you would be better indexing the position of each newline char, for example you could use a function like this:
long *LinePosFind(int FileDes)
{
long * LinePosArr = malloc(500 * sizeof(long));
char TmpChar;
long LinesRead = 0;
long CharsRead = 0;
while(1 == read(FileDes, &TmpChar, 1))
{
if (!(LinesRead % 500)
{
LinePosArr = realloc(LinePosArr, (LinesRead + 500) * sizeof(long));
}
if (TmpChar == '\n')
{
LinePosArr[LinesRead++] = CharsRead;
}
CharsRead++;
}
return LinePosArr;
}
Then you can save the index of all the newlines for repeated use.
After this you can use it like so:
long *LineIndex = LinePosFind(FileDes);
long FourthLine = LineIndex[3];
Note I have not checked this code, just written from my head so it may need fixes, also, you should add some error checking for the malloc and read and realloc if you are using the code in production.

zero padding a signal in MATLAB

I have an audio signal of length 12769. I'm trying to perform STFT on it by breaking it into small windows of 1024 samples. This gives me with 12 exact windows while there are 481 points remaining. Since i need 543 (1024 - 481) more points to make up 1024 samples, i used the following code to zero pad.
f = [a zeros(1,542)];
where a is the audio file.
However i get an error saying
??? Error using ==> horzcat
CAT arguments dimensions are not consistent.
How can I overcome this?
Your vector a is a column vector and cannot be concatenated with row vector zeros(1,542). Use zeros(542,1) instead.
However, it is much easier to just use
f = a;
f(1024*ceil(end/1024)) = 0;
MATLAB will zero pad the vector up to element 1024, and it is independent of the array being column or row.
You can either remove the excess 481 samples using
Total_Samples = length(a);
for i=1 : Total_Samples-481
a_new[i] = a[i];
or you could add an additional 543 Zero samples by using
Total_Samples = length(a);
for i=Total_Samples+1 : Total_Samples+543
a[i] = 0 ;

Resources