I need to write a code that will:
Read a file that is a source and it is "emitting" 8-bytes signs (letters, numbers). The code needs to add +1 in the accumulator that is the size of 2^8^n every time a sign occurs. After I can calculate the entropy. The input parameters are the file and "n" that is 8 bytes * n (from 3 to 5).
Example:
if we read from file 10001110|11110000|11111111|00001011|10101010, for n = 3, we put +1 in the accumulator at 100011101111000011111111 and proceed with +1 at 111100001111111100001011 and so on...
The main problem is speed of this process. I need to read files that are max 50MB. Programming language can be anything from C, Matlab or Java.
Code so far in Matlab for n = 1. I need a good implementation for accumulator, so its not ful size from the very beginning...
load B.mat; %just some small part of file
file = dec2bin(B);
file = str2num(file);
[fileSize, ~] = size(file);
%Choose n
n = 1;
%Make accumulator
acu10 = linspace(0, 2^8^n-1, 2^8^n);
acu = dec2bin(acu10);
acu = str2num(acu);
index = zeros(size(acu10))';
%go through file and add +1 in accumulator
for i = 1:n:fileSize
for j = 1:size(acu)
if acu(j,:) == file(i);
index(j) = index(j) + 1;
end
end
end
Related
I have an array containing the Fourier transform of an input audio signal (the amplitudes corresponding to various frequencies). I wish to select specific ranges of the signal without using the inbuilt functions. I have performed the following simple operations for the same :
[audio_in,audio_freq_sampl]=audioread('F:\Signals and Systems\Take Me Home Country Roads (John Denver Cover).wav');
Length_audio=length(audio_in);
df=audio_freq_sampl/Length_audio;
frequency_audio=-audio_freq_sampl/2:df:audio_freq_sampl/2-df;
figure
FFT_audio_in=fft(audio_in);
n = length(FFT_audio_in);
init = 30000;
fin = 40000;
conji= mod((n-init+2),n) ;
conjf= mod((n-fin+2),n) ;
fs_1(1:n) = 0.0 ;
fs_1(init:fin) = FFT_audio_in(init:fin);
fs_1(conji:conjf) = FFT_audio_in(conji:conjf);
plot(frequency_audio,abs(fs_1));
As we can see here, there is only one peak. the other one should be visible at the other end of the plot in the range.
The song can be found here - https://www.youtube.com/watch?v=WF046Z5tPJE
The song must be converted to a .wav file before reading it.
The above code should give me a plot containing two small peaks corresponding to the ranges of frequencies - (init , fin) and (conji , conjf). However, I am getting a peak only corresponding to the 1st range. Both these ranges are within the size of the array - FFT_audio_in.
The error lies in the following lines of code :
n = length(FFT_audio_in);
init = 30000;
fin = 40000;
conji= mod((n-init+2),n) ;
conjf= mod((n-fin+2),n) ;
fs_1(1:n) = 0.0 ;
fs_1(init:fin) = FFT_audio_in(init:fin);
fs_1(conji:conjf) = FFT_audio_in(conji:conjf);
Turns out that in the above lines of code, conji < conjf. So in the last line of the code, I am spanning through the vector with the initial point being larger than the final point. However, with the following lines of code, this issue is resolved :
n = length(FFT_audio_in);
init = n/4;
fin = n/4 + 10000;
conji= mod((n-init+2),n) ;
conjf= mod((n-fin+2),n) ;
init_2 = min(conji , conjf);
fin_2 = max(conji , conjf);
fs_1(1:n) = 0.0 ;
fs_1(init:fin) = FFT_audio_in(init:fin);
fs_1(init_2:fin_2) = FFT_audio_in(init_2:fin_2);
I'm writing a code to solve Ax=b using MATLAB's x=A\B. I believe my problem lies within getting the data from the files into the array. Right now, the solution vector is coming out to be a load of 0's
The matrices I'm using have 10 rows respectively. They are aligned correctly in the text files.
% solve a linear system Ax = b by reading A and b from input file
% and then writing x on output file.
clear;
clc;
input_filename = 'my_input.txt';
output_filename = 'my_output.txt';
% read data from file
fileID = fopen('a_matrix.txt', 'r');
formatSpec = '%d %f';
sizeA = [10 Inf];
A = load('b_matrix.txt');
A = A'
file2ID = fopen('b_matrix.txt','r');
formatSpec2 = '%d %f';
sizeB = [10 Inf];
b = load('b_matrix.txt');
fclose(file2ID);
b = b'
% solve the linear system
x = A\b;
% write output data on file
dlmwrite('my_output.txt',x,'delimiter',',','precision',4);
% print screen
fprintf('Solution vector is: \n');
fprintf('%4.2f \n', x);
I answered my own question but I felt the need to share in case anyone else has similar troubles.
% solve a linear system Ax = b by reading A and b from input file
% and then writing x on output file.
clear;
clc;
input_filename = 'my_input.txt';
output_filename = 'my_output.txt';
% read data from file
f = textread('a_matrix.txt', '%f');
vals = reshape(f, 11, []).';
A = vals(:,1:10);
b = vals(:,11);
% solve the linear system
x = A\b;
% write output data on file
dlmwrite('my_output.txt',x,'delimiter',',','precision',4);
% print screen
fprintf('Solution vector is: \n');
fprintf('%4.2f \n', x);
I ended up combining the 'a' and 'b' matrix into a single text file for simplicity. Now, MATLAB reads data in by columns, so it is necessary to use 'reshape' in order to fit the data within the array correctly. Then, I filtered out the information from the single matrix by columns, using the 'vals' function as seen in my code. The 'A' matrix is essentially all numbers in columns 1 through 10, while the 'B' matrix is the 11th (and final) column.
Using MATLAB's x=A\b function, I was able to solve the linear system of equations.
The Overview
I am using the low-level calls in the libbzip2 library: BZ2_bzCompressInit(), BZ2_bzCompress() and BZ2_bzCompressEnd() to compress chunks of data to standard output.
I am migrating working code from higher-level calls, because I have a stream of bytes coming in and I want to compress those bytes in sets of discrete chunks (a discrete chunk is a set of bytes that contains a group of tokens of interest — my input is logically divided into groups of these chunks).
A complete group of chunks might contain, say, 500 chunks, which I want to compress to one bzip2 stream and write to standard output.
Within a set, using the pseudocode I outline below, if my example buffer is able to hold 101 chunks at a time, I would open a new stream, compress 500 chunks in runs of 101, 101, 101, 101, and one final run of 96 chunks that closes the stream.
The Problem
The issue is that my bz_stream structure instance, which keeps tracks of the number of compressed bytes in a single pass of the BZ2_bzCompress() routine, seems to claim to be writing more compressed bytes than the total bytes in the final, compressed file.
For example, the compressed output could be a file with a true size of 1234 bytes, while the number of reported compressed bytes (which I track while debugging) is somewhat higher than 1234 bytes (say 2345 bytes).
My rough pseudocode is in two parts.
The first part is a rough sketch of what I do to compress a subset of chunks (and I know that I have another subset coming after this one):
bz_stream bzStream;
unsigned char bzBuffer[BZIP2_BUFFER_MAX_LENGTH] = {0};
unsigned long bzBytesWritten = 0UL;
unsigned long long cumulativeBytesWritten = 0ULL;
unsigned char myBuffer[UNCOMPRESSED_MAX_LENGTH] = {0};
size_t myBufferLength = 0;
/* initialize bzStream */
bzStream.next_in = NULL;
bzStream.avail_in = 0U;
bzStream.avail_out = 0U;
bzStream.bzalloc = NULL;
bzStream.bzfree = NULL;
bzStream.opaque = NULL;
int bzError = BZ2_bzCompressInit(&bzStream, 9, 0, 0);
/* bzError checking... */
do
{
/* read some bytes into myBuffer... */
/* compress bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do
{
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
bzError = BZ2_bzCompress(&bzStream, BZ_RUN);
/* error checking... */
bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
cumulativeBytesWritten += bzBytesWritten;
/* write compressed data in bzBuffer to standard output */
fwrite(bzBuffer, 1, bzBytesWritten, stdout);
fflush(stdout);
}
while (bzError == BZ_OK);
}
while (/* while there is a non-final myBuffer full of discrete chunks left to compress... */);
Now we wrap up the output:
/* read in the final batch of bytes into myBuffer (with a total byte size of `myBufferLength`... */
/* compress remaining myBufferLength bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do
{
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
/* bzError error checking... */
/* increment cumulativeBytesWritten by `bz_stream` struct `total_out_*` members */
bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
cumulativeBytesWritten += bzBytesWritten;
/* write compressed data in bzBuffer to standard output */
fwrite(bzBuffer, 1, bzBytesWritten, stdout);
fflush(stdout);
}
while (bzError != BZ_STREAM_END);
/* close stream */
bzError = BZ2_bzCompressEnd(&bzStream);
/* bzError checking... */
The Questions
Am I calculating cumulativeBytesWritten (or, specifically, bzBytesWritten) incorrectly, and how would I fix that?
I have been tracking these values in a debug build, and I do not seem to be "double counting" the bzBytesWritten value. This value is counted and used once to increment cumulativeBytesWritten after each successful BZ2_bzCompress() pass.
Alternatively, am I not understanding the correct use of the bz_stream state flags?
For example, does the following compress and keep the bzip2 stream open, so long as I keep sending some bytes?
bzError = BZ2_bzCompress(&bzStream, BZ_RUN);
Likewise, can the following statement compress data, so long as there are at least some bytes are available to access from the bzStream.next_in pointer (BZ_RUN), and then the stream is wrapped up when there are no more bytes available (BZ_FINISH)?
bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
Or, am I not using these low-level calls correctly at all? Should I go back to using the higher-level calls to continuously append a grouping of compressed chunks of data to one main file?
There's probably a simple solution to this, but I've been banging my head on the table for a couple days in the course of debugging what could be wrong, and I'm not making much progress. Thank you for any advice.
In answer to my own question, it appears I am miscalculating the number of bytes written. I should not use the total_out_* members. The following correction works properly:
bzBytesWritten = sizeof(bzBuffer) - bzStream.avail_out;
The rest of the calculations follow.
The file will not fit into memory. It is over 100GB and I want to access specific lines by line number. I do not want to count line by line until I reach it.
I have read http://docstore.mik.ua/orelly/perl/cookbook/ch08_09.htm
When I built an index using the following methods, the line return works up to a certain point. Once the line number is very large, the line being returned is the same. When I go to the specific line in the file the same line is returned. It seems to work for line numbers 1 through 350000 (approximately);
# usage: build_index(*DATA_HANDLE, *INDEX_HANDLE)
sub build_index {
my $data_file = shift;
my $index_file = shift;
my $offset = 0;
while (<$data_file>) {
print $index_file pack("N", $offset);
$offset = tell($data_file);
}
}
# usage: line_with_index(*DATA_HANDLE, *INDEX_HANDLE, $LINE_NUMBER)
# returns line or undef if LINE_NUMBER was out of range
sub line_with_index {
my $data_file = shift;
my $index_file = shift;
my $line_number = shift;
my $size; # size of an index entry
my $i_offset; # offset into the index of the entry
my $entry; # index entry
my $d_offset; # offset into the data file
$size = length(pack("N", 0));
$i_offset = $size * ($line_number-1);
seek($index_file, $i_offset, 0) or return;
read($index_file, $entry, $size);
$d_offset = unpack("N", $entry);
seek($data_file, $d_offset, 0);
return scalar(<$data_file>);
}
I've also tried using the DB_file method, but it seems to take a very long time to do the tie. I also don't really understand what it means for "DB_RECNO access method ties an array to a file, one line per array element." Tie does not read the file into the array correct?
pack N creates a 32-bit integer. The maximum 32-bit integer is 4GB, so using that to store indexes into a file that's 100GB in size won't work.
Some builds use 64-bit integers. On those, you could use j.
Some builds use 32-bit integers. tell returns a floating-point number on those, allowing you to index files up to 8,388,608 GB in size losslessly. On those, you should use F.
Portable code would look as follows:
use Config qw( %Config );
my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';
...
print $index_file pack($off_t, $offset);
...
Note: I'm assuming the index file is only used by the same Perl that built it (or at least one with with the same integer size, seek size and machine endianness). Let me know if that assumption doesn't hold for you.
I've got some code, and I've been trying to make some minor tweaks to it. It used to use fgets to load in a single character from a line, and use it to colour points in a 3D plot. So it would read
a
p
p
n
c
and then use other data files to assign what x, y, z points to give these. The result is a really pretty 3D plot.
I've edited the input file so it reads
0
1
1
0
2
2
0
and I want it to colour numbers the same colour.
This is where I've gotten so far with the code:
function PlotCluster(mcStep)
clear all
filename = input('Please enter filename: ', 's');
disp('Loading hopping site coordinates ...')
load x.dat
load y.dat
load z.dat
temp = z;
z = x;
x = temp;
n_sites = length(x);
disp('Loading hopping site types ...')
fp = fopen([filename]);
data = load(filename); %# Load the data
% Plot the devices
% ----------------
disp('Plotting the sample surface ...')
figure
disp('Hello world!')
ia = data == 0;
in = data == 1;
ip = data == 2;
disp('Hello Again')
plot3(x(ia),y(ia),z(ia),'b.') %,'MarkerSize',4)
hold on
plot3(x(ic),y(ic),z(ic),'b.') %,'MarkerSize',4)
plot3(x(in),y(in),z(in),'g.') %,'MarkerSize',4)
plot3(x(ip),y(ip),z(ip),'r.') %,'MarkerSize',4)
daspect([1 1 1])
set(gca,'Projection','Perspective')
set(gca,'FontSize',16)
axis tight
xlabel('z (nm)','FontSize',18)
ylabel('y (nm)','FontSize',18)
zlabel('x (nm)','FontSize',18)
%title(['Metropolis Monte Carlo step ' num2str(mcStep)])
view([126.5 23])
My issue is I'm getting this error
Index exceeds matrix dimensions.
Error in PlotCluster (line 34)
plot3(x(ia),y(ia),z(ia),'b.') %,'MarkerSize',4)
And I don't see why ia would go out of bounds of the x array. Is it to do with changing the fgets to a load statement? It was the only way to get it read the correct numbers in (not 49s and 50s which was very odd.)
The main bits that are sticking me are these lines (where the number used to correspond to 'a','n','p' etc)
ia = data == 0;
in = data == 1;
ip = data == 2;
They look like implied if statements with assignment from data to ia etc. where ia becomes an array. But I'm not sure.
Any help understanding this would be greatly appreciated.
I've fixed the issue, I hadn't updated my input correctly. To clear this up for anyone who comes to this question: ia = data ==0 means 'Make an array the same size as data, and fill it with 1 or 0 depending on if the logic (data == 0) is true or false'