I have an array of size 12*200000 as a result of some computation.
I want to save this to a .txt file.
I use the following method.
Lets just say that the array is X.
fid = fopen('filename.txt','w');
fprintf(fid,'%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n',x');
The size of the file should be around 40 MB.
It saves as a 40 MB file.But after sometime(immediately) it changes to a file of around 3 MB with some random values of the actual array. What is happening? It was working fine till a few days back. Help.
Related
Is there a way to get a specific line inside a text file without iterating line-by-line in C?
for example I have this text file names.txt it contains the following names below;
John
James
Julia
Jasmine
and I want to access 'Julia' right away without iterating through 'John' and 'James'?, Something like, just give the index value of '2' or '3' to access 'Julia' right away.
Is there a way to do this in C?
I just want to know how because I want to deal with a very large text file something like in about 3 billion lines and I want to access a specific line in there right away and iterating line-by-line is very slow
You have to at least once iterate thru all lines. In this iteration, before reading a line, you record the position in the file and save it to an array or to another file (Usually named an index file). The file shall have a fixed record size that is good for storing the position off the line in the text file.
Later, when you want to access a give line, you either use the array to get the position (Line number is the array index) or the file (You seek into the file to offset line number of record size) and read the position. Once you get the position, you can see into the text file to that position and read the line.
Each time the text file is updated, you must reconstruct the array or index file.
There are other way to do that, but you need to better explain the context.
This question already has answers here:
Shell command to find lines common in two files
(12 answers)
grep a large list against a large file
(4 answers)
Closed 5 years ago.
Really need help trying to find the fastest way to display the number of times each part of a 500000 part unidimensional array occurs inside the DATA file. So its a word count of every element in the huge array.
I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:
DATA file is
1321064615465465465465465446513213321378787542119 #### the actual real life string is a lot longer than this space provides!!
The ArrayDataFile (they are all unique string elements) is
1212121
11
2
00
215
0
5845
88
1
6
5
133
86 ##### . . . etc etc on to 500 000 items
BASH Script code that I have been able to put together to accomplish this:
#!/bin/bash
declare -a Array
readarray Array < ArrayDataFile
for each in "${Array[#]}"
do
LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile
done
In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?
BASH the best way at all? Heard its slow.
The data file is just a huge string of numbers 21313541321 etc etc as I have now clarified. Also the ArrayDataFile is the one that has 500000 items. These are taken into an array by readarray in order to search the DATA file one by one and then get results in a per line basis into a new file. My specific question is regarding a search against a LARGE STRING, not an INDEXED file or a per line sorted file, nor do I want the lines in which my array elements from ArrayDataFile were found or anything like that. What I want is to search a large string of data for every time that every array element (taken from the ArrayDataFile) happened and print the results in the same line as they are located in the Array Data File so that I can keep everything together and do further operations. The only operation that really takes long is the actual Searching of the DATA file by utilizing the code provided in this post. I could not utilize those solutions for my query and my issue is not resolved with those answers. At least I have not been able to extrapolate a working solution for my sample code from those specific posts.
I am trying to implement External merge sort for my DBMS project. I have a 3 file each with 20 pages and my buffer size is 20 pages .
Each of these i have sorted now . So all three files of 20 pages are sorted. Now while merging i need to bring 6 pages of each files (6x3=18 pages ) and 1 page to write the sorted output . And this has to be done 4 times to get whole file complete sorted .
But i am finding difficult to merge all these files ? any steps how to perform merge of 3 files making sure that every pages is brought in buffer size .Any recursive function ?
All the files content are stored in array a[fileno][pageno] format
eg a[1][20] =5 mean i have a data of 5 in the page no 20 of File 1 .
Assuming the page of a file hold an integer .
Assuming you do a 3 way merge, that's 3 inputs and 1 output, and it only has to be done once. Divide buffer into 4 parts, 5 pages each. Start by reading the first 5 pages of the 3 files, each into it's on 5 page buffer. Start a 3 way merge by comparing the first records in each of the 3 buffers and move the smallest to the output buffer. When the output buffer is filled (5 pages), write it out and continue. When an input buffer is emptied, read in the next 5 pages for that file.
When the end of one of the three input files is reached, the code switches to a 2 way merge. To simplify the code, copy the file related parameters into the parameters for file 0 and file 1. If file 2 goes empty first, nothing needs to be done. If file 1 goes empty first, copy file 2 parameters to file 1. If file 0 goes empty first, copy file 1 parameters to file 0, then file 2 parameters to file1. Then do the 2 way merge using file 0 and file 1.
When the end of of the two input files is reaches, the code switches to just copy the remaining file. Again, if file 0 goes empty first, then copy file 1 parameters to file 0, so that the copy code always works with file 0.
I have 79 .mat files each contain a 264*264 array named "CM". I want to combine them all into a single 264*264*79 matrix but I don't know how.
files=dir('*.mat') %// load all filenames from the directory ending on .mat
for ii = numel(files):-1:1 %// let the loop run backwards
load(files(ii).name);
A(:,:,ii) = CM; %// assumed they are actually all equivalently called CM
end
The dir command get a list of all files in the pwd (present working directory). The the for loop runs backwards, so as to initialise the storage variable A to its maximum size, improving efficiency. Within the loop, load a file and then store it in A. Finally A will be a [264 264 79] array.
I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5