External Merge Sort - database

I am trying to implement External merge sort for my DBMS project. I have a 3 file each with 20 pages and my buffer size is 20 pages .
Each of these i have sorted now . So all three files of 20 pages are sorted. Now while merging i need to bring 6 pages of each files (6x3=18 pages ) and 1 page to write the sorted output . And this has to be done 4 times to get whole file complete sorted .
But i am finding difficult to merge all these files ? any steps how to perform merge of 3 files making sure that every pages is brought in buffer size .Any recursive function ?
All the files content are stored in array a[fileno][pageno] format
eg a[1][20] =5 mean i have a data of 5 in the page no 20 of File 1 .
Assuming the page of a file hold an integer .

Assuming you do a 3 way merge, that's 3 inputs and 1 output, and it only has to be done once. Divide buffer into 4 parts, 5 pages each. Start by reading the first 5 pages of the 3 files, each into it's on 5 page buffer. Start a 3 way merge by comparing the first records in each of the 3 buffers and move the smallest to the output buffer. When the output buffer is filled (5 pages), write it out and continue. When an input buffer is emptied, read in the next 5 pages for that file.
When the end of one of the three input files is reached, the code switches to a 2 way merge. To simplify the code, copy the file related parameters into the parameters for file 0 and file 1. If file 2 goes empty first, nothing needs to be done. If file 1 goes empty first, copy file 2 parameters to file 1. If file 0 goes empty first, copy file 1 parameters to file 0, then file 2 parameters to file1. Then do the 2 way merge using file 0 and file 1.
When the end of of the two input files is reaches, the code switches to just copy the remaining file. Again, if file 0 goes empty first, then copy file 1 parameters to file 0, so that the copy code always works with file 0.

Related

Fastest way to search a 500 thousand part array in BASH? [duplicate]

This question already has answers here:
Shell command to find lines common in two files
(12 answers)
grep a large list against a large file
(4 answers)
Closed 5 years ago.
Really need help trying to find the fastest way to display the number of times each part of a 500000 part unidimensional array occurs inside the DATA file. So its a word count of every element in the huge array.
I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:
DATA file is
1321064615465465465465465446513213321378787542119 #### the actual real life string is a lot longer than this space provides!!
The ArrayDataFile (they are all unique string elements) is
1212121
11
2
00
215
0
5845
88
1
6
5
133
86 ##### . . . etc etc on to 500 000 items
BASH Script code that I have been able to put together to accomplish this:
#!/bin/bash
declare -a Array
readarray Array < ArrayDataFile
for each in "${Array[#]}"
do
LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile
done
In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?
BASH the best way at all? Heard its slow.
The data file is just a huge string of numbers 21313541321 etc etc as I have now clarified. Also the ArrayDataFile is the one that has 500000 items. These are taken into an array by readarray in order to search the DATA file one by one and then get results in a per line basis into a new file. My specific question is regarding a search against a LARGE STRING, not an INDEXED file or a per line sorted file, nor do I want the lines in which my array elements from ArrayDataFile were found or anything like that. What I want is to search a large string of data for every time that every array element (taken from the ArrayDataFile) happened and print the results in the same line as they are located in the Array Data File so that I can keep everything together and do further operations. The only operation that really takes long is the actual Searching of the DATA file by utilizing the code provided in this post. I could not utilize those solutions for my query and my issue is not resolved with those answers. At least I have not been able to extrapolate a working solution for my sample code from those specific posts.

Saving m*n array to txt file in matlab

I have an array of size 12*200000 as a result of some computation.
I want to save this to a .txt file.
I use the following method.
Lets just say that the array is X.
fid = fopen('filename.txt','w');
fprintf(fid,'%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n',x');
The size of the file should be around 40 MB.
It saves as a 40 MB file.But after sometime(immediately) it changes to a file of around 3 MB with some random values of the actual array. What is happening? It was working fine till a few days back. Help.

Write to and replace a file multiple times in Fortran

I'm trying to run a code that takes a particularly long time. In order for it to complete, I've separated the time step loops as such so that the data can be dumped and then re-read for the next loop:
do 10 n1 = 1, 10
OPEN(unit=11,file='Temperature', status='replace')
if (n1.eq.1) then
(set initial conditions)
elseif (n1.gt.1) then
READ(11,*) (reads the T values from 11)
endif
do 20 n = 1, 10000
(all the calculations for new T values)
WRITE(11,*) (overwrites the T values in 11 - the file isn't empty to begin with)
20 continue
10 continue
My issue then is that this only works for 2 time n1 time steps - after it has replace file 11 once, it no longer replaces and just reiterates the values in there.
Is there something wrong with the open statement? Is there a way to be able to replace file 11 more than once in the same code?
Your program will execute the open statement 10 times, each time with status = 'replace'. On the first go round presumably the file does not exist so the open statement causes the creation of a new, empty, file. On the second go round the file does exist so the open statement causes the file to be deleted and a new, empty, file of the same name to be created. Any attempt to read from that file is likely to cause issues.
I would lift the initial file opening out of the loop and restructure the code along these lines:
open(unit=11,file='Temperature', status='replace')
(set initial conditions)
(write first data set into file)
do n1 = 2, 10
rewind(11)
read(11,*) (reads the T values from 11)
! do stuff
close(11) ! Not strictly necessary but aids comprehension of intent
! Now re-open the file and replace it
open(unit=11,file='Temperature', status='replace')
do n = 1, 10000
(all the calculations for new T values)
write(11,*) (overwrites the T values in 11 - the file isn't empty to begin with)
end do
end do
but there is any number of other ways to restructure the code; choose one that suits you.
In passing, passing data from one iteration to the next by writing/reading a file is likely to be very slow, I'd only use it for checkpointing to support restarting a failed execution.

Unix : script as proxy to a file

Hi : Is there a way to create a file which, when read, is generated dynamically ?
I wanted to create 3 versions of the same file (one with 10 lines, one with 100 lines, one with all of the lines). Thus, I don't see any need for these to be static, but rather, it would be best if they were proxies from a head/tail/cat command.
The purpose of this is for unit testing - I want a unit test to run on a small portion of the full input file used in production. However, since the code only runs on full files (its actually a hadoop map/reduce application), I want to provide a truncated version of the whole data set without duplicating information.
UPDATE: An Example
more myActualFile.txt
1
2
3
4
5
more myProxyFile2.txt
1
2
more myProxyFile4.txt
1
2
3
4
etc.... So the proxy files are DIFFERENT named files with content that is dynamically provided by simply getting the first n lines of the main file.
This is hacky, but... One way is to use named pipes, and a looping shell script to generate the content (one per named pipe). This script would look like:
while true; do
(
for $(seq linenr); do echo something; done
) >thenamedpipe;
done
Your script would then read from that named pipe.
Another solution, if you are ready to dig into low level stuff, is FUSE.

Concatenate a large number of HDF5 files

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

Resources