Concatenate a large number of HDF5 files - dataset

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.

I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.

I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.

Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

Related

Fastest way to search a 500 thousand part array in BASH? [duplicate]

This question already has answers here:
Shell command to find lines common in two files
(12 answers)
grep a large list against a large file
(4 answers)
Closed 5 years ago.
Really need help trying to find the fastest way to display the number of times each part of a 500000 part unidimensional array occurs inside the DATA file. So its a word count of every element in the huge array.
I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:
DATA file is
1321064615465465465465465446513213321378787542119 #### the actual real life string is a lot longer than this space provides!!
The ArrayDataFile (they are all unique string elements) is
1212121
11
2
00
215
0
5845
88
1
6
5
133
86 ##### . . . etc etc on to 500 000 items
BASH Script code that I have been able to put together to accomplish this:
#!/bin/bash
declare -a Array
readarray Array < ArrayDataFile
for each in "${Array[#]}"
do
LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile
done
In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?
BASH the best way at all? Heard its slow.
The data file is just a huge string of numbers 21313541321 etc etc as I have now clarified. Also the ArrayDataFile is the one that has 500000 items. These are taken into an array by readarray in order to search the DATA file one by one and then get results in a per line basis into a new file. My specific question is regarding a search against a LARGE STRING, not an INDEXED file or a per line sorted file, nor do I want the lines in which my array elements from ArrayDataFile were found or anything like that. What I want is to search a large string of data for every time that every array element (taken from the ArrayDataFile) happened and print the results in the same line as they are located in the Array Data File so that I can keep everything together and do further operations. The only operation that really takes long is the actual Searching of the DATA file by utilizing the code provided in this post. I could not utilize those solutions for my query and my issue is not resolved with those answers. At least I have not been able to extrapolate a working solution for my sample code from those specific posts.

Reading many (1000+) files with dlmread - Loop with varying filenames?

I'm very new to matlab, or coding for that matter.
I'm running a simulation which outputs thousands of files. These files are .vtk and are read correctly by dlmread.
I tried reading one of them, defining it as a matrix and extracting column vectors out of this matrix. This works fine. What i need now is to not only read one of them, but all. The filenames vary by a number, for example cover1000.vtk, cover2000.vtk, ...., cover1200000.vtk.
I want all of them to be read with dlmread and stored as a different matrix. How do i do that? Here is what i have right now, working with one file at a time:
A_1000 = dlmread ('cover1000.vtk') %matrix a containing values from vtk file_in_loadpath
fx_1000 = A(1:20,1) %extracting vector with specific values
fx_ave_1000 = sum(fx_1000)/length(fx_1000) % average of the values in extracted vector
I'm thinking of a loop, but how do i create a loop with varying file names?
Also I've read that a loop is not the best idea, cell arrays would be better. But i have absolutely no idea how to implement any of this.
Thanks for your help!
cheers
You can use the function dir to list all the vtk files in your directory then loop over those files.
filename = dir('*.vtk'); %list all the vtk files in your current directory.
for ii = 1:length(filename)
A = dlmread (filename(ii).name) %matrix a containing values from vtk file_in_loadpath
fx{ii} = A(1:20,1) %extracting vector with specific value
fx_ave{ii} = sum(fx{ii})/length(fx{ii}) % average of the values in extracted vector
end
The results are now stored in two cells: fx and fx_ave.

Deleting specific files from a directory

I am trying to delete all files from a directory apart from two (which will be erased, then re-written). One of these files, not to be deleted, contains the names of all files in the folder/directory (the other contains the number of files in the directory).
I believe there (possibly?) are 2 solutions:
Read the names of each file from the un-deleted file and delete them individually until there is only the final 2 files remaining,
or...
Because all other files end in .txt I could use some sort of filter which would only delete files with this ending.
Which of these 2 would be most efficient and how could it be done?
Any help would be appreciated.
You are going to end up deleting files one by one, regardless of which method you use. Any optimizations you make are going to be very miniscule. Without actually timing your algorithms, I'd say they'd both take about the same amount of time (and this would vary from one computer to the next, based on CPU speed, HDD type, etc). So, instead of debating that, I'll provide you code for both the ways you've mentioned:
Method1:
import os
def deleteAll(infilepath):
with open(infilepath) as infile:
for line in infile:
os.remove(line)
Method 2:
import os
def deleteAll():
blacklist = set(['names/of/files/to/be/deleted', 'number/of/files'])
for fname in (f for f in os.listdir() if f not in blacklist):
os.remove(fname)

Python reading of files

I am new with Python and I am facing my first troubles.
I have to read some .dat files (100), and each file contains a set of 5000 power traces. The total amount of memory taken by the files is almost 10 GB, so I cannot read the files all toghether because I fill the RAM. So, the np.fromfile method with a for loop in every files is not usefull.
I would like to make a memory mapping, reading just few files at time, but I need to handle the data at the same time.
Do you have some suggestion?
Cheers

Excel Files, C, and misery

Alright, so, I haven't programmed anything useful in ages - last time I did was a year ago and as you can imagine my knowledge of programming is seriously rusty. (last thing I 'programmed' was a ren'py game over the weekend. One can imagine the limited uses of this. The most advanced C program I wrote was a tic-tac-toe game a year ago. So yeah.)
Anyways, I've been given a job to write a program that takes two Excel files, both of which have a list of items, each associated with an ID. I need to write a program to search both files for IDs and if the IDs match, the program will need to create a new file with the matched IDs and items. This is insanely beyond my limited C capabilities.
If anyone could help, I would seriously appreciate it.
(also, if this is not possible with C, I'll do my best to work with any other languages)
Export the two files to .csv format and write a script to process the two files. For example, in PHP, you have built in csv read/write capabilities.
You can do this with VBA and create a Macro in one of the files which iterates over the cells in your column in file 1 and compares them to cells in file 2 and writes them to a new .xls file if they match.
Dana points out that the VLOOKUP function will do this quite easily.
Install GnuWin32
Output the excel files as text (csv for example)
sort each file with the -u option to remove duplicates if needed
mix and sort the 2 files
count unique IDs with uniq -c
filter out lines with a value of 1 for the count with grep
remove the count leaving the ID and whatever else you need with cut
If you know Java then you can use Apache POI for your project. You can use the examples given on the Apache POI website to accomplish your task.
Apache POI Excel Documentation: http://poi.apache.org/spreadsheet/quick-guide.html
If you absolutely have to do this on xls/xlsx file from a process, you probably need a copy of Excel controlled by COM automation. You can do this in BV/VBA/C#/C++ whatever, some easier than others. Google for 'Excel automation'.
Rgds,
Martin
Not C, but you may be able to cobble something together very quickly using xlsperl.
It has come in handy for me in the past.

Resources