HDF gzip compression vs. ASCII gzip compression - c

I have a 2D matrix with 1100x1600 data points. Initially, I stored it in an ascii-file which I tar-zipped using the command
tar -cvzf ascii_file.tar.gz ascii_file
Now, I wanted to switch to hdf5 files, but they are too large, at least in the way I am using them... First, I write the array into an hdf5-file using the c-procedures
H5Fcreate, H5Screate_simple, H5Dcreate, H5Dwrite
in that order. The data is not compressed within the hdf-file and it is relatively large, so I compressed it using the command
h5repack --filter=GZIP=9 hdf5_file hdf5_file.gzipped
Unfortunatelly, this hdf file with the zipped content is still larger than the compressed ascii file by a factor of 5, see the following table:
file size
--------------------------
ascii_file 5721600
ascii_file.tar.gz 287408
hdf5_file 7042144
hdf5_file.gzipped 1117033
Now my question(s): Why is the gzipped ascii-file so much smaller and is there a way to make the hdf-file smaller?
Thanks.

well, after reading Mark Adler's comment, I realized that this question is somehow stupid: In the ascii case, the values are truncated after a certain number of digits, whereas in the hdf case the "real" values ("real" = whatever precision the data type has I am using) are stored.
There was, however, one possibility to further reduce the size of my hdf file: by applying the shuffle filter using the option
--filter=SHUF

Related

Sorting huge volumed data using Serialized Binary Search Tree

I have 50 GB structured (as key/value) data like this which are stored in a text file (input.txt / keys and values are 63 bit unsigned integers);
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
2436941118228099529 7438724021973510085
3370171830426105971 6928935600176631582
3370171830426105971 5928936601176631564
I need to sort this data as keys in increasing order with the minimum value of that key. The result must be presented in another text file (data.out) under 30 minutes. For example the result must be like this for the sample above;
2436941118228099529 7438724021973510085
3370171830426105971 5928936601176631564
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
I decided that;
I will create a BST tree with the keys from the input.txt with their minimum value, but this tree would be more than 50GB. I mean, I have time and memory limitation at this point.
So I will use another text file (tree.txt) and I will serialize the BST tree into that file.
After that, I will traverse the tree using in-order traverse and write the sequenced data into data.out file.
My problem is mostly with the serialization and deserialization part. How can I serialize this type of data? and I want to use the INSERT operation on the serialized data. Because my data is bigger than memory. I can't perform this in the memory. Actually I want to use text files as a memory.
By the way, I am very new to this kind of stuffs. If is there a conflict with my algorithm steps, please warn me. Any thought, technique and code samples would be helpful.
OS: Linux
Language: C
RAM: 6 GB
Note: I am not allowed to use built-in functions like sort and merge.
Considering, that your files seems to have the same line size around 40 chars giving me around 1250000000 lines in total, I'd split the input file into smaller, by a command:
split -l 2500000 biginput.txt
then I'd sort each of them
for f in x*; do sort -n $f > s$f; done
and finally I'd merge them by
sort -m sx* > bigoutput.txt

Maximum Length For Filename

What is the maximum length allowed for filenames? And is the max different for different operating system? I'm asking because I have trouble creating or deleting files, and I suspect the error was because of long file names.
1. Creating:
I wrote a program that will read a xml source and save a copy of the file. The xml contains hundreds of <Document>, and each have childnode <Name> and <Format>, the saved file is named based on what I read in the xml. For example, if I have the code below, I will save a file called test.txt
<Document>
<Name>test</Name>
<Format>.txt</Format>
</Document>
I declared a counter in my code, and I found out not all files are successfully saved. After going through the large xml file, I found out the program fail to save the files whose <Name> are like a whole paragraph long. I modify my code to save as a different name if <Name> is longer than 15 characters, and it went through no problem. So I think the issue was that the filename is too long.
2. Deleting
I found a random file on my computer, and I was not able to delete it. The error says that the file name was too long, even if I rename the file to 1 character. The file doesn't take up much space, but it was just annoying being there and not doing anything.
So my overall question is: What is the maximum and minimum length for filenames? Does it differ based on the operating system? And how can I delete the file I mentioned in 2?
It depends on the filesystem. Have a look here: http://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits
255 characters is a common maximum length these days.

Widechar sorting and loading from file

I'm currently trying to solve the problem when I need to load rows from the file and then sort them in the right order.
If I manually assign lettes to the array of wint_t and then sort them, everything from just fine with any encoding http://pastebin.com/85eycH15.
But if I read the very same letters from file and then try to sort them it works just with one encoding (cs_CZ.utf8) and with the rest of them it doesn't read the letters properly or or just skip them http://pastebin.com/3C8r9W5T.
I highly appreciate any help.
I assume that the only encoding with which you get your expected result is the one used for your data file. Re-encode the data file in another encoding, you'll get your expected result for the new encoding and not others.

Concatenate a large number of HDF5 files

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

Check whether it is the same file even after moving, renaming etc

This problem might be a common one, but since I don't know the terms associated with it, I couldn't search for it (unless Google accepted entire paragraphs as search queries).
I have a file - Can be a text file, or an MP3 file or a video clip or even a HUGE mkv file.
I have access to this file and now I have to process it in some way so that I get some kind of a value or unique identifier.. a hash, or something. I store it somewhere. This "hash" has to be small - several byte. It shouldnt be half the file size!
Later on, when I am presented with a file again, I have to verify whether it was the same original file using that value I got in step 1. I will NOT have access to the original file this time. All I have will be that value from step 1.
This algorithm should return true if the second file contains the exact same data - every single bit - as the first file (basically the same file) even if the file name, attributes, location etc have all changed.
Basically I need to know whether I am dealing with the same file, even if it moved, renamed and has all its attributes changed - but when NOT having access to both the files at the same time.
This has to be OS or FileSystem independent.
Is there a way to accomplish this?
What you're looking for are cryptographic hash algorithms. Read about them:
http://en.wikipedia.org/wiki/SHA-1
http://en.wikipedia.org/wiki/MD5
All robust languages and libraries offer support for calculating hashes.
Your dilemma is simple. Get an MD5 (or whatever algorithm can produce 1 way hash) hash every time you process the file.
Here it is in simple steps:
Step 1: Load file stream into a byte array
Step 2: Obtain MD5 hash from byte array
Step 3: Check your db if it already contains hash.
Step 4: return false if not exist
Step 5: return true if found
Step 6: If not exist process file
Step 7: Save hash
Compare two files by a key ( file-1(1,20,a), file-2(2,20,a)) Using sort card , matched records should be moved to file-3 (a new file) from file file-2 ?

Resources