I have 50 GB structured (as key/value) data like this which are stored in a text file (input.txt / keys and values are 63 bit unsigned integers);
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
2436941118228099529 7438724021973510085
3370171830426105971 6928935600176631582
3370171830426105971 5928936601176631564
I need to sort this data as keys in increasing order with the minimum value of that key. The result must be presented in another text file (data.out) under 30 minutes. For example the result must be like this for the sample above;
2436941118228099529 7438724021973510085
3370171830426105971 5928936601176631564
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
I decided that;
I will create a BST tree with the keys from the input.txt with their minimum value, but this tree would be more than 50GB. I mean, I have time and memory limitation at this point.
So I will use another text file (tree.txt) and I will serialize the BST tree into that file.
After that, I will traverse the tree using in-order traverse and write the sequenced data into data.out file.
My problem is mostly with the serialization and deserialization part. How can I serialize this type of data? and I want to use the INSERT operation on the serialized data. Because my data is bigger than memory. I can't perform this in the memory. Actually I want to use text files as a memory.
By the way, I am very new to this kind of stuffs. If is there a conflict with my algorithm steps, please warn me. Any thought, technique and code samples would be helpful.
OS: Linux
Language: C
Note: I am not allowed to use built-in functions like sort and merge.

Considering, that your files seems to have the same line size around 40 chars giving me around 1250000000 lines in total, I'd split the input file into smaller, by a command:
split -l 2500000 biginput.txt
then I'd sort each of them
for f in x*; do sort -n $f > s$f; done
and finally I'd merge them by
sort -m sx* > bigoutput.txt


Really need help trying to find the fastest way to display the number of times each part of a 500000 part unidimensional array occurs inside the DATA file. So its a word count of every element in the huge array.
I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:
DATA file is
1321064615465465465465465446513213321378787542119 #### the actual real life string is a lot longer than this space provides!!
The ArrayDataFile (they are all unique string elements) is
86 ##### . . . etc etc on to 500 000 items
BASH Script code that I have been able to put together to accomplish this:
declare -a Array
readarray Array < ArrayDataFile
for each in "${Array[#]}"
LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile
In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?
BASH the best way at all? Heard its slow.
The data file is just a huge string of numbers 21313541321 etc etc as I have now clarified. Also the ArrayDataFile is the one that has 500000 items. These are taken into an array by readarray in order to search the DATA file one by one and then get results in a per line basis into a new file. My specific question is regarding a search against a LARGE STRING, not an INDEXED file or a per line sorted file, nor do I want the lines in which my array elements from ArrayDataFile were found or anything like that. What I want is to search a large string of data for every time that every array element (taken from the ArrayDataFile) happened and print the results in the same line as they are located in the Array Data File so that I can keep everything together and do further operations. The only operation that really takes long is the actual Searching of the DATA file by utilizing the code provided in this post. I could not utilize those solutions for my query and my issue is not resolved with those answers. At least I have not been able to extrapolate a working solution for my sample code from those specific posts.

HDF gzip compression vs. ASCII gzip compression

I have a 2D matrix with 1100x1600 data points. Initially, I stored it in an ascii-file which I tar-zipped using the command
tar -cvzf ascii_file.tar.gz ascii_file
Now, I wanted to switch to hdf5 files, but they are too large, at least in the way I am using them... First, I write the array into an hdf5-file using the c-procedures
H5Fcreate, H5Screate_simple, H5Dcreate, H5Dwrite
in that order. The data is not compressed within the hdf-file and it is relatively large, so I compressed it using the command
h5repack --filter=GZIP=9 hdf5_file hdf5_file.gzipped
Unfortunatelly, this hdf file with the zipped content is still larger than the compressed ascii file by a factor of 5, see the following table:
file size
ascii_file 5721600
ascii_file.tar.gz 287408
hdf5_file 7042144
hdf5_file.gzipped 1117033
Now my question(s): Why is the gzipped ascii-file so much smaller and is there a way to make the hdf-file smaller?
well, after reading Mark Adler's comment, I realized that this question is somehow stupid: In the ascii case, the values are truncated after a certain number of digits, whereas in the hdf case the "real" values ("real" = whatever precision the data type has I am using) are stored.
There was, however, one possibility to further reduce the size of my hdf file: by applying the shuffle filter using the option

Concatenate a large number of HDF5 files

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
f = open(inputfilepath)
for line in eachline(f)
r = strip(line, ['\n'])
datai = h5read(r, "/data")
if (firstit)
data=cat(4,data, datai) #In this case concatenating on 4th dimension
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

Check whether it is the same file even after moving, renaming etc

This problem might be a common one, but since I don't know the terms associated with it, I couldn't search for it (unless Google accepted entire paragraphs as search queries).
I have a file - Can be a text file, or an MP3 file or a video clip or even a HUGE mkv file.
I have access to this file and now I have to process it in some way so that I get some kind of a value or unique identifier.. a hash, or something. I store it somewhere. This "hash" has to be small - several byte. It shouldnt be half the file size!
Later on, when I am presented with a file again, I have to verify whether it was the same original file using that value I got in step 1. I will NOT have access to the original file this time. All I have will be that value from step 1.
This algorithm should return true if the second file contains the exact same data - every single bit - as the first file (basically the same file) even if the file name, attributes, location etc have all changed.
Basically I need to know whether I am dealing with the same file, even if it moved, renamed and has all its attributes changed - but when NOT having access to both the files at the same time.
This has to be OS or FileSystem independent.
Is there a way to accomplish this?
What you're looking for are cryptographic hash algorithms. Read about them:
All robust languages and libraries offer support for calculating hashes.
Your dilemma is simple. Get an MD5 (or whatever algorithm can produce 1 way hash) hash every time you process the file.
Here it is in simple steps:
Step 1: Load file stream into a byte array
Step 2: Obtain MD5 hash from byte array
Step 3: Check your db if it already contains hash.
Step 4: return false if not exist
Step 5: return true if found
Step 6: If not exist process file
Step 7: Save hash
MD5 implementation in C for a XML file

I need to implement the MD5 checksum to verify a MD5 checksum in a XML file including all XML tags and which has received from our client. The length of the received MD5 checksum is 32 byte hexadecimal digits.
We need set MD5 Checksum field should be 0 in received XML file prior to checksum calculation and we have to indepandantly calculate and verify the MD5 checksum value in a received XML file.
Our application is implemented in C. Please assist me on how to implement this.
This directly depends on the library used for XML parsing. This is tricky however, because you can't embed the MD5 in the XML file itself, for after embedding the checksum inside, unless you do the checksum only from the specific elements. As I understand you receive the MD5 independently? Is it calculated from the whole file, or only the tags/content?
MD5 Public Domain code link -
XML library for C -
Exact solutions depend on the code used.
Based on your comment you need to do the following steps:
load the xml file (possibly even as plain-text) read the MD5
substitute the MD5 in the file with zero, write the file down (or better to memory)
run MD5 on the pure file data and compare it with the value stored before
There are public-domain implementations of MD5 that you should use, instead of writing your own. I hear that Colin Plumb's version is widely used.
Don't reinvent the wheel, use a proven existing solution:
Incidentally that was the first link that came up when I googled "md5 c implementation".
This is rather nasty. The approach suggested seems to imply you need to parse the XML document into something like a DOM tree, find the MD5 checksum and store it for future reference. Then you would replace the checksum with 0 before re-serializing the document and calculating it's MD5 hash. This all sounds doable but potentially tricky. The major difficulty I see is that your new serialization of the document may not be the same as the original one and irrelevant (to XML) differences like the use of single or double quotes around attribute values, added line breaks or even a different encoding will cause the hashs to differ. If you go down this route you'll need to make sure your app and the procedure used to create the document in the first place make the same choices. For this sort of problem canonical XML is the standard solution (
However, I would do something different. With any luck it should be quite easy to write a regular expression to locate the MD5 hash in the file and replace it with 0. You can then use this to grab the hash and replace with 0 it in the XML file before recalculating the hash. This sidesteps all the possible issues with parsing, changing and re-serializing the XML document. To illustrate I'm going to assume the hash '33d4046bea07e89134aecfcaf7e73015' lives in the XML file like this:
<docRoot xmlns='some-irrelevant-uri>
<myData>Blar blar</myData>
<myExtraData number='1'/>
<docHash MD5='33d4046bea07e89134aecfcaf7e73015' />
<evenMoreOfMyData number='34'/>
(which I've called hash.xml), that the MD5 should be replaced by 32 zeros (so the hash is correct) and illustrate the procedure on a shell command line using perl, md5 and bash. (Hopefully translating this into C won't be too hard given the existence of regular expression and hashing libraries.)
Breaking down the problem, you first need to be able to find the hash that is in the file:
perl -p -e'if (m#<docHash.+MD5="([a-fA-F0-9]{32})#) {$_ = "$1\n"} else {$_ = ""}' hash.xml
(this works by looking for the start of the MD5 attribute of the docHash element, allowing for possible other attributes, and then grabbing the next 32 hex characters. If it finds them it bungs them in the magic $_ variable, if not it sets $_ to be empty, then the value of $_ gets printed for each line. This results in the string "33d4046bea07e89134aecfcaf7e73015" being printed.)
Then you need to calculate the hash of the the file with the has replaced with zeros:
perl -p -e's#(<docHash.+MD5=)"([a-fA-F0-9]{32})#$1"000000000000000000000000000000#' hash.xml | md5
(where the regular expression is almost the same, but this time the hex characters are replaced by zeros and the whole file is printed. Then the MD5 of this is calculated by piping the result through an md5 hashing program. Putting this together with a bit of bash gives:
if [ `perl -p -e'if (m#<docHash.+MD5="([a-fA-F0-9]{32})#) {$_ = "$1\n"} else {$_ = ""}' hash.xml` = `perl -p -e's#(<docHash.+MD5=)"([a-fA-F0-9]{32})#$1"000000000000000000000000000000#' hash.xml | md5` ] ; then echo OK; else echo ERROR; fi
which executes those two small commands, compares the output and prints "OK" if the outputs match or "ERROR" if they don't. Obviously this is just a simple prototype, and is in the wrong language, I think it illustrates the most straight forward solution.
Incidentally, why do you put the hash inside the XML document? As far as I can see it doesn't have any advantage compared to passing the hash along on a side channel (even something as simple as in a second file called documentname.md5) and makes the hash validation more difficult.
Check out these examples for how to use the XMLDSIG standard with .net
How to: Sign XML Documents with Digital Signatures
How to: Verify the Digital Signatures of XML Documents
You should maybe consider to change the setting for preserving whitespaces.
