How to efficiently append data to an HDF5 table in C? - c

I'm failing to save a large dataset of float values in an HDF5 file efficiently.
The data acquisition works as follows:
A fixed array of 'ray data' (coordintaes, directions, wavelength, intensity, etc.) is created and send to an external ray trace programm (its about 2500 values).
In return I get the same array but with changed data.
I now want to save the new coordinates in an HDF5 for further processing as a simple table.
These steps are repeated many times (about 80 000).
I followed the example of the HDF5group http://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/examples/h5_extend_write.c, but unfortunatly the solution is quite slow.
Before I wrote the data directly into an hdf5 file I used a simple csv file, it takes about 80 sec for 100 repetitions, whereas it takes 160 sec appending to the hdf5 file.
The 'pseudo' code looks like this:
//n is a large number e.g. 80000
for (i=0;i<n;++i):
{
/*create an array of rays for tracing*/
rays = createArray(i);
/*trace the rays*/
traceRays(&rays);
/* write results to hdf5 file, m is a number around 2500 */
for(j=0;j<m;j++):
{
buffer.x = rays[j].x
buffer.y = rays[j].y
//this seems to be slow:
H5TBappend_records(h5file,tablename, 1,dst_size, dst_offset, dst_sizes, &buffer)
// this is fast:
sprintf(szBuffer, "%15.6E,%14.6E\n",rays[j].x,rays[j].y)
fputs(szBuffer, outputFile)
}
}
I could imagine that it has something to do with the overhead of extending the table at each step ?
Any help would be appreciated.
cheers,
Julian

You can get very good performance using the low level API of HDF5. I explain how to do it in this detailed answer.
Basically you need to either use a fixed-size dataset if you know its final size in advance (best case scenario), or use a chunked dataset which you can extend at will (a bit more code, more overhead, and choosing a good chunk size is critical for performance). In any case, then you can let the HDF5 library buffer the writes for you. It should be very fast.
In your case you probably want to create a compound datatype to hold each record of your table. Your dataset would then be a 1D array of your compound datatype.
NB: The methods used in the example code you linked to are correct. If it didn't work for you, that might be because your chunk size was too small.

Related

Shuffle Dask array chunks from hdf5 file

I have a very large array stored in an hdf5 file. I am trying to load it and manage it as a Dask array.
At the moment my challenge is that i need to shuffle this array time to time in a process, this is a challenge by itself to shuffle an array bigger than memory.
So what i am trying to do without success is to shuffle the dask array chunks.
#Prepare data
f=h5py.File('Data.hdf5')
dset = f['/Data']
dk_array = da.from_array(dset, chunks=dset.chunks)
So given the context above how can i shuffle the chunks?
If your array is tabular in nature then you might consider adding a column of random data (see da.concatenate and da.random), turning it into a dask.dataframe, and setting that column as the index.
As a warning, this will be somewhat slow as it will need to do an on-disk shuffle.

repeating a vector many times without repmat MATLAB

I have a vector with very large size in column format, I want to repeat this vector multiple times. the simple method that works for small arrays is repmat but I am running out of memory. I used bsxfun but still no success, MATLAB gives me an error of memory for using ones. any idea how to do that?
Here is the simple code (just for demonstration):
t=linspace(0,1000,89759)';
tt=repmat(t,1,length(t));
or using bsxfun:
tt=bsxfun(#times,t, ones(length(t),length(t)));
The problem here is simply too much data, it does not have to do with the repmat function itself. To verify that it is too much data, you can simply try creating a matrix of ones of that size with a clear workspace to reproduce the error. On my system, I get this error:
>> clear
>> a = ones(89759,89759)
Error using ones
Requested 89759x89759 (60.0GB) array exceeds maximum array size preference. Creation of arrays greater than
this limit may take a long time and cause MATLAB to become unresponsive. See array size limit or preference
panel for more information.
So you fundamentally need to reduce the amount of data you are handling.
Also, I should note that plots will hold onto references to the data, so even if you try plotting this "in chunks", then you will still run into the same problem. So again, you fundamentally need to reduce the amount of data you are handling.

Appending data to the same dataset in hdf5 in matlab

I have to put all the huge data together into a single dataset in hdf5. Now, the thing is, if you try:
>> hdf5write('hd', '/dataset1', [1;2;3])
>> hdf5write('hd', '/dataset1', [4;5;6], 'WriteMode', 'append')
??? Error using ==> hdf5writec
writeH5Dset: Dataset names must be unique when appending data.
As you can see, hdf5write will complain when you tried to append data to the same dataset. I've looked around and see one possible workaround is to grab your data from the dataset first, then concatenate the data right in matlab environment. This is not a problem for small data, of course. For this case, we are talking about gigabytes of data, and Matlab starts yelling out out of memory.
Because of this, what are my available options in this case?
Note: we do not have h5write function in our matlab version.
I believe the 'append' mode is to add datasets to an existing file.
hdf5write does not appear to support appending to existing datasets. Without the newer h5write function, your best bet would be to write a small utility with the low-level HDF5 library functions that are exposed with the H5* package functions.
To get you started, the doc page has an example on how to append to a datatset.
You cannot do it with hdf5write, however if your version of Matlab is not too old, you can do it with h5create and h5write. This example is drawn from the doc of h5write:
Append data to an unlimited data set.
h5create('myfile.h5','/DS3',[20 Inf],'ChunkSize',[5 5]);
for j = 1:10
data = j*ones(20,1);
start = [1 j];
count = [20 1];
h5write('myfile.h5','/DS3',data,start,count);
end
h5disp('myfile.h5');
For older versions of Matlab, it should be possible to do it using the Matlab's HDF5 low level API.

redis memory efficiency

I want to load data with 4 columns and 80 millon rows in MySQL on Redis, so that I can reduce fetching delay.
However, when I try to load all the data, it becomes 5 times larger.
The original data was 3gb (when exported to csv format), but when I load them on Redis, it takes 15GB... it's too large for our system.
I also tried different datatypes -
1) 'table_name:row_number:column_name' -> string
2) 'table_name:row_number' -> hash
but all of them takes too much.
am I missing something?
added)
my data have 4 col - (user id(pk), count, created time, and a date)
The most memory efficient way is storing values as a json array, and splitting your keys such that you can store them using a ziplist encoded hash.
Encode your data using say json array, so you have key=value pairs like user:1234567 -> [21,'25-05-2012','14-06-2010'].
Split your keys into two parts, such that the second part has about 100 possibilities. For example, user:12345 and 67
Store this combined key in a hash like this hset user:12345 67 <json>
To retrieve user details for user id 9876523, simply do hget user:98765 23 and parse the json array
Make sure to adjust the settings hash-max-ziplist-entries and hash-max-ziplist-value
Instagram wrote a great blog post explaining this technique, so I will skip explaining why this is memory efficient.
Instead, I can tell you the disadvantages of this technique.
You cannot access or update a single attribute on a user; you have to rewrite the entire record.
You'd have to fetch the entire json object always even if you only care about some fields.
Finally, you have to write this logic on splitting keys, which is added maintenance.
As always, this is a trade-off. Identify your access patterns and see if such a structure makes sense. If not, you'd have to buy more memory.
+1 idea that may free some memory in this case - key zipping based on crumbs dictionary and base62 encoding for storing integers,
it shrinks user:12345 60 to 'u:3d7' 'Y', which take two times less memory for storing key.
And with custom compression of data, not to array but to a loooong int (it's possible to convert [21,'25-05-2012','14-06-2010'] to such int: 212505201214062010, two last part has fixed length then it's obvious how to pack/repack such value )
So whole bunch of keys/values size is now 1.75 times less.
If your codebase is ruby-based I may suggest me-redis gem which is seamlessly implement all ideas from Sripathi answer + given ones.

Avoiding Memcache "1000000 bytes in length" limit on values

My model has different entities that I'd like to calculate once like the employees of a company. To avoid making the same query again and again, the calculated list is saved in Memcache (duration=1day).. The problem is that the app is sometimes giving me an error that there are more bytes being stored in Memcache than is permissible:
Values may not be more than 1000000 bytes in length; received 1071339 bytes
Is storing a list of objects something that you should be doing with Memcache? If so, what are best practices in avoiding the error above? I'm currently pulling 1000 objects. Do you limit values to < 200? Checking for an object's size in memory doesn't seem like too good an idea because they're probably being processed (serialized or something like that) before going into Memcache.
David, you don't say which language you use, but in Python you can do the same thing as Ibrahim suggests using pickle. All you need to do is write two little helper functions that read and write a large object to memcache. Here's an (untested) sketch:
def store(key, value, chunksize=950000):
serialized = pickle.dumps(value, 2)
values = {}
for i in xrange(0, len(serialized), chunksize):
values['%s.%s' % (key, i//chunksize)] = serialized[i : i+chunksize]
return memcache.set_multi(values)
def retrieve(key):
result = memcache.get_multi(['%s.%s' % (key, i) for i in xrange(32)])
serialized = ''.join([v for k, v in sorted(result.items()) if v is not None])
return pickle.loads(serialized)
I frequently store objects with the size of several megabytes on the memcache. I cannot comment on whether this is a good practice or not, but my opinion is that sometimes we simply need a relatively fast way to transfer megabytes of data between our app engine instances.
Since I am using Java, what I did is serializing my raw objects using Java's serializer, producing a serialized array of bytes. Since the size of the serialized object is now known, I could cut into chunks of 800 KBs byte arrays. I then encapsulate the byte array in a container object, and store that object instead of the raw objects.
Each container object could have a pointer to the next memcache key where I could fetch the next byte array chunk, or null if there is no more chunks that need to be fetched from the memcache. (i.e. just like a linked list) I then re-merge the chunks of byte arrays into a large byte array and deserialize it using Java's deserializer.
Do you always need to access all the data which you store? If not then you will benefit from partitioning the dataset and accessing only the part of data you need.
If you display a list of 1000 employees you probably are going to paginate it. If you paginate then you definitely can partition.
You can make two lists of your dataset: one lighter with just the most essential information which can fit into 1 MB and other list which is divided into several parts with full information. On the light list you will be able to apply the most essential operations for example filtering through employees name or pagination. And then when needed load the heavy dataset you will be able to load only parts which you really need.
But well these suggestions takes time to implement. If you can live with your current design then just divide your list into lumps of ~300 items or whatever number is safe and load them all and merge.
If you know how large will the objects be you can use the memcached option to allow larger objects:
memcached -I 10m
This will allow objects up to 10MB.

Resources