Sparse array support in HDF5 - sparse-matrix

I need to store a 512^3 array on disk in some way and I'm currently using HDF5. Since the array is sparse a lot of disk space gets wasted.
Does HDF5 provide any support for sparse array ?

One workaround is to create the dataset with a compression option. For example, in Python using h5py:
import h5py
f = h5py.File('my.h5', 'w')
d = f.create_dataset('a', dtype='f', shape=(512, 512, 512), fillvalue=-999.,
compression='gzip', compression_opts=9)
d[3, 4, 5] = 6
f.close()
The resulting file is 4.5 KB. Without compression, this same file would be about 512 MB. That's a compression of 99.999%, because most of the data are -999. (or whatever fillvalue you want).
The equivalent can be achieved using the C++ HDF5 API by setting H5::DSetCreatPropList::setDeflate to 9, with an example shown in h5group.cpp.

Chunked datasets (H5D_CHUNKED) allow sparse storage but depending on your data, the overhead may be important.
Take a typical array and try both sparse and non-sparse and then compare the file sizes, then you will see if it is really worth.

HDF5 provides indexed storage: http://www.hdfgroup.org/HDF5/doc/TechNotes/RawDStorage.html

Related

How to get HDF5 chunk dimensions in existing file?

I'm writing some code to remove parts of an existing HDF5 file (some dimensions, some datasets, etc.) with the C HDF5 API. I'd like the new HDF5 file to come with the same chunk size as my existing HDF5 file, but I can't seem to find anywhere that I can retrieve current chunk dimensions. There is the H5P_GET_CHUNK function, but that retrieves chunk dimensions on dataset creation only. There is also the H5D_GET_CHUNK_STORAGE_SIZE function, which only retrieves total size (not dimensions).
Is there a way to retrieve chunk dimensions (not just total size) from an existing dataset that I'm missing?
You may want to have a look at HDFql as it is a (much) simpler way to manipulate HDF5 files, in particular to get the storage dimensions (i.e. chunk size) of a certain dataset. As an example, if you want to get the storage dimensions of a dataset named dset do the following:
SHOW STORAGE DIMENSION dset
Additional information about this operation may be found in section 6.7.15 of the reference manual.

Random lookups in large sparse arrays?

Im using HDF5 to store massive sparse arrays in Coordinate format (basically, an M x 3 array which stores the value, x index and y index for each non-zero element).
This is great for processing the whole dataset in an iterative manner, but I am struggling with random lookups based on index values.
E.g, given a 100x100 matrix, I might store then non sparse elements like so:
[[1,2,3,4,5], // Data values
[13, 14, 55, 67, 80], // X-indices
[45, 12, 43, 55, 12]] // Y-indices
I then wish to get all the data values between 10<x<32 and 10<y<32, for example. With the current format, all I can do is iterate through the x and y index arrays looking for matching indices. This is very very slow, with multiple reads from disk (my real data typically has as size of 200000x200000 with perhaps 10000000 non-sparse elements).
Is there a better way to store large (larger than RAM) sparse matrices and support rapid index-based lookups?
I'm using HDF5, but happy to be pointed in other directions
First, let's suppose that, as your example hints but you don't state conclusively, you store the elements in order sorted by x first and by y second.
One easy technique for more rapid lookup would be to store an x-index-index, a vector of tuples (following your example this might be [(10,1),(20,null),(30,null),(40,null),(50,3),...]) pointing to locations in the x-index vector at which runs of elements start. If this index-index fits comfortably in RAM you could get away with reading it from disk only once at the start of your computation.
Of course, this only supports rapid location of x indices, and then a scan for the y. If you need to support rapid location of both you're into the realm of spatial indexing, and HDF5 might not be the best on-disk storage you could choose.
One thought that does occur, though, would be to define a z-order curve across your array and to store the elements in your HDF5 file in that order. To supplement that you'd want to define a z-index which would identify the location of the start of the elements in each 'tile' of the array. This all begins to get a bit hairy, I suggest you look at the Wikipedia article on z-order curves and do some head scratching.
Finally, in case it's not crystal clear, I've looked at this only from the point of view of reading values out of the file. All the suggestions I've made make creating and updating the file more difficult.
Finally, finally, you're not the first person to think about effective and efficient indexing for sparse arrays and your favourite search engine will throw up some useful resources for your study.

How to efficiently append data to an HDF5 table in C?

I'm failing to save a large dataset of float values in an HDF5 file efficiently.
The data acquisition works as follows:
A fixed array of 'ray data' (coordintaes, directions, wavelength, intensity, etc.) is created and send to an external ray trace programm (its about 2500 values).
In return I get the same array but with changed data.
I now want to save the new coordinates in an HDF5 for further processing as a simple table.
These steps are repeated many times (about 80 000).
I followed the example of the HDF5group http://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/examples/h5_extend_write.c, but unfortunatly the solution is quite slow.
Before I wrote the data directly into an hdf5 file I used a simple csv file, it takes about 80 sec for 100 repetitions, whereas it takes 160 sec appending to the hdf5 file.
The 'pseudo' code looks like this:
//n is a large number e.g. 80000
for (i=0;i<n;++i):
{
/*create an array of rays for tracing*/
rays = createArray(i);
/*trace the rays*/
traceRays(&rays);
/* write results to hdf5 file, m is a number around 2500 */
for(j=0;j<m;j++):
{
buffer.x = rays[j].x
buffer.y = rays[j].y
//this seems to be slow:
H5TBappend_records(h5file,tablename, 1,dst_size, dst_offset, dst_sizes, &buffer)
// this is fast:
sprintf(szBuffer, "%15.6E,%14.6E\n",rays[j].x,rays[j].y)
fputs(szBuffer, outputFile)
}
}
I could imagine that it has something to do with the overhead of extending the table at each step ?
Any help would be appreciated.
cheers,
Julian
You can get very good performance using the low level API of HDF5. I explain how to do it in this detailed answer.
Basically you need to either use a fixed-size dataset if you know its final size in advance (best case scenario), or use a chunked dataset which you can extend at will (a bit more code, more overhead, and choosing a good chunk size is critical for performance). In any case, then you can let the HDF5 library buffer the writes for you. It should be very fast.
In your case you probably want to create a compound datatype to hold each record of your table. Your dataset would then be a 1D array of your compound datatype.
NB: The methods used in the example code you linked to are correct. If it didn't work for you, that might be because your chunk size was too small.

Appending data to the same dataset in hdf5 in matlab

I have to put all the huge data together into a single dataset in hdf5. Now, the thing is, if you try:
>> hdf5write('hd', '/dataset1', [1;2;3])
>> hdf5write('hd', '/dataset1', [4;5;6], 'WriteMode', 'append')
??? Error using ==> hdf5writec
writeH5Dset: Dataset names must be unique when appending data.
As you can see, hdf5write will complain when you tried to append data to the same dataset. I've looked around and see one possible workaround is to grab your data from the dataset first, then concatenate the data right in matlab environment. This is not a problem for small data, of course. For this case, we are talking about gigabytes of data, and Matlab starts yelling out out of memory.
Because of this, what are my available options in this case?
Note: we do not have h5write function in our matlab version.
I believe the 'append' mode is to add datasets to an existing file.
hdf5write does not appear to support appending to existing datasets. Without the newer h5write function, your best bet would be to write a small utility with the low-level HDF5 library functions that are exposed with the H5* package functions.
To get you started, the doc page has an example on how to append to a datatset.
You cannot do it with hdf5write, however if your version of Matlab is not too old, you can do it with h5create and h5write. This example is drawn from the doc of h5write:
Append data to an unlimited data set.
h5create('myfile.h5','/DS3',[20 Inf],'ChunkSize',[5 5]);
for j = 1:10
data = j*ones(20,1);
start = [1 j];
count = [20 1];
h5write('myfile.h5','/DS3',data,start,count);
end
h5disp('myfile.h5');
For older versions of Matlab, it should be possible to do it using the Matlab's HDF5 low level API.

Avoiding Memcache "1000000 bytes in length" limit on values

My model has different entities that I'd like to calculate once like the employees of a company. To avoid making the same query again and again, the calculated list is saved in Memcache (duration=1day).. The problem is that the app is sometimes giving me an error that there are more bytes being stored in Memcache than is permissible:
Values may not be more than 1000000 bytes in length; received 1071339 bytes
Is storing a list of objects something that you should be doing with Memcache? If so, what are best practices in avoiding the error above? I'm currently pulling 1000 objects. Do you limit values to < 200? Checking for an object's size in memory doesn't seem like too good an idea because they're probably being processed (serialized or something like that) before going into Memcache.
David, you don't say which language you use, but in Python you can do the same thing as Ibrahim suggests using pickle. All you need to do is write two little helper functions that read and write a large object to memcache. Here's an (untested) sketch:
def store(key, value, chunksize=950000):
serialized = pickle.dumps(value, 2)
values = {}
for i in xrange(0, len(serialized), chunksize):
values['%s.%s' % (key, i//chunksize)] = serialized[i : i+chunksize]
return memcache.set_multi(values)
def retrieve(key):
result = memcache.get_multi(['%s.%s' % (key, i) for i in xrange(32)])
serialized = ''.join([v for k, v in sorted(result.items()) if v is not None])
return pickle.loads(serialized)
I frequently store objects with the size of several megabytes on the memcache. I cannot comment on whether this is a good practice or not, but my opinion is that sometimes we simply need a relatively fast way to transfer megabytes of data between our app engine instances.
Since I am using Java, what I did is serializing my raw objects using Java's serializer, producing a serialized array of bytes. Since the size of the serialized object is now known, I could cut into chunks of 800 KBs byte arrays. I then encapsulate the byte array in a container object, and store that object instead of the raw objects.
Each container object could have a pointer to the next memcache key where I could fetch the next byte array chunk, or null if there is no more chunks that need to be fetched from the memcache. (i.e. just like a linked list) I then re-merge the chunks of byte arrays into a large byte array and deserialize it using Java's deserializer.
Do you always need to access all the data which you store? If not then you will benefit from partitioning the dataset and accessing only the part of data you need.
If you display a list of 1000 employees you probably are going to paginate it. If you paginate then you definitely can partition.
You can make two lists of your dataset: one lighter with just the most essential information which can fit into 1 MB and other list which is divided into several parts with full information. On the light list you will be able to apply the most essential operations for example filtering through employees name or pagination. And then when needed load the heavy dataset you will be able to load only parts which you really need.
But well these suggestions takes time to implement. If you can live with your current design then just divide your list into lumps of ~300 items or whatever number is safe and load them all and merge.
If you know how large will the objects be you can use the memcached option to allow larger objects:
memcached -I 10m
This will allow objects up to 10MB.

Resources