I'm writing some code to remove parts of an existing HDF5 file (some dimensions, some datasets, etc.) with the C HDF5 API. I'd like the new HDF5 file to come with the same chunk size as my existing HDF5 file, but I can't seem to find anywhere that I can retrieve current chunk dimensions. There is the H5P_GET_CHUNK function, but that retrieves chunk dimensions on dataset creation only. There is also the H5D_GET_CHUNK_STORAGE_SIZE function, which only retrieves total size (not dimensions).
Is there a way to retrieve chunk dimensions (not just total size) from an existing dataset that I'm missing?
You may want to have a look at HDFql as it is a (much) simpler way to manipulate HDF5 files, in particular to get the storage dimensions (i.e. chunk size) of a certain dataset. As an example, if you want to get the storage dimensions of a dataset named dset do the following:
SHOW STORAGE DIMENSION dset
Additional information about this operation may be found in section 6.7.15 of the reference manual.
Related
How do I store Matlab arrays located in a 'struct within struct within struct' into a database so that I can then retrieve the fields and arrays?
More detail on why do I need this below:
I have tons of data saved as .mat files....the hassle is that I need to load a complete .mat file to begin manipulating and plotting the data there. If that file is large, it becomes quite a task just to load it into memory.
These .mat files are resulted from the analysis of raw electrical measurement data of transistors. All .mat files have the same structure but each file correspond to a different and unique transistor.
Now say I want to compare a certain parameter in all transistors that are common in A and B, I have to manually search and load all the .mat files I need and then try to do the comparison. There is no simple way to merge all of these .mat files into a single .mat file (since they all have the same variable names but with different data). Even if that is possible, there is no way I know of to query specific entries from .mat files.
I do not see a way of easily doing that without a structured database from which I can query specific entries. Then I can use any programming language (continue with Matlab or switch to python) to convieniently do the comparison and plotting...etc. without the hassle of the scattered .mat files.
Problem is that the data in the .mat files are structured in structs and large arrays. From what I know, storing that in a simple SQL database is not a straight forward task. I looked up using HDF5 but from the examples I saw, I have to do a lot of low-level commands to store those structs in an HDF file and I am not sure if I can load parts of the HDF file into Matlab/python or if I also have to load the whole file in memory first.
The goal here is to merge all existing (and to-be-created) .mat files (with their compound data strucutre of structs and arrays) into a single database file from which I can query specific entries. Is there a database solution that can preserve the structure of my complex data? Is HDF the way to go? or is there a simple solution I am missing?
EDIT:
Example on data I need to save and retrieve:
All(16).rf.SS(3,2).data
Where All is an array of structs with 7 fields. Each struct in the rf field is a struct with arrays, integers, strings and structs. One of those structs is named SS which in turn is an array of structs each containing a 2x2 array named data.
Merge .mat files into one data structure
In general it's not correct that There is no simple way to merge ... .mat files into a single .mat file (since they all have the same variable names but with different data).
Let's say you have two files, data1.mat and data2.mat and each one contains two variables, a and b. You can do:
>> s = load('data1')
s =
struct with fields:
a: 'foo'
b: 3
>> s(2) = load('data2')
s =
1×2 struct array with fields:
a
b
Now you have a struct array (see note below). You can access the data in it like this:
>> s(1).a
ans =
'foo'
>> s(2).a
ans =
'bar'
But you can also get all the values at once for each field, as a comma-separated list, which you can assign to a cell array or matrix:
>> s.a
ans =
'foo'
ans =
'bar'
>> allAs = {s.a}
allAs =
1×2 cell array
{'foo'} {'bar'}
>> allBs = [s.b]
allBs =
3 4
Note: Annoyingly, it seems you have to create the struct with the correct fields before you can assign to it using indexing. In other words
s = struct;
s(1) = load('data1')
won't work, but
s = struct('a', [], 'b', [])
s(1) = load('data1')
is OK.
Build an index to the .mat files
If you don't need to be able to search on all of the data in each .mat file, just certain fields, you could build an index in MATLAB containing just the relevant metadata from each .mat file plus a reference (e.g. filename) to the file itself. This is less robust as a long-term solution as you have to make sure the index is kept in sync with the files, but should be less work to set up.
Flatten the data structure into a database-compatible table
If you really want to keep everything in a database, then you can convert your data structure into a tabular form where any multi-dimensional elements such as structs or arrays are 'flattened' into a table row with one scalar value per (suitably-named) table variable.
For example if you have a struct s with fields s.a and s.b, and s.b is a 2 x 2 matrix, you might call the variables s_a, s_b_1_1, s_b_1_2, s_b_2_1 and s_b_2_2 - probably not the ideal database design, but you get the idea.
You should be able to adapt the code in this answer and/or the MATLAB File Exchange submissions flattenstruct2cell and flatten-nested-cell-arrays to suit your needs.
I'm failing to save a large dataset of float values in an HDF5 file efficiently.
The data acquisition works as follows:
A fixed array of 'ray data' (coordintaes, directions, wavelength, intensity, etc.) is created and send to an external ray trace programm (its about 2500 values).
In return I get the same array but with changed data.
I now want to save the new coordinates in an HDF5 for further processing as a simple table.
These steps are repeated many times (about 80 000).
I followed the example of the HDF5group http://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/examples/h5_extend_write.c, but unfortunatly the solution is quite slow.
Before I wrote the data directly into an hdf5 file I used a simple csv file, it takes about 80 sec for 100 repetitions, whereas it takes 160 sec appending to the hdf5 file.
The 'pseudo' code looks like this:
//n is a large number e.g. 80000
for (i=0;i<n;++i):
{
/*create an array of rays for tracing*/
rays = createArray(i);
/*trace the rays*/
traceRays(&rays);
/* write results to hdf5 file, m is a number around 2500 */
for(j=0;j<m;j++):
{
buffer.x = rays[j].x
buffer.y = rays[j].y
//this seems to be slow:
H5TBappend_records(h5file,tablename, 1,dst_size, dst_offset, dst_sizes, &buffer)
// this is fast:
sprintf(szBuffer, "%15.6E,%14.6E\n",rays[j].x,rays[j].y)
fputs(szBuffer, outputFile)
}
}
I could imagine that it has something to do with the overhead of extending the table at each step ?
Any help would be appreciated.
cheers,
Julian
You can get very good performance using the low level API of HDF5. I explain how to do it in this detailed answer.
Basically you need to either use a fixed-size dataset if you know its final size in advance (best case scenario), or use a chunked dataset which you can extend at will (a bit more code, more overhead, and choosing a good chunk size is critical for performance). In any case, then you can let the HDF5 library buffer the writes for you. It should be very fast.
In your case you probably want to create a compound datatype to hold each record of your table. Your dataset would then be a 1D array of your compound datatype.
NB: The methods used in the example code you linked to are correct. If it didn't work for you, that might be because your chunk size was too small.
I have to put all the huge data together into a single dataset in hdf5. Now, the thing is, if you try:
>> hdf5write('hd', '/dataset1', [1;2;3])
>> hdf5write('hd', '/dataset1', [4;5;6], 'WriteMode', 'append')
??? Error using ==> hdf5writec
writeH5Dset: Dataset names must be unique when appending data.
As you can see, hdf5write will complain when you tried to append data to the same dataset. I've looked around and see one possible workaround is to grab your data from the dataset first, then concatenate the data right in matlab environment. This is not a problem for small data, of course. For this case, we are talking about gigabytes of data, and Matlab starts yelling out out of memory.
Because of this, what are my available options in this case?
Note: we do not have h5write function in our matlab version.
I believe the 'append' mode is to add datasets to an existing file.
hdf5write does not appear to support appending to existing datasets. Without the newer h5write function, your best bet would be to write a small utility with the low-level HDF5 library functions that are exposed with the H5* package functions.
To get you started, the doc page has an example on how to append to a datatset.
You cannot do it with hdf5write, however if your version of Matlab is not too old, you can do it with h5create and h5write. This example is drawn from the doc of h5write:
Append data to an unlimited data set.
h5create('myfile.h5','/DS3',[20 Inf],'ChunkSize',[5 5]);
for j = 1:10
data = j*ones(20,1);
start = [1 j];
count = [20 1];
h5write('myfile.h5','/DS3',data,start,count);
end
h5disp('myfile.h5');
For older versions of Matlab, it should be possible to do it using the Matlab's HDF5 low level API.
My model has different entities that I'd like to calculate once like the employees of a company. To avoid making the same query again and again, the calculated list is saved in Memcache (duration=1day).. The problem is that the app is sometimes giving me an error that there are more bytes being stored in Memcache than is permissible:
Values may not be more than 1000000 bytes in length; received 1071339 bytes
Is storing a list of objects something that you should be doing with Memcache? If so, what are best practices in avoiding the error above? I'm currently pulling 1000 objects. Do you limit values to < 200? Checking for an object's size in memory doesn't seem like too good an idea because they're probably being processed (serialized or something like that) before going into Memcache.
David, you don't say which language you use, but in Python you can do the same thing as Ibrahim suggests using pickle. All you need to do is write two little helper functions that read and write a large object to memcache. Here's an (untested) sketch:
def store(key, value, chunksize=950000):
serialized = pickle.dumps(value, 2)
values = {}
for i in xrange(0, len(serialized), chunksize):
values['%s.%s' % (key, i//chunksize)] = serialized[i : i+chunksize]
return memcache.set_multi(values)
def retrieve(key):
result = memcache.get_multi(['%s.%s' % (key, i) for i in xrange(32)])
serialized = ''.join([v for k, v in sorted(result.items()) if v is not None])
return pickle.loads(serialized)
I frequently store objects with the size of several megabytes on the memcache. I cannot comment on whether this is a good practice or not, but my opinion is that sometimes we simply need a relatively fast way to transfer megabytes of data between our app engine instances.
Since I am using Java, what I did is serializing my raw objects using Java's serializer, producing a serialized array of bytes. Since the size of the serialized object is now known, I could cut into chunks of 800 KBs byte arrays. I then encapsulate the byte array in a container object, and store that object instead of the raw objects.
Each container object could have a pointer to the next memcache key where I could fetch the next byte array chunk, or null if there is no more chunks that need to be fetched from the memcache. (i.e. just like a linked list) I then re-merge the chunks of byte arrays into a large byte array and deserialize it using Java's deserializer.
Do you always need to access all the data which you store? If not then you will benefit from partitioning the dataset and accessing only the part of data you need.
If you display a list of 1000 employees you probably are going to paginate it. If you paginate then you definitely can partition.
You can make two lists of your dataset: one lighter with just the most essential information which can fit into 1 MB and other list which is divided into several parts with full information. On the light list you will be able to apply the most essential operations for example filtering through employees name or pagination. And then when needed load the heavy dataset you will be able to load only parts which you really need.
But well these suggestions takes time to implement. If you can live with your current design then just divide your list into lumps of ~300 items or whatever number is safe and load them all and merge.
If you know how large will the objects be you can use the memcached option to allow larger objects:
memcached -I 10m
This will allow objects up to 10MB.
Java - How do you read binary objects into an object array without knowing the size beforehand? For example, I don't know how many "clients" are within a binary file so how do I read them into an array without knowing the size beforehand? I know I could probably use vector but I have to use an array.
When you create an ArrayList, it creates a T[] of the reserved size.
When you add one too many items, it makes a new, larger T[] and uses System.arraycopy to move the contents.
For an unbounded number of possible inputs, this is the best you can do. You can even read the source of ArrayList to watch it being done.
Another possible solution applies when you can guarantee a maximum possible size, even if you don't know what the actual size is. You make an array as big as the maximum, put things into it. When done, create the final array of the actual size, and copy.
You can create a new array when you run out of space, then use arraycopy to copy the old elements to the new.
Use something from the collections library, like a vector or an ArrayList.