Appending data to the same dataset in hdf5 in matlab - file

I have to put all the huge data together into a single dataset in hdf5. Now, the thing is, if you try:
>> hdf5write('hd', '/dataset1', [1;2;3])
>> hdf5write('hd', '/dataset1', [4;5;6], 'WriteMode', 'append')
??? Error using ==> hdf5writec
writeH5Dset: Dataset names must be unique when appending data.
As you can see, hdf5write will complain when you tried to append data to the same dataset. I've looked around and see one possible workaround is to grab your data from the dataset first, then concatenate the data right in matlab environment. This is not a problem for small data, of course. For this case, we are talking about gigabytes of data, and Matlab starts yelling out out of memory.
Because of this, what are my available options in this case?
Note: we do not have h5write function in our matlab version.

I believe the 'append' mode is to add datasets to an existing file.
hdf5write does not appear to support appending to existing datasets. Without the newer h5write function, your best bet would be to write a small utility with the low-level HDF5 library functions that are exposed with the H5* package functions.
To get you started, the doc page has an example on how to append to a datatset.

You cannot do it with hdf5write, however if your version of Matlab is not too old, you can do it with h5create and h5write. This example is drawn from the doc of h5write:
Append data to an unlimited data set.
h5create('myfile.h5','/DS3',[20 Inf],'ChunkSize',[5 5]);
for j = 1:10
data = j*ones(20,1);
start = [1 j];
count = [20 1];
h5write('myfile.h5','/DS3',data,start,count);
end
h5disp('myfile.h5');
For older versions of Matlab, it should be possible to do it using the Matlab's HDF5 low level API.

Related

MATLAB strings in arrays

I know that I am pretty confused about arrays and strings and have tried a bunch of things but I am still stumped. I have groups of data that I am pulling into various arrays. For example I have site locations coming from one source. Numerous cores can be at a single location. The cores can have multiple depths. So I am pulling all this data together in various ways and pushing it out into a single excel file for each core. I create a filename based on location id and core name and year the core was sampled. So it might look like ‘ID_14_CORE_Bu-2-MT-1991.xlsx’ and I am storing it to use with a xlswrite statement in a variable called “filename.” This is all working fine.
But now I want to keep track of what files I have created and when I created them in another EXCEL file. So I was trying to store the location, filename and the date it was processed into some sort of array so that I can use the xlswrite statement to push it all out after I have processed all the locations/cores/layers that might occur in the original input files.
As I start the program and look in the original input files I can figure out how many cores I have so I wanted to create some sort of array to hold the location, filename and the date together. I have tried to use a cell array (a = cell(numcores,3)) but that does not seem to work. I think I am understanding that the filename is actually a string array so each of the letters is trying to be assigned to a separate cell instead of just the cell in the second column.
I also have had problems trying to push the three values out to the summary EXCEL file as each core is being processed but MATLAB tends to treat single dimensional arrays as a row rather than a column so I am kind of confused there.
Below is what I want an array to end up like…but since I am developing the filename on the fly this seems to be more challenging.
ArraytoExcel = [“14”, “ID_14_CORE_Bu-2-MT-1991.xlsx”,”1/1/2018”;
“14”, “ID_14_CORE_Bu-3-MT-1991.xlsx”,”1/1/2018”;
“13”, “ID_13_CORE_Tail_33-1992.xlsx”,”1/1/2018”;]
Maybe I am just going about this the wrong way. Any suggestions would help.
Your question is a little confusing but I think you want to do something like the following. The variables inside of my example are static but from your question it sounds like you already have these figured out somehow.
numcores = 5; %.. Or however, you determine what you are procesing
ArraytoExcel = cell(numcores ,3);
for ii = 1:numcores
%These 3 things will need to determined by you in the loop
% and not be static like in this example.
coreID = '14';
filename = 'ID_14_CORE_Bu-2-MT-1991.xlsx'; %
dataProc = datestr(now,'mm/dd/yyyy');
ArraytoExcel(ii,:) = {coreID,filename,dataProc};
end
xlswrite('YourOutput.xls',ArraytoExcel)

Modelica Flexible Array Size - Error: Failed to expand the variable

I still have problems when using arrays with undifined size in Modelica.
It would be really helpful to understand what is the underlying problem.
I read a lot of stuff (known size at compile time, functions, constructor and destructor, records and so on) but I'm still not sure what is the proper way to use flexible arrays in Modelica.
Can somebody give a good explanation, maybe using the following example?
Given data:
x y
1 7
2 1
3 1
4 7
Modelica model which works fine:
model Test_FixedMatrix
parameter Real A[4,2] = [1,7;2,1;3,1;4,7];
parameter Integer nA = size(A,1);
parameter Integer iA[:] = Modelica.Math.BooleanVectors.index( {A[i,2] == 1 for i in 1:nA});
parameter Real Aslice[:,:] = A[iA,:];
end Test_FixedMatrix;
The array iA[:] gives the indices of all rows having an y = 1 and Aslice[:,:] includes all this rows in one matrix.
Now instead of using a given matrix, I want to read the data from a file. Using the Modelica library ExternData (https://github.com/tbeu/ExternData) it is possible to read the data from an xlsx-file:
model Test_MatrixFromFile
ExternData.XLSXFile xlsxfile(fileName="data/ExampleData.xlsx");
parameter Real B[:,:] = xlsxfile.getRealArray2D("A2","Tabelle1",4,2);
parameter Integer nB = size(B,1);
parameter Integer iB[:] = Modelica.Math.BooleanVectors.index( {B[i,2] == 1 for i in 1:nB});
parameter Real Bslice[:,:] = B[iB,:];
end Test_MatrixFromFile;
The xlsx-file looks like this:
It's exatly the same code, but this time I get the error message:
(If I delete the two lines with iB and Bsliece then it works, so the problem is not the xlsxfile.getRealArray2D function)
Application example and further questions:
It would be really nice to be able to just read a data file into Modelica and then prepare the data directly within the Modelica code without using any other software. Of course I can prepare all the data using e.g. a Python script and then use TimeTable components... but sometimes this is not really comfortable.
Is there another way to read in data (txt or csv) without specifying the size of the table in it?
How would it look like, if I want to read-in data from different files e.g. every 10 seconds?
I also found the Modelica functions readFile and countLines in Modelica.Utilities.Streams. Is it maybe only possible to do this using external functions?
I assume the fundamental issue is, that Modelica.Math.BooleanVectors.index returns a vector of variable size. The size depends on the input of the function which in the given case is the xlsx file. Therefore the size of iB depends on the content of the xlsx file which can change independently of the model itself. Therefore the problem you originally stated "known size at compile time" is problematic, as the size of iB is dependent on a non-parameter.
Another thing that I learned from my experience with Dymola is that it behaves differently if functions like ExternData.XLSXFile are translated before usage. This means that you need to open the function in Dymola (double-clicking it in the package browser) and press the translate button (or F9). This will generate the respective .exe file in your working directory and make Dymola more "confident" in outputs of the function being parameters rather than variables. I don't know the details of the implementations and therefore can't give exact details, but it is worth a try.

How to efficiently append data to an HDF5 table in C?

I'm failing to save a large dataset of float values in an HDF5 file efficiently.
The data acquisition works as follows:
A fixed array of 'ray data' (coordintaes, directions, wavelength, intensity, etc.) is created and send to an external ray trace programm (its about 2500 values).
In return I get the same array but with changed data.
I now want to save the new coordinates in an HDF5 for further processing as a simple table.
These steps are repeated many times (about 80 000).
I followed the example of the HDF5group http://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/examples/h5_extend_write.c, but unfortunatly the solution is quite slow.
Before I wrote the data directly into an hdf5 file I used a simple csv file, it takes about 80 sec for 100 repetitions, whereas it takes 160 sec appending to the hdf5 file.
The 'pseudo' code looks like this:
//n is a large number e.g. 80000
for (i=0;i<n;++i):
{
/*create an array of rays for tracing*/
rays = createArray(i);
/*trace the rays*/
traceRays(&rays);
/* write results to hdf5 file, m is a number around 2500 */
for(j=0;j<m;j++):
{
buffer.x = rays[j].x
buffer.y = rays[j].y
//this seems to be slow:
H5TBappend_records(h5file,tablename, 1,dst_size, dst_offset, dst_sizes, &buffer)
// this is fast:
sprintf(szBuffer, "%15.6E,%14.6E\n",rays[j].x,rays[j].y)
fputs(szBuffer, outputFile)
}
}
I could imagine that it has something to do with the overhead of extending the table at each step ?
Any help would be appreciated.
cheers,
Julian
You can get very good performance using the low level API of HDF5. I explain how to do it in this detailed answer.
Basically you need to either use a fixed-size dataset if you know its final size in advance (best case scenario), or use a chunked dataset which you can extend at will (a bit more code, more overhead, and choosing a good chunk size is critical for performance). In any case, then you can let the HDF5 library buffer the writes for you. It should be very fast.
In your case you probably want to create a compound datatype to hold each record of your table. Your dataset would then be a 1D array of your compound datatype.
NB: The methods used in the example code you linked to are correct. If it didn't work for you, that might be because your chunk size was too small.

Saving parts of Matlab cell array

I am using Matlab for some data collection, and I want to save the data after each trial (just in case something goes wrong). The data is organized as a cell array of cell arrays, basically in the format
data{target}{trial} = zeros(1000,19)
But the actual data gets up to >150 MB by the end of the collection, so saving everything after each trial becomes prohibitively slow.
So now I am looking at opting for the matfile approach (http://www.mathworks.de/de/help/matlab/ref/matfile.html), which would allow me to only save parts of the data. The problem: this doesn't support cells of cell arrays, which means I couldn't change/update the data for a single trial; I would have to re-save the entire target's data (100 trials).
So, my question:
Is there another different method I can use to save parts of the cell array to speed up saving?
(OR)
Is there a better way to format my data that would work with this saving process?
A not very elegant but possibly effective solution is to use trial as part of the variable name. That is, use not a cell array of cell arrays (data{target}{trial}), but just different cell arrays such as data_1{target}, data_2{target}, where 1, 2 are the values of the trial counter.
You could do that with eval: for example
trial = 1; % change this value in a for lopp
eval([ 'data_' num2str(trial) '{target} = zeros(1000,19);']); % fill data_1{target}
You can then save the data for each trial in a different file. For example, this
eval([ 'save temp_save_file_' num2str(trial) ' data_' num2str(trial)])
saves data_1 in file temp_save_file_1, etc.
Update:
Actually it does appear to be possible to index into cell arrays, just not iside cell arrays. Hence, if you store your data slightly differently it seems like you can use matfile to update only part of it. See this example:
x = cell(3,4);
save x;
matObj = matfile('x.mat','writable',true);
matObj.x(3,4) = {eye(10)};
Note that this gives me a version warning, but it seems to work.
Hope this does the trick. However, still look into the next part of my answer as it may help you even more.
For calculations it is usually not required to save to disk after every iteration. An easy way to get a speedup (at the cost of a little more risk) is to save only after every n trials.
Like this for example:
maxTrial = 99;
saveEvery = 10;
for trial = 1:maxTrial
myFun; %Do your calculations here
if trial == maxTrial || mod(trial, saveEvery) == 0
save %Put your save command here
end
end
If your data is always at (or within) a certain size, you can also choose to store your data in a matrix rather than a cell array, then you can use indexing to save only part of the file.
In response to #Luis I will post an other way to deal with the situation.
It is indeed an option to save data in named variables or files, but to save a named variable in a named file seems too much.
If you only change the name of the file, you can save everything without using eval:
assuming you are dealing with trial 't':
filename = ['temp_save_file_' + num2str(t)];
If you really want, you can use print commands to write it as 001 for example.
Now you can simply use this:
save(filename, myData)
To use this, construct the filename again and so something like this:
totalData = {}; %Initialize your total data
And then read them as you wrote them (inside a loop):
load(filename)
totalData{t} = myData

Inserting a Matlab Float Array into postgresql float[] column

I am using JDBC to access a postgresql database through Matlab, and have gotten hung up when trying to insert an array of values that I would rather store as an array instead of individual values. The Matlab code that I'm using is as follows:
insertCommand = 'INSERT INTO neuron (classifier_id, threshold, weights, neuron_num) VALUES (?,?,?,?)';
statementObject = dbhandle.prepareStatement(insertCommand);
statementObject.setObject(1,1);
statementObject.setObject(2,output_thresholds(1));
statementObject.setArray(3,dbHandle.createArrayOf('"float8"',outputnodes(1,:)));
statementObject.setObject(4,1);
statementObject.execute;
close(statementObject);
Everything functions properly except for the line dealing with Arrays. The object outputnodes is a <5x23> double matrix, so I'm attempting to put the first <1x23> into my table.
I've tried several different combinations of names and quotes for the '"float8"' part of the createArrayof call, but I always get this error:
??? Java exception occurred:
org.postgresql.util.PSQLException: Unable to find server array type for provided name "float8".
at org.postgresql.jdbc4.AbstractJdbc4Connection.createArrayOf(AbstractJdbc4Connection.java:82)
at org.postgresql.jdbc4.Jdbc4Connection.createArrayOf(Jdbc4Connection.java:19)
Error in ==> Databasetest at 22
statementObject.setArray(3,dbHandle.createArrayOf('"float8"',outputnodes(1,:)));
Performance of JDBC connector for arrays
I'd like to note that in the case you have to export rather big volumes of data containing arrays JDBC may not be the best choice. Firstly, its performance degrades due to the overhead caused by a conversion of native Matlab arrays into org.postgresql.jdbc.PgArray objects. Secondly, this may lead to a shortage of Java heap memory (and simply increasing Java heap memory size may not be a panacea). Both these points can be seen on the following picture illustrating the performance of datainsert method from Matlab Database Toolbox (it works with PostgreSQL exactly through a direct JDBC connection):
The blue graph displays the performance of batchParamExec command from PgMex library (see https://pgmex.alliedtesting.com/#batchparamexec for details). The endpoint of the red graph corresponds
to a certain maximum data volume passed into the database by datainsert without any error.
A data volume greater than that maximum causes “out of Java heap memory” problem
(Java heap size is specified at the top of the figure).
For further details of experiments please see the following
paper with full benchmarking results for data insertion.
Example reworked
As can be seen PgMex based on libpq (the official C application programmer's interface to PostgreSQL) has greater performance and able to process volumes at least up to more than 2Gb.
Using this library your code can be rewritten as follows (we assume below that all the parameters marked by <> signs are properly filled, that the table neuron already exists in the database and have fields classifier_id of int4, threshold of float8, weights of float8[] and neuron_num of int4 and, at last, that the variables classfierIdVec, output_thresholds, outputnodes and neuronNumVec are already defined and are numerical arrays of sizes shown in the comments in the code below; in the case the types of table fields are different you need to appropriately fix the last command of the code):
% Create the database connection
dbConn = com.allied.pgmex.pgmexec('connect',[...
'host=<yourhost> dbname=<yourdb> port=<yourport> '...
'user=<your_postgres_username> password=<your_postgres_password>']);
insertCommand = ['INSERT INTO neuron '...
'(classifier_id, threshold, weights, neuron_num) VALUES ($1,$2,$3,$4)'];
SData = struct();
SData.classifier_id = classifierIdVec(:); % [nTuples x 1]
SData.threshold = output_thresholds(:); % [nTuples x 1]
SData.weights = outputnodes; % [nTuples x nWeights]
SData.neuron_num = neuronNumVec; % [nTuples x 1]
com.allied.pgmex.pgmexec('batchParamExec',dbConn,insertCommand,...
'%int4 %float8 %float8[] %int4',SData);
It should be noted that outputnodes needs not to be cut along rows on separate arrays because the latter ones
are of the same length. In the case of arrays for different tuples having different sizes it is necessary to pass them
as a column cell array with each cell containing its own array for each tuple.
EDIT: Currently PgMex has free academic licensing.
I was getting confused with the documentation which all used double quotes, which Matlab doesn't allow, using only single quotes actually resolved this. The correct line was:
statementObject.setArray(3,dbHandle.createArrayOf('float8',outputnodes(1,:)));
instead of
statementObject.setArray(3,dbHandle.createArrayOf('"float8"',outputnodes(1,:)));
I originally thought that the problem was with the alias that I was using for double precision was incorrect, but as Craig pointed out in the comment above this isn't the case.

Resources