I am using Matlab for some data collection, and I want to save the data after each trial (just in case something goes wrong). The data is organized as a cell array of cell arrays, basically in the format
data{target}{trial} = zeros(1000,19)
But the actual data gets up to >150 MB by the end of the collection, so saving everything after each trial becomes prohibitively slow.
So now I am looking at opting for the matfile approach (http://www.mathworks.de/de/help/matlab/ref/matfile.html), which would allow me to only save parts of the data. The problem: this doesn't support cells of cell arrays, which means I couldn't change/update the data for a single trial; I would have to re-save the entire target's data (100 trials).
So, my question:
Is there another different method I can use to save parts of the cell array to speed up saving?
(OR)
Is there a better way to format my data that would work with this saving process?
A not very elegant but possibly effective solution is to use trial as part of the variable name. That is, use not a cell array of cell arrays (data{target}{trial}), but just different cell arrays such as data_1{target}, data_2{target}, where 1, 2 are the values of the trial counter.
You could do that with eval: for example
trial = 1; % change this value in a for lopp
eval([ 'data_' num2str(trial) '{target} = zeros(1000,19);']); % fill data_1{target}
You can then save the data for each trial in a different file. For example, this
eval([ 'save temp_save_file_' num2str(trial) ' data_' num2str(trial)])
saves data_1 in file temp_save_file_1, etc.
Update:
Actually it does appear to be possible to index into cell arrays, just not iside cell arrays. Hence, if you store your data slightly differently it seems like you can use matfile to update only part of it. See this example:
x = cell(3,4);
save x;
matObj = matfile('x.mat','writable',true);
matObj.x(3,4) = {eye(10)};
Note that this gives me a version warning, but it seems to work.
Hope this does the trick. However, still look into the next part of my answer as it may help you even more.
For calculations it is usually not required to save to disk after every iteration. An easy way to get a speedup (at the cost of a little more risk) is to save only after every n trials.
Like this for example:
maxTrial = 99;
saveEvery = 10;
for trial = 1:maxTrial
myFun; %Do your calculations here
if trial == maxTrial || mod(trial, saveEvery) == 0
save %Put your save command here
end
end
If your data is always at (or within) a certain size, you can also choose to store your data in a matrix rather than a cell array, then you can use indexing to save only part of the file.
In response to #Luis I will post an other way to deal with the situation.
It is indeed an option to save data in named variables or files, but to save a named variable in a named file seems too much.
If you only change the name of the file, you can save everything without using eval:
assuming you are dealing with trial 't':
filename = ['temp_save_file_' + num2str(t)];
If you really want, you can use print commands to write it as 001 for example.
Now you can simply use this:
save(filename, myData)
To use this, construct the filename again and so something like this:
totalData = {}; %Initialize your total data
And then read them as you wrote them (inside a loop):
load(filename)
totalData{t} = myData
Related
I know that I am pretty confused about arrays and strings and have tried a bunch of things but I am still stumped. I have groups of data that I am pulling into various arrays. For example I have site locations coming from one source. Numerous cores can be at a single location. The cores can have multiple depths. So I am pulling all this data together in various ways and pushing it out into a single excel file for each core. I create a filename based on location id and core name and year the core was sampled. So it might look like ‘ID_14_CORE_Bu-2-MT-1991.xlsx’ and I am storing it to use with a xlswrite statement in a variable called “filename.” This is all working fine.
But now I want to keep track of what files I have created and when I created them in another EXCEL file. So I was trying to store the location, filename and the date it was processed into some sort of array so that I can use the xlswrite statement to push it all out after I have processed all the locations/cores/layers that might occur in the original input files.
As I start the program and look in the original input files I can figure out how many cores I have so I wanted to create some sort of array to hold the location, filename and the date together. I have tried to use a cell array (a = cell(numcores,3)) but that does not seem to work. I think I am understanding that the filename is actually a string array so each of the letters is trying to be assigned to a separate cell instead of just the cell in the second column.
I also have had problems trying to push the three values out to the summary EXCEL file as each core is being processed but MATLAB tends to treat single dimensional arrays as a row rather than a column so I am kind of confused there.
Below is what I want an array to end up like…but since I am developing the filename on the fly this seems to be more challenging.
ArraytoExcel = [“14”, “ID_14_CORE_Bu-2-MT-1991.xlsx”,”1/1/2018”;
“14”, “ID_14_CORE_Bu-3-MT-1991.xlsx”,”1/1/2018”;
“13”, “ID_13_CORE_Tail_33-1992.xlsx”,”1/1/2018”;]
Maybe I am just going about this the wrong way. Any suggestions would help.
Your question is a little confusing but I think you want to do something like the following. The variables inside of my example are static but from your question it sounds like you already have these figured out somehow.
numcores = 5; %.. Or however, you determine what you are procesing
ArraytoExcel = cell(numcores ,3);
for ii = 1:numcores
%These 3 things will need to determined by you in the loop
% and not be static like in this example.
coreID = '14';
filename = 'ID_14_CORE_Bu-2-MT-1991.xlsx'; %
dataProc = datestr(now,'mm/dd/yyyy');
ArraytoExcel(ii,:) = {coreID,filename,dataProc};
end
xlswrite('YourOutput.xls',ArraytoExcel)
I still have problems when using arrays with undifined size in Modelica.
It would be really helpful to understand what is the underlying problem.
I read a lot of stuff (known size at compile time, functions, constructor and destructor, records and so on) but I'm still not sure what is the proper way to use flexible arrays in Modelica.
Can somebody give a good explanation, maybe using the following example?
Given data:
x y
1 7
2 1
3 1
4 7
Modelica model which works fine:
model Test_FixedMatrix
parameter Real A[4,2] = [1,7;2,1;3,1;4,7];
parameter Integer nA = size(A,1);
parameter Integer iA[:] = Modelica.Math.BooleanVectors.index( {A[i,2] == 1 for i in 1:nA});
parameter Real Aslice[:,:] = A[iA,:];
end Test_FixedMatrix;
The array iA[:] gives the indices of all rows having an y = 1 and Aslice[:,:] includes all this rows in one matrix.
Now instead of using a given matrix, I want to read the data from a file. Using the Modelica library ExternData (https://github.com/tbeu/ExternData) it is possible to read the data from an xlsx-file:
model Test_MatrixFromFile
ExternData.XLSXFile xlsxfile(fileName="data/ExampleData.xlsx");
parameter Real B[:,:] = xlsxfile.getRealArray2D("A2","Tabelle1",4,2);
parameter Integer nB = size(B,1);
parameter Integer iB[:] = Modelica.Math.BooleanVectors.index( {B[i,2] == 1 for i in 1:nB});
parameter Real Bslice[:,:] = B[iB,:];
end Test_MatrixFromFile;
The xlsx-file looks like this:
It's exatly the same code, but this time I get the error message:
(If I delete the two lines with iB and Bsliece then it works, so the problem is not the xlsxfile.getRealArray2D function)
Application example and further questions:
It would be really nice to be able to just read a data file into Modelica and then prepare the data directly within the Modelica code without using any other software. Of course I can prepare all the data using e.g. a Python script and then use TimeTable components... but sometimes this is not really comfortable.
Is there another way to read in data (txt or csv) without specifying the size of the table in it?
How would it look like, if I want to read-in data from different files e.g. every 10 seconds?
I also found the Modelica functions readFile and countLines in Modelica.Utilities.Streams. Is it maybe only possible to do this using external functions?
I assume the fundamental issue is, that Modelica.Math.BooleanVectors.index returns a vector of variable size. The size depends on the input of the function which in the given case is the xlsx file. Therefore the size of iB depends on the content of the xlsx file which can change independently of the model itself. Therefore the problem you originally stated "known size at compile time" is problematic, as the size of iB is dependent on a non-parameter.
Another thing that I learned from my experience with Dymola is that it behaves differently if functions like ExternData.XLSXFile are translated before usage. This means that you need to open the function in Dymola (double-clicking it in the package browser) and press the translate button (or F9). This will generate the respective .exe file in your working directory and make Dymola more "confident" in outputs of the function being parameters rather than variables. I don't know the details of the implementations and therefore can't give exact details, but it is worth a try.
I have a very simple problem but I am wondering if there is a simpler way to solve it (they must be).
I have a matrix which is 10 by 10 and contains double. I need to create a time serie with those data points.
The way i am doing it is as follow. I create a 3D array with the thrid dimension being the time. And everyday I add the new data in the array by increasing the time dimension by one.
Here is the code:
TS_updated = zeros(size(TS_Current)+[0,0,1]);
TS_updated(:,:,1:end-1) = TS_Current;
TS_updated(:,:,end) = TS_New;
where TS_Current is the existing 3D array representing the time serie and TS_New is the new data from today which I need to add to the time series.
Is there a quicker way to append the last element such as with 2D table:
TS_updated = [TS_Current;TS_New];
Or even maybe a smarter way to store the time serie?
You can also use
TS(:,:,end+1) = TS_new;
And you might also want to preallocate if you intend to extend the series more often than once per day. You can start with any length and double space when that limit is reached.
There is no clearly better way of arranging data I could see. You might flatten it to 100xTime instead of 10x10xTime, but it depends whether it would help.
Use the cat function (documentation) in the third dimension:
TS_updated = cat(3, TS_Current, TS_New);
You could include error checking first by using
% Check dimensions 1 and 2 are consistent first
if size(TS_Current,1) == size(TS_New,1) && size(TS_Current,2) == size(TS_New,2)
% Now concatenate
TS_updated = cat(2, TS_Current, TS_New);
else
error('New time series has incorrect dimensions')
end
You want to concatenate in the 3rd dimension?
A=ones(3,3,2)
B=rand(3,3);
C=cat(3,A,B)
I have a variable that is created by a loop. The variable is large enough and in a complicated enough form that I want to save the variable each time it comes out of the loop with a different name.
PM25 is my variable. But I want to save it as PM25_year in which the year changes based on `str = fname(13:end)'
PM25 = permute(reshape(E',[c,r/nlay,nlay]),[2,1,3]); % Reshape and permute to achieve the right shape. Each face of the 3D should be one day
str = fname(13:end); % The year
% Third dimension is organized so that the data for each site is on a face
save('PM25_str', 'PM25_Daily_US.mat', '-append')
The str would be a year, like 2008. So the variable saved would be PM25_2008, then PM25_2009, etc. as it is created.
Defining new variables based on data isn't considered best practice, but you can store your data more efficiently using a cell array. You can store even a large, complicated variable like your PM25 variable within a single cell. Here's how you could go about doing it:
Place your PM25 data for each year into the cell array C using your loop:
for i = 1:numberOfYears
C{i} = PM25;
end
Resulting in something like this:
C = { PM25_2005, PM25_2006, PM25_2007 };
Now let's say you want to obtain your variable for the year 2006. This is easy (assuming you aren't skipping years). The first year of your data will correspond to position 1, the second year to position 2, etc. So to find the index of the year you want:
minYear = 2005;
yearDesired = 2006;
index = yearDesired - minYear + 1;
PM25_2006 = C{index};
You can do this using eval, but note that it's often not considered good practice. eval may be a security risk, as it allows user input to be executed as code. A better way to do this may be to use a cell array or an array of objects.
That said, I think this will do what you want:
for year = 2008:2014
eval(sprintf('PM25_%d = permute(reshape(E',[c,r/nlay,nlay]),[2,1,3]);',year));
save('PM25_Daily_US.mat',sprintf('PM25_%d',year),'-append');
end
I do not recommend to set variables like this since there is no way to track these variables and completely prevents all kind of error checking that MATLAB does beforehand. This kind of code is handled completely in runtime.
Anyway in case you have a really good reason for doing this I recommend that you use the function assignin for this.
assignin('caller', ['myvar',num2str(1)], 63);
If I want to store some strings or matrices of different sizes in a single variable, I can think of two options: I could make a struct array and have one of the fields hold the data,
structArray(structIndex).structField
or I could use a cell array,
cellArray{cellIndex}
but is there a general rule-of-thumb of when to use which data structure? I'd like to know if there are downsides to using one or the other in certain situations.
In my opinion it's more a matter of convenience and code clarity. Ask yourself would you prefer to refer your variable elements by number(s) or by name. Then use cell array in former case and struct array in later. Think about it as if you have a table with and without headers.
By the way you can easily convert between structures and cells with CELL2STRUCT and STRUCT2CELL functions.
If you use it for computation within a function, I suggest you use cell arrays, since they're more convenient to handle, thanks e.g. to CELLFUN.
However, if you use it to store data (and return output), it's better to return structures, since the field names are (should be) self-documenting, so you don't need to remember what information you had in column 7 of your cell array. Also, you can easily include a field 'help' in your structure where you can put some additional explanation of the fields, if necessary.
Structures are also useful for data storage since you can, if you want to update your code at a later date, replace them with objects without needing to change your code (at least in case you did pre-assignment of your structure). They have the same sytax, but objects will allow you to add more functionality, such as dependent properties (i.e. properties that are calculated on the fly based on other properties).
Finally, note that cells and structures add a few bytes of overhead to every field. Thus, if you want to use them to handle large amounts of data, you're much better off to use structures/cells containing arrays, rather than having large arrays of structures/cells where the fields/elements only contain scalars.
This code suggests that cell arrays may be roughly twice as fast as structs for assignment and retrieval. I did not separate the two operations. One could easily modify the code to do that.
Running "whos" afterwards suggests that they use very similar amounts of memory.
My goal was to make a "list of lists" in python terminology. Perhaps an "array of arrays".
I hope this is interesting/useful!
%%%%%%%%%%%%%% StructVsCell.m %%%%%%%%%%%%%%%
clear all
M = 100; % number of repetitions
N = 2^10; % size of cell array and struct
for m = 1:M
% Fill up a template cell array with
% lists of randomly sized matrices with
% random elements.
template{N} = 0;
for n = 1:N
r1 = round(24*rand());
r2 = round(24*rand());
r3 = rand(round(r2*rand),round(r1*rand()));
template{N} = r3;
end
% Make a cell array equivalent
% to the template.
cell_array = template;
% Create a struct with the
% same data.
structure = struct('data',0);
for n = 1:N
structure(n).data = template{n};
end
% Time cell array
tic;
for n = 1:N
data = cell_array{n};
cell_array{n} = data';
end
cell_time(m) = toc;
% Time struct
tic;
for n = 1:N
data = structure(n).data;
structure(n).data = data';
end
struct_time(m) = toc;
end
str = sprintf('cell array: %0.4f',mean(cell_time));
disp(str);
str = sprintf('struct: %0.4f',mean(struct_time));
disp(str);
str = sprintf('struct_time / cell_time: %0.4f',mean(struct_time)/mean(cell_time));
disp(str);
% Check memory use
whos
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
First and foremost, I second yuk's answer. Clarity is generally more important in the long run.
However, you may have two more options depending on how irregularly shaped your data is:
Option 3: structScalar.structField(fieldIndex)
Option 4: structScalar.structField{cellIndex}
Among the four, #3 has the least memory overhead for large numbers of elements (it minimizes the total number of matrices), and by large numbers I mean >100,000. If your code lends itself to vectorizing on structField, it is probably a performance win, too. If you can't collect each element of structField into a single matrix, option 4 has the notational benefits without the memory & performance advantages of option 3. Both of these options make it easier to use arrayfun or cellfun on the entire dataset, at the expense of requiring you to add or remove elements from each field individually. The choice depends on how you use your data, which brings us back to yuk's answer -- choose what makes for the clearest code.