Parsing NumPy arrays from pandas data frame cells - arrays

I'm rather new to Pandas, and I think I have messed up with my data files.
I have stored some pandas data frames to CSV files. Data frames contained NumPy arrays stored in a single column. I know that it is not recommended to do so. However, because the arrays have an indefinite number of elements (varying row by row), I stored them in a single column. Column names and column order was getting a bit tedious otherwise. Initially, my notion was that I would not need those arrays for my data analysis because they contain raw data just stored for completeness. It was only later that I realized that I would have to go back to the raw data to extract some relevant data. Lucky for me that I saved it initially, but reading it back from the CSV files proved to be difficult.
Everything works fine, as long as I have the original data frame, but when I read the data frame back from CSV, the columns that contain the arrays are read back as strings instead of NumPy arrays.
I have used Pandas.read_csv function's converters option and NumPy.fromstring function with some regular expressions to parse the NumPy arrays from the strings. However, it is slow (data frames contain approx 400k rows).
So, preferably, I would like to convert the data once and save the data frames to a file format that maintains the NumPy arrays in the cells and can be read back directly as NumPy arrays. What would be the best file format to use if it is possible? Or what would be the best way to do it otherwise?
Your suggestions would be appreciated.
For completeness, here is my converter code:
def parseArray(s):
s = re.sub(r'\[','',s)
s = re.sub(r'\]','',s)
s = re.sub(r' +',',',s)
s = np.fromstring(s,sep=',')
return s
testruns = pd.read_csv("datafiles/parse_test.csv", converters={'swarmBest': parseArray})
Without the converter, the 'swarmBest' column is read back as a string:
'[1095.56629 52.32807 8.43377 122.19014 75.42834 8.43377]'
With the converter I can do for example:
testarray = swarmFits[0]
print(testarray)
print(testarray[0])
Output:
[1095.56629 52.32807 8.43377 122.19014 75.42834 8.43377]
1095.56629

Related

Working with CSV imge data to perform CNN in Julia - format problem

I am trying to make a convolusional neural network on MNIST sign language dataset. It is provided in a CSV format where each row is one picture and there are 784 columns refering to a single pixel (the pictures have a size 28x28).
My problem is that in order to perform the algorithm I need to transpose my data to a different format, the same as is the format of a built-in ML dataset fashion MNIST, which is:
Array{Array{ColorTypes.Gray{FixedPointNumbers.Normed{UInt8,8}},2},1}
I would like to end up with the following format, where my data is joined with the encoded labels:
Array{Tuple{Array{Float32,4},Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}},1}
I was trying to use reshape function to convert it to a 4-dimensional array, but all I get is:
7172×28×28×1 Array{Float64,4}
My labels are in the following (correct) format:
25×7172 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}
I understand that somehow the proper data format is an array in an array while my data is a simple array with 4 dimensions, but I can't figure out how to change that.
I am new to Julia and some of the code I am using has been written by someone else.

Setting one CSV as an array to compare data from another CSV

I am new to Python and am over complicating the coding on a project so I am starting with much smaller data sets in order to learn the process. My boss is having me compare two CSV files. The first CSV only contains the data 1,2,3,4,5,6 all in a single column. He wants me to set this CSV file as an array so I can compare the second CSV against it. The second CSV contains the data 3,5,6 all in a single column. The code should result in a print out of 1,2,4 as it is the only data not found in both CSV files.
I originally tried to write a code to import both CSV files and compare data without setting it as an array but this did not work so the first CSV file needs to be set as an array. The problem is I am not sure exactly how to do this with an array. This is what I have so far, any help anyone could give me would be greatly appreciated. I have been working on this project for a week now and am at a total loss, even with this simplified form.
import csv
temp_list = []
with open('1.csv','rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
In terms of psuedo-code, what you need to do here is import both csv files into two separate arrays, Array A and Array B for example.
Now what you need to do is compare each index position in one array, to each index position in the other array.
You need to create a nested loop, where the outer loop will choose an index position in A and then inner loop chooses a position in B.
After you check one index in A with each position in B, and no positions are the same, I suggest adding this value into a third array, C. You can check which positions are the same by using a boolean flag. When your code is done, C will have any values that don't exist in both A and B.
I suggest following these tutorials to learn more about python syntax:
https://www.w3schools.com/python/
Good luck

Can you store and query compound data of Matlab arrays and structures into a database?

How do I store Matlab arrays located in a 'struct within struct within struct' into a database so that I can then retrieve the fields and arrays?
More detail on why do I need this below:
I have tons of data saved as .mat files....the hassle is that I need to load a complete .mat file to begin manipulating and plotting the data there. If that file is large, it becomes quite a task just to load it into memory.
These .mat files are resulted from the analysis of raw electrical measurement data of transistors. All .mat files have the same structure but each file correspond to a different and unique transistor.
Now say I want to compare a certain parameter in all transistors that are common in A and B, I have to manually search and load all the .mat files I need and then try to do the comparison. There is no simple way to merge all of these .mat files into a single .mat file (since they all have the same variable names but with different data). Even if that is possible, there is no way I know of to query specific entries from .mat files.
I do not see a way of easily doing that without a structured database from which I can query specific entries. Then I can use any programming language (continue with Matlab or switch to python) to convieniently do the comparison and plotting...etc. without the hassle of the scattered .mat files.
Problem is that the data in the .mat files are structured in structs and large arrays. From what I know, storing that in a simple SQL database is not a straight forward task. I looked up using HDF5 but from the examples I saw, I have to do a lot of low-level commands to store those structs in an HDF file and I am not sure if I can load parts of the HDF file into Matlab/python or if I also have to load the whole file in memory first.
The goal here is to merge all existing (and to-be-created) .mat files (with their compound data strucutre of structs and arrays) into a single database file from which I can query specific entries. Is there a database solution that can preserve the structure of my complex data? Is HDF the way to go? or is there a simple solution I am missing?
EDIT:
Example on data I need to save and retrieve:
All(16).rf.SS(3,2).data
Where All is an array of structs with 7 fields. Each struct in the rf field is a struct with arrays, integers, strings and structs. One of those structs is named SS which in turn is an array of structs each containing a 2x2 array named data.
Merge .mat files into one data structure
In general it's not correct that There is no simple way to merge ... .mat files into a single .mat file (since they all have the same variable names but with different data).
Let's say you have two files, data1.mat and data2.mat and each one contains two variables, a and b. You can do:
>> s = load('data1')
s =
struct with fields:
a: 'foo'
b: 3
>> s(2) = load('data2')
s =
1×2 struct array with fields:
a
b
Now you have a struct array (see note below). You can access the data in it like this:
>> s(1).a
ans =
'foo'
>> s(2).a
ans =
'bar'
But you can also get all the values at once for each field, as a comma-separated list, which you can assign to a cell array or matrix:
>> s.a
ans =
'foo'
ans =
'bar'
>> allAs = {s.a}
allAs =
1×2 cell array
{'foo'} {'bar'}
>> allBs = [s.b]
allBs =
3 4
Note: Annoyingly, it seems you have to create the struct with the correct fields before you can assign to it using indexing. In other words
s = struct;
s(1) = load('data1')
won't work, but
s = struct('a', [], 'b', [])
s(1) = load('data1')
is OK.
Build an index to the .mat files
If you don't need to be able to search on all of the data in each .mat file, just certain fields, you could build an index in MATLAB containing just the relevant metadata from each .mat file plus a reference (e.g. filename) to the file itself. This is less robust as a long-term solution as you have to make sure the index is kept in sync with the files, but should be less work to set up.
Flatten the data structure into a database-compatible table
If you really want to keep everything in a database, then you can convert your data structure into a tabular form where any multi-dimensional elements such as structs or arrays are 'flattened' into a table row with one scalar value per (suitably-named) table variable.
For example if you have a struct s with fields s.a and s.b, and s.b is a 2 x 2 matrix, you might call the variables s_a, s_b_1_1, s_b_1_2, s_b_2_1 and s_b_2_2 - probably not the ideal database design, but you get the idea.
You should be able to adapt the code in this answer and/or the MATLAB File Exchange submissions flattenstruct2cell and flatten-nested-cell-arrays to suit your needs.

MATLAB strings in arrays

I know that I am pretty confused about arrays and strings and have tried a bunch of things but I am still stumped. I have groups of data that I am pulling into various arrays. For example I have site locations coming from one source. Numerous cores can be at a single location. The cores can have multiple depths. So I am pulling all this data together in various ways and pushing it out into a single excel file for each core. I create a filename based on location id and core name and year the core was sampled. So it might look like ‘ID_14_CORE_Bu-2-MT-1991.xlsx’ and I am storing it to use with a xlswrite statement in a variable called “filename.” This is all working fine.
But now I want to keep track of what files I have created and when I created them in another EXCEL file. So I was trying to store the location, filename and the date it was processed into some sort of array so that I can use the xlswrite statement to push it all out after I have processed all the locations/cores/layers that might occur in the original input files.
As I start the program and look in the original input files I can figure out how many cores I have so I wanted to create some sort of array to hold the location, filename and the date together. I have tried to use a cell array (a = cell(numcores,3)) but that does not seem to work. I think I am understanding that the filename is actually a string array so each of the letters is trying to be assigned to a separate cell instead of just the cell in the second column.
I also have had problems trying to push the three values out to the summary EXCEL file as each core is being processed but MATLAB tends to treat single dimensional arrays as a row rather than a column so I am kind of confused there.
Below is what I want an array to end up like…but since I am developing the filename on the fly this seems to be more challenging.
ArraytoExcel = [“14”, “ID_14_CORE_Bu-2-MT-1991.xlsx”,”1/1/2018”;
“14”, “ID_14_CORE_Bu-3-MT-1991.xlsx”,”1/1/2018”;
“13”, “ID_13_CORE_Tail_33-1992.xlsx”,”1/1/2018”;]
Maybe I am just going about this the wrong way. Any suggestions would help.
Your question is a little confusing but I think you want to do something like the following. The variables inside of my example are static but from your question it sounds like you already have these figured out somehow.
numcores = 5; %.. Or however, you determine what you are procesing
ArraytoExcel = cell(numcores ,3);
for ii = 1:numcores
%These 3 things will need to determined by you in the loop
% and not be static like in this example.
coreID = '14';
filename = 'ID_14_CORE_Bu-2-MT-1991.xlsx'; %
dataProc = datestr(now,'mm/dd/yyyy');
ArraytoExcel(ii,:) = {coreID,filename,dataProc};
end
xlswrite('YourOutput.xls',ArraytoExcel)

Shuffle Dask array chunks from hdf5 file

I have a very large array stored in an hdf5 file. I am trying to load it and manage it as a Dask array.
At the moment my challenge is that i need to shuffle this array time to time in a process, this is a challenge by itself to shuffle an array bigger than memory.
So what i am trying to do without success is to shuffle the dask array chunks.
#Prepare data
f=h5py.File('Data.hdf5')
dset = f['/Data']
dk_array = da.from_array(dset, chunks=dset.chunks)
So given the context above how can i shuffle the chunks?
If your array is tabular in nature then you might consider adding a column of random data (see da.concatenate and da.random), turning it into a dask.dataframe, and setting that column as the index.
As a warning, this will be somewhat slow as it will need to do an on-disk shuffle.

Resources