Working with CSV imge data to perform CNN in Julia - format problem - arrays

I am trying to make a convolusional neural network on MNIST sign language dataset. It is provided in a CSV format where each row is one picture and there are 784 columns refering to a single pixel (the pictures have a size 28x28).
My problem is that in order to perform the algorithm I need to transpose my data to a different format, the same as is the format of a built-in ML dataset fashion MNIST, which is:
Array{Array{ColorTypes.Gray{FixedPointNumbers.Normed{UInt8,8}},2},1}
I would like to end up with the following format, where my data is joined with the encoded labels:
Array{Tuple{Array{Float32,4},Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}},1}
I was trying to use reshape function to convert it to a 4-dimensional array, but all I get is:
7172×28×28×1 Array{Float64,4}
My labels are in the following (correct) format:
25×7172 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}
I understand that somehow the proper data format is an array in an array while my data is a simple array with 4 dimensions, but I can't figure out how to change that.
I am new to Julia and some of the code I am using has been written by someone else.

Related

Parsing NumPy arrays from pandas data frame cells

I'm rather new to Pandas, and I think I have messed up with my data files.
I have stored some pandas data frames to CSV files. Data frames contained NumPy arrays stored in a single column. I know that it is not recommended to do so. However, because the arrays have an indefinite number of elements (varying row by row), I stored them in a single column. Column names and column order was getting a bit tedious otherwise. Initially, my notion was that I would not need those arrays for my data analysis because they contain raw data just stored for completeness. It was only later that I realized that I would have to go back to the raw data to extract some relevant data. Lucky for me that I saved it initially, but reading it back from the CSV files proved to be difficult.
Everything works fine, as long as I have the original data frame, but when I read the data frame back from CSV, the columns that contain the arrays are read back as strings instead of NumPy arrays.
I have used Pandas.read_csv function's converters option and NumPy.fromstring function with some regular expressions to parse the NumPy arrays from the strings. However, it is slow (data frames contain approx 400k rows).
So, preferably, I would like to convert the data once and save the data frames to a file format that maintains the NumPy arrays in the cells and can be read back directly as NumPy arrays. What would be the best file format to use if it is possible? Or what would be the best way to do it otherwise?
Your suggestions would be appreciated.
For completeness, here is my converter code:
def parseArray(s):
s = re.sub(r'\[','',s)
s = re.sub(r'\]','',s)
s = re.sub(r' +',',',s)
s = np.fromstring(s,sep=',')
return s
testruns = pd.read_csv("datafiles/parse_test.csv", converters={'swarmBest': parseArray})
Without the converter, the 'swarmBest' column is read back as a string:
'[1095.56629 52.32807 8.43377 122.19014 75.42834 8.43377]'
With the converter I can do for example:
testarray = swarmFits[0]
print(testarray)
print(testarray[0])
Output:
[1095.56629 52.32807 8.43377 122.19014 75.42834 8.43377]
1095.56629

Creating a dummy MNIST dataset

I want to create a subset (dummy) of the MNIST dataset. I want to create it in a similar format as mentioned on the MNIST's official page (FILE FORMATS FOR THE MNIST DATABASE section in http://yann.lecun.com/exdb/mnist/). I want to add the magic number and other dimensions for my dummy dataset
I am not able to understand how to create the IDX binary format from the numpy arrays or CSV (MNIST images after extraction, from which I want to subset).
The Pypy module idx2numpy helped me to solve the problem. I converted the idx to numpy, took a subset of the data and then converted the subset back to idx format.

Setting one CSV as an array to compare data from another CSV

I am new to Python and am over complicating the coding on a project so I am starting with much smaller data sets in order to learn the process. My boss is having me compare two CSV files. The first CSV only contains the data 1,2,3,4,5,6 all in a single column. He wants me to set this CSV file as an array so I can compare the second CSV against it. The second CSV contains the data 3,5,6 all in a single column. The code should result in a print out of 1,2,4 as it is the only data not found in both CSV files.
I originally tried to write a code to import both CSV files and compare data without setting it as an array but this did not work so the first CSV file needs to be set as an array. The problem is I am not sure exactly how to do this with an array. This is what I have so far, any help anyone could give me would be greatly appreciated. I have been working on this project for a week now and am at a total loss, even with this simplified form.
import csv
temp_list = []
with open('1.csv','rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
In terms of psuedo-code, what you need to do here is import both csv files into two separate arrays, Array A and Array B for example.
Now what you need to do is compare each index position in one array, to each index position in the other array.
You need to create a nested loop, where the outer loop will choose an index position in A and then inner loop chooses a position in B.
After you check one index in A with each position in B, and no positions are the same, I suggest adding this value into a third array, C. You can check which positions are the same by using a boolean flag. When your code is done, C will have any values that don't exist in both A and B.
I suggest following these tutorials to learn more about python syntax:
https://www.w3schools.com/python/
Good luck

Why does pyserial read as b'number'

I want to save those values in a file.txt, when the program saves them, it saves b'number', I'd like to plot those values but I can't with b'number' I just want number saved
You need to transform your input data into numbers because in your case pyserial is sending binary data, not prepared numeric. Also, byte order matters, you should specify whether it is 'big-endian', 'little-endian', 'native' etc.
If you are using Python 3.x, your job can be done this way:
values = [int.from_bytes(binary_number, 'little') for binary_number in binary_data]
And plot your values.
Hope this helps.

Array multiplication in Excel

In my excel document I have two sheets. The first is a data set and the second is a matrix of the relationship between two of the variables in my data set. Each possibility of the variable is a column in my matrix. I'm trying to get the sum of the products of the elements in two different arrays. Right now I'm using the formula {=SUM(N3:N20 * F3:F20)} and manually changing the columns each time. But my data set is over 800 items...
Ideally I'd like to know how to write a program that reads the value of the variable in my dataset looks up the correct columns in the matrix, multiplies them together, sums the products, and puts the result in the correct place in my data set. However, just knowing the result for all the possible combinations of columns would also save me alot of time. Its an 18x18 matrix. Thanks for any feedback!
Your question is a little bit ambiguous but as far as i understand your question you want to multiply different sets of two columns in the same sheet and put their result into the next sheet, is it so? if so, please post images of your work (all sheets). Your answer is possible even in Excel only without any vba code, thanks.
you can also use =SUMPRODUCT(N3:N20,F3:F20) for your formula instead of {=SUM(N3:N20 * F3:F20)}

Resources