Load text from csv file with int and float columns into ndarray - arrays

I have csv file as input :
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
It has mix of int and float.
when i tried to import file using "numpy.loadtext" what i got is 2d array with every column as float.
r = np.loadtxt(open("text.csv", "rb"), delimiter=",", skiprows=0)
and i received output like :
array([[ 6. , 148. , 72. , ..., 0.627, 50. , 1. ],
[ 1. , 85. , 66. , ..., 0.351, 31. , 0. ],
[ 8. , 183. , 64. , ..., 0.672, 32. , 1. ],
...,
[ 5. , 121. , 72. , ..., 0.245, 30. , 0. ],
[ 1. , 126. , 60. , ..., 0.349, 47. , 1. ],
[ 1. , 93. , 70. , ..., 0.315, 23. , 0. ]])
which is perfect have 2d array with each row in list instead of tuple.
but when looking into datatypes every column treated as float which is not correct.
What i am asking is there any way i can do output like :
Desired output
array([[ 6 , 148 , 72 , ..., 0.627, 50 , 1 ],
[ 1 , 85 , 66 , ..., 0.351, 31 , 0 ],
[ 8 , 183 , 64 , ..., 0.672, 32 , 1 ],
...,
[ 5 , 121 , 72 , ..., 0.245, 30 , 0 ],
[ 1 , 126 , 60 , ..., 0.349, 47 , 1 ],
[ 1 , 93 , 70 , ..., 0.315, 23 , 0 ]])
I did tried this approach:
r = np.loadtxt(open("F:/idm/compressed/ANN-CI1/Diabetes.csv", "rb"), delimiter=",", skiprows=0, dtype=[('f0',int),('f1',int),('f2',int),('f3',int),('f4',int),('f5',float),('f6',float),('f7',int),('f8',int)])
Output
array([( 6, 148, 72, 35, 0, 33.6, 0.627, 50, 1),
( 1, 85, 66, 29, 0, 26.6, 0.351, 31, 0),
( 8, 183, 64, 0, 0, 23.3, 0.672, 32, 1),
( 1, 89, 66, 23, 94, 28.1, 0.167, 21, 0),
...,
( 1, 126, 60, 0, 0, 30.1, 0.349, 47, 1),
( 1, 93, 70, 31, 0, 30.4, 0.315, 23, 0)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4','<i4'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<i4'), ('f8', '<i4')])
Here you can see dtype solve the problem but now its not in correct form which i require,
[[col1,col2,...,coln],] instead of [(col1,col2,...,coln),] ndarray
Thanks
------------------EDIT------------------------
problem why i am asking is that I am giving this 2d array as input to my binary classification network, when all values are int and in [[ ]] shape it's converging to values , but in current case it's mixed output is either 0. or 1. with very high error learning.
Visit https://github.com/naitikshukla/MachineLearning/blob/master/neural/demo_ann.py!
for complete code
In input space if i mark my current input and unmark from line 69-88 then output will be both 0 and 1.
So i wanted to change it to correct datatype and see if that will solve my issue.
There are very good explanation below for this not possible , i will see any workaround and see if i can use current input for train and predict.

It's impossible to create a numpy array like [[col1,col2,...,coln],] which containing different types of values.
numpy array is homogeneous. In other words, numpy array contains only values of one single type.
In [32]: sio = StringIO('''6,148,72,35,0,33.6,0.627,50,1
...: 1,85,66,29,0,26.6,0.351,31,0
...: 8,183,64,0,0,23.3,0.672,32,1
...: 1,89,66,23,94,28.1,0.167,21,0''')
In [33]: r = np.loadtxt(sio, delimiter=",", skiprows=0)
In [34]: r.shape
Out[34]: (4, 9)
In [41]: r.dtype
Out[41]: dtype('float64')
This above line create a 2D array of floats, and it's shape is 4x9.
In [36]: r = np.loadtxt(sio, delimiter=",", skiprows=0, dtype=[('f0',int),('f1'
...: ,int),('f2',int),('f3',int),('f4',int),('f5',float),('f6',float),('f7'
...: ,int),('f8',int)])
In [38]: r.shape
Out[38]: (4,)
In [45]: r.dtype
Out[45]: dtype([('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<i4'), ('f8', '<i4')])
This line code create a 1-D structured array. Each element of this array is a structure that contains 9 items. It's still homogeneous.

In the first case you get a 2d array of floats. In the second, a 1d array with a structured dtype, a mix of ints and floats. What where columns in the first are now named fields. The structured records are marked with () instead of the [].
Both forms are valid and useful. It just depends on what you need to do.
The structured form is more useful when some of the fields are strings or other things that don't fit the integer/float pattern. Usually you can work with the integers as floats without any loss of functionality.
What exactly is wrong with the first case, the all floats? Which is most important - named columns, or ranges of columns (e.g. 0:5, 5:8)?

Related

How to generate a numpy array with random values that are all different from each other

I've just started to learn about Python libraries and today I got stuck at a numpy exercise.
Let's say I want to generate a 5x5 random array whose values are all different from each other. Further, its values have to range from 0 to 100.
I have already look this up here but found no suitable solution to my problem. Please see the posts I consulted before turning to you:
Numpy: Get random set of rows from 2D array
Numpy Random 2D Array
Here is my attempt to solve the exercise:
import numpy as np
M = np.random.randint(1, 101, size=25)
for x in M:
for y in M:
if x in M == y in M:
M = np.random.randint(1, 101, size=25)
print(M)
By doing this, all I get is a value error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Thus, my second attempt has been the following:
M = np.random.randint(1, 101, size=25)
a = M[x]
b = M[y]
for a.any in M:
for b.any in M:
if a.any in M == b.any in M:
M = np.random.randint(1, 101, size=25)
print(M)
Once again, I got an error: AttributeError: 'numpy.int64' object attribute 'any' is read-only.
Could you please let me know what am I doing wrong? I've spent days going over this but nothing else comes to mind :(
Thank you so much!
Not sure if this will be ok for all your needs, but it will work for your example:
np.random.choice(np.arange(100, dtype=np.int32), size=(5, 5), replace=False)
You can use
np.random.random((5,5))
to generate an array of random numbers from 0 to 1, with shape (5,5).
Then just multiply by 100 to get the numbers between 0 and 100:
100*np.random.random((5,5))
Your code gives an error because of this line:
if x in M == y in M:
That syntax doesn't work. And M == y is a comparison between an array and a number, which is why you get ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
You already defined x and y as elements of M in the for loops so just write
if x == y:
While I think choice without replacement is the easy way to go, here's a way implementing your approach. I use np.unique to test whether all the numbers are different.
In [158]: M = np.random.randint(1,101, size=25)
In [159]: x = np.unique(M)
In [160]: x.shape
Out[160]: (23,)
In [161]: cnt = 1
...: while x.shape[0]<25:
...: M = np.random.randint(1,101,size=25)
...: x = np.unique(M)
...: cnt += 1
...:
In [162]: cnt
Out[162]: 34
In [163]: M
Out[163]:
array([ 70, 76, 27, 98, 81, 92, 97, 38, 49, 7, 2, 55, 85,
89, 32, 51, 20, 100, 9, 91, 53, 3, 11, 63, 21])
Note that I had to generate 34 random sets before I got a unique one. That is a lot of work!
unique works by sorting the array, and then looking for adjacent duplicates.
A numpy whole-array approach to testing whether there are any duplicates is:
For this unique set:
In [165]: (M[:,None]==M).sum(axis=0)
Out[165]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
but for another randint array:
In [166]: M1 = np.random.randint(1,101, size=25)
In [168]: (M1[:,None]==M1).sum(axis=0)
Out[168]:
array([1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1])
This M[:,None]==M is roughly the equivalent of
for x in M:
for y in M:
test x==y
I'm using sum to count how many True values there are in each column. any just identifies if there is a True.

Numpy Sum Rows of 2D Array uniquely (no sequence duplicates)

I have the following array
import numpy as np
single_array =
[[ 1 80 80 80]
[ 2 80 80 89]
[ 3 52 50 90]
[ 4 39 34 54]
[ 5 37 47 32]
[ 6 42 42 27]
[ 7 42 52 27]
[ 8 38 33 28]
[ 9 42 37 42]]
and want to create another array with all unique sums of 2 rows within this single_array so that 1+2 and 2+1 are treated as duplicates and are only included once.
First I would like to update the 0th column of the array to multiply each value by 10 (so I can identify the corresponding matching), then I want to add up every 2 rows and append them into the new array.
Output should look like this:
double_array=
[[12 160 160 169]
[13 132 130 170]
[14 119 114 134]
...
[98 80 70 70]]
Can I use itertools.combinations to get a 3D array with two unique combinations and then add the rows on the corresponding 3rd axis?
This
import numpy as np
from itertools import combinations
single_array = np.array(
[[ 1, 80, 80, 80],
[ 2, 80, 80, 89],
[ 3, 52, 50, 90],
[ 4, 39, 34, 54],
[ 5, 37, 47, 32],
[ 6, 42, 42, 27],
[ 7, 42, 52, 27],
[ 8, 38, 33, 28],
[ 9, 42, 37, 42]]
)
np.vstack([single_array[i] * np.array([10, 1, 1, 1]) + single_array[j]
for i, j in combinations(range(single_array.shape[0]), 2)])
does what you ask for in terms of specified input and output; I'm not sure if it's what you actually need. I don't think it will scale to big inputs.
A 3D array to find this sum would be ragged (first "layer" would be 9 deep, next one 8, etc.); you could maybe get around this with NaNs or masking. It also wouldn't scale that well for big inputs: you'd be allocating twice as much memory as you need, and then have to index out ragged layers to get your final output.
If you have to do this fast for big arrays, I suggest a pre-allocated output array and a for-loop with Numba:
from numba import jit
#jit(nopython=True)
def unique_row_sums(a):
n = a.shape[0]
b = np.empty((n*(n-1)//2, a.shape[1]))
s = np.array([10, 1, 1, 1])
k = 0
for i in range(n):
for j in range(i+1, n):
b[k] = s * a[i] + a[j]
k += 1
return b
In my not-too-careful testing with IPython's %timeit, this took about 4µs versus 152µs for the itertools-based version with your data, and should scale better.

How to convert 1D numpy array (made using .genfromtxt() method) to a 2D array?

I am new to numpy and I am trying to generate an array from a CSV file. I was informed that the .genfromtxt method works well in generating an array and automatically detecting and ascribing dtypes. The formula seemingly did this without flaws until I checked the shape of the array.
import numpy as np
taxi = np.genfromtxt("nyc_taxis.csv", delimiter=",", dtype = None, names = True)
taxi.shape
[out]: (89560,)
I believe this shows me that my dataset is now a 1D array. The tutorial I am working on in class has a final result of taxi.shape as (89560,15) but they used a long, tedious for loop, then converted certain columns to floats. But I want to try learn a more efficient way.
The first few lines of the array are
array([(2016, 1, 1, 5, 0, 2, 4, 21. , 2037, 52. , 0.8, 5.54, 11.65, 69.99, 1),
(2016, 1, 1, 5, 0, 2, 1, 16.29, 1520, 45. , 1.3, 0. , 8. , 54.3 , 1),
(2016, 1, 1, 5, 0, 2, 6, 12.7 , 1462, 36.5, 1.3, 0. , 0. , 37.8 , 2),
(2016, 1, 1, 5, 0, 2, 6, 8.7 , 1210, 26. , 1.3, 0. , 5.46, 32.76, 1),
(2016, 1, 1, 5, 0, 2, 6, 5.56, 759, 17.5, 1.3, 0. , 0. , 18.8 , 2),
(2016, 1, 1, 5, 0, 4, 2, 21.45, 2004, 52. , 0.8, 0. , 52.8 , 105.6 , 1),
(2016, 1, 1, 5, 0, 2, 6, 8.45, 927, 24.5, 1.3, 0. , 6.45, 32.25, 1),
(2016, 1, 1, 5, 0, 2, 6, 7.3 , 731, 21.5, 1.3, 0. , 0. , 22.8 , 2),
(2016, 1, 1, 5, 0, 2, 5, 36.3 , 2562, 109.5, 0.8, 11.08, 10. , 131.38, 1),
(2016, 1, 1, 5, 0, 6, 2, 12.46, 1351, 36. , 1.3, 0. , 0. , 37.3 , 2)],
So I can see from the results that each row has 15 comma-separations (i.e 15 columns) but the shape tells me that it is only 89560 rows and no columns. Am I reading this wrong? Is there a way that I can transform the shape of my taxi array dataset to reflect the true number of columns (i.e 15) as they are in the csv file?
Any and all help is appreciated
You can use this function to convert your structured to unstructured with your desired data type (assuming all fields are of the same data type, if not, keeping it as structured is better):
import numpy.lib.recfunctions as rfn
taxi = rfn.structured_to_unstructured(taxi, dtype=np.float)

Efficient data sifting for unique values (Python)

I have a 2D Numpy array that consists of (X,Y,Z,A) values, where (X,Y,Z) are Cartesian coordinates in 3D space, and A is some value at that location. As an example..
__X__|__Y__|__Z__|__A_
13 | 7 | 21 | 1.5
9 | 2 | 7 | 0.5
15 | 3 | 9 | 1.1
13 | 7 | 21 | 0.9
13 | 7 | 21 | 1.7
15 | 3 | 9 | 1.1
Is there an efficient way to find all the unique combinations of (X,Y), and add their values? For example, the total for (13,7) would be (1.5+0.9+1.7), or 4.1.
scipy.sparse matrix takes this kind of information, but for just 2d
sparse.coo_matrix((data, (row, col)))
where row and col are indices like your X,Y and Z. It sums duplicates.
The first step to doing that is a lexical sort of the indices. That puts points with matching coordinates next to each other.
The actually grouping and summing is done, I believe, in compiled code. Part of the difficulty in doing that fast in numpy terms is that there will be a variable number of elements in each group. Some will be unique, others might have 3 or more.
Python itertools has a groupby tool. Pandas also has grouping functions. I can also imagine using a default_dict to group and sum values.
The ufunc reduceat might also work, though it's easier to use in 1d than 2 or 3.
If you are ignoring the Z, the sparse coo_matrix approach may be easiest.
In [2]: X=np.array([13,9,15,13,13,15])
In [3]: Y=np.array([7,2,3,7,7,3])
In [4]: A=np.array([1.5,0.5,1.1,0.9,1.7,1.1])
In [5]: M=sparse.coo_matrix((A,(X,Y)))
In [15]: M.sum_duplicates()
In [16]: M.data
Out[16]: array([ 0.5, 2.2, 4.1])
In [17]: M.row
Out[17]: array([ 9, 15, 13])
In [18]: M.col
Out[18]: array([2, 3, 7])
In [19]: M
Out[19]:
<16x8 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
Here's what I had in mind with lexsort
In [32]: Z=np.array([21,7,9,21,21,9])
In [33]: xyz=np.stack([X,Y,Z],1)
In [34]: idx=np.lexsort([X,Y,Z])
In [35]: idx
Out[35]: array([1, 2, 5, 0, 3, 4], dtype=int32)
In [36]: xyz[idx,:]
Out[36]:
array([[ 9, 2, 7],
[15, 3, 9],
[15, 3, 9],
[13, 7, 21],
[13, 7, 21],
[13, 7, 21]])
In [37]: A[idx]
Out[37]: array([ 0.5, 1.1, 1.1, 1.5, 0.9, 1.7])
When sorted like this it becomes more evident that the Z coordinate is 'redundant', at least for this purpose.
Using reduceat to sum groups:
In [40]: np.add.reduceat(A[idx],[0,1,3])
Out[40]: array([ 0.5, 2.2, 4.1])
(for now I just eyeballed the [0,1,3] list)
Approach #1
Get each row as a view, thus converting each into a scalar each and then use np.unique to tag each row as a minimum scalar starting from (0......n), withnas no. of unique scalars based on the uniqueness among others and finally usenp.bincount` to perform the summing of the last column based on the unique scalars obtained earlier.
Here's the implementation -
def get_row_view(a):
void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
a = np.ascontiguousarray(a)
return a.reshape(a.shape[0], -1).view(void_dt).ravel()
def groupby_cols_view(x):
a = x[:,:2].astype(int)
a1D = get_row_view(a)
_, indx, IDs = np.unique(a1D, return_index=1, return_inverse=1)
return np.c_[x[indx,:2],np.bincount(IDs, x[:,-1])]
Approach #2
Same as approach #1, but instead of working with the view, we will generate equivalent linear index equivalent for each row and thus reducing each row to a scalar. Rest of the workflow is same as with the first approach.
The implementation -
def groupby_cols_linearindex(x):
a = x[:,:2].astype(int)
a1D = a[:,0] + a[:,1]*(a[:,0].max() - a[:,1].min() + 1)
_, indx, IDs = np.unique(a1D, return_index=1, return_inverse=1)
return np.c_[x[indx,:2],np.bincount(IDs, x[:,-1])]
Sample runs
In [80]: data
Out[80]:
array([[ 2. , 5. , 1. , 0.40756048],
[ 3. , 4. , 6. , 0.78945661],
[ 1. , 3. , 0. , 0.03943097],
[ 2. , 5. , 7. , 0.43663582],
[ 4. , 5. , 0. , 0.14919507],
[ 1. , 3. , 3. , 0.03680583],
[ 1. , 4. , 8. , 0.36504428],
[ 3. , 4. , 2. , 0.8598825 ]])
In [81]: groupby_cols_view(data)
Out[81]:
array([[ 1. , 3. , 0.0762368 ],
[ 1. , 4. , 0.36504428],
[ 2. , 5. , 0.8441963 ],
[ 3. , 4. , 1.64933911],
[ 4. , 5. , 0.14919507]])
In [82]: groupby_cols_linearindex(data)
Out[82]:
array([[ 1. , 3. , 0.0762368 ],
[ 1. , 4. , 0.36504428],
[ 3. , 4. , 1.64933911],
[ 2. , 5. , 0.8441963 ],
[ 4. , 5. , 0.14919507]])

Easy way of printing two numpy arrays with each element in a different line?

Let's say I have a 1D numpy array x and another one y = x ** 2.
I am looking for an easier alternative to
for i in range(x.size):
print(x[i], y[i])
With one array one can do print(*x, sep = '\n') which is easier than a for loop. I'm thinking of something like converting x and y to arrays of strings and then adding them up into an array z and then using print(*z, sep = '\n'). However, I tried to do that but numpy gives an error when the add operation is performed.
Edit: This is the function I use for this
def to_str(*args):
return '\n'.join([' '.join([str(ls[i]) for ls in args]) for i in range(len(args[0]))]) + '\n'
>>> x = np.arange(10)
>>> y = x ** 2
>>> print(to_str(x,y))
0 0
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
>>>
or if something quick and dirty is enough:
print(np.array((x,y)).T)
You could do something along these lines -
# Input arrays
In [238]: x
Out[238]: array([14, 85, 79, 89, 41])
In [239]: y
Out[239]: array([13, 79, 13, 79, 11])
# Join arrays with " "
In [240]: z = [" ".join(item) for item in np.column_stack((x,y)).astype(str)]
# Finally print it
In [241]: print(*z, sep='\n')
14 13
85 79
79 13
89 79
41 11
# Original approach for printing
In [242]: for i in range(x.size):
...: print(x[i], y[i])
...:
14 13
85 79
79 13
89 79
41 11
To make things a bit more compact, np.column_stack((x,y)) could be replaced by np.vstack((x,y)).T.
There are few other methods to create z as listed below -
z = [str(i)[1:-1] for i in zip(x,y)] # Prints commas between elems
z = [str(i)[1:-1] for i in np.column_stack((x,y))]
z = [str(i)[1:-1] for i in np.vstack((x,y)).T]
Here is one way without loop:
print(np.array2string(np.column_stack((x, y)),separator=',').replace(' [ ','').replace('],', '').strip('[ ]'))
Demo:
In [86]: x
Out[86]: array([0, 1, 2, 3, 4])
In [87]: y
Out[87]: array([ 0, 1, 4, 9, 16])
In [85]: print(np.array2string(np.column_stack((x, y)),separator=',').replace(' [ ','').replace('],', '').strip('[ ]'))
0, 0
1, 1
2, 4
3, 9
4,16
There are 2 issues - combining the 2 arrays, and printing the result
In [1]: a = np.arange(4)
In [2]: b = a**2
In [3]: ab = [a,b] # join arrays in a simple list
In [4]: ab
Out[4]: [array([0, 1, 2, 3]), array([0, 1, 4, 9])]
In [6]: list(zip(*ab)) # 'transpose' that list
Out[6]: [(0, 0), (1, 1), (2, 4), (3, 9)]
That zip(*) is a useful tool or idiom.
I could use your print(*a, sep...) method with this
In [11]: print(*list(zip(*ab)), sep='\n')
(0, 0)
(1, 1)
(2, 4)
(3, 9)
Using sep is a neat py3 trick, but is rarely used. I'm not even sure how to do the equivalent with the older py2 print statement.
But if we convert the list of arrays into a 2d array we have more options.
In [12]: arr = np.array(ab)
In [13]: arr
Out[13]:
array([[0, 1, 2, 3],
[0, 1, 4, 9]])
In [14]: np.vstack(ab) # does the same thing
Out[14]:
array([[0, 1, 2, 3],
[0, 1, 4, 9]])
For simply looking at the 2 arrays together this arr is quite useful. And if the lines get too long, transpose it:
In [15]: arr.T
Out[15]:
array([[0, 0],
[1, 1],
[2, 4],
[3, 9]])
In [16]: print(arr.T)
[[0 0]
[1 1]
[2 4]
[3 9]]
Note that array print format is different that for nested lists. That's intentional.
The brackets seldom get in the way of understanding the display. They even help with the array becomes 3d and higher.
For printing a file that can be read by other programs, np.savetxt is quite useful. It lets me specify the delimiter, and the format for each column or line.
In [17]: np.savetxt('test.csv', arr.T, delimiter=',',fmt='%10d')
In ipython I can look at the file with a simple system call:
In [18]: cat test.csv
0, 0
1, 1
2, 4
3, 9
I can omit the delimiter parameter.
I can reload it with loadtxt
In [20]: np.loadtxt('test.csv',delimiter=',',dtype=int)
Out[20]:
array([[0, 0],
[1, 1],
[2, 4],
[3, 9]])
In Py3 it is hard to write savetxt to the screen. It operates on a byte string file, and sys.stdout is unicode. In Py2 np.savetxt(sys.stdout, ...) might work.
savetxt is not sophisticated. In this example, it is essentially doing a fwrite equivalent of:
In [21]: for row in arr.T:
...: print('%10d,%10d'%tuple(row))
...:
0, 0
1, 1
2, 4
3, 9

Resources