How to handle large files in python? - arrays

I am new in python. I have asked another question How to arrange three lists in such a way that the sum of corresponding elements if greater then appear first? Now the problem is following:
I am working with a large text file, in which there are 419040 rows and 6 columns containing floats. Among them I am taking first 3 columns to generate those three lists. So the lists I am actually working with has 419040 entries in each. While I was running the python code to extract the three columns into three lists the python shell was not responding, I suspected the large number of entries for this, I used this code:
file=open("file_location","r")
a=[]
b=[]
c=[]
for lines in file:
x=lines.split(" ")
a.append(float(x[0]))
b.append(float(x[1]))
c.append(float(x[2]))
Note: for small file this code was running perfectly.
To avoid this problem I am using the following code:
import numpy as np
a = []
b = []
c = []
a,b,c = np.genfromtxt('file_location',usecols = [0,1,2], unpack=True)
So when I am running the code given in answers to my previous question the same problem is happening. So what will be the corresponding code using numpy? Or, any other solutions?

If you're going to use numpy, then I suggest using ndarrays, rather than lists. You can use loadtxt since you don't have to handle missing data. I assume it'll be faster.
a = np.loadtxt('file.txt', usecols=(0, 1, 2))
a is now a two-dimensional array, stored as an np.ndarray datatype. It should look like:
>>> a
array([[ 1, 20, 400],
[ 5, 30, 500],
[ 3, 50, 100],
[ 2, 40, 300],
[ 4, 10, 200]])
However, you now need to re-do what you did in the previous question, but using numpy arrays rather than lists. This can be easily achieved like so:
>>> b = a.sum(axis=1)
>>> b
Out[21]: array([535, 421, 342, 214, 153])
>>> i = np.argsort(b)[::-1]
>>> i
Out[26]: array([0, 1, 2, 3, 4])
>>> a[i, :]
Out[27]:
array([[ 5, 30, 500],
[ 1, 20, 400],
[ 2, 40, 300],
[ 4, 10, 200],
[ 3, 50, 100]])
The steps involved are described in a little greater detail here.

Related

extract blocks of columns (as seperated subarrays) indicated by 1D binary array

Based on a 1D binary mask, for example, np.array([0,0,0,1,1,1,0,0,1,1,0]), I would like to extract the columns of another array, indicated by the 1's in the binary mask, as as sub-arrays/separate blocks, like [9, 3.5, 7]) and [2.8, 9.1] (I am just making up the numbers to illustrate the point).
So far what I have (again just as a demo to illustrate what my goal is, not the data where this operation will be performed):
arr = torch.from_numpy(np.array([0,0,0,1,1,1,0,0,1,1,0]))
split_idx = torch.where(torch.diff(arr) == 1)[0]+1
torch.tensor_split(arr, split_idx.tolist())
The output is:
(tensor([0, 0, 0]),
tensor([1, 1, 1]),
tensor([0, 0]),
tensor([1, 1]),
tensor([0]))
What I would like to have in the end is:
(tensor([1, 1, 1]),
tensor([1, 1]))
Do you know how to implement it, preferably in pytorch, but numpy functions are also fine. A million thanks in advance!!
You can construct your tensor of slice indices with your approach. Only thing is you were missing the indices for the position of the end of each slice. You can do something like:
>>> slices = arr.diff().abs().nonzero().flatten()+1
tensor([ 3, 6, 8, 10])
Then apply tensor_split and slice to only keep every other element:
>>> torch.tensor_split(arr, slices)[1::2]
(tensor([1, 1, 1]), tensor([1, 1]))

How python List and Numpy array works?

How array works differently from a python list?
a=[0,1,2,3,4,5,6,7,8,9]
b=2
a==b
gives
False
But
a=np.array([0,1,2,3,4,5,6,7,8,9])
b=2
a==b
gives
array([False, False, True, False, False, False, False, False, False, False])
This happens because the __eq__ method is defined differently on numpy arrays comparing to default python lists.
Numpy was designed for various proposes, mainly for data science usage, which makes this method definition a very useful (and very fast) choice.
In other words, np.array and lists are different animals. Using each depends on what you're trying to achieve, although for some proposes it doesn't vary much, as they share some similarities.
First, you don't need to add any ; in python at the end of your line.
Then,
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
b = 2
a == b
will check if a is equal to b, meaning that it's exactly the same (same type and same content for a list here).
You can use:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
b = 2
b in a # True if yes, false if not
for example. There is several methods in python (which you can read here )
With numpy array it's a bit diffrent because the == for a np.array and a int/float will check if the number you want is a value of the array and will give you the result for each element of it. As mentionned by Kevin (in comment) it's called broadcasting.
This will perform this calculation :
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
b = 2
a == b
result_list = []
for value in a:
result_list.append(b == value)
print(result_list)
which can be more interessant in some case. Don't forget that numpy, because it's written in Cython, is faster than what I wrote here (especially for large arrays/lists)
Numpy returns the equality element-wise, see here.
Python checks the equality element-wise: if it finds that two elements are different (or the lengths are different) it returns false, otherwise true. See here paragraph 5.8.

Pythonic way to assign 3rd Dimension of Numpy array to 1D Array

I'm trying to flatten an image that's been converted to a 3D numpy array into three separate 1D arrays, representing RGB channels.
The image array is shaped (HEIGHT, WIDTH, RGB), and I've tried in vain to use both index slicing and unzipping to just return the 3rd dimension values.
Ideally, three separate arrays represent each RGB channel,
Example:
print(image)
[
[ [56, 6, 3], [23, 32, 53], [27, 33, 56] ],
[ [57, 2, 3], [23, 246, 49], [29, 253, 58] ]
]
red_channel, green_channel, blue_channel = get_third(image)
print(red_channel)
[56, 23, 27, 57, 23, 29]
I've thought of just using a nested for loop to iterate over the first two dimensions and then add each RGB array to a list or what not, but its my understanding that this would be both inefficient and a bit of an eyesore.
Thanks in advance!
EDIT
Clarification: By unzipping I mean using the star operator (*) within the zip function, like so:
zip(*image)
Also to clarify, I don't intend to retain the width and height, I just want to essentially only flatten and return the 3D dimension of the array.
red_channel, green_channel, blue_channel = np.transpose(np.reshape(image, (-1, 3)))

Confusion with Fancy indexing (for non-fancy people)

Let's assume a multi-dimensional array
import numpy as np
foo = np.random.rand(102,43,35,51)
I know that those last dimensions represent a 2D space (35,51) of which I would like to index a range of rows of a column
Let's say I want to have rows 8 to 30 of column 0
From my understanding of indexing I should call
foo[0][0][8::30][0]
Knowing my data though (unlike the random data used here), this is not what I expected
I could try this that does work but looks ridiculous
foo[0][0][[8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30],0]
Now from what I can find in this documentation I can also use
something like:
foo[0][0][[8,30],0]
which only gives me the values of rows 8 and 30
while this:
foo[0][0][[8::30],0]
gives an error
File "<ipython-input-568-cc49fe1424d1>", line 1
foo[0][0][[8::30],0]
^
SyntaxError: invalid syntax
I don't understand why the :: argument cannot be passed here. What is then a way to indicate a range in your indexing syntax?
So I guess my overall question is what would be the proper pythonic equivalent of this syntax:
foo[0][0][[8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30],0]
Instead of
foo[0][0][8::30][0]
try
foo[0, 0, 8:30, 0]
The foo[0][0] part is the same as foo[0, 0, :, :], selecting a 2d array (35 x 51). But foo[0][0][8::30] selects a subset of those rows
Consider what happens when is use 0::30 on 2d array:
In [490]: np.zeros((35,51))[0::30].shape
Out[490]: (2, 51)
In [491]: np.arange(35)[0::30]
Out[491]: array([ 0, 30])
The 30 is the step, not the stop value of the slice.
the last [0] then picks the first of those rows. The end result is the same as foo[0,0,0,:].
It is better, in most cases, to index multiple dimensions with the comma syntax. And if you want the first 30 rows use 0:30, not 0::30 (that's basic slicing notation, applicable to lists as well as arrays).
As for:
foo[0][0][[8::30],0]
simplify it to x[[8::30], 0]. The Python interpreter accepts [1:2:3, 0], translating it to tuple(slice(1,2,3), 0) and passing it to a __getitem__ method. But the colon syntax is accepted in a very specific context. The interpreter is treating that inner set of brackets as a list, and colons are not accepted there.
foo[0,0,[1,2,3],0]
is ok, because the inner brackets are a list, and the numpy getitem can handle those.
numpy has a tool for converting a slice notation into a list of numbers. Play with that if it is still confusing:
In [495]: np.r_[8:30]
Out[495]:
array([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26, 27, 28, 29])
In [496]: np.r_[8::30]
Out[496]: array([0])
In [497]: np.r_[8:30:2]
Out[497]: array([ 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

Python 2.7: looping over 1D fibers in a multidimensional Numpy array

I am looking for a way to loop over 1D fibers (row, column, and multi-dimensional equivalents) along any dimension in a 3+-dimensional array.
In a 2D array this is fairly trivial since the fibers are rows and columns, so just saying for row in A gets the job done. But for 3D arrays for example, this expression iterates over 2D slices, not 1D fibers.
A working solution is the one below:
import numpy as np
A = np.arange(27).reshape((3,3,3))
func = np.sum
for fiber_index in np.ndindex(A.shape[:-1]):
print func(A[fiber_index])
However, I am wondering whether there is something that is:
More idiomatic
Faster
Hope you can help!
I think you might be looking for numpy.apply_along_axis
In [10]: def my_func(x):
...: return x**2 + x
In [11]: np.apply_along_axis(my_func, 2, A)
Out[11]:
array([[[ 0, 2, 6],
[ 12, 20, 30],
[ 42, 56, 72]],
[[ 90, 110, 132],
[156, 182, 210],
[240, 272, 306]],
[[342, 380, 420],
[462, 506, 552],
[600, 650, 702]]])
Although many NumPy functions (including sum) have their own axis argument to specify which axis to use:
In [12]: np.sum(A, axis=2)
Out[12]:
array([[ 3, 12, 21],
[30, 39, 48],
[57, 66, 75]])
numpy provides a number of different ways of looping over 1 or more dimensions.
Your example:
func = np.sum
for fiber_index in np.ndindex(A.shape[:-1]):
print func(fiber_index)
print A[fiber_index]
produces something like:
(0, 0)
[0 1 2]
(0, 1)
[3 4 5]
(0, 2)
[6 7 8]
...
generates all index combinations over the 1st 2 dim, giving your function the 1D fiber on the last.
Look at the code for ndindex. It's instructive. I tried to extract it's essence in https://stackoverflow.com/a/25097271/901925.
It uses as_strided to generate a dummy matrix over which an nditer iterate. It uses the 'multi_index' mode to generate an index set, rather than elements of that dummy. The iteration itself is done with a __next__ method. This is the same style of indexing that is currently used in numpy compiled code.
http://docs.scipy.org/doc/numpy-dev/reference/arrays.nditer.html
Iterating Over Arrays has good explanation, including an example of doing so in cython.
Many functions, among them sum, max, product, let you specify which axis (axes) you want to iterate over. Your example, with sum, can be written as:
np.sum(A, axis=-1)
np.sum(A, axis=(1,2)) # sum over 2 axes
An equivalent is
np.add.reduce(A, axis=-1)
np.add is a ufunc, and reduce specifies an iteration mode. There are many other ufunc, and other iteration modes - accumulate, reduceat. You can also define your own ufunc.
xnx suggests
np.apply_along_axis(np.sum, 2, A)
It's worth digging through apply_along_axis to see how it steps through the dimensions of A. In your example, it steps over all possible i,j in a while loop, calculating:
outarr[(i,j)] = np.sum(A[(i, j, slice(None))])
Including slice objects in the indexing tuple is a nice trick. Note that it edits a list, and then converts it to a tuple for indexing. That's because tuples are immutable.
Your iteration can applied along any axis by rolling that axis to the end. This is a 'cheap' operation since it just changes the strides.
def with_ndindex(A, func, ax=-1):
# apply func along axis ax
A = np.rollaxis(A, ax, A.ndim) # roll ax to end (changes strides)
shape = A.shape[:-1]
B = np.empty(shape,dtype=A.dtype)
for ii in np.ndindex(shape):
B[ii] = func(A[ii])
return B
I did some timings on 3x3x3, 10x10x10 and 100x100x100 A arrays. This np.ndindex approach is consistently a third faster than the apply_along_axis approach. Direct use of np.sum(A, -1) is much faster.
So if func is limited to operating on a 1D fiber (unlike sum), then the ndindex approach is a good choice.

Resources