understand functions that operate on whole array in groupby aggregation

understand functions that operate on whole array in groupby aggregation - arrays

import numpy as np
import pandas as pd
df = pd.DataFrame({
'clients': pd.Series(['A', 'A', 'A', 'B', 'B']),
'odd1': pd.Series([1, 1, 2, 1, 2]),
'odd2': pd.Series([6, 7, 8, 9, 10])})
grpd = df.groupby(['clients', 'odd1']).agg({
'odd2': lambda x: x/float(x.sum())
})
print grpd
The desired result is:
A 1 0.619047619
2 0.380952381
B 1 0.473684211
2 0.526316
I have browsed around but I still don't understand how having lambdas that operate on the whole array, f.ex. x.sum() work. Furthermore, I still miss the point on what x is in x.sum() wrt to the grouped columns.

You can do:
>>> df.groupby(['clients', 'odd1'])['odd2'].sum() / df.groupby('clients')['odd2'].sum()
clients odd1
A 1 0.619
2 0.381
B 1 0.474
2 0.526
Name: odd2, dtype: float64
or alternatively, use .transform to obtain values based on clients grouping and then sum for each clients and odd1 grouping:
>>> df['val'] = df['odd2'] / df.groupby('clients')['odd2'].transform('sum')
>>> df
clients odd1 odd2 val
0 A 1 6 0.286
1 A 1 7 0.333
2 A 2 8 0.381
3 B 1 9 0.474
4 B 2 10 0.526
>>> df.groupby(['clients', 'odd1'])['val'].sum()
clients odd1
A 1 0.619
2 0.381
B 1 0.474
2 0.526
Name: val, dtype: float64

Related

Algorithm for Number series row and column

I am trying to create a simple Algorithm in Dart but I think the programming language doesn't matter it is more about the Algorithm:
I am trying to make 2 lists of pairs of numbers depending on "row" and "column" for example:
col_1
col_2
1
2
3
4
5
6
7
8
9
10
=> I need a Algorithm that makes me 2 lists of numbers:
first list: 2,3,6,7,10...
second list: 4,5,8,9...
But this must also work when the "columns" change like that:
col_1
col_2
col_3
1
2
3
4
5
6
7
8
9
this time the first list must be:
3,4,9...
the second list:
6,7 ...
anyone has an Idea on how I could achieve this with a simple calculation? or algorithm depending on the "Maximum" amount of numbers?

You're probably looking for the modulo operator %
Directly, it can create a repeating series based upon the remainder of dividing by the value
For example, you could place your values into the Nth list in a list of lists (whew)
>>> cols = [[],[],[]]
>>> for x in range(12):
... cols[x % 3].append(x)
...
>>> cols
[[0, 3, 6, 9], [1, 4, 7, 10], [2, 5, 8, 11]]
0%5 --> 0
1%5 --> 1
...
11%5 --> 1
To massage values when you have (for example) some increment of 1, you can do a little math
>>> for x in range(12):
... print("x={} is in column {}".format(x + 1, (x % 3) + 1 ))
...
x=1 is in column 1
x=2 is in column 2
x=3 is in column 3
x=4 is in column 1
x=5 is in column 2
x=6 is in column 3
x=7 is in column 1
x=8 is in column 2
x=9 is in column 3
x=10 is in column 1
x=11 is in column 2
x=12 is in column 3
Note you may need to do some extra work to either
know how many values you need to fill all the columns
count and stop when you have enough rows
Examples are in Python because I'm most familiar with it
Note that work like [[],[],[]] should largely be avoided if a better collection is available (perhaps a dict of lists with column-name keys or Pandas DataFrame), but it makes for a good illustration

With a little discussion, https://oeis.org/A042964 seems like the solution, which suggests the following for testing entries for set membership
binomial(m+2, m) mod 2 = 0
in Python, this could be
>>> from scipy.special import binom # 3rd party math library
>>> for m in range(1, 15):
... print("{} belongs in list {}".format(m, 1 if binom(m+2, m) % 2 == 0 else 2))
...
1 belongs in list 2
2 belongs in list 1
3 belongs in list 1
4 belongs in list 2
5 belongs in list 2
6 belongs in list 1
7 belongs in list 1
8 belongs in list 2
9 belongs in list 2
10 belongs in list 1
11 belongs in list 1
12 belongs in list 2
13 belongs in list 2
14 belongs in list 1
If only the first list is desired, entries can be directly generated by the formula a(n)=2*n+2-n%2 (refer to PROG on OEIS page), rather than testing values for membership
>>> a = lambda n: 2*n+2-n%2
>>> [a(n) for n in range(10)]
[2, 3, 6, 7, 10, 11, 14, 15, 18, 19]

pandas dataframe values with numpy array

For example i have a dataframe like this :
import pandas as pd
df = pd.DataFrame([[1, 2.], [3, 4.]], columns=['a', 'b'])
print df
a b
0 1 2.0
1 3 4.0
I want to get a dataframe as follows :
a b
0 [1,3] [2,4]

One approach -
df_out = pd.DataFrame([df.values.T.astype(int).tolist()], columns=df.columns)
To retrieve back -
N = len(df_out.columns)
arr_back = np.concatenate(np.concatenate(df_out.values)).reshape(N,-1).T
df_back = pd.DataFrame(arr_back, columns=df_out.columns)
Sample run -
In [164]: df
Out[164]:
a b
0 1 2.0
1 3 4.0
2 5 6.0
In [165]: df_out
Out[165]:
a b
0 [1, 3, 5] [2, 4, 6]
In [166]: df_back
Out[166]:
a b
0 1 2
1 3 4
2 5 6

Python Pandas: selecting 1st element in array in all cells

What I am trying to do is select the 1st element of each cell regardless of the number of columns or rows (they may change based on user defined criteria) and make a new pandas dataframe from the data. My actual data structure is similar to what I have listed below.
0 1 2
0 [1, 2] [2, 3] [3, 6]
1 [4, 2] [1, 4] [4, 6]
2 [1, 2] [2, 3] [3, 6]
3 [4, 2] [1, 4] [4, 6]
I want the new dataframe to look like:
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
The code below generates a data set similar to mine and attempts to do what I want to do in my code without success (d), and mimics what I have seen in a similar question with success(c ; however, only one column). The link to the similar, but different question is here :Python Pandas: selecting element in array column
import pandas as pd
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame((zz.columns.values))
b = pd.DataFrame.transpose(a)
c =zz[0].str[0] # this will give the 1st value for each cell in columns 0
d= zz[[b[0]].values].str[0] #attempt to get 1st value for each cell in all columns

You can use apply and for selecting first value of list use indexing with str:
print (zz.apply(lambda x: x.str[0]))
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
Another solution with stack and unstack:
print (zz.stack().str[0].unstack())
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4

I would use applymap which applies the same function to each individual cell in your DataFrame
df.applymap(lambda x: x[0])
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4

I'm a big fan of stack + unstack
However, #jezrael already put that answer down... so + 1 from me.
That said, here is a quicker way. By slicing a numpy array
pd.DataFrame(
np.array(zz.values.tolist())[:, :, 0],
zz.index, zz.columns
)
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
timing

How to iterate the rows of a DataFrame as Series in Pandas?

How can I iterate over rows in a DataFrame? For some reason iterrows() is returning tuples rather than Series. I also understand that this is not an efficient way of using Pandas.

Use:
s = pd.Series([0,1,2])
for i in s:
print (i)
0
1
2
DataFrame:
df = pd.DataFrame({'a':[0,1,2], 'b':[4,5,8]})
print (df)
a b
0 0 4
1 1 5
2 2 8
for i,s in df.iterrows():
print (s)
a 0
b 4
Name: 0, dtype: int64
a 1
b 5
Name: 1, dtype: int64
a 2
b 8
Name: 2, dtype: int64

How can I iterate over rows in a DataFrame? For some reason iterrows() is returning tuples rather than Series.
The second entry in the tuple is a Series:
In [9]: df = pd.DataFrame({'a': range(4), 'b': range(2, 6)})
In [10]: for r in df.iterrows():
print r[1], type(r[1])
....:
a 0
b 2
Name: 0, dtype: int64 <class 'pandas.core.series.Series'>
a 1
b 3
Name: 1, dtype: int64 <class 'pandas.core.series.Series'>
a 2
b 4
Name: 2, dtype: int64 <class 'pandas.core.series.Series'>
a 3
b 5
Name: 3, dtype: int64 <class 'pandas.core.series.Series'>
I also understand that this is not an efficient way of using Pandas.
That is true, in general, but the question is a bit too general. You'll need to specify why you're trying to iterate over the DataFrame.

Turn a pandas dataframe into a two dimensional array

I have a dataframe with three columns. X, Y, and counts, where counts is the number of occurences where x and y appear together. My goal is to transform this from a dataframe to an array of two dimensions where X is the name of the rows, Y is the name of the columns and the counts make up the records in the table.
Is this possible? I can elaborate if needed.

To get the same result as a pivot table, you can also perform a groupby operation and then unstack one of the columns:
import numpy as np
import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'black'] * 2,
'vehicle': ['car', 'truck'] * 3,
'value': np.arange(1, 7)})
>>> df
color value vehicle
0 red 1 car
1 blue 2 truck
2 black 3 car
3 red 4 truck
4 blue 5 car
5 black 6 truck
>>> df.groupby(['color', 'vehicle']).sum().unstack('vehicle')
value
vehicle car truck
color
black 3 6
blue 5 2
red 1 4

Here is an IPython session that may be a good simulation of what you are trying to do:
In [17]: import pandas as pd
In [18]: from random import randint
In [19]: x = ['a', 'b', 'c'] * 4
In [20]: y = ['i', 'j', 'k', 'l'] * 3
In [21]: counts = [randint(10, 20) for i in range(12)]
In [22]: df = pd.DataFrame(dict(x=x, y=y, counts=counts))
In [23]: df.head()
Out[23]:
counts x y
0 16 a i
1 10 b j
2 16 c k
3 15 a l
4 19 b i
In [24]: df.pivot(index='x', columns='y', values='counts')
Out[24]:
y i j k l
x
a 16 14 18 15
b 19 10 15 20
c 10 18 16 16
In [25]: df.pivot(index='x', columns='y', values='counts').values
Out[25]:
array([[16, 14, 18, 15],
[19, 10, 15, 20],
[10, 18, 16, 16]], dtype=int64)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

understand functions that operate on whole array in groupby aggregation - arrays

Related

Algorithm for Number series row and column

pandas dataframe values with numpy array

Python Pandas: selecting 1st element in array in all cells

How to iterate the rows of a DataFrame as Series in Pandas?

Turn a pandas dataframe into a two dimensional array

Categories

Resources