pandas dataframe values with numpy array

pandas dataframe values with numpy array - arrays

For example i have a dataframe like this :
import pandas as pd
df = pd.DataFrame([[1, 2.], [3, 4.]], columns=['a', 'b'])
print df
a b
0 1 2.0
1 3 4.0
I want to get a dataframe as follows :
a b
0 [1,3] [2,4]

One approach -
df_out = pd.DataFrame([df.values.T.astype(int).tolist()], columns=df.columns)
To retrieve back -
N = len(df_out.columns)
arr_back = np.concatenate(np.concatenate(df_out.values)).reshape(N,-1).T
df_back = pd.DataFrame(arr_back, columns=df_out.columns)
Sample run -
In [164]: df
Out[164]:
a b
0 1 2.0
1 3 4.0
2 5 6.0
In [165]: df_out
Out[165]:
a b
0 [1, 3, 5] [2, 4, 6]
In [166]: df_back
Out[166]:
a b
0 1 2
1 3 4
2 5 6

Related

Extract pattern and subsequent n elements from array and count number of occurences

I have an array of doubles like this:
C = [1 2 3 4 0 3 2 5 6 7 1 2 3 4 150 30]
i want to find the pattern [1 2 3 4] within the array and then store the 2 values after that pattern with it like:
A = [1 2 3 4 0 3]
B = [1 2 3 4 150 30]
i can find the pattern like this but i don't know how to get and store 2 values after that with the previous one.
And after finding A, B if i want to find the number of occurrences of each arrays within array C how can i do that?
indices = cellfun(#(c) strfind(c,pattern), C, 'UniformOutput', false);
Thanks!

Assuming you're fine with a cell array output, this works fine:
C = [1 2 3 4 0 3 2 5 6 7 1 2 3 4 150 30 42 1 2 3 4 0 3]
p = [1 2 3 4]
n = 2
% full patttern length - 1
dn = numel(p) + n - 1
%// find indices
ind = strfind(C,p)
%// pre check if pattern at end of array
if ind(end)+ dn > numel(C), k = -1; else k = 0; end
%// extracting
temp = arrayfun(#(x) C(x:x+dn), ind(1:end+k) , 'uni', 0)
%// post processing
[out, ~, idx] = unique(vertcat(temp{:}),'rows','stable')
occ = histcounts(idx).'
If the array C ends with at least n elements after the last occurrence of the pattern p, you can use the short form:
out = arrayfun(#(x) C(x:x+n+numel(p)-1), strfind(C,p) , 'uni', 0)
out =
1 2 3 4 0 3
1 2 3 4 150 30
occ =
2
1

A simple solution can be:
C = [1 2 3 4 0 3 2 5 6 7 1 2 3 4 150 30];
pattern = [1 2 3 4];
numberOfAddition = 2;
outputs = zeros(length(A),length(pattern)+ numberOfAddition); % preallocation
numberOfFoundPattern = 1;
lengthOfConsider = length(C) - length(pattern) - numberOfAddition;
for i = 1:lengthOfConsider
if(sum(C(i:i+length(pattern)) - pattern) == 0) % find pattern
outputs(numberOfFoundPattern,:) = C(i:i+length(pattern)+numberOfAddition);
numberOfFoundPattern = numberOfFoundPattern + 1;
end
end
outputs = outputs(1:numberOfFoundPattern - 1,:);

Python Pandas: selecting 1st element in array in all cells

What I am trying to do is select the 1st element of each cell regardless of the number of columns or rows (they may change based on user defined criteria) and make a new pandas dataframe from the data. My actual data structure is similar to what I have listed below.
0 1 2
0 [1, 2] [2, 3] [3, 6]
1 [4, 2] [1, 4] [4, 6]
2 [1, 2] [2, 3] [3, 6]
3 [4, 2] [1, 4] [4, 6]
I want the new dataframe to look like:
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
The code below generates a data set similar to mine and attempts to do what I want to do in my code without success (d), and mimics what I have seen in a similar question with success(c ; however, only one column). The link to the similar, but different question is here :Python Pandas: selecting element in array column
import pandas as pd
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame((zz.columns.values))
b = pd.DataFrame.transpose(a)
c =zz[0].str[0] # this will give the 1st value for each cell in columns 0
d= zz[[b[0]].values].str[0] #attempt to get 1st value for each cell in all columns

You can use apply and for selecting first value of list use indexing with str:
print (zz.apply(lambda x: x.str[0]))
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
Another solution with stack and unstack:
print (zz.stack().str[0].unstack())
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4

I would use applymap which applies the same function to each individual cell in your DataFrame
df.applymap(lambda x: x[0])
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4

I'm a big fan of stack + unstack
However, #jezrael already put that answer down... so + 1 from me.
That said, here is a quicker way. By slicing a numpy array
pd.DataFrame(
np.array(zz.values.tolist())[:, :, 0],
zz.index, zz.columns
)
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
timing

Extract blocks of certain number from array

I have a vector and I would like to extract all the 4's from it:
x = [1 1 1 4 4 5 5 4 6 1 2 4 4 4 9 8 4 4 4 4]
so that I will get 4 vectors or a cell containing the 4 blocks of 4's:
[4 4], [4], [4 4 4], [4 4 4 4]
Thanks!

You can create cells from the appropriate ranges using arrayfun:
x = [1 1 1 4 4 5 5 4 6 1 2 4 4 4 9 8 4 4 4 4];
x = [0, x, 0]; D = diff (x==4); % pad array and diff its mask
A = find (D == 1); B = find (D == -1); % find inflection points
out = arrayfun (# (a,b) {x(a+1 : b)}, A, B) % collect ranges in cells

This should be pretty fast, using accumarray:
X = 4;
%// data
x = [1 1 1 4 4 5 5 4 6 1 2 4 4 4 9 8 4 4 4 4]
%// mask all 4
mask = x(:) == X
%// get subs for accumarray
subs = cumsum( diff( [0; mask] ) > 0 )
%// filter data and sort into cell array
out = accumarray( subs(mask), x(mask), [], #(y) {y} )

idx=find(x==4);
for (i= 1:length(idx))
if (i==1 || idx(i-1)!=idx(i)-1)if(i!=1) printf(",") endif; printf("[") endif;
printf("4");
if (i<length(idx)&&idx(i+1)==idx(i)+1) printf(",") else printf("]") endif
endfor
Note this won't give the actual vectors, but it will give the output you wanted. The above is OCtave code. I am pretty sure changing endfor and endif to end would work in MAtlab, but without testing in matlab, I am not positive.[edited in light of comment]

with regionprops we can set property PixelValues so the function returns 1s instead of 4s
x = [1 1 1 4 4 5 5 4 6 1 2 4 4 4 9 8 4 4 4 4]
{regionprops(x==4,'PixelValues').PixelValues}
if we set property PixelIdxList the function returns a cell of indices of 4s:
{regionprops(x==4,'PixelIdxList').PixelIdxList}
Update(without image processing toolbox):
this way we can get number of elements of each connected components:
c = cumsum(x~=4)
h=hist(,c(1):c(end));
h(1)=h(1)+~c(1);
result = h(h~=1)-1

understand functions that operate on whole array in groupby aggregation

import numpy as np
import pandas as pd
df = pd.DataFrame({
'clients': pd.Series(['A', 'A', 'A', 'B', 'B']),
'odd1': pd.Series([1, 1, 2, 1, 2]),
'odd2': pd.Series([6, 7, 8, 9, 10])})
grpd = df.groupby(['clients', 'odd1']).agg({
'odd2': lambda x: x/float(x.sum())
})
print grpd
The desired result is:
A 1 0.619047619
2 0.380952381
B 1 0.473684211
2 0.526316
I have browsed around but I still don't understand how having lambdas that operate on the whole array, f.ex. x.sum() work. Furthermore, I still miss the point on what x is in x.sum() wrt to the grouped columns.

You can do:
>>> df.groupby(['clients', 'odd1'])['odd2'].sum() / df.groupby('clients')['odd2'].sum()
clients odd1
A 1 0.619
2 0.381
B 1 0.474
2 0.526
Name: odd2, dtype: float64
or alternatively, use .transform to obtain values based on clients grouping and then sum for each clients and odd1 grouping:
>>> df['val'] = df['odd2'] / df.groupby('clients')['odd2'].transform('sum')
>>> df
clients odd1 odd2 val
0 A 1 6 0.286
1 A 1 7 0.333
2 A 2 8 0.381
3 B 1 9 0.474
4 B 2 10 0.526
>>> df.groupby(['clients', 'odd1'])['val'].sum()
clients odd1
A 1 0.619
2 0.381
B 1 0.474
2 0.526
Name: val, dtype: float64

storing value against variable name "QW1I5K20" in an array element Q[1,5,20] using R

I have an excel file (.csv) with a sorted column of variable names such as "QW1I1K5" and numerical values against them.
this list goes on for
W from 1 to 15
I from 1 to 4
K from 1 to 30
total elements = 15*4*30 = 1800
I want to store the numerical values against these variables in an array whose indices are derived from the variable name .
for example QW1I1K5 has a value 11 . this must be stored in an array element Q[1,1,5] = 11 ( index set of [1,1,5] corresponds to W1 , I1 , K5)

May be this helps
Q <- array(dat$Col2, dim=c(15,4,30))
dat$Col2[dat$Col1=='QW1I1K5']
#[1] 34
Q[1,1,5]
#[1] 34
dat$Col2[dat$Col1=='QW4I3K8']
#[1] 38
Q[4,3,8]
#[1] 38
If you want the index along with the values
library(reshape2)
d1 <- melt(Q)
head(d1,3)
# Var1 Var2 Var3 value
#1 1 1 1 12
#2 2 1 1 9
#3 3 1 1 29
Q[1,1,1]
#[1] 12
Q[3,1,1]
#[1] 29
Update
Suppose, your data is in the order as you described in the comments, which will be dat1
indx <- read.table(text=gsub('[^0-9]+', ' ', dat1$Col1), header=FALSE)
dat2 <- dat1[do.call(order, indx[,3:1]),]
Q1 <- array(dat2$Col2,dim=c(15,4,30))
Q1[1,1,2]
#[1] 20
dat2$Col2[dat2$Col1=='QW1I1K2']
#[1] 20
data
Col1 <- do.call(paste,c(expand.grid('QW', 1:15, 'I', 1:4, 'K',1:30),
list(sep='')))
set.seed(24)
dat <- data.frame(Col1, Col2=sample(1:40, 1800,replace=TRUE))
dat1 <- dat[order(as.numeric(gsub('[^0-9]+', '', dat$Col1))),]
row.names(dat1) <- NULL

I would suggest looking at using "data.table" and setting your key to the split columns. You can use cSplit from my "splitstackshape" function to easily split the column.
Sample Data:
df <- data.frame(
V1 = c("QW1I1K1", "QW1I1K2", "QW1I1K3",
"QW1I1K4", "QW2I1K5", "QW2I3K2"),
V2 = c(15, 20, 5, 6, 7, 9))
df
# V1 V2
# 1 QW1I1K1 15
# 2 QW1I1K2 20
# 3 QW1I1K3 5
# 4 QW1I1K4 6
# 5 QW2I1K5 7
# 6 QW2I3K2 9
Splitting the column:
library(splitstackshape)
out <- cSplit(df, "V1", "[A-Z]+", fixed = FALSE)
setnames(out, c("V2", "W", "I", "K"))
setcolorder(out, c("W", "I", "K", "V2"))
setkey(out, W, I, K)
out
# W I K V2
# 1: 1 1 1 15
# 2: 1 1 2 20
# 3: 1 1 3 5
# 4: 1 1 4 6
# 5: 2 1 5 7
# 6: 2 3 2 9
Extracting rows:
out[J(1, 1, 4)]
# W I K V2
# 1: 1 1 4 6
out[J(2, 3, 2)]
# W I K V2
# 1: 2 3 2 9

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

pandas dataframe values with numpy array - arrays

For example i have a dataframe like this : import pandas as pd df = pd.DataFrame([[1, 2.], [3, 4.]], columns=['a', 'b']) print df a b 0 1 2.0 1 3 4.0 I want to get a dataframe as follows : a b 0 [1,3] [2,4]

Related

Extract pattern and subsequent n elements from array and count number of occurences

Python Pandas: selecting 1st element in array in all cells

Extract blocks of certain number from array

understand functions that operate on whole array in groupby aggregation

storing value against variable name "QW1I5K20" in an array element Q[1,5,20] using R

Categories

Resources