How to iterate the rows of a DataFrame as Series in Pandas?

How to iterate the rows of a DataFrame as Series in Pandas? - loops

How can I iterate over rows in a DataFrame? For some reason iterrows() is returning tuples rather than Series. I also understand that this is not an efficient way of using Pandas.

Use:
s = pd.Series([0,1,2])
for i in s:
print (i)
0
1
2
DataFrame:
df = pd.DataFrame({'a':[0,1,2], 'b':[4,5,8]})
print (df)
a b
0 0 4
1 1 5
2 2 8
for i,s in df.iterrows():
print (s)
a 0
b 4
Name: 0, dtype: int64
a 1
b 5
Name: 1, dtype: int64
a 2
b 8
Name: 2, dtype: int64

How can I iterate over rows in a DataFrame? For some reason iterrows() is returning tuples rather than Series.
The second entry in the tuple is a Series:
In [9]: df = pd.DataFrame({'a': range(4), 'b': range(2, 6)})
In [10]: for r in df.iterrows():
print r[1], type(r[1])
....:
a 0
b 2
Name: 0, dtype: int64 <class 'pandas.core.series.Series'>
a 1
b 3
Name: 1, dtype: int64 <class 'pandas.core.series.Series'>
a 2
b 4
Name: 2, dtype: int64 <class 'pandas.core.series.Series'>
a 3
b 5
Name: 3, dtype: int64 <class 'pandas.core.series.Series'>
I also understand that this is not an efficient way of using Pandas.
That is true, in general, but the question is a bit too general. You'll need to specify why you're trying to iterate over the DataFrame.

Related

Remove a value in an array and decrease the size of it

I have an array filled with some values. After running this code:
array = zeros(10)
for i in 1:10
array[i] = 2*i + 3
end
The array looks like:
10-element Array{Float64,1}:
5.0
7.0
9.0
11.0
13.0
15.0
17.0
19.0
21.0
23.0
I would like to obtain, for example, the following array by removing the third value:
9-element Array{Float64,1}:
5.0
7.0
11.0
13.0
15.0
17.0
19.0
21.0
23.0
How to do that?
EDIT
If I have an array (and not a vector), like here:
a = [1 2 3 4 5]
1×5 Array{Int64,2}:
1 2 3 4 5
The deleteat! proposed is not working:
a = deleteat!([1 2 3 4 5], 1)
ERROR: MethodError: no method matching deleteat!(::Array{Int64,2}, ::Int64)
You might have used a 2d row vector where a 1d column vector was required.
Note the difference between 1d column vector [1,2,3] and 2d row vector [1 2 3].
You can convert to a column vector with the vec() function.
Closest candidates are:
deleteat!(::Array{T,1} where T, ::Integer) at array.jl:875
deleteat!(::Array{T,1} where T, ::Any) at array.jl:913
deleteat!(::BitArray{1}, ::Integer) at bitarray.jl:961
I don't want a column vector. I would want:
1×4 Array{Int64,2}:
2 3 4 5
Is it possible ?

To make that clear: Vector{T} in Julia is just a synonym for Array{T, 1}, unless you're talking about something else... we call Arrays of all ranks arrays.
But this seems to be a Matlab-inherited misconception. In Julia, you construct a Matrix{T}, ie., an Array{T, 2}, by using spaces in the literal:
julia> a = [1 2 3 4 5]
1×5 Array{Int64,2}:
1 2 3 4 5
Deleting from a matrix does not make sense in general, since you can't trivially "fix the shape" in a rectangular layout.
A Vector or Array{T, 1} can be written using commas:
julia> a = [1, 2, 3, 4, 5]
5-element Array{Int64,1}:
1
2
3
4
5
And on this, deleteat! works:
julia> deleteat!(a, 1)
4-element Array{Int64,1}:
2
3
4
5
For completeness, there's also a third variant, the RowVector, which results of a transposition:
julia> a'
1×4 RowVector{Int64,Array{Int64,1}}:
2 3 4 5
From this you also can't delete.

Deleteat! is only defined for:
Fully implemented by:
Vector (a.k.a. 1-dimensional Array)
BitVector (a.k.a. 1-dimensional BitArray)
A Row Vector (2 Dimensions) won't work.
But ... there is a workaround by this trick:
julia> deleteat!(a[1,:], 1)' # mind the ' -> transposes it back to a row vector.
1×4 RowVector{Int64,Array{Int64,1}}:
2 3 4 5
Ofcourse this wouldn't work for an Array with 2 or more rows.

pandas dataframe values with numpy array

For example i have a dataframe like this :
import pandas as pd
df = pd.DataFrame([[1, 2.], [3, 4.]], columns=['a', 'b'])
print df
a b
0 1 2.0
1 3 4.0
I want to get a dataframe as follows :
a b
0 [1,3] [2,4]

One approach -
df_out = pd.DataFrame([df.values.T.astype(int).tolist()], columns=df.columns)
To retrieve back -
N = len(df_out.columns)
arr_back = np.concatenate(np.concatenate(df_out.values)).reshape(N,-1).T
df_back = pd.DataFrame(arr_back, columns=df_out.columns)
Sample run -
In [164]: df
Out[164]:
a b
0 1 2.0
1 3 4.0
2 5 6.0
In [165]: df_out
Out[165]:
a b
0 [1, 3, 5] [2, 4, 6]
In [166]: df_back
Out[166]:
a b
0 1 2
1 3 4
2 5 6

understand functions that operate on whole array in groupby aggregation

import numpy as np
import pandas as pd
df = pd.DataFrame({
'clients': pd.Series(['A', 'A', 'A', 'B', 'B']),
'odd1': pd.Series([1, 1, 2, 1, 2]),
'odd2': pd.Series([6, 7, 8, 9, 10])})
grpd = df.groupby(['clients', 'odd1']).agg({
'odd2': lambda x: x/float(x.sum())
})
print grpd
The desired result is:
A 1 0.619047619
2 0.380952381
B 1 0.473684211
2 0.526316
I have browsed around but I still don't understand how having lambdas that operate on the whole array, f.ex. x.sum() work. Furthermore, I still miss the point on what x is in x.sum() wrt to the grouped columns.

You can do:
>>> df.groupby(['clients', 'odd1'])['odd2'].sum() / df.groupby('clients')['odd2'].sum()
clients odd1
A 1 0.619
2 0.381
B 1 0.474
2 0.526
Name: odd2, dtype: float64
or alternatively, use .transform to obtain values based on clients grouping and then sum for each clients and odd1 grouping:
>>> df['val'] = df['odd2'] / df.groupby('clients')['odd2'].transform('sum')
>>> df
clients odd1 odd2 val
0 A 1 6 0.286
1 A 1 7 0.333
2 A 2 8 0.381
3 B 1 9 0.474
4 B 2 10 0.526
>>> df.groupby(['clients', 'odd1'])['val'].sum()
clients odd1
A 1 0.619
2 0.381
B 1 0.474
2 0.526
Name: val, dtype: float64

storing value against variable name "QW1I5K20" in an array element Q[1,5,20] using R

I have an excel file (.csv) with a sorted column of variable names such as "QW1I1K5" and numerical values against them.
this list goes on for
W from 1 to 15
I from 1 to 4
K from 1 to 30
total elements = 15*4*30 = 1800
I want to store the numerical values against these variables in an array whose indices are derived from the variable name .
for example QW1I1K5 has a value 11 . this must be stored in an array element Q[1,1,5] = 11 ( index set of [1,1,5] corresponds to W1 , I1 , K5)

May be this helps
Q <- array(dat$Col2, dim=c(15,4,30))
dat$Col2[dat$Col1=='QW1I1K5']
#[1] 34
Q[1,1,5]
#[1] 34
dat$Col2[dat$Col1=='QW4I3K8']
#[1] 38
Q[4,3,8]
#[1] 38
If you want the index along with the values
library(reshape2)
d1 <- melt(Q)
head(d1,3)
# Var1 Var2 Var3 value
#1 1 1 1 12
#2 2 1 1 9
#3 3 1 1 29
Q[1,1,1]
#[1] 12
Q[3,1,1]
#[1] 29
Update
Suppose, your data is in the order as you described in the comments, which will be dat1
indx <- read.table(text=gsub('[^0-9]+', ' ', dat1$Col1), header=FALSE)
dat2 <- dat1[do.call(order, indx[,3:1]),]
Q1 <- array(dat2$Col2,dim=c(15,4,30))
Q1[1,1,2]
#[1] 20
dat2$Col2[dat2$Col1=='QW1I1K2']
#[1] 20
data
Col1 <- do.call(paste,c(expand.grid('QW', 1:15, 'I', 1:4, 'K',1:30),
list(sep='')))
set.seed(24)
dat <- data.frame(Col1, Col2=sample(1:40, 1800,replace=TRUE))
dat1 <- dat[order(as.numeric(gsub('[^0-9]+', '', dat$Col1))),]
row.names(dat1) <- NULL

I would suggest looking at using "data.table" and setting your key to the split columns. You can use cSplit from my "splitstackshape" function to easily split the column.
Sample Data:
df <- data.frame(
V1 = c("QW1I1K1", "QW1I1K2", "QW1I1K3",
"QW1I1K4", "QW2I1K5", "QW2I3K2"),
V2 = c(15, 20, 5, 6, 7, 9))
df
# V1 V2
# 1 QW1I1K1 15
# 2 QW1I1K2 20
# 3 QW1I1K3 5
# 4 QW1I1K4 6
# 5 QW2I1K5 7
# 6 QW2I3K2 9
Splitting the column:
library(splitstackshape)
out <- cSplit(df, "V1", "[A-Z]+", fixed = FALSE)
setnames(out, c("V2", "W", "I", "K"))
setcolorder(out, c("W", "I", "K", "V2"))
setkey(out, W, I, K)
out
# W I K V2
# 1: 1 1 1 15
# 2: 1 1 2 20
# 3: 1 1 3 5
# 4: 1 1 4 6
# 5: 2 1 5 7
# 6: 2 3 2 9
Extracting rows:
out[J(1, 1, 4)]
# W I K V2
# 1: 1 1 4 6
out[J(2, 3, 2)]
# W I K V2
# 1: 2 3 2 9

How do I perform a function on multiple rows of data which are factored by the column they are in R?

I have a table in a file with many rows which I have read into R using
data <-read.table("path/to/data.txt",header=TRUE, sep="\t",row.names=1)
A1 A2 A3 B1 B2 B3
Row1 1 3 2 3 2 6
Row2 3 2 1 3 6 7
...
I have then read this into a frame using
df <-data.frame(data)
I would like to perform a function() to compare the A samples against the B samples for each row,
function(A,B)
but I am unsure how to specify only the A's and only the B's from the data frame for each row - is there a way to do this all at once for the whole data table? Do I have to read the data into a frame or can I work straight from the initial read.table data?

Try this:
set.seed(001) # Generating some data
DF <- data.frame(A1=sample(1:9, 10, T),
A2=sample(1:9, 10, T),
A3=sample(1:9, 10, T),
B1=sample(1:9, 10, T),
B2=sample(1:9, 10, T),
B3=sample(1:9, 10, T))
sampA <- DF[,grep('A', names(DF))] # Sample with columns A
sampB <- DF[,grep('B', names(DF))] # Sample with columns B
lapply(1:nrow(DF), function(i){
wilcox.test(as.numeric(sampA[i,]), as.numeric(sampB[i,]), exact=FALSE )
}) # Performing the test
The result looks like this:
[[1]]
Wilcoxon rank sum test with continuity correction
data: as.numeric(sampA[i, ]) and as.numeric(sampB[i, ])
W = 3, p-value = 0.6579
alternative hypothesis: true location shift is not equal to 0
[[2]]
Wilcoxon rank sum test with continuity correction
data: as.numeric(sampA[i, ]) and as.numeric(sampB[i, ])
W = 0, p-value = 0.0722
alternative hypothesis: true location shift is not equal to 0
[[3]]
Wilcoxon rank sum test with continuity correction
data: as.numeric(sampA[i, ]) and as.numeric(sampB[i, ])
W = 6, p-value = 0.6579
alternative hypothesis: true location shift is not equal to 0
I only showed the first 3 results, the complete list length is 10 since DF has 10 rows.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to iterate the rows of a DataFrame as Series in Pandas? - loops

How can I iterate over rows in a DataFrame? For some reason iterrows() is returning tuples rather than Series. I also understand that this is not an efficient way of using Pandas.

Use: s = pd.Series([0,1,2]) for i in s: print (i) 0 1 2 DataFrame: df = pd.DataFrame({'a':[0,1,2], 'b':[4,5,8]}) print (df) a b 0 0 4 1 1 5 2 2 8 for i,s in df.iterrows(): print (s) a 0 b 4 Name: 0, dtype: int64 a 1 b 5 Name: 1, dtype: int64 a 2 b 8 Name: 2, dtype: int64

Related

Remove a value in an array and decrease the size of it

pandas dataframe values with numpy array

understand functions that operate on whole array in groupby aggregation

storing value against variable name "QW1I5K20" in an array element Q[1,5,20] using R

How do I perform a function on multiple rows of data which are factored by the column they are in R?

Categories

Resources