How to iterate the rows of a DataFrame as Series in Pandas? - loops

How can I iterate over rows in a DataFrame? For some reason iterrows() is returning tuples rather than Series. I also understand that this is not an efficient way of using Pandas.

Use:
s = pd.Series([0,1,2])
for i in s:
print (i)
0
1
2
DataFrame:
df = pd.DataFrame({'a':[0,1,2], 'b':[4,5,8]})
print (df)
a b
0 0 4
1 1 5
2 2 8
for i,s in df.iterrows():
print (s)
a 0
b 4
Name: 0, dtype: int64
a 1
b 5
Name: 1, dtype: int64
a 2
b 8
Name: 2, dtype: int64

How can I iterate over rows in a DataFrame? For some reason iterrows() is returning tuples rather than Series.
The second entry in the tuple is a Series:
In [9]: df = pd.DataFrame({'a': range(4), 'b': range(2, 6)})
In [10]: for r in df.iterrows():
print r[1], type(r[1])
....:
a 0
b 2
Name: 0, dtype: int64 <class 'pandas.core.series.Series'>
a 1
b 3
Name: 1, dtype: int64 <class 'pandas.core.series.Series'>
a 2
b 4
Name: 2, dtype: int64 <class 'pandas.core.series.Series'>
a 3
b 5
Name: 3, dtype: int64 <class 'pandas.core.series.Series'>
I also understand that this is not an efficient way of using Pandas.
That is true, in general, but the question is a bit too general. You'll need to specify why you're trying to iterate over the DataFrame.

Related

Remove a value in an array and decrease the size of it

I have an array filled with some values. After running this code:
array = zeros(10)
for i in 1:10
array[i] = 2*i + 3
end
The array looks like:
10-element Array{Float64,1}:
5.0
7.0
9.0
11.0
13.0
15.0
17.0
19.0
21.0
23.0
I would like to obtain, for example, the following array by removing the third value:
9-element Array{Float64,1}:
5.0
7.0
11.0
13.0
15.0
17.0
19.0
21.0
23.0
How to do that?
EDIT
If I have an array (and not a vector), like here:
a = [1 2 3 4 5]
1×5 Array{Int64,2}:
1 2 3 4 5
The deleteat! proposed is not working:
a = deleteat!([1 2 3 4 5], 1)
ERROR: MethodError: no method matching deleteat!(::Array{Int64,2}, ::Int64)
You might have used a 2d row vector where a 1d column vector was required.
Note the difference between 1d column vector [1,2,3] and 2d row vector [1 2 3].
You can convert to a column vector with the vec() function.
Closest candidates are:
deleteat!(::Array{T,1} where T, ::Integer) at array.jl:875
deleteat!(::Array{T,1} where T, ::Any) at array.jl:913
deleteat!(::BitArray{1}, ::Integer) at bitarray.jl:961
I don't want a column vector. I would want:
1×4 Array{Int64,2}:
2 3 4 5
Is it possible ?
To make that clear: Vector{T} in Julia is just a synonym for Array{T, 1}, unless you're talking about something else... we call Arrays of all ranks arrays.
But this seems to be a Matlab-inherited misconception. In Julia, you construct a Matrix{T}, ie., an Array{T, 2}, by using spaces in the literal:
julia> a = [1 2 3 4 5]
1×5 Array{Int64,2}:
1 2 3 4 5
Deleting from a matrix does not make sense in general, since you can't trivially "fix the shape" in a rectangular layout.
A Vector or Array{T, 1} can be written using commas:
julia> a = [1, 2, 3, 4, 5]
5-element Array{Int64,1}:
1
2
3
4
5
And on this, deleteat! works:
julia> deleteat!(a, 1)
4-element Array{Int64,1}:
2
3
4
5
For completeness, there's also a third variant, the RowVector, which results of a transposition:
julia> a'
1×4 RowVector{Int64,Array{Int64,1}}:
2 3 4 5
From this you also can't delete.
Deleteat! is only defined for:
Fully implemented by:
Vector (a.k.a. 1-dimensional Array)
BitVector (a.k.a. 1-dimensional BitArray)
A Row Vector (2 Dimensions) won't work.
But ... there is a workaround by this trick:
julia> deleteat!(a[1,:], 1)' # mind the ' -> transposes it back to a row vector.
1×4 RowVector{Int64,Array{Int64,1}}:
2 3 4 5
Ofcourse this wouldn't work for an Array with 2 or more rows.

pandas dataframe values with numpy array

For example i have a dataframe like this :
import pandas as pd
df = pd.DataFrame([[1, 2.], [3, 4.]], columns=['a', 'b'])
print df
a b
0 1 2.0
1 3 4.0
I want to get a dataframe as follows :
a b
0 [1,3] [2,4]
One approach -
df_out = pd.DataFrame([df.values.T.astype(int).tolist()], columns=df.columns)
To retrieve back -
N = len(df_out.columns)
arr_back = np.concatenate(np.concatenate(df_out.values)).reshape(N,-1).T
df_back = pd.DataFrame(arr_back, columns=df_out.columns)
Sample run -
In [164]: df
Out[164]:
a b
0 1 2.0
1 3 4.0
2 5 6.0
In [165]: df_out
Out[165]:
a b
0 [1, 3, 5] [2, 4, 6]
In [166]: df_back
Out[166]:
a b
0 1 2
1 3 4
2 5 6

understand functions that operate on whole array in groupby aggregation

import numpy as np
import pandas as pd
df = pd.DataFrame({
'clients': pd.Series(['A', 'A', 'A', 'B', 'B']),
'odd1': pd.Series([1, 1, 2, 1, 2]),
'odd2': pd.Series([6, 7, 8, 9, 10])})
grpd = df.groupby(['clients', 'odd1']).agg({
'odd2': lambda x: x/float(x.sum())
})
print grpd
The desired result is:
A 1 0.619047619
2 0.380952381
B 1 0.473684211
2 0.526316
I have browsed around but I still don't understand how having lambdas that operate on the whole array, f.ex. x.sum() work. Furthermore, I still miss the point on what x is in x.sum() wrt to the grouped columns.
You can do:
>>> df.groupby(['clients', 'odd1'])['odd2'].sum() / df.groupby('clients')['odd2'].sum()
clients odd1
A 1 0.619
2 0.381
B 1 0.474
2 0.526
Name: odd2, dtype: float64
or alternatively, use .transform to obtain values based on clients grouping and then sum for each clients and odd1 grouping:
>>> df['val'] = df['odd2'] / df.groupby('clients')['odd2'].transform('sum')
>>> df
clients odd1 odd2 val
0 A 1 6 0.286
1 A 1 7 0.333
2 A 2 8 0.381
3 B 1 9 0.474
4 B 2 10 0.526
>>> df.groupby(['clients', 'odd1'])['val'].sum()
clients odd1
A 1 0.619
2 0.381
B 1 0.474
2 0.526
Name: val, dtype: float64

storing value against variable name "QW1I5K20" in an array element Q[1,5,20] using R

I have an excel file (.csv) with a sorted column of variable names such as "QW1I1K5" and numerical values against them.
this list goes on for
W from 1 to 15
I from 1 to 4
K from 1 to 30
total elements = 15*4*30 = 1800
I want to store the numerical values against these variables in an array whose indices are derived from the variable name .
for example QW1I1K5 has a value 11 . this must be stored in an array element Q[1,1,5] = 11 ( index set of [1,1,5] corresponds to W1 , I1 , K5)
May be this helps
Q <- array(dat$Col2, dim=c(15,4,30))
dat$Col2[dat$Col1=='QW1I1K5']
#[1] 34
Q[1,1,5]
#[1] 34
dat$Col2[dat$Col1=='QW4I3K8']
#[1] 38
Q[4,3,8]
#[1] 38
If you want the index along with the values
library(reshape2)
d1 <- melt(Q)
head(d1,3)
# Var1 Var2 Var3 value
#1 1 1 1 12
#2 2 1 1 9
#3 3 1 1 29
Q[1,1,1]
#[1] 12
Q[3,1,1]
#[1] 29
Update
Suppose, your data is in the order as you described in the comments, which will be dat1
indx <- read.table(text=gsub('[^0-9]+', ' ', dat1$Col1), header=FALSE)
dat2 <- dat1[do.call(order, indx[,3:1]),]
Q1 <- array(dat2$Col2,dim=c(15,4,30))
Q1[1,1,2]
#[1] 20
dat2$Col2[dat2$Col1=='QW1I1K2']
#[1] 20
data
Col1 <- do.call(paste,c(expand.grid('QW', 1:15, 'I', 1:4, 'K',1:30),
list(sep='')))
set.seed(24)
dat <- data.frame(Col1, Col2=sample(1:40, 1800,replace=TRUE))
dat1 <- dat[order(as.numeric(gsub('[^0-9]+', '', dat$Col1))),]
row.names(dat1) <- NULL
I would suggest looking at using "data.table" and setting your key to the split columns. You can use cSplit from my "splitstackshape" function to easily split the column.
Sample Data:
df <- data.frame(
V1 = c("QW1I1K1", "QW1I1K2", "QW1I1K3",
"QW1I1K4", "QW2I1K5", "QW2I3K2"),
V2 = c(15, 20, 5, 6, 7, 9))
df
# V1 V2
# 1 QW1I1K1 15
# 2 QW1I1K2 20
# 3 QW1I1K3 5
# 4 QW1I1K4 6
# 5 QW2I1K5 7
# 6 QW2I3K2 9
Splitting the column:
library(splitstackshape)
out <- cSplit(df, "V1", "[A-Z]+", fixed = FALSE)
setnames(out, c("V2", "W", "I", "K"))
setcolorder(out, c("W", "I", "K", "V2"))
setkey(out, W, I, K)
out
# W I K V2
# 1: 1 1 1 15
# 2: 1 1 2 20
# 3: 1 1 3 5
# 4: 1 1 4 6
# 5: 2 1 5 7
# 6: 2 3 2 9
Extracting rows:
out[J(1, 1, 4)]
# W I K V2
# 1: 1 1 4 6
out[J(2, 3, 2)]
# W I K V2
# 1: 2 3 2 9

How do I perform a function on multiple rows of data which are factored by the column they are in R?

I have a table in a file with many rows which I have read into R using
data <-read.table("path/to/data.txt",header=TRUE, sep="\t",row.names=1)
A1 A2 A3 B1 B2 B3
Row1 1 3 2 3 2 6
Row2 3 2 1 3 6 7
...
I have then read this into a frame using
df <-data.frame(data)
I would like to perform a function() to compare the A samples against the B samples for each row,
function(A,B)
but I am unsure how to specify only the A's and only the B's from the data frame for each row - is there a way to do this all at once for the whole data table? Do I have to read the data into a frame or can I work straight from the initial read.table data?
Try this:
set.seed(001) # Generating some data
DF <- data.frame(A1=sample(1:9, 10, T),
A2=sample(1:9, 10, T),
A3=sample(1:9, 10, T),
B1=sample(1:9, 10, T),
B2=sample(1:9, 10, T),
B3=sample(1:9, 10, T))
sampA <- DF[,grep('A', names(DF))] # Sample with columns A
sampB <- DF[,grep('B', names(DF))] # Sample with columns B
lapply(1:nrow(DF), function(i){
wilcox.test(as.numeric(sampA[i,]), as.numeric(sampB[i,]), exact=FALSE )
}) # Performing the test
The result looks like this:
[[1]]
Wilcoxon rank sum test with continuity correction
data: as.numeric(sampA[i, ]) and as.numeric(sampB[i, ])
W = 3, p-value = 0.6579
alternative hypothesis: true location shift is not equal to 0
[[2]]
Wilcoxon rank sum test with continuity correction
data: as.numeric(sampA[i, ]) and as.numeric(sampB[i, ])
W = 0, p-value = 0.0722
alternative hypothesis: true location shift is not equal to 0
[[3]]
Wilcoxon rank sum test with continuity correction
data: as.numeric(sampA[i, ]) and as.numeric(sampB[i, ])
W = 6, p-value = 0.6579
alternative hypothesis: true location shift is not equal to 0
I only showed the first 3 results, the complete list length is 10 since DF has 10 rows.

Resources