I have a pandas dataframe with a column that looks like this:
'Column name'
NaN
[11am-2am]
NaN
[9am-10pm]
NaN
[10:30am-10:30pm]
See picture below for further illustration:row_explanation
I am trying to make all row in the same format such as [10:30am-10:00pm]
working_hours_daily=schedule['Daily'] // column name is 'Daily'
c=lambda x: str(x)
b=lambda x: str(x).replace('-',',').replace('am',':00am').replace('pm',':00pm').split(',')
times_daily.apply(c)
open_hours_daily=[]
for i in (range(0,len(times_daily))):
if ":" not in times_daily:
working_hours_daily=times_daily.apply(b)
print (working_hours_daily)
open_hours_daily.append(working_hours_daily)
The idea is to apply b only when ":" is not in the string,
and so I am using not in syntax
But the code is not respecting that condition and applies b to all rows,
So some rows turn out fine: [['11:00am, 2:00am']]
but others which already contain ':' turn out like this: [['10:30:00am, 10:30:00pm']]
Any help would be much appreciated.
Camille
This should work:
b = lambda x: str(x) if str(x).contains(':') else str(x).replace('-',',').replace('am',':00am').replace('pm',':00pm').split(',')
times_daily.apply(b)
If you could please post a sample dataset, that would be great, so I can debug this code.
Related
Hi everyone,
I have 2 tables, 3rd column for Table 1 is Value 1 and 3rd column for Table 2 is Value 2. I combined these 2 tables by expanding both tables first so that all the columns are aligned as shown in the screenshot above (Column E to Column H).
The formula in all the yellow cells are:
Cell E4 : =QUERY(A4:C10,"Select A,B,C,' ' label ' ' 'Value 2' ")
Cell E12 : =QUERY(A12:C20,"Select A,B,' ',C label ' ' 'Value 1' ")
Cell K7 : =QUERY({E5:H10;E13:H17},"Select * where Col1 is not null",0)
Cell P7 : =ArrayFormula(IF(ISBLANK(M7:M12),100,M7:M12))
In column P, I want to return 100 as Value 1 if the cells in Column M is blank. So by right I should get 2,34,55,100,100,100 in column P but right now the formula still return 3 blank cells.
I suspect that is because the QUERY function that I used before which make the cell is not blank although it seems like still a blank cell. May I know is there any trick that I can use to find the blank cells in column M and column N (preferably don't touch the QUERY formula) since ISBLANK() is not working in this case?
Any help or advise will be greatly appreciated!
Edited
makes sense. you cant use ISBLANK because cell is not blank. remember that QUERY inserted an empty space.
try:
=ARRAYFORMULA(IF(ISBLANK(TRIM(M7:M12)), 100, M7:M12))
ISBLANK is so sensitive that it will detect even residue from TRIM
update:
=ARRAYFORMULA(IF(TRIM(M7:M12)="", 100, M7:M12))
Data
My data looks like this. I want to see a bar chart of the % of rows where column Y/N? contains "Y" and of those rows, the percent that % column contains a value of 7 or above, and the percent where % column contains a value below 7
Something like...
IF [C2:C19] = Y, THEN COUNTIF [B2:B19] >=7
and
IF [C2:C19] = Y, THEN COUNTIF [B2:B19] <7
Sorry if this is unclear or an obvious question!
One way to do this is to create three different measures
first countrows with filter on "Y" and similarly others - then you can bring these values on your bar chart
I have a code below that works to take 24 columns (hours) of data and combine it into a single column array for each row in a dataframe:
# Adds all of the values into column twentyfourhours with "," as the separator.
agg_bluetooth_data$twentyfourhours <- paste(agg_bluetooth_data[,1],
agg_bluetooth_data[,2], agg_bluetooth_data[,3], agg_bluetooth_data[,4],
agg_bluetooth_data[,5], agg_bluetooth_data[,6], agg_bluetooth_data[,7],
agg_bluetooth_data[,8], agg_bluetooth_data[,9], agg_bluetooth_data[,10],
agg_bluetooth_data[,11], agg_bluetooth_data[,12], agg_bluetooth_data[,13],
agg_bluetooth_data[,14], agg_bluetooth_data[,15], agg_bluetooth_data[,16],
agg_bluetooth_data[,17], agg_bluetooth_data[,18], agg_bluetooth_data[,19],
agg_bluetooth_data[,20], agg_bluetooth_data[,21], agg_bluetooth_data[,22],
agg_bluetooth_data[,23], agg_bluetooth_data[,24], sep=",")
However, after this I still have to write more lines of code to remove spaces, add brackets around it, and delete the columns. None of this is difficult to do, but I feel like there should be a shorter/cleaner code to use to get the results I am looking for. Does anyone have any suggestions?
There is a built-in function to do rowSums. It looks like you want an analogous rowPaste function. We can do this with apply:
# create example dataset
df <- data.frame(
v=1:10,
x=letters[1:10],
y=letters[6:15],
z=letters[11:20],
stringsAsFactors = FALSE
)
# rowPaste columns 2 through 4
apply(df[, 2:4], 1, paste, collapse=",")
Another option, using #Dan Y's data (might be helpful if you posted a subset of your data using dput though).
library(tidyr)
library(dplyr)
df %>%
unite('new_col', v, x, y, z, sep = ',')
new_col
1 1,a,f,k
2 2,b,g,l
3 3,c,h,m
4 4,d,i,n
5 5,e,j,o
6 6,f,k,p
7 7,g,l,q
8 8,h,m,r
9 9,i,n,s
10 10,j,o,t
You can then perform the neccessary edits with mutate. There's also a fair amount of flexibility in the column selections within the unite call. Check out the "Useful Functions" section of the select documentation.
I have a dataset like this
id category value
1 A NaN
2 B NaN
3 A 10.5
5 A 2.0
6 B 1.0
I want to fill the NAN values with the mean of their respective category. As shown below
id category value
1 A 4.16
2 B 0.5
3 A 10.5
5 A 2.0
6 B 1.0
I tried to calculate first mean values of each category using group by
val df2 = dataFrame.groupBy(category).agg(mean(value)).rdd.map{
case r:Row => (r.getAs[String](category),r.get(1))
}.collect().toMap
println(df2)
I got map of each category and their respective mean values.output: Map(A ->4.16,B->0.5)
Now i tried update query in Sparksql to fill column but it seems spqrkSql dosnt support update query. I tried to fill null values with in dataframe but failed to do so.
What can i do? We can do the same in pandas as shown in Pandas: How to fill null values with mean of a groupby?
But how can i do using spark dataframe
The simplest solution would be to use groupby and join:
val df2 = df.filter(!(isnan($"value"))).groupBy("category").agg(avg($"value").as("avg"))
df.join(df2, "category").withColumn("value", when(col("value").isNaN, $"avg").otherwise($"value")).drop("avg")
Note that if there is a category with all NaN it will be removed from the result
Indeed, you cannot update DataFrames, but you can transform them using functions like select and join. In this case, you can keep the grouping result as a DataFrame and join it (on category column) to the original one, then perform the mapping that would replace NaNs with the mean values:
import org.apache.spark.sql.functions._
import spark.implicits._
// calculate mean per category:
val meanPerCategory = dataFrame.groupBy("category").agg(mean("value") as "mean")
// use join, select and "nanvl" function to replace NaNs with the mean values:
val result = dataFrame
.join(meanPerCategory, "category")
.select($"category", $"id", nanvl($"value", $"mean")).show()
I stumbled upon same problem and came across this post. But tried a different solution i.e. using window functions. The code below is tested on pyspark 2.4.3 (Window functions are available from Spark 1.4). I believe this is bit cleaner solution.
This post is quiet old, but hope this answer will be helpful for others.
from pyspark.sql import Window
from pyspark.sql.functions import *
df = spark.createDataFrame([(1,"A", None), (2,"B", None), (3,"A",10.5), (5,"A",2.0), (6,"B",1.0)], ['id', 'category', 'value'])
category_window = Window.partitionBy("category")
value_mean = mean("value0").over(category_window)
result = df\
.withColumn("value0", coalesce("value", lit(0)))\
.withColumn("value_mean", value_mean)\
.withColumn("new_value", coalesce("value", "value_mean"))\
.select("id", "category", "new_value")
result.show()
Output will be as expected (in question):
id category new_value
1 A 4.166666666666667
2 B 0.5
3 A 10.5
5 A 2
6 B 1
I have a Google sheet with fixed number of columns and dynamic rows.
I like to use countA to count fields with a value (non-blank) in the current row.
I found a formula here but don't understand it, neither can get it to work.
ArrayFormula(MMULT( LEN(A1:E)>0 ; TRANSPOSE(SIGN(COLUMN(A1:E1)))))
Sheet gives me error: "Function MMULT parameter 1 expects number values. But 'TRUE' is a boolean and cannot be coerced to a number."
The formula should work if you convert the booleans (true or false) returned by LEN(A1:E)>0 to numbers (1 or 0), as Barry already mentioned. This can be done quite easily by wrapping the output of the LEN()-function in an N-function or by preceding it with '--'. So, assuming your data starts in row 2, see if this works:
=ArrayFormula(MMULT( --(LEN(A2:E)>0) , TRANSPOSE(COLUMN(A2:E2)^0)))
An alternative way would be to use COUNTIF()
=ArrayFormula(COUNTIF(IF(A2:E<>"", row(A2:A),),row(A2:A)))
and probably even a combination should work:
=ArrayFormula(MMULT( --(A2:E<>"") , TRANSPOSE(COLUMN(A2:E1)^0)))
If you also want to include a header row, try:
=ArrayFormula(if(row(A:A)=1, "Header", MMULT( --(LEN(A:E)>0) , TRANSPOSE(COLUMN(A1:E1)^0))))
or
=ArrayFormula(if(row(A:A)=1, "Header", MMULT( --(A:E<>"") , TRANSPOSE(COLUMN(A1:E1)^0))))
or
=ArrayFormula(if(row(A:A)=1, "Header", COUNTIF(IF(not(isblank(A:E)), row(A:A),),row(A:A))))
EDIT: (after new question in comments)
If you want to sum the values, you can do that with MMULT() too:
=ArrayFormula(if(row(A:A)=1, "Header", MMULT(if(A1:E<>"", A1:E,0), transpose(column(A1:E1)^0))))
or using sumif:
=ArrayFormula(if(row(A:A)=1, "Header", sumif(IF(COLUMN(A1:E1),ROW(A1:A)),ROW(A1:A),A1:E)))
NOTE: if you want to limit the output to let's say the last row that has values in col A, try:
=ArrayFormula(if(row(A:A)=1, "Header", IF(LEN(A1:A), MMULT(if(A1:E<>"", A1:E,0), transpose(column(A1:E1)^0)),)))
or, again with sumif()
=ArrayFormula(if(row(A:A)=1, "Header", if(len(A1:A), sumif(IF(COLUMN(A1:E1),ROW(A1:A)),ROW(A1:A),A1:E),)))
That formula seems a little complex for your explanation, can't you just use this formula copied down
=COUNTA(A1:E1)
...but specifically addressing your question, you need to change this part
LEN(A1:E)>0
...so that it returns numbers - try
IF(LEN(A1:E)>0;1;0)