Replace Null Values of a Column with mean of another Categorcial Column in Spark Dataframe - database

I have a dataset like this
id category value
1 A NaN
2 B NaN
3 A 10.5
5 A 2.0
6 B 1.0
I want to fill the NAN values with the mean of their respective category. As shown below
id category value
1 A 4.16
2 B 0.5
3 A 10.5
5 A 2.0
6 B 1.0
I tried to calculate first mean values of each category using group by
val df2 = dataFrame.groupBy(category).agg(mean(value)).rdd.map{
case r:Row => (r.getAs[String](category),r.get(1))
}.collect().toMap
println(df2)
I got map of each category and their respective mean values.output: Map(A ->4.16,B->0.5)
Now i tried update query in Sparksql to fill column but it seems spqrkSql dosnt support update query. I tried to fill null values with in dataframe but failed to do so.
What can i do? We can do the same in pandas as shown in Pandas: How to fill null values with mean of a groupby?
But how can i do using spark dataframe

The simplest solution would be to use groupby and join:
val df2 = df.filter(!(isnan($"value"))).groupBy("category").agg(avg($"value").as("avg"))
df.join(df2, "category").withColumn("value", when(col("value").isNaN, $"avg").otherwise($"value")).drop("avg")
Note that if there is a category with all NaN it will be removed from the result

Indeed, you cannot update DataFrames, but you can transform them using functions like select and join. In this case, you can keep the grouping result as a DataFrame and join it (on category column) to the original one, then perform the mapping that would replace NaNs with the mean values:
import org.apache.spark.sql.functions._
import spark.implicits._
// calculate mean per category:
val meanPerCategory = dataFrame.groupBy("category").agg(mean("value") as "mean")
// use join, select and "nanvl" function to replace NaNs with the mean values:
val result = dataFrame
.join(meanPerCategory, "category")
.select($"category", $"id", nanvl($"value", $"mean")).show()

I stumbled upon same problem and came across this post. But tried a different solution i.e. using window functions. The code below is tested on pyspark 2.4.3 (Window functions are available from Spark 1.4). I believe this is bit cleaner solution.
This post is quiet old, but hope this answer will be helpful for others.
from pyspark.sql import Window
from pyspark.sql.functions import *
df = spark.createDataFrame([(1,"A", None), (2,"B", None), (3,"A",10.5), (5,"A",2.0), (6,"B",1.0)], ['id', 'category', 'value'])
category_window = Window.partitionBy("category")
value_mean = mean("value0").over(category_window)
result = df\
.withColumn("value0", coalesce("value", lit(0)))\
.withColumn("value_mean", value_mean)\
.withColumn("new_value", coalesce("value", "value_mean"))\
.select("id", "category", "new_value")
result.show()
Output will be as expected (in question):
id category new_value
1 A 4.166666666666667
2 B 0.5
3 A 10.5
5 A 2
6 B 1

Related

Alternative solutions to an array search in PostgreSQL

I am not sure if my database design is good for this tricky case and I also ask for help how the query for this could look like.
I plan a query with the following table:
search_array | value | id
-----------------------+-------+----
{XYa,YZb,WQb} | b | 1
{XYa,YZb,WQb,RSc,QZa} | a | 2
{XYc,YZa} | c | 3
{XYb} | a | 4
{RSa} | c | 5
There are 5 main elements in the search_array: XY, YZ, WQ, RS, QZ and 3 Values: a, b, c that are concardinated to each element.
Each row has also one value: a, b or c.
My aim is to find all rows that fit to a specific row in this sense: At first it should be checked if they have any same main elements in their search_arrays (yellow marked in the example).
As example:
Row id 4 an row id 5 wouldnt match because XY != RS.
Row id 1, 2 and 3 would match two times because they have all XY and YZ.
Row id 1 and 2 would even match three times because they have also WQ in common.
And second: if there is a Main Element match it should be 'crosschecked' if the lowercase letters after the Main Elements fit to the value of the other row.
As example: The only match for Row id 1 in the table would be Row id 4 because they both search for XY and the low letters after the elements match each value of the two rows.
Another match would be ROW id 2 and 5 with RS and search c to value c and search a to value a (green and orange marked).
My idea was to cut the search_array elements in the query in two parts with the RIGHT and LEFT command for strings. But I dont know how to combine the subqueries for this search.
Or would be a complete other solution faster? Like splitting the search array into another table with the columns 'foregin key' to the maintable, 'main element' and 'searched_value'. I am not sure if this is the best solution because the program would all the time switch to the main table to find two rows out of 3 million rows to compare their searched_values to the values?
Thank you very much for your answers and your time!
You'll have to represent the data in a normalized fashion. I'll do it in a WITH clause, but it would be better to store the data in this fashion to begin with.
WITH unravel AS (
SELECT t.id, t.value,
substr(u.val, 1, 2) AS arr_main,
substr(u.val, 3, 1) AS arr_val
FROM mytable AS t
CROSS JOIN LATERAL unnest(t.search_array) AS u(val)
)
SELECT a.id AS first_id,
a.value AS first_value,
b.id AS second_id,
b.value AS second_value,
a.arr_main AS main_element
FROM unravel AS a
JOIN unravel AS b
ON a.arr_main = b.arr_main
AND a.arr_val = b.value
AND b.arr_val = a.value;

Is there a more effective way to combine 24 columns into a single column as an array in R

I have a code below that works to take 24 columns (hours) of data and combine it into a single column array for each row in a dataframe:
# Adds all of the values into column twentyfourhours with "," as the separator.
agg_bluetooth_data$twentyfourhours <- paste(agg_bluetooth_data[,1],
agg_bluetooth_data[,2], agg_bluetooth_data[,3], agg_bluetooth_data[,4],
agg_bluetooth_data[,5], agg_bluetooth_data[,6], agg_bluetooth_data[,7],
agg_bluetooth_data[,8], agg_bluetooth_data[,9], agg_bluetooth_data[,10],
agg_bluetooth_data[,11], agg_bluetooth_data[,12], agg_bluetooth_data[,13],
agg_bluetooth_data[,14], agg_bluetooth_data[,15], agg_bluetooth_data[,16],
agg_bluetooth_data[,17], agg_bluetooth_data[,18], agg_bluetooth_data[,19],
agg_bluetooth_data[,20], agg_bluetooth_data[,21], agg_bluetooth_data[,22],
agg_bluetooth_data[,23], agg_bluetooth_data[,24], sep=",")
However, after this I still have to write more lines of code to remove spaces, add brackets around it, and delete the columns. None of this is difficult to do, but I feel like there should be a shorter/cleaner code to use to get the results I am looking for. Does anyone have any suggestions?
There is a built-in function to do rowSums. It looks like you want an analogous rowPaste function. We can do this with apply:
# create example dataset
df <- data.frame(
v=1:10,
x=letters[1:10],
y=letters[6:15],
z=letters[11:20],
stringsAsFactors = FALSE
)
# rowPaste columns 2 through 4
apply(df[, 2:4], 1, paste, collapse=",")
Another option, using #Dan Y's data (might be helpful if you posted a subset of your data using dput though).
library(tidyr)
library(dplyr)
df %>%
unite('new_col', v, x, y, z, sep = ',')
new_col
1 1,a,f,k
2 2,b,g,l
3 3,c,h,m
4 4,d,i,n
5 5,e,j,o
6 6,f,k,p
7 7,g,l,q
8 8,h,m,r
9 9,i,n,s
10 10,j,o,t
You can then perform the neccessary edits with mutate. There's also a fair amount of flexibility in the column selections within the unite call. Check out the "Useful Functions" section of the select documentation.

Counting rows in a table based on multiple array criterias

I am trying to count rows in a table based on multiple criteria in different columns of that table. The criteria are not directly in the formula though; they are arrays which I would like to refer to (and not list them in the formula directly).
Range table example:
Group Status
a 1
b 4
b 6
c 4
a 6
d 5
d 4
a 2
b 2
d 3
b 2
c 1
c 2
c 1
a 4
b 3
Criteria/arrays example:
group
a
b
status
1
2
3
I am able to do this if i only have one array search through a range (1 column) in that table:
{=SUM(COUNTIFS(data[Group],group[group]))}
Returns "9" as expected (=sum of rows in the group column of the data table which match any values in group[group])
But if I add a second different range and a different array I get an incorrect result:
{=SUM(COUNTIFS(data[Group],group[group], data[Status],status[status]))}
Returns "3" but should return "5" (=sum of rows which have 1, 2 or 3 in the status column and a or b in the group column)
I searched and googled for various ideas related to using sumproduct or defining arrays within the formula instead of classifying the whole formula as an array but I was not able to get expected results via those means.
Thank you for your help.
Because your group and status criteria are a different number of values (2 values for group, but 3 values for status), I'm not sure you can do this in a single formula. Best way I know of to do this would be to use a helper column (which can be hidden if preferred).
Put this array formula in a helper column and copy down the length of your data (array formulas must be confirmed with Ctrl+Shift+Enter):
=AND(OR(data[#Group]=group[group]),OR(data[#Status]=status[status]))
And then get the count with: =COUNTIF(helpercolumn,TRUE)
You could use a slightly different approach, using Power Query / Power Pivot.
Name your tables Data, Group and Status, then create the following query, named Filtered Data:
let
tbData = Excel.CurrentWorkbook(){[Name="Data"]}[Content],
tbGroup = Excel.CurrentWorkbook(){[Name="Group"]}[Content],
tbStatus = Excel.CurrentWorkbook(){[Name="Status"]}[Content],
#"Merged Group" = Table.NestedJoin(tbData,{"Group"},tbGroup,{"Group"},"tbGroup",JoinKind.Inner),
#"Merged Status" = Table.NestedJoin(#"Merged Group",{"Status"},tbStatus,{"Status"},"Merged Status",JoinKind.Inner),
#"Removed Columns" = Table.RemoveColumns(#"Merged Status",{"tbGroup", "Merged Status"}),
#"Changed Type" = Table.TransformColumnTypes(#"Removed Columns",{{"Status", type number}})
in
#"Changed Type"
Load To as connection only, and tick Load to Data Model
Now create a DAX measure:
Status Sum:=SUM ( 'Filtered Data'[Status] )
You can then use the following formula on your worksheet, to get the Sum of Status values, for rows matching the criteria specified in the Group and Status tables:
=CUBEVALUE("ThisWorkbookDataModel","[Measures].[Status Sum]")
Simply refresh the data connection to update the value.

PySpark replace Null with Array

After a join by ID, my data frame looks as follows:
ID | Features | Vector
1 | (50,[...] | Array[1.1,2.3,...]
2 | (50,[...] | Null
I ended up with Null values for some IDs in the column 'Vector'. I would like to replace these Null values by an array of zeros with 300 dimensions (same format as non-null vector entries). df.fillna does not work here since it's an array I would like to insert. Any idea how to accomplish this in PySpark?
---edit---
Similarly to this post my current approach:
df_joined = id_feat_vec.join(new_vec_df, "id", how="left_outer")
fill_with_vector = udf(lambda x: x if x is not None else np.zeros(300),
ArrayType(DoubleType()))
df_new = df_joined.withColumn("vector", fill_with_vector("vector"))
Unfortunately with little success:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0in stage 848.0 failed 4 times, most recent failure: Lost task 0.3 in stage 848.0 (TID 692199, 10.179.224.107, executor 16): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-193-e55fed27fcd8> in <module>()
5 a = df_joined.withColumn("vector", fill_with_vector("vector"))
6
----> 7 a.show()
/databricks/spark/python/pyspark/sql/dataframe.pyc in show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))
/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
Updated: I couldn't get the SQL expression form to create an array of doubles. 'array(0.0, ...)' appears to create an array of Decimal types. But, using the python functions you can get it to properly create an array of doubles.
The general idea is use the when/otherwise functions to selectively update only the rows you want. You can define the literal value you want ahead of time as a column and then dump that in the "THEN" clause.
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType([StructField("f1", LongType()), StructField("f2", ArrayType(DoubleType(), False))])
data = [(1, [10.0, 11.0]), (2, None), (3, None)]
df = sqlContext.createDataFrame(sc.parallelize(data), schema)
# Create a column object storing the value you want in the NULL case
num_elements = 300
null_value = array([lit(0.0)] * num_elements)
# If you want a different type you can change it like this
# null_value = null_value.cast('array<float>')
# Keep the value when there is one, replace it when it's null
df2 = df.withColumn('f2', when(df['f2'].isNull(), null_value).otherwise(df['f2']))
You could try to make an update request on your dataset with a where, replacing every NULL in the Vector column by an array.
Are you using SparkSQL and dataframes?

Split array into chunks based on timestamp in Haskell

I have an array of records (custom data type) in Haskell which I want to aggregate based on a each records' timestamp. In very general terms each record looks like this:
data Record = Record { event :: String,
time :: Double,
from :: Int,
to :: Int
} deriving (Show, Eq)
I used a Double for the timestamp since that is the same format used in the tracefile.
And I parse them from a CSV file into an array of records: [Record]
Now I'm looking to get an approximation of instantaneous events / time. So I want to split the array into several arrays based on the timestamp (say. every 1 seconds) and then fold across each smaller array.
The problem is I can't figure out how to split an array based on the value of a record. Looking on Hoogle I found several functions like splitEvery and splitWhen, but I'm lost. I considered using splitWhen to break up the list when, say, (mod time 0.1) == 0, but even if that worked it would remove the elements it's splitting on (which I don't want to do).
I should note that the records are NOT evenly spaced in time. E.g. the timestamp on sequential records is not going to differ by a fixed amount.
I am more than willing to store the data in a different format if you can suggest one that would make this sort of work easier.
A quick sample of the data I'm parsing (from a ns2 simulation):
r 0.114 1 2 tcp 1000 ________ 2 1.0 5.0 0 2
r 0.240 1 2 tcp 1000 ________ 2 1.0 5.0 0 2
r 0.914 2 1 tcp 1000 ________ 2 5.0 1.0 0 3
If you have [Record] and you want to group them by a specific condition, you can use Data.List.groupBy. I'm assuming that for your time :: Double, 1 second is the base unit, so time = 1 is 1 second, time = 100 is 100 seconds, etc, so adjust this to whatever system you're actually using:
import Data.List
import Data.Function (on)
isInSameClockSecond :: Record -> Record -> Bool
isInSameClockSecond = (==) `on` (floor . time :: Record -> Integer)
-- The type signature is given for floor . time to remove any ambiguity
-- due to floor's polymorphic type signature.
groupBySameClockSecond :: [Record] -> [[Record]]
groupBySameClockSecond = groupBy isInSameClockSecond

Resources