pyspark remove empty values from all columns and replace it with null - loops

I am trying to clean my dataset from missing values.
In the rows there are values like
ID
A
B
1
324
2
Breda
3
null
34556
I would like to see in A1 and B2 null and so on without doing the cleaning column by column. I would like to loop over each columns without specifying the column names
I have found this code but the last raw returns an error :
My table name is custom
def replaceEmptyCols(columns:Array[String]):Array[Column]={
columns.map(c>={
when(col(c)=="" ,null).otherwise(col(c)).alias(c)
})
}
custom.select(replaceEmptyCols(custom.columns):_*).show()
The error is :
SyntaxError: invalid syntax (, line 6)
File "<command-447346330485202>", line 6
custom.select(replaceEmptyCols(custom.columns):_*).show()
^
SyntaxError: invalid syntax

Maybe you are looking for something like this?
custom = spark.createDataFrame(
[('1','','324')
,('2','Breda','')
,('3',None,'34556')
],
['ID','A','B']
)
custom.show()
# +---+-----+-----+
# | ID| A| B|
# +---+-----+-----+
# | 1| | 324|
# | 2|Breda| |
# | 3| null|34556|
# +---+-----+-----+
import pyspark.sql.functions as F
from pyspark.sql.types import *
def replaceEmptyCols(df, columns:[]):
for c in columns:
df = df.withColumn(c, F.when((F.col(c) == '') | (F.col(c) == None), F.lit('null')
).otherwise(F.col(c)))
return df
replaceEmptyCols(custom, [c for c in custom.columns if c not in ['ID']]).show()
# +---+-----+-----+
# | ID| A| B|
# +---+-----+-----+
# | 1| null| 324|
# | 2|Breda| null|
# | 3| null|34556|
# +---+-----+-----+

Related

Extract value from complex array of map type to string

I have a dataframe like below.
No comp_value
1 [[ -> 10]]
2 [[ -> 35]]
The schema type of column - value is.
comp_value: array (nullable = true)
element: map(containsNull = true)
key: string
value: long (valueContainsNull = true)
I would like to convert the comp_value from complex type to string using PySpark. Is there a way to achieve this?
Expected output:
No comp_value
1 10
2 35
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1,' [[ -> 10]]'),
(2, '[[ -> 35]]')],
['No', 'v'])
df.show()
replace the corner brackets, remove trailing spaces, split by space to get a list and get the elements you want by slicing the list
new = df.withColumn('comp_value', split(trim(regexp_replace('v','\[|\]','')),'\s')[1])
new.show()
+---+-----------+----------+
| No| v|comp_value|
+---+-----------+----------+
| 1| [[ -> 10]]| 10|
| 2| [[ -> 35]]| 35|
+---+-----------+----------+
I will assume your data looks like this:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 10),
(2, 35)],
['No', 'v'])
df = df.select('No', F.array(F.create_map(F.lit(''), 'v')).alias('comp_value'))
df.show()
# +---+----------+
# |No |comp_value|
# +---+----------+
# |1 |[{ -> 10}]|
# |2 |[{ -> 35}]|
# +---+----------+
You can extract values inside array by referencing to them using index number (in this case [0]). And extracting values from maps is done by referencing keys (in this case ['']).
df2 = df.select('No', F.col('comp_value')[0][''].cast('string').alias('comp_value'))
df2.show()
# +---+----------+
# |No |comp_value|
# +---+----------+
# | 1| 10|
# | 2| 35|
# +---+----------+

Spark: GroupBy and collect_list while filtering by another column

I have the following dataframe
+-----+-----+------+
|group|label|active|
+-----+-----+------+
| a| 1| y|
| a| 2| y|
| a| 1| n|
| b| 1| y|
| b| 1| n|
+-----+-----+------+
I would like to group by "group" column and collect by "label" column, meanwhile filtering on the value in column active.
The expected result would be
+-----+---------+---------+----------+
|group| labelyes| labelno |difference|
+-----+---------+---------+----------+
|a | [1,2] | [1] | [2] |
|b | [1] | [1] | [] |
+-----+---------+---------+----------+
I could get easily filter for "y" label by
val dfyes = df.filter($"active" === "y").groupBy("group").agg(collect_set("label"))
and similarly for the "n" value
val dfno = df.filter($"active" === "n").groupBy("group").agg(collect_set("label"))
but I don't understand if it's possible to aggregate simultaneously while filtering and how to get the difference of the two sets.
You can do a pivot, and use some array functions to get the difference:
val df2 = df.groupBy("group").pivot("active").agg(collect_list("label")).withColumn(
"difference",
array_union(
array_except(col("n"), col("y")),
array_except(col("y"), col("n"))
)
)
df2.show
+-----+---+------+----------+
|group| n| y|difference|
+-----+---+------+----------+
| b|[1]| [1]| []|
| a|[1]|[1, 2]| [2]|
+-----+---+------+----------+
Thanks #mck for his help. I have found an alternative way to solve the question, namely to filter with when during the aggregation:
df
.groupBy("group")
.agg(
collect_set(when($"active" === "y", $"label")).as("labelyes"),
collect_set(when($"active" === "n", $"label")).as("labelno")
)
.withColumn("diff", array_except($"labelyes", $"labelno"))

Compare rows of an array column with the headers of another data frame using Scala and Spark

I am using Scala and Spark.
I have two data frames.
The first one is like following:
+------+------+-----------+
| num1 | num2 | arr |
+------+------+-----------+
| 25 | 10 | [a,c] |
| 35 | 15 | [a,b,d] |
+------+------+-----------+
In the second one the data frame headers are
num1, num2, a, b, c, d
I have created a case class by adding all the possible header columns.
Now what I want is, by matching the columns num1 and num2, I have to check whether
the array in arr column contains the headers of the second data frame.
If it so the value should be 1, else 0.
So the required output is:
+------+------+---+---+---+---+
| num1 | num2 | a | b | c | d |
+------+------+---+---+---+---+
| 25 | 10 | 1 | 0 | 1 | 0 |
| 35 | 15 | 1 | 1 | 0 | 1 |
+------+------+---+---+---+---+
If I understand correctly, you want to transform the array column arr into one column per possible value, that would contain whether or not the array contains that value.
If so, you can use the array_contains function like this:
val df = Seq((25, 10, Seq("a","c")), (35, 15, Seq("a","b","d")))
.toDF("num1", "num2", "arr")
val values = Seq("a", "b", "c", "d")
df
.select(Seq("num1", "num2").map(col) ++
values.map(x => array_contains('arr, x) as x) : _*)
.show
+----+----+---+---+---+---+
|num1|num2| a| b| c| d|
+----+----+---+---+---+---+
| 25| 10| 1| 0| 1| 0|
| 35| 15| 1| 1| 0| 1|
+----+----+---+---+---+---+

How to convert a row (array of strings) to a dataframe column [duplicate]

I have this code:
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext()
sqlContext = SQLContext(sc)
documents = sqlContext.createDataFrame([
Row(id=1, title=[Row(value=u'cars', max_dist=1000)]),
Row(id=2, title=[Row(value=u'horse bus',max_dist=50), Row(value=u'normal bus',max_dist=100)]),
Row(id=3, title=[Row(value=u'Airplane', max_dist=5000)]),
Row(id=4, title=[Row(value=u'Bicycles', max_dist=20),Row(value=u'Motorbikes', max_dist=80)]),
Row(id=5, title=[Row(value=u'Trams', max_dist=15)])])
documents.show(truncate=False)
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[[1000,cars]] |
#|2 |[[50,horse bus], [100,normal bus]]|
#|3 |[[5000,Airplane]] |
#|4 |[[20,Bicycles], [80,Motorbikes]] |
#|5 |[[15,Trams]] |
#+---+----------------------------------+
I need to split all compound rows (e.g. 2 & 4) to multiple rows while retaining the 'id', to get a result like this:
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[1000,cars] |
#|2 |[50,horse bus] |
#|2 |[100,normal bus] |
#|3 |[5000,Airplane] |
#|4 |[20,Bicycles] |
#|4 |[80,Motorbikes] |
#|5 |[15,Trams] |
#+---+----------------------------------+
Just explode it:
from pyspark.sql.functions import explode
documents.withColumn("title", explode("title"))
## +---+----------------+
## | id| title|
## +---+----------------+
## | 1| [1000,cars]|
## | 2| [50,horse bus]|
## | 2|[100,normal bus]|
## | 3| [5000,Airplane]|
## | 4| [20,Bicycles]|
## | 4| [80,Motorbikes]|
## | 5| [15,Trams]|
## +---+----------------+
Ok, here is what I've come up with. Unfortunately, I had to leave the world of Row objects and enter the world of list objects because I couldn't find a way to append to a Row object.
That means this method is bit messy. If you can find a way to add a new column to a Row object, then this is NOT the way to go.
def add_id(row):
it_list = []
for i in range(0, len(row[1])):
sm_list = []
for j in row[1][i]:
sm_list.append(j)
sm_list.append(row[0])
it_list.append(sm_list)
return it_list
with_id = documents.flatMap(lambda x: add_id(x))
df = with_id.map(lambda x: Row(id=x[2], title=Row(value=x[0], max_dist=x[1]))).toDF()
When I run df.show(), I get:
+---+----------------+
| id| title|
+---+----------------+
| 1| [cars,1000]|
| 2| [horse bus,50]|
| 2|[normal bus,100]|
| 3| [Airplane,5000]|
| 4| [Bicycles,20]|
| 4| [Motorbikes,80]|
| 5| [Trams,15]|
+---+----------------+
I am using Spark Dataset API, and following solved the 'explode' requirement for me:
Dataset<Row> explodedDataset = initialDataset.selectExpr("ID","explode(finished_chunk) as chunks");
Note: The explode method of Dataset API is deprecated in Spark 2.4.5 and the documentation suggests using Select(shown above) or FlatMap.

Apache pyspark How to create a column with array containing n elements

I have a dataframe with 1 column of type integer.
I want to create a new column with an array containing n elements (n being the # from the first column)
For example:
x = spark.createDataFrame([(1,), (2,),],StructType([ StructField("myInt", IntegerType(), True)]))
+-----+
|myInt|
+-----+
| 1|
| 2|
| 3|
+-----+
I need the resulting data frame to look like this:
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Note, It doesn't actually matter what the values inside of the arrays are, it's just the count that matters.
It'd be fine if the resulting data frame looked like this:
+-----+------------------+
|myInt| myArr|
+-----+------------------+
| 1| [item]|
| 2| [item, item]|
| 3|[item, item, item]|
+-----+------------------+
It is preferable to avoid UDFs if possible because they are less efficient. You can use array_repeat instead.
import pyspark.sql.functions as F
x.withColumn('myArr', F.array_repeat(F.col('myInt'), F.col('myInt'))).show()
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Use udf:
from pyspark.sql.functions import *
#udf("array<int>")
def rep_(x):
return [x for _ in range(x)]
x.withColumn("myArr", rep_("myInt")).show()
# +-----+------+
# |myInt| myArr|
# +-----+------+
# | 1| [1]|
# | 2|[2, 2]|
# +-----+------+

Resources