How to convert a row (array of strings) to a dataframe column [duplicate] - arrays

I have this code:
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext()
sqlContext = SQLContext(sc)
documents = sqlContext.createDataFrame([
Row(id=1, title=[Row(value=u'cars', max_dist=1000)]),
Row(id=2, title=[Row(value=u'horse bus',max_dist=50), Row(value=u'normal bus',max_dist=100)]),
Row(id=3, title=[Row(value=u'Airplane', max_dist=5000)]),
Row(id=4, title=[Row(value=u'Bicycles', max_dist=20),Row(value=u'Motorbikes', max_dist=80)]),
Row(id=5, title=[Row(value=u'Trams', max_dist=15)])])
documents.show(truncate=False)
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[[1000,cars]] |
#|2 |[[50,horse bus], [100,normal bus]]|
#|3 |[[5000,Airplane]] |
#|4 |[[20,Bicycles], [80,Motorbikes]] |
#|5 |[[15,Trams]] |
#+---+----------------------------------+
I need to split all compound rows (e.g. 2 & 4) to multiple rows while retaining the 'id', to get a result like this:
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[1000,cars] |
#|2 |[50,horse bus] |
#|2 |[100,normal bus] |
#|3 |[5000,Airplane] |
#|4 |[20,Bicycles] |
#|4 |[80,Motorbikes] |
#|5 |[15,Trams] |
#+---+----------------------------------+

Just explode it:
from pyspark.sql.functions import explode
documents.withColumn("title", explode("title"))
## +---+----------------+
## | id| title|
## +---+----------------+
## | 1| [1000,cars]|
## | 2| [50,horse bus]|
## | 2|[100,normal bus]|
## | 3| [5000,Airplane]|
## | 4| [20,Bicycles]|
## | 4| [80,Motorbikes]|
## | 5| [15,Trams]|
## +---+----------------+

Ok, here is what I've come up with. Unfortunately, I had to leave the world of Row objects and enter the world of list objects because I couldn't find a way to append to a Row object.
That means this method is bit messy. If you can find a way to add a new column to a Row object, then this is NOT the way to go.
def add_id(row):
it_list = []
for i in range(0, len(row[1])):
sm_list = []
for j in row[1][i]:
sm_list.append(j)
sm_list.append(row[0])
it_list.append(sm_list)
return it_list
with_id = documents.flatMap(lambda x: add_id(x))
df = with_id.map(lambda x: Row(id=x[2], title=Row(value=x[0], max_dist=x[1]))).toDF()
When I run df.show(), I get:
+---+----------------+
| id| title|
+---+----------------+
| 1| [cars,1000]|
| 2| [horse bus,50]|
| 2|[normal bus,100]|
| 3| [Airplane,5000]|
| 4| [Bicycles,20]|
| 4| [Motorbikes,80]|
| 5| [Trams,15]|
+---+----------------+

I am using Spark Dataset API, and following solved the 'explode' requirement for me:
Dataset<Row> explodedDataset = initialDataset.selectExpr("ID","explode(finished_chunk) as chunks");
Note: The explode method of Dataset API is deprecated in Spark 2.4.5 and the documentation suggests using Select(shown above) or FlatMap.

Related

pyspark remove empty values from all columns and replace it with null

I am trying to clean my dataset from missing values.
In the rows there are values like
ID
A
B
1
324
2
Breda
3
null
34556
I would like to see in A1 and B2 null and so on without doing the cleaning column by column. I would like to loop over each columns without specifying the column names
I have found this code but the last raw returns an error :
My table name is custom
def replaceEmptyCols(columns:Array[String]):Array[Column]={
columns.map(c>={
when(col(c)=="" ,null).otherwise(col(c)).alias(c)
})
}
custom.select(replaceEmptyCols(custom.columns):_*).show()
The error is :
SyntaxError: invalid syntax (, line 6)
File "<command-447346330485202>", line 6
custom.select(replaceEmptyCols(custom.columns):_*).show()
^
SyntaxError: invalid syntax
Maybe you are looking for something like this?
custom = spark.createDataFrame(
[('1','','324')
,('2','Breda','')
,('3',None,'34556')
],
['ID','A','B']
)
custom.show()
# +---+-----+-----+
# | ID| A| B|
# +---+-----+-----+
# | 1| | 324|
# | 2|Breda| |
# | 3| null|34556|
# +---+-----+-----+
import pyspark.sql.functions as F
from pyspark.sql.types import *
def replaceEmptyCols(df, columns:[]):
for c in columns:
df = df.withColumn(c, F.when((F.col(c) == '') | (F.col(c) == None), F.lit('null')
).otherwise(F.col(c)))
return df
replaceEmptyCols(custom, [c for c in custom.columns if c not in ['ID']]).show()
# +---+-----+-----+
# | ID| A| B|
# +---+-----+-----+
# | 1| null| 324|
# | 2|Breda| null|
# | 3| null|34556|
# +---+-----+-----+

Retrieve an array from a dataframe using Scala/Spark

I got a dataframe, dfFood, with some info which i would like to extract and analize.
Dataframe looks like this:
|amount| description| food| id| name|period|rule|typeFood|version|
| ---- | ------------ | ------------- | - | --- | ---- | -- | ------ | ----- |
| 100|des rule name2|[Chicken, Fish]| 2|name2| 2022| 2| [1, 2]| 2|
| 55|des rule name3| [Vegetables]| 3|name3| 2022| 3| [3]| 3|
| 13|des rule name4| [Ramen]| 4|name4| 2022| 4| [4]| 4|
I want to read all the rows and analize some fields in those rows. I'm using foreach to extract dataframe info.
dfFood.foreach(row => {
println(row)
val food = row(2)
println(food)
}
That code prints this:
[100,des rule name2,WrappedArray(Chicken, Fish),2,name2,2022,2,WrappedArray(1, 2),2]
WrappedArray(Chicken, Fish)
[55,des rule name3,WrappedArray(Vegetables),3,name3,2022,3,WrappedArray(3),3]
WrappedArray(Vegetables)
[13,des rule name4,WrappedArray(Ramen),4,name4,2022,4,WrappedArray(4),4]
WrappedArray(Ramen)
I'm getting the field food using row(2), but i would like to get the info throw name using something like row("food").
Also i would like to transform that wrapped array in a list, and then analize all data inside the array.

PySpark concat two columns in order

I would like to concat two columns, but in a way that they are ordered.
For example I have dataframe like this:
|-------------------|-----------------|
| column_1 | column_2 |
|-------------------|-----------------|
| aaa | bbb |
|-------------------|-----------------|
| bbb | aaa |
|-------------------|-----------------|
Returns a dataframe like this:
|-------------------|-----------------|-----------------|
| column_1 | column_2 | concated_cols |
|-------------------|-----------------|-----------------|
| aaa | bbb | aaabbb |
|-------------------|-----------------|-----------------|
| bbb | aaa | aaabbb |
|-------------------|-----------------|-----------------|
Version Spark >= 2.4
from pyspark.sql import functions as F
df.withColumn(
"concated_cols",
F.array_join(F.array_sort(F.array(F.col("column_1"), F.col("column_2"))), ""),
).show()
Spark <= 2.3 version.
With a simple UDF :
from pyspark.sql import functions as F
#F.udf
def concat(*cols):
return "".join(sorted(cols))
df.withColumn("concated_cols", concat(F.col("column_1"), F.col("column_2"))).show()
+--------+--------+-------------+
|column_1|column_2|concated_cols|
+--------+--------+-------------+
| aaa| bbb| aaabbb|
| bbb| aaa| aaabbb|
+--------+--------+-------------+

Compare rows of an array column with the headers of another data frame using Scala and Spark

I am using Scala and Spark.
I have two data frames.
The first one is like following:
+------+------+-----------+
| num1 | num2 | arr |
+------+------+-----------+
| 25 | 10 | [a,c] |
| 35 | 15 | [a,b,d] |
+------+------+-----------+
In the second one the data frame headers are
num1, num2, a, b, c, d
I have created a case class by adding all the possible header columns.
Now what I want is, by matching the columns num1 and num2, I have to check whether
the array in arr column contains the headers of the second data frame.
If it so the value should be 1, else 0.
So the required output is:
+------+------+---+---+---+---+
| num1 | num2 | a | b | c | d |
+------+------+---+---+---+---+
| 25 | 10 | 1 | 0 | 1 | 0 |
| 35 | 15 | 1 | 1 | 0 | 1 |
+------+------+---+---+---+---+
If I understand correctly, you want to transform the array column arr into one column per possible value, that would contain whether or not the array contains that value.
If so, you can use the array_contains function like this:
val df = Seq((25, 10, Seq("a","c")), (35, 15, Seq("a","b","d")))
.toDF("num1", "num2", "arr")
val values = Seq("a", "b", "c", "d")
df
.select(Seq("num1", "num2").map(col) ++
values.map(x => array_contains('arr, x) as x) : _*)
.show
+----+----+---+---+---+---+
|num1|num2| a| b| c| d|
+----+----+---+---+---+---+
| 25| 10| 1| 0| 1| 0|
| 35| 15| 1| 1| 0| 1|
+----+----+---+---+---+---+

Apache pyspark How to create a column with array containing n elements

I have a dataframe with 1 column of type integer.
I want to create a new column with an array containing n elements (n being the # from the first column)
For example:
x = spark.createDataFrame([(1,), (2,),],StructType([ StructField("myInt", IntegerType(), True)]))
+-----+
|myInt|
+-----+
| 1|
| 2|
| 3|
+-----+
I need the resulting data frame to look like this:
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Note, It doesn't actually matter what the values inside of the arrays are, it's just the count that matters.
It'd be fine if the resulting data frame looked like this:
+-----+------------------+
|myInt| myArr|
+-----+------------------+
| 1| [item]|
| 2| [item, item]|
| 3|[item, item, item]|
+-----+------------------+
It is preferable to avoid UDFs if possible because they are less efficient. You can use array_repeat instead.
import pyspark.sql.functions as F
x.withColumn('myArr', F.array_repeat(F.col('myInt'), F.col('myInt'))).show()
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Use udf:
from pyspark.sql.functions import *
#udf("array<int>")
def rep_(x):
return [x for _ in range(x)]
x.withColumn("myArr", rep_("myInt")).show()
# +-----+------+
# |myInt| myArr|
# +-----+------+
# | 1| [1]|
# | 2|[2, 2]|
# +-----+------+

Resources