Retrieve an array from a dataframe using Scala/Spark - arrays

I got a dataframe, dfFood, with some info which i would like to extract and analize.
Dataframe looks like this:
|amount| description| food| id| name|period|rule|typeFood|version|
| ---- | ------------ | ------------- | - | --- | ---- | -- | ------ | ----- |
| 100|des rule name2|[Chicken, Fish]| 2|name2| 2022| 2| [1, 2]| 2|
| 55|des rule name3| [Vegetables]| 3|name3| 2022| 3| [3]| 3|
| 13|des rule name4| [Ramen]| 4|name4| 2022| 4| [4]| 4|
I want to read all the rows and analize some fields in those rows. I'm using foreach to extract dataframe info.
dfFood.foreach(row => {
println(row)
val food = row(2)
println(food)
}
That code prints this:
[100,des rule name2,WrappedArray(Chicken, Fish),2,name2,2022,2,WrappedArray(1, 2),2]
WrappedArray(Chicken, Fish)
[55,des rule name3,WrappedArray(Vegetables),3,name3,2022,3,WrappedArray(3),3]
WrappedArray(Vegetables)
[13,des rule name4,WrappedArray(Ramen),4,name4,2022,4,WrappedArray(4),4]
WrappedArray(Ramen)
I'm getting the field food using row(2), but i would like to get the info throw name using something like row("food").
Also i would like to transform that wrapped array in a list, and then analize all data inside the array.

Related

BigQuery removes fields with Postgres Array during datastream ingestion

I have this table named student_classes:
| id | name | class_ids |
| ----| ---------| -----------|
| 1 | Rebecca | {1,2,3} |
| 2 | Roy | {1,3,4} |
| 3 | Ted | {2,4,5} |
name is type text / string
class_ids is type integer[]
I created a datastream from PostgreSQL to BigQuery (following these instructions), but when I looked at the table's schema in BigQuery the class_ids field was gone and I am not sure why.
I was expecting class_ids would get ingested into BigQuery instead of getting dropped.

How to aggregate column values into array after groupBy?

I want to group by name and add color into array i have done following thing but it cant helped
val uid = flatten(collect_list($"color")).alias("color")
val df00= df_a.groupBy($"name")
.agg(color)
I have a dataframe with following values
---------------
|name |color |
---------------
|gaurav| red |
|harsh |black |
|nitin |yellow|
|gaurav|white |
|harsha|blue |
---------------
I want to group by name and store the color values into array using scala, to get a result like this:
----------------------
|name | color |
----------------------
|gaurav| [red,white] |
|harsh | [black,blue]|
|nitin | [yellow] |
----------------------
Use collect_list
The code is shown below:
import org.apache.spark.sql.functions._
df.groupBy($"name").agg(collect_list($"color").as("color_list")).show
Hope it helps!!

How to convert a row (array of strings) to a dataframe column [duplicate]

I have this code:
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext()
sqlContext = SQLContext(sc)
documents = sqlContext.createDataFrame([
Row(id=1, title=[Row(value=u'cars', max_dist=1000)]),
Row(id=2, title=[Row(value=u'horse bus',max_dist=50), Row(value=u'normal bus',max_dist=100)]),
Row(id=3, title=[Row(value=u'Airplane', max_dist=5000)]),
Row(id=4, title=[Row(value=u'Bicycles', max_dist=20),Row(value=u'Motorbikes', max_dist=80)]),
Row(id=5, title=[Row(value=u'Trams', max_dist=15)])])
documents.show(truncate=False)
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[[1000,cars]] |
#|2 |[[50,horse bus], [100,normal bus]]|
#|3 |[[5000,Airplane]] |
#|4 |[[20,Bicycles], [80,Motorbikes]] |
#|5 |[[15,Trams]] |
#+---+----------------------------------+
I need to split all compound rows (e.g. 2 & 4) to multiple rows while retaining the 'id', to get a result like this:
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[1000,cars] |
#|2 |[50,horse bus] |
#|2 |[100,normal bus] |
#|3 |[5000,Airplane] |
#|4 |[20,Bicycles] |
#|4 |[80,Motorbikes] |
#|5 |[15,Trams] |
#+---+----------------------------------+
Just explode it:
from pyspark.sql.functions import explode
documents.withColumn("title", explode("title"))
## +---+----------------+
## | id| title|
## +---+----------------+
## | 1| [1000,cars]|
## | 2| [50,horse bus]|
## | 2|[100,normal bus]|
## | 3| [5000,Airplane]|
## | 4| [20,Bicycles]|
## | 4| [80,Motorbikes]|
## | 5| [15,Trams]|
## +---+----------------+
Ok, here is what I've come up with. Unfortunately, I had to leave the world of Row objects and enter the world of list objects because I couldn't find a way to append to a Row object.
That means this method is bit messy. If you can find a way to add a new column to a Row object, then this is NOT the way to go.
def add_id(row):
it_list = []
for i in range(0, len(row[1])):
sm_list = []
for j in row[1][i]:
sm_list.append(j)
sm_list.append(row[0])
it_list.append(sm_list)
return it_list
with_id = documents.flatMap(lambda x: add_id(x))
df = with_id.map(lambda x: Row(id=x[2], title=Row(value=x[0], max_dist=x[1]))).toDF()
When I run df.show(), I get:
+---+----------------+
| id| title|
+---+----------------+
| 1| [cars,1000]|
| 2| [horse bus,50]|
| 2|[normal bus,100]|
| 3| [Airplane,5000]|
| 4| [Bicycles,20]|
| 4| [Motorbikes,80]|
| 5| [Trams,15]|
+---+----------------+
I am using Spark Dataset API, and following solved the 'explode' requirement for me:
Dataset<Row> explodedDataset = initialDataset.selectExpr("ID","explode(finished_chunk) as chunks");
Note: The explode method of Dataset API is deprecated in Spark 2.4.5 and the documentation suggests using Select(shown above) or FlatMap.

Apache pyspark How to create a column with array containing n elements

I have a dataframe with 1 column of type integer.
I want to create a new column with an array containing n elements (n being the # from the first column)
For example:
x = spark.createDataFrame([(1,), (2,),],StructType([ StructField("myInt", IntegerType(), True)]))
+-----+
|myInt|
+-----+
| 1|
| 2|
| 3|
+-----+
I need the resulting data frame to look like this:
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Note, It doesn't actually matter what the values inside of the arrays are, it's just the count that matters.
It'd be fine if the resulting data frame looked like this:
+-----+------------------+
|myInt| myArr|
+-----+------------------+
| 1| [item]|
| 2| [item, item]|
| 3|[item, item, item]|
+-----+------------------+
It is preferable to avoid UDFs if possible because they are less efficient. You can use array_repeat instead.
import pyspark.sql.functions as F
x.withColumn('myArr', F.array_repeat(F.col('myInt'), F.col('myInt'))).show()
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Use udf:
from pyspark.sql.functions import *
#udf("array<int>")
def rep_(x):
return [x for _ in range(x)]
x.withColumn("myArr", rep_("myInt")).show()
# +-----+------+
# |myInt| myArr|
# +-----+------+
# | 1| [1]|
# | 2|[2, 2]|
# +-----+------+

How can we validate tabular data in robot framework?

In Cucumber, we can directly validate the database table content in tabular format by mentioning the values in below format:
| Type | Code | Amount |
| A | HIGH | 27.72 |
| B | LOW | 9.28 |
| C | LOW | 4.43 |
Do we have something similar in Robot Framework. I need to run a query on the DB and the output looks like the above given table.
No, there is nothing built in to do exactly what you say. However, it's fairly straight-forward to write a keyword that takes a table of data and compares it to another table of data.
For example, you could write a keyword that takes the result of the query and then rows of information (though, the rows must all have exactly the same number of columns):
| | ${ResultOfQuery}= | <do the database query>
| | Database should contain | ${ResultOfQuery}
| | ... | #Type | Code | Amount
| | ... | A | HIGH | 27.72
| | ... | B | LOW | 9.28
| | ... | C | LOW | 4.43
Then it's just a matter of iterating over all of the arguments three at a time, and checking if the data has that value. It would look something like this:
**** Keywords ***
| Database should contain
| | [Arguments] | ${actual} | #{expected}
| | :FOR | ${type} | ${code} | ${amount} | IN | #{expected}
| | | <verify that the values are in ${actual}>
Even easier might be to write a python-based keyword, which makes it a bit easier to iterate over datasets.

Resources