Scala: Turn Array into DataFrame or RDD - arrays

I am currently working on IntelliJ in Maven.
Is there a way to turn an array into a dataframe or RDD with the first portion of the array as a header?
I'm fine with turning the array into a List, as long as it can be converted into a dataframe or RDD.
Example:
input
val input = Array("Name, Number", "John, 9070", "Sara, 8041")
output
+----+------+
|Name|Number|
+----+------+
|John| 9070 |
|Sara| 8041 |
+----+------+

import org.apache.spark.sql.SparkSession
val ss = SparkSession
.builder
.master("local[*]")
.appName("test")
.getOrCreate()
val input = Array("Name, Number", "John, 9070", "Sara, 8041")
val header = input.head.split(", ")
val data = input.tail
val rdd = ss.sparkContext.parallelize(data)
val df = rdd.map(x => (x.split(",")(0),x.split(",")(1))).toDF(header: _*)
df.show(false)
+----+------+
|Name|Number|
+----+------+
|John| 9070 |
|Sara| 8041 |
+----+------+

Related

Parsing a string into a Dataframe

I have the following data
100///t1001///t2///t0.119///t2342342342///tHi\nthere!///n103///t1002///t2///t0.119///t2342342342///tHello
there!
1010///t10077///t2///t0.119///t2342342342///tHi\nthere!///n1044///t1003///t2///t0.119///t2342342342///tHello there!
In a file, I have multiple lines of of the above formatted data. Each line is delimited by ///n and ///t. For each line, there are four records that are delimited by ///n. Inside each record, there are four columns that are delimited by ///t. Now, I need to parse this into a Dataframe. So basically for the above two lines; since each line has 2 records with 6 columns; there should be 12 records in the Dataframe. Each record follows the same format.
I tried parsing this using a combination of split and amp but did not get the correct output
You can process it using string transformations, like:
// Sample of input data
val str1 = "100///t1001///t2///t0.119///t2342342342///tHi\nthere!///n103///t1002///t2///t0.119///t2342342342///tHello there!"
val str2 = "1010///t10077///t2///t0.119///t2342342342///tHi\nthere!///n1044///t1003///t2///t0.119///t2342342342///tHello there!"
val df = Seq(str1, str2).toDF
// Process:
val output = df.as[String].flatMap(row=>{
val fields = row.split("///n").map(record=>{
val fields = record.split("///t").toList
(fields(0), fields(1), fields(2), fields(3), fields(4), fields(5))
}).toList
fields
}).toDF("column_1", "column_2", "column_3", "column_4", "column_5", "column_6")
Result:
+--------+--------+--------+--------+----------+------------+
|column_1|column_2|column_3|column_4| column_5| column_6|
+--------+--------+--------+--------+----------+------------+
| 100| 1001| 2| 0.119|2342342342| Hi |
| |there! |
| 103| 1002| 2| 0.119|2342342342|Hello there!|
| 1010| 10077| 2| 0.119|2342342342| Hi |
| | there!|
| 1044| 1003| 2| 0.119|2342342342|Hello there!|
+--------+--------+--------+--------+----------+------------+

How to covert a column with String to Array[String] in Scala/Spark?

I have a data frame :
+--------------------------------------+------------------------------------------------------------+
|item |item_codes |
+--------------------------------------+------------------------------------------------------------+
|loose fit long sleeve swim shirt women|["2237741011","1046622","1040660","7147440011","7141123011"]|
+--------------------------------------+------------------------------------------------------------+
And schema looks like this =
root
|-- item: string (nullable = true)
|-- item_codes: string (nullable = true)
How can I convert the column item_codes string to Array[String] in Scala ?
You can remove quotes/square brackets using regexp_replace, followed by a split to generate the ArrayType column:
val df = Seq(
("abc", "[\"2237741011\",\"1046622\",\"1040660\",\"7147440011\",\"7141123011\"]")
).toDF("item", "item_codes")
df.
withColumn("item_codes", split(regexp_replace($"item_codes", """\[?\"\]?""", ""), "\\,")).
show(false)
// +----+------------------------------------------------------+
// |item|item_codes |
// +----+------------------------------------------------------+
// |abc |[2237741011, 1046622, 1040660, 7147440011, 7141123011]|
// +----+------------------------------------------------------+
You can use the split method after doing some "preprocessing"
val col_names = Seq("item", "item_codes")
val data = Seq(("loose fit long sleeve swim shirt women", """["2237741011","1046622","1040660","7147440011","7141123011"]"""))
val df = spark.createDataFrame(data).toDF(col_names: _*)
// chop off first 2 and last 2 character and split at ","
df.withColumn("item_codes", split(expr("substring(item_codes, 3, length(item_codes)-4)"), """","""")).printSchema
If your format can change you might be more flexible using a regexp as leo suggestes chopping off everything that is not a digit or a , and split at ,

convert JSON text string to Pandas, but each row cell ends up as an array of values inside

I manage to extract a time-series of prices from a web-portal. The data arrives in a json format, and I convert them into a pandas dataFrame.
Unfortunately, the data for the different bands come in a text string, and I can't seem to extract them out properly.
The below is the json data I extract
I convert them into a pandas dataframe using this code
data = pd.DataFrame(r.json()['prices'])
and get them like this
I need to extract (for example) the data in the column ClosePrice out, so that I can do data analysis and cleansing on them.
I tried using
data['closePrice'].str.split(',', expand=True).rename(columns = lambda x: "string"+str(x+1))
but it doesn't really work.
Is there any way to either
a) when I convert the json to dataFrame, such that the prices within the closePrice, bidPrice etc are extracted in individual columns OR
b) if they were saved in the dataFrame, extract the text strings within them, such that I can extract the prices (e.g. the bid, ask and lastTraded) within the text string?
A relatively brute force way, using links from other stackOverflow.
# load and extract the json data
s = requests.Session()
r = s.post(url + '/session', json=data)
loc = <url>
dat1 = s.get(loc)
dat1 = pd.DataFrame(dat1.json()['prices'])
# convert the object list into individual columns
dat2 = pd.DataFrame()
dat2[['bidC','askC', 'lastP']] = pd.DataFrame(dat1.closePrice.values.tolist(), index= dat1.index)
dat2[['bidH','askH', 'lastH']] = pd.DataFrame(dat1.highPrice.values.tolist(), index= dat1.index)
dat2[['bidL','askL', 'lastL']] = pd.DataFrame(dat1.lowPrice.values.tolist(), index= dat1.index)
dat2[['bidO','askO', 'lastO']] = pd.DataFrame(dat1.openPrice.values.tolist(), index= dat1.index)
dat2['tStamp'] = pd.to_datetime(dat1.snapshotTime)
dat2['volume'] = dat1.lastTradedVolume
get the equivalent below
Use pandas.json_normalize to extract the data from the dict
import pandas as pd
data = r.json()
# print(data)
{'prices': [{'closePrice': {'ask': 1.16042, 'bid': 1.16027, 'lastTraded': None},
'highPrice': {'ask': 1.16052, 'bid': 1.16041, 'lastTraded': None},
'lastTradedVolume': 74,
'lowPrice': {'ask': 1.16038, 'bid': 1.16026, 'lastTraded': None},
'openPrice': {'ask': 1.16044, 'bid': 1.16038, 'lastTraded': None},
'snapshotTime': '2018/09/28 21:49:00',
'snapshotTimeUTC': '2018-09-28T20:49:00'}]}
df = pd.json_normalize(data['prices'])
Output:
| | lastTradedVolume | snapshotTime | snapshotTimeUTC | closePrice.ask | closePrice.bid | closePrice.lastTraded | highPrice.ask | highPrice.bid | highPrice.lastTraded | lowPrice.ask | lowPrice.bid | lowPrice.lastTraded | openPrice.ask | openPrice.bid | openPrice.lastTraded |
|---:|-------------------:|:--------------------|:--------------------|-----------------:|-----------------:|:------------------------|----------------:|----------------:|:-----------------------|---------------:|---------------:|:----------------------|----------------:|----------------:|:-----------------------|
| 0 | 74 | 2018/09/28 21:49:00 | 2018-09-28T20:49:00 | 1.16042 | 1.16027 | | 1.16052 | 1.16041 | | 1.16038 | 1.16026 | | 1.16044 | 1.16038 | |

Spark dataset from List

I need to create a Spark Dataset for ML. I have an array of 100 Double values and I want to add them to a dataset of 100 columns (each column for one value).
How can I do it?
Thanks
EDIT: CODE
import org.apache.spark.sql.Row
import org.apache.spark.sql.RowFactory
import sess.implicits._
val values = new ListBuffer[Double]()
//Values population proccess ....
val ds = values.toDS()
ds.show()
And de output shows as:
+--------+
| value|
+--------+
| 27242.0|
| 33883.0|
| 69727.0|
| 20851.0|
| 27740.0|
| 18747.0|
There are plenty of ways to meet your requirement. One of the ways is to form a schema and then convert the array of 100 doubles to RDD[Seq[Row[Doubles]]] and finally use createDataFrame api to form a dataframe.
// necessary imports
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField}
import org.apache.spark.sql.SQLContext
// forming array of 100 doubles
var values = new ListBuffer[Double]()
for(x <- 1 to 100){
values = values :+ x.toDouble
}
//creating schema for the 100 doubles
val schema = StructType(values.map(value => StructField(("col"+value).replace(".", "_"), DoubleType, true)))
// finally creating the dataframe of 100 doubles with each values in each column
val df = sqlContext.createDataFrame(sc.parallelize(Seq(Row.fromSeq((values.toSeq)))), schema)
df.show(false)
which should give you
+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
|col1_0|col2_0|col3_0|col4_0|col5_0|col6_0|col7_0|col8_0|col9_0|col10_0|col11_0|col12_0|col13_0|col14_0|col15_0|col16_0|col17_0|col18_0|col19_0|col20_0|col21_0|col22_0|col23_0|col24_0|col25_0|col26_0|col27_0|col28_0|col29_0|col30_0|col31_0|col32_0|col33_0|col34_0|col35_0|col36_0|col37_0|col38_0|col39_0|col40_0|col41_0|col42_0|col43_0|col44_0|col45_0|col46_0|col47_0|col48_0|col49_0|col50_0|col51_0|col52_0|col53_0|col54_0|col55_0|col56_0|col57_0|col58_0|col59_0|col60_0|col61_0|col62_0|col63_0|col64_0|col65_0|col66_0|col67_0|col68_0|col69_0|col70_0|col71_0|col72_0|col73_0|col74_0|col75_0|col76_0|col77_0|col78_0|col79_0|col80_0|col81_0|col82_0|col83_0|col84_0|col85_0|col86_0|col87_0|col88_0|col89_0|col90_0|col91_0|col92_0|col93_0|col94_0|col95_0|col96_0|col97_0|col98_0|col99_0|col100_0|
+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
|1.0 |2.0 |3.0 |4.0 |5.0 |6.0 |7.0 |8.0 |9.0 |10.0 |11.0 |12.0 |13.0 |14.0 |15.0 |16.0 |17.0 |18.0 |19.0 |20.0 |21.0 |22.0 |23.0 |24.0 |25.0 |26.0 |27.0 |28.0 |29.0 |30.0 |31.0 |32.0 |33.0 |34.0 |35.0 |36.0 |37.0 |38.0 |39.0 |40.0 |41.0 |42.0 |43.0 |44.0 |45.0 |46.0 |47.0 |48.0 |49.0 |50.0 |51.0 |52.0 |53.0 |54.0 |55.0 |56.0 |57.0 |58.0 |59.0 |60.0 |61.0 |62.0 |63.0 |64.0 |65.0 |66.0 |67.0 |68.0 |69.0 |70.0 |71.0 |72.0 |73.0 |74.0 |75.0 |76.0 |77.0 |78.0 |79.0 |80.0 |81.0 |82.0 |83.0 |84.0 |85.0 |86.0 |87.0 |88.0 |89.0 |90.0 |91.0 |92.0 |93.0 |94.0 |95.0 |96.0 |97.0 |98.0 |99.0 |100.0 |
+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+

How to sum values of a struct in a nested array in a Spark dataframe?

This is in Spark 2.1, Given this input file:
`order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
And the following dataframes:
val order = sqlContext.read.json("order.json")
val df2 = order.select(struct("*") as 'order)
val df3 = df2.groupBy("order.userId").agg( collect_list( $"order").as("array"))
df3 has the following content:
+------+---------------------------+
|userId|array |
+------+---------------------------+
|1 |[[1,202.3,1], [2,343.99,1]]|
|2 |[[3,399.99,2]] |
+------+---------------------------+
and structure:
root
|-- userId: long (nullable = true)
|-- array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
Now assuming I am given df3:
I would like to compute sum of array.price for each userId, taking advantage of having the array per userId rows.
I would add this computation in a new column in the resulting dataframe. Like if I had done df3.withColumn( "sum", lit(0)), but with lit(0) replaced by my computation.
It would have assume to be straighforward, but I am stuck on both. I didnt find any way to access the array as whole do the computation per row (with a foldLeft for example).
I would like to compute sum of array.price for each userId, taking advantage of having the array
Unfortunately having an array works against you here. Neither Spark SQL nor DataFrame DSL provides tools that could be used directly to handle this task on array of an arbitrary size without decomposing (explode) first.
You can use an UDF:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
val totalPrice = udf((xs: Seq[Row]) => xs.map(_.getAs[Double]("price")).sum)
df3.withColumn("totalPrice", totalPrice($"array"))
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
or convert to statically typed Dataset:
df3
.as[(Long, Seq[(Long, Double, Long)])]
.map{ case (id, xs) => (id, xs, xs.map(_._2).sum) }
.toDF("userId", "array", "totalPrice").show
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
As mentioned above you decompose and aggregate:
import org.apache.spark.sql.functions.{sum, first}
df3
.withColumn("price", explode($"array.price"))
.groupBy($"userId")
.agg(sum($"price"), df3.columns.tail.map(c => first(c).alias(c)): _*)
+------+----------+--------------------+
|userId|sum(price)| array|
+------+----------+--------------------+
| 1| 546.29|[[1,202.3,1], [2,...|
| 2| 399.99| [[3,399.99,2]]|
+------+----------+--------------------+
but it is expensive and doesn't use the existing structure.
There is an ugly trick you could use:
import org.apache.spark.sql.functions.{coalesce, lit, max, size}
val totalPrice = (0 to df3.agg(max(size($"array"))).as[Int].first)
.map(i => coalesce($"array.price".getItem(i), lit(0.0)))
.foldLeft(lit(0.0))(_ + _)
df3.withColumn("totalPrice", totalPrice)
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
but it is more a curiosity than a real solution.
Spark 2.4.0 and above
You can now use the AGGREGATE functionality.
df3.createOrReplaceTempView("orders")
spark.sql(
"""
|SELECT
| *,
| AGGREGATE(`array`, 0.0, (accumulator, item) -> accumulator + item.price) AS totalPrice
|FROM
| orders
|""".stripMargin).show()

Resources