convert JSON text string to Pandas, but each row cell ends up as an array of values inside - arrays

I manage to extract a time-series of prices from a web-portal. The data arrives in a json format, and I convert them into a pandas dataFrame.
Unfortunately, the data for the different bands come in a text string, and I can't seem to extract them out properly.
The below is the json data I extract
I convert them into a pandas dataframe using this code
data = pd.DataFrame(r.json()['prices'])
and get them like this
I need to extract (for example) the data in the column ClosePrice out, so that I can do data analysis and cleansing on them.
I tried using
data['closePrice'].str.split(',', expand=True).rename(columns = lambda x: "string"+str(x+1))
but it doesn't really work.
Is there any way to either
a) when I convert the json to dataFrame, such that the prices within the closePrice, bidPrice etc are extracted in individual columns OR
b) if they were saved in the dataFrame, extract the text strings within them, such that I can extract the prices (e.g. the bid, ask and lastTraded) within the text string?

A relatively brute force way, using links from other stackOverflow.
# load and extract the json data
s = requests.Session()
r = s.post(url + '/session', json=data)
loc = <url>
dat1 = s.get(loc)
dat1 = pd.DataFrame(dat1.json()['prices'])
# convert the object list into individual columns
dat2 = pd.DataFrame()
dat2[['bidC','askC', 'lastP']] = pd.DataFrame(dat1.closePrice.values.tolist(), index= dat1.index)
dat2[['bidH','askH', 'lastH']] = pd.DataFrame(dat1.highPrice.values.tolist(), index= dat1.index)
dat2[['bidL','askL', 'lastL']] = pd.DataFrame(dat1.lowPrice.values.tolist(), index= dat1.index)
dat2[['bidO','askO', 'lastO']] = pd.DataFrame(dat1.openPrice.values.tolist(), index= dat1.index)
dat2['tStamp'] = pd.to_datetime(dat1.snapshotTime)
dat2['volume'] = dat1.lastTradedVolume
get the equivalent below

Use pandas.json_normalize to extract the data from the dict
import pandas as pd
data = r.json()
# print(data)
{'prices': [{'closePrice': {'ask': 1.16042, 'bid': 1.16027, 'lastTraded': None},
'highPrice': {'ask': 1.16052, 'bid': 1.16041, 'lastTraded': None},
'lastTradedVolume': 74,
'lowPrice': {'ask': 1.16038, 'bid': 1.16026, 'lastTraded': None},
'openPrice': {'ask': 1.16044, 'bid': 1.16038, 'lastTraded': None},
'snapshotTime': '2018/09/28 21:49:00',
'snapshotTimeUTC': '2018-09-28T20:49:00'}]}
df = pd.json_normalize(data['prices'])
Output:
| | lastTradedVolume | snapshotTime | snapshotTimeUTC | closePrice.ask | closePrice.bid | closePrice.lastTraded | highPrice.ask | highPrice.bid | highPrice.lastTraded | lowPrice.ask | lowPrice.bid | lowPrice.lastTraded | openPrice.ask | openPrice.bid | openPrice.lastTraded |
|---:|-------------------:|:--------------------|:--------------------|-----------------:|-----------------:|:------------------------|----------------:|----------------:|:-----------------------|---------------:|---------------:|:----------------------|----------------:|----------------:|:-----------------------|
| 0 | 74 | 2018/09/28 21:49:00 | 2018-09-28T20:49:00 | 1.16042 | 1.16027 | | 1.16052 | 1.16041 | | 1.16038 | 1.16026 | | 1.16044 | 1.16038 | |

Related

Create a list of lists from two columns in a data frame - Scala

I have a data frame where passengerId and path are Strings. The path represents the flight path of the passenger so passenger 10096 started in country CO and traveled to country BM. I need to find out the longest amount of flights each passenger has without traveling to the UK.
+-----------+--------------------+
|passengerId| path|
+-----------+--------------------+
| 10096| co,bm|
| 10351| pk,uk|
| 10436| co,co,cn,tj,us,ir|
| 1090| dk,tj,jo,jo,ch,cn|
| 11078| pk,no,fr,no|
| 11332|sg,cn,co,bm,sg,jo...|
| 11563|us,sg,th,cn,il,uk...|
| 1159| ca,cl,il,sg,il|
| 11722| dk,dk,pk,sg,cn|
| 11888|au,se,ca,tj,th,be...|
| 12394| dk,nl,th|
| 12529| no,be,au|
| 12847| cn,cg|
| 13192| cn,tk,cg,uk,uk|
| 13282| co,us,iq,iq|
| 13442| cn,pk,jo,us,ch,cg|
| 13610| be,ar,tj,no,ch,no|
| 13772| be,at,iq|
| 13865| be,th,cn,il|
| 14157| sg,dk|
+-----------+--------------------+
I need to get it like this.
val data = List(
(1,List("UK","IR","AT","UK","CH","PK")),
(2,List("CG","IR")),
(3,List("CG","IR","SG","BE","UK")),
(4,List("CG","IR","NO","UK","SG","UK","IR","TJ","AT")),
(5,List("CG","IR"))
I'm trying to use this solution but I can't make this list of lists. It also seems like the input used in the solution has each country code as a separate item in the list, while my path column has the country codes listed as a single element to describe the flight path.
If the goal is just to generate the list of destinations from a string, you can simply use split:
df.withColumn("path", split('path, ","))
If the goal is to compute the maximum number of steps without going to the UK, you could do something like this:
df
// split the string on 'uk' and generate one row per sub journey
.withColumn("path", explode(split('path, ",?uk,?")))
// compute the size of each sub journey
.withColumn("path_size", size(split('path, ",")))
// retrieve the longest one
.groupBy("passengerId")
.agg(max('path_size) as "max_path_size")

Calculate total number of orders in different statuses

I want to create simple dashboard where I want to show the number of orders in different statuses. The statuses can be New/Cancelled/Finished/etc
Where should I implement these criteria? If I add filter in the Cube Browser then it applies for the whole dashboard. Should I do that in KPI? Or should I add calculated column with 1/0 values?
My expected output is something like:
--------------------------------------
| Total | New | Finished | Cancelled |
--------------------------------------
| 1000 | 100 | 800 | 100 |
--------------------------------------
I'd use measures for that, something like:
CountTotal = COUNT('Orders'[OrderID])
CountNew = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "New")
CountFinished = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "Finished")
CountCancelled = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "Cancelled")

Parsing a string into a Dataframe

I have the following data
100///t1001///t2///t0.119///t2342342342///tHi\nthere!///n103///t1002///t2///t0.119///t2342342342///tHello
there!
1010///t10077///t2///t0.119///t2342342342///tHi\nthere!///n1044///t1003///t2///t0.119///t2342342342///tHello there!
In a file, I have multiple lines of of the above formatted data. Each line is delimited by ///n and ///t. For each line, there are four records that are delimited by ///n. Inside each record, there are four columns that are delimited by ///t. Now, I need to parse this into a Dataframe. So basically for the above two lines; since each line has 2 records with 6 columns; there should be 12 records in the Dataframe. Each record follows the same format.
I tried parsing this using a combination of split and amp but did not get the correct output
You can process it using string transformations, like:
// Sample of input data
val str1 = "100///t1001///t2///t0.119///t2342342342///tHi\nthere!///n103///t1002///t2///t0.119///t2342342342///tHello there!"
val str2 = "1010///t10077///t2///t0.119///t2342342342///tHi\nthere!///n1044///t1003///t2///t0.119///t2342342342///tHello there!"
val df = Seq(str1, str2).toDF
// Process:
val output = df.as[String].flatMap(row=>{
val fields = row.split("///n").map(record=>{
val fields = record.split("///t").toList
(fields(0), fields(1), fields(2), fields(3), fields(4), fields(5))
}).toList
fields
}).toDF("column_1", "column_2", "column_3", "column_4", "column_5", "column_6")
Result:
+--------+--------+--------+--------+----------+------------+
|column_1|column_2|column_3|column_4| column_5| column_6|
+--------+--------+--------+--------+----------+------------+
| 100| 1001| 2| 0.119|2342342342| Hi |
| |there! |
| 103| 1002| 2| 0.119|2342342342|Hello there!|
| 1010| 10077| 2| 0.119|2342342342| Hi |
| | there!|
| 1044| 1003| 2| 0.119|2342342342|Hello there!|
+--------+--------+--------+--------+----------+------------+

Scala: Turn Array into DataFrame or RDD

I am currently working on IntelliJ in Maven.
Is there a way to turn an array into a dataframe or RDD with the first portion of the array as a header?
I'm fine with turning the array into a List, as long as it can be converted into a dataframe or RDD.
Example:
input
val input = Array("Name, Number", "John, 9070", "Sara, 8041")
output
+----+------+
|Name|Number|
+----+------+
|John| 9070 |
|Sara| 8041 |
+----+------+
import org.apache.spark.sql.SparkSession
val ss = SparkSession
.builder
.master("local[*]")
.appName("test")
.getOrCreate()
val input = Array("Name, Number", "John, 9070", "Sara, 8041")
val header = input.head.split(", ")
val data = input.tail
val rdd = ss.sparkContext.parallelize(data)
val df = rdd.map(x => (x.split(",")(0),x.split(",")(1))).toDF(header: _*)
df.show(false)
+----+------+
|Name|Number|
+----+------+
|John| 9070 |
|Sara| 8041 |
+----+------+

Spark dataset from List

I need to create a Spark Dataset for ML. I have an array of 100 Double values and I want to add them to a dataset of 100 columns (each column for one value).
How can I do it?
Thanks
EDIT: CODE
import org.apache.spark.sql.Row
import org.apache.spark.sql.RowFactory
import sess.implicits._
val values = new ListBuffer[Double]()
//Values population proccess ....
val ds = values.toDS()
ds.show()
And de output shows as:
+--------+
| value|
+--------+
| 27242.0|
| 33883.0|
| 69727.0|
| 20851.0|
| 27740.0|
| 18747.0|
There are plenty of ways to meet your requirement. One of the ways is to form a schema and then convert the array of 100 doubles to RDD[Seq[Row[Doubles]]] and finally use createDataFrame api to form a dataframe.
// necessary imports
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField}
import org.apache.spark.sql.SQLContext
// forming array of 100 doubles
var values = new ListBuffer[Double]()
for(x <- 1 to 100){
values = values :+ x.toDouble
}
//creating schema for the 100 doubles
val schema = StructType(values.map(value => StructField(("col"+value).replace(".", "_"), DoubleType, true)))
// finally creating the dataframe of 100 doubles with each values in each column
val df = sqlContext.createDataFrame(sc.parallelize(Seq(Row.fromSeq((values.toSeq)))), schema)
df.show(false)
which should give you
+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
|col1_0|col2_0|col3_0|col4_0|col5_0|col6_0|col7_0|col8_0|col9_0|col10_0|col11_0|col12_0|col13_0|col14_0|col15_0|col16_0|col17_0|col18_0|col19_0|col20_0|col21_0|col22_0|col23_0|col24_0|col25_0|col26_0|col27_0|col28_0|col29_0|col30_0|col31_0|col32_0|col33_0|col34_0|col35_0|col36_0|col37_0|col38_0|col39_0|col40_0|col41_0|col42_0|col43_0|col44_0|col45_0|col46_0|col47_0|col48_0|col49_0|col50_0|col51_0|col52_0|col53_0|col54_0|col55_0|col56_0|col57_0|col58_0|col59_0|col60_0|col61_0|col62_0|col63_0|col64_0|col65_0|col66_0|col67_0|col68_0|col69_0|col70_0|col71_0|col72_0|col73_0|col74_0|col75_0|col76_0|col77_0|col78_0|col79_0|col80_0|col81_0|col82_0|col83_0|col84_0|col85_0|col86_0|col87_0|col88_0|col89_0|col90_0|col91_0|col92_0|col93_0|col94_0|col95_0|col96_0|col97_0|col98_0|col99_0|col100_0|
+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
|1.0 |2.0 |3.0 |4.0 |5.0 |6.0 |7.0 |8.0 |9.0 |10.0 |11.0 |12.0 |13.0 |14.0 |15.0 |16.0 |17.0 |18.0 |19.0 |20.0 |21.0 |22.0 |23.0 |24.0 |25.0 |26.0 |27.0 |28.0 |29.0 |30.0 |31.0 |32.0 |33.0 |34.0 |35.0 |36.0 |37.0 |38.0 |39.0 |40.0 |41.0 |42.0 |43.0 |44.0 |45.0 |46.0 |47.0 |48.0 |49.0 |50.0 |51.0 |52.0 |53.0 |54.0 |55.0 |56.0 |57.0 |58.0 |59.0 |60.0 |61.0 |62.0 |63.0 |64.0 |65.0 |66.0 |67.0 |68.0 |69.0 |70.0 |71.0 |72.0 |73.0 |74.0 |75.0 |76.0 |77.0 |78.0 |79.0 |80.0 |81.0 |82.0 |83.0 |84.0 |85.0 |86.0 |87.0 |88.0 |89.0 |90.0 |91.0 |92.0 |93.0 |94.0 |95.0 |96.0 |97.0 |98.0 |99.0 |100.0 |
+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+

Resources