Create a list of lists from two columns in a data frame - Scala - arrays

I have a data frame where passengerId and path are Strings. The path represents the flight path of the passenger so passenger 10096 started in country CO and traveled to country BM. I need to find out the longest amount of flights each passenger has without traveling to the UK.
+-----------+--------------------+
|passengerId| path|
+-----------+--------------------+
| 10096| co,bm|
| 10351| pk,uk|
| 10436| co,co,cn,tj,us,ir|
| 1090| dk,tj,jo,jo,ch,cn|
| 11078| pk,no,fr,no|
| 11332|sg,cn,co,bm,sg,jo...|
| 11563|us,sg,th,cn,il,uk...|
| 1159| ca,cl,il,sg,il|
| 11722| dk,dk,pk,sg,cn|
| 11888|au,se,ca,tj,th,be...|
| 12394| dk,nl,th|
| 12529| no,be,au|
| 12847| cn,cg|
| 13192| cn,tk,cg,uk,uk|
| 13282| co,us,iq,iq|
| 13442| cn,pk,jo,us,ch,cg|
| 13610| be,ar,tj,no,ch,no|
| 13772| be,at,iq|
| 13865| be,th,cn,il|
| 14157| sg,dk|
+-----------+--------------------+
I need to get it like this.
val data = List(
(1,List("UK","IR","AT","UK","CH","PK")),
(2,List("CG","IR")),
(3,List("CG","IR","SG","BE","UK")),
(4,List("CG","IR","NO","UK","SG","UK","IR","TJ","AT")),
(5,List("CG","IR"))
I'm trying to use this solution but I can't make this list of lists. It also seems like the input used in the solution has each country code as a separate item in the list, while my path column has the country codes listed as a single element to describe the flight path.

If the goal is just to generate the list of destinations from a string, you can simply use split:
df.withColumn("path", split('path, ","))
If the goal is to compute the maximum number of steps without going to the UK, you could do something like this:
df
// split the string on 'uk' and generate one row per sub journey
.withColumn("path", explode(split('path, ",?uk,?")))
// compute the size of each sub journey
.withColumn("path_size", size(split('path, ",")))
// retrieve the longest one
.groupBy("passengerId")
.agg(max('path_size) as "max_path_size")

Related

Calculate total number of orders in different statuses

I want to create simple dashboard where I want to show the number of orders in different statuses. The statuses can be New/Cancelled/Finished/etc
Where should I implement these criteria? If I add filter in the Cube Browser then it applies for the whole dashboard. Should I do that in KPI? Or should I add calculated column with 1/0 values?
My expected output is something like:
--------------------------------------
| Total | New | Finished | Cancelled |
--------------------------------------
| 1000 | 100 | 800 | 100 |
--------------------------------------
I'd use measures for that, something like:
CountTotal = COUNT('Orders'[OrderID])
CountNew = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "New")
CountFinished = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "Finished")
CountCancelled = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "Cancelled")

Parsing a string into a Dataframe

I have the following data
100///t1001///t2///t0.119///t2342342342///tHi\nthere!///n103///t1002///t2///t0.119///t2342342342///tHello
there!
1010///t10077///t2///t0.119///t2342342342///tHi\nthere!///n1044///t1003///t2///t0.119///t2342342342///tHello there!
In a file, I have multiple lines of of the above formatted data. Each line is delimited by ///n and ///t. For each line, there are four records that are delimited by ///n. Inside each record, there are four columns that are delimited by ///t. Now, I need to parse this into a Dataframe. So basically for the above two lines; since each line has 2 records with 6 columns; there should be 12 records in the Dataframe. Each record follows the same format.
I tried parsing this using a combination of split and amp but did not get the correct output
You can process it using string transformations, like:
// Sample of input data
val str1 = "100///t1001///t2///t0.119///t2342342342///tHi\nthere!///n103///t1002///t2///t0.119///t2342342342///tHello there!"
val str2 = "1010///t10077///t2///t0.119///t2342342342///tHi\nthere!///n1044///t1003///t2///t0.119///t2342342342///tHello there!"
val df = Seq(str1, str2).toDF
// Process:
val output = df.as[String].flatMap(row=>{
val fields = row.split("///n").map(record=>{
val fields = record.split("///t").toList
(fields(0), fields(1), fields(2), fields(3), fields(4), fields(5))
}).toList
fields
}).toDF("column_1", "column_2", "column_3", "column_4", "column_5", "column_6")
Result:
+--------+--------+--------+--------+----------+------------+
|column_1|column_2|column_3|column_4| column_5| column_6|
+--------+--------+--------+--------+----------+------------+
| 100| 1001| 2| 0.119|2342342342| Hi |
| |there! |
| 103| 1002| 2| 0.119|2342342342|Hello there!|
| 1010| 10077| 2| 0.119|2342342342| Hi |
| | there!|
| 1044| 1003| 2| 0.119|2342342342|Hello there!|
+--------+--------+--------+--------+----------+------------+

Import .csv into MongoDB creating arrays with objects

I am starting to familiarize with MongoDB and now i have to import a relative large .csv file into a Mongo collection. So far I was able to import the .csv file without big problems (either with mongoimport or through Studio 3T). However, I would like to create some arrays for each specific document. Maybe by explaining with an example my data I could make it clearer.
worker_id | gender | birth | employer_id | start | end | type |
---------------------------------------------------------------------------
CAD1213 | M | 1990 | WRF1119 | 15jun2018 | 31dec2018 | 11.1.1 |
CAD1213 | M | 1990 | WAC1134 | 1jan2019 | 31dec2019 | 12.1.1 |
CAF1456 | F | 1972 | WAC1134 | 15aug1988 | 17sep2016 | 55.0.1 |
CAR5567 | F | 1991 | WER9280 | 15jun2018 | 31dec2018 | 1.1.1 |
CAR5578 | F | 1989 | WHI1176 | 15jun2011 | 1nov2011 | 12.2.5 |
CAR5578 | F | 1989 | WHI1176 | 2nov2011 | 31dec2012 | 12.2.5 |
This is just a mere example with few fields (columns). I have data for job contracts where: worker_id is the the identification of the worker, gender the sex, birth refers to the birth year of the worker, employer_id is the identification of each employer, start and end the beginning and the end of each specific work contract, respectively, while type is the specific work contract. The whole database obviously is characterize by more fields.
With the classic .csv import, for each line I would obtain a different document with specific _ids and each column would represent the fields. However, I would like to have as '_id' (although not necessarily) the value of the worker_id. As you can see, some worker_id are repeated in the column (in a file of over 17 millions of row you can imagine how many replica I could have), therefore it would be impossibile to select this field as _id.
Here it comes what I am struggling to obtain. I would like to create an array (for example called contracts) which contains information of each specific contract, hence objects, in this case the columns employer_id, start, end, and type. For some workers we would have just an element in the array (CAF1456 and CAR5567 in the example) while for others more than one (CAD1213 and CAR5578 in the example). Therefore, an hypothetical document would be like the following one:
_id: "CAD1213"
gender: "M"
birth: "1990"
contracts: Array
0: Object
employer_id: "WRF1119"
start: "15jun2018"
end: "31dec2018"
type: "11.1.1"
1: Object
employer_id: "WAC1134"
start: "1jan1988"
end: "31dec2019"
type: "12.1.1"
_id: "CAF1456"
gender: "F"
birth: "1972"
contracts: Array
0: Object
employer_id: "WAC1134"
start: "15aug1988"
end: "17sep2016"
type: "55.0.1"
_id: "CAR5567"
gender: "F"
birth: "1991"
contracts: Array
0: Object
employer_id: "WER9280"
start: "15jun2018"
end: "31dec2018"
type: "1.1.1"
_id: "CAR5578"
gender: "F"
birth: "1989"
contracts: Array
0: Object
employer_id: "WHI1176"
start: "15jun2011"
end: "1nov2011"
type: "12.2.5"
1: Object
employer_id: "WHI1176"
start: "2nov2011"
end: "31dec2012"
type: "12.2.5"
As far as I know this procedure cannot be performed directly with mongoimport, then it has to be performed after importing the whole .csv document (I guess). I hope someone could provide me some hints, suggestions or links where I can find some help in achieving this structure. Moreover, if other structures of the documents could be more suitable for my example, I am eager to know.
Thank you in advance.

convert JSON text string to Pandas, but each row cell ends up as an array of values inside

I manage to extract a time-series of prices from a web-portal. The data arrives in a json format, and I convert them into a pandas dataFrame.
Unfortunately, the data for the different bands come in a text string, and I can't seem to extract them out properly.
The below is the json data I extract
I convert them into a pandas dataframe using this code
data = pd.DataFrame(r.json()['prices'])
and get them like this
I need to extract (for example) the data in the column ClosePrice out, so that I can do data analysis and cleansing on them.
I tried using
data['closePrice'].str.split(',', expand=True).rename(columns = lambda x: "string"+str(x+1))
but it doesn't really work.
Is there any way to either
a) when I convert the json to dataFrame, such that the prices within the closePrice, bidPrice etc are extracted in individual columns OR
b) if they were saved in the dataFrame, extract the text strings within them, such that I can extract the prices (e.g. the bid, ask and lastTraded) within the text string?
A relatively brute force way, using links from other stackOverflow.
# load and extract the json data
s = requests.Session()
r = s.post(url + '/session', json=data)
loc = <url>
dat1 = s.get(loc)
dat1 = pd.DataFrame(dat1.json()['prices'])
# convert the object list into individual columns
dat2 = pd.DataFrame()
dat2[['bidC','askC', 'lastP']] = pd.DataFrame(dat1.closePrice.values.tolist(), index= dat1.index)
dat2[['bidH','askH', 'lastH']] = pd.DataFrame(dat1.highPrice.values.tolist(), index= dat1.index)
dat2[['bidL','askL', 'lastL']] = pd.DataFrame(dat1.lowPrice.values.tolist(), index= dat1.index)
dat2[['bidO','askO', 'lastO']] = pd.DataFrame(dat1.openPrice.values.tolist(), index= dat1.index)
dat2['tStamp'] = pd.to_datetime(dat1.snapshotTime)
dat2['volume'] = dat1.lastTradedVolume
get the equivalent below
Use pandas.json_normalize to extract the data from the dict
import pandas as pd
data = r.json()
# print(data)
{'prices': [{'closePrice': {'ask': 1.16042, 'bid': 1.16027, 'lastTraded': None},
'highPrice': {'ask': 1.16052, 'bid': 1.16041, 'lastTraded': None},
'lastTradedVolume': 74,
'lowPrice': {'ask': 1.16038, 'bid': 1.16026, 'lastTraded': None},
'openPrice': {'ask': 1.16044, 'bid': 1.16038, 'lastTraded': None},
'snapshotTime': '2018/09/28 21:49:00',
'snapshotTimeUTC': '2018-09-28T20:49:00'}]}
df = pd.json_normalize(data['prices'])
Output:
| | lastTradedVolume | snapshotTime | snapshotTimeUTC | closePrice.ask | closePrice.bid | closePrice.lastTraded | highPrice.ask | highPrice.bid | highPrice.lastTraded | lowPrice.ask | lowPrice.bid | lowPrice.lastTraded | openPrice.ask | openPrice.bid | openPrice.lastTraded |
|---:|-------------------:|:--------------------|:--------------------|-----------------:|-----------------:|:------------------------|----------------:|----------------:|:-----------------------|---------------:|---------------:|:----------------------|----------------:|----------------:|:-----------------------|
| 0 | 74 | 2018/09/28 21:49:00 | 2018-09-28T20:49:00 | 1.16042 | 1.16027 | | 1.16052 | 1.16041 | | 1.16038 | 1.16026 | | 1.16044 | 1.16038 | |

Count unique rows of column in Array of object

Using
Ruby 1.9.3-p194
Rails 3.2.8
Here's what I need.
Count the different human resources (human_resource_id) and divide this by the total number of assignments (assignment_id).
So, the answer for the dummy-data as given below should be:
1.5 assignments per human resource
But I just don't know where to go anymore.
Here's what I tried:
Table name: Assignments
id | human_resource_id | assignment_id | assignment_start_date | assignment_expected_end_date
80101780 | 20200132 | 80101780 | 2012-10-25 | 2012-10-31
80101300 | 20200132 | 80101300 | 2012-07-07 | 2012-07-31
80101308 | 21100066 | 80101308 | 2012-07-09 | 2012-07-17
At first I need to make a selection for the period I need to 'look' at. This is always from max a year ago.
a = Assignment.find(:all, :conditions => { :assignment_expected_end_date => (DateTime.now - 1.year)..DateTimenow })
=> [
#<Assignment id: 80101780, human_resource_id: "20200132", assignment_id: "80101780", assignment_start_date: "2012-10-25", assignment_expected_end_date: "2012-10-31">,
#<Assignment id: 80101300, human_resource_id: "20200132", assignment_id: "80101300", assignment_start_date: "2012-07-07", assignment_expected_end_date: "2012-07-31">,
#<Assignment id: 80101308, human_resource_id: "21100066", assignment_id: "80101308", assignment_start_date: "2012-07-09", assignment_expected_end_date: "2012-07-17">
]
foo = a.group_by(&:human_resource_id)
Now I got a beautiful 'Array of hash of object' and I just don't know what to do next.
Can someone help me?
You can try to execute the request in SQL :
ActiveRecord::Base.connection.select_value('SELECT count(distinct human_resource_id) / count(distinct assignment_id) AS ratio FROM assignments');
You could do something like
human_resource_count = assignments.collect{|a| a.human_resource_id}.uniq.count
assignment_count = assignments.collect{|a| a.assignment_id}.uniq.count
result = human_resource_count/assignment_count

Resources