Convert Array form (as String) to Column in Pyspark

Convert Array form (as String) to Column in Pyspark - arrays

I have a df with this form:
+---------------------------------------+
|ID|ESTRUC_COMP |
+---------------------------------------+
|4A|{'AP': '201', 'BQ': '2'} |
|8B| {'AP': '501', 'BQ': '1', 'IN': '5'}|
+---------------------------------------+
And I need something like this:
+------------------------------------------------+
|ID|ESTRUC_COMP |AP |BQ|IN|
+------------------------------------------------+
|4A|{'AP': '201', 'BQ': '2'} |201|2 | |
|8B|{'AP': '501', 'BQ': '1', 'IN': '5'}|501|1 |5 |
+------------------------------------------------+
But, ESTRUC_COMP is a String.
root
|-- ID: string (nullable = true)
|-- ESTRUC_COMP: string (nullable = true)
How can I perform this transformation? Thank you in advance.
Boris

Since you're using Spark 1.6, you can't use pyspark.sql.functions.from_json() - you're going to have to use a udf.
This question is very similar to PySpark “explode” dict in column, but I accept it's not a dupe for 2 reasons:
Your string column is not valid JSON (because of single quotes)
You want the keys to become columns
Nevertheless, the first step is to basically follow the same steps in the linked post with a minor tweak to the parse() function which replaces single quotes with double quotes:
from pyspark.sql.functions import udf, explode, first
from pyspark.sql.types import *
import json
def parse(s):
try:
return json.loads(s.replace("'", '"'))
except json.JSONDecodeError:
pass
parse_udf = udf(parse, MapType(StringType(), StringType()))
Now you can parse the string and call pyspark.sql.functions.explode():
df.select("ID", explode(parse_udf("ESTRUC_COMP"))).show()
#+---+---+-----+
#| ID|key|value|
#+---+---+-----+
#| 4A| BQ| 2|
#| 4A| AP| 201|
#| 8B| IN| 5|
#| 8B| BQ| 1|
#| 8B| AP| 501|
#+---+---+-----+
Finally, pivot() to get the keys as the columns. You can use first() as the aggregate function because we know that the key-value relationship is one-to-one for each ID.
df.select("*", explode(parse_udf("ESTRUC_COMP")))\
.groupBy("ID","ESTRUC_COMP").pivot("key").agg(first("value")).show(truncate=False)
#+---+-----------------------------------+---+---+----+
#|ID |ESTRUC_COMP |AP |BQ |IN |
#+---+-----------------------------------+---+---+----+
#|4A |{'AP': '201', 'BQ': '2'} |201|2 |null|
#|8B |{'AP': '501', 'BQ': '1', 'IN': '5'}|501|1 |5 |
#+---+-----------------------------------+---+---+----+
Of course, since I defined the udf to return MapType(StringType(), StringType()), all of your resultant columns are going to be strings. You can either cast them or modify the udf accordingly.

Related

JQ using not with IN does not work or have any effect?

This code works as expected:
jq --argjson BL ${BL} '.rows[] | select(.cells[] | .value | IN($BL[]))
It returns a list of elements that contain a value in $BL
I want to return all those that are not in $BL, so I use | not
It returns the exact same result as without the | not, it seems to make no difference.
jq --argjson BL ${BL} '.rows[] | select(.cells[] | .value | IN($BL[]) | not)
using the following retuned nothing at all
jq --argjson BL ${BL} '.rows[] | select(.cells[] | .value | IN($BL[]|not))
is there a simple thing I'm missing with using IN with NOT?
for reference $BL is and array on email address, trying to make an api call and return all elements that don't have an email listed in $BL

Your select receives a series of boolean values, one for each item in the .cells array. Using not inverts all of them, which means if you had a mixed set of boolean values, it would still be mixed, and in either case select would take those being evaluated to true.
The solution is to use any or all to aggregate these boolean values. Without any sample data, I assume you are looking for
.rows[] | select(any(.cells[]; .value | IN($BL[])) | not)

Is there a documented list of Snowflake query types?

I am working with the view SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY. It would be extremely helpful to have an exhaustive list of query types that might appear in the column QUERY_TYPE, with the type of commands that generate them. For example, does a PUT command generate a PUT query type? Or is it something like "LOAD"?
If anyone knows where such a list can be found, please post a link. Snowflake's documentation of the view does not provide any list.

Thanks all who have answered so far. Since the consensus is that no such list exists, here is a merge of the entries provided so far with the values found in my own database. Please keep posting additional answers if your DB contains entries not found below. This way, sooner or later, we will have a fairly complete list:
QUERY_TYPE
CREATE_USER
REVOKE
DROP_CONSTRAINT
RENAME_SCHEMA
UPDATE
CREATE_VIEW
CREATE_TASK
RENAME_TABLE
INSERT
ALTER_TABLE_ADD_COLUMN
RENAME_COLUMN
MERGE
BEGIN_TRANSACTION
ALTER_VIEW_MODIFY_SECURITY
GRANT
ALTER_SESSION
DELETE
DROP_ROLE
DESCRIBE
UNKNOWN
TRUNCATE_TABLE
DROP
SHOW
ALTER_WAREHOUSE_SUSPEND
GET_FILES
UNLOAD
CREATE_NETWORK_POLICY
ALTER_TABLE_DROP_COLUMN
CREATE
REMOVE_FILES
ALTER
ALTER_USER
PUT_FILES
COPY
ALTER_ACCOUNT
DROP_TASK
CREATE_CONSTRAINT
DESCRIBE_QUERY
SELECT
RENAME_USER
COMMIT
RENAME_VIEW
USE
CREATE_TABLE
ALTER_NETWORK_POLICY
CREATE_ROLE
ALTER_TABLE_MODIFY_COLUMN
SET
ALTER_USER_ABORT_ALL_JOBS
ROLLBACK
LIST_FILES
UNSET
CREATE_TABLE_AS_SELECT
DROP_USER
ALTER_WAREHOUSE_RESUME
QUERY_TYPE
ALTER_PIPE
ALTER_ROLE
ALTER_TABLE
ALTER_TABLE_DROP_CLUSTERING_KEY
ALTER_USER_RESET_PASSWORD
CREATE_EXTERNAL_TABLE
CREATE_MASKING_POLICY
CREATE_SEQUENCE
CREATE_STREAM
DROP_STREAM
RENAME_DATABASE
RENAME_FILE_FORMAT
RENAME_ROLE
RENAME_WAREHOUSE
RESTORE

By the looks of it there is no complete list of query types that show up in this table. Best I can do is give you a list from my own database, which still doesn't contain things like alter role etc. To answer your other question a PUT command is actually PUT_FILES by the looks of it:
select distinct query_type from SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY;
+-------------------------+
|QUERY_TYPE |
+-------------------------+
|ALTER |
|ALTER_SESSION |
|ALTER_TABLE_ADD_COLUMN |
|ALTER_TABLE_DROP_COLUMN |
|ALTER_TABLE_MODIFY_COLUMN|
|ALTER_USER |
|ALTER_WAREHOUSE_RESUME |
|ALTER_WAREHOUSE_SUSPEND |
|BEGIN_TRANSACTION |
|COMMIT |
|COPY |
|CREATE |
|CREATE_CONSTRAINT |
|CREATE_EXTERNAL_TABLE |
|CREATE_MASKING_POLICY |
|CREATE_ROLE |
|CREATE_SEQUENCE |
|CREATE_STREAM |
|CREATE_TABLE |
|CREATE_TABLE_AS_SELECT |
|CREATE_USER |
|CREATE_VIEW |
|DELETE |
|DESCRIBE |
|DESCRIBE_QUERY |
|DROP |
|DROP_CONSTRAINT |
|DROP_STREAM |
|DROP_USER |
|GET_FILES |
|GRANT |
|INSERT |
|LIST_FILES |
|MERGE |
|PUT_FILES |
|REMOVE_FILES |
|RENAME_COLUMN |
|RENAME_DATABASE |
|RENAME_TABLE |
|RESTORE |
|REVOKE |
|ROLLBACK |
|SELECT |
|SET |
|SHOW |
|TRUNCATE_TABLE |
|UNKNOWN |
|UNLOAD |
|UPDATE |
|USE |
+-------------------------+

Added ours ... 16 extra's ... pass it on :-)
QUERY_TYPE
ALTER
ALTER_ACCOUNT
ALTER_PIPE
ALTER_ROLE
ALTER_SESSION
ALTER_TABLE
ALTER_TABLE_ADD_COLUMN
ALTER_TABLE_DROP_CLUSTERING_KEY
ALTER_TABLE_DROP_COLUMN
ALTER_TABLE_MODIFY_COLUMN
ALTER_USER
ALTER_USER_ABORT_ALL_JOBS
ALTER_USER_RESET_PASSWORD
ALTER_WAREHOUSE_RESUME
ALTER_WAREHOUSE_SUSPEND
BEGIN_TRANSACTION
COMMIT
COPY
CREATE
CREATE_CONSTRAINT
CREATE_EXTERNAL_TABLE
CREATE_MASKING_POLICY
CREATE_NETWORK_POLICY
CREATE_ROLE
CREATE_SEQUENCE
CREATE_STREAM
CREATE_TABLE
CREATE_TABLE_AS_SELECT
CREATE_TASK
CREATE_USER
CREATE_VIEW
DELETE
DESCRIBE
DESCRIBE_QUERY
DROP
DROP_CONSTRAINT
DROP_ROLE
DROP_STREAM
DROP_TASK
DROP_USER
GET_FILES
GRANT
INSERT
LIST_FILES
MERGE
PUT_FILES
REMOVE_FILES
RENAME_COLUMN
RENAME_DATABASE
RENAME_FILE_FORMAT
RENAME_ROLE
RENAME_SCHEMA
RENAME_TABLE
RENAME_USER
RENAME_VIEW
RENAME_WAREHOUSE
RESTORE
REVOKE
ROLLBACK
SELECT
SET
SHOW
TRUNCATE_TABLE
UNKNOWN
UNLOAD
UNSET
UPDATE
USE

Here are some additional ones:
ALTER_AUTO_RECLUSTER
ALTER_SET_TAG
ALTER_TABLE_MODIFY_CONSTRAINT
ALTER_UNSET_TAG
CALL
DROP_SESSION_POLICY
RECLUSTER

jq condensing sub array permutation query

I intend to extract a csv with a row for each sub array item.
Given a json array with a sub array. e.g. like this one:
[
{
"foo": 108,
"bar": ["a","b"]
},
{
"foo": 201,
"bar": ["c","d"]
}
]
It is possible to fetch the data by utilizing an intermediate object.
.[] | { "y": .foo, "x": .bar[] }| [.y,.x] | #csv
https://jqplay.org/s/922RlkbFNA
But I'd like to express it in a less elaborate form.
However the following does not work :( :
.[] | [ (.foo, .bar[]) ] | #csv
PS: I struggle to find a fitting headline

In three lines:
.[]
| [.foo] + (.bar[]|[.])
| #csv
or maybe less obscurely:
.[]
| .bar[] as $bar
| [.foo, $bar]
| #csv

splunk query taking long time to return the value, can we eliminate append

i have initially used inputlook to get the output and query was returning output in fractions of sec, but now i want to use the source as input and run the Splunk query but its taking lot of time to return output.
Please suggest solution to optimise the output time.
I am thinking of removing multiple append
index=csvlookups source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_sip_pbx_usage.csv" OR source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_dpt_capacity.csv" OR source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_sip_pbx_forecasts.csv"
| eval Date=strftime(strptime(Date,"%m/%d/%Y"),"%Y-%m-%d")
| sort Date, CLLI
| rename CLLI as Office
| search Office="CLGRAB21DS1"
| stats sum(Usage) as Usage by Office, Date
| append
[ search index=csvlookups source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_sip_pbx_usage.csv" OR source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_dpt_capacity.csv" OR source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_sip_pbx_forecasts.csv"
| eval Date=strftime(strptime(Date,"%m/%d/%Y"),"%Y-%m-%d")
| reverse
| search Office="CLGRAB21DS1" AND Type="SIP PBX"
| fields Date NB_RTU
| fields - _raw _time ]
| sort Date
| fillnull value="CLGRAB21DS1" Office
| filldown Usage
| filldown NB_RTU
| fillnull value=0 Usage
| eval _time = strptime(Date, "%Y-%m-%d")
| eval latest_time = if("now" == "now", now(), relative_time(now(), "now"))
| where ((_time >= relative_time(now(), "-3y#h")) AND (_time <= latest_time))
| fields - latest_time Date
| append
[ gentimes start=-1
| eval Date=strftime(mvrange(now(),now()+60*60*24*365*3,"1mon"),"%F")
| mvexpand Date
| fields Date
| append
[ search index=csvlookups source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_sip_pbx_usage.csv" OR source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_dpt_capacity.csv" OR source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_sip_pbx_forecasts.csv"
| rename "Expected Date of Addition" as edate
| eval edate=strftime(strptime(edate,"%m/%d/%Y"),"%Y-%m-%d")
| rename edate as "Expected Date of Addition"
| table Contact Customer "Expected Date of Addition" "Number of Channels" Switch
| reverse
| search Customer = "Regular Usage" AND Switch = "CLGRAB21DS1"
| rename "Number of Channels" as val
| return $val ]
| reverse
| filldown search
| rename search as Usage
| where Date != ""
| reverse
| append
[ search index=csvlookups source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_sip_pbx_usage.csv" OR source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_dpt_capacity.csv" OR source="F:\\SplunkMonitor\\csvlookups\\Core_Network\\lookup_table_sip_pbx_forecasts.csv"
| rename "Expected Date of Addition" as edate
| eval edate=strftime(strptime(edate,"%m/%d/%Y"),"%Y-%m-%d")
| rename edate as "Expected Date of Addition"
| table Contact Customer "Expected Date of Addition" "Number of Channels" Switch
| reverse
| search Customer != "Regular Usage" AND Switch = "CLGRAB21DS1"
| rename "Expected Date of Addition" as Date
| eval _time=strptime(Date, "%Y-%m-%d")
| rename "Number of Channels" as Forecast
| stats sum(Forecast) as Forecast by Date]
| sort Date
| rename Switch as Office
| eval Forecast1 = if(isnull(Forecast),Usage,Forecast)
| fields - Usage Forecast
| streamstats sum(Forecast1) as Forecast
| fields - Forecast1
| eval Date=strptime(Date, "%Y-%m-%d")
| eval Date=if(Date < now(), now(), Date) ]
| filldown Usage
| filldown Office
| eval Forecast = Forecast + Usage
| eval Usage = if(Forecast >= 0,NULL,Usage)
| eval _time=if(isnull(_time), Date, _time)
| timechart limit=0 span=1w max(Usage) as Usage, max(NB_RTU) as NB_RTU, max(Forecast) as Forecast by Office
| rename "NB_RTU: CLGRAB21DS1" as "RTU's Purchased", "Usage: CLGRAB21DS1" as "Usage", "Forecast: CLGRAB21DS1" as "Forecast"
| filldown "RTU's Purchased" |sort -Forecast

Definitely an expensive query you don't want to run often or over large timeranges. In your first append, why are you using reverse? Are you trying to get latest time and earliest time which is why you used the append? You could use earliest and latest for this and eliminate the first subsearch. You could also consider eventstats instead of stats on that first search since you'll still retain the raw data.
You're also summing by _time, so you should think about binning your _time spans (i.e. | bin Date span=1h). Also, why are you using filldown? I'm guessing you want to grab values from different rows and need the rows to match? If so, use streamstats for this

If inputlookup was working well you should stick with that as you won't get much faster.
It's hard to give specific advice about your query without knowing more about the data and your end goals. In general:
Filter early. Make your base query (before the first '|') as specific as possible. Run your where and search clauses as soon as you can.
Use fields instead of table. It's more efficient.
Sort only when necessary. Usually, it's not necessary.
Fewer appends is better.

jq - How to concatenate an array in json

Struggling with formatting of data in jq. I have 2 issues.
Need to take the last array .rental_methods and concatenate them into 1 line, colon separated.
#csv doesn't seem to work with my query. I get the error string ("5343") cannot be csv-formatted, only array
jq command is this (without the | #csv)
jq --arg LOC "$LOC" '.last_updated as $lu | .data[]|.[]| $lu, .station_id, .name, .region_id, .address, .rental_methods[]'
JSON:
{
"last_updated": 1539122087,
"ttl": 60,
"data": {
"stations": [{
"station_id": "5343",
"name": "Lot",
"region_id": "461",
"address": "Austin",
"rental_methods": [
"KEY",
"APPLEPAY",
"ANDROIDPAY",
"TRANSITCARD",
"ACCOUNTNUMBER",
"PHONE"
]
}
]
}
}
I'd like the output to end up as:
1539122087,5343,Lot,461,Austin,KEY:APPLEPAY:ANDROIDPAY:TRANSITCARD:ACCOUNTNUMBER:PHONE:,

Using #csv:
jq -r '.last_updated as $lu
| .data[][]
| [$lu, .station_id, .name, .region_id, .address, (.rental_methods | join(":")) ]
| #csv'
What you were probably missing with #csv before was an array constructor around the list of things you wanted in the CSV record.

You could repair your jq filter as follows:
.last_updated as $lu
| .data[][]
| [$lu, .station_id, .name, .region_id, .address,
(.rental_methods | join(":"))]
| #csv
With your JSON, this would produce:
1539122087,"5343","Lot","461","Austin","KEY:APPLEPAY:ANDROIDPAY:TRANSITCARD:ACCOUNTNUMBER:PHONE"
... which is not quite what you've said you want. Changing the last line to:
map(tostring) | join(",")
results in:
1539122087,5343,Lot,461,Austin,KEY:APPLEPAY:ANDROIDPAY:TRANSITCARD:ACCOUNTNUMBER:PHONE
This is exactly what you've indicated you want except for the terminating punctuation, which you can easily add (e.g. by appending + "," to the program above) if so desired.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Convert Array form (as String) to Column in Pyspark - arrays

Related

JQ using not with IN does not work or have any effect?

Is there a documented list of Snowflake query types?

jq condensing sub array permutation query

splunk query taking long time to return the value, can we eliminate append

jq - How to concatenate an array in json

Categories

Resources