PySpark concat two columns in order - concatenation

I would like to concat two columns, but in a way that they are ordered.
For example I have dataframe like this:
|-------------------|-----------------|
| column_1 | column_2 |
|-------------------|-----------------|
| aaa | bbb |
|-------------------|-----------------|
| bbb | aaa |
|-------------------|-----------------|
Returns a dataframe like this:
|-------------------|-----------------|-----------------|
| column_1 | column_2 | concated_cols |
|-------------------|-----------------|-----------------|
| aaa | bbb | aaabbb |
|-------------------|-----------------|-----------------|
| bbb | aaa | aaabbb |
|-------------------|-----------------|-----------------|

Version Spark >= 2.4
from pyspark.sql import functions as F
df.withColumn(
"concated_cols",
F.array_join(F.array_sort(F.array(F.col("column_1"), F.col("column_2"))), ""),
).show()
Spark <= 2.3 version.
With a simple UDF :
from pyspark.sql import functions as F
#F.udf
def concat(*cols):
return "".join(sorted(cols))
df.withColumn("concated_cols", concat(F.col("column_1"), F.col("column_2"))).show()
+--------+--------+-------------+
|column_1|column_2|concated_cols|
+--------+--------+-------------+
| aaa| bbb| aaabbb|
| bbb| aaa| aaabbb|
+--------+--------+-------------+

Related

Search row for array match of a specific word, and return the dates from column A that match

Objective:
Looking up Class ID (ignoring anything past the **), and return corresponding date in (B2:B).
Then Looking up Class Event, and return the corresponding date in (C2:C).
I have tried combinations of HLookup, VLookup, Index & Match, and Query, but cannot seem to get it to work correctly.
My Sheet:
Column | A | B? | C? | D | E | F |
Row1 | [Class ID's] | [Class ID Date] | [Class Event Date] | [Dates] | [Name1] | [Name2] |
Row2 | Class ID1 | 01/02/2021 | 01/04/2021 | 01/01/2021 | | Class ID3** |
Row3 | Class ID2 | 01/08/2021 | 01/09/2021 | 01/02/2021 | Class ID1** | |
Row4 | Class ID3 | 01/01/2021 | 01/07/2021 | 01/03/2021 | | Class ID4** |
Row5 | Class ID4 | 01/03/2021 | 01/09/2021 | 01/04/2021 | Class Event | |
Row6 | Class ID5 | * Formula #1 * | * Formula #2 * | 01/05/2021 | | |
Row7 | Class ID6 | | | 01/06/2021 | | |
Row8 | Class ID7 | | | 01/07/2021 | | Class Event |
Row9 | Class ID8 | | | 01/08/2021 | Class ID2** | |
Row10 | Class ID9 | | | 01/09/2021 | Class Event | Class Event |
Row11 | Class ID10 | | | 01/10/2021 | | |
Formula #1 (Column: B)
Find the class ID, and returning the dates into range: B2:B (Working, but not efficient)
=IFERROR(INDEX($D$2:$D,MATCH($A2,E$2:E,0)),INDEX($D$2:$D,MATCH($A2,F$2:F,0)))
...and so on, for each column (There are 60 columns).
Formula #2 (Column: C)
Find class ID, search column, find first instance of "Class Event", return date into range: C2:C
="I have absolutely no clue for this one"
Is this even possible in Google Sheets?
I can use excel if needed (but preferably not as this sheet pulls data from another Google Sheet)
C2:
=ARRAYFORMULA(IFNA(VLOOKUP(B2:B, {F2:F, E2:E; G2:G, E2:E}, 2, 0)))
D2:
as for now, how you stated the dataset, there is nothing to pair events with specific IDs - it would be possible if you would have
Class ID4 Event
instead of just
Class Event
update 1:
C2 for 60 columns would be:
=ARRAYFORMULA(IFNA(VLOOKUP(B2:B,
SPLIT(FLATTEN(IF(F2:BN="",,F2:BN&"×"&E2:E)), "×"), 2, 0)))
update 2:
try in D2 (but it will work only if class event will follow after each id class)
=ARRAYFORMULA(IFNA(VLOOKUP(B2:B, {
QUERY(SPLIT(FLATTEN(IF(TRANSPOSE(F2:BN)="",,
TRANSPOSE(F2:BN)&"×"&TRANSPOSE(E2:E))), "×"),
"select Col1 where Col2 is not null", 0), {
QUERY(QUERY(SPLIT(FLATTEN(IF(TRANSPOSE(F2:BN)="",,
TRANSPOSE(F2:BN)&"×"&TRANSPOSE(E2:E))), "×"),
"select Col2 where Col2 is not null", 0),
"offset 1", 0); ""}}, 2, 0)))

How to return first not empty cell from importrange values?

my google sheet excel document contain data like this
+---+---+---+---+---+---+
| | A | B | C | D | E |
+---+---+---+---+---+---+
| 1 | | c | | x | |
+---+---+---+---+---+---+
| 2 | | r | | 4 | |
+---+---+---+---+---+---+
| 3 | | | | m | |
+---+---+---+---+---+---+
| 4 | | | | | |
+---+---+---+---+---+---+
Column B and D contain data provided by IMPORTRANGE function, which are store in different files.
And i would like to fill column A with first not empty value in row, in other words: desired result must look like this:
+---+---+---+---+---+---+
| | A | B | C | D | E |
+---+---+---+---+---+---+
| 1 | c | c | | x | |
+---+---+---+---+---+---+
| 2 | r | r | | 4 | |
+---+---+---+---+---+---+
| 3 | m | | | m | |
+---+---+---+---+---+---+
| 4 | | | | | |
+---+---+---+---+---+---+
I tried ISBLANK function, but apperantly if column is imported then, even if the value is empty, is not blank, so this function dosn't work for my case. Then i tried QUERY function in 2 different variant:
1) =QUERY({B1;D1}; "select Col1 where Col1 is not null limit 1"; 0) but result in this case is wrong when row contain cells with numbers. Result with this query is following:
+---+---+---+---+---+---+
| | A | B | C | D | E |
+---+---+---+---+---+---+
| 1 | c | c | | x | |
+---+---+---+---+---+---+
| 2 | 4 | r | | 4 | |
+---+---+---+---+---+---+
| 3 | m | | | m | |
+---+---+---+---+---+---+
| 4 | | | | | |
+---+---+---+---+---+---+
2) =QUERY({B1;D1};"select Col1 where Col1 <> '' limit 1"; 0) / =QUERY({B1;D1};"select Col1 where Col1 != '' limit 1"; 0) and this dosn't work at all, result is always #N/A
Also i would like to avoid using nested IFs and javascript scripts, if possible, as solution with QUERY function suits for my case best due to easy expansion to another columns without any deeper knowladge about programming. Is there any way how to make it simply, just with QUERY, and i am just missing something, or i have to use IFs/javascript?
try:
=ARRAYFORMULA(SUBSTITUTE(INDEX(IFERROR(SPLIT(TRIM(TRANSPOSE(QUERY(
TRANSPOSE(SUBSTITUTE(B:G, " ", "♦")),,99^99))), " ")),,1), "♦", " "))
selective columns:

How to get data to other sheet if cell have number value only

I have google sheet which have some data. like bellow
Sheet one
Column A | | Column B
=================================
Hello | | *1*
World! | | p
Foo | | *3*
Bar | | L
Bar1 | | *0*
Want data in sheet 2 only which have nubmers
Sheet two
Column A | | Column B
=================================
Hello | | *1*
Foo | | *3*
Bar1 | | *0*
Hope you understand what I want.
try:
=FILTER(Sheet1!A1:B, ISNUMBER(Sheet1!B1:B*1))
or maybe:
=FILTER(Sheet1!A1:B, ISNUMBER(REGEXEXTRACT(Sheet1!B1:B, "\*(\d+)\*")*1))

How to convert a row (array of strings) to a dataframe column [duplicate]

I have this code:
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext()
sqlContext = SQLContext(sc)
documents = sqlContext.createDataFrame([
Row(id=1, title=[Row(value=u'cars', max_dist=1000)]),
Row(id=2, title=[Row(value=u'horse bus',max_dist=50), Row(value=u'normal bus',max_dist=100)]),
Row(id=3, title=[Row(value=u'Airplane', max_dist=5000)]),
Row(id=4, title=[Row(value=u'Bicycles', max_dist=20),Row(value=u'Motorbikes', max_dist=80)]),
Row(id=5, title=[Row(value=u'Trams', max_dist=15)])])
documents.show(truncate=False)
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[[1000,cars]] |
#|2 |[[50,horse bus], [100,normal bus]]|
#|3 |[[5000,Airplane]] |
#|4 |[[20,Bicycles], [80,Motorbikes]] |
#|5 |[[15,Trams]] |
#+---+----------------------------------+
I need to split all compound rows (e.g. 2 & 4) to multiple rows while retaining the 'id', to get a result like this:
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[1000,cars] |
#|2 |[50,horse bus] |
#|2 |[100,normal bus] |
#|3 |[5000,Airplane] |
#|4 |[20,Bicycles] |
#|4 |[80,Motorbikes] |
#|5 |[15,Trams] |
#+---+----------------------------------+
Just explode it:
from pyspark.sql.functions import explode
documents.withColumn("title", explode("title"))
## +---+----------------+
## | id| title|
## +---+----------------+
## | 1| [1000,cars]|
## | 2| [50,horse bus]|
## | 2|[100,normal bus]|
## | 3| [5000,Airplane]|
## | 4| [20,Bicycles]|
## | 4| [80,Motorbikes]|
## | 5| [15,Trams]|
## +---+----------------+
Ok, here is what I've come up with. Unfortunately, I had to leave the world of Row objects and enter the world of list objects because I couldn't find a way to append to a Row object.
That means this method is bit messy. If you can find a way to add a new column to a Row object, then this is NOT the way to go.
def add_id(row):
it_list = []
for i in range(0, len(row[1])):
sm_list = []
for j in row[1][i]:
sm_list.append(j)
sm_list.append(row[0])
it_list.append(sm_list)
return it_list
with_id = documents.flatMap(lambda x: add_id(x))
df = with_id.map(lambda x: Row(id=x[2], title=Row(value=x[0], max_dist=x[1]))).toDF()
When I run df.show(), I get:
+---+----------------+
| id| title|
+---+----------------+
| 1| [cars,1000]|
| 2| [horse bus,50]|
| 2|[normal bus,100]|
| 3| [Airplane,5000]|
| 4| [Bicycles,20]|
| 4| [Motorbikes,80]|
| 5| [Trams,15]|
+---+----------------+
I am using Spark Dataset API, and following solved the 'explode' requirement for me:
Dataset<Row> explodedDataset = initialDataset.selectExpr("ID","explode(finished_chunk) as chunks");
Note: The explode method of Dataset API is deprecated in Spark 2.4.5 and the documentation suggests using Select(shown above) or FlatMap.

Hive Query: working with String Array

I have a HIVE Table with following schema like this:
hive>desc books;
gen_id int
author array<string>
rating double
genres array<string>
hive>select * from books;
| gen_id | rating | author |genres
+----------------+-------------+---------------+----------
| 1 | 10 | ["A","B"] | ["X","Y"]
| 2 | 20 | ["C","A"] | ["Z","X"]
| 3 | 30 | ["D"] | ["X"]
Is there a query where I can perform some SELECT operation and that returns individual rows, like this:
| gen_id | rating | SplitData
+-------------+---------------+-------------
| 1 | 10 | "A"
| 1 | 10 | "B"
| 1 | 10 | "X"
| 1 | 10 | "Y"
| 2 | 20 | "C"
| 2 | 20 | "A"
| 2 | 20 | "Z"
| 2 | 20 | "X"
| 3 | 30 | "D"
| 3 | 30 | "X"
Can someone guide me how can get to this result. Thanks in advance for any kind of help.
You need to do Lateral view and explode,i.e.
SELECT
gen_id,
rating,
SplitData
FROM (
SELECT
gen_id,
rating,
array (ex_author,ed_genres) AS ar_SplitData
FROM
books
LATERAL VIEW explode(books.author) exploded_authors AS ex_author
LATERAL VIEW explode(books.genres) exploded_genres AS ed_genres
) tab
LATERAL VIEW explode(tab.ar_SplitData) exploded_SplitData AS SplitData;
I had no chance to test it but it should show you general path. GL!

Resources