Sort by a key, but value has more than one element using Scala - arrays

I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:
Year, Name, County, Number
2000, JOHN, KINGS, 50
2000, BOB, KINGS, 40
2000, MARY, NASSAU, 60
2001, JOHN, KINGS, 14
2001, JANE, KINGS, 30
2001, BOB, NASSAU, 45
And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?
I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)
val names = sc.textFile("names.csv").map(l => l.split(","))
val uniqueCounty = names.map(x => x(2)).distinct.collect
for (i <- 0 to uniqueCounty.length-1) {
val county = uniqueCounty(i).toString;
val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
println("County:" + county + eachCounty.first)
}

Here is the solution using RDD. I am assuming you need top occurring name per county.
val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect
Output:
res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))

You could use the spark-csv and the Dataframe API. If you are using the new version of Spark (2.0) it is slightly different. Spark 2.0 has a native csv data source based on spark-csv.
Use spark-csv to load your csv file into a Dataframe.
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(new File(getClass.getResource("/names.csv").getFile).getAbsolutePath)
df.show
Gives output:
+----+----+------+------+
|Year|Name|County|Number|
+----+----+------+------+
|2000|JOHN| KINGS| 50|
|2000| BOB| KINGS| 40|
|2000|MARY|NASSAU| 60|
|2001|JOHN| KINGS| 14|
|2001|JANE| KINGS| 30|
|2001| BOB|NASSAU| 45|
+----+----+------+------+
DataFrames uses a set of operations for structured data manipulation. You could use some basic operations to become your result.
import org.apache.spark.sql.functions._
df.select("County","Number").groupBy("County").agg(max("Number")).show
Gives output:
+------+-----------+
|County|max(Number)|
+------+-----------+
|NASSAU| 60|
| KINGS| 50|
+------+-----------+
Is this what you are trying to achieve?
Notice the import org.apache.spark.sql.functions._ which is needed for the agg() function.
More information about Dataframes API
EDIT
For correct output:
df.registerTempTable("names")
//there is probably a better query for this
sqlContext.sql("SELECT * FROM (SELECT Name, County,count(1) as Occurrence FROM names GROUP BY Name, County ORDER BY " +
"count(1) DESC) n").groupBy("County", "Name").max("Occurrence").limit(2).show
Gives output:
+------+----+---------------+
|County|Name|max(Occurrence)|
+------+----+---------------+
| KINGS|JOHN| 2|
|NASSAU|MARY| 1|
+------+----+---------------+

Related

Webscraping to a DataFrame

I am trying to get information from a website, and into a Dataframe, but I'm having some trouble.
I have extracted the data, but I'm trying to merge two dataframes, and reshape them into one. Here is what I have:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.civilaviation.gov.in/'
resp = requests.get(url)
soup = BeautifulSoup(resp.content.decode(), 'html.parser')
div = soup.find('div', {'class':'airport-col vande-bharat-col'})
div2 = soup.find('div', {'class':'airport-col airport-widget'})
div['class'] = 'Domestic traffic'
div2['class'] = 'International traffic'
dom = div.get_text()
intl = div2.get_text()
def str2frame(estr, sep = '\n', lineterm = '\n\n\n\n\n', set_header = True):
dat = [x.split(sep) for x in estr.split(lineterm)][0:-1]
df = pd.DataFrame(dat)
if set_header:
df = df.T.set_index(0, drop = True).T # flip, set ix, flip back
return df
df1 = str2frame(dom)
df2 = str2frame(intl)
df1.rename(columns={"अन्तर्देशीय यातायात Domestic traffic On 29 Jan 2023":"Domestic Traffic"}, inplace=True)
df2.rename(columns={"अंतर्राष्ट्रीय यातायात International traffic On 29 Jan 2023":"International Traffic"}, inplace=True)
So now I get two separate DataFrames with all the information I want, but not in the format I want. The shape of my dataframes are 6,2(one of the columns is blank)... I need them merged into one dataframe that is 2,6. So basically I show
Domestic Traffic
1 Departing flights 2,967
2 Departing Pax 4,24,224
3 Arriving flights 2,960
4 Arriving Pax 4,18,697
5 Aircraft movements 5,927
6 Airport footfalls 8,42,921
I would like to see two rows, one for domestic and one for international traffic, and each column based on the given values. I apologize if my question or my coding is unclear. This is my first time asking a question on this forum. Thank you for your help.
Not sure if this is the expected result but you could transform and concat your data:
pd.concat([
df1.set_index(df1.columns[0]).T,
df2.set_index(df2.columns[0]).T
]).reset_index()
Output
0
Departing flights
Departing Pax
Arriving flights
Arriving Pax
Aircraft movements
Airport footfalls
0
अन्तर्देशीय यातायात Domestic traffic On 30 Jan 2023
2,862
4,07,957
2,864
4,04,799
5,726
8,12,756
1
अंतर्राष्ट्रीय यातायात International traffic On 30 Jan 2023
433
90,957
516
82,036
949
1,72,993

How to compare two array of string columns in Pyspark

I want to compare two arrays and filter the data frame
condition_1 = AAA
condition_2 = ["AAA","BBB","CCC"]
My spark data frame has a column with array of strings
df = df.withColumn("array_column", F.lit(["XXX","YYY","AAA"]))
# to filter a string condition_1 with the array column
df = df.filter(
F.col('array_column').isin(condition_1) &
# second filter here
)
But how can I filter condition_2 in in a similar way? since they are both arrays?
Code I tried:
df = df.filter(
F.col('array_column').isin(condition_1) &
any(x in condition_2 for x in F.col('array_column'))
)
But I get an error - Column is not iterable.
I also tried - bool(set(F.col('array_column')).intersection(condition_2))
But still have the same error. Can anyone help me with this?
Hope I got your question right. It wasnt as clear. Use pyspark's array functions
Data
condition_1 = 'AAA'
condition_2 = ["AAA","BBB","CCC"]
df=spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
df.show()
+---+----------+-------+------------------+
| ID|Company_Id| value| array_column|
+---+----------+-------+------------------+
| 1A| 3412asd|value-1| [XXX, YYY, AAA]|
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG]|
| 3C| 9800bvd|value-3| [AAA]|
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC]|
+---+----------+-------+------------------+
Code
df.where((array_contains(col('array_column'), lit(condition_1)))&(size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0)).show(truncate=False)
Outcome
+---+----------+-------+----------------+
|ID |Company_Id|value |array_column |
+---+----------+-------+----------------+
|1A |3412asd |value-1|[XXX, YYY, AAA] |
|3C |9800bvd |value-3|[AAA] |
|3C |9800bvd |value-1|[AAA, YYY, CCCC]|
+---+----------+-------+----------------+
How it works
condition_1 ; get a boolean selection of where column contains string
array_contains(col('array_column'), lit(condition_1))
condition_2 ; This happens in stages
Intersect column with the list
array_intersect(col('array_column'),array([lit(x) for x in condition_2]))
get the size of the outcome of 1 above
size(array_intersect(col('array_column'),array([lit(x) for x in` condition_2])))
Check that the intersection contains at least one item
size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0
Finally, chain condition_1 and condition_2 using operant & and pass into the df.where() or df.filter() methods

Compare two arrays from two different dataframes in Pyspark

I have two dataframes ecah has an array(string) columns.
I am trying to create a new data frame that only filters rows where one of the array element in a row matches with other.
#first dataframe
main_df = spark.createDataFrame([('1', ['YYY', 'MZA']),
('2', ['XXX','YYY']),
('3',['QQQ']),
('4', ['RRR', 'ZZZ', 'BBB1'])],
('No', 'refer_array_col'))
#second dataframe
df = spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG', '1']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
df.show()
+---+----------+-------+--------------------+
| ID|Company_Id| value| array_column |
+---+----------+-------+--------------------+
| 1A| 3412asd|value-1| [XXX, YYY, AAA] |
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG, 1]|
| 3C| 9800bvd|value-3| [AAA] |
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC] |
+---+----------+-------+---------------------+
Code I tried:
The main idea is to use rdd.toLocalIterator() as there are some other functions inside the same for loop that are depending on this filters
for x in main_df.rdd.toLocalIterator:
a = main_df["refer_array_col"]
b = main_df["No"]
some_x_filter = F.col('array_coulmn').isin(b)
final_df = df.filter(
# filter 1
some_x_filter &
# second filter is to compare 'a' with array_column - i tried using F.array_contains
(F.array_contains(F.col('array_column'), F.lit(a)))
)
some_x_filter is also working in a similar way
some_x_filter is comparing a string value in a array of strings column.
But now a contains a list of strings and I am unable to compare it with array_column
With my code I am getting an error for array contains
Error
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList ['YYY', 'MZA']
Can anyone tell me what can i use at the second filter alternatively?
From what I understood based on our conversation in the comments.
Essentially your requirement is to compare an array column with a Python List.
Thus, this would do the job
df.withColumn("asArray", F.array(*[F.lit(x) for x in b]))

Pyspark Array Column - Replace Empty Elements with Default Value

I have a dataframe with a column which is an array of strings. Some of the elements of the array may be missing like so:
-------------|-------------------------------
ID |array_list
---------------------------------------------
38292786 |[AAA,, JLT] |
38292787 |[DFG] |
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |
38292790 |[] |
38292791 |[] |
38292792 |[,,, HKJ] |
I would like to replace the missing elements with a default value of "ZZZ". Is there a way to do that? I tried the following code, which is using a transform function and a regular expression:
import pyspark.sql.functions as F
from pyspark.sql.dataframe import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
df = df.withColumn("array_list2", F.expr("transform(array_list, x -> regexp_replace(x, '', 'ZZZ'))"))
This doesn't give an error but it is producing nonsense. I'm thinking I just don't know the correct way to identify the missing elements of the array - can anyone help me out?
In production our data has around 10 million rows and I am trying to avoid using explode or a UDF (not sure if it's possible to avoid using both though, just need the code run as efficiently as possible). I'm using Spark 2.4.4
This is what I would like my output to look like:
-------------|-------------------------------|-------------------------------
ID |array_list | array_list2
---------------------------------------------|-------------------------------
38292786 |[AAA,, JLT] |[AAA, ZZZ, JLT]
38292787 |[DFG] |[DFG]
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |[SHJ, QKJ, AAA, YTR, CBM]
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |[DUY, ANK, QJK, POI, CNM, ADD]
38292790 |[] |[ZZZ]
38292791 |[] |[ZZZ]
38292792 |[,,, HKJ] |[ZZZ, ZZZ, ZZZ, HKJ]
The regex_replace works at character level.
I could not get it to work with transform either, but with help from the first answerer I used a UDF - not that easy.
Here is my example with my data, you can tailor.
%python
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(
lambda con_str, arr: [
x if x is not None else con_str for x in arr or [None]
],
ArrayType(StringType()),
)
arrayData = [
('James',['Java','Scala']),
('Michael',['Spark','Java',None]),
('Robert',['CSharp','']),
('Washington',None),
('Jefferson',['1','2'])]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages'])
df = df.withColumn("knownLanguages", concat_udf(lit("ZZZ"), col("knownLanguages")))
df.show()
returns:
+----------+------------------+
| name| knownLanguages|
+----------+------------------+
| James| [Java, Scala]|
| Michael|[Spark, Java, ZZZ]|
| Robert| [CSharp, ]|
|Washington| [ZZZ]|
| Jefferson| [1, 2]|
+----------+------------------+
Quite tough this, had some help from the first answerer.
I'm thinking of something, but i'm not sure if it is efficient.
from pyspark.sql import functions as F
df.withColumn("array_list2", F.split(F.array_join("array_list", ",", "ZZZ"), ","))
First I concatenate the values as a string with a delimiter , (hoping you don't have it in your string but you can use something else). I use the null_replacement option to fill the null values. Then I split according to the same delimiter.
EDIT: Based on #thebluephantom comment, you can try this solution :
df.withColumn(
"array_list_2", F.expr(" transform(array_list, x -> coalesce(x, 'ZZZ'))")
).show()
SQL built-in transform is not working for me, so I couldn't try it but hopefully you'll have the result you wanted.

Remove garbage(#,$) value from any string and drop records that contains only garbage(#,$) value with multiple occurances in multiple columns

I tried below code for drop records that contains garbage value with multiple occurrences and multiple columns,But I want to remove garbage value form string with multiple occurrences in multiple columns.
Sample Code :-
filter_list = ['$','#','%','#','!','^','&','*','null']
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
In this example "original" is my original dataframe and "compulsory_fields" this is my array(it stores as multiple columns).
Sample Input :-
id name salary
# Yogita 1000
2 Neha ##
3 #Jay$deep## 8000
4 Priya 40$00&
5 Bhavana $$%&^
6 $% $$&&
Sample Output :-
id name salary
3 Jaydeep 8000
4 priya 4000
Your requirements are not completely clear to me, but it seems you want to output records that are valid after removing the "garbage" characters. You can achieve this by adding a clean_special_characters udf that removes the special characters before running your filter_udf:
import pyspark.sql.functions as f
from itertools import chain
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import BooleanType,StringType
rdd = sc.parallelize((
('#','Yogita','1000'),
('2', 'Neha', '##'),
('3', '#Jay$deep##','8000'),
('4', 'Priya', '40$00&'),
('5', 'Bhavana', '$$%&^'),
('6', '$%','$$&&'))
)
original = rdd.toDF(['id','name','salary'])
filter_list = ['$','#','%','#','!','^','&','*','null']
compulsory_fields = ['id','name','salary']
def clean_special_characters(input_string):
cleaned_input = input_string.translate({ord(c): None for c in filter_list if len(c)==1})
if cleaned_input == '':
return 'null'
return cleaned_input
clean_special_characters_udf = f.udf(clean_special_characters, StringType())
original = original.withColumn('name', clean_special_characters_udf(original.name))
original = original.withColumn('salary', clean_special_characters_udf(original.salary))
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
This outputs:
+---+-------+------+
| id| name|salary|
+---+-------+------+
| 3|Jaydeep| 8000|
| 4| Priya| 4000|
+---+-------+------+

Resources