I have a dataframe having dictionary with multiple lists and I would like to create a dataframe by extracting on a certain element 'Student'.
{"Student":["Grad","School"], "Comments": "Finished Education"}
{"Student":["New"], "Comments": "Started Education", Location : ["USA", "China", "Australia"]}
{"Student": ["Middle", "School"], "ID" : ["1000", "2000"]}}
{"Student": ["Med","School"]}
Expected output:
Grad, School
Middle, School
Med, School
I have tried to read the dataframe into a dictionary, but was unable to retrieve only the 'Student' element from the dictionary.
data_dict = {}
df = df.toPandas()
for column in df.columns:
data_dict[column] = df[column].values.tolist()
Student = [data for data in data_dict.values()]

First of all, what you have is not dictionaries and not lists. When you have a Spark dataframe, what you call a dictionary is a struct. And what you call a list is an array. Data types can be inspected with df.printSchema().
You can extract the "Student" fields from the structs, then add everything to an array in order to finally explode.
arr = F.array([F.col(f'{c}.Student') for c in df.columns])
df = df.select(F.explode(arr).alias('Student'))
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[((["Grad","School"], "Finished Education"),(['New'], "Started Education", ["USA", "China", "Australia"]),(["Middle", "School"], ["1000", "2000"]),(["Med","School"],))],
'`1` struct<Student:array<string>,Comments:string>, `2` struct<Student:array<string>,Comments:string,Location:array<string>>, `3` struct<Student:array<string>,ID:array<string>>, `4` struct<Student:array<string>>')
arr = F.array([F.col(f'{c}.Student') for c in df.columns])
df = df.select(F.explode(arr).alias('Student'))
# +----------------+
# | Student|
# +----------------+
# | [Grad, School]|
# | [New]|
# |[Middle, School]|
# | [Med, School]|
# +----------------+
Another working option:
to_melt = [f"`{c}`.Student" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) Student")


Compare two arrays from two different dataframes in Pyspark

I have two dataframes ecah has an array(string) columns.
I am trying to create a new data frame that only filters rows where one of the array element in a row matches with other.
#first dataframe
main_df = spark.createDataFrame([('1', ['YYY', 'MZA']),
('2', ['XXX','YYY']),
('4', ['RRR', 'ZZZ', 'BBB1'])],
('No', 'refer_array_col'))
#second dataframe
df = spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG', '1']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
| ID|Company_Id| value| array_column |
| 1A| 3412asd|value-1| [XXX, YYY, AAA] |
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG, 1]|
| 3C| 9800bvd|value-3| [AAA] |
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC] |
Code I tried:
The main idea is to use rdd.toLocalIterator() as there are some other functions inside the same for loop that are depending on this filters
for x in main_df.rdd.toLocalIterator:
a = main_df["refer_array_col"]
b = main_df["No"]
some_x_filter = F.col('array_coulmn').isin(b)
final_df = df.filter(
# filter 1
some_x_filter &
# second filter is to compare 'a' with array_column - i tried using F.array_contains
(F.array_contains(F.col('array_column'), F.lit(a)))
some_x_filter is also working in a similar way
some_x_filter is comparing a string value in a array of strings column.
But now a contains a list of strings and I am unable to compare it with array_column
With my code I am getting an error for array contains
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList ['YYY', 'MZA']
Can anyone tell me what can i use at the second filter alternatively?
From what I understood based on our conversation in the comments.
Essentially your requirement is to compare an array column with a Python List.
Thus, this would do the job
df.withColumn("asArray", F.array(*[F.lit(x) for x in b]))

How to create a new array of substrings from string array column in a spark dataframe

I have a spark dataframe. One of the columns is an array type consisting of an array of text strings of varying lengths. I am looking for a way to add a new column that is an array of the unique left 8 characters of those strings.
|-- arr_agent: array (nullable = true)
| |-- element: string (containsNull = true)
example data from column "arr_agent":
What I need to have in the new column:
I already tried to define a udf that does it for me.
from pyspark.sql import functions as F
from pyspark.sql import types as T
def make_list_of_unique_prefixes(text_array, prefix_length=8):
out_arr = set(t[0:prefix_length] for t in text_array)
make_list_of_unique_prefixes_udf = F.udf(lambda x,y=8: make_list_of_unique_prefixes(x,y), T.ArrayType(T.StringType()))
But then calling:
df.withColumn("arr_prefix8s", F.collect_set( make_list_of_unique_prefixes_udf(F.col("arr_agent") )))
Throws an error
AnalysisException: grouping expressions sequence is empty,
Any tips would be appreciated.
You can solve this using higher order functions available from spark 2.4+ using transform and substring and then take array distinct:
from pyspark.sql import functions as F
n = 8
out = df.withColumn("New",F.expr(f"array_distinct(transform(arr_agent,x->substring(x,0,{n})))"))
|arr_agent |New |

parsing a JSON string Pyspark dataframe column that has string of array in one of the columns

I am trying to read a JSON file and parse 'jsonString' and the underlying fields which includes array into a pyspark dataframe.
Here is the contents of json file.
[{"jsonString": "{\"uid\":\"value1\",\"adUsername\":\"value3\",\"courseCertifications\":[{\"uid\":\"value2\",\"courseType\":\"TRAINING\"},{\"uid\":\"TEST\",\"courseType\":\"TRAINING\"}],\"modifiedBy\":\"value4\"}","transactionId": "value5", "tableName": "X"},
{"jsonString": "{\"uid\":\"value11\",\"adUsername\":\"value13\",\"modifiedBy\":\"value14\"}","transactionId": "value15", "tableName": "X1"},
{"jsonString": "{\"uid\":\"value21\",\"adUsername\":\"value23\",\"modifiedBy\":\"value24\"}","transactionId": "value25", "tableName": "X2"}]
I am able to parse contents of string 'jsonString' and select required columns using the below logic
df = spark.read.json('path.json',multiLine=True)
df = df.withColumn('courseCertifications', explode(array(get_json_object(df['jsonString'],'$.courseCertifications'))))
Now my end goal is to parse field "courseType" from "courseCertifications" and create one row per instance.
I am using below logic to get "courseType"
df = df.withColumn('new',get_json_object(df.courseCertifications, '$[*].courseType'))
I am able to get the contents of "courseType" but as a string as shown below
My end goal is to create a dataframe with columns transactionId, jsonString.uid, jsonString.adUsername, jsonString.courseCertifications.uid, jsonString.courseCertifications.courseType
I need to retain all the rows and create multiple rows one per array instances of courseCertifications.uid/courseCertifications.courseType.
An elegant manner to resolve your question is creating the schema of the json string and then parse it using from_json function
import pyspark.sql.functions as f
from pyspark.shell import spark
from pyspark.sql.types import ArrayType, StringType, StructType, StructField
df = spark.read.json('your_path', multiLine=True)
schema = StructType([
StructField('uid', StringType()),
StructField('adUsername', StringType()),
StructField('modifiedBy', StringType()),
StructField('courseCertifications', ArrayType(
StructField('uid', StringType()),
StructField('courseType', StringType())
df = df \
.withColumn('tmp', f.from_json(df.jsonString, schema)) \
.withColumn('adUsername', f.col('tmp').adUsername) \
.withColumn('uid', f.col('tmp').uid) \
.withColumn('modifiedBy', f.col('tmp').modifiedBy) \
.withColumn('tmp', f.explode(f.col('tmp').courseCertifications)) \
.withColumn('course_uid', f.col('tmp').uid) \
.withColumn('course_type', f.col('tmp').courseType) \
.drop('jsonString', 'tmp')
|transactionId|uid |adUsername|modifiedBy|course_uid|course_type|
|value5 |value1|value3 |value4 |value2 |TRAINING |
|value5 |value1|value3 |value4 |TEST |TRAINING |

Scala: Turn Array into DataFrame or RDD

I am currently working on IntelliJ in Maven.
Is there a way to turn an array into a dataframe or RDD with the first portion of the array as a header?
I'm fine with turning the array into a List, as long as it can be converted into a dataframe or RDD.
val input = Array("Name, Number", "John, 9070", "Sara, 8041")
|John| 9070 |
|Sara| 8041 |
import org.apache.spark.sql.SparkSession
val ss = SparkSession
val input = Array("Name, Number", "John, 9070", "Sara, 8041")
val header = input.head.split(", ")
val data = input.tail
val rdd = ss.sparkContext.parallelize(data)
val df = rdd.map(x => (x.split(",")(0),x.split(",")(1))).toDF(header: _*)
|John| 9070 |
|Sara| 8041 |

Sort by a key, but value has more than one element using Scala

I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:
Year, Name, County, Number
2000, JOHN, KINGS, 50
2000, BOB, KINGS, 40
2000, MARY, NASSAU, 60
2001, JOHN, KINGS, 14
2001, JANE, KINGS, 30
2001, BOB, NASSAU, 45
And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?
I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)
val names = sc.textFile("names.csv").map(l => l.split(","))
val uniqueCounty = names.map(x => x(2)).distinct.collect
for (i <- 0 to uniqueCounty.length-1) {
val county = uniqueCounty(i).toString;
val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
println("County:" + county + eachCounty.first)
Here is the solution using RDD. I am assuming you need top occurring name per county.
val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect
res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))
You could use the spark-csv and the Dataframe API. If you are using the new version of Spark (2.0) it is slightly different. Spark 2.0 has a native csv data source based on spark-csv.
Use spark-csv to load your csv file into a Dataframe.
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(new File(getClass.getResource("/names.csv").getFile).getAbsolutePath)
Gives output:
|2000|JOHN| KINGS| 50|
|2000| BOB| KINGS| 40|
|2000|MARY|NASSAU| 60|
|2001|JOHN| KINGS| 14|
|2001|JANE| KINGS| 30|
|2001| BOB|NASSAU| 45|
DataFrames uses a set of operations for structured data manipulation. You could use some basic operations to become your result.
import org.apache.spark.sql.functions._
Gives output:
|NASSAU| 60|
| KINGS| 50|
Is this what you are trying to achieve?
Notice the import org.apache.spark.sql.functions._ which is needed for the agg() function.
More information about Dataframes API
For correct output:
//there is probably a better query for this
sqlContext.sql("SELECT * FROM (SELECT Name, County,count(1) as Occurrence FROM names GROUP BY Name, County ORDER BY " +
"count(1) DESC) n").groupBy("County", "Name").max("Occurrence").limit(2).show
Gives output:
