Scala: Delete empty array values from a Spark DataFrame - arrays

I'm a new learner of Scala. Now given a DataFrame named df as follows:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
| []| []| []| []|
+-------+-------+-------+-------+
I'd like to delete rows if all columns is an empty array (4th row).
For example I might expect the result is:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
+-------+-------+-------+-------+
I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show()
) but still show all rows.
I found python solution of using a Hive UDF from link, but I had hard time trying to convert to a valid scala code. I would like use scala command similar to the following code:
val query = "SELECT * FROM targetDf WHERE {0}".format(" AND ".join("SIZE({0}) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)
Any help would be appreciated. Thank you.

Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame. Your example DF does not contain null values but empty values, there is a difference there.
One option: You could create a new column that has the length of of the array and filter for if the array is zero.
val dfFil = df
.withColumn("arrayLengthColOne", size($"Column1"))
.withColumn("arrayLengthColTwo", size($"Column2"))
.withColumn("arrayLengthColThree", size($"Column3"))
.withColumn("arrayLengthColFour", size($"Column4"))
.filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0
&& $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
.drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")
Original DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
| []| []| []| []|
+-------+-------+-------+-------+
New DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
+-------+-------+-------+-------+
You could also create a function that will map across all the columns and do it.

Another approach (in addition to accepted answer) would be using Datasets.
For example, by having a case class:
case class MyClass(col1: Seq[String],
col2: Seq[Double],
col3: Seq[Double],
col4: Seq[String]) {
def isEmpty: Boolean = ...
}
You can represent your source as a typed structure:
import spark.implicits._ // needed to provide an implicit encoder/data mapper
val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset
So you could do filtering like following:
source.filter(element => !element.isEmpty) // calling class's instance method

Related

Pyspark Array Column - Replace Empty Elements with Default Value

I have a dataframe with a column which is an array of strings. Some of the elements of the array may be missing like so:
-------------|-------------------------------
ID |array_list
---------------------------------------------
38292786 |[AAA,, JLT] |
38292787 |[DFG] |
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |
38292790 |[] |
38292791 |[] |
38292792 |[,,, HKJ] |
I would like to replace the missing elements with a default value of "ZZZ". Is there a way to do that? I tried the following code, which is using a transform function and a regular expression:
import pyspark.sql.functions as F
from pyspark.sql.dataframe import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
df = df.withColumn("array_list2", F.expr("transform(array_list, x -> regexp_replace(x, '', 'ZZZ'))"))
This doesn't give an error but it is producing nonsense. I'm thinking I just don't know the correct way to identify the missing elements of the array - can anyone help me out?
In production our data has around 10 million rows and I am trying to avoid using explode or a UDF (not sure if it's possible to avoid using both though, just need the code run as efficiently as possible). I'm using Spark 2.4.4
This is what I would like my output to look like:
-------------|-------------------------------|-------------------------------
ID |array_list | array_list2
---------------------------------------------|-------------------------------
38292786 |[AAA,, JLT] |[AAA, ZZZ, JLT]
38292787 |[DFG] |[DFG]
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |[SHJ, QKJ, AAA, YTR, CBM]
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |[DUY, ANK, QJK, POI, CNM, ADD]
38292790 |[] |[ZZZ]
38292791 |[] |[ZZZ]
38292792 |[,,, HKJ] |[ZZZ, ZZZ, ZZZ, HKJ]
The regex_replace works at character level.
I could not get it to work with transform either, but with help from the first answerer I used a UDF - not that easy.
Here is my example with my data, you can tailor.
%python
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(
lambda con_str, arr: [
x if x is not None else con_str for x in arr or [None]
],
ArrayType(StringType()),
)
arrayData = [
('James',['Java','Scala']),
('Michael',['Spark','Java',None]),
('Robert',['CSharp','']),
('Washington',None),
('Jefferson',['1','2'])]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages'])
df = df.withColumn("knownLanguages", concat_udf(lit("ZZZ"), col("knownLanguages")))
df.show()
returns:
+----------+------------------+
| name| knownLanguages|
+----------+------------------+
| James| [Java, Scala]|
| Michael|[Spark, Java, ZZZ]|
| Robert| [CSharp, ]|
|Washington| [ZZZ]|
| Jefferson| [1, 2]|
+----------+------------------+
Quite tough this, had some help from the first answerer.
I'm thinking of something, but i'm not sure if it is efficient.
from pyspark.sql import functions as F
df.withColumn("array_list2", F.split(F.array_join("array_list", ",", "ZZZ"), ","))
First I concatenate the values as a string with a delimiter , (hoping you don't have it in your string but you can use something else). I use the null_replacement option to fill the null values. Then I split according to the same delimiter.
EDIT: Based on #thebluephantom comment, you can try this solution :
df.withColumn(
"array_list_2", F.expr(" transform(array_list, x -> coalesce(x, 'ZZZ'))")
).show()
SQL built-in transform is not working for me, so I couldn't try it but hopefully you'll have the result you wanted.

How to map over an Spark array without exploding it?

MY case is that I have an array column that I'd like to filter. Consider the following:
+------------------------------------------------------+
| column|
+------------------------------------------------------+
|[prefix1-whatever, prefix2-whatever, prefix4-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix5-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
+------------------------------------------------------+
I'd like to filter only columns containing prefix-4, prefix-5, prefix-6, prefix-7, [...]. So,using an "or" statement is not scalable here.
Of course, I can just:
val prefixesList = List("prefix-4", "prefix-5", "prefix-6", "prefix-7")
df
.withColumn("prefix", explode($"column"))
.withColumn("prefix", split($"prefix", "\\-").getItem(0))
.withColumn("filterColumn", $"prefix".inInCollection(prefixesList))
But that involves exploding, which I want to avoid. My plan right now is to define an array column from prefixesList, and then use array_intersect to filter it - for this to work, though, I have to get rid of the -whatever part (which is, obviously, different for each entry). Was this a Scala array, I could easily do a map over it. But, being it a Spark Array, I don't know if that is possible.
TL;DR I have a dataframe containing an array column. I'm trying to manipulate it and filter it without exploding (because, if I do explode, I'll have to manipulate it later to reverse the explode, and I'd like to avoid it).
Can I achieve that without exploding? If so, how?
Not sure if I understood your question correctly: you want to keep all lines that do not contain any of the prefixes in prefixesList?
If so, you can write your own filter function:
def filterPrefixes (row: Row) : Boolean = {
for( s <- row.getSeq[String](0)) {
for( p <- Seq("prefix4", "prefix5", "prefix6", "prefix7")) {
if( s.startsWith(p) ) {
return false
}
}
}
return true
}
and then use it as argument for the filter call:
df.filter(filterPrefixes _)
.show(false)
prints
+------------------------------------------------------+
|column |
+------------------------------------------------------+
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
+------------------------------------------------------+
It's relatively trivial to convert the Dataframe to a Dataset[Array[String]], and map over those arrays as whole elements. The basic idea is that you can iterate over your list of arrays easily, without having to flatten the entire dataset.
val df = Seq(Seq("prefix1-whatever", "prefix2-whatever", "prefix4-whatever"),
Seq("prefix1-whatever", "prefix2-whatever", "prefix3-whatever"),
Seq("prefix1-whatever", "prefix2-whatever", "prefix5-whatever"),
Seq("prefix1-whatever", "prefix2-whatever", "prefix3-whatever")
).toDF("column")
val pl = List("prefix4", "prefix5", "prefix6", "prefix7")
val df2 = df.as[Array[String]].map(a => {
a.flatMap(s => {
val start = s.split("-")(0)
if(pl.contains(start)) {
Some(s)
} else {
None
}
})
}).toDF("column")
df2.show(false)
The above code results in:
+------------------+
|column |
+------------------+
|[prefix4-whatever]|
|[] |
|[prefix5-whatever]|
|[] |
+------------------+
I'm not entirely sure how this would compare performance wise to actually flattening and recombining the data set. Doing this misses any catalyst optimizations, but avoids a lot of unnecessary shuffling of data.
P.S. I corrected for a minor issue in your prefix list, since "prefix-N" didn't match the data's pattern.
You can achieve it using SQL API. If you want to keep only rows that contain any of values prefix-4, prefix-5, prefix-6, prefix-7 you could use arrays_overlap function. Otherwise, if you want to keep rows that contain all of your values you could try array_intersect and then check if its size is equal to count of your values.
val df = Seq(
Seq("prefix1-a", "prefix2-b", "prefix3-c", "prefix4-d"),
Seq("prefix4-e", "prefix5-f", "prefix6-g", "prefix7-h", "prefix8-i"),
Seq("prefix6-a", "prefix7-b", "prefix8-c", "prefix9-d"),
Seq("prefix8-d", "prefix9-e", "prefix10-c", "prefix12-a")
).toDF("arr")
val schema = StructType(Seq(
StructField("arr", ArrayType.apply(StringType)),
StructField("arr2", ArrayType.apply(StringType))
))
val encoder = RowEncoder(schema)
val df2 = df.map(s =>
(s.getSeq[String](0).toArray, s.getSeq[String](0).map(s => s.substring(0, s.indexOf("-"))).toArray)
).map(s => RowFactory.create(s._1, s._2))(encoder)
val prefixesList = Array("prefix4", "prefix5", "prefix6", "prefix7")
val prefixesListSize = prefixesList.size
val prefixesListCol = lit(prefixesList)
df2.select('arr,'arr2,
arrays_overlap('arr2,prefixesListCol).as("OR"),
(size(array_intersect('arr2,prefixesListCol)) === prefixesListSize).as("AND")
).show(false)
it will give you:
+-------------------------------------------------------+---------------------------------------------+-----+-----+
|arr |arr2 |OR |AND |
+-------------------------------------------------------+---------------------------------------------+-----+-----+
|[prefix1-a, prefix2-b, prefix3-c, prefix4-d] |[prefix1, prefix2, prefix3, prefix4] |true |false|
|[prefix4-e, prefix5-f, prefix6-g, prefix7-h, prefix8-i]|[prefix4, prefix5, prefix6, prefix7, prefix8]|true |true |
|[prefix6-a, prefix7-b, prefix8-c, prefix9-d] |[prefix6, prefix7, prefix8, prefix9] |true |false|
|[prefix8-d, prefix9-e, prefix10-c, prefix12-a] |[prefix8, prefix9, prefix10, prefix12] |false|false|
+-------------------------------------------------------+---------------------------------------------+-----+-----+
so finally you can use:
df2.filter(size(array_intersect('arr2,prefixesListCol)) === prefixesListSize).show(false)
and you will get below result:
+-------------------------------------------------------+---------------------------------------------+
|arr |arr2 |
+-------------------------------------------------------+---------------------------------------------+
|[prefix4-e, prefix5-f, prefix6-g, prefix7-h, prefix8-i]|[prefix4, prefix5, prefix6, prefix7, prefix8]|
+-------------------------------------------------------+---------------------------------------------+

How to sum values of a struct in a nested array in a Spark dataframe?

This is in Spark 2.1, Given this input file:
`order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
And the following dataframes:
val order = sqlContext.read.json("order.json")
val df2 = order.select(struct("*") as 'order)
val df3 = df2.groupBy("order.userId").agg( collect_list( $"order").as("array"))
df3 has the following content:
+------+---------------------------+
|userId|array |
+------+---------------------------+
|1 |[[1,202.3,1], [2,343.99,1]]|
|2 |[[3,399.99,2]] |
+------+---------------------------+
and structure:
root
|-- userId: long (nullable = true)
|-- array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
Now assuming I am given df3:
I would like to compute sum of array.price for each userId, taking advantage of having the array per userId rows.
I would add this computation in a new column in the resulting dataframe. Like if I had done df3.withColumn( "sum", lit(0)), but with lit(0) replaced by my computation.
It would have assume to be straighforward, but I am stuck on both. I didnt find any way to access the array as whole do the computation per row (with a foldLeft for example).
I would like to compute sum of array.price for each userId, taking advantage of having the array
Unfortunately having an array works against you here. Neither Spark SQL nor DataFrame DSL provides tools that could be used directly to handle this task on array of an arbitrary size without decomposing (explode) first.
You can use an UDF:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
val totalPrice = udf((xs: Seq[Row]) => xs.map(_.getAs[Double]("price")).sum)
df3.withColumn("totalPrice", totalPrice($"array"))
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
or convert to statically typed Dataset:
df3
.as[(Long, Seq[(Long, Double, Long)])]
.map{ case (id, xs) => (id, xs, xs.map(_._2).sum) }
.toDF("userId", "array", "totalPrice").show
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
As mentioned above you decompose and aggregate:
import org.apache.spark.sql.functions.{sum, first}
df3
.withColumn("price", explode($"array.price"))
.groupBy($"userId")
.agg(sum($"price"), df3.columns.tail.map(c => first(c).alias(c)): _*)
+------+----------+--------------------+
|userId|sum(price)| array|
+------+----------+--------------------+
| 1| 546.29|[[1,202.3,1], [2,...|
| 2| 399.99| [[3,399.99,2]]|
+------+----------+--------------------+
but it is expensive and doesn't use the existing structure.
There is an ugly trick you could use:
import org.apache.spark.sql.functions.{coalesce, lit, max, size}
val totalPrice = (0 to df3.agg(max(size($"array"))).as[Int].first)
.map(i => coalesce($"array.price".getItem(i), lit(0.0)))
.foldLeft(lit(0.0))(_ + _)
df3.withColumn("totalPrice", totalPrice)
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
but it is more a curiosity than a real solution.
Spark 2.4.0 and above
You can now use the AGGREGATE functionality.
df3.createOrReplaceTempView("orders")
spark.sql(
"""
|SELECT
| *,
| AGGREGATE(`array`, 0.0, (accumulator, item) -> accumulator + item.price) AS totalPrice
|FROM
| orders
|""".stripMargin).show()

selecting a range of elements in an array spark sql

I use spark-shell to do the below operations.
Recently loaded a table with an array column in spark-sql .
Here is the DDL for the same:
create table test_emp_arr{
dept_id string,
dept_nm string,
emp_details Array<string>
}
the data looks something like this
+-------+-------+-------------------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+-------------------------------+
| 10|Finance|[Jon, Snow, Castle, Black, Ned]|
| 20| IT| [Ned, is, no, more]|
+-------+-------+-------------------------------+
I can query the emp_details column something like this :
sqlContext.sql("select emp_details[0] from emp_details").show
Problem
I want to query a range of elements in the collection :
Expected query to work
sqlContext.sql("select emp_details[0-2] from emp_details").show
or
sqlContext.sql("select emp_details[0:2] from emp_details").show
Expected output
+-------------------+
| emp_details|
+-------------------+
|[Jon, Snow, Castle]|
| [Ned, is, no]|
+-------------------+
In pure Scala, if i have an array something as :
val emp_details = Array("Jon","Snow","Castle","Black")
I can get the elements from 0 to 2 range using
emp_details.slice(0,3)
returns me
Array(Jon, Snow,Castle)
I am not able to apply the above operation of the array in spark-sql.
Thanks
Since Spark 2.4 you can use slice function. In Python):
pyspark.sql.functions.slice(x, start, length)
Collection function: returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.
...
New in version 2.4.
from pyspark.sql.functions import slice
df = spark.createDataFrame([
(10, "Finance", ["Jon", "Snow", "Castle", "Black", "Ned"]),
(20, "IT", ["Ned", "is", "no", "more"])
], ("dept_id", "dept_nm", "emp_details"))
df.select(slice("emp_details", 1, 3).alias("empt_details")).show()
+-------------------+
| empt_details|
+-------------------+
|[Jon, Snow, Castle]|
| [Ned, is, no]|
+-------------------+
In Scala
def slice(x: Column, start: Int, length: Int): Column
Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.
import org.apache.spark.sql.functions.slice
val df = Seq(
(10, "Finance", Seq("Jon", "Snow", "Castle", "Black", "Ned")),
(20, "IT", Seq("Ned", "is", "no", "more"))
).toDF("dept_id", "dept_nm", "emp_details")
df.select(slice($"emp_details", 1, 3) as "empt_details").show
+-------------------+
| empt_details|
+-------------------+
|[Jon, Snow, Castle]|
| [Ned, is, no]|
+-------------------+
The same thing can be of course done in SQL
SELECT slice(emp_details, 1, 3) AS emp_details FROM df
Important:
Please note, that unlike Seq.slice, values are indexed from zero and the second argument is length, not end position.
Here is a solution using a User Defined Function which has the advantage of working for any slice size you want. It simply builds a UDF function around the scala builtin slice method :
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val slice = udf((array : Seq[String], from : Int, to : Int) => array.slice(from,to))
Example with a sample of your data :
val df = sqlContext.sql("select array('Jon', 'Snow', 'Castle', 'Black', 'Ned') as emp_details")
df.withColumn("slice", slice($"emp_details", lit(0), lit(3))).show
Produces the expected output
+--------------------+-------------------+
| emp_details| slice|
+--------------------+-------------------+
|[Jon, Snow, Castl...|[Jon, Snow, Castle]|
+--------------------+-------------------+
You can also register the UDF in your sqlContext and use it like this
sqlContext.udf.register("slice", (array : Seq[String], from : Int, to : Int) => array.slice(from,to))
sqlContext.sql("select array('Jon','Snow','Castle','Black','Ned'),slice(array('Jon‌​','Snow','Castle','Black','Ned'),0,3)")
You won't need lit anymore with this solution
Edit2: For who wants to avoid udf at the expense of readability ;-)
If you really want to do it in one step, you will have to use Scala to create a lambda function returning an sequence of Column and wrap it in an array. This is a bit involved, but it's one step:
val df = List(List("Jon", "Snow", "Castle", "Black", "Ned")).toDF("emp_details")
df.withColumn("slice", array((0 until 3).map(i => $"emp_details"(i)):_*)).show(false)
+-------------------------------+-------------------+
|emp_details |slice |
+-------------------------------+-------------------+
|[Jon, Snow, Castle, Black, Ned]|[Jon, Snow, Castle]|
+-------------------------------+-------------------+
The _:* works a bit of magic to pass an list to a so-called variadic function (array in this case, which construct the sql array). But I would advice against using this solution as is. put the lambda function in a named function
def slice(from: Int, to: Int) = array((from until to).map(i => $"emp_details"(i)):_*))
for code readability. Note that in general, sticking to Column expressions (without using `udf) has better performances.
Edit: In order to do it in a sql statement (as you ask in your question...), following the same logic you would generate the sql query using scala logic (not saying it's the most readable)
def sliceSql(emp_details: String, from: Int, to: Int): String = "Array(" + (from until to).map(i => "emp_details["+i.toString+"]").mkString(",") + ")"
val sqlQuery = "select emp_details,"+ sliceSql("emp_details",0,3) + "as slice from emp_details"
sqlContext.sql(sqlQuery).show
+-------------------------------+-------------------+
|emp_details |slice |
+-------------------------------+-------------------+
|[Jon, Snow, Castle, Black, Ned]|[Jon, Snow, Castle]|
+-------------------------------+-------------------+
note that you can replace until by to in order to provide the last element taken rather than the element at which the iteration stops.
You can use the function array to build a new Array out of the three values:
import org.apache.spark.sql.functions._
val input = sqlContext.sql("select emp_details from emp_details")
val arr: Column = col("emp_details")
val result = input.select(array(arr(0), arr(1), arr(2)) as "emp_details")
val result.show()
// +-------------------+
// | emp_details|
// +-------------------+
// |[Jon, Snow, Castle]|
// | [Ned, is, no]|
// +-------------------+
use selecrExpr() and split() function in apache spark.
for example :
fs.selectExpr("((split(emp_details, ','))[0]) as e1,((split(emp_details, ','))[1]) as e2,((split(emp_details, ','))[2]) as e3);
Here is my generic slice UDF, support array with any type. A little bit ugly because you need to know the element type in advance.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
def arraySlice(arr: Seq[AnyRef], from: Int, until: Int): Seq[AnyRef] =
if (arr == null) null else arr.slice(from, until)
def slice(elemType: DataType): UserDefinedFunction =
udf(arraySlice _, ArrayType(elemType)
fs.select(slice(StringType)($"emp_details", 1, 2))
For those of you stuck using Spark < 2.4 and don't have the slice function, here is a solution in pySpark (Scala would be very similar) that does not use udfs. Instead it uses the spark sql functions concat_ws, substring_index, and split.
This will only work with string arrays. To make it work with arrays of other types, you will have to cast them into strings first, then cast back to the original type after you have 'sliced' the array.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = (SparkSession.builder
.master('yarn')
.appName("array_slice")
.getOrCreate()
)
emp_details = [
["Jon", "Snow", "Castle", "Black", "Ned"],
["Ned", "is", "no", "more"]
]
df1 = spark.createDataFrame(
[tuple([emp]) for emp in emp_details],
["emp_details"]
)
df1.show(truncate=False)
+-------------------------------+
|emp_details |
+-------------------------------+
|[Jon, Snow, Castle, Black, Ned]|
|[Ned, is, no, more] |
+-------------------------------+
last_string = 2
df2 = (
df1
.withColumn('last_string', (F.lit(last_string)))
.withColumn('concat', F.concat_ws(" ", F.col('emp_details')))
.withColumn('slice', F.expr("substring_index(concat, ' ', last_string + 1)" ))
.withColumn('slice', F.split(F.col('slice'), ' '))
.select('emp_details', 'slice')
)
df2.show(truncate=False)
+-------------------------------+-------------------+
|emp_details |slice |
+-------------------------------+-------------------+
|[Jon, Snow, Castle, Black, Ned]|[Jon, Snow, Castle]|
|[Ned, is, no, more] |[Ned, is, no] |
+-------------------------------+-------------------+
Use nested split:
split(split(concat_ws(',',emp_details),concat(',',emp_details[3]))[0],',')
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> val spark=SparkSession.builder().getOrCreate()
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#1d637673
scala> val df = spark.read.json("file:///Users/gengmei/Desktop/test/test.json")
18/12/11 10:09:32 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [dept_id: bigint, dept_nm: string ... 1 more field]
scala> df.createOrReplaceTempView("raw_data")
scala> df.show()
+-------+-------+--------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+--------------------+
| 10|Finance|[Jon, Snow, Castl...|
| 20| IT| [Ned, is, no, more]|
+-------+-------+--------------------+
scala> val df2 = spark.sql(
| s"""
| |select dept_id,dept_nm,split(split(concat_ws(',',emp_details),concat(',',emp_details[3]))[0],',') as emp_details from raw_data
| """)
df2: org.apache.spark.sql.DataFrame = [dept_id: bigint, dept_nm: string ... 1 more field]
scala> df2.show()
+-------+-------+-------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+-------------------+
| 10|Finance|[Jon, Snow, Castle]|
| 20| IT| [Ned, is, no]|
+-------+-------+-------------------+

Subset with loop over an array of strings

I have my code like this:
for (i in 1:b) {
carteraR[[i]]=subset(carteraR[[i]],RUN.FONDO=="8026" | RUN.FONDO=="8036" | RUN.FONDO=="8048" | RUN.FONDO=="8057" | RUN.FONDO=="8059" | RUN.FONDO=="8072" | RUN.FONDO=="8094" |
RUN.FONDO=="8107" | RUN.FONDO=="8110" | RUN.FONDO=="8115" | RUN.FONDO=="8130" | RUN.FONDO=="8230" | RUN.FONDO=="8248" | RUN.FONDO=="8257" | RUN.FONDO=="8319")
}
Where b=length(carteraR), and class(carteraR[[i]])=data.frame. RUN.FONDO is one of the head of these data frames. This code is working fine but I want to save some lines.
What I want is something like:
for (i in 1:b) {
for (j in 1:length(A)){
carteraR[[i]]=subset(carteraR[[i]],RUN.FONDO==A[j])
}
}
And where A= "8026" "8036" "8048" "8057" ... "8319" ....... etc......
What should the code be like ?
Thx
Like this:
carteraR <- lapply(carteraR, subset, RUN.FONDO %in% A)
Just be aware there can be risks with using subset in a programmatic way: Why is `[` better than `subset`?. This usage is fine though.

Resources