I want to convert a set of rows in SQL Server database (In the form of rules) to a single if-else condition without hardcoding any values in the code. The code will be written in Scala and I am trying to figure out the logic to do this but could not think of a good approach.
Sample SQL Server Rows:
TAG | CONDITION | MIN VALUE | MAX VALUE | STATUS
ABC | = | 0 | NULL | GOOD
ABC | = | 1 | NULL | BAD
ABC | = | 2 | NULL | ERROR
ABC | >= | 3 | NULL | IGNORE
Similar to tag ABC, there could be any number of tags and the conditions will vary with the tag column and each tag will have conditions in multiple rows. If anyone has dealt with a similar problem and has any suggestions that would be appreciated.
The question doesn't seem clear to me as currently written. What do you mean by a "a single if-else condition without hardcoding any values in the code"?
Would the following work?
sealed trait Condition
object Eq extends Condition // =
object Ge extends Condition // >=
sealed trait Status
object Good extends Status
object Bad extends Status
object Error extends Status
object Ignore extends Status
case class Rule(tag: String,
condition: Condition,
min: Int,
max: Int,
status: Status)
def handle(input: Int, rules: List[Rule]): Status =
rules
.view // lazily iterate the rules
.filter { // find matching rules
case Rule(_, Eq, x, _, _) if input == x => true
case Rule(_, Ge, x, _, _) if input >= x => true
case _ => false
}
.map { matchingRule => matchingRule.status } // return the status
.head // find the status of the first matching rule, or throw
// Tests
val rules = List(
Rule("abc", Eq, 0, 0, Good),
Rule("abc", Eq, 1, 0, Bad),
Rule("abc", Eq, 2, 0, Error),
Rule("abc", Ge, 3, 0, Ignore))
assert(handle(0, rules) == Good)
assert(handle(1, rules) == Bad)
assert(handle(2, rules) == Error)
assert(handle(3, rules) == Ignore)
assert(handle(4, rules) == Ignore)
Related
I have a dataframe with a column which is an array of strings. Some of the elements of the array may be missing like so:
-------------|-------------------------------
ID |array_list
---------------------------------------------
38292786 |[AAA,, JLT] |
38292787 |[DFG] |
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |
38292790 |[] |
38292791 |[] |
38292792 |[,,, HKJ] |
I would like to replace the missing elements with a default value of "ZZZ". Is there a way to do that? I tried the following code, which is using a transform function and a regular expression:
import pyspark.sql.functions as F
from pyspark.sql.dataframe import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
df = df.withColumn("array_list2", F.expr("transform(array_list, x -> regexp_replace(x, '', 'ZZZ'))"))
This doesn't give an error but it is producing nonsense. I'm thinking I just don't know the correct way to identify the missing elements of the array - can anyone help me out?
In production our data has around 10 million rows and I am trying to avoid using explode or a UDF (not sure if it's possible to avoid using both though, just need the code run as efficiently as possible). I'm using Spark 2.4.4
This is what I would like my output to look like:
-------------|-------------------------------|-------------------------------
ID |array_list | array_list2
---------------------------------------------|-------------------------------
38292786 |[AAA,, JLT] |[AAA, ZZZ, JLT]
38292787 |[DFG] |[DFG]
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |[SHJ, QKJ, AAA, YTR, CBM]
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |[DUY, ANK, QJK, POI, CNM, ADD]
38292790 |[] |[ZZZ]
38292791 |[] |[ZZZ]
38292792 |[,,, HKJ] |[ZZZ, ZZZ, ZZZ, HKJ]
The regex_replace works at character level.
I could not get it to work with transform either, but with help from the first answerer I used a UDF - not that easy.
Here is my example with my data, you can tailor.
%python
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(
lambda con_str, arr: [
x if x is not None else con_str for x in arr or [None]
],
ArrayType(StringType()),
)
arrayData = [
('James',['Java','Scala']),
('Michael',['Spark','Java',None]),
('Robert',['CSharp','']),
('Washington',None),
('Jefferson',['1','2'])]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages'])
df = df.withColumn("knownLanguages", concat_udf(lit("ZZZ"), col("knownLanguages")))
df.show()
returns:
+----------+------------------+
| name| knownLanguages|
+----------+------------------+
| James| [Java, Scala]|
| Michael|[Spark, Java, ZZZ]|
| Robert| [CSharp, ]|
|Washington| [ZZZ]|
| Jefferson| [1, 2]|
+----------+------------------+
Quite tough this, had some help from the first answerer.
I'm thinking of something, but i'm not sure if it is efficient.
from pyspark.sql import functions as F
df.withColumn("array_list2", F.split(F.array_join("array_list", ",", "ZZZ"), ","))
First I concatenate the values as a string with a delimiter , (hoping you don't have it in your string but you can use something else). I use the null_replacement option to fill the null values. Then I split according to the same delimiter.
EDIT: Based on #thebluephantom comment, you can try this solution :
df.withColumn(
"array_list_2", F.expr(" transform(array_list, x -> coalesce(x, 'ZZZ'))")
).show()
SQL built-in transform is not working for me, so I couldn't try it but hopefully you'll have the result you wanted.
MY case is that I have an array column that I'd like to filter. Consider the following:
+------------------------------------------------------+
| column|
+------------------------------------------------------+
|[prefix1-whatever, prefix2-whatever, prefix4-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix5-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
+------------------------------------------------------+
I'd like to filter only columns containing prefix-4, prefix-5, prefix-6, prefix-7, [...]. So,using an "or" statement is not scalable here.
Of course, I can just:
val prefixesList = List("prefix-4", "prefix-5", "prefix-6", "prefix-7")
df
.withColumn("prefix", explode($"column"))
.withColumn("prefix", split($"prefix", "\\-").getItem(0))
.withColumn("filterColumn", $"prefix".inInCollection(prefixesList))
But that involves exploding, which I want to avoid. My plan right now is to define an array column from prefixesList, and then use array_intersect to filter it - for this to work, though, I have to get rid of the -whatever part (which is, obviously, different for each entry). Was this a Scala array, I could easily do a map over it. But, being it a Spark Array, I don't know if that is possible.
TL;DR I have a dataframe containing an array column. I'm trying to manipulate it and filter it without exploding (because, if I do explode, I'll have to manipulate it later to reverse the explode, and I'd like to avoid it).
Can I achieve that without exploding? If so, how?
Not sure if I understood your question correctly: you want to keep all lines that do not contain any of the prefixes in prefixesList?
If so, you can write your own filter function:
def filterPrefixes (row: Row) : Boolean = {
for( s <- row.getSeq[String](0)) {
for( p <- Seq("prefix4", "prefix5", "prefix6", "prefix7")) {
if( s.startsWith(p) ) {
return false
}
}
}
return true
}
and then use it as argument for the filter call:
df.filter(filterPrefixes _)
.show(false)
prints
+------------------------------------------------------+
|column |
+------------------------------------------------------+
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
+------------------------------------------------------+
It's relatively trivial to convert the Dataframe to a Dataset[Array[String]], and map over those arrays as whole elements. The basic idea is that you can iterate over your list of arrays easily, without having to flatten the entire dataset.
val df = Seq(Seq("prefix1-whatever", "prefix2-whatever", "prefix4-whatever"),
Seq("prefix1-whatever", "prefix2-whatever", "prefix3-whatever"),
Seq("prefix1-whatever", "prefix2-whatever", "prefix5-whatever"),
Seq("prefix1-whatever", "prefix2-whatever", "prefix3-whatever")
).toDF("column")
val pl = List("prefix4", "prefix5", "prefix6", "prefix7")
val df2 = df.as[Array[String]].map(a => {
a.flatMap(s => {
val start = s.split("-")(0)
if(pl.contains(start)) {
Some(s)
} else {
None
}
})
}).toDF("column")
df2.show(false)
The above code results in:
+------------------+
|column |
+------------------+
|[prefix4-whatever]|
|[] |
|[prefix5-whatever]|
|[] |
+------------------+
I'm not entirely sure how this would compare performance wise to actually flattening and recombining the data set. Doing this misses any catalyst optimizations, but avoids a lot of unnecessary shuffling of data.
P.S. I corrected for a minor issue in your prefix list, since "prefix-N" didn't match the data's pattern.
You can achieve it using SQL API. If you want to keep only rows that contain any of values prefix-4, prefix-5, prefix-6, prefix-7 you could use arrays_overlap function. Otherwise, if you want to keep rows that contain all of your values you could try array_intersect and then check if its size is equal to count of your values.
val df = Seq(
Seq("prefix1-a", "prefix2-b", "prefix3-c", "prefix4-d"),
Seq("prefix4-e", "prefix5-f", "prefix6-g", "prefix7-h", "prefix8-i"),
Seq("prefix6-a", "prefix7-b", "prefix8-c", "prefix9-d"),
Seq("prefix8-d", "prefix9-e", "prefix10-c", "prefix12-a")
).toDF("arr")
val schema = StructType(Seq(
StructField("arr", ArrayType.apply(StringType)),
StructField("arr2", ArrayType.apply(StringType))
))
val encoder = RowEncoder(schema)
val df2 = df.map(s =>
(s.getSeq[String](0).toArray, s.getSeq[String](0).map(s => s.substring(0, s.indexOf("-"))).toArray)
).map(s => RowFactory.create(s._1, s._2))(encoder)
val prefixesList = Array("prefix4", "prefix5", "prefix6", "prefix7")
val prefixesListSize = prefixesList.size
val prefixesListCol = lit(prefixesList)
df2.select('arr,'arr2,
arrays_overlap('arr2,prefixesListCol).as("OR"),
(size(array_intersect('arr2,prefixesListCol)) === prefixesListSize).as("AND")
).show(false)
it will give you:
+-------------------------------------------------------+---------------------------------------------+-----+-----+
|arr |arr2 |OR |AND |
+-------------------------------------------------------+---------------------------------------------+-----+-----+
|[prefix1-a, prefix2-b, prefix3-c, prefix4-d] |[prefix1, prefix2, prefix3, prefix4] |true |false|
|[prefix4-e, prefix5-f, prefix6-g, prefix7-h, prefix8-i]|[prefix4, prefix5, prefix6, prefix7, prefix8]|true |true |
|[prefix6-a, prefix7-b, prefix8-c, prefix9-d] |[prefix6, prefix7, prefix8, prefix9] |true |false|
|[prefix8-d, prefix9-e, prefix10-c, prefix12-a] |[prefix8, prefix9, prefix10, prefix12] |false|false|
+-------------------------------------------------------+---------------------------------------------+-----+-----+
so finally you can use:
df2.filter(size(array_intersect('arr2,prefixesListCol)) === prefixesListSize).show(false)
and you will get below result:
+-------------------------------------------------------+---------------------------------------------+
|arr |arr2 |
+-------------------------------------------------------+---------------------------------------------+
|[prefix4-e, prefix5-f, prefix6-g, prefix7-h, prefix8-i]|[prefix4, prefix5, prefix6, prefix7, prefix8]|
+-------------------------------------------------------+---------------------------------------------+
I have seen this question earlier here and I have took lessons from that. However I am not sure why I am getting an error when I feel it should work.
I want to create a new column in existing Spark DataFrame by some rules. Here is what I wrote. iris_spark is the data frame with a categorical variable iris_spark with three distinct categories.
from pyspark.sql import functions as F
iris_spark_df = iris_spark.withColumn(
"Class",
F.when(iris_spark.iris_class == 'Iris-setosa', 0, F.when(iris_spark.iris_class == 'Iris-versicolor',1)).otherwise(2))
Throws the following error.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-157-21818c7dc060> in <module>()
----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))
TypeError: when() takes exactly 2 arguments (3 given)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-157-21818c7dc060> in <module>()
----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))
TypeError: when() takes exactly 2 arguments (3 given)
Any idea why?
Correct structure is either:
(when(col("iris_class") == 'Iris-setosa', 0)
.when(col("iris_class") == 'Iris-versicolor', 1)
.otherwise(2))
which is equivalent to
CASE
WHEN (iris_class = 'Iris-setosa') THEN 0
WHEN (iris_class = 'Iris-versicolor') THEN 1
ELSE 2
END
or:
(when(col("iris_class") == 'Iris-setosa', 0)
.otherwise(when(col("iris_class") == 'Iris-versicolor', 1)
.otherwise(2)))
which is equivalent to:
CASE WHEN (iris_class = 'Iris-setosa') THEN 0
ELSE CASE WHEN (iris_class = 'Iris-versicolor') THEN 1
ELSE 2
END
END
with general syntax:
when(condition, value).when(...)
or
when(condition, value).otherwise(...)
You probably mixed up things with Hive IF conditional:
IF(condition, if-true, if-false)
which can be used only in raw SQL with Hive support.
Conditional statement In Spark
Using “when otherwise” on DataFrame
Using “case when” on DataFrame
Using && and || operator
import org.apache.spark.sql.functions.{when, _}
import spark.sqlContext.implicits._
val spark: SparkSession = SparkSession.builder().master("local[1]").appName("SparkByExamples.com").getOrCreate()
val data = List(("James ","","Smith","36636","M",60000),
("Michael ","Rose","","40288","M",70000),
("Robert ","","Williams","42114","",400000),
("Maria ","Anne","Jones","39192","F",500000),
("Jen","Mary","Brown","","F",0))
val cols = Seq("first_name","middle_name","last_name","dob","gender","salary")
val df = spark.createDataFrame(data).toDF(cols:_*)
1. Using “when otherwise” on DataFrame
Replace the value of gender with new value
val df1 = df.withColumn("new_gender", when(col("gender") === "M","Male")
.when(col("gender") === "F","Female")
.otherwise("Unknown"))
val df2 = df.select(col("*"), when(col("gender") === "M","Male")
.when(col("gender") === "F","Female")
.otherwise("Unknown").alias("new_gender"))
2. Using “case when” on DataFrame
val df3 = df.withColumn("new_gender",
expr("case when gender = 'M' then 'Male' " +
"when gender = 'F' then 'Female' " +
"else 'Unknown' end"))
Alternatively,
val df4 = df.select(col("*"),
expr("case when gender = 'M' then 'Male' " +
"when gender = 'F' then 'Female' " +
"else 'Unknown' end").alias("new_gender"))
3. Using && and || operator
val dataDF = Seq(
(66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
)).toDF("id", "code", "amt")
dataDF.withColumn("new_column",
when(col("code") === "a" || col("code") === "d", "A")
.when(col("code") === "b" && col("amt") === "4", "B")
.otherwise("A1"))
.show()
Output:
+---+----+---+----------+
| id|code|amt|new_column|
+---+----+---+----------+
| 66| a| 4| A|
| 67| a| 0| A|
| 70| b| 4| B|
| 71| d| 4| A|
+---+----+---+----------+
There are different ways you can achieve if-then-else.
Using when function in DataFrame API.
You can specify the list of conditions in when and also can specify otherwise what value you need. You can use this expression in nested form as well.
expr function.
Using "expr" function you can pass SQL expression in expr. PFB example. Here we are creating new column "quarter" based on month column.
cond = """case when month > 9 then 'Q4'
else case when month > 6 then 'Q3'
else case when month > 3 then 'Q2'
else case when month > 0 then 'Q1'
end
end
end
end as quarter"""
newdf = df.withColumn("quarter", expr(cond))
selectExpr function.
We can also use the variant of select function which can take SQL expression. PFB example.
cond = """case when month > 9 then 'Q4'
else case when month > 6 then 'Q3'
else case when month > 3 then 'Q2'
else case when month > 0 then 'Q1'
end
end
end
end as quarter"""
newdf = df.selectExpr("*", cond)
you can use this:
if(exp1, exp2, exp3) inside spark.sql()
where exp1 is condition and if true give me exp2, else give me exp3.
now the funny thing with nested if-else is. you need to pass every exp inside
brackets {"()"}
else it will raise error.
example:
if((1>2), (if (2>3), True, False), (False))
Let's say I want I have a table with the following columns and values:
| BucketID | Value |
|:-----------|------------:|
| 1 | 3 |
| 1 | 2 |
| 1 | 1 |
| 2 | 0 |
| 2 | 1 |
| 2 | 5 |
Let's pretend I want to partition over BucketId and Multiply the values in the partition:
SELECT DISTINCT BucketId, MULT(Value) OVER (PARTITION BY BucketId)
Now, there is no built-in Aggregate MULT function so I am writing my own using SQL CLR:
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
[Serializable]
[SqlUserDefinedAggregate(Format.Native, Name = "MULT")]
public struct MULT
{
private int runningSum;
public void Init()
{
runningSum = 1;
}
public void Accumulate(SqlInt32 value)
{
runnningSum *= (int)value;
}
public void Merge(OverallStatus other)
{
runnningSum *= other.runningSum;
}
public SqlInt32 Terminate()
{
return new SqlInt32(runningSum);
}
}
What my question boils down to is, let's assume I hit a 0 in Accumulate or Merge, there is no point continuing on. If that is the case, how can I return 0 as soon as I hit 0?
You cannot force a termination since there is no way to control the workflow. SQL Server will call the Terminate() method as each group finishes processing its set.
However, since the state of the UDA is maintained across each row processed in a group, you can simply check runningSum to see if it's already 0 and if so, skip any computations. This would save some slight amount of time processing.
In the Accumulate() method, the first step will be to check if runningSum is 0, and if true, simply return;. You should also check to see if value is NULL (something you are not currently checking for). After that, check the incoming value to see if it is 0 and if true, then set runningSum to 0 and return;.
public void Accumulate(SqlInt32 value)
{
if (runningSum == 0 || value.IsNull)
{
return;
}
if (value.Value == 0)
{
runningSum = 0;
return;
}
runningSum *= value.Value;
}
Note: Do not cast value to int. All Sql* types have a Value property that returns the expected native .NET type.
Finally, in the Merge method, check runnningSum to see if it is 0 and f true, simply return;. Then check other.runningSum to see if it is 0, and if true, then set runnningSum to 0.
I am starting with Python behave and got stuck when trying to access context - it's not available. Here is my code:
Here is the Feature file:
Feature: Company staff inventory
Scenario: count company staff
Given a set of employees:
| name | dept |
| Paul | IT |
| Mary | IT |
| Pete | Dev |
When we count the number of employees in each department
Then we will find two people in IT
And we will find one employee in Dev
Here is the Steps file:
from behave import *
#given('a set of employees')
def step_impl(context):
assert context is True
#when('we count the number of employees in each department')
def step_impl(context):
context.res = dict()
for row in context.table:
for k, v in row:
if k not in context.res:
context.res[k] = 1
else:
context.res[k] += 1
#then('we will find two people in IT')
def step_impl(context):
assert context.res['IT'] == 2
#then('we will find one employee in Dev')
def step_impl(context):
assert context.res['Dev'] == 1
Here's the Traceback:
Traceback (most recent call last):
File "/home/kseniyab/Documents/broadsign_code/spikes/BDD_Gherkin/behave/src/py3behave/lib/python3.4/site-packages/behave/model.py", line 1456, in run
match.run(runner.context)
File "/home/kseniyab/Documents/broadsign_code/spikes/BDD_Gherkin/behave/src/py3behave/lib/python3.4/site-packages/behave/model.py", line 1903, in run
self.func(context, *args, **kwargs)
File "steps/count_staff.py", line 5, in step_impl
assert context is True
AssertionError
The context object is there. It is your assertion that is problematic:
assert context is True
fails because context is not True. The only thing that can satisfy x is True is where x has been set to True. context has not been set to True. It is an object. Note that testing whether context is True is not the same as testing whether if you were to write if context: the true branch would be taken or the false branch. The latter is equivalent to testing whether bool(context) is True.
There's no point in testing whether context is available. It is always there.