PySpark: Convert T-SQL Case When Then statement to PySpark [duplicate] - sql-server

I have seen this question earlier here and I have took lessons from that. However I am not sure why I am getting an error when I feel it should work.
I want to create a new column in existing Spark DataFrame by some rules. Here is what I wrote. iris_spark is the data frame with a categorical variable iris_spark with three distinct categories.
from pyspark.sql import functions as F
iris_spark_df = iris_spark.withColumn(
"Class",
F.when(iris_spark.iris_class == 'Iris-setosa', 0, F.when(iris_spark.iris_class == 'Iris-versicolor',1)).otherwise(2))
Throws the following error.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-157-21818c7dc060> in <module>()
----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))
TypeError: when() takes exactly 2 arguments (3 given)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-157-21818c7dc060> in <module>()
----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))
TypeError: when() takes exactly 2 arguments (3 given)
Any idea why?

Correct structure is either:
(when(col("iris_class") == 'Iris-setosa', 0)
.when(col("iris_class") == 'Iris-versicolor', 1)
.otherwise(2))
which is equivalent to
CASE
WHEN (iris_class = 'Iris-setosa') THEN 0
WHEN (iris_class = 'Iris-versicolor') THEN 1
ELSE 2
END
or:
(when(col("iris_class") == 'Iris-setosa', 0)
.otherwise(when(col("iris_class") == 'Iris-versicolor', 1)
.otherwise(2)))
which is equivalent to:
CASE WHEN (iris_class = 'Iris-setosa') THEN 0
ELSE CASE WHEN (iris_class = 'Iris-versicolor') THEN 1
ELSE 2
END
END
with general syntax:
when(condition, value).when(...)
or
when(condition, value).otherwise(...)
You probably mixed up things with Hive IF conditional:
IF(condition, if-true, if-false)
which can be used only in raw SQL with Hive support.

Conditional statement In Spark
Using “when otherwise” on DataFrame
Using “case when” on DataFrame
Using && and || operator
import org.apache.spark.sql.functions.{when, _}
import spark.sqlContext.implicits._
val spark: SparkSession = SparkSession.builder().master("local[1]").appName("SparkByExamples.com").getOrCreate()
val data = List(("James ","","Smith","36636","M",60000),
("Michael ","Rose","","40288","M",70000),
("Robert ","","Williams","42114","",400000),
("Maria ","Anne","Jones","39192","F",500000),
("Jen","Mary","Brown","","F",0))
val cols = Seq("first_name","middle_name","last_name","dob","gender","salary")
val df = spark.createDataFrame(data).toDF(cols:_*)
1. Using “when otherwise” on DataFrame
Replace the value of gender with new value
val df1 = df.withColumn("new_gender", when(col("gender") === "M","Male")
.when(col("gender") === "F","Female")
.otherwise("Unknown"))
val df2 = df.select(col("*"), when(col("gender") === "M","Male")
.when(col("gender") === "F","Female")
.otherwise("Unknown").alias("new_gender"))
2. Using “case when” on DataFrame
val df3 = df.withColumn("new_gender",
expr("case when gender = 'M' then 'Male' " +
"when gender = 'F' then 'Female' " +
"else 'Unknown' end"))
Alternatively,
val df4 = df.select(col("*"),
expr("case when gender = 'M' then 'Male' " +
"when gender = 'F' then 'Female' " +
"else 'Unknown' end").alias("new_gender"))
3. Using && and || operator
val dataDF = Seq(
(66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
)).toDF("id", "code", "amt")
dataDF.withColumn("new_column",
when(col("code") === "a" || col("code") === "d", "A")
.when(col("code") === "b" && col("amt") === "4", "B")
.otherwise("A1"))
.show()
Output:
+---+----+---+----------+
| id|code|amt|new_column|
+---+----+---+----------+
| 66| a| 4| A|
| 67| a| 0| A|
| 70| b| 4| B|
| 71| d| 4| A|
+---+----+---+----------+

There are different ways you can achieve if-then-else.
Using when function in DataFrame API.
You can specify the list of conditions in when and also can specify otherwise what value you need. You can use this expression in nested form as well.
expr function.
Using "expr" function you can pass SQL expression in expr. PFB example. Here we are creating new column "quarter" based on month column.
cond = """case when month > 9 then 'Q4'
else case when month > 6 then 'Q3'
else case when month > 3 then 'Q2'
else case when month > 0 then 'Q1'
end
end
end
end as quarter"""
newdf = df.withColumn("quarter", expr(cond))
selectExpr function.
We can also use the variant of select function which can take SQL expression. PFB example.
cond = """case when month > 9 then 'Q4'
else case when month > 6 then 'Q3'
else case when month > 3 then 'Q2'
else case when month > 0 then 'Q1'
end
end
end
end as quarter"""
newdf = df.selectExpr("*", cond)

you can use this:
if(exp1, exp2, exp3) inside spark.sql()
where exp1 is condition and if true give me exp2, else give me exp3.
now the funny thing with nested if-else is. you need to pass every exp inside
brackets {"()"}
else it will raise error.
example:
if((1>2), (if (2>3), True, False), (False))

Related

How to compare two array of string columns in Pyspark

I want to compare two arrays and filter the data frame
condition_1 = AAA
condition_2 = ["AAA","BBB","CCC"]
My spark data frame has a column with array of strings
df = df.withColumn("array_column", F.lit(["XXX","YYY","AAA"]))
# to filter a string condition_1 with the array column
df = df.filter(
F.col('array_column').isin(condition_1) &
# second filter here
)
But how can I filter condition_2 in in a similar way? since they are both arrays?
Code I tried:
df = df.filter(
F.col('array_column').isin(condition_1) &
any(x in condition_2 for x in F.col('array_column'))
)
But I get an error - Column is not iterable.
I also tried - bool(set(F.col('array_column')).intersection(condition_2))
But still have the same error. Can anyone help me with this?
Hope I got your question right. It wasnt as clear. Use pyspark's array functions
Data
condition_1 = 'AAA'
condition_2 = ["AAA","BBB","CCC"]
df=spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
df.show()
+---+----------+-------+------------------+
| ID|Company_Id| value| array_column|
+---+----------+-------+------------------+
| 1A| 3412asd|value-1| [XXX, YYY, AAA]|
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG]|
| 3C| 9800bvd|value-3| [AAA]|
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC]|
+---+----------+-------+------------------+
Code
df.where((array_contains(col('array_column'), lit(condition_1)))&(size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0)).show(truncate=False)
Outcome
+---+----------+-------+----------------+
|ID |Company_Id|value |array_column |
+---+----------+-------+----------------+
|1A |3412asd |value-1|[XXX, YYY, AAA] |
|3C |9800bvd |value-3|[AAA] |
|3C |9800bvd |value-1|[AAA, YYY, CCCC]|
+---+----------+-------+----------------+
How it works
condition_1 ; get a boolean selection of where column contains string
array_contains(col('array_column'), lit(condition_1))
condition_2 ; This happens in stages
Intersect column with the list
array_intersect(col('array_column'),array([lit(x) for x in condition_2]))
get the size of the outcome of 1 above
size(array_intersect(col('array_column'),array([lit(x) for x in` condition_2])))
Check that the intersection contains at least one item
size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0
Finally, chain condition_1 and condition_2 using operant & and pass into the df.where() or df.filter() methods

Convert DB rules to if-else conditions?

I want to convert a set of rows in SQL Server database (In the form of rules) to a single if-else condition without hardcoding any values in the code. The code will be written in Scala and I am trying to figure out the logic to do this but could not think of a good approach.
Sample SQL Server Rows:
TAG | CONDITION | MIN VALUE | MAX VALUE | STATUS
ABC | = | 0 | NULL | GOOD
ABC | = | 1 | NULL | BAD
ABC | = | 2 | NULL | ERROR
ABC | >= | 3 | NULL | IGNORE
Similar to tag ABC, there could be any number of tags and the conditions will vary with the tag column and each tag will have conditions in multiple rows. If anyone has dealt with a similar problem and has any suggestions that would be appreciated.
The question doesn't seem clear to me as currently written. What do you mean by a "a single if-else condition without hardcoding any values in the code"?
Would the following work?
sealed trait Condition
object Eq extends Condition // =
object Ge extends Condition // >=
sealed trait Status
object Good extends Status
object Bad extends Status
object Error extends Status
object Ignore extends Status
case class Rule(tag: String,
condition: Condition,
min: Int,
max: Int,
status: Status)
def handle(input: Int, rules: List[Rule]): Status =
rules
.view // lazily iterate the rules
.filter { // find matching rules
case Rule(_, Eq, x, _, _) if input == x => true
case Rule(_, Ge, x, _, _) if input >= x => true
case _ => false
}
.map { matchingRule => matchingRule.status } // return the status
.head // find the status of the first matching rule, or throw
// Tests
val rules = List(
Rule("abc", Eq, 0, 0, Good),
Rule("abc", Eq, 1, 0, Bad),
Rule("abc", Eq, 2, 0, Error),
Rule("abc", Ge, 3, 0, Ignore))
assert(handle(0, rules) == Good)
assert(handle(1, rules) == Bad)
assert(handle(2, rules) == Error)
assert(handle(3, rules) == Ignore)
assert(handle(4, rules) == Ignore)

LUA getting values from table by using a string

How would I go about compiling values from a table using a string?
i.e.
NumberDef = {
[1] = 1,
[2] = 2,
[3] = 3
}
TextDef = {
["a"] = 1,
["b"] = 2,
["c"] = 3
}
If I was for example to request "1ABC3", how would I get it to output 1 1 2 3 3?
Greatly appreciate any response.
Try this:
s="1ABC3z9"
t=s:gsub(".",function (x)
local y=tonumber(x)
if y~=nil then
y=NumberDef[y]
else
y=TextDef[x:lower()]
end
return (y or x).." "
end)
print(t)
This may be simplified if you combine the two tables into one.
You can access values in a lua array like so:
TableName["IndexNameOrNumber"]
Using your example:
NumberDef = {
[1] = 1,
[2] = 2,
[3] = 3
}
TextDef = {
["a"] = 1,
["b"] = 2,
["c"] = 3
}
print(NumberDef[2])--Will print 2
print(TextDef["c"])--will print 3
If you wish to access all values of a Lua array you can loop through all values like so (similarly to a foreach in c#):
for i,v in next, TextDef do
print(i, v)
end
--Output:
--c 3
--a 1
--b 2
So to answer your request, you would request those values like so:
print(NumberDef[1], TextDef["a"], TextDef["b"], TextDef["c"], NumberDef[3])--Will print 1 1 2 3 3
One more point, if you're interested in concatenating lua string this can be accomplished like so:
string1 = string2 .. string3
Example:
local StringValue1 = "I"
local StringValue2 = "Love"
local StringValue3 = StringValue1 .. " " .. StringValue2 .. " Memes!"
print(StringValue3) -- Will print "I Love Memes!"
UPDATE
I whipped up a quick example code you could use to handle what you're looking for. This will go through the inputted string and check each of the two tables if the value you requested exists. If it does it will add it onto a string value and print at the end the final product.
local StringInput = "1abc3" -- Your request to find
local CombineString = "" --To combine the request
NumberDef = {
[1] = 1,
[2] = 2,
[3] = 3
}
TextDef = {
["a"] = 1,
["b"] = 2,
["c"] = 3
}
for i=1, #StringInput do --Foreach character in the string inputted do this
local CurrentCharacter = StringInput:sub(i,i); --get the current character from the loop position
local Num = tonumber(CurrentCharacter)--If possible convert to number
if TextDef[CurrentCharacter] then--if it exists in text def array then combine it
CombineString = CombineString .. TextDef[CurrentCharacter]
end
if NumberDef[Num] then --if it exists in number def array then combine it
CombineString = CombineString .. NumberDef[Num]
end
end
print("Combined: ", CombineString) --print the final product.

Remove garbage(#,$) value from any string and drop records that contains only garbage(#,$) value with multiple occurances in multiple columns

I tried below code for drop records that contains garbage value with multiple occurrences and multiple columns,But I want to remove garbage value form string with multiple occurrences in multiple columns.
Sample Code :-
filter_list = ['$','#','%','#','!','^','&','*','null']
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
In this example "original" is my original dataframe and "compulsory_fields" this is my array(it stores as multiple columns).
Sample Input :-
id name salary
# Yogita 1000
2 Neha ##
3 #Jay$deep## 8000
4 Priya 40$00&
5 Bhavana $$%&^
6 $% $$&&
Sample Output :-
id name salary
3 Jaydeep 8000
4 priya 4000
Your requirements are not completely clear to me, but it seems you want to output records that are valid after removing the "garbage" characters. You can achieve this by adding a clean_special_characters udf that removes the special characters before running your filter_udf:
import pyspark.sql.functions as f
from itertools import chain
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import BooleanType,StringType
rdd = sc.parallelize((
('#','Yogita','1000'),
('2', 'Neha', '##'),
('3', '#Jay$deep##','8000'),
('4', 'Priya', '40$00&'),
('5', 'Bhavana', '$$%&^'),
('6', '$%','$$&&'))
)
original = rdd.toDF(['id','name','salary'])
filter_list = ['$','#','%','#','!','^','&','*','null']
compulsory_fields = ['id','name','salary']
def clean_special_characters(input_string):
cleaned_input = input_string.translate({ord(c): None for c in filter_list if len(c)==1})
if cleaned_input == '':
return 'null'
return cleaned_input
clean_special_characters_udf = f.udf(clean_special_characters, StringType())
original = original.withColumn('name', clean_special_characters_udf(original.name))
original = original.withColumn('salary', clean_special_characters_udf(original.salary))
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
This outputs:
+---+-------+------+
| id| name|salary|
+---+-------+------+
| 3|Jaydeep| 8000|
| 4| Priya| 4000|
+---+-------+------+

Fastest and most effective way of comparing two array of hashes of different format

I have two arrays of hashes with the format:
hash1
[{:root => root_value, :child1 => child1_value, :subchild1 => subchild1_value, bases => hit1,hit2,hit3}...]
hash2
[{:path => root_value/child1_value/subchild1_value, :hit1_exist => t ,hit2_exist => t,hit3_exist => f}...]
IF I do this
Def sample
results = nil
project = Project.find(params[:project_id])
testrun_query = "SELECT root_name, suite_name, case_name, ic_name, executed_platforms FROM testrun_caches WHERE start_date >= '#{params[:start_date]}' AND start_date < '#{params[:end_date]}' AND project_id = #{params[:project_id]} AND result <> 'SKIP' AND result <> 'N/A'"
if !params[:platform].nil? && params[:platform] != [""]
#yell_and_log "platform not nil"
platform_query = nil
params[:platform].each do |platform|
if platform_query.nil?
platform_query = " AND (executed_platforms LIKE '%#{platform.to_s},%'"
else
platform_query += " OR executed_platforms LIKE '%#{platform.to_s},%'"
end
end
testrun_query += ")" + platform_query
end
if !params[:location].nil? &&!params[:location].empty?
#yell_and_log "location not nil"
testrun_query += "AND location LIKE '#{params[:location].to_s}%'"
end
testrun_query += " GROUP BY root_name, suite_name, case_name, ic_name, executed_platforms ORDER BY root_name, suite_name, case_name, ic_name"
ic_query = "SELECT ics.path, memberships.pts8210, memberships.sv6, memberships.sv7, memberships.pts14k, memberships.pts22k, memberships.pts24k, memberships.spb32, memberships.spb64, memberships.sde, projects.name FROM ics INNER JOIN memberships on memberships.ic_id = ics.id INNER JOIN test_groups ON test_groups.id = memberships.test_group_id INNER JOIN projects ON test_groups.project_id = projects.id WHERE deleted = 'false' AND (memberships.pts8210 = true OR memberships.sv6 = true OR memberships.sv7 = true OR memberships.pts14k = true OR memberships.pts22k = true OR memberships.pts24k = true OR memberships.spb32 = true OR memberships.spb64 = true OR memberships.sde = true) AND projects.name = '#{project.name}' GROUP BY path, memberships.pts8210, memberships.sv6, memberships.sv7, memberships.pts14k, memberships.pts22k, memberships.pts24k, memberships.spb32, memberships.spb64, memberships.sde, projects.name ORDER BY ics.path"
if params[:ic_type] == "never_run"
runtest = TestrunCache.connection.select_all(testrun_query)
alltest = TrsIc.connection.select_all(ic_query)
(alltest.length).times do |i|
#exec_pltfrm = test['executed_platforms'].split(",")
unfinishedtest = comparison(runtest[i],alltest[i])
yell_and_log("test = #{unfinishedtest}")
yell_and_log("#{runtest[i]}")
yell_and_log("#{alltest[i]}")
end
end
end
I get in my log:
test = true
array of hash 1 = {"root_name"=>"BSDPLATFORM", "suite_name"=>"cli", "case_name"=>"functional", "ic_name"=>"cli_sanity_test", "executed_platforms"=>"pts22k,pts24k,sv7,"}
array of hash 2 = {"path"=>"BSDPLATFORM/cli/functional/cli_sanity_test", "pts8210"=>"f", "sv6"=>"f", "sv7"=>"t", "pts14k"=>nil, "pts22k"=>"t", "pts24k"=>"t", "spb32"=>nil, "spb64"=>nil, "sde"=>nil, "name"=>"pts_6_20"}
test = false
array of hash 1 = {"root_name"=>"BSDPLATFORM", "suite_name"=>"infrastructure", "case_name"=>"bypass_pts14k_copper", "ic_name"=>"ic_packet_9", "executed_platforms"=>"sv6,"}
array of hash 2 = {"path"=>"BSDPLATFORM/infrastructure/build/copyrights", "pts8210"=>"f", "sv6"=>"t", "sv7"=>"t", "pts14k"=>"f", "pts22k"=>"t", "pts24k"=>"t", "spb32"=>"f", "spb64"=>nil, "sde"=>nil, "name"=>"pts_6_20"}
test = false
array of hash 1 = {"root_name"=>"BSDPLATFORM", "suite_name"=>"infrastructure", "case_name"=>"bypass_pts14k_copper", "ic_name"=>"ic_status_1", "executed_platforms"=>"sv6,"}
array of hash 2 = {"path"=>"BSDPLATFORM/infrastructure/build/ic_1", "pts8210"=>"f", "sv6"=>"t", "sv7"=>"t", "pts14k"=>"f", "pts22k"=>"t", "pts24k"=>"t", "spb32"=>"f", "spb64"=>nil, "sde"=>nil, "name"=>"pts_6_20"}
test = false
array of hash 1 = {"root_name"=>"BSDPLATFORM", "suite_name"=>"infrastructure", "case_name"=>"bypass_pts14k_copper", "ic_name"=>"ic_status_2", "executed_platforms"=>"sv6,"}
array of hash 2 = {"path"=>"BSDPLATFORM/infrastructure/build/ic_files", "pts8210"=>"f", "sv6"=>"t", "sv7"=>"f", "pts14k"=>"f", "pts22k"=>"t", "pts24k"=>"t", "spb32"=>"f", "spb64"=>nil, "sde"=>nil, "name"=>"pts_6_20"}
SO I get only the first to match but rest becomes different and I get result of one instead of 4230
I would like some way to match by path and root/suite/case/ic and then compare the executed platforms passed in array of hashes 1 vs platforms set to true in array of hash2
Not sure if this is fastest, and I wrote this based on your original question that didn't provide sample code, but:
def compare(h1, h2)
(h2[:path] == "#{h1[:root]}/#{h1[:child1]}/#{h1[:subchild1]}") && \
(h2[:hit1_exist] == ((h1[:bases][0] == nil) ? 'f' : 't')) && \
(h2[:hit2_exist] == ((h1[:bases][1] == nil) ? 'f' : 't')) && \
(h2[:hit3_exist] == ((h1[:bases][2] == nil) ? 'f' : 't'))
end
def compare_arr(h1a, h2a)
(h1a.length).times do |i|
compare(h1a[i],h2a[i])
end
end
Test:
require "benchmark"
h1a = []
h2a = []
def rstr
# from http://stackoverflow.com/a/88341/178651
(0...2).map{65.+(rand(26)).chr}.join
end
def rnil
rand(2) > 0 ? '' : nil
end
10000.times do
h1a << {:root => rstr(), :child1 => rstr(), :subchild1 => rstr(), :bases => [rnil,rnil,rnil]}
h2a << {:path => '#{rstr()}/#{rstr()}/#{rstr()}', :hit1_exist => 't', :hit2_exist => 't', :hit3_exist => 'f'}
end
Benchmark.measure do
compare_arr(h1a,h2a)
end
Results:
=> 0.020000 0.000000 0.020000 ( 0.024039)
Now that I'm looking at your code, I think it could be optimized by removing array creations, and splits and joins which are creating arrays and strings that need to be garbage collected which also will slow things down, but not by as much as you mention.
Your database queries may be slow. Run explain/analyze or similar on them to see why each is slow, optimize/reduce your queries, add indexes where needed, etc. Also, check cpu and memory utilization, etc. It might not just be the code.
But, there are some definite things that need to be fixed. You also have several risks of SQL injection attack, e.g.:
... start_date >= '#{params[:start_date]}' AND start_date < '#{params[:end_date]}' AND project_id = #{params[:project_id]} ...
Anywhere that params and variables are put directly into the SQL may be a danger. You'll want to make sure to use prepared statements or at least SQL escape the values. Read this all the way through: http://guides.rubyonrails.org/active_record_querying.html
([element_being_tested].each do |el|
[hash_array_1, hash_array_2].reject do |x, y|
x[el] == y[el]
end
end).each {|x, y| puts (x[bases] | y[bases])}
Enumerate the hash elements to test.
[element_being_tested].each do |el|
Then iterate through the hash arrays themselves, comparing the given hashes by the elements of the given comparison defined by the outer loop, rejecting those not appropriately equal. (The == may actually need to be != but you can figure that much out)
[hash_array_1, hash_array_2].reject do |x, y|
x[el] == y[el]
end
Finally, you again compare the hashes taking the set union of their elements.
.each {|x, y| puts (x[bases] | y[bases])}
You may need to test the code. It's not meant for production so much as demonstration because I wasn't sure I read your code right. Please post a larger sample of the source including the data structures in question if this answer is unsatisfactory.
Regarding speed: if you're iterating through a large data set and comparing multiple there's probably nothing you can do. Perhaps you can invert the loops I presented and make the hash arrays the outer loop. You're not going to get lightning speed here in Ruby (really any language) if the data structure is large.

Resources