I have seen this question earlier here and I have took lessons from that. However I am not sure why I am getting an error when I feel it should work.
I want to create a new column in existing Spark DataFrame by some rules. Here is what I wrote. iris_spark is the data frame with a categorical variable iris_spark with three distinct categories.
from pyspark.sql import functions as F
iris_spark_df = iris_spark.withColumn(
"Class",
F.when(iris_spark.iris_class == 'Iris-setosa', 0, F.when(iris_spark.iris_class == 'Iris-versicolor',1)).otherwise(2))
Throws the following error.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-157-21818c7dc060> in <module>()
----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))
TypeError: when() takes exactly 2 arguments (3 given)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-157-21818c7dc060> in <module>()
----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))
TypeError: when() takes exactly 2 arguments (3 given)
Any idea why?
Correct structure is either:
(when(col("iris_class") == 'Iris-setosa', 0)
.when(col("iris_class") == 'Iris-versicolor', 1)
.otherwise(2))
which is equivalent to
CASE
WHEN (iris_class = 'Iris-setosa') THEN 0
WHEN (iris_class = 'Iris-versicolor') THEN 1
ELSE 2
END
or:
(when(col("iris_class") == 'Iris-setosa', 0)
.otherwise(when(col("iris_class") == 'Iris-versicolor', 1)
.otherwise(2)))
which is equivalent to:
CASE WHEN (iris_class = 'Iris-setosa') THEN 0
ELSE CASE WHEN (iris_class = 'Iris-versicolor') THEN 1
ELSE 2
END
END
with general syntax:
when(condition, value).when(...)
or
when(condition, value).otherwise(...)
You probably mixed up things with Hive IF conditional:
IF(condition, if-true, if-false)
which can be used only in raw SQL with Hive support.
Conditional statement In Spark
Using “when otherwise” on DataFrame
Using “case when” on DataFrame
Using && and || operator
import org.apache.spark.sql.functions.{when, _}
import spark.sqlContext.implicits._
val spark: SparkSession = SparkSession.builder().master("local[1]").appName("SparkByExamples.com").getOrCreate()
val data = List(("James ","","Smith","36636","M",60000),
("Michael ","Rose","","40288","M",70000),
("Robert ","","Williams","42114","",400000),
("Maria ","Anne","Jones","39192","F",500000),
("Jen","Mary","Brown","","F",0))
val cols = Seq("first_name","middle_name","last_name","dob","gender","salary")
val df = spark.createDataFrame(data).toDF(cols:_*)
1. Using “when otherwise” on DataFrame
Replace the value of gender with new value
val df1 = df.withColumn("new_gender", when(col("gender") === "M","Male")
.when(col("gender") === "F","Female")
.otherwise("Unknown"))
val df2 = df.select(col("*"), when(col("gender") === "M","Male")
.when(col("gender") === "F","Female")
.otherwise("Unknown").alias("new_gender"))
2. Using “case when” on DataFrame
val df3 = df.withColumn("new_gender",
expr("case when gender = 'M' then 'Male' " +
"when gender = 'F' then 'Female' " +
"else 'Unknown' end"))
Alternatively,
val df4 = df.select(col("*"),
expr("case when gender = 'M' then 'Male' " +
"when gender = 'F' then 'Female' " +
"else 'Unknown' end").alias("new_gender"))
3. Using && and || operator
val dataDF = Seq(
(66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
)).toDF("id", "code", "amt")
dataDF.withColumn("new_column",
when(col("code") === "a" || col("code") === "d", "A")
.when(col("code") === "b" && col("amt") === "4", "B")
.otherwise("A1"))
.show()
Output:
+---+----+---+----------+
| id|code|amt|new_column|
+---+----+---+----------+
| 66| a| 4| A|
| 67| a| 0| A|
| 70| b| 4| B|
| 71| d| 4| A|
+---+----+---+----------+
There are different ways you can achieve if-then-else.
Using when function in DataFrame API.
You can specify the list of conditions in when and also can specify otherwise what value you need. You can use this expression in nested form as well.
expr function.
Using "expr" function you can pass SQL expression in expr. PFB example. Here we are creating new column "quarter" based on month column.
cond = """case when month > 9 then 'Q4'
else case when month > 6 then 'Q3'
else case when month > 3 then 'Q2'
else case when month > 0 then 'Q1'
end
end
end
end as quarter"""
newdf = df.withColumn("quarter", expr(cond))
selectExpr function.
We can also use the variant of select function which can take SQL expression. PFB example.
cond = """case when month > 9 then 'Q4'
else case when month > 6 then 'Q3'
else case when month > 3 then 'Q2'
else case when month > 0 then 'Q1'
end
end
end
end as quarter"""
newdf = df.selectExpr("*", cond)
you can use this:
if(exp1, exp2, exp3) inside spark.sql()
where exp1 is condition and if true give me exp2, else give me exp3.
now the funny thing with nested if-else is. you need to pass every exp inside
brackets {"()"}
else it will raise error.
example:
if((1>2), (if (2>3), True, False), (False))
I try to make query for tag search.
tags: how many tags ex.3
q: array of tags ex.['foo','hoo','poo']
def queryByTags(cls, tags, q):
def one():
qry = models.Card.query(models.Card.tags_value == q[0])
return qry
def two():
qry = models.Card.query(ndb.AND(models.Card.tags_value == q[0],
models.Card.tags_value == q[1]))
return qry
def three():
qry = models.Card.query(ndb.AND(models.Card.tags_value == q[0],
models.Card.tags_value == q[1],
models.Card.tags_value == q[2]))
return qry
tags_len = {1: one,
2: two,
3: three,
}
return tags_len[tags]()
This method can use up to 3 tags. I can copy code myself and extend it until 7,8,9...
It is very sad way...
Is there any smart way?
In pseudo python-ndb (I didn't run my code but you'll get it) I would say that a way would be to do:
cards_count = Card.query().filter(tags_value==q[0])\
.filter(tags_value==q[1])\
.filter(tags_value==q[2]).count()
or if iterating dynamic array (unknown length)
cards_count = Card.query()
for value in q:
q = q.filter(tags_value==value)
cards_count = q.count()
Here is my selection code from db:
$q = $this->db->like('Autor1' or 'Autor2' or 'Autor3' or 'Autor4', $vyraz)
->where('stav', 1)
->order_by('id', 'desc')
->limit($limit)
->offset($offset)
->get('knihy');
return $q->result();
Where $vyraz = "Zuzana Šidlíková";
And the error is:
Nastala chyba databázy
Error Number: 1054
Unknown column '1' in 'where clause'
SELECT * FROM (\knihy`) WHERE `stav` = 1 AND `1` LIKE '%Zuzana Ĺ idlĂková%' ORDER BY `id` desc LIMIT 9
Filename: C:\wamp\www\artbooks\system\database\DB_driver.php
Line Number: 330
Can you help me solve this problem?
Your syntax is wrong for what you're trying to do, but still technically valid, because this:
'Autor1' or 'Autor2' or 'Autor3' or 'Autor4'
...is actually a valid PHP expression which evaluates to TRUE (because all non-empty strings are "truthy"), which when cast to a string or echoed comes out as 1, so the DB class is looking to match on a column called "1".
Example:
function like($arg1, $arg2)
{
return "WHERE $arg1 LIKE '%$arg2%'";
}
$vyraz = 'Zuzana Šidlíková';
echo like('Autor1' or 'Autor2' or 'Autor3' or 'Autor4', $vyraz);
// Output: WHERE 1 LIKE '%Zuzana Šidlíková%'
Anyways, here's what you need:
$q = $this->db
->like('Autor1', $vyraz)
->or_like('Autor2', $vyraz)
->or_like('Autor3', $vyraz)
->or_like('Autor4', $vyraz)
->where('stav', 1)
->order_by('id', 'desc')
->limit($limit)
->offset($offset)
->get('knihy');
In Drupal 7 simple updates work like this:
$num_updated = db_update('joke')
->fields(array(
'punchline' => 'Take my wife please!',
))
->condition('nid', 3, '>=')
->execute();
But what if I have multiple conditions( say nid >=3 and uid >=2). Writing something like:
$num_updated = db_update('joke')
->fields(array(
'punchline' => 'Take my wife please!',
))
->condition('nid', 3, '>=')
->condition('uid', 2, '>=')
->execute();
does not seem to work. Any ideas?
What you have written will do the equivalent of:
'...WHERE (NID >= 3 AND UID >= 2)"
If you wanted an OR conditional you need to nest the statements as such (no, it's not very intuitive):
->condition(db_or()->condition('NID'', 3, '>=')->condition('UID', 2, '>='))
There is also db_and() [default for chaining multiple condition methods] and db_xor() that you can use when nesting.
In my django app, especially on the admin side, I do a few def's with my models:
def get_flora(self):
return self.flora.all()
def targeted_flora(self):
return u"%s" % (self.get_flora())
whereas flora is a ManyToManyField, however, sometimes ForeignKey fields are used also.
I do this to provide a utility 'get' function for the model, and then the second def is to provide django admin with a a friendlier field name to populate the tabular/list view.
Perhaps a two part question here:
1. Is this a good workflow/method for doing such things, and
2. The resultant string output in admin looks something like:
[<Species: pittosporum>, <Species: pinus radiata>]
Naturally enough, but how to make it look like:
pittosporum & pinus radiata
or, if there were three;
pittosporum, pinus radiata & erharta ercta
Super thanks!
Sounds like you want something like this:
def targeted_flora(self):
names= [f.name for f in self.get_flora()] # or however you get a floras name
if len(names) == 1:
return names[0]
else:
return ', '.join(names[:-1]) + ' & ' + names[-1]
This works btw:
def punctuated_object_list(objects, field):
if field:
field_list = [getattr(f, field) for f in objects]
else:
field_list = [str(f) for f in objects]
if len(field_list) > 0:
if len(field_list) == 1:
return field_list[0]
else:
return ', '.join(field_list[:-1]) + ' & ' + field_list[-1]
else:
return u''