I am starting with Python behave and got stuck when trying to access context - it's not available. Here is my code:
Here is the Feature file:
Feature: Company staff inventory
Scenario: count company staff
Given a set of employees:
| name | dept |
| Paul | IT |
| Mary | IT |
| Pete | Dev |
When we count the number of employees in each department
Then we will find two people in IT
And we will find one employee in Dev
Here is the Steps file:
from behave import *
#given('a set of employees')
def step_impl(context):
assert context is True
#when('we count the number of employees in each department')
def step_impl(context):
context.res = dict()
for row in context.table:
for k, v in row:
if k not in context.res:
context.res[k] = 1
context.res[k] += 1
#then('we will find two people in IT')
def step_impl(context):
assert context.res['IT'] == 2
#then('we will find one employee in Dev')
def step_impl(context):
assert context.res['Dev'] == 1
Here's the Traceback:
Traceback (most recent call last):
File "/home/kseniyab/Documents/broadsign_code/spikes/BDD_Gherkin/behave/src/py3behave/lib/python3.4/site-packages/behave/", line 1456, in run
File "/home/kseniyab/Documents/broadsign_code/spikes/BDD_Gherkin/behave/src/py3behave/lib/python3.4/site-packages/behave/", line 1903, in run
self.func(context, *args, **kwargs)
File "steps/", line 5, in step_impl
assert context is True

The context object is there. It is your assertion that is problematic:
assert context is True
fails because context is not True. The only thing that can satisfy x is True is where x has been set to True. context has not been set to True. It is an object. Note that testing whether context is True is not the same as testing whether if you were to write if context: the true branch would be taken or the false branch. The latter is equivalent to testing whether bool(context) is True.
There's no point in testing whether context is available. It is always there.


How can I create and shuffle a dataset for triplet mining in TensorFlow 2?

I'm working on a network using triplet mining for training. In order to make it work properly, I need my batches to contain several images of the same class. The problem I'm currently facing is that I have 751 classes, for a total of 12,937 pictures, and a batch size of 48 pictures. When shuffling the dataset using the command below, the odds to get pictures from the same class are really low, making the triplet mining inefficient.
dataset = dataset.shuffle(12937)
What I would need instead is a way of generating batches that contain a specific number of pictures for every class represented in this batch. As an example, let's say here that I want 12 classes per batch, there would be 4 pictures for each of them.
Another problem I'm facing is how would I shuffle this dataset at the end of every epoch so that I can have different batches that still follow the condition fixed above, that is 12 classes, 4 pictures for each one of them?
Is there any proper way to do it? I can't really find one. Please let me know if I'm unclear, and if you need further details.
================ EDIT ================
I've been trying a few things, and came up with something that would do what I want. The function would be the following:
counter = 0.
# Assuming a format such as (data, label)
def predicate(data, label):
global counter
allowed_labels = tf.constant([counter])
isallowed = tf.equal(allowed_labels, tf.cast(label, tf.float32))
reduced = tf.reduce_sum(tf.cast(isallowed, tf.float32))
counter += 1
return tf.greater(reduced, tf.constant(0.))
def custom_shuffle(train_dataset, batch_size, samples_per_class = 4, iterations_in_epoch = 100, database='market'):
assert batch_size%samples_per_class==0, F'batch size must be a {samples_per_class} multiple.'
if database == 'market':
class_nbr = 751
raise Exception('Unsuported database yet')
all_datasets = [train_dataset.filter(predicate) for _ in range(class_nbr)] # Every element of this array is a dataset of one class
for i in range(iterations_in_epoch):
choice = tf.random.uniform(
) # Which classes will be in batch
choice =[choice for _ in range(4)], axis=0)) # Exactly 4 picture from each class in the batch
batch =, choice)
if i==0:
all_batches = batch
all_batches = all_batches.concatenate(batch)
all_batches = all_batches.batch(batch_size)
return all_batches
It does what I want, however the returned dataset is extremely slow to iterate, making modele learning impossible. As per this thread, I understood that I needed to decorate custom_shuffle with #tf.function, as the one commented out. However, when doing so, it raises the following error:
Traceback (most recent call last):
File "", line 137, in <module>
File "", line 80, in main
train_dataset = get_dataset(TRAINING_FILENAMES, IMG_SIZE, BATCH_SIZE, database=database, func_type='train')
File "E:\Morgan\TransReID_TF\", line 260, in get_dataset
dataset = custom_shuffle(dataset, batch_size)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\", line 780, in __call__
result = self._call(*args, **kwds)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\", line 846, in _call
return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds) # pylint: disable=protected-access
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\", line 1843, in _filtered_call
return self._call_flat(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\", line 1923, in _call_flat
return self._build_call_outputs(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\", line 545, in call
outputs = execute.execute(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: No unary variant device copy function found for direction: 1 and Variant type_index: class tensorflow::data::`anonymous namespace'::DatasetVariantWrapper
[[{{node BatchDatasetV2/_206}}]] [Op:__inference_custom_shuffle_11485]
Function call stack:
Which I don't understand, and don't see how to fix.
Is there something I'm doing wrong?
PS: I'm aware the lack of minimal code to reproduce this behavior makes it hard to debug, I'll try to provide some as soon as possible.

Pyspark Array Column - Replace Empty Elements with Default Value

I have a dataframe with a column which is an array of strings. Some of the elements of the array may be missing like so:
ID |array_list
38292786 |[AAA,, JLT] |
38292787 |[DFG] |
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |
38292790 |[] |
38292791 |[] |
38292792 |[,,, HKJ] |
I would like to replace the missing elements with a default value of "ZZZ". Is there a way to do that? I tried the following code, which is using a transform function and a regular expression:
import pyspark.sql.functions as F
from pyspark.sql.dataframe import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
df = df.withColumn("array_list2", F.expr("transform(array_list, x -> regexp_replace(x, '', 'ZZZ'))"))
This doesn't give an error but it is producing nonsense. I'm thinking I just don't know the correct way to identify the missing elements of the array - can anyone help me out?
In production our data has around 10 million rows and I am trying to avoid using explode or a UDF (not sure if it's possible to avoid using both though, just need the code run as efficiently as possible). I'm using Spark 2.4.4
This is what I would like my output to look like:
ID |array_list | array_list2
38292786 |[AAA,, JLT] |[AAA, ZZZ, JLT]
38292787 |[DFG] |[DFG]
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |[SHJ, QKJ, AAA, YTR, CBM]
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |[DUY, ANK, QJK, POI, CNM, ADD]
38292790 |[] |[ZZZ]
38292791 |[] |[ZZZ]
38292792 |[,,, HKJ] |[ZZZ, ZZZ, ZZZ, HKJ]
The regex_replace works at character level.
I could not get it to work with transform either, but with help from the first answerer I used a UDF - not that easy.
Here is my example with my data, you can tailor.
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(
lambda con_str, arr: [
x if x is not None else con_str for x in arr or [None]
arrayData = [
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages'])
df = df.withColumn("knownLanguages", concat_udf(lit("ZZZ"), col("knownLanguages")))
| name| knownLanguages|
| James| [Java, Scala]|
| Michael|[Spark, Java, ZZZ]|
| Robert| [CSharp, ]|
|Washington| [ZZZ]|
| Jefferson| [1, 2]|
Quite tough this, had some help from the first answerer.
I'm thinking of something, but i'm not sure if it is efficient.
from pyspark.sql import functions as F
df.withColumn("array_list2", F.split(F.array_join("array_list", ",", "ZZZ"), ","))
First I concatenate the values as a string with a delimiter , (hoping you don't have it in your string but you can use something else). I use the null_replacement option to fill the null values. Then I split according to the same delimiter.
EDIT: Based on #thebluephantom comment, you can try this solution :
"array_list_2", F.expr(" transform(array_list, x -> coalesce(x, 'ZZZ'))")
SQL built-in transform is not working for me, so I couldn't try it but hopefully you'll have the result you wanted.

Looping in Pyspark causes sparkException

In Zeppelin with pyspark.
Before I found the correct way of doing things (Last over a Window), I had a loop that extended the value of a previous row to itself one by one (I know loops are bad practice). However, after running a couple hundred times it fails with a nullPointerException before reaching the best case condition=0.
To get around the error, (before I discovered the last command), I had the loop run a few hundred times for a midpoint condition=1000, dump the results. Run it again with condition=500, rinse and repeat until I hit condition=0.
def extendTarget(myDF, loop, lessThan):
i = myDF.filter(col("target") == "unknown").count()
while (i > lessThan):
cc = loop
while (cc > 0):
myDF = myDF.withColumn("targetPrev", lag("target", 1).over(Window.partitionBy("id").orderBy("myTime")))
myDF = myDF.withColumn("targetNew", when(col("target") == "unknown", col("targetPrev")).otherwise(col("target")))
myDF =
cc = cc - 1
i = myDF.filter(col("target") == "unknown").count()
print i
return myDF
myData =
myData = extendTarget(myData, 20, 0)
I expect it to take a stupid long amount of time (since I'm doing it wrong), but do not expect it to exception out with
Output (given inputs (myData, 20, 0)
Py4JJavaError: An error occurred while calling o26814.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 32 in stage 1539.0 failed 4 times, most recent failure: Lost task 32.3 in stage 1539.0 (TID XXXX, ip-XXXX, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_XXXX_0001_01_000033 on host: ip-XXXX. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_XXXX_0001_01_000033
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Container exited with a non-zero exit code 50
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
at org.apache.spark.util.EventLoop$$anon$
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2830)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2829)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2829)
at sun.reflect.GeneratedMethodAccessor388.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at py4j.reflection.MethodInvoker.invoke(
at py4j.reflection.ReflectionEngine.invoke(
at py4j.Gateway.invoke(
at py4j.commands.AbstractCommand.invokeMethod(
at py4j.commands.CallCommand.execute(
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling o26814.count.\n', JavaObject id=o26815), <traceback object at 0x7efc521b11b8>)
I can only guess that this has something to do with memory or cache. Even though I am reusing all variable names. If it is a memory problem, is there a garbage collect, or clear cache/memory command that I can put beside the print command to allow it to loop forever?
Again I know its bad practice to use loops, especially if they go on for what seems like forever, but sometimes the better\smarter code doesn't present itself when I need it so during the interim I hack it however I can.
Answered in a comment by Cronoik
Could you please try the following (this will keep the execution plan small): myDF ="id", "myTime", col("targetNew").alias("target")).checkpoint() (replace the corrosponding line before and and the checkpoint directory before via spark.setCheckpointDir). – cronoik Sep 9 at 4:05
I tested it out and it worked! The first few times it failed though, mostly due to my misunderstanding of how checkpoint works. Would you like to put your answer as an answer? Some things I had to figure out that would be nice as part of the answer spark.sparkContext.setCheckpointDir("path/") - to set checkpointPath myDF.checkpoint() does not work, you have to assign that checkpoint back myDF = myDF.checkpoint() Also, I found the nullpointer exception, if I tell the loop to run 200 times before counting it does a nullpointerException and sparkContext ends. (must be restarted) – Ranald Fong Sep 10 at 2:44
Also, having the checkpoint inside each loop caused it to run slowly as it was probably checkpointing every loop run. What really sped it up was putting the checkpoint right before the print, allowing it to loop within memory x amt of times (i used 100) before checkpointing. Thanks cronoik, if you put your answer as an answer I'll mark it my answer! – Ranald Fong Sep 10 at 2:53
Edit: Commenter wanted details
Point is to extend a value from one row until the end, replacing all unknowns
| id | time | target |
| a | 1:00 | 1 |
| a | 1:01 | unknown|
| a | . | . |
| a | 5:00 | unknown|
| a | 5:01 | 2 |
| id | time | target |
| a | 1:00 | 1 |
| a | 1:01 | 1 |
| a | . | 1 |
| a | 5:00 | 1 |
| a | 5:01 | 2 |
Code was changed to use Checkpoints
def extendTarget(myDF, loop, lessThan):
i = myDF.filter(col("target") == "unknown").count()
while (i > lessThan):
cc = loop
while (cc > 0):
myDF = myDF.withColumn("targetPrev", lag("target", 1).over(Window.partitionBy("id").orderBy("myTime")))
myDF = myDF.withColumn("targetNew", when(col("target") == "unknown", col("targetPrev")).otherwise(col("target")))
myDF =
cc = cc - 1
i = myDF.filter(col("target") == "unknown").count()
print i
myDF = myDF.checkpoint()
return myDF
myData =
myData = extendTarget(myData, 20, 0)
At the expense of HDD space for the checkpoints, allows the loop to go on forever! But in general, do not use Loops(doubly so for infinite loops), use First and Last with ignore Nulls instead.

evaluating test dataset using eval() in LightGBM

I have trained a ranking model with LightGBM with the objective 'lambdarank'.
I want to evaluate my model to get the nDCG score for my test dataset using the best iteration, but I have never been able to use the lightgbm.Booster.eval() nor lightgbm.Booster.eval_train() function.
First, I have created 3 dataset instances, namely the train set, valid set and test set:
lgb_train = lgb.Dataset(x_train, y_train, group=query_train, free_raw_data=False)
lgb_valid = lgb.Dataset(x_valid, y_valid, reference=lgb_train, group=query_valid, free_raw_data=False)
lgb_test = lgb.Dataset(x_test, y_test, group=query_test)
I then train my model using lgb_train and lgb_valid:
gbm = lgb.train(params,
valid_sets=[lgb_train, lgb_valid],
When I call the eval() or the eval_train() functions after training, it returns an error:
AttributeError Traceback (most recent call last)
<ipython-input-122-7ff5ef5136b8> in <module>()
----> 1 gbm.eval(data=lgb_test,name='test')
/usr/local/lib/python3.6/dist-packages/lightgbm/ in eval(self, data,
name, feval)
1925 raise TypeError("Can only eval for Dataset instance")
1926 data_idx = -1
-> 1927 if data is self.train_set:
1928 data_idx = 0
1929 else:
AttributeError: 'Booster' object has no attribute 'train_set'
ValueError Traceback (most recent call last)
<ipython-input-123-0ce5fa3139f5> in <module>()
----> 1 gbm.eval_train()
/usr/local/lib/python3.6/dist-packages/lightgbm/ in eval_train(self,
1956 List with evaluation results.
1957 """
-> 1958 return self.__inner_eval(self.__train_data_name, 0, feval)
1960 def eval_valid(self, feval=None):
/usr/local/lib/python3.6/dist-packages/lightgbm/ in
__inner_eval(self, data_name, data_idx, feval)
2352 """Evaluate training or validation data."""
2353 if data_idx >= self.__num_dataset:
-> 2354 raise ValueError("Data_idx should be smaller than number
of dataset")
2355 self.__get_eval_info()
2356 ret = []
ValueError: Data_idx should be smaller than number of dataset
and when i called the eval_valid() function, it returns an empty list.
Can anyone tell me how to evaluate a LightGBM model and get the nDCG score using test set properly? Thanks.
If you add keep_training_booster=True as an argument to your lgb.train, the returned booster object would be able to execute eval and eval_train (though eval_valid would still return an empty list for some reason even when valid_sets is provided in lgb.train).
Documentation says:
keep_training_booster (bool, optional (default=False)) – Whether the returned Booster will be used to keep training. If False, the returned value will be converted into _InnerPredictor before returning.

Convert DB rules to if-else conditions?

I want to convert a set of rows in SQL Server database (In the form of rules) to a single if-else condition without hardcoding any values in the code. The code will be written in Scala and I am trying to figure out the logic to do this but could not think of a good approach.
Sample SQL Server Rows:
ABC | = | 0 | NULL | GOOD
ABC | = | 1 | NULL | BAD
ABC | = | 2 | NULL | ERROR
ABC | >= | 3 | NULL | IGNORE
Similar to tag ABC, there could be any number of tags and the conditions will vary with the tag column and each tag will have conditions in multiple rows. If anyone has dealt with a similar problem and has any suggestions that would be appreciated.
The question doesn't seem clear to me as currently written. What do you mean by a "a single if-else condition without hardcoding any values in the code"?
Would the following work?
sealed trait Condition
object Eq extends Condition // =
object Ge extends Condition // >=
sealed trait Status
object Good extends Status
object Bad extends Status
object Error extends Status
object Ignore extends Status
case class Rule(tag: String,
condition: Condition,
min: Int,
max: Int,
status: Status)
def handle(input: Int, rules: List[Rule]): Status =
.view // lazily iterate the rules
.filter { // find matching rules
case Rule(_, Eq, x, _, _) if input == x => true
case Rule(_, Ge, x, _, _) if input >= x => true
case _ => false
.map { matchingRule => matchingRule.status } // return the status
.head // find the status of the first matching rule, or throw
// Tests
val rules = List(
Rule("abc", Eq, 0, 0, Good),
Rule("abc", Eq, 1, 0, Bad),
Rule("abc", Eq, 2, 0, Error),
Rule("abc", Ge, 3, 0, Ignore))
assert(handle(0, rules) == Good)
assert(handle(1, rules) == Bad)
assert(handle(2, rules) == Error)
assert(handle(3, rules) == Ignore)
assert(handle(4, rules) == Ignore)
