Looping in Pyspark causes sparkException - loops

In Zeppelin with pyspark.
Before I found the correct way of doing things (Last over a Window), I had a loop that extended the value of a previous row to itself one by one (I know loops are bad practice). However, after running a couple hundred times it fails with a nullPointerException before reaching the best case condition=0.
To get around the error, (before I discovered the last command), I had the loop run a few hundred times for a midpoint condition=1000, dump the results. Run it again with condition=500, rinse and repeat until I hit condition=0.
def extendTarget(myDF, loop, lessThan):
i = myDF.filter(col("target") == "unknown").count()
while (i > lessThan):
cc = loop
while (cc > 0):
myDF = myDF.withColumn("targetPrev", lag("target", 1).over(Window.partitionBy("id").orderBy("myTime")))
myDF = myDF.withColumn("targetNew", when(col("target") == "unknown", col("targetPrev")).otherwise(col("target")))
myDF = myDF.select(
"id",
"myTime",
col("targetNew").alias("target"))
cc = cc - 1
i = myDF.filter(col("target") == "unknown").count()
print i
return myDF
myData = spark.read.load(myPath)
myData = extendTarget(myData, 20, 0)
myData.write.parquet(myPathPart1)
I expect it to take a stupid long amount of time (since I'm doing it wrong), but do not expect it to exception out with
Output (given inputs (myData, 20, 0)
38160
22130
11375
6625
5085
4522
4216
3936
3662
3419
3202
Error
Py4JJavaError: An error occurred while calling o26814.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 32 in stage 1539.0 failed 4 times, most recent failure: Lost task 32.3 in stage 1539.0 (TID XXXX, ip-XXXX, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_XXXX_0001_01_000033 on host: ip-XXXX. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_XXXX_0001_01_000033
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 50
.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2830)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2829)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2829)
at sun.reflect.GeneratedMethodAccessor388.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling o26814.count.\n', JavaObject id=o26815), <traceback object at 0x7efc521b11b8>)
I can only guess that this has something to do with memory or cache. Even though I am reusing all variable names. If it is a memory problem, is there a garbage collect, or clear cache/memory command that I can put beside the print command to allow it to loop forever?
Again I know its bad practice to use loops, especially if they go on for what seems like forever, but sometimes the better\smarter code doesn't present itself when I need it so during the interim I hack it however I can.

Answered in a comment by Cronoik
Could you please try the following (this will keep the execution plan small): myDF = myDF.select("id", "myTime", col("targetNew").alias("target")).checkpoint() (replace the corrosponding line before and and the checkpoint directory before via spark.setCheckpointDir). – cronoik Sep 9 at 4:05
I tested it out and it worked! The first few times it failed though, mostly due to my misunderstanding of how checkpoint works. Would you like to put your answer as an answer? Some things I had to figure out that would be nice as part of the answer spark.sparkContext.setCheckpointDir("path/") - to set checkpointPath myDF.checkpoint() does not work, you have to assign that checkpoint back myDF = myDF.checkpoint() Also, I found the nullpointer exception, if I tell the loop to run 200 times before counting it does a nullpointerException and sparkContext ends. (must be restarted) – Ranald Fong Sep 10 at 2:44
Also, having the checkpoint inside each loop caused it to run slowly as it was probably checkpointing every loop run. What really sped it up was putting the checkpoint right before the print, allowing it to loop within memory x amt of times (i used 100) before checkpointing. Thanks cronoik, if you put your answer as an answer I'll mark it my answer! – Ranald Fong Sep 10 at 2:53
Edit: Commenter wanted details
Point is to extend a value from one row until the end, replacing all unknowns
| id | time | target |
| a | 1:00 | 1 |
| a | 1:01 | unknown|
| a | . | . |
| a | 5:00 | unknown|
| a | 5:01 | 2 |
to
| id | time | target |
| a | 1:00 | 1 |
| a | 1:01 | 1 |
| a | . | 1 |
| a | 5:00 | 1 |
| a | 5:01 | 2 |
Code was changed to use Checkpoints
spark.sparkContext.setCheckpointDir(".../myCheckpointsPath/")
def extendTarget(myDF, loop, lessThan):
i = myDF.filter(col("target") == "unknown").count()
while (i > lessThan):
cc = loop
while (cc > 0):
myDF = myDF.withColumn("targetPrev", lag("target", 1).over(Window.partitionBy("id").orderBy("myTime")))
myDF = myDF.withColumn("targetNew", when(col("target") == "unknown", col("targetPrev")).otherwise(col("target")))
myDF = myDF.select(
"id",
"myTime",
col("targetNew").alias("target"))
cc = cc - 1
i = myDF.filter(col("target") == "unknown").count()
print i
myDF = myDF.checkpoint()
return myDF
myData = spark.read.load(myPath)
myData = extendTarget(myData, 20, 0)
myData.write.parquet(myPathPart1)
At the expense of HDD space for the checkpoints, allows the loop to go on forever! But in general, do not use Loops(doubly so for infinite loops), use First and Last with ignore Nulls instead.

Related

flink table API not processing records

I read json data from Kafka and tried to process the data with flink table API.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
tEnv.executeSql(
"create table inputTable(" +
"`src_ip` STRING," +
"`src_port` STRING," +
"`bytes_from_src` BIGINT," +
"`pkts_from_src` BIGINT," +
"`ts` TIMESTAMP(2) METADATA FROM 'timestamp'," +
"WATERMARK FOR ts AS ts" +
") WITH (" +
"'connector' = 'kafka'," +
"'topic' = 'test'," +
"'properties.bootstrap.servers' = 'localhost:9092'," +
"'properties.group.id' = 'testGroup'," +
"'scan.startup.mode' = 'earliest-offset'," +
"'format' = 'json'," +
"'json.fail-on-missing-field' = 'true'," +
"'json.ignore-parse-errors' = 'false'" +
")");
Table inputTable = tEnv.from("inputTable");
inputTable.printSchema();
inputTable.execute().print();
Table windowedTable = inputTable
.window(Tumble.over(lit(5).seconds()).on($("ts")).as("w"))
.groupBy($("w"), $("src_ip"))
.select($("w").start().as("window_start"),
$("src_ip"),
$("src_ip").count().as("src_ip_count"),
$("bytes_from_src").avg().as("bytes_from_src_mean")
);
windowedTable.execute().print();
There are 4 records in Kafka. The flink program prints out the schema info and the inputTable as the following:
Connected to the target VM, address: '127.0.0.1:62348', transport: 'socket'
(
`src_ip` STRING,
`src_port` STRING,
`bytes_from_src` BIGINT,
`pkts_from_src` BIGINT,
`ts` TIMESTAMP(2) *ROWTIME* METADATA FROM 'timestamp',
WATERMARK FOR `ts`: TIMESTAMP(2) AS `ts`
)
+----+--------------------------------+--------------------------------+----------------------+----------------------+-------------------------+
| op | src_ip | src_port | bytes_from_src | pkts_from_src | ts |
+----+--------------------------------+--------------------------------+----------------------+----------------------+-------------------------+
| +I | 44.38.5.31 | 53159 | 120 | 3 | 2021-08-13 14:59:56.00 |
| +I | 44.38.132.51 | 39409 | 100 | 2 | 2021-08-13 14:58:11.00 |
| +I | 44.38.4.44 | 56758 | 336 | 6 | 2021-08-13 14:59:14.00 |
| +I | 44.38.5.34 | 40001 | 80 | 2 | 2021-08-13 14:57:04.00 |
After that, nothing is printed out. The program did not exit. I am running the flink within IDEA. At this point, it seems like a black box. There is no output, and I do not know how to trace a flink program.
If I commented out the line inputTable.execute().print();, the schema info is printed out, but nothing after that and the program does not exit.
The flink version used is 1.14.2.
I believe those records are being processed, and are being added to the window. But event time windows are triggered by watermarks, and the watermark isn't becoming large enough to trigger the window. To get this to work you need to process an event with a timestamp past the end of the window -- i.e., 2021-08-13 15:00:00.00 or larger.
For debugging, the Flink web dashboard is helpful in situations like this. You can see if events are being processed, examine the watermarks, etc. See Flink webui when running from IDE for help in setting it up.

Calculate total number of orders in different statuses

I want to create simple dashboard where I want to show the number of orders in different statuses. The statuses can be New/Cancelled/Finished/etc
Where should I implement these criteria? If I add filter in the Cube Browser then it applies for the whole dashboard. Should I do that in KPI? Or should I add calculated column with 1/0 values?
My expected output is something like:
--------------------------------------
| Total | New | Finished | Cancelled |
--------------------------------------
| 1000 | 100 | 800 | 100 |
--------------------------------------
I'd use measures for that, something like:
CountTotal = COUNT('Orders'[OrderID])
CountNew = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "New")
CountFinished = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "Finished")
CountCancelled = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "Cancelled")

How can i get the logs of past months in postgres DB .

Issue :
Someone has added a junk column in one of my table.I want to figure it out from the logs as when and from where this activity has been performed.
Please Help regarding this issue.
Make sure enable logging in postgresql.conf
1.log_destination = 'stderr' #log_destination = 'stderr,csvlog,syslog'
2.logging_collector = on #need restart
3.log_directory = 'pg_log'
4.log_file_name = 'postgresql-%Y-%m-%d_%H%M%S.log'
5.log_rotation_age = 1d
6.log_rotation_size = 10MB
7.log_min_error_statement = error
8.log_min_duration_statement = 5000 # -1 = disable ; 0 = ALL ; 5000 = 5sec
9.log_line_prefix = '|%m|%r|%d|%u|%e|'
10.log_statment = 'ddl' # 'none' | 'ddl' | 'mod' | 'all'
#prefer 'ddl' because the log output will be 'ddl' and 'query min duration'
If you don't enable it, make sure enable it now.
if you don't have log the last attempt is pg_xlogdump your xlog file under pg_xlog and look for DDL

context is unavailable in python behave

I am starting with Python behave and got stuck when trying to access context - it's not available. Here is my code:
Here is the Feature file:
Feature: Company staff inventory
Scenario: count company staff
Given a set of employees:
| name | dept |
| Paul | IT |
| Mary | IT |
| Pete | Dev |
When we count the number of employees in each department
Then we will find two people in IT
And we will find one employee in Dev
Here is the Steps file:
from behave import *
#given('a set of employees')
def step_impl(context):
assert context is True
#when('we count the number of employees in each department')
def step_impl(context):
context.res = dict()
for row in context.table:
for k, v in row:
if k not in context.res:
context.res[k] = 1
else:
context.res[k] += 1
#then('we will find two people in IT')
def step_impl(context):
assert context.res['IT'] == 2
#then('we will find one employee in Dev')
def step_impl(context):
assert context.res['Dev'] == 1
Here's the Traceback:
Traceback (most recent call last):
File "/home/kseniyab/Documents/broadsign_code/spikes/BDD_Gherkin/behave/src/py3behave/lib/python3.4/site-packages/behave/model.py", line 1456, in run
match.run(runner.context)
File "/home/kseniyab/Documents/broadsign_code/spikes/BDD_Gherkin/behave/src/py3behave/lib/python3.4/site-packages/behave/model.py", line 1903, in run
self.func(context, *args, **kwargs)
File "steps/count_staff.py", line 5, in step_impl
assert context is True
AssertionError
The context object is there. It is your assertion that is problematic:
assert context is True
fails because context is not True. The only thing that can satisfy x is True is where x has been set to True. context has not been set to True. It is an object. Note that testing whether context is True is not the same as testing whether if you were to write if context: the true branch would be taken or the false branch. The latter is equivalent to testing whether bool(context) is True.
There's no point in testing whether context is available. It is always there.

Count unique rows of column in Array of object

Using
Ruby 1.9.3-p194
Rails 3.2.8
Here's what I need.
Count the different human resources (human_resource_id) and divide this by the total number of assignments (assignment_id).
So, the answer for the dummy-data as given below should be:
1.5 assignments per human resource
But I just don't know where to go anymore.
Here's what I tried:
Table name: Assignments
id | human_resource_id | assignment_id | assignment_start_date | assignment_expected_end_date
80101780 | 20200132 | 80101780 | 2012-10-25 | 2012-10-31
80101300 | 20200132 | 80101300 | 2012-07-07 | 2012-07-31
80101308 | 21100066 | 80101308 | 2012-07-09 | 2012-07-17
At first I need to make a selection for the period I need to 'look' at. This is always from max a year ago.
a = Assignment.find(:all, :conditions => { :assignment_expected_end_date => (DateTime.now - 1.year)..DateTimenow })
=> [
#<Assignment id: 80101780, human_resource_id: "20200132", assignment_id: "80101780", assignment_start_date: "2012-10-25", assignment_expected_end_date: "2012-10-31">,
#<Assignment id: 80101300, human_resource_id: "20200132", assignment_id: "80101300", assignment_start_date: "2012-07-07", assignment_expected_end_date: "2012-07-31">,
#<Assignment id: 80101308, human_resource_id: "21100066", assignment_id: "80101308", assignment_start_date: "2012-07-09", assignment_expected_end_date: "2012-07-17">
]
foo = a.group_by(&:human_resource_id)
Now I got a beautiful 'Array of hash of object' and I just don't know what to do next.
Can someone help me?
You can try to execute the request in SQL :
ActiveRecord::Base.connection.select_value('SELECT count(distinct human_resource_id) / count(distinct assignment_id) AS ratio FROM assignments');
You could do something like
human_resource_count = assignments.collect{|a| a.human_resource_id}.uniq.count
assignment_count = assignments.collect{|a| a.assignment_id}.uniq.count
result = human_resource_count/assignment_count

Resources