I read json data from Kafka and tried to process the data with flink table API.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
tEnv.executeSql(
"create table inputTable(" +
"`src_ip` STRING," +
"`src_port` STRING," +
"`bytes_from_src` BIGINT," +
"`pkts_from_src` BIGINT," +
"`ts` TIMESTAMP(2) METADATA FROM 'timestamp'," +
"WATERMARK FOR ts AS ts" +
") WITH (" +
"'connector' = 'kafka'," +
"'topic' = 'test'," +
"'properties.bootstrap.servers' = 'localhost:9092'," +
"'properties.group.id' = 'testGroup'," +
"'scan.startup.mode' = 'earliest-offset'," +
"'format' = 'json'," +
"'json.fail-on-missing-field' = 'true'," +
"'json.ignore-parse-errors' = 'false'" +
")");
Table inputTable = tEnv.from("inputTable");
inputTable.printSchema();
inputTable.execute().print();
Table windowedTable = inputTable
.window(Tumble.over(lit(5).seconds()).on($("ts")).as("w"))
.groupBy($("w"), $("src_ip"))
.select($("w").start().as("window_start"),
$("src_ip"),
$("src_ip").count().as("src_ip_count"),
$("bytes_from_src").avg().as("bytes_from_src_mean")
);
windowedTable.execute().print();
There are 4 records in Kafka. The flink program prints out the schema info and the inputTable as the following:
Connected to the target VM, address: '127.0.0.1:62348', transport: 'socket'
(
`src_ip` STRING,
`src_port` STRING,
`bytes_from_src` BIGINT,
`pkts_from_src` BIGINT,
`ts` TIMESTAMP(2) *ROWTIME* METADATA FROM 'timestamp',
WATERMARK FOR `ts`: TIMESTAMP(2) AS `ts`
)
+----+--------------------------------+--------------------------------+----------------------+----------------------+-------------------------+
| op | src_ip | src_port | bytes_from_src | pkts_from_src | ts |
+----+--------------------------------+--------------------------------+----------------------+----------------------+-------------------------+
| +I | 44.38.5.31 | 53159 | 120 | 3 | 2021-08-13 14:59:56.00 |
| +I | 44.38.132.51 | 39409 | 100 | 2 | 2021-08-13 14:58:11.00 |
| +I | 44.38.4.44 | 56758 | 336 | 6 | 2021-08-13 14:59:14.00 |
| +I | 44.38.5.34 | 40001 | 80 | 2 | 2021-08-13 14:57:04.00 |
After that, nothing is printed out. The program did not exit. I am running the flink within IDEA. At this point, it seems like a black box. There is no output, and I do not know how to trace a flink program.
If I commented out the line inputTable.execute().print();, the schema info is printed out, but nothing after that and the program does not exit.
The flink version used is 1.14.2.
I believe those records are being processed, and are being added to the window. But event time windows are triggered by watermarks, and the watermark isn't becoming large enough to trigger the window. To get this to work you need to process an event with a timestamp past the end of the window -- i.e., 2021-08-13 15:00:00.00 or larger.
For debugging, the Flink web dashboard is helpful in situations like this. You can see if events are being processed, examine the watermarks, etc. See Flink webui when running from IDE for help in setting it up.
Related
I have a data frame where passengerId and path are Strings. The path represents the flight path of the passenger so passenger 10096 started in country CO and traveled to country BM. I need to find out the longest amount of flights each passenger has without traveling to the UK.
+-----------+--------------------+
|passengerId| path|
+-----------+--------------------+
| 10096| co,bm|
| 10351| pk,uk|
| 10436| co,co,cn,tj,us,ir|
| 1090| dk,tj,jo,jo,ch,cn|
| 11078| pk,no,fr,no|
| 11332|sg,cn,co,bm,sg,jo...|
| 11563|us,sg,th,cn,il,uk...|
| 1159| ca,cl,il,sg,il|
| 11722| dk,dk,pk,sg,cn|
| 11888|au,se,ca,tj,th,be...|
| 12394| dk,nl,th|
| 12529| no,be,au|
| 12847| cn,cg|
| 13192| cn,tk,cg,uk,uk|
| 13282| co,us,iq,iq|
| 13442| cn,pk,jo,us,ch,cg|
| 13610| be,ar,tj,no,ch,no|
| 13772| be,at,iq|
| 13865| be,th,cn,il|
| 14157| sg,dk|
+-----------+--------------------+
I need to get it like this.
val data = List(
(1,List("UK","IR","AT","UK","CH","PK")),
(2,List("CG","IR")),
(3,List("CG","IR","SG","BE","UK")),
(4,List("CG","IR","NO","UK","SG","UK","IR","TJ","AT")),
(5,List("CG","IR"))
I'm trying to use this solution but I can't make this list of lists. It also seems like the input used in the solution has each country code as a separate item in the list, while my path column has the country codes listed as a single element to describe the flight path.
If the goal is just to generate the list of destinations from a string, you can simply use split:
df.withColumn("path", split('path, ","))
If the goal is to compute the maximum number of steps without going to the UK, you could do something like this:
df
// split the string on 'uk' and generate one row per sub journey
.withColumn("path", explode(split('path, ",?uk,?")))
// compute the size of each sub journey
.withColumn("path_size", size(split('path, ",")))
// retrieve the longest one
.groupBy("passengerId")
.agg(max('path_size) as "max_path_size")
I want to create simple dashboard where I want to show the number of orders in different statuses. The statuses can be New/Cancelled/Finished/etc
Where should I implement these criteria? If I add filter in the Cube Browser then it applies for the whole dashboard. Should I do that in KPI? Or should I add calculated column with 1/0 values?
My expected output is something like:
--------------------------------------
| Total | New | Finished | Cancelled |
--------------------------------------
| 1000 | 100 | 800 | 100 |
--------------------------------------
I'd use measures for that, something like:
CountTotal = COUNT('Orders'[OrderID])
CountNew = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "New")
CountFinished = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "Finished")
CountCancelled = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "Cancelled")
In Zeppelin with pyspark.
Before I found the correct way of doing things (Last over a Window), I had a loop that extended the value of a previous row to itself one by one (I know loops are bad practice). However, after running a couple hundred times it fails with a nullPointerException before reaching the best case condition=0.
To get around the error, (before I discovered the last command), I had the loop run a few hundred times for a midpoint condition=1000, dump the results. Run it again with condition=500, rinse and repeat until I hit condition=0.
def extendTarget(myDF, loop, lessThan):
i = myDF.filter(col("target") == "unknown").count()
while (i > lessThan):
cc = loop
while (cc > 0):
myDF = myDF.withColumn("targetPrev", lag("target", 1).over(Window.partitionBy("id").orderBy("myTime")))
myDF = myDF.withColumn("targetNew", when(col("target") == "unknown", col("targetPrev")).otherwise(col("target")))
myDF = myDF.select(
"id",
"myTime",
col("targetNew").alias("target"))
cc = cc - 1
i = myDF.filter(col("target") == "unknown").count()
print i
return myDF
myData = spark.read.load(myPath)
myData = extendTarget(myData, 20, 0)
myData.write.parquet(myPathPart1)
I expect it to take a stupid long amount of time (since I'm doing it wrong), but do not expect it to exception out with
Output (given inputs (myData, 20, 0)
38160
22130
11375
6625
5085
4522
4216
3936
3662
3419
3202
Error
Py4JJavaError: An error occurred while calling o26814.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 32 in stage 1539.0 failed 4 times, most recent failure: Lost task 32.3 in stage 1539.0 (TID XXXX, ip-XXXX, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_XXXX_0001_01_000033 on host: ip-XXXX. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_XXXX_0001_01_000033
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 50
.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2830)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2829)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2829)
at sun.reflect.GeneratedMethodAccessor388.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling o26814.count.\n', JavaObject id=o26815), <traceback object at 0x7efc521b11b8>)
I can only guess that this has something to do with memory or cache. Even though I am reusing all variable names. If it is a memory problem, is there a garbage collect, or clear cache/memory command that I can put beside the print command to allow it to loop forever?
Again I know its bad practice to use loops, especially if they go on for what seems like forever, but sometimes the better\smarter code doesn't present itself when I need it so during the interim I hack it however I can.
Answered in a comment by Cronoik
Could you please try the following (this will keep the execution plan small): myDF = myDF.select("id", "myTime", col("targetNew").alias("target")).checkpoint() (replace the corrosponding line before and and the checkpoint directory before via spark.setCheckpointDir). – cronoik Sep 9 at 4:05
I tested it out and it worked! The first few times it failed though, mostly due to my misunderstanding of how checkpoint works. Would you like to put your answer as an answer? Some things I had to figure out that would be nice as part of the answer spark.sparkContext.setCheckpointDir("path/") - to set checkpointPath myDF.checkpoint() does not work, you have to assign that checkpoint back myDF = myDF.checkpoint() Also, I found the nullpointer exception, if I tell the loop to run 200 times before counting it does a nullpointerException and sparkContext ends. (must be restarted) – Ranald Fong Sep 10 at 2:44
Also, having the checkpoint inside each loop caused it to run slowly as it was probably checkpointing every loop run. What really sped it up was putting the checkpoint right before the print, allowing it to loop within memory x amt of times (i used 100) before checkpointing. Thanks cronoik, if you put your answer as an answer I'll mark it my answer! – Ranald Fong Sep 10 at 2:53
Edit: Commenter wanted details
Point is to extend a value from one row until the end, replacing all unknowns
| id | time | target |
| a | 1:00 | 1 |
| a | 1:01 | unknown|
| a | . | . |
| a | 5:00 | unknown|
| a | 5:01 | 2 |
to
| id | time | target |
| a | 1:00 | 1 |
| a | 1:01 | 1 |
| a | . | 1 |
| a | 5:00 | 1 |
| a | 5:01 | 2 |
Code was changed to use Checkpoints
spark.sparkContext.setCheckpointDir(".../myCheckpointsPath/")
def extendTarget(myDF, loop, lessThan):
i = myDF.filter(col("target") == "unknown").count()
while (i > lessThan):
cc = loop
while (cc > 0):
myDF = myDF.withColumn("targetPrev", lag("target", 1).over(Window.partitionBy("id").orderBy("myTime")))
myDF = myDF.withColumn("targetNew", when(col("target") == "unknown", col("targetPrev")).otherwise(col("target")))
myDF = myDF.select(
"id",
"myTime",
col("targetNew").alias("target"))
cc = cc - 1
i = myDF.filter(col("target") == "unknown").count()
print i
myDF = myDF.checkpoint()
return myDF
myData = spark.read.load(myPath)
myData = extendTarget(myData, 20, 0)
myData.write.parquet(myPathPart1)
At the expense of HDD space for the checkpoints, allows the loop to go on forever! But in general, do not use Loops(doubly so for infinite loops), use First and Last with ignore Nulls instead.
Using
Ruby 1.9.3-p194
Rails 3.2.8
Here's what I need.
Count the different human resources (human_resource_id) and divide this by the total number of assignments (assignment_id).
So, the answer for the dummy-data as given below should be:
1.5 assignments per human resource
But I just don't know where to go anymore.
Here's what I tried:
Table name: Assignments
id | human_resource_id | assignment_id | assignment_start_date | assignment_expected_end_date
80101780 | 20200132 | 80101780 | 2012-10-25 | 2012-10-31
80101300 | 20200132 | 80101300 | 2012-07-07 | 2012-07-31
80101308 | 21100066 | 80101308 | 2012-07-09 | 2012-07-17
At first I need to make a selection for the period I need to 'look' at. This is always from max a year ago.
a = Assignment.find(:all, :conditions => { :assignment_expected_end_date => (DateTime.now - 1.year)..DateTimenow })
=> [
#<Assignment id: 80101780, human_resource_id: "20200132", assignment_id: "80101780", assignment_start_date: "2012-10-25", assignment_expected_end_date: "2012-10-31">,
#<Assignment id: 80101300, human_resource_id: "20200132", assignment_id: "80101300", assignment_start_date: "2012-07-07", assignment_expected_end_date: "2012-07-31">,
#<Assignment id: 80101308, human_resource_id: "21100066", assignment_id: "80101308", assignment_start_date: "2012-07-09", assignment_expected_end_date: "2012-07-17">
]
foo = a.group_by(&:human_resource_id)
Now I got a beautiful 'Array of hash of object' and I just don't know what to do next.
Can someone help me?
You can try to execute the request in SQL :
ActiveRecord::Base.connection.select_value('SELECT count(distinct human_resource_id) / count(distinct assignment_id) AS ratio FROM assignments');
You could do something like
human_resource_count = assignments.collect{|a| a.human_resource_id}.uniq.count
assignment_count = assignments.collect{|a| a.assignment_id}.uniq.count
result = human_resource_count/assignment_count
I am developing a photo album web application in which I am using two tables USER_INFORMATION and USER_PHOTO. USER_INFORMATION contains only user record which is one record, USER_PHOTO contains more than one photo form a single user. I want to get the number of user information along with their userimage path and store it in to a Java Pojo variable like a list so that I can display it using display tag in struts 2.
The table goes like this.
USER_PHOTO USER_INFORMATION
======================= =================================
| IMAGEPATH | USER_ID | | USER_NAME | AGE | ADDRESS |ID |
======================= =================================
|xyz | 1 | | abs | 34 | sdas | 1 |
|sdas | 1 | | asddd | 22 | asda | 2 |
|qwewq | 2 | | sadl | 121 | asd | 3 |
| asaa | 1 | ==================================
| 121 | 3 |
=======================
i some what manage to get the user information correctly by using HashSet and display tag the code goes like this...
public ArrayList loadData() throws ClassNotFoundException, SQLException {
ArrayList<UserData> userList = new ArrayList<UserData>();
try {
String name;
String fatherName;
int Id;
int age;
String address;
String query = "SELECT NAME,FATHERNAME,AGE,ADDRESS,ID FROM USER_INFORMATION ,USER_PHOTO WHERE ID=USER_ID";
ps = con.prepareStatement(query);
ResultSet rs = ps.executeQuery();
while (rs.next()) {
name = rs.getString(1);
fatherName = rs.getString(2);
age = rs.getInt(3);
address = rs.getString(4);
//Id = rs.getInt(5);
UserData list = new UserData();
list.setName(name);
list.setFatherName(fatherName);
list.setAge(age);
list.setAddress(address);
//list.setPhot(Id);
userList.add(list);
}
ps.close();
con.close();
} catch (Exception e) {
e.printStackTrace();
}
ArrayList al = new ArrayList();
HashSet hs = new HashSet();
hs.addAll(userList);
al.clear();
al.addAll(hs);
//return (userList);
return al;
and my display tag goes like this in jsp...
<display:table id="data" name="sessionScope.UserForm.userList" requestURI="/userAction.do" pagesize="15" >
<display:column property="name" title="NAME" sortable="true" />
<display:column property="fatherName" title="User Name" sortable="true" />
<display:column property="age" title="AGE" sortable="true" />
<display:column property="address" title="ADDRESS" sortable="true" />
</display:table>
as u can see i am displaying only user information not his photo uploaded i want to display his uploaded filename also. i some what created another table to store his photo information. if i include Imagepath attribute in my sql query than their will be duplicate rows means if user have uploaded 5 photos than display tag will display five records with same information with different ImagePath.
so any can tell me hoe make it to display only one record of user information with his multiple ImagePath so that it not display user information repeatedly.
Thanks In Advance