I am trying to use ContinuousTimeEventTrigger with timeWindow:
val timed_stamped_stream = likes_stream.map({
p => {
val x = p.split(",")
(get_timestamp(x(0)), x(1).trim.toLong, x(2).trim.toLong)
}}).assignTimestamps(new TimestampExtractor[(Long, Long, Long)] {
override def getCurrentWatermark: Long = Long.MinValue
override def extractWatermark(t: (Long, Long, Long), l: Long): Long = t._1
override def extractTimestamp(t: (Long, Long, Long), l: Long): Long = t._1
}).keyBy(2).timeWindow(Time.of(5, TimeUnit.SECONDS))
timed_stamped_stream.trigger(ContinuousEventTimeTrigger.of(Time.of(5, TimeUnit.SECONDS))).sum(1).print()
But I am getting a java.lang.StackOverflowError while executing it on flink streaming cluster with following stackTrace:
java.lang.StackOverflowError
at java.util.HashMap.putVal(HashMap.java:628)
at java.util.HashMap.put(HashMap.java:611)
at java.util.HashSet.add(HashSet.java:219)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.registerEventTimeTimer(WindowOperator.java:532)
at org.apache.flink.streaming.api.windowing.triggers.ContinuousEventTimeTrigger.onEventTime(ContinuousEventTimeTrigger.java:61)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.onEventTime(WindowOperator.java:558)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.onEventTime(WindowOperator.java:562)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.onEventTime(WindowOperator.java:562)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.onEventTime(WindowOperator.java:562)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.onEventTime(WindowOperator.java:562)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.onEventTime(WindowOperator.java:562)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.onEventTime(WindowOperator.java:562)
Anyone there who can help?
Related
I am using Flink 1.12.2, and I write a simple flink application to try the integration with DataStream and Table, the code is as follows:
package org.example
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.table.api.FieldExpression
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment
import org.apache.flink.table.api.bridge.scala._
import org.apache.flink.streaming.api.scala._
object FlinkDemo {
def main(args: Array[String]): Unit = {
implicit val typeInfo = TypeInformation.of(classOf[Int])
val fsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val fsTableEnv = StreamTableEnvironment.create(fsEnv, fsSettings)
fsEnv.setParallelism(2)
val ds: DataStream[Int] = fsEnv.fromElements(Seq(1, 2, 3, 4, 5): _*)
fsTableEnv.createTemporaryView("t", ds)
fsTableEnv.from("t").printSchema()
val table = fsTableEnv.sqlQuery("select f0 from t")
table.toAppendStream[Int].print()
fsEnv.execute()
}
}
The result is very surprising, it is not 1,2,3,4,5 but as follows, I am not sure where the prolem is, could you please help me out?
2> org.apache.flink.table.data.binary.BinaryRowData#4fe4bbf0
1> org.apache.flink.table.data.binary.BinaryRowData#5759f99e
2> org.apache.flink.table.data.binary.BinaryRowData#564770ca
1> org.apache.flink.table.data.binary.BinaryRowData#d206e547
1> org.apache.flink.table.data.binary.BinaryRowData#f8248f4
``
I have a beam pipeline written in python that when deployed to a flink runner doesn't make use of the parallelism correctly.
There is unbounded data coming in through a kafka connector and I want the data to be read when split in parallel.
My understanding is that it should split up the tasks but as shown in the image one parallelism is used and all the other 5 sub tasks finished instantly leaving the one running to do all the work.
The pipeline settings are:
options = PipelineOptions([
"--runner=PortableRunner",
"--sdk_worker_parallelism=3",
"--artifact_endpoint=localhost:8098",
"--job_endpoint=localhost:8099",
"--environment_type=EXTERNAL",
"--environment_config=localhost:50000",
"--checkpointing_interval=30000",
])
options._all_options['parallelism'] = 3
Is this a missing config on the Flink runner or something that can be configured in the BEAM pipeline?
The full pipeline:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions([
"--runner=PortableRunner",
"--sdk_worker_parallelism=3",
"--artifact_endpoint=localhost:8098",
"--job_endpoint=localhost:8099",
"--environment_type=EXTERNAL",
"--environment_config=localhost:50000",
"--checkpointing_interval=30000",
])
options._all_options['parallelism'] = 3
class CountProvider(beam.RestrictionProvider):
def __init__(self, initial_split_size=5):
self._initial_split_size = initial_split_size
self.OffsetRestrictionTracker = None
def imports(self):
if self.OffsetRestrictionTracker is not None: return
from apache_beam.io.restriction_trackers import OffsetRestrictionTracker, OffsetRange
self.OffsetRestrictionTracker = OffsetRestrictionTracker
self.OffsetRange = OffsetRange
def initial_restriction(self, element):
self.imports()
return self.OffsetRange(0, 10)
def create_tracker(self, restriction):
self.imports()
return self.OffsetRestrictionTracker(restriction)
def restriction_size(self, element, restriction):
return restriction.size()*100_000
def split(self, element, restriction):
self.imports()
if restriction.start + 1 >= restriction.stop:
yield self.OffsetRange(restriction.start, restriction.stop)
else:
last_val = restriction.start
for i in range(1, self._initial_split_size):
next_stop = i * (restriction.start + restriction.stop) // self._initial_split_size
yield self.OffsetRange(last_val, next_stop)
last_val = next_stop
yield self.OffsetRange(last_val, restriction.stop)
class CountFn(beam.DoFn):
def setup(self):
print("setup")
def process(self, element, tracker=beam.DoFn.RestrictionParam(CountProvider())):
res = tracker.current_restriction()
print(f"Current Restriction {res.start}, {res.stop}")
for i in range(res.start, res.stop):
if not tracker.try_claim(i):
return
for j in range(10_000):
yield i, j
def get_initial_restriction(self, filename):
return (0, 10)
def teardown(self):
print("Teardown")
p = beam.Pipeline(options=options)
out = (p | f'Create' >> beam.Create([tuple()])
| f'Gen Data' >> beam.ParDo(CountFn())
| beam.Map(print)
)
result = p.run()
result.wait_until_finish()
I'm trying to consume some kafka topics through flink streams.
Below is the code for the stream.
object VehicleFuelEventStream {
def sink[IN](hosts: String, table: String, ds: DataStream[IN]): CassandraSink[IN] = {
CassandraSink.addSink(ds)
.setQuery(s"INSERT INTO global.$table values (id, vehicle_id, consumption, from, to, version) values (?, ?, ?, ?, ?, ?);")
.setClusterBuilder(new ClusterBuilder() {
override def buildCluster(builder: Cluster.Builder): Cluster = {
builder.addContactPoints(hosts.split(","): _*).build()
}
}).build()
}
def dataStream[T: TypeInformation](env: StreamExecutionEnvironment,
source: SourceFunction[String],
flinkConfigs: FlinkStreamProcessorConfigs,
windowFunction: WindowFunction[VehicleFuelEvent, (String, Double, SortedMap[Date, Double], Long, Long), String, TimeWindow]): DataStream[(String, Double, SortedMap[Date, Double], Long, Long)] = {
val fuelEventStream = env.addSource(source)
.map(e => EventSerialiserRepo.fromJsonStr[VehicleFuelEvent](e))
.filter(_.isSuccess)
.map(_.get)
.assignTimestampsAndWatermarks(VehicleFuelEventTimestampExtractor.withWatermarkDelay(flinkConfigs.windowWatermarkDelay))
.keyBy(_.vehicleId) // key by VehicleId
.timeWindow(Time.minutes(flinkConfigs.windowTimeWidth.toMinutes))
.apply(windowFunction)
fuelEventStream
}
}
Stream is triggered when play framework create it's dependencies on startup via google Guice as below.
#Singleton
class VehicleEventKafkaConsumer #Inject()(conf: Configuration,
lifecycle: ApplicationLifecycle,
repoFactory: StorageFactory,
cache: CacheApi,
cassandra: Cassandra,
fleetConfigs: FleetConfigManager) {
private val kafkaConfigs = KafkaConsumerConfigs(conf)
private val flinkConfigs = FlinkStreamProcessorConfigs(conf)
private val topicsWithClassTags = getClassTagsForTopics
private val cassandraConfigs = CassandraConfigs(conf)
private val repoCache = mutable.HashMap.empty[String, CachedSubjectStatePersistor]
private val props = new Properties()
// add props
private val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.enableCheckpointing(flinkConfigs.checkpointingInterval.toMillis)
if (flinkConfigs.enabled) {
topicsWithClassTags.toList
.map {
case (topic, tag) if tag.runtimeClass.isAssignableFrom(classOf[VehicleFuelEvent]) =>
Logger.info(s"starting for - $topic and $tag")
val source = new FlinkKafkaConsumer09[String](topic, new SimpleStringSchema(), props)
val fuelEventStream = VehicleFuelEventStream.dataStream[String](env, source, flinkConfigs, new VehicleFuelEventWindowFunction)
VehicleFuelEventStream.sink(cassandraConfigs.hosts, flinkConfigs.cassandraTable, fuelEventStream)
case (topic, _) =>
Logger.info(s"no stream processor found for topic $topic")
}
Logger.info("starting flink stream processors")
env.execute("flink vehicle event processors")
} else
Logger.info("Flink stream processor is disabled!")
}
I get the below errors on application startup.
03/13/2018 05:47:23 TriggerWindow(TumblingEventTimeWindows(1800000), ListStateDescriptor{serializer=org.apache.flink.api.common.typeutils.base.ListSerializer#e899e41f}, EventTimeTrigger(), WindowedStream.apply(WindowedStream.scala:582)) -> Sink: Cassandra Sink(4/4) switched to RUNNING
2018-03-13 05:47:23,262 - [info] o.a.f.r.t.Task - TriggerWindow(TumblingEventTimeWindows(1800000), ListStateDescriptor{serializer=org.apache.flink.api.common.typeutils.base.ListSerializer#e899e41f}, EventTimeTrigger(), WindowedStream.apply(WindowedStream.scala:582)) -> Sink: Cassandra Sink (3/4) (d5124be9bcef94bd0e305c4b4546b055) switched from RUNNING to FAILED.
org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load user class: streams.fuelevent.VehicleFuelEventWindowFunction
ClassLoader info: URL ClassLoader:
Class not resolvable through given classloader.
at org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:232)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.<init>(OperatorChain.java:95)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:231)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:745)
2018-03-13 05:47:23,262 - [info] o.a.f.r.t.Task - TriggerWindow(TumblingEventTimeWindows(1800000), ListStateDescriptor{serializer=org.apache.flink.api.common.typeutils.base.ListSerializer#e899e41f}, EventTimeTrigger(), WindowedStream.apply(WindowedStream.scala:582)) -> Sink: Cassandra Sink (2/4) (6b3de15a4f6
.....
org.apache.flink.streaming.runtime.tasks.StreamTaskException: Could not instantiate outputs in order.
at org.apache.flink.streaming.api.graph.StreamConfig.getOutEdgesInOrder(StreamConfig.java:394)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.<init>(OperatorChain.java:103)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:231)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: streams.fuelevent.VehicleFuelEventStream$$anonfun$4
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader.loadClass(FlinkUserCodeClassLoaders.java:128)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
2018-03-13 05:47:23,336 - [info] o.a.f.r.t.Task - Source: Custom Source -> Map -> Filter -> Map -> Timestamps/Watermarks (4/4) (97b2313c985592fdec0ac4f7fba8062f) switched from RUNNING to FAILED.
dependencies
// kafka
libraryDependencies += "org.apache.kafka" %% "kafka" % "1.0.0"
//flink
libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % "1.4.2"
libraryDependencies += "org.apache.flink" %% "flink-connector-kafka-0.9" % "1.4.0"
libraryDependencies += "org.apache.flink" %% "flink-connector-cassandra" % "1.4.2"
Any help is appreciated to solve this issue.
Look like Flink can't found this class streams.fuelevent.VehicleFuelEventStream
It's that class inside Flink classpath or inside you Jar file?
Flink doc could bring some light on this https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html
I am using Scala on Spark 2.1.0 GraphX. I have an array as shown below:
scala> TEMP1Vertex.take(5)
res46: Array[org.apache.spark.graphx.VertexId] = Array(-1895512637, -1745667420, -1448961741, -1352361520, -1286348803)
If I had to filter the edge table for a single value, let's say for soruce ID -1895512637
val TEMP1Edge = graph.edges.filter { case Edge(src, dst, prop) => src == -1895512637}
scala> TEMP1Edge.take(5)
res52: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(-1895512637,-2105158920,89), Edge(-1895512637,-2020727043,3), Edge(-1895512637,-1963423298,449), Edge(-1895512637,-1855207100,214), Edge(-1895512637,-1852287689,339))
scala> TEMP1Edge.count
17/04/03 10:20:31 WARN Executor: 1 block locks were not released by TID = 1436:[rdd_36_2]
res53: Long = 126
But when I pass an array which contains a set of unique source IDs, the code runs successfully but it doesn't return any values as shown below:
scala> val TEMP1Edge = graph.edges.filter { case Edge(src, dst, prop) => src == TEMP1Vertex}
TEMP1Edge: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Int]] = MapPartitionsRDD[929] at filter at <console>:56
scala> TEMP1Edge.take(5)
17/04/03 10:29:07 WARN Executor: 1 block locks were not released by TID = 1471:
[rdd_36_5]
res60: Array[org.apache.spark.graphx.Edge[Int]] = Array()
scala> TEMP1Edge.count
17/04/03 10:29:10 WARN Executor: 1 block locks were not released by TID = 1477:
[rdd_36_5]
res61: Long = 0
I suppose that TEMP1Vertex is of type Array[VertexId], so I think that your code should be like:
val TEMP1Edge = graph.edges.filter {
case Edge(src, _, _) => TEMP1Vertex.contains(src)
}
I have two streams:
Measurements
WhoMeasured (metadata about who took the measurement)
These are the case classes for them:
case class Measurement(var value: Int, var who_measured_id: Int)
case class WhoMeasured(var who_measured_id: Int, var name: String)
The Measurement stream has a lot of data. The WhoMeasured stream has little. In fact, for each who_measured_id in the WhoMeasured stream, only 1 name is relevant, so old elements can be discarded if one with the same who_measured_id arrives. This is essentially a HashTable that gets filled by the WhoMeasured stream.
In my custom window function
class WFunc extends WindowFunction[Measurement, Long, Int, TimeWindow] {
override def apply(key: Int, window: TimeWindow, input: Iterable[Measurement], out: Collector[Long]): Unit = {
// Here I need access to the WhoMeasured stream to get the name of the person who took a measurement
// The following two are equivalent since I keyed by who_measured_id
val name_who_measured = magic(key)
val name_who_measured = magic(input.head.who_measured_id)
}
}
This is my job. Now as you might see, there is something missing: The combination of the two streams.
val who_measured_stream = who_measured_source
.keyBy(w => w.who_measured_id)
.countWindow(1)
val measurement_stream = measurements_source
.keyBy(m => m.who_measured_id)
.timeWindow(Time.seconds(60), Time.seconds(5))
.apply(new WFunc)
So in essence this is sort of a lookup table that gets updated when new elements in the WhoMeasured stream arrive.
So the question is: How to achieve such a lookup from one WindowedStream into another?
Follow Up:
After implementing in the way Fabian suggested, the job always fails with some sort of serialization issue:
[info] Loading project definition from /home/jgroeger/Code/MeasurementJob/project
[info] Set current project to MeasurementJob (in build file:/home/jgroeger/Code/MeasurementJob/)
[info] Compiling 8 Scala sources to /home/jgroeger/Code/MeasurementJob/target/scala-2.11/classes...
[info] Running de.company.project.Main dev MeasurementJob
[error] Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: The implementation of the RichCoFlatMapFunction is not serializable. The object probably contains or references non serializable fields.
[error] at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:100)
[error] at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.clean(StreamExecutionEnvironment.java:1478)
[error] at org.apache.flink.streaming.api.datastream.DataStream.clean(DataStream.java:161)
[error] at org.apache.flink.streaming.api.datastream.ConnectedStreams.flatMap(ConnectedStreams.java:230)
[error] at org.apache.flink.streaming.api.scala.ConnectedStreams.flatMap(ConnectedStreams.scala:127)
[error] at de.company.project.jobs.MeasurementJob.run(MeasurementJob.scala:139)
[error] at de.company.project.Main$.main(Main.scala:55)
[error] at de.company.project.Main.main(Main.scala)
[error] Caused by: java.io.NotSerializableException: de.company.project.jobs.MeasurementJob
[error] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
[error] at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
[error] at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
[error] at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
[error] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
[error] at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
[error] at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:301)
[error] at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:81)
[error] ... 7 more
java.lang.RuntimeException: Nonzero exit code returned from runner: 1
at scala.sys.package$.error(package.scala:27)
[trace] Stack trace suppressed: run last MeasurementJob/compile:run for the full output.
[error] (MeasurementJob/compile:run) Nonzero exit code returned from runner: 1
[error] Total time: 9 s, completed Nov 15, 2016 2:28:46 PM
Process finished with exit code 1
The error message:
The implementation of the RichCoFlatMapFunction is not serializable. The object probably contains or references non serializable fields.
However, the only field my JoiningCoFlatMap has is the suggested ValueState.
The signature looks like this:
class JoiningCoFlatMap extends RichCoFlatMapFunction[Measurement, WhoMeasured, (Measurement, String)] {
I think what you want to do is a window operation followed by a join.
You can implement the a join of a high-volume stream and a low-value update-by-key stream using a stateful CoFlatMapFunction as in the example below:
val measures: DataStream[Measurement] = ???
val who: DataStream[WhoMeasured] = ???
val agg: DataStream[(Int, Long)] = measures
.keyBy(_._2) // measured_by_id
.timeWindow(Time.seconds(60), Time.seconds(5))
.apply( (id: Int, w: TimeWindow, v: Iterable[(Int, Int, String)], out: Collector[(Int, Long)]) => {
// do your aggregation
})
val joined: DataStream[(Int, Long, String)] = agg
.keyBy(_._1) // measured_by_id
.connect(who.keyBy(_.who_measured_id))
.flatMap(new JoiningCoFlatMap)
// CoFlatMapFunction
class JoiningCoFlatMap extends RichCoFlatMapFunction[(Int, Long), WhoMeasured, (Int, Long, String)] {
var names: ValueState[String] = null
override def open(conf: Configuration): Unit = {
val stateDescrptr = new ValueStateDescriptor[String](
"whoMeasuredName",
classOf[String],
"" // default value
)
names = getRuntimeContext.getState(stateDescrptr)
}
override def flatMap1(a: (Int, Long), out: Collector[(Int, Long, String)]): Unit = {
// join with state
out.collect( (a._1, a._2, names.value()) )
}
override def flatMap2(w: WhoMeasured, out: Collector[(Int, Long, String)]): Unit = {
// update state
names.update(w.name)
}
}
A note on the implementation: A CoFlatMapFunction cannot decide which input to process, i.e., the flatmap1 and flatmap2 functions are called depending on what data arrives at the operator. It cannot be controlled by the function. This is a problem when initializing the state. In the beginning, the state might not have the correct name for an arriving Measurement object but return the default value. You can avoid that by buffering the measurements and joining them once, the first update for the key from the who stream arrives. You'll need another state for that.