Flink CEP is not deterministic - apache-flink

I have the following code running locally without a cluster:
val count = new AtomicInteger()
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val text: DataStream[String] = env.readTextFile("file:///flink/data2")
val mapped: DataStream[Map[String, Any]] = text.map((x: String) => Map("user" -> x.split(",")(0), "val" -> x.split(",")(1)))
val pattern: ...
CEP.pattern(mapped, pattern).select(eventMap => {
println("Found: " + (patternName, eventMap))
count.incrementAndGet()
})
env.execute()
println(count)
My data is a CSV file in the following format (user, val):
1,1
1,2
1,3
2,1
2,2
2,3
...
I am trying to detect events of the pattern where event(val=1) -> event(val=2) -> event(val=3). When I run this on a large input stream, with a set number of events that I know exist in the stream, I get an inconsistent count of events detected, almost always less than the number of events in the system. If I do env.setParallelism(1) (Like I have done in line 3 of the code), all events are detected.
I assume the problem is that multiple threads are processing the events from the stream when the parallelism is > 1, which means that while one thread has event(val=1) -> event(val=2), event(val=3) might be sent to a different thread and the whole pattern might not get detected.
Is there something I'm missing here? I cannot lose any patterns in the stream, but setting parallelism to 1 seems to defeat the purpose of having a system like Flink to detect events.
Update:
I have tried keying the stream using:
val mapped: KeyedStream[Map[String, Any]] = text.map(...).keyBy((m) => m.get("user"))
Though this prevents events of different users interfering with each other:
1,1
2,2
1,3
This does not prevent Flink from sending the events to the node out of order, which means that the non-determinism still exists.

Most probably the problem lies in applying the keyBy operator after the map operator.
So, instead of:
val mapped: KeyedStream[Map[String, Any]] = text.map(...).keyBy((m) => m.get("user"))
There should be:
val mapped: KeyedStream[Map[String, Any]] = text.keyBy((m) => m.get("user")).map(...)
I know this is an old question, but maybe it helps someone.

Have you thought about keying your stream with the userid (your first value)? Flink guarantees that all events of one key get to the same processing node.
Of course that only helps, if you want to detect a pattern of val=1->val=2->val=3 per user.

Related

For bounded data, how do I get Flink to "trigger" once flatmap has finished outputting all its data

I've explicitly set "batch mode" in Flink's StreamExecutionEnvironmen settings, as I'm working with bounded data.
The bounded data passes through a flatmap; and the flatmap is windowed using GlobalWindows. Since the data is bounded, there is a FINITE (though initially unknown) number of elements that will be outputted by the Collection.out() operations in the FlatMap. I'd like to trigger a Reduce() function. However, I can't figure out how to tell Flink the following: once the FlatMap has finished outputting all its elements, proceeed with the remainder of the code, eg do the reduce. (From the documentation, GlogalWindows always use the NeverTrigger, so I need to explicitly call a trigger I presume.) (Note: The CountTrigger won't work I believe, since I don't know apriori the number of elements that the flatmap will output.)
Bonus: Technically, the reduce operation can start as soon as the flatmap starts outputting its output. So, I'm not sure exactly how Flink works, but ideally, the reduce starts right away but only "completes" after the window closes....(And the window should close, in the case of bounded data, once the flatmap stops outputting the the output data.)
===
Edit #1:
Per #kkrugler, here's the skeleton code:
sosCleavedFeaturesEtc
.flatMap((Tuple4<Float2FloatAVLTreeMap, List<ImmutableFeatureV2>, List<ImmutableFeatureV2>, Integer> tuple4, Collector<Tuple4<Float2FloatAVLTreeMap, List<ImmutableFeatureV2>, Integer, Integer>> out) -> {
...
IntStream.range(0, numBlocksForClustering + 1)
.forEach(blockIdx -> out.collect(Tuple4.of(rtMapper, unmodifiableLstCleavedFeatures, diaWindowNum, blockIdx)));
})
.flatMap((Tuple4<Float2FloatAVLTreeMap, List<ImmutableFeatureV2>, Integer, Integer> tuple4, Collector<Tuple2<Float2FloatAVLTreeMap, Cluster>> out) -> {
...
setClusters
.stream()
.filter(cluster -> cluster.getClusterSize() >= minFeaturesInCluster)
.forEach(e -> out.collect(Tuple2.of(rtMapper, e)));
})
.map(tuple -> {
...
})
.filter(repFeature -> {
...
})
.windowAll(GlobalWindows.create())
...trigger??...
.aggregate(...});

How might I implement a map of maps in Flink keyed state that supports fast insert, lookup and iteration of nested maps?

I'd like to write a Flink streaming operator that maintains say 1500-2000 maps per key, with each map containing perhaps 100,000s of elements of ~100B. Most records will trigger inserts and reads, but I’d also like to support occasional fast iteration of entire nested maps.
I've written a KeyedProcessFunction that creates 1500 RocksDb-backed MapStates per key, and tested it by generating a stream of records with a single distinct key, but I find things perform poorly. Just initialising all of them takes on the order of several minutes, and once data begin to flow async incremental checkpoints frequently fail due to timeout. Is this is a reasonable approach? If not, what alternative(s) should I consider?
Thanks!
Functionally my code is along the lines of:
val stream = env.fromCollection(new Iterator[(Int, String)] with Serializable {
override def hasNext: Boolean = true
override def next(): (Int, String) = {
(1, randomString())
}
})
stream
.keyBy(_._1)
.process(new KPF())
.writeUsingOutputFormat(...)
class KFP extends KeyedProcessFunction[Int, (Int, String), String] {
var states: Array[MapState[Int, String]] = _
override def processElement(
value: (Int, String),
ctx: KeyedProcessFunction[Int, (Int, String), String]#Context,
out: Collector[String]
): Unit = {
if (states(0).isEmpty) {
// insert 0-300,000 random strings <= 100B
}
val state = states(random.nextInt(1500))
// Read from R random keys in state
// Write to W random keys state
// With probability 0.01 iterate entire contents of state
if (random.nextInt(100) == 0) {
state.iterator().forEachRemaining {
// do something trivial
}
}
}
override def open(parameters: Configuration): Unit = {
states = (0 until 1500).map { stateId =>
getRuntimeContext.getMapState(new MapStateDescriptor[Int, String](stateId.toString, classOf[Int], classOf[String]))
}.toArray
}
}
There's nothing in what you've described that's an obvious explanation for poor performance. You are already doing the most important thing, which is to use MapState<K, V> rather than ValueState<Map<K, V>>. This way each key/value pair in the map is a separate RocksDB object, rather than the entire Map being one RocksDB object that has to go through ser/de for every access/update for any of its entries.
To understand the performance better, the next step might be to enable the RocksDB native metrics, and study those for clues. RocksDB is quite tunable, and better performance may be achievable. E.g., you can tune for your expected mix of read and writes, and if you are trying to access keys that don't exist, then you should enable bloom filters (which are turned off by default).
The RocksDB state backend has to go through ser/de for every state access/update, which is certainly expensive. You should consider whether you can optimize the serializer; some serializers can be 2-5x faster than others. (Some benchmarks.)
Also, you may want to investigate the new spillable heap state backend that is being developed. See https://flink-packages.org/packages/spillable-state-backend-for-flink, https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend, and https://issues.apache.org/jira/browse/FLINK-12692. Early benchmarking suggest this state backend is significantly faster than RocksDB, as it keeps its working state as objects on the heap, and spills cold objects to disk. (How much this would help probably depends on how often you have to iterate.)
And if you don't need to spill to disk, the the FsStateBackend would be faster still.

How to execute a collection of statements in Tiberius?

I could not figure out how to iterate over a collection and execute statements one by one with Tiberius.
My current code looks like this (simplified):
use futures::Future;
use futures_state_stream::StateStream;
use tokio::executor::current_thread;
use tiberius::SqlConnection;
fn find_files(files: &mut Vec<String>) {
files.push(String::from("file1.txt"));
files.push(String::from("file2.txt"));
files.push(String::from("file3.txt"));
}
fn main() {
let mut files: Vec<String> = Vec::new();
find_files(&mut files);
let future = SqlConnection::connect(CONN_STR)
.and_then(|conn| {
conn.simple_exec("CREATE TABLE db.dbo.[Filenames] ( [Spalte 0] varchar(80) );")
})
.and_then(|(_, conn)| {
for k in files.iter() {
let sql = format!("INSERT INTO db.dbo.Filenames ([Spalte 0]) VALUES ('{}')", k);
&conn.simple_exec(sql);
}
Ok(())
});
current_thread::block_on_all(future).unwrap();
}
I got the following error message
error[E0382]: use of moved value: `conn`
--> src/main.rs:23:18
|
20 | .and_then(|(_, conn)| {
| ---- move occurs because `conn` has type `tiberius::SqlConnection<std::boxed::Box<dyn tiberius::BoxableIo>>`, which does not implement the `Copy` trait
...
23 | &conn.simple_exec(sql);
| ^^^^ value moved here, in previous iteration of loop
I'm new to Rust but I know there is something wrong with the use of the conn variable but nothing works.
There are actual two questions here:
The header question: how to perform multiple sequential statements using tiberius?
The specific question concerning why an error message comes from a specific bit of code.
I will answer them separately.
Multiple statements
There are many ways to skin a cat. In TDS (the underlying protocol Tiberius is implementing) there is the possibility to execute several statements in a single command. They just need to be delimited by using semicolon. The response from such an execution is in Tiberius represented a stream of futures, one for each statement.
So if your chain of statements is not too big to fit into one command, just build one string and send it over:
fn main() {
let mut files: Vec<String> = Vec::new();
find_files(&mut files);
let stmts = vec![
String::from(
"CREATE TABLE db.dbo.[Filenames] ( [Spalte 0] varchar(80) )")]
.into_iter()
.chain(files.iter().map(|k|
format!("INSERT INTO db.dbo.Filenames ([Spalte 0]) VALUES ('{}')", k)))
.collect::<Vec<_>>()
.join(";");
let future
= SqlConnection::connect(std::env::var("CONN_STR").unwrap().as_str())
.and_then(|conn|
conn.simple_exec(stmts)
.into_stream()
.and_then(|future| future)
.for_each(|_| Ok(())));
current_thread::block_on_all(future).unwrap();
}
There is some simple boilerplate in that example.
simple_exec returns an ExecResult, a wrapper around the individual statement's future results. Calling `into_stream() on that provides a stream of those futures.
That stream of results need to be forced to be carried out, one way of doing that is to call and_then, which awaits each future and does something with it.
We don't actually care about the results here, so we just do a noop for_each.
But, say that there is a lot of statements, more than can fit in a single TDS command, then there is a need to issue them separately (another case is when the statements themselves depend on earlier ones). A version of that problem is solved in How do I iterate over a Vec of functions returning Futures in Rust? .
Then finally, what is your specific error? Well conn is consumed by simple_exec, so it cannot be used afterwards, that is what the error tells you. If you want to use the connection after that execution is done you have to use the Future it returns, which is wrapping the mutated connection. I defer to the link above, on one way to do that.

Action to spawn multiple further actions in Gatling scenario

Background
I'm currently working on a capability analysis set of stress-testing tools for which I'm using gatling.
Part of this involves loading up an elasticsearch with scroll queries followed by update API calls.
What I want to achieve
Step 1: Run the scroll initiator and save the _scroll_id where it can be used by further scroll queries
Step 2: Run a scroll query on repeat, as part of each scroll query make a modification to each hit returned and index it back into elasticsearch, effectively spawning up to 1000 Actions from the one scroll query action, and having the results sampled.
Step 1 is easy. Step 2 not so much.
What I've tried
I'm currently trying to achieve this via a ResponseTransformer that parses JSON-formatted results, makes modifications to each one and fires off a thread for each one that attempts another exec(http(...).post(...) etc) to index the changes back into elasticsearch.
Basically, I think I'm going about it the wrong way about it. The indexing threads never get run, let alone sampled by gatling.
Here's the main body of my scroll query action:
...
val pool = Executors.newFixedThreadPool(parallelism)
val query = exec(http("Scroll Query")
.get(s"/_search/scroll")
.body(ElFileBody("queries/scrollquery.json")).asJSON // Do the scroll query
.check(jsonPath("$._scroll_id").saveAs("scroll_id")) // Get the scroll ID from the response
.transformResponse { case response if response.isReceived =>
new ResponseWrapper(response) {
val responseJson = JSON.parseFull(response.body.string)
// Get the hits and
val hits = responseJson.get.asInstanceOf[Map[String, Any]]("hits").asInstanceOf[Map[String,Any]]("hits").asInstanceOf[List[Map[String, Any]]]
for (hit <- hits) {
val id = hit.get("_id").get.asInstanceOf[String]
val immutableSource = hit.get("_source").get.asInstanceOf[Map[String, Any]]
val source = collection.mutable.Map(immutableSource.toSeq: _*) // Make the map mutable
source("newfield") = "testvalue" // Make a modification
Thread.sleep(pause) // Pause to simulate topology throughput
pool.execute(new DocumentIndexer(index, doctype, id, source)) // Create a new thread that executes the index request
}
}
}) // Make some mods and re-index into elasticsearch
...
DocumentIndexer looks like this:
class DocumentIndexer(index: String, doctype: String, id: String, source: scala.collection.mutable.Map[String, Any]) extends Runnable {
...
val httpConf = http
.baseURL(s"http://$host:$port/${index}/${doctype}/${id}")
.acceptHeader("application/json")
.doNotTrackHeader("1")
.disableWarmUp
override def run() {
val json = new ObjectMapper().writeValueAsString(source)
exec(http(s"Index ${id}")
.post("/_update")
.body(StringBody(json)).asJSON)
}
}
Questions
Is this even possible using gatling?
How can I achieve what I want to achieve?
Thanks for any help/suggestions!
It's possible to achieve this by using jsonPath to extract the JSON hit array and saving the elements into the session and then, using a foreach in the action chain and exec-ing the index task in the loop you can perform the indexing accordingly.
ie:
ScrollQuery
...
val query = exec(http("Scroll Query")
.get(s"/_search/scroll")
.body(ElFileBody("queries/scrollquery.json")).asJSON // Do the scroll query
.check(jsonPath("$._scroll_id").saveAs("scroll_id")) // Get the scroll ID from the response
.check(jsonPath("$.hits.hits[*]").ofType[Map[String,Any]].findAll.saveAs("hitsJson")) // Save a List of hit Maps into the session
)
...
Simulation
...
val scrollQueries = scenario("Enrichment Topologies").exec(ScrollQueryInitiator.query, repeat(numberOfPagesToScrollThrough, "scrollQueryCounter"){
exec(ScrollQuery.query, pause(10 seconds).foreach("${hitsJson}", "hit"){ exec(HitProcessor.query) })
})
...
HitProcessor
...
def getBody(session: Session): String = {
val hit = session("hit").as[Map[String,Any]]
val id = hit("_id").asInstanceOf[String]
val source = mapAsScalaMap(hit("_source").asInstanceOf[java.util.LinkedHashMap[String,Any]])
source.put("newfield", "testvalue")
val sourceJson = new ObjectMapper().writeValueAsString(mapAsJavaMap(source))
val json = s"""{"doc":${sourceJson}}"""
json
}
def getId(session: Session): String = {
val hit = session("hit").as[Map[String,Any]]
val id = URLEncoder.encode(hit("_id").asInstanceOf[String], "UTF-8")
val uri = s"/${index}/${doctype}/${id}/_update"
uri
}
val query = exec(http(s"Index Item")
.post(session => getId(session))
.body(StringBody(session => getBody(session))).asJSON)
...
Disclaimer: This code still needs optimising! And I haven't actually learnt much scala yet. Feel free to comment with better solutions
Having done this, what I really want to achieve now is to parallelise a given number of the indexing tasks. ie: I get 1000 hits back, I want to execute an index task for each individual hit, but rather than just iterating over them and doing them one after another, I want to do 10 at a time concurrently.
However, I think this is a separate question, really, so I'll present it as such.

Merging streams when _any_ of the substreams has a value ready

From Akka-stream documentation, it looks like that all stream merging options (merge, mergeSorted, mergePreferred, zipN, zipWithN) work by waiting when all merged streams have the new element ready, then applying the merge strategy (combining elements into a tuple, or applying zip function, etc.)
This works well for offline processing (e.g. reading the data from files or HTTP and combining it), but it introduces latency in online processing. I need to merge streams of data produced by e.g. multiple Websocket connection, and deliver updates in the merged stream as soon as any of the source streams produces a value. Example: if there are source streams A and B, here's what should be in the merged stream:
Output stream starts with some initial value, e.g. (None, None).
(A:1) (B:<not ready>) -> (Some(1), None)
(A:2) (B:<not ready>) -> (Some(2), None)
(A:3) (B:1) -> (Some(3), Some(1))
(A:3) (B:2) -> (Some(3), Some(2))
etc. Again, a new value appears in the output stream when any of the source stream produces a value, immediately.
Is there any combinator to achieve that?
As stated in the comments, Merge and MergePreferred stages do emit elements downstream even if not all upstreams have an element available.
From your example it looks like you are looking for zipping sources though. And yes, Zip emits the zipped tuple downstream only when it has elements to zip from all its upstreams. To overcome this you can 'lift' your sources to produce Options, and make them emit None whenever there is nothing else to emit. The source wrapper can look like this:
def asOption[In, Mat](source: Source[In, Mat]): Source[Option[In], Mat] =
Source.fromGraph(GraphDSL.create(source.map(Option(_))) {
implicit builder: GraphDSL.Builder[Mat] => src =>
import GraphDSL.Implicits._
val noneSource = Source.repeat(None)
val merge = builder.add(MergePreferred[Option[In]](1))
src ~> merge.preferred
noneSource ~> merge.in(0)
SourceShape(merge.out)
})
At this point you can zip your sources as you would normally.
val src1: Source[Int, NotUsed] = ???
val src2: Source[Int, NotUsed] = ???
val zipped = asOption(src1) zip asOption(src2)

Resources