Flink app-level barriers

Flink app-level barriers - apache-flink

I'm trying to figure out what's the proper way to make kind of a barrier on merging multiple streams in Flink.
So let's say I have 4 keyed streams each calculating some aggregated statistics over batches of data. Next I want to combine results of these 4 streams into one stream (Y) and perform some additional computation on received 4 summaries.
The problem is how to make Y node wait until it received all the summaries with X=N before going forward with X=N+1.
In the picture node 3 sent its summary X=N later than node 4 sent its X=N+1
so node Y must wait until it has received node 3 summary while caching summaries with X=N+1 from other nodes somehow.
I couldn't find anything similar in documentation so I'd really appreciate any hints.

I figured out this task can be solved by simply doing the following:
.keyBy(X)
.countWindow(4)
.fold(...)

Related

Does Flink's windowing operation process elements at the end of window or does it do a rolling processing?

I am having some trouble understanding the way windowing is implemented internally in Flink and could not find any article which explain this in depth. In my mind, there are two ways this can be done. Consider a simple window wordcount code as below
env.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.groupBy(0)
.window(Time.of(500, TimeUnit.SECONDS)).sum(1)
Method 1: Store all events for 500 seconds and at the end of the window, process all of them by applying the sum operation on the stored events.
Method 2: We use a counter to store a rolling sum for every window. As each event in a window comes, we do not store the individual events but keep adding 1 to previously stored counter and output the result at the end of the window.
Could someone kindly help to understand which of the above methods (or maybe a different approach) is used by Flink in reality. The reason is, there are pros and cons to both approach and is important to understand in order configure the resources for the cluster correctly.
eg: The Method 1 seems very close to batch processing and might potentially have issues related to spike in processing at every 500 sec interval while sitting idle otherwise etc while Method2 would need to maintain a common counter between all task managers.

sum is a reducing function as mentioned here(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#reducefunction). Internally, Flink will apply reduce function to each input element, and simply save the reduced result in ReduceState.
For other windows functions, like windows.apply(WindowFunction). There is no aggregation so all input elements will be saved in the ListState.
This document(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#window-functions) about windows stream mentions about how the internal elements are handled in Flink.

Too big tsfile / datastore in Apache IoTDB database (version 0.11.2)

We are using Apache IoTDB Server in Version 0.11.2 in a scenario and observe a data directory / tsfiles that are way bigger than they should be (about 130 sensors with 4 Million double values for each sensor but the files are about 200gb).
Are there known issues or do you have any ideas what could cause this is how to track that down?
The only thing we could think off could be some merge artefacts as we do write many datapoints out of order so merging has to happen frequently.
Are there any ideas or tools on how to debug / inspect the tsfiles to get an idea whats happening here?
Any help or hints is appreciated!

this may be due to the compaction strategy.
You could fix this in two ways (no need at the same time):
(1) upgrade to 0.12.2 version
(2) open the configuration in iotdb-engine.properties: force_full_merge=true
The reason is:
The unsequenced data compaction in the 0.11.2 version has two strategies.
E.g.,
Chunks in a sequence TsFile: [1,3], [4,5]
Chunks in a unsequence TsFile: [2]
(I use [1,3] to indicate the timestamp of 2 data points in a Chunk)
(1) When using full merge(rewrite all data): we get a tidy sequence file: [1,2,3],[4,5]
(2) However, to speed up the compaction, we use append merge by default, when we get a sequence TsFile: [1,3],[4,5],[1,2,3]. In this TsFile, [1,3] does not have metadata at the end of the File, it is garbage data.
So, if you have lots of out-of-order data merged frequently, this will happen (get a verrry big TsFile).
The big TsFile will be tidy after a new compaction.
You could also use TsFileSketchTool or example/tsfile/TsFileSequenceRead to see the structure of the TsFile.

Starvation of one of 2 streams in ConnectedStreams

Background
We have 2 streams, let's call them A and B.
They produce elements a and b respectively.
Stream A produces elements at a slow rate (one every minute).
Stream B receives a single element once every 2 weeks. It uses a flatMap function which receives this element and generates ~2 million b elements in a loop:
(Java)
for (BElement value : valuesList) {
out.collect(updatedTileMapVersion);
}
The valueList here contains ~2 million b elements
We connect those streams (A and B) using connect, key by some key and perform another flatMap on the connected stream:
streamA.connect(streamB).keyBy(AClass::someKey, BClass::someKey).flatMap(processConnectedStreams)
Each of the b elements has a different key, meaning there are ~2 million keys coming from the B stream.
The Problem
What we see is starvation. Even though there are a elements ready to be processed they are not processed in the processConnectedStreams.
Our tries to solve the issue
We tried to throttle stream B to 10 elements in a 1 second by performing a Thread.sleep() every 10 elements:
long totalSent = 0;
for (BElement value : valuesList) {
totalSent++;
out.collect(updatedTileMapVersion);
if (totalSent % 10 == 0) {
Thread.sleep(1000)
}
}
The processConnectedStreams is simulated to take 1 second with another Thread.sleep() and we have tried it with:
* Setting parallelism of 10 to all the pipeline - didn't work
* Setting parallelism of 15 to all the pipeline - did work
The question
We don't want to use all these resources since stream B is activated very rarely and for stream A elements having high parallelism is an overkill.
Is it possible to solve it without setting the parallelism to more than the number of b elements we send every second?

It would be useful if you shared the complete workflow topology. For example, you don't mention doing any keying or random partitioning of the data. If that's really the case, then Flink is going to pipeline multiple operations in one task, which can (depending on the topology) lead to the problem you're seeing.
If that's the case, then forcing partitioning prior to the processConnectedStreams can help, as then that operation will be reading from network buffers.

Using JGraphT to Manage Ordering of Dependent Tasks

I have a list of tasks that have dependencies between them and I was considering how I could use JGraphT to manage the ordering of the tasks. I would set up the graph as a directed graph and remove vertices as I processed them (or should I mask them?). I could use TopologicalOrderIterator if I were only going to execute one task at a time but I'm hoping to parallelize the tasks. I could get TopologicalOrderIterator and check Graphs.vertexHasPredecessors until I find as many as I want to execute as once but ideally, there would be something like Graphs.getVerticesWithNoPredecessors. I see that Netflix provides a utility to get leaf vertices, so I could reverse the graph and use that, but it's probably not worth it. Can anyone point me to a better way? Thanks!

A topological order may not necessary be what you want. Here's an example why not. Given the following topological ordering of tasks: [1,2,3,4], and the arcs (1,3), (2,3). That is, task 1 needs to be completed before task 3, similar for 2 and 4. Let's also assume that task 1 takes a really long time to complete. So we can start processing tasks 1, and 2 in parallel, but you cannot start 3 before 1 completes. Even though task 2 completes, we cannot start task 4 because task 3 is the next task in our ordering and this task is being blocked by 1.
Here's what you could do. Create an array dep[] which tracks the number of unfulfilled dependencies per task. So dep[i]==0 means that all dependencies for task i have been fulfilled, meaning that we can now perform task i. If dep[i]>0, we cannot perform task i yet. Lets assume that there is a task j which needs to be performed prior to task i. As soon as we complete task j, we can decrement the number of unfulfilled dependencies of task i, i.e: dep[i]=dep[i]-1. Again, if dep[i]==0, we are now ready to process task i.
So in short, the algorithm in pseudocode would look like this:
Initialize dep[] array.
Start processing in parallel all tasks i with dep[i]==0
if a task i completes, decrement dep[j] for all tasks j which depend on i. If task j has dep[j]==0, start processing it.
You could certainly use a Directed Graph to model the dependencies. Each time you complete a task, you could simply iterate over the outgoing neighbors (in jgrapht use the successorsOf(vertex) function). The DAG can also simply be used to check feasibility: if the graph contains a cycle, you have a problem in your dependencies. However, if you don't need this heavy machinery, I would simply create a 2-dimensional array where for each task i you store the tasks that are dependent on i.
The resulting algorithm runs in O(n+m) time, where n is the number of tasks and m the number of arcs (dependencies). So this is very efficient.

MPI - sending and receiving rows of matrix

I am making program, where will be 2-4 processes in MPI with C language. In elimination I am sending each row below actual row to different processes by cyclic mapping.
e.g. for matrix 5 * 6, when there is active row 0, and 4 processes, I am sending row 1 to process 1, row 2 to the process 2, row 3 to the process 3 and row 4 to the process 1 again. In theese processes I want to make some computation and return some values back to the process 0, which will add theese into its original matrix. My question is, which Send and Recv call should I use?
There is some theoretical code:
if(master){
for(...)
//sending values to other processes
}
for(...)
//recieving values from other processes
}
}else{//for other non-master processes
recieve value from master
doing some computing
sending value back to master
}
If I use only simple blocking send, I don't know what will happen on process 1, because it gets row 1 and 4, and before I will cal recv in master, my value in process 1 - row 1 will be overwritten by row 4, so I will miss one set of data, am I right? My question on therfore, what kind of MPI_Send should I use?

You might be able to make use of nonblocking if, say, proc 1 can start doing something with row 1 while waiting for row 4. But blocking should be fine to start with, too.
There is a lot of synchronization built into the algorithm. Everyone has to work based on the current row. So the recieving processes will need to know how much work to expect for each iteration of this procedure. That's easy to do if they know the total number of rows, and which iteration they're currently working on. So if they're expecting two rows, they could do two blocking recvs, or launch 2 nonblocking recvs, wait on one, and start processing the first row right away. But probably easiest to get blocking working first.
You may find it advantageous even at the start to have the master process doing isends so that all the sends can be "launched" simulataneously; then a waitall can process them in whatever order.
But better than this one-to-many communication would probably be to use a scatterv, which one could expect to proceed more efficiently.
Something that has been raised in many answers and comments to this series of questions but which I don't think you've ever actually addressed -- I really hope that you're just working through this project for educational purposes. It would be completely, totally, crazy to actually be implementing your own parallel linear algebra solvers when there are tuned and tested tools like Scalapack, PETSc, and Plapack out there free for the download.