I wrote a Flink program that calculates the number of events per keyed window from a simple kafka stream. I works great, fast & accurate. When the source stops, I would like to have 0 as result of the calculation on each window, but no result is sent. The function just does not execute. I assume this is because of the lazy operation behavior of Flink.
Any recommendation?
I encountered the same situation. Filling the holes in your database with another process is a solution.
However, I found it easier to union your main stream with a custom periodical source, that emits dummies, whose only roles are to trigger windows creation. When doing this, you have to make sure that dummies are ignored in computations.
Here is how to code a periodical source (however you may not need a RichParallelSourceFunction, a SourceFunction can be enough)
Related
I see that there are lot of discussions going on about adding support for watermarks per key. But do flink support per partition watermarks?
Currently - then minimum of all the watermarks(non idle partitions) is taken into account. Because of this the last hanging records in a window are stuck as well.(when incremented the watermark using periodicemit)
Any info on this is really appreciated!
Some of the sources, such as the FlinkKafkaConsumer, support per-partition watermarking. You get this by calling assignTimestampsAndWatermarks on the source, rather than on the stream produced by the source.
What this does is that each consumer instance tracks the maximum timestamp within each partition, and take as its watermark the minimum of these maximums, less the configured bounded out-of-orderness. Idle partitions will be ignored, if you configure it to do so.
Not only does this yield more accurate watermarking, but if your events are in-order within each partition, this also makes it possible to take advantage of the WatermarkStrategy.forMonotonousTimestamps() strategy.
See Watermark Strategies and the Kafka Connector for more details.
As for why the last window isn't being triggered, this is related to watermarking, but not to per-partition watermarking. The problem is simply that windows are triggered by watermarks, and the watermarks are trailing behind the timestamps in the events. So the watermarks can never catch up to the final events, and can never trigger the last window.
This isn't a problem for unbounded streaming jobs, since they never stop and never have a last window. And it isn't a problem for batch jobs, since they are aware of all of the data. But for bounded streaming jobs, you need to do something to work around this issue. Broadly speaking, what you must do is to inform Flink that the input stream has ended -- whenever the Flink sources detect that they have reached the end of an event-time-based input stream, they emit one last watermark whose value is MAX_WATERMARK, and this will trigger any open windows.
One way to do this is to use a KafkaDeserializationSchema with an implementation of isEndOfStream that returns true when the job reaches its end.
My flink application generates output (complex) events based on the processing of (simple) input events. The generated output events are to be consumed by other external services. My application works using event-time semantics, so I am bit in doubt regarding what should I use as the output events' timestamp.
Should I use:
the processing time at the moment of generating them?
the event time (given by the watermark value)?
both? (*)
For my use case, I am using both for now. But maybe you can come up with examples/justifications for each of the given options.
(*) In the case of using both, what naming would you use for the two fields? Something along the lines of event_time and processing_time seems to leak implementation details of my app to the external services...
There is no general answer to your question. It often depends on downstream requirements. Let's look at two simple cases:
A typical data processing pipeline is ingesting some kind of movement event (e.g., sensor data, click on web page, search request) and enriches it with master data (e.g., sensor calibration data, user profiles, geographic information) through joins. Then the resulting event should clearly have the same time as the input event.
A second pipeline is aggregating the events from the first pipeline on a 15 min tumbling window and simply counts it. Then fair options would be to use the start of the window or the time of the first event, end of the window or time of the last event, or both of these information. Using the start/end of a window would mean that we have a resulting signal that is always defined. Using the first/last event timestamp is more precise when you actually want to see in the aggregates when things happen. Usually, that also means that you probably want some finer window resolutions though (1 min instead of 15 min). Whether you use the start or the end of a window is often more a matter of taste and you are usually safer to include both.
In none of these cases, processing time is relevant at all. In fact, if your input is event time, I'd argue that there is no good reason for processing time. The main reason is that you cannot do meaningful reprocessing with processing time.
You can still add processing time, but for a different reason: to measure the end-to-end latency of a very complex data analytics pipeline including multiple technologies and jobs.
This is a two question topic about flink streaming based on experiments I did myself and I need some clarification. The questions are:
When we use windows on a KeyedStream in flink, are the computations of the apply function asynchronous? Specifically, will flink create separate windows per key and process these windows independently from one another?
Assume that we use the apply function (do some computations) on a windowed stream which will then create a DataStream. If we do some transformations on the resulting DataStream, will flink hold the entire WindowedStream in memory? And will flink wait until all the apply functions of the WindowedStream are finished and then move on to the transformations on the resulting stream?
In all the experiments I did I used event time and I read the data from a file. I have observed the above statements in my experiments and I need some clarification.
Ad. 1 Yes, each key is processed independently. It is also the way windows computations are parallelised.
Ad.2 Flink will keep windows state until the window can be emitted (plus some extra time in case of allowedLateness). Once results for a window are emitted(in your case are forwarded to next operator), the state can be cleared.
I want to make the "TRAP AGENT" library. The trap agent library keeps the tracks of the various parameter of the client system. If the parameter of the client system changes above threshold then trap agent library at client side notifies to the server about that parameter. For example, if CPU usage exceeds beyond threshold then it will notify the server that CPU usage is exceeded. I have to measure 50-100 parameters (like memory usage, network usage etc.) at client side.
Now I have the basic idea about the design, but I am stuck with the entire library design.
I have thought of below solutions:
I can create a thread for each parameter (i.e. each thread will monitor single parameter).
I can create a process for each parameter (i.e. each process will monitor single parameter).
I can classify the various parameters into the various groups, like data usage parameter will fall into network group, CPU memory usage parameter will fall into the system group, and then will create thread for each group.
Now 1st solution is looking good as compare to 2nd. If I am adopting 1st solution then it may fail when I want to upgrade my library for 100 to 1000 parameters. Because I have to create 1000 threads at that time, which is not good design (I think so; if I am wrong correct me.)
3rd solution is good, but response time will be high since many parameters will be monitored in single thread.
Is there any better approach?
In general, it's a bad idea to spawn threads 1-to-1 for any logical mapping in your code. You can quickly exhaust the available threads of the system.
In .NET this is very elegantly handled using thread pools:
Thread vs ThreadPool
Here is a C++ discussion, but the concept is the same:
Thread pooling in C++11
Processes are also high overhead on Windows. Both designs sound like they would ironically be quite taxing on the very resources you are trying to monitor.
Threads (and processes) give you parallelism where you need it. For example, letting the GUI be responsive while some background task is running. But if you are just monitoring in the background and reporting to a server, why require so much parallelism?
You could just run each check, one after the other, in a tight event loop in one single thread. If you are worried about not sampling the values as often, I'd say that's actually a benefit. It does no help to consume 50% CPU to monitor your CPU. If you are spot-checking values once every few seconds that is probably fine resolution.
In fact high resolution is of no help if you are reporting to a server. You don't want to denial-of-service-attack your server by doing a HTTP call to it multiple times a second once some value triggers.
NOTE: this doesn't mean you can't have a pluggable architecture. You could create some base class that represents checking a resource and then create subclasses for each specific type. Your event loop could iterate over an array or list of objects, calling each one successively and aggregating the results. At the end of the loop you report back to the server if any are out of range.
You may want to add logic to stop checking (or at least stop reporting back to the server) for some "cool down period" once a trap hits. You don't want to tax your server or spam your logs.
You can follow below methodology:
1.You can have two threads one thread is dedicated to measure emergency parameter and second thread monitors non emergency parameter.
hence response time for emergency parameter will be less.
2.You can define 3 threads.First thread will monitor the high priority(emergency parameter).Second thread will monitor the intermediate priority parameter. and last thread will monitor lowest priority parameter.
So overall response time will be improved as compared to first solution.
3.If response time is not concern then you can monitor all the parameters in single thread.But in this case response time becomes worst when you upgrade your library to monitor 100 to 1000 parameters.
So in 1st case there will be more response time for non emergency parameter.While in 3rd case there will be definitely very high response time.
So solution 2 is better.
I have an application that is receiving a high volume of data that I want to store in a database. My current strategy is to fire off an asynchronous call (BeginExecuteNonQuery) with each record when it's ready. I'm using the asynchronous call to ensure that the rest of the application runs smoothly.
The problem I have is that as the volume of data increases, eventually I get to the point where I'm trying to fire a command down the connection while it's still in use. I can see two possible options:
Buffer the pending data myself until the existing command is finished.
Open multiple connections as needed.
I'm not sure which of these options is best, or if in fact there is a better way. Option 1 will probably lead to my buffer getting bigger and bigger, while option 2 may be very bad form - I just don't know.
Any help would be appreciated.
Depending on your locking strategy, it may be worth using several connections but certainly not a number "without upper bounds". So a good strategy/pattern to use here is "thread pool", with each of N dedicated threads holding a connection and picking up write requests as the requests come and the thread finishes the previous one it was doing. Number of threads in the pool for best performance is best determined empirically, by benchmarking various possibilities in a realistic experimental/prototype setting.
If the "buffer" queue (in which your main thread queues write requests and the dedicated threads in the pool picks them up) grows beyond a certain threshold, it means you're getting data faster than you can possibly write it out, so, unless you can get more resources, you'll simply have to drop some of the incoming data -- maybe by a random-sampling strategy to avoid biasing future statistical analysis. Just count how much you're writing and how much you're having to drop due to the resource shortage in each period of time (say every minute or so), so you can use "stratified sampling" techniques in future data-mining explorations.
Thanks Alex - so you'd suggest a hybrid method then, assuming that I'll still need to buffer updates if all connections are in use?
(I'm the original poster, I've just managed to get two accounts without realizing)