Apache Flink: Window Functions and the beginning of time

Apache Flink: Window Functions and the beginning of time - apache-flink

In a WindowAssigner, an element gets assigned to one or more TimeWindow instances. In case of a sliding event time window, this happens in SlidingEventTimeWindows#assignWindows1.
In case of a window with size=5 and slide=1, an element with timestamp=0 gets assigned into the following windows:
Window(start=0, end=5)
Window(start=-1, end=4)
Window(start=-2, end=3)
Window(start=-3, end=2)
Window(start=-4, end=1)
In one picture:
+-> Beginning of time
|
|
+----------------------------------------------+
| size = 5 +--+ element |
| slide = 1 | |
| v |
| t=[ 0,5[ Window 1 XXXXX |
| t=[-1,4[ Window 2 XXXXX |
| t=[-2,3[ Window 3 XXXXX |
| t=[-3,2[ Window 4 XXXXX |
| t=[-4,1[ Window 5 XXXXX |
| |
| time(-4 to +4) ---- |
| 432101234 |
+---------------------------+------------------+
|
|
|
+
Is there a way to tell Flink that there is a beginning of time and before, there are no windows? If not, where to start looking to change that? In the above case, Flink should have only one window (t=[4,8[ Window 1) for the first element. Like this:
+-> Beginning of time
|
|
+-----------------------------------------------+
| size = 5 +--+ element |
| slide = 1 | |
| v |
| t=[ 0,5[ Window 1 XXXXX |
| t=[ 1,6[ Window 2 XXXXX |
| t=[ 2,7[ Window 3 XXXXX |
| t=[ 3,8[ Window 4 XXXXX |
| t=[ 4,9[ Window 5 XXXXX |
| |
| time(-4 to +8) ---- |
| 4321012345678 |
+---------------------------+-------------------+
|
|
|
+
This will have no more effect once the number of windows reaches and exceeds window size. Then, in the above case, all elements are inside of 5 Windows.
Footnotes:
org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows#assignWindows

At the moment there is no way to specify the valid time interval of a Flink job. This might also be a little bit problematic given that you might want to apply your job on historic data as well.
What you could do, though, is to filter windows which start before the beginning of time out manually:
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val startTime = 1
val windowLength = 2
val slide = 1
val input = env.fromElements((1,1), (2,2), (3,3))
.assignAscendingTimestamps(x => x._2)
val windowed = input
.timeWindowAll(Time.milliseconds(windowLength), Time.milliseconds(slide))
.apply{ (window, iterable, collector: Collector[Int]) =>
if (window.getStart >= startTime) {
collector.collect(iterable.map(_._1).reduce(_ + _))
} else {
// discard early windows
}
}
windowed.print()
env.execute()

I might found better workaround for this issue.
The idea is to set watermark to the point in future far enough so there will be enough data for your windows. Early windows will still be there, but they will be discarded.
Here is proof of concept for AssignerWithPeriodicWatermarks[T]:
class WMG[T](wait: Long) extends AssignerWithPeriodicWatermarks[T] {
var t: Option[Long] = None
var firstTime = true
override def extractTimestamp(el: T, prevTs: Long): Long = {
t = Some(prevTs)
prevTs
}
override def getCurrentWatermark(): Watermark = (t, firstTime) match {
case (None, _) => return null
case (Some(v), false) => new Watermark(v)
case (Some(v), true) => {
firstTime = false
new Watermark(v + wait)
}
}
}
`wait` is the size of your first window.
Seems to work correctly, but I don't understand flink enough to be sure.
Update: Unfortunately, it doesn't work (now I don't know why should it), there is always few keys in keyed stream with "early windows". So in the end I'm just filtering wrong windows with something like:
val s = (winSize/winStep).intValue
kstream.flatMapWithState((in: StreamOut, state: Option[Int]) =>
state match {
case None => (Seq(), Some(1))
case Some(s) => (Seq(in), Some(s))
case Some(v) => (Seq(), Some(v+1))
})

Related

Merge multiple tabs in Google Sheets and add a column for where the data came from

I have a spreadsheet which contains multiple tabs with similar layouts. I want to use a formula to merge these into a single tab which has a new column naming the tab it came from.
Example
Tab: Area A
| Item | Status |
|------|-------------|
| Foo | Blocked |
| Bar | In Progress |
Tab: Area B
| Item | Status |
|--------|-----------|
| Foobar | Completed |
Tab: Merged
| Area | Item | Status |
|------|--------|-------------|
| A | Foo | Blocked |
| A | Bar | In Progress |
| B | Foobar | Completed |
Merging without new column
I can merge the data without the additional column, using this formula:
=ARRAYFORMULA(SORT({'Area A'!A2:B; 'Area B'!A2:B}))
Which looks like this:
|--------|-------------|
| Item | Status |
|--------|-------------|
| Foo | Blocked |
| Bar | In Progress |
| Foobar | Completed |
Adding the Area column
What's missing from the above formula is the addition of the area column. This would be possible by cross-referencing the item in every tab using a vlookup and labelling it. But that wouldn't be very efficient and some updates are already slow to re-calculate in this document. I expect this to have approx. 40 tabs with 10,000 rows in total to merge.
Eg:
=IFS(NOT(ISERROR(VLOOKUP(B2,'Area A'!A$2:A,1,FALSE))), "A", NOT(ISERROR(VLOOKUP(B2,'Area B'!A$2:A,1,FALSE))), "B")
Is there a better way to do this?
I'd like something like this, but it doesn't work as the constant I'm adding doesn't match the number of rows it needs to be:
=ARRAYFORMULA(SORT({{"A",'Area A'!A2:B}; {"B", 'Area B'!A2:B}}))

you can borrow empty column and do:
=ARRAYFORMULA(SORT({{'Area A'!X2:X&"A", 'Area A'!A2:B};
{'Area B'!X2:X&"B", 'Area B'!A2:B}}))
or you can add it to first column and then split it:
=ARRAYFORMULA(QUERY(SORT({SPLIT(
{"A♦"&'Area A'!A2:A;
"B♦"&'Area B'!A2:A}, "♦"),
{'Area A'!B2:B;
'Area B'!B2:B}}), "where Col2 is not null", 0))
see: https://stackoverflow.com/a/63496191/5632629

Extracting actions from child component to parent

I've been trying to build a react based application, and I am stumped with how to solve this situation:
Let's say I have a component called "SimpleTable" that provides as the name says, a simple table - the props for it are "headers" (array), "rows" (array) and "paginated" (bool)
If the paginated is false, then we just give a simple long table - if it is true, we chunk the rows to multiple small tables, and provide buttons to toggle prev/next - so far so simple.
Now comes the challenging part - I want to sometimes have the table as just a table in my code, and sometimes I want to wrap it in a "card" element. I have introduced a new prop to the component called "asCard" (bool), which changes the output of the HTML, and changes where the prev/next buttons are placed.
Is there a way to reverse this, so that instead of having "asCard" in my component, I would have a wrapper component that I can put anything in, and it can "extract" actions from the child component, and place them in a different position - this way I could have many different components, and would not have to worry about having "asCard" on each of them.
What I am thinking is maybe having a "Card" component, and have a function in it called something like "extractAction", and then it passes it to the child component, and the child component then has a check for a prop called "handleExtractAction" which then passes the action element to, instead of using it in its own output. But I am not sure if this is an overly complicated way of doing this, and if there is a more sensible way of doing it.
EDIT:
I'll try and add a visual example of what I am talking about
SimpleTable with pagination:
< >
item 1
------
item 2
------
item 3
------
item 4
------
item 5
------
SimpleTable inside a Card, with basic parent>child setup:
------------------------
| card title |
------------------------
| |
| < > |
| |
| item 1 |
| ------ |
| |
| item 2 |
| ------ |
| |
| item 3 |
| ------ |
| |
| item 4 |
| ------ |
| |
| item 5 |
| ------ |
------------------------
And the result that I would want to have instead, without having to use "asCard" in each custom component I create.
------------------------
| card title < >|
------------------------
| |
| item 1 |
| ------ |
| |
| item 2 |
| ------ |
| |
| item 3 |
| ------ |
| |
| item 4 |
| ------ |
| |
| item 5 |
| ------ |
------------------------

Why 'Lost connection to MySQL server during query' errors occur time-to-time (timeouts raised, pinging before query)

I'm creating an experiment, in which the server side is a libuv based C/C++ application that runs queries on a mysql server (it's on localhost).
In most cases it just works, but sometimes I get 'Lost connection to MySQL server during query' in various places in the process.
I have tried raising all timeouts. But I think this is unrelated, if the server gets bombarded with requests (like every second) the same error gets thrown.
+-------------------------------------+----------+
| Variable_name | Value |
+-------------------------------------+----------+
| connect_timeout | 31536000 |
| deadlock_timeout_long | 50000000 |
| deadlock_timeout_short | 10000 |
| delayed_insert_timeout | 300 |
| innodb_flush_log_at_timeout | 1 |
| innodb_lock_wait_timeout | 50 |
| innodb_print_lock_wait_timeout_info | OFF |
| innodb_rollback_on_timeout | OFF |
| interactive_timeout | 31536000 |
| lock_wait_timeout | 31536000 |
| net_read_timeout | 31536000 |
| net_write_timeout | 31536000 |
| slave_net_timeout | 31536000 |
| thread_pool_idle_timeout | 60 |
| wait_timeout | 31536000 |
+-------------------------------------+----------+
I'm pinging the server before doing queries.
// this->con was set up somewhere before, just like down below in the retry section
char query[] = "SELECT * FROM data WHERE id = 12;";
mysql_ping(this->con);
if(!mysql_query(this->con, query)) {
// OK
return 0;
} else {
// reconnect usually helps
// here I get the error message
mysql_close(this->con);
this->con = mysql_init(NULL);
if(mysql_real_connect(this->con, this->host.c_str(), this->user.c_str(), this->password.c_str(), this->db.c_str(), 0, NULL, 0) == NULL) {
// no DB, goodbye
exit(6);
}
// retry
if(!mysql_query(this->con, query)) {
return 0;
} else {
// DB fail
return 1;
}
}
I have tried the reconnect option, the problem is the same.
In my understanding this flow should be possible with single-threaded libuv and mysql:
1. set up db connection
2. run event loop
-> make queries based on IO events, get results
3. event loop ends
4. db close
What do I miss?

How do I trigger a form level event when a user control has the focus?

I have a Windows Form C#/.Net project. There is a form with several user controls on it. Each user control is docked so that it takes up the whole form. When a button on one user control is clicked, that user control is hidden and another one appears. This was done by a coworker and is in production and can't be changed (that is, I absolutely can't get rid of the user controls and can't change them unless really necessary). I want a panel (or maybe another user control) on the form that can be brought up on some event (like hitting a certain key combination like CTRL-Q). However, if I put a KeyDown event on the form, it never gets triggered because one of the user controls always has the focus. I could put my new panel on each user control and have KeyDown events on each of them, but I'm not really supposed to change the existing user controls and I don't really want multiple instances of this panel and multiple events. How can I trigger a form level event when one of the user controls has the focus?
Here is how the form, user control, and panels are laid out. The user controls are actually docked to fill the entire form; I staggered them in this illustration so you could see them all.
----------------------------------------------------------------------
| form1 |
| |
| --------------------------------------------------------------- |
| | userControl1 | |
| | | |
| | | |
| | -------------------------------------------------------- | |
| | | userControl2 | | |
| | | | | |
| | | | | |
| | | -------------------------------------------------- | | |
| | | | userControl3 | | | |
| | | | | | | |
| | | --------------------------------------------------- | | |
| | | | | |
| | -------------------------------------------------------- | |
| | | |
| --------------------------------------------------------------- |
| |
| |
| ------------------------------------------------------------------ |
| | panel or user control on top of everything, visible on demand | |
| ------------------------------------------------------------------ |
| |
----------------------------------------------------------------------
What I want is for this event from form1 to get triggered no matter what user control is active:
private void form1_KeyDown(object sender, KeyEventArgs e)
{
if (e.Control && e.KeyCode == Keys.Q)
{
panel1.Visible = true;
}
}

If you want to use the KeyDown or KeyUp events on a form, always make sure that the property KeyPreview on the form is set to TRUE
This property will then make sure the form gets the keystroke first, before the controls do
See also this
EDIT
Another way is to stop using the KeyDown/KeyUp event for this, and override the ProcessCmdKey on the form, as #Jimi suggested in his comment
protected override bool ProcessCmdKey(ref Message msg, Keys keyData)
{
bool Result = true;
if (keyData == Keys.Control | Keys.Q)
{
panel1.Visible = true;
}
else
{
Result = base.ProcessCmdKey(ref msg, keyData);
}
return Result;
}

Flink windowing: aggregate and output to sink

We have a stream of data where each element is of this type:
id: String
type: Type
amount: Integer
We want to aggregate this stream and output the sum of amount once per week.
Current solution:
A example flink pipeline would look like this:
stream.keyBy(type)
.window(TumblingProcessingTimeWindows.of(Time.days(7)))
.reduce(sumAmount())
.addSink(someOutput())
For input
| id | type | amount |
| 1 | CAT | 10 |
| 2 | DOG | 20 |
| 3 | CAT | 5 |
| 4 | DOG | 15 |
| 5 | DOG | 50 |
if the window ends between record 3 and 4 our output would be:
| TYPE | sumAmount |
| CAT | 15 | (id 1 and id 3 added together)
| DOG | 20 | (only id 2 as been 'summed')
Id 4 and 5 would still be inside the flink pipeline and will be outputted next week.
Thus next week our total output would be:
| TYPE | sumAmount |
| CAT | 15 | (of last week)
| DOG | 20 | (of last week)
| DOG | 65 | (id 4 and id 5 added together)
New requirement:
We now also want to know for each record in what week has each record been processed. In other words our new output should be:
| TYPE | sumAmount | weekNumber |
| CAT | 15 | 1 |
| DOG | 20 | 1 |
| DOG | 65 | 2 |
but we also want an additional output like this:
| id | weekNumber |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
How to handle this?
Does flink have any way to achieve this? I would image we would have an aggregate function that sums the amounts but also outputs each record with the current week number for example but I don't find a way to do this in the docs.
(Note: we process about a 100 million records a week, so ideally we would only like to keep the aggregates in flink's state during the week, not all individual records)
EDIT:
I went for the solution described by Anton below:
DataStream<Element> elements =
stream.keyBy(type)
.process(myKeyedProcessFunction());
elements.addSink(outputElements());
elements.getSideOutput(outputTag)
.addSink(outputAggregates())
And the KeyedProcessFunction looks something like:
class MyKeyedProcessFunction extends KeyedProcessFunction<Type, Element, Element>
private ValueState<ZonedDateTime> state;
private ValueState<Integer> sum;
public void processElement(Element e, Context c, Collector<Element> out) {
if (state.value() == null) {
state.update(ZonedDateTime.now());
sum.update(0);
c.timerService().registerProcessingTimeTimer(nowPlus7Days);
}
element.addAggregationId(state.value());
sum.update(sum.value() + element.getAmount());
}
public void onTimer(long timestamp, OnTimerContext c, Collector<Element> out) {
state.update(null);
c.output(outputTag, sum.value());
}
}

There's a variant of the reduce method that takes a ProcessWindowFunction as a second argument. You would use it like this:
stream.keyBy(type)
.window(TumblingProcessingTimeWindows.of(Time.days(7)))
.reduce(sumAmount(), new WrapWithWeek())
.addSink(someOutput())
private static class WrapWithWeek
extends ProcessWindowFunction<Event, Tuple3<Type, Long, Long>, Type, TimeWindow> {
public void process(Type key,
Context context,
Iterable<Event> reducedEvents,
Collector<Tuple3<Type, Long, Long>> out) {
Long sum = reducedEvents.iterator().next();
out.collect(new Tuple3<Type, Long, Long>(key, context.window.getStart(), sum));
}
}
Normally a ProcessWindowFunction is passed an Iterable holding all of the events collected by the window, but if you are using a reduce or aggregate function to pre-aggregate the window result, then only that single value is passed into the Iterable. The documentation for this is here but the example in the docs currently has a small bug which I've fixed in my example here.
But given the new requirement for the second output, I suggest you abandon the idea of doing this with Windows, and instead use a keyed ProcessFunction. You'll need two pieces of per-key ValueState: one that's counting up by weeks, and another to store the sum. You'll need a timer that fires once a week: when it fires, it should emit the type, sum, and week number, and then increment the week number. Meanwhile the process element method will simply output the ID of each incoming event along with the value of the week counter.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Apache Flink: Window Functions and the beginning of time - apache-flink

Related

Merge multiple tabs in Google Sheets and add a column for where the data came from

Extracting actions from child component to parent

Why 'Lost connection to MySQL server during query' errors occur time-to-time (timeouts raised, pinging before query)

How do I trigger a form level event when a user control has the focus?

Flink windowing: aggregate and output to sink

Categories

Resources