Flink SlidingEventTimeWindows doesnt work as expected - apache-flink

I have a stream execution configured as
object FlinkSlidingEventTimeExample extends App {
case class Trx(timestamp:Long, id:String, trx:String, count:Int)
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI()
val watermarkS1 = WatermarkStrategy
.forBoundedOutOfOrderness[Trx](Duration.ofSeconds(15))
.withTimestampAssigner(new SerializableTimestampAssigner[Trx] {
override def extractTimestamp(element: Trx, recordTimestamp: Long): Long = element.timestamp
})
val s1 = env.socketTextStream("localhost", 9999)
.flatMap(l => l.split(" "))
.map(l => Trx(timestamp = l.split(",")(0).toLong, id = l.split(",")(1), trx = l.split(",")(2), count = 1))
.assignTimestampsAndWatermarks(watermarkS1)
.keyBy(l => l.id)
.window(SlidingEventTimeWindows.of(Time.seconds(20),Time.seconds(5))) // Not working
//.window(SlidingProcessingTimeWindows.of(Time.seconds(20),Time.seconds(5))) // Working
.sum("count")
.print
env.execute("FlinkSlidingEventTimeExample")
}
I have already defined a watermark, but couldn't figure out why it is not producing anything. Does anyone has any ideas? My flink version is 1.14.0
My build.sbt is like below:
scalaVersion := "2.12.15"
libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % "1.14.0"
libraryDependencies += "org.apache.flink" %% "flink-runtime-web" % "1.14.0"
libraryDependencies += "org.apache.flink" %% "flink-clients" % "1.14.0"
libraryDependencies += "org.apache.flink" % "flink-queryable-state-runtime" % "1.14.0"
I am entering input data from socket(port:9999) like below:
1640375790000,1,trx1
1640375815000,1,trx2
1640375841000,1,trx3
1640375741000,1,trx4
tried to give larger timestamp than window size, but still not working.
Flink Web UI screenshot:
web-ui
watermarks

Earlier answer deleted; it was based on faulty assumptions about the setup.
When event time windows fail to produce results it's always something to do with watermarking.
The timestamps in your input correspond to
December 24, 2021 19:56:30
December 24, 2021 19:56:55
December 24, 2021 19:57:21
December 24, 2021 19:55:41
so there's more than enough data to trigger the closure of several sliding windows. For example, trx2 has a large enough timestamp that it can generate a watermark large enough to close these windows that contain 19:56:30:
19:56:15 - 19:56:34.999
19:56:20 - 19:56:39.999
However, your execution graph looks something like this:
The problem is the rebalance between the socket source and the task that follows (the one doing flatmap -> map -> watermarks). Each of your four events is going to a different instance of the watermark strategy, and some instances aren't receiving any events. That's why there are no watermarks being generated.
What you want to do instead is to chain the input parsing and watermark generation to the source at the same parallelism, so that your execution graph looks like this instead:
This code will do that:
env
.socketTextStream("localhost", 9999)
.map(l => {
val input = l.split(",")
Trx(timestamp = input(0).toLong, id = input(1), trx = input(2), count = 1)
})
.setParallelism(1)
.assignTimestampsAndWatermarks(watermarkS1)
.setParallelism(1)
.keyBy(l => l.id)
.window(SlidingEventTimeWindows.of(Time.seconds(20), Time.seconds(5)))
.sum("count")
.print
In general it's not necessary to do watermarking at a parallelism of one, but it is necessary that every instance of the watermark generator either has enough events to work with, or is configured with withIdleness. (And if every instance is idle then you won't get any results either.)

Related

When creating a tensor with an array of timestamps, the numbers are incorrect

Looking for some kind of solution to this issue:
trying to create a tensor from an array of timestamps
[
1612892067115,
],
but here is what happens
tf.tensor([1612892067115]).arraySync()
> [ 1612892078080 ]
as you can see, the result is incorrect.
Somebody pointed out, I may need to use the datatype int64, but this doesn't seem to exist in tfjs 😭
I have also tried to divide my timestamp to a small float, but I get a similar result
tf.tensor([1.612892067115, 1.612892068341]).arraySync()
[ 1.6128920316696167, 1.6128920316696167 ]
If you know a way to work around using timestamps in a tensor, please help :)
:edit:
As an attempted workaround, I tried to remove my year, month, and date from my timestamp
Here are my subsequent input values:
[
56969701,
56969685,
56969669,
56969646,
56969607,
56969602
]
and their outputs:
[
56969700,
56969684,
56969668,
56969648,
56969608,
56969600
]
as you can see, they are still incorrect, and should be well within the acceptable range
found a solution that worked for me:
Since I only require a subset of the timestamp (just the date / hour / minute / second / ms) for my purposes, I simply truncate out the year / month:
export const subts = (ts: number) => {
// a sub timestamp which can be used over the period of a month
const yearMonth = +new Date(new Date().getFullYear(), new Date().getMonth())
return ts - yearMonth
}
then I can use this with:
subTimestamps = timestamps.map(ts => subts(ts))
const x_vals = tf.tensor(subTimestamps, [subTimestamps.length], 'int32')
now all my results work as expected.
Currently only int32 is supported with tensorflow.js, your data has gone out of the range supported by int32.
Until int64 is supported, this can be solved by using a relative timestamp. Currently a timestamp in js uses the number of ms that elapsed since 1 January 1970. A relative timestamp can be used by using another origin and compute the difference of ms that has elapsed since that date. That way, we will have a lower number that can be represented using int32. The best origin to take will be the starting date of the records
const a = Date.now() // computing a tensor out of it will give an accurate result since the number is out of range
const origin = new Date("02/01/2021").now()
const relative = a - origin
const tensor = tf.tensor(relative, undefined, 'int32')
// get back the data
const data = tensor.dataSync()[0]
// get the initial date
const initial date = new Date(data + origin)
In other scenarios, if using the ms is not of interest, using the number of s that has elapsed since the start would be better. It is called the unix time

how to set coreSize parameter in Hystrix?

I'm working on a spring boot project about electronic contract recently.And There has an interface of raiseContract().Considering that the traffic of this interface will be large in the future,My Leader let me use Hystrix to defender it.And I did not use it before.I am learning it and trying to use it on the interface.I use ThreadPool Isolation Strategy and I don't konw how to set
the parameter of coreSize reasonable in ThreadPoolProperties.In other words,I want to know what should I follow to set it.
I did a lot of research,but I did not get the answer.All of Answer is about the meaning of coreSize,maxQueueSize etc.
Here is my code:
#HystrixCommand(
groupKey = "contractGroup",
commandKey = "raiseContract",
fallbackMethod = "raiseContractFallback",
threadPoolProperties = {
#HystrixProperty(name = "coreSize", value = "20"),
#HystrixProperty(name = "maxQueueSize", value = "150"),
#HystrixProperty(name = "queueSizeRejectionThreshold", value = "100")},
commandProperties = {
#HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "15000"),
#HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "5"),
#HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50"),
#HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds", value = "3000"),
#HystrixProperty(name = "fallback.isolation.semaphore.maxConcurrentRequests", value = "20")
})
As you are already aware of, there are 3 main threadPool configurations:
coreSize: Number of threads that will be maintained in the pool
maxSize: Defines how much extra threads are allowed in case need arises.
maxQueueSize: Queue size of the tasks
Now lets start with an example. Assume there is a service using hystrix, call it HystrixService, for which coreSize = maxSize = n and maxQueueSize = -1 (default case). This means at a time at most 'n' tasks will be executed. Any extra task that is coming will be rejected (fallback will be executed).
So, in ideal scenario, you have to ensure that this HystrixService doesn't reject any request coming to it. You need to know at max how many requests can there be on HystrixService. So if throughput on HystrixService is 10 Requests per second, then max concurrent requests on HystrixService can be 10. Now suppose latency of HystrixService is 2 sec, then by the time it responds to first 10 requests, 10 more requests will come. i.e. total requests = 2 * 10 = 20. So coreSize in this case should be 20.
This is same as mentioned in hystrix documentation,
coreSize = Peak Request per sec × P99 latency + some breathing room
Now, you can keep maxSize and queueSize a bit high, so that it doesn't reject requests, in case there are sudden throughput spikes on your service.

unable to extra/list all event log on watson assistant wrokspace

Please help I was trying to call watson assistant endpoint
https://gateway.watsonplatform.net/assistant/api/v1/workspaces/myworkspace/logs?version=2018-09-20 to get all the list of events
and filter by date range using this params
var param =
{ workspace_id: '{myworkspace}',
page_limit: 100000,
filter: 'response_timestamp%3C2018-17-12,response_timestamp%3E2019-01-01'}
apparently I got any empty response below.
{
"logs": [],
"pagination": {}
}
Couple of things to check.
1. You have 2018-17-12 which is a metric date. This translates to "12th day of the 17th month of 2018".
2. Assuming the date should be a valid one, your search says "Documents that are Before 17th Dec 2018 and after 1st Jan 2019". Which would return no documents.
3. Logs are only generated when you call the message() method through the API. So check your logging page in the tooling to see if you even have logs.
4. If you have a lite account logs are only stored for 7 days and then deleted. To keep logs longer you need to upgrade to a standard account.
Although not directly related to your issue, be aware that page_limit has an upper hard coded limit (IIRC 200-300?). So you may ask for 100,000 records, but it won't give it to you.
This is sample python code (unsupported) that is using pagination to read the logs:
from watson_developer_cloud import AssistantV1
username = '...'
password = '...'
workspace_id = '....'
url = '...'
version = '2018-09-20'
c = AssistantV1(url=url, version=version, username=username, password=password)
totalpages = 999
pagelimit = 200
logs = []
page_count = 1
cursor = None
count = 0
x = { 'pagination': 'DUMMY' }
while x['pagination']:
if page_count > totalpages:
break
print('Reading page {}. '.format(page_count), end='')
x = c.list_logs(workspace_id=workspace_id,cursor=cursor,page_limit=pagelimit)
if x is None: break
print('Status: {}'.format(x.get_status_code()))
x = x.get_result()
logs.append(x['logs'])
count = count + len(x['logs'])
page_count = page_count + 1
if 'pagination' in x and 'next_url' in x['pagination']:
p = x['pagination']['next_url']
u = urlparse(p)
query = parse_qs(u.query)
cursor = query['cursor'][0]
Your logs object should contain the logs.
I believe the limit is 500, and then we return a pagination URL so you can get the next 500. I dont think this is the issue but once you start getting logs back its good to know

Run another DAG with TriggerDagRunOperator multiple times

i have a DAG (DAG1) where i copy a bunch of files. I would then like to kick off another DAG (DAG2) for each file that was copied. As the number of files copied will vary per DAG1 run, i would like to essentially loop over the files and call DAG2 with the appropriate parameters.
eg:
with DAG( 'DAG1',
description="copy files over",
schedule_interval="* * * * *",
max_active_runs=1
) as dag:
t_rsync = RsyncOperator( task_id='rsync_data',
source='/source/',
target='/destination/' )
t_trigger_preprocessing = TriggerDagRunOperator( task_id='trigger_preprocessing',
trigger_daq_id='DAG2',
python_callable=trigger
)
t_rsync >> t_trigger_preprocessing
i was hoping to use the python_callable trigger to pull the relevant xcom data from t_rsync and then trigger DAG2; but its not clear to me how to do this.
i would prefer to put the logic of calling DAG2 here to simplify the contents of DAG2 (and also provide stacking schematics with the max_active_runs)
ended up writing my own operator:
class TriggerMultipleDagRunOperator(TriggerDagRunOperator):
def execute(self, context):
count = 0
for dro in self.python_callable(context):
if dro:
with create_session() as session:
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
conf=dro.payload,
external_trigger=True)
session.add(dr)
session.commit()
count = count + 1
else:
self.log.info("Criteria not met, moving on")
if count == 0:
raise AirflowSkipException('No external dags triggered')
with a python_callable like
def trigger_preprocessing(context):
for base_filename,_ in found.items():
exp = context['ti'].xcom_pull( task_ids='parse_config', key='experiment')
run_id='%s__%s' % (exp['microscope'], datetime.utcnow().replace(microsecond=0).isoformat())
dro = DagRunOrder(run_id=run_id)
d = {
'directory': context['ti'].xcom_pull( task_ids='parse_config', key='experiment_directory'),
'base': base_filename,
'experiment': exp['name'],
}
LOG.info('triggering dag %s with %s' % (run_id,d))
dro.payload = d
yield dro
return
and then tie it all together with:
t_trigger_preprocessing = TriggerMultipleDagRunOperator( task_id='trigger_preprocessing',
trigger_dag_id='preprocessing',
python_callable=trigger_preprocessing
)

How to loop through table based on unique date in MATLAB

I have this table named BondData which contains the following:
Settlement Maturity Price Coupon
8/27/2016 1/12/2017 106.901 9.250
8/27/2019 1/27/2017 104.79 7.000
8/28/2016 3/30/2017 106.144 7.500
8/28/2016 4/27/2017 105.847 7.000
8/29/2016 9/4/2017 110.779 9.125
For each day in this table, I am about to perform a certain task which is to assign several values to a variable and perform necessary computations. The logic is like:
do while Settlement is the same
m_settle=current_row_settlement_value
m_maturity=current_row_maturity_value
and so on...
my_computation_here...
end
It's like I wanted to loop through my settlement dates and perform task for as long as the date is the same.
EDIT: Just to clarify my issue, I am implementing Yield Curve fitting using Nelson-Siegel and Svensson models.Here are my codes so far:
function NS_SV_Models()
load bondsdata
BondData=table(Settlement,Maturity,Price,Coupon);
BondData.Settlement = categorical(BondData.Settlement);
Settlements = categories(BondData.Settlement); % get all unique Settlement
for k = 1:numel(Settlements)
rows = BondData.Settlement==Settlements(k);
Bonds.Settle = Settlements(k); % current_row_settlement_value
Bonds.Maturity = BondData.Maturity(rows); % current_row_maturity_value
Bonds.Prices=BondData.Price(rows);
Bonds.Coupon=BondData.Coupon(rows);
Settle = Bonds.Settle;
Maturity = Bonds.Maturity;
CleanPrice = Bonds.Prices;
CouponRate = Bonds.Coupon;
Instruments = [Settle Maturity CleanPrice CouponRate];
Yield = bndyield(CleanPrice,CouponRate,Settle,Maturity);
NSModel = IRFunctionCurve.fitNelsonSiegel('Zero',Settlements(k),Instruments);
SVModel = IRFunctionCurve.fitSvensson('Zero',Settlements(k),Instruments);
NSModel.Parameters
SVModel.Parameters
end
end
Again, my main objective is to get each model's parameters (beta0, beta1, beta2, etc.) on a per day basis. I am getting an error in Instruments = [Settle Maturity CleanPrice CouponRate]; because Settle contains only one record (8/27/2016), it's suppose to have two since there are two rows for this date. Also, I noticed that Maturity, CleanPrice and CouponRate contains all records. They should only contain respective data for each day.
Hope I made my issue clearer now. By the way, I am using MATLAB R2015a.
Use categorical array. Here is your function (without its' headline, and all rows I can't run are commented):
BondData = table(datetime(Settlement),datetime(Maturity),Price,Coupon,...
'VariableNames',{'Settlement','Maturity','Price','Coupon'});
BondData.Settlement = categorical(BondData.Settlement);
Settlements = categories(BondData.Settlement); % get all unique Settlement
for k = 1:numel(Settlements)
rows = BondData.Settlement==Settlements(k);
Settle = BondData.Settlement(rows); % current_row_settlement_value
Mature = BondData.Maturity(rows); % current_row_maturity_value
CleanPrice = BondData.Price(rows);
CouponRate = BondData.Coupon(rows);
Instruments = [datenum(char(Settle)) datenum(char(Mature))...
CleanPrice CouponRate];
% Yield = bndyield(CleanPrice,CouponRate,Settle,Mature);
%
% NSModel = IRFunctionCurve.fitNelsonSiegel('Zero',Settlements(k),Instruments);
% SVModel = IRFunctionCurve.fitSvensson('Zero',Settlements(k),Instruments);
%
% NSModel.Parameters
% SVModel.Parameters
end
Keep in mind the following:
You cannot concat different types of variables as you try to do in: Instruments = [Settle Maturity CleanPrice CouponRate];
There is no need in the structure Bond, you don't use it (e.g. Settle = Bonds.Settle;).
Use the relevant functions to convert between a datetime object and string or numbers. For instance, in the code above: datenum(char(Settle)). I don't know what kind of input you need to pass to the following functions.

Resources