Flink Dashboard Throughput doesn't add up

Flink Dashboard Throughput doesn't add up - apache-flink

I have two operators, a source and a map. The incoming throughput of of the map is stuck at just above 6K messages/s whereas the message count reaches the size of the whole stream (~ 350K) in under 20s (see duration). 350000/20 means that I have a throughput of at least 17500 and not 6000 as flink suggests! What's going on here?
as shown in the picture:
start time = 13:10:29
all messages are already read by = 13:10:46 (less than 20s)

I checked the flink library code and it seems that the numRecordsOutPerSecond statistic (as well as the rest similar ones) operate on a window. This means that they display average throughput but of the last X seconds. It's not the average throughput of the whole execution

Related

How to measure the amount of time taken to do any kind of operation inside apps such as Maya, Creo Parametric, Adobe Premier, etc

I want to measure the amount of time taken to do any kind of operation inside apps such as Creo Parametric 5.0, Adobe Premiere Pro, Maya, Adobe Creative, Lightroom CC or any other design app
The idea is to measure the performance (time taken per operation) to catch performance issues.

When you create your library of action, you can create a decorator that log and time any actions so you can monitor whats going on

A method which is hacky and time consuming but generic (even for softwares not providing (good) scripting), could be to record your screen at high speed (such as 60 fps). Then you look at the frames to count the frames between giving the order (click, enter key) and the result (updated display).
The precision will be in the order of 1 / recording frequency (16 ms if recording at 60 fps). A drawback is that you are likely to measure the time of more than just the operation you are interested in, for instance if you want to bench the loading of a file, you will also measure the time it took to render it after (which may/should be negligible).
I was able to apply this method using https://github.com/SerpentAI/D3DShot (increase the framebuffer size which by default last 1 second). Note that the frame numbers when exporting to files are going backward in time.
It may be possible to make this method less hacky by using computer vision algorithms to not have to count frames manually.

Method to Correlate Time Series Arrays of Differing Lengths

I am attempting to correlate the time series from 4 separate tilt monitors that sample every 5 minutes. The time series have slightly different base times and end times, and the resulting arrays are slightly different lengths, though they span almost the (differing by ~3 mins) same period of time. My goal is to correlate each of these time series with a single "wind speed" time series that also covers the same period of time as the tilt monitors, sampling every 5 minutes, but also has a slightly different array length and origin time and end time.
The different array lengths in the tilt measurements are due to instrument error. There are some times within each of the arrays where the instrument missed a measurement and so the sample interval is 10 minutes.
My arrays sizes look something like this:
Tilt_a = 6236x2
Tilt_b = 6310x2
Tilt_c = 6304x2
Tilt_d = 6309x2
Wind_speed = 6283x2
I am using MATLAB to do the correlation. I imagine that I will need to re-sample the data using something like interp1, but I do not know how to renconcile the origin and end times. Is there a method that comes to mind for handling a situation such as this one? Or a function that allows correlating arrays of differing lengths?

So for the different time windows your analyzing, you could either trim them all so that they start and end at the same time, or you could just review them over their raw intervals, and make your comparisons over the windows that overlap.
As far as the sampling interval, you can use the resample command to make your comparisons more uniform.
https://www.mathworks.com/help/signal/ref/resample.html
Extending the first concept, you could use resample to define new vectors with the start time and end time and interval all synchronized, then continue with your analysis.

Gatling: difference between Response Time Percentiles and Latency Percentiles over time

On my Gatling reports, I noticed that "Response Time Percentiles" and "Latency Percentiles over time" charts are quite identical. In which way are they different?
I saw this post, which makes me even more unsure:
Latency Percentiles over Time (OK) – same as Response Time Percentiles
over Time (OK), but showing the time needed for the server to process
the request, although it is incorrectly called latency. By definition
Latency + Process Time = Response time. So this graphic is supposed to
give the time needed for a request to reach the server. Checking
real-life graphics I think this graphic shows not the Latency, but the
real Process Time. You can get an idea of the real Latency by taking
one and the same second from Response Time Percentiles over Time (OK)
and subtract values from current graphs for the same second.
Thanks in advance for your help.

Latency basically tells how long it takes to receive the first packet for each page request throughout the duration of your load test. If you look at this chart in the Gatling documentation, the first spike is just before 21:30:20 on the x axis and tells you that 100% of the pages requested took longer than 1000 milliseconds to get the first packet from source to destination, but that number fell significantly after 21:30:20.

Is there an easy way to get the percentage of successful reads of last x minutes?

I have a setup with a Beaglebone Black which communicates over I²C with his slaves every second and reads data from them. Sometimes the I²C readout fails though, and I want to get statistics about these fails.
I would like to implement an algorithm which displays the percentage of successful communications of the last 5 minutes (up to 24 hours) and updates that value constantly. If I would implement that 'normally' with an array where I store success/no success of every second, that would mean a lot of wasted RAM/CPU load for a minor feature (especially if I would like to see the statistics of the last 24 hours).
Does someone know a good way to do that, or can anyone point me in the right direction?

Why don't you just implement a low-pass filter? For every successfull transfer, you push in a 1, for every failed one a 0; the result is a number between 0 and 1. Assuming that your transfers happen periodically, this works well -- and you just have to adjust the cutoff frequency of that filter to your desired "averaging duration".
However, I can't follow your RAM argument: assuming you store one byte representing success or failure per transfer, which you say happens every second, you end up with 86400B per day -- 85KB/day is really negligible.
EDIT Cutoff frequency is something from signal theory and describes the highest or lowest frequency that passes a low or high pass filter.
Implementing a low-pass filter is trivial; something like (pseudocode):
new_val = 1 //init with no failed transfers
alpha = 0.001
while(true):
old_val=new_val
success=do_transfer_and_return_1_on_success_or_0_on_failure()
new_val = alpha * success + (1-alpha) * old_val
That's a single-tap IIR (infinite impulse response) filter; single tap because there's only one alpha and thus, only one number that is stored as state.
EDIT2: the value of alpha defines the behaviour of this filter.
EDIT3: you can use a filter design tool to give you the right alpha; just set your low pass filter's cutoff frequency to something like 0.5/integrationLengthInSamples, select an order of 0 for the IIR and use an elliptic design method (most tools default to butterworth, but 0 order butterworths don't do a thing).
I'd use scipy and convert the resulting (b,a) tuple (a will be 1, here) to the correct form for this feedback form.

UPDATE In light of the comment by the OP 'determine a trend of which devices are failing' I would recommend the geometric average that Marcus Müller ꕺꕺ put forward.
ACCURATE METHOD
The method below is aimed at obtaining 'well defined' statistics for performance over time that are also useful for 'after the fact' analysis.
Notice that geometric average has a 'look back' over recent messages rather than fixed time period.
Maintain a rolling array of 24*60/5 = 288 'prior success rates' (SR[i] with i=-1, -2,...,-288) each representing a 5 minute interval in the preceding 24 hours.
That will consume about 2.5K if the elements are 64-bit doubles.
To 'effect' constant updating use an Estimated 'Current' Success Rate as follows:
ECSR = (t*S/M+(300-t)*SR[-1])/300
Where S and M are the count of errors and messages in the current (partially complete period. SR[-1] is the previous (now complete) bucket.
t is the number of seconds expired of the current bucket.
NB: When you start up you need to use 300*S/M/t.
In essence the approximation assumes the error rate was steady over the preceding 5 - 10 minutes.
To 'effect' a 24 hour look back you can either 'shuffle' the data down (by copy or memcpy()) at the end of each 5 minute interval or implement a 'circular array by keeping track of the current bucket index'.
NB: For many management/diagnostic purposes intervals of 15 minutes are often entirely adequate. You might want to make the 'grain' configurable.

App Engine: Can I set a req/min rate instead of req/sec in queue.yaml?

Google has a 15000/min limit on the number of reads and writes. To stay under this limit, I calculated 15000/min == 250/sec, so my queue config is:
name: mapreduce-queue
rate: 200/s
max_concurrent_requests: 200
Can I directly set a rate of 15000/min in queue.yaml? I used 200/s because 15000/min == 250/sec adjusted for bursts. Also, I feel like I should not need the max_concurrent_requests limit at all?

Yes you can.
However, use 15000/m instead of 15000/min
From the docs
rate (push queues only)
How often tasks are processed on this queue. The value is a number
followed by a slash and a unit of time, where the unit is s for
seconds, m for minutes, h for hours, or d for days. For example, the
value 5/m says tasks will be processed at a rate of 5 times per
minute.
If the number is 0 (such as 0/s), the queue is considered "paused,"
and no tasks are processed.
and
max_concurrent_requests (push queues only)
Sets the maximum number of tasks that can be executed at any given
time in the specified queue. The value is an integer. By default, this
directive is unset and there is no limit on the maximum number of
concurrent tasks. One use of this directive is to prevent too many
tasks from running at once or to prevent datastore contention.
Restricting the maximum number of concurrent tasks gives you more
control over your queue's rate of execution. For example, you can
constrain the number of instances that are running the queue's tasks.
Limiting the number of concurrent requests in a given queue allows you
to make resources available for other queues or online processing.
It seems to me that for your situation, max_concurrent_requests is something you don't want to leave out

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Flink Dashboard Throughput doesn't add up - apache-flink

I checked the flink library code and it seems that the numRecordsOutPerSecond statistic (as well as the rest similar ones) operate on a window. This means that they display average throughput but of the last X seconds. It's not the average throughput of the whole execution

Related

How to measure the amount of time taken to do any kind of operation inside apps such as Maya, Creo Parametric, Adobe Premier, etc

Method to Correlate Time Series Arrays of Differing Lengths

Gatling: difference between Response Time Percentiles and Latency Percentiles over time

Is there an easy way to get the percentage of successful reads of last x minutes?

App Engine: Can I set a req/min rate instead of req/sec in queue.yaml?

Categories

Resources