We've a data stream with continously dumps data in our data lake. Is there a good solution with min running to get 10% random data samples from the data?
I'm currently using code(snipped below) but this will outgrow 10% total sampling as new batches will arrive. I've also tried to calculate 10 batches of 100 records each with (.1) mean but it resulted in ~32% sampling.
select id,
(uniform(0::float, 1::float, random(1)) < .10)::boolean as sampling
from temp_hh_mstr;
Prior to it, I thought to get sampling via snowflake's TABLESAMPLE by substracting from the total count and current IDs in sampling from the table. It takes calculations for every time and any batch arrives which will increase the cost.
Some additional referece I've been thinking towards -
Wilson Score Interval With Continuity Correction
Binomial Confidence Interval
Related
I have two operators, a source and a map. The incoming throughput of of the map is stuck at just above 6K messages/s whereas the message count reaches the size of the whole stream (~ 350K) in under 20s (see duration). 350000/20 means that I have a throughput of at least 17500 and not 6000 as flink suggests! What's going on here?
as shown in the picture:
start time = 13:10:29
all messages are already read by = 13:10:46 (less than 20s)
I checked the flink library code and it seems that the numRecordsOutPerSecond statistic (as well as the rest similar ones) operate on a window. This means that they display average throughput but of the last X seconds. It's not the average throughput of the whole execution
I'm required to calculate median of many parameters received from a kafka stream for 15 min time window.
i couldn't find any built in function for that, but I have found a way using custom WindowFunction.
my questions are:
is it a difficult task for flink? the data can be very large.
if the data gets to giga bytes, will flink store everything in memory until the end of the time window? (one of the arguments of apply WindowFunction implementation is Iterable - a collection of all data which came during the time window )
thanks
Your question contains several aspects, but let me answer the most fundamental one:
Is this a hard task for Flink, why is this not a standard example?
Yes, the median is a hard concept, as the only way to determine it is to keep the full data.
Many statistics don't need the full data to be calculated. For instance:
If you have the total sum, you can take the previous total sum and add the latest observation.
If you have the total count, you add 1 and have the new total count
If you have the average, under the hood you can just keep track of the total sum and count, and at any point calculate the new average based on an observation.
This can even be done with more complicated metrics, like the standard deviation.
However, there is no shortcut for determining the median, the only way to know what the median is after adding a new observation, is by looking at all observations and then figuring out what the middle one is.
As such, it is a challenging metric and the size of the data that comes in will need to be handled. As mentioned there may be estimates in the workings like this: https://issues.apache.org/jira/browse/FLINK-2147
Alternately, you could look at how your data is distributed, and perhaps estimate the median with metrics like Mean, Skew, and Kurtosis.
A final solution I could come up with, is if you need to know approximately what the value should be, is to pick a few 'candidates' and count the fractin of observations below them. The one closest to 50% would then be a reasonable estimate.
A hard disk system has the following parameters :
Number of tracks = 500
Number of sectors/track = 100
Number of bytes /sector = 500
Time taken by the head to move from one track to adjacent track = 1 ms
Rotation speed = 600 rpm.
What is the average time taken for transferring 250 bytes from the disk ?
Well I wanted to know How the average seek time is calculated ?
My Approach
Avg. time to transfer = Avg. seek time + Avg. rotational delay + Data transfer time
Avg Seek Time
given that : time to move between successive tracks is 1 ms
time to move from track 1 to track 1 : 0ms
time to move from track 1 to track 2 : 1ms
time to move from track 1 to track 3 : 2ms
..
..
time to move from track 1 to track 500 : 499 ms
Avg Seek time =
= 249.5 ms
But After Reading Answer given here Why is average disk seek time one-third of the full seek time?
Im confused with my approach.
My question is
Is my Approach Correct ?
If not Please explain the correct way to calculate Average seek time
If Yes please explain wh we are not considering average for every possible pair of tracks (as mentioned in the above link)?
There are a lot more than 500 possible seek times. Your method only accounts for seeks starting at track 1.
What about seeks starting from track 2? Or from track 285?
I wouldn't say your approach is wrong, but it's certainly incomplete.
As is pointed out in the link you're reffering to in this question the average time is calculated as average distance from ANY track to ANY track. So you have to add all of the Subsums to the one you are using to calculate average seek time and then divide this sum by the number of tracks. It sums out to: N/3, where N is the distance between track 0 and last.
f.eg. average distance from track 249 to ANY other track is:middle average sum
Your calculation is the average track seek, you need to add the sector seek to that.
When seeking for a read operation, the head is positioned on (a) a track, at a given (b) sector.
The (average) seek time is the time taken to switch to that position to any other position, with both (a) track and (b) sector.
When positioned, the read can start.
The disk RPM is into play for this, if it spins at 600rpm and has 100 sectors per track, it means that it seeks sectors at
60000ms (because rpm = per minute)
/
600rpm (disk spin speed)
/
100sectors (per track)
=
1ms (to change from a sector to the next adjacent one)
Normally, you would have to consider that as you change tracks, the disk is still spinning and thus account for the sector offset change. But since we are interested only in the average, this cancels out (hopefully).
So, to your 249.5 ms for the track seek average time, you need to add :
same formula :
sum 0->100/100 * 1ms (sector seek speed) = 50.5ms
Thus, the average seek speed for both track and sector is 300ms.
I have a setup with a Beaglebone Black which communicates over I²C with his slaves every second and reads data from them. Sometimes the I²C readout fails though, and I want to get statistics about these fails.
I would like to implement an algorithm which displays the percentage of successful communications of the last 5 minutes (up to 24 hours) and updates that value constantly. If I would implement that 'normally' with an array where I store success/no success of every second, that would mean a lot of wasted RAM/CPU load for a minor feature (especially if I would like to see the statistics of the last 24 hours).
Does someone know a good way to do that, or can anyone point me in the right direction?
Why don't you just implement a low-pass filter? For every successfull transfer, you push in a 1, for every failed one a 0; the result is a number between 0 and 1. Assuming that your transfers happen periodically, this works well -- and you just have to adjust the cutoff frequency of that filter to your desired "averaging duration".
However, I can't follow your RAM argument: assuming you store one byte representing success or failure per transfer, which you say happens every second, you end up with 86400B per day -- 85KB/day is really negligible.
EDIT Cutoff frequency is something from signal theory and describes the highest or lowest frequency that passes a low or high pass filter.
Implementing a low-pass filter is trivial; something like (pseudocode):
new_val = 1 //init with no failed transfers
alpha = 0.001
while(true):
old_val=new_val
success=do_transfer_and_return_1_on_success_or_0_on_failure()
new_val = alpha * success + (1-alpha) * old_val
That's a single-tap IIR (infinite impulse response) filter; single tap because there's only one alpha and thus, only one number that is stored as state.
EDIT2: the value of alpha defines the behaviour of this filter.
EDIT3: you can use a filter design tool to give you the right alpha; just set your low pass filter's cutoff frequency to something like 0.5/integrationLengthInSamples, select an order of 0 for the IIR and use an elliptic design method (most tools default to butterworth, but 0 order butterworths don't do a thing).
I'd use scipy and convert the resulting (b,a) tuple (a will be 1, here) to the correct form for this feedback form.
UPDATE In light of the comment by the OP 'determine a trend of which devices are failing' I would recommend the geometric average that Marcus Müller ꕺꕺ put forward.
ACCURATE METHOD
The method below is aimed at obtaining 'well defined' statistics for performance over time that are also useful for 'after the fact' analysis.
Notice that geometric average has a 'look back' over recent messages rather than fixed time period.
Maintain a rolling array of 24*60/5 = 288 'prior success rates' (SR[i] with i=-1, -2,...,-288) each representing a 5 minute interval in the preceding 24 hours.
That will consume about 2.5K if the elements are 64-bit doubles.
To 'effect' constant updating use an Estimated 'Current' Success Rate as follows:
ECSR = (t*S/M+(300-t)*SR[-1])/300
Where S and M are the count of errors and messages in the current (partially complete period. SR[-1] is the previous (now complete) bucket.
t is the number of seconds expired of the current bucket.
NB: When you start up you need to use 300*S/M/t.
In essence the approximation assumes the error rate was steady over the preceding 5 - 10 minutes.
To 'effect' a 24 hour look back you can either 'shuffle' the data down (by copy or memcpy()) at the end of each 5 minute interval or implement a 'circular array by keeping track of the current bucket index'.
NB: For many management/diagnostic purposes intervals of 15 minutes are often entirely adequate. You might want to make the 'grain' configurable.
My system is supposed to write a large amount of data into a DynamoDB table every day. These writes come in bursts, i.e. at certain times each day several different processes have to dump their output data into the same table. Speed of writing is not critical as long as all the daily data gets written before the next dump occurs. I need to figure out the right way of calculating the provisional capacity for my table.
So for simplicity let's assume that I have only one process writing data once a day and it has to write upto X items into the table (each item < 1KB). Is the capacity I would have to specify essentially equal to X / 24 / 3600 writes/second?
Thx
The provisioned capacity is in terms of writes/second. You need to make sure that you can handle the PEAK number of writes/second that you are going to expect, not the average over the day. So, if you have a single process that runs once a day and makes X number of writes, of Y size (in KB, rounded up), over Z number of seconds, your formula would be
capacity = (X * Y) / Z
So, say you had 100K writes over 100 seconds and each write < 1KB, you would need 1000 w/s capacity.
Note that in order to minimize provisioned write capacity needs, it is best to add data into the system on a more continuous basis, so as to reduce peaks in necessary read/write capacity.