I'm required to calculate median of many parameters received from a kafka stream for 15 min time window.
i couldn't find any built in function for that, but I have found a way using custom WindowFunction.
my questions are:
is it a difficult task for flink? the data can be very large.
if the data gets to giga bytes, will flink store everything in memory until the end of the time window? (one of the arguments of apply WindowFunction implementation is Iterable - a collection of all data which came during the time window )
thanks
Your question contains several aspects, but let me answer the most fundamental one:
Is this a hard task for Flink, why is this not a standard example?
Yes, the median is a hard concept, as the only way to determine it is to keep the full data.
Many statistics don't need the full data to be calculated. For instance:
If you have the total sum, you can take the previous total sum and add the latest observation.
If you have the total count, you add 1 and have the new total count
If you have the average, under the hood you can just keep track of the total sum and count, and at any point calculate the new average based on an observation.
This can even be done with more complicated metrics, like the standard deviation.
However, there is no shortcut for determining the median, the only way to know what the median is after adding a new observation, is by looking at all observations and then figuring out what the middle one is.
As such, it is a challenging metric and the size of the data that comes in will need to be handled. As mentioned there may be estimates in the workings like this: https://issues.apache.org/jira/browse/FLINK-2147
Alternately, you could look at how your data is distributed, and perhaps estimate the median with metrics like Mean, Skew, and Kurtosis.
A final solution I could come up with, is if you need to know approximately what the value should be, is to pick a few 'candidates' and count the fractin of observations below them. The one closest to 50% would then be a reasonable estimate.
Related
I have come across a problem.
I’m not asking for help how to construct what I’m searching for, but only to guide me to what I’m looking for! 😊
The thing I want to create is some sort of ‘Sorting Algorithm/Mechanism’.
Example:
Imagine I have a database with over 1000 pictures of different vehicles.
A person sees a vehicle, he now tries to get as much information and details about that vehicle, such as:
Shape
number of wheels
number and shape of windows
number and shape of light(s)
number and shape of exhaust(s)
Etc…
He then gives me all information about that vehicle he saw. BUT! Without telling me anything about:
Make and model.
…
I will now take that information and tell my database to sort out every vehicle so that it arranges all 1000 vehicle by best match, based by the description it have been given.
But it should NOT exclude any vehicle!
So…
If the person tells me that the vehicle only has 4 wheels, but in reality it has 5 (he might not have seen the fifth wheel) it should just get a bad score in the # of wheels.
But if every other aspect matches that vehicle perfect it will still get a high score.
That way we don’t exclude the vehicle that he has seen, and we still have a change to find the correct vehicle.
The whole aspect of this mechanism is to, as said, sort out the most, so instead of looking through 1000 vehicles we only need to sort through the best matches which is 10 to maybe 50 vehicles out of a 1000 (hopefully).
I tried to describe it the best I could in a language that isn’t ‘my father’s tongue’. So bear with me.
Again, I’m not looking for anybody telling me how to make this algorithm, I’m pretty sure nobody even wants of have the time to do that for me, without getting paid somehow...
But I just need to know where to look regarding learning and understanding how to create this mess of a mechanism.
Kind regards
Gent!
Assuming that all your pictures have been indexed with the relevant fields (number of wheels, window shapes...), and given that they are not too numerous (a thousand is peanuts for a computer), you can proceed as follows:
for every criterion, weight the possible discrepancies (e.g. one wheel too much costs 5, one wheel too few costs 10, bad window shape costs 8...). Make this in a coherent way so that the costs of the criteria are well balanced.
to perform a search, evaluate the total discrepancy cost of every car, and sort the values increasingly. Report the first ten.
Technically, what you are after is called a "nearest neighbor search" in a high dimensional space. This problem has been well studied. There are fast solutions but they are extremely complex, and in your case are absolutely not worth using.
The default way of doing this for example in artificial intelligence is to encode all properties as a vector and applying certain weights to each property. The distance can then be calculated using any metric you like. In your case manhatten-distance should be fine. So in pseudocode:
distance(first_car, second_car):
return abs(first_car.n_wheels - second_car.n_wheels) * wheels_weight+ ... +
abs(first_car.n_windows - second_car.n_windows) * windows_weight
This works fine for simple properties like the number of wheels. For more complex properties like the shape of a window you'll probably need to split it up into multiple attributes depending on your requirements on similarity.
Weights are usually picked in such a way as to normalize all values, if their range is known. Optionally an additional factor can be multiplied to increase the impact of a specific attribute on the overall distance.
We plan to use Apache Flink to perform real-time aggregations on multiple types of objects.
We need to support several types of aggregations such as sum, max, min, average etc. - nothing special so far
Our requirement is to output the data to kafka where one message contains multiple aggregated values for multiple object attributes.
for example, the message should include the sum, max, and average values for attribute A and also the sum and min values of attribute B for the last 10 minutes
My question is what is the best way to implement such a requirement with Flink?
We though about using a custom window function that will run on all objects at the end of the window and calculate by itself all required values and output a new object that holds all of these aggregated values.
The thing we are concerned about with this solution is the affect on the memory consumption having to hold all the window data in memory waiting for the window to fire (we will have many such windows opened at the same time)
Any suggestions / comments are highly appreciated!
Thanks
The best approach would be to use incremental aggregation to compute the count, sum, min, and max for each window -- and you can compute the average in your window function, given the sum and count. In this way the only state you'll need to keep are these four values (the count, sum, min, and max), rather than having to buffer the entire stream for processing at the end of the window.
This example from the documentation should be enough to get you started.
I have a 2 column vector with times and speeds of a subset of data, like so:
5 40
10 37
15 34
20 39
And so on. I want to get the fourier transform of speeds to get a frequency. How would I go about doing this with a fast fourier transform (fft)?
If my vector name is sampleData, I have tried
fft(sampleData);
but that gives me a vector of real and imaginary numbers. To be able to get sensible data to plot, how would I go about doing this?
Fourier Transform will yield a complex vector, when you fft you get a vector of frequencies, each has a spectral phase. These phases can be extremely important! (they contain most of the information of the time-domain signal, you won't see interference effects without them etc...). If you want to plot the power spectrum, you can
plot(abs(fft(sampleData)));
To complete the story, you'll probably need to fftshift, and also produce a frequency vector. Here's a more elaborate code:
% Assuming 'time' is the 1st col, and 'sampleData' is the 2nd col:
N=length(sampleData);
f=window(#hamming,N)';
dt=mean(diff(time));
df=1/(N*dt); % the frequency resolution (df=1/max_T)
if mod(N,2)==0
f_vec= df*((1:N)-1-N/2); % frequency vector for EVEN length vector
else
f_vec= df*((1:N)-0.5-N/2);
end
fft_data= fftshift(fft(fftshift(sampleData.*f))) ;
plot(f_vec,abs(fft_data))
I would recommend that you back up and think about what you are trying to accomplish, and whether an FFT is an appropriate tool for your situation. You say that you "want to ... get a frequency", but what exactly do you mean by that? Do you know that this data has exactly one frequency component, and want to know what the frequency is? Do you want to know both the frequency and phase of the component? Do you just want to get a rough idea of how many discrete frequency components are present? Are you interested in the spectrum of the noise in your measurement? There are many questions you can ask about "frequencies" in a data set, and whether or not an FFT and/or power spectrum is the best approach to getting an answer depends on the question.
In a comment above you asked "Is there some way to correlate the power spectrum to the time values?" This strikes me as a confused question, but also makes me think that maybe the question you are really trying to answer is "I have a signal whose frequency varies with time, and I want to get an estimate of the frequency vs time". I'm sure I've seen a question along those lines within the past few months here on SO, so I would search for that.
Given is an array of 320 elements (int16), which represent an audio signal (16-bit LPCM) of 20 ms duration. I am looking for a most simple and very fast method which should decide whether this array contains active audio (like speech or music), but not noise or silence. I don't need a very high quality of the decision, but it must be very fast.
It occurred to me first to add all squares or absolute values of the elements and compare their sum with a threshold, but such a method is very slow on my system, even if it is O(n).
You're not going to get much faster than a sum-of-squares approach.
One optimization that you may not be doing so far is to use a running total. That is, in each time step, instead of summing the squares of the last n samples, keep a running total and update that with the square of the most recent sample. To avoid your running total from growing and growing over time, add an exponential decay. In pseudocode:
decay_constant=0.999; // Some suitable value smaller than 1
total=0;
for t=1,...
// Exponential decay
total=total*decay_constant;
// Add in latest sample
total+=current_sample;
if total>threshold
// do something
end
end
Of course, you'll have to tune the decay constant and threshold to suit your application. If this isn't fast enough to run in real time, you have a seriously underpowered DSP...
You might try calculating two simple "statistics" - first would be spread (max-min). Silence will have very low spread. Second would be variety - divide the range of possible values into say 16 brackets (= value range) and as you go through the elements, determine in which bracket that element goes. Noise will have similar numbers for all brackets, whereas music or speech should prefer some of them while neglecting others.
This should be possible to do in just one pass through the array and you do not need complicated arithmetics, just some addition and comparison of values.
Also consider some approximation, for example take only each fourth value, thus reducing the number of checked elements to 80. For audio signal, this should be okay.
I did something like this a while back. After some experimentation I arrived at a solution that worked sufficiently well in my case.
I used the rate of change in the cube of the running average over about 120ms. When there is silence (only noise that is) the expression should be hovering around zero. As soon as the rate starts increasing over a couple of runs, you probably have some action going on.
rate = cur_avg^3 - prev_avg^3
I used a cube because the square just wasn't agressive enough. If the cube is to slow for you, try using the square and a bitshift instead. Hope this helps.
It is clearly that the complexity should be at least O(n). Probably some simple algorithms that calculate some value range are good for the moment but I would look for Voice Activity Detection on web and for related code samples.
I have a number of tracks recorded by a GPS, which more formally can be described as a number of line strings.
Now, some of the recorded tracks might be recordings of the same route, but because of inaccurasies in the GPS system, the fact that the recordings were made on separate occasions and that they might have been recorded travelling at different speeds, they won't match up perfectly, but still look close enough when viewed on a map by a human to determine that it's actually the same route that has been recorded.
I want to find an algorithm that calculates the similarity between two line strings. I have come up with some home grown methods to do this, but would like to know if this is a problem that's already has good algorithms to solve it.
How would you calculate the similarity, given that similar means represents the same path on a map?
Edit: For those unsure of what I'm talking about, please look at this link for a definition of what a line string is: http://msdn.microsoft.com/en-us/library/bb895372.aspx - I'm not asking about character strings.
Compute the Fréchet distance on each pair of tracks. The distance can be used to gauge the similarity of your tracks.
Math alert: Fréchet was a pioneer in the field of metric space which is relevant to your problem.
I would add a buffer around the first line based on the estimated probable error, and then determine if the second line fits entirely within the buffer.
To determine "same route," create the minimal set of normalized path vectors, calculate the total power differences and compare the total to a quality measure.
Normalize the GPS waypoints on total path length,
walk the vectors of the paths together, creating a new set of path vectors for each path based upon the shortest vector at each waypoint,
calculate the total power differences between endpoints of each vector in the normalized paths weighting for vector length, and
compare against a quality measure.
Tune the power of the differences (start with, say, squared differences) and the quality measure (say as a percent of the total power differences) visually. This algorithm produces a continuous quality measure of the path match as well as a binary result (Are the paths the same?)
Paul Tomblin said: I would add a buffer
around the first line based on the
estimated probable error, and then
determine if the second line fits
entirely within the buffer.
You could modify the algorithm as the normalized vector endpoints are compared. You could determine if any endpoint difference was above a certain size (implementing Paul's buffer idea) or perhaps, if the endpoints were outside the "buffer," use that fact to ignore that endpoint difference, allowing a comparison ignoring side trips.
You could walk along each point (Pa) of LineString A and measure the distance from Pa to the nearest line-segment of LineString B, averaging each of these distances.
This is not a quick or perfect method, but should be able to give use a useful number and is pretty quick to implement.
Do the line strings start and finish at similar points, or are they of very different extents?
If you consider a single line string to be a sequence of [x,y] points (or [x,y,z] points), then you could compute the similarity between each pair of line strings using the Needleman-Wunsch algorithm. As described in the referenced Wikipedia article, the Needleman-Wunsch algorithm requires a "similarity matrix" which defines the distance between a pair of points. However, it would be easy to use a function instead of a matrix. In your case you could simply use the 2D Euclidean distance function (or a 3D Euclidean function if your points have elevation) to provide the distance between each pair of points.
I actually side with the person (Aaron F) who said that you might be interested in the Levenshtein distance problem (and cited this). His answer seems to me to be the best so far.
More specifically, Levenshtein distance (also called edit distance), does not measure strictly the character-by-character distance, but also allows you to perform insertions and deletions. The best algorithm for this distance measure can be computed in quadratic time (pretty slow if your strings are long), but the computational biologists have pretty good heuristics for this, that might be of interest to you on their own. Check out BLAST and FASTA.
In your problem, it seems that you are dealing with differences between strings of numbers, and you care about the numbers. If you give more information, I might be able to direct you to the right variant of BLAST/FASTA/etc for your purposes. In any case, you might consider adapting BLAST and FASTA for your needs. They're quite simple.
1: http://en.wikipedia.org/wiki/Levenshtein_distance, http://www.nist.gov/dads/HTML/Levenshtein.html