How can you exclude a large large initial value from a running delta calculation? - google-data-studio

I'm trying to use a running delta calculation to graph how much additional storage is used per hour, from a field that contains how much storage is used. Let's say I have a field like disk_space_used_mb. If I have the values 50000, 50100, 50300, the running delta would be 50000, 100, 200, but I don't really care about the first value, and it throw off my graph. I can of course set the max value of the y axis manually, but that isn't dynamic.
How can I prevent this first large value from throwing off my graph? is there a way to force that to 0?
Here's an example of why this is a problem (with different numbers):

Sadly, this is currently not possible and it is a very common problem when plotting running delta.
To workaround, if your initial value is static, you can create a new calculated field where you subtract the initial value from all rows (so the initial value will be always zero). But obviously, this is not an elegant solution and your chart Y-axis values will be different from the real values.
But if the initial value can be changed by the user (it is dynamic), you're really out of lucky. The only solution I can imagine is to search for an alternative visualization that support this feature or develop your own visualization.
The second option probably solves your problem, but the development of community visualizations is far from being an easy task.

Related

Sort by constant number

I need to randomize Solr (6.6.2) search results, but the order needs to be consistent given a specific seed. This is for a paginated search that returns a limited result set from a much larger one, so I must do the ordering at the query level and not at the application level once the data has been fetched.
Initially I tried this:
https://localhost:8984/solr/some_index/select?q=*:*&sort=random_999+ASC
Where 999 is a constant that is fed in when constructing the query prior to sending it to Solr. The constant value changes for each new search.
This solution works. However, when I run the query a few times, or run it on different Solr instances, the ordering is different.
After doing some reading, random_ generates a number via:
fieldName.hashCode() + context.docBase + (int)top.getVersion()
This means that when the random number is generated, it takes the index version into account. This becomes problematic when using a distributed architecture or when indexes are updated, as is well explained here.
There are various recommended solutions online, but I am trying to avoid writing a custom random override. Is there some type of trick where I can feed in some type of function or equation to the sort param?
For example:
min(999,random_999)
Though this always results in the same order, even when either of the values change.
This question is somewhat similar to this other question, but not quite.
I searched for answers on SO containing solr.RandomSortField, and while they point out what the issue is, none of them have a solution. It seems the best way would be to override the solr.RandomSortField logic, but it's not clear how.
Prior Research
https://lucene.472066.n3.nabble.com/Random-sorting-and-result-consistency-across-successive-calls-based-on-seed-td4170508.html
Solr: Random sort order after index version change
https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/%3CJIRA.13196983.1541639245000.300557.1541639520069#Atlassian.JIRA%3E
Solr - Return random results (Sort by Random)
https://realize.be/blog/random-results-apache-solr-and-drupal
https://lucene.472066.n3.nabble.com/Sorting-with-customized-function-of-score-td3987281.html
Even after implementing a custom random sort field, the results still differed across instances of Solr.
I ended up adding a new field that is populated at index time which is a 32 bit hash of an ID field that already existed in the document.
I then built a "stateless" linear congruential generator to produce a set of acceptably random numbers to use for sorting:
?sort=mod(product(hash_int_id,{seedConstant},982451653), 104395301) asc
Since this function technically passes a new seed for each row, and because it does not store state (like rand.Next() would), this solution is admittedly inferior and it is not a true PRNG; however, it does seem to get me most of the way there. Note that you will have to tune your values depending on the size of your data set and the size of the values in your hash_int_id equivalent field.

flink calculate median on stream

I'm required to calculate median of many parameters received from a kafka stream for 15 min time window.
i couldn't find any built in function for that, but I have found a way using custom WindowFunction.
my questions are:
is it a difficult task for flink? the data can be very large.
if the data gets to giga bytes, will flink store everything in memory until the end of the time window? (one of the arguments of apply WindowFunction implementation is Iterable - a collection of all data which came during the time window )
thanks
Your question contains several aspects, but let me answer the most fundamental one:
Is this a hard task for Flink, why is this not a standard example?
Yes, the median is a hard concept, as the only way to determine it is to keep the full data.
Many statistics don't need the full data to be calculated. For instance:
If you have the total sum, you can take the previous total sum and add the latest observation.
If you have the total count, you add 1 and have the new total count
If you have the average, under the hood you can just keep track of the total sum and count, and at any point calculate the new average based on an observation.
This can even be done with more complicated metrics, like the standard deviation.
However, there is no shortcut for determining the median, the only way to know what the median is after adding a new observation, is by looking at all observations and then figuring out what the middle one is.
As such, it is a challenging metric and the size of the data that comes in will need to be handled. As mentioned there may be estimates in the workings like this: https://issues.apache.org/jira/browse/FLINK-2147
Alternately, you could look at how your data is distributed, and perhaps estimate the median with metrics like Mean, Skew, and Kurtosis.
A final solution I could come up with, is if you need to know approximately what the value should be, is to pick a few 'candidates' and count the fractin of observations below them. The one closest to 50% would then be a reasonable estimate.

Solr Distance Filter from a Radius Field

Hi I am very new to Solr queries (like a few hours), so please excuse me if this is a naive question, but is there a way on the geo filter to set the radius from a field.
{!geofilt pt=35.3459327,-97.4705935 sfield=locs_field_location$latlon d=fs_radius}
Or do a subquery to return the value of that field fs_field_job_search_radius and place it in there. I can return the value from the field list so I was hoping it could go in there, in some method.
This is similar to this Filtering by distance vs. field value in Solr but I do not know if he got it working or where I would need to start to write a function as was suggested. Also this is on a Solr server I do not control. It is controlled by my hosting company, so I do not know if I can even create functions. Thanks.
Took a work around, but I got what I was trying to accomplish I believe.
fq={!frange l=0 h=12742}sub(radius_field,geodist(field,point))
The 12742 is the diameter of the earth in km as I still needed a hard number for that, but I doubt most are searching in space. So basically we subtract the distance from radius_field to find out if it is in range.
radius_field - distance
If the results are a positive number than it is within range. If it is a negative number than it is not. Please let me know if I screwed up my logic. Thanks.

Matlab's bvp4c: output arrays not always the same length as the initial guess

The Matlab function bvp4c solves boundary value problems. It takes a differential equation, boundary conditions and an initial guess as input, and returns a structure array containing arrays of x, y and yp (which stands for "y prime", or y').
The length of the output arrays should be the same as that of the initial guess, but I found that it isn't always. I have checked the dimensions of the input (the initial guess, always 1x101 double for x and 16x101 double for y) and the output (sometimes 1x101 double for x and 16x101 double for y and yp as it should be, but often different values, such as 1x91 double and 16x91 double or 1x175 double and 16x175 double).
Looking at the output array x when its length is off, some extra values are squeezed in, or some are taken out. For example, the initial guess has 100 positions between x=0 and x=1, and the x array should be [0 0.01 0.02 ... 1], but sometimes a new position like 0.015 shows up.
Question: Why does this happen, and how can this be solved?
"The length of the output arrays should be the same as that of the initial guess ...." This is incorrect.
As described in the bvp4c documentation, sol.x contains a "[mesh] selected by bvp4c" with an "[approximation] to y(x) at the mesh points of sol.x". In order to evaluate bvp4c's solution on your mesh, use deval.
Why does bvp4c choose a mesh? Quoting from the cited paper1, which you can get in full here if you have a MathWorks account:
Because BVPs can have more than one solution, BVP codes require users to supply a guess for the solution desired. The guess includes a guess for an initial mesh that reveals the behavior of the desired solution. The codes then adapt the mesh so as to obtain an accurate numerical solution with a modest number of mesh points.
Because a steady BVP generally has a global behavior strongly dependent on its boundary values, the spatial mesh between the two boundaries may need to be refined in order to properly approximate the desired solution with the locally chosen basis functions for the method. However, there may also be portions of the mesh that do not need to be refined and can even be coarsened in some cases to maintain a reasonably small residual and accurate approximation. Therefore, for general efficiency, the guess mesh is adaptively refined or coarsened depending on some locally chosen metric (since bvp4c is collocation based, the metric is probably point-based or division-integrated based) such that the mesh returned by bvp4c is, in some sense, adequate enough for generic interpolation within the boundaries.
I'll also note that this is different from numerically solving IVPs since their state is not global across the entire time integration locus and only depends on the current state to the next time-step, and possibly previous time steps if using a multi-step method or solving a delay differential equation, which makes the refinement inherently local. This local behavior of IVPs is what allows functions like ode45 to return a solution at pre-selected time values because it can locally refine the solution at the selected point while performing the time march (this is known as dense output).
1 Shampine, L.F., M.W. Reichelt, and J. Kierzenka, "Solving Boundary Value Problems for Ordinary Differential Equations in MATLAB with bvp4c".

Most simple and fast method for audio activity detection?

Given is an array of 320 elements (int16), which represent an audio signal (16-bit LPCM) of 20 ms duration. I am looking for a most simple and very fast method which should decide whether this array contains active audio (like speech or music), but not noise or silence. I don't need a very high quality of the decision, but it must be very fast.
It occurred to me first to add all squares or absolute values of the elements and compare their sum with a threshold, but such a method is very slow on my system, even if it is O(n).
You're not going to get much faster than a sum-of-squares approach.
One optimization that you may not be doing so far is to use a running total. That is, in each time step, instead of summing the squares of the last n samples, keep a running total and update that with the square of the most recent sample. To avoid your running total from growing and growing over time, add an exponential decay. In pseudocode:
decay_constant=0.999; // Some suitable value smaller than 1
total=0;
for t=1,...
// Exponential decay
total=total*decay_constant;
// Add in latest sample
total+=current_sample;
if total>threshold
// do something
end
end
Of course, you'll have to tune the decay constant and threshold to suit your application. If this isn't fast enough to run in real time, you have a seriously underpowered DSP...
You might try calculating two simple "statistics" - first would be spread (max-min). Silence will have very low spread. Second would be variety - divide the range of possible values into say 16 brackets (= value range) and as you go through the elements, determine in which bracket that element goes. Noise will have similar numbers for all brackets, whereas music or speech should prefer some of them while neglecting others.
This should be possible to do in just one pass through the array and you do not need complicated arithmetics, just some addition and comparison of values.
Also consider some approximation, for example take only each fourth value, thus reducing the number of checked elements to 80. For audio signal, this should be okay.
I did something like this a while back. After some experimentation I arrived at a solution that worked sufficiently well in my case.
I used the rate of change in the cube of the running average over about 120ms. When there is silence (only noise that is) the expression should be hovering around zero. As soon as the rate starts increasing over a couple of runs, you probably have some action going on.
rate = cur_avg^3 - prev_avg^3
I used a cube because the square just wasn't agressive enough. If the cube is to slow for you, try using the square and a bitshift instead. Hope this helps.
It is clearly that the complexity should be at least O(n). Probably some simple algorithms that calculate some value range are good for the moment but I would look for Voice Activity Detection on web and for related code samples.

Resources