Calculated field for Percentage difference in data studio - google-data-studio

I need to have the percentage difference between different time duration for the same raw data coming from Analytics. Unable to do so as metrics from data source doesn't consist any time duration and in order to create calculated field I'm supposed to use the metrics from the data source. How shall I go about creating the percentage difference in this scenario?
Feel free to ask follow up.

If I understand you correctly then you can't do a calculated metric to achieve what you are after. The only option available is "comparison to period" but I'm guessing from what you've put there that it isn't possible either.
The only way to achieve this would be by organising you data in your data source to be something like
Date | metric value | metric value to compare to
But I appreciate this may not be as flexible as you'd like

Related

Does Apache Superset support Weighted Averages?

I'm trying to use Apache Superset to create a dashboard that will display the average rate of X/Y at different entities such that the time grain can be changed on the fly. However, all I have available as raw data is daily totals of X and Y for the entities in question.
It would be simple to do if I could just get a line chart that displayed sum(X)/sum(Y) as its own metric, where the sum range would change with the time grain, but that doesn't seem to be supported.
Creating a function in SQLAlchemy that calculates the daily rates and then uses that as the raw data is also an insufficient solution, since taking the average of that over different time ranges would not be properly weighted.
Is there a workaround I'm not seeing?
Is there a way to use Druid or some other tool to make displaying a quotient over a variable range possible?
My current best solution is to just set up different charts for each time grain size (day, month, quarter, year), but that's extremely inelegant and I'm hoping to do better.
There are multiple ways to do this, one is using the Metric editor as shown bellow, in this case the metric definition is stored as part of the chart.
Another way is to define a metric in the "datasource editor", where the metric will be stored with the datasource definition, and become reusable for any chart using this datasource, as shown here
Side note: depending on the database you use, you may have to CAST from say an integer to a numeric type as I did in the example, or multiply by 100 in order to get a proper result that's useful.

How to make this query using Prometheus?

I'm really new to Prometheus and for the moment I want to do some tests with the query to be a bit more familiar with it.
So with the query container_last_seen[10s] it returns me an array :
container_last_seen{container_label_com_docker_compose_config_hash="dc8a2ab1347ad16ab37ff0ad03f3a00f86b381ea2d85d45a11367331526c3640",container_label_com_docker_compose_container_number="1",container_label_com_docker_compose_oneoff="False",container_label_com_docker_compose_project="dockprom",container_label_com_docker_compose_service="cadvisor",container_label_com_docker_compose_version="1.10.0",container_label_org_label_schema_group="monitoring",id="/docker/2b448d19a33b50411941a55435b03f5a4af19e3b3e9581054a67e4da3363ef19",image="google/cadvisor:v0.24.1",instance="cadvisor:8080",job="cadvisor",name="cadvisor"}
And I want to get only the attribute name.
So my idea was to do something like this:
container_last_seen[10s][name]
But I have a parse error. So how can I make this query ?
It may seem a little counterintuitive for this purpose, but the aggregation operators allow reducing labels with the by and without clauses.
sum by(name) (container_last_seen{..criteria..})
should get you closer to what you are wanting by returning objects with only the name key.
I think you want a little further though - you don't want values and you don't want the object part - you just want strings. Unfortunately Prometheus deals with numeric metrics that can have labels, and specifically not string metrics.
While it requires additional software, it is officially recommended by Prometheus so I will mention it here as it gets you very close to what I believe is your desired solution:
If you were to chart that query in Grafana, either with all the keys or just the name key, the legend format {{name}} should get you exactly what you want. Grafana also provides label_values to help with this purpose in regards to filtering.
Lastly if this is not the right direction for you, for intensive string-based metrics ELK/EFK stack may be a better fit. There are projects like prometheus-es-exporter that can report the results from ElasticSearch queries as metrics.
This is not possible as labels like 'name' are separate to the metric value. You should look at the JSON the query and query_range endpoints return to see how this is exposed.

Which kind of DBs calculate rate per minute statistics?

I have a use case requirement, where I want to design a hashtag ranking system. 10 most popular hashtag should be selected. My idea is something like this:
[hashtag, rateofhitsperminute, rateofhisper5minutes]
Then I will query, find out the 10 most popular #hashtags, whose rateofhits per minute are highest.
My question is what sort of databases, can I use, to provide me statistics like 'rateofhitsperminute' ?
What is a good way to calculate such a detail and store in it db ? Do some DBs offer these features?
First of all, "rate of hits per minute" is calculated:
[hits during period]/[length of period]
So the rate will vary depending on how long the period is. (The last minute? The last 10 minutes? Since the hits started being recorded? Since the hashtag was first used?)
So what you really want to store is the count of hits, not the rate. It is better to either:
Store the hashtags and their hit counts during a certain period (less memory/cpu required but less flexible)
OR the timestamp and hashtag of each hit (more memory/cpu required but more flexible)
Now it is a matter of selecting the time period of interest, and querying the database to find the top 10 hashtags with the most hits during that period.
If you need to display the rate, use the formula above, but notice it does not change the order of the top hashtags because the period is the same for every hashtag.
You can apply the algorithm above to almost any DB. You can even do it without using a database (just use a programming language's builtin hashmap).
If performance is a concern and there will be many different hashtags, I suggest using an OLAP database. OLAP databases are specially designed for top-k queries (over a certain time period) like this.
Having said that, here is an example of how to accomplish your use case in Solr: Solr as an Analytics Platform. Solr is not an OLAP database, but this example uses Solr like an OLAP DB and seems to be the easiest to implement and adapt to your use case:
Your Solr schema would look like:
<fields>
<field name="hashtag" type="string"/>
<field name="hit_date" type="date"/>
</fields>
An example document would be:
{
"hashtag": "java",
"hit_date": '2012-12-04T10:30:45Z'
}
A query you could use would be:
http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=hashtag&facet.mincount=1&facet.limit=10&facet.range=hit_date&facet.range.end=2013-01-01T00:00:00Z&facet.range.start=2012-01-01T00:00:00
Finally, here are some advanced resources related to this question:
Similar question: Implementing twitter and facebook like hashtags
What is the best way to compute trending topics or tags? An interesting idea I got from these answers is to use the derivative of the hit counts over time to calculate the "instantaneous" hit rate.
HyperLogLog can be used to estimate the hit counts if an approximate calculation is acceptable.
Look into Sliding-Window Top-K if you want to get really academic on this topic.
No database has rate per minute statistics just built in, but any modern database could be used to create a database in which you could quite easily calculate rate per minute or any other calculated values you need.
Your question is like asking which kind of car can drive from New York to LA - well no car can drive itself or refuel itself along the way (I should be careful with this analogy because I guess cars are almost doing this now!), but you could drive any car you like from New York to LA, some will be more comfortable, some more fuel efficient and some faster than others, but you're going to have to do the driving and refueling.
You can use InfluxDB. It's well suited for your use case, since it was created to handle time series data (for example "hits per minute").
In your case, every time there is a hit, you could send a record containing the name of the hashtag and a timestamp.
The data is queryable, and there are already tools that can help you process or visualize it (like Grafana).
If you are happy with a large data set you could store and calculate this information yourself.
I believe Mongo is fairly fast when it comes to index based queries so you could structure something like this.
Every time a tag is "hit" or accessed you could store this information as a row
[Tag][Timestamp]
Storing it in such a fashion allows you to first of all run simple Group, Count and Sort operations which will lead you to your first desired ability of calculating the 10 most popular tags.
With the information in this format you can then perform further queries based on tag and timestamp to Count the amount of hits for a specific tag between the times X and Y which would give you your hits Per period.
Benefits of doing it this way:
High information granularity depending on time frames supplied via query
These queries are rather fast in mongoDB or similar databases even on large data sets
Negatives of doing it this way:
You have to store many rows of data
You have to perform queries to retrieve the information you need rather than returning a single data row

Using an array of strings that have the same meaning as a "lookup_value" in the MATCH() excel function

I have a large table of market values with rows labeled with each asset name and each column representing each month between 2000 and 2014.
The table I have is currently blank and I want to use an index / match function to search for the data corresponding to each date / asset combo in data that was submitted to me. My problem is that this submitted data is slightly inconsistent in how it names assets at different points at time. One year may have an asset called Goldman Sachs Strategic and another year may label the same asset as GS Strategic Income.
I would like to allow for the user of the spreadsheet to type names that are actually equivalent to each other. On a sheet called equivalent names I would like have cell A1 be Goldman Sachs Strategic and B1 to be GS Strategic Income. I was hoping there was a way to make A1:B1 an array that could then be used as a lookup value within my index / match function.
I realize this probably is not the way to approach this problem but I have a very limited dictionary of solutions because I have limited experience coding and using excel. I was hoping someone could point me in the direction of a solution that would actually work because I am assuming inconsistent data is a problem many people have dealt with before. Thanks a lot for any help that you can offer!
I created some random data to test with
I then created the following table on the second worksheet.
The last 5 columns (2010,2011,2012,2013,2014) represent the year to year names. Using these names (which may or may not be blank), we simply use a series of concatenated SUMIF() functions.
Because you didn't provide specific data, I understand that this answer might not fully fit your question, so if there is anything that is incorrect, let me know. Otherwise, I hope this helps.

Storing History in a Database

In regards to storing history within a database, is it better to use a DateEnd (Ex. 1) or a Duration (Ex. 2)?
Or please feel free to even suggest another approach that would be the most effective.
Are there other changes that I should make to one of these Examples if one proves to be the correct approach? DB being used is MySQL although I don't think it has a bearing on the approach here.
There are two perspectives on this one - firstly, what's the business domain? In your example, you've used "subscription" - these are often sold as "monthly", "weekly", etc. All other things being equal, I prefer my database to align to business concepts when possible. You might even go so far as to create a "subscription_type" table, and derive the duration of the description from the type.
That often clashes with the need for your database to perform. From that point of view, I'd work out what the most common queries are going to be, and see if you can make your database design work with the minimal amount of type conversion or calculation possible. Finding all records where the subscription expires on a given date, for instance, is a lot easier (and probably faster) if you can ask for dateEnd < targetDate, rather than calculating the date by adding the duration to the start date.
If you have start date and end date, you can always (or should always be able to) compute duration. If you have start date and duration, you can always (or should always be able to) compute end date.
You can also record all three and enforce a row constraint to the effect that they cannot "mismatch".
However, one very frequent kind of usage of the "end date" datum, is to filter out rows that are "not current" : something like WHERE END_DATE > CURRENTSYSTEMDATE(). If you have that kind of usage, then it is probably not advisable to "leave out" the end date.
I would store the end date as opposed to the duration. You can calculate the duration when needed. Seems to make more sense storing measurement points instead of measurements.
Derived values such as duration don't need to be stored in the database.
Your approach to capture history of a row will work in some cases, but to truly catch all changes to all rows you will likely need another table, something like subscription_hist that will be updated by a trigger on insert and update of the subscription. Then you would have to keep track only of last_modified_date and last_person_changed_by. Everything else could be derived. This way you could see what happened to the row over time. I don't have any pictures to clarify, but this method, if implemented carefully would allow for full point-in-time recovery of data, as well as maintaining historical information.

Resources