Does OpenTSDB can perform 'clever' query? - database

I really need to know if openTSDB database can perform some clever query. There are no lot of example shared.
For example, i need average for one dataset values, with zero values excluded.
Now i have something like this:
http://localhost:4242/api/query?start=1480892400&end=1483657140&m=sum:32d-avg:site.availability{siteId=73}&arrays=true&ms
This query takes start & end date timestamps, 32 days downsample parametar and 'avg' as average aggregate function, metrics name, tag and response format.
My time data series looks like:
[
[1483142484722, 210],
[1483142548883, 203],
[1483142609002, 0]
]
Etc...
This query returns one single value as expected, that is average of all values in my dataset. I need query that returns average for all, except '0' values, i don't wont zero values affect my calculations.
Can we do something like that in openTSDB ?

Related

Best way to handle time consuming queries in InfluxDB

We have an API that queries an Influx database and a report functionality was implemented so the user can query data using a start and end date.
The problem is that when a longer period is chosen(usually more than 8 weeks), we get a timeout from influx, query takes around 13 seconds to run. When the query returns a dataset successfully, we store that in cache.
The most time-consuming part of the query is probably comparison and averages we do, something like this:
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
FROM $MEASUREMENT
WHERE time >= $startDate AND time < $endDate
AND ("field" = 'myFieldValue' )
GROUP BY "tagname"
What would be the best approach to fix this? I can of course limit the amount of weeks the user can choose, but I guess that's not the ideal fix.
How would you approach this? Increase timeout? Batch query? Any database optimization to be able to run this faster?
In such cases where you allow user to select in days, I would suggest to have another table that stores the result (min, max and avg) of each day as a document. This table can be populated using some job after end of the day.
You can also think changing the document per day to per week or per month, based on how you plot the values. You can also add more fields like in your case, tagname and other fields.
Reason why this is superior to using a cache: When you use a cache, you can store the result of the query, so you have to compute for every different combination in realtime. However, in this case, the cumulative results are already available with much smaller dataset to compute.
Based on your query, I assume you are using InfluxDB v1.X. You could try Continuous Queries which are InfluxQL queries that run automatically and periodically on realtime data and store query results in a specified measurement.
In your case, for each report, you could generate a CQ and let your users to query it.
e.g.:
Step 1: create a CQ
CREATE CONTINUOUS QUERY "cq_basic_rp" ON "db"
BEGIN
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
INTO "mean_min_max"
FROM $MEASUREMENT
WHERE "field" = 'myFieldValue' // note that the time filter is not here
GROUP BY time(1h), "tagname" // here you can define the job interval
END
Step 2: Query against that CQ
SELECT * FROM "mean_min_max"
WHERE time >= $startDate AND time < $endDate // here you can pass the user's time filter
Since you already ask InfluxDB to run these aggregates continuously based on the specified interval, you should be able to trade space for time.

Google Data Studio : how to obtain a SUM related to a COUNT_DISTINCT?

I have a dataset including 3 columns :
ID transac (The unique ID of the transaction - Dimension)
Source (The source of the transaction - Dimension)
Amount € (The amount of the transaction - Stat)
screenshot of my dataset
To Count the number of transactions (for one or more sources), i use COUNT_DISTINCT function
I want to make the sum of the transactions amounts (for one or more sources). But i don't want to additionate the amounts of the transactions with the same ID !
Is there a way to do this calcul with a DataStudio function ?
Thanks for your answers. :-)
EDIT : I saw that we could do this type of calculation via SQL here and I would like to do this in DataStudio (so that I don't have to pre-calculate the amounts per source.)
IMO, your dataset contains wrong data. Each value should be relative only to that line, but this is not the case: if the total is =20, each line should describe the participation of that line to the total. With 4 sources, each line should be =5 or something else that sums 20.
To solve it in DataStudio, you need something like CALCULATE function in PowerBI, but currently DataStudio doesn't support this feature.
But there are some options to consider to repair your data:
If you're sure there are always 4 sources, just create a new calculated field with the expression Amount/4 and SUM it. It is not an elegant solution, but it works.
If your data source is Google Sheets, you can easily repair the data using formulas, like in this example:
Link to spreadsheet
For this spreadsheet, I used this formula in adjusted_amount column: =C2/COUNTIF(A:A,A2). With this column in DataStudio, just use the usual SUM aggregation function to summarize it correctly.

Extract data by day from SQL Server

I need to get all the values from a SQL Server database by day (24 hours). I have timestamps column in TestAllData table and I want to select the data which only corresponds to a specific day.
For instance, there are timestamps of DateTime type like '2019-03-19 12:26:03.002', '2019-03-19 17:31:09.024' and '2019-04-10 14:45:12.015' so I want to load the data for the day 2019-03-19 and separately for the day 2019-04-10. Basically, it is needed to get DateTime values with the same date.
Is this possible to use some functions like DatePart or DateDiff for that?
And how can I solve such problem overall?
As in this case, I do not know the exact difference in hours between a timestamp and the end of the day (because there are various timestamps for 1 day) and I need to extract the day itself from the timestamp. After that, I need to group the data by days or something like this and get block by block. For example:
'2019-03-19' - 1200 records
'2019-04-10' - 3500 records
'2019-05-12' - 10000 records and so on
I'm looking for a more generic solution not supplying a timestamp (like '2019-03-19') as a boundary or in a where clause because the problem is not about simply filtering the data by some date!!
UPDATE: In my dataset, I have about 1,000,000 records and more than 100 unique dates. I was thinking about extracting the set of unique dates and then kind of run a query in the loop where the data would be filtered by the provided day. It would look in such a way:
select * from TestAllData where dayColumn = '2019-03-19'
select * from TestAllData where dayColumn = '2019-04-10'
select * from TestAllData where dayColumn = '2019-05-12'
...
I might use this query in my code, so I may run it in the loop from Scala function. However, I am not sure that in terms of performance it would be ok to run separate unique dates extraction query.
Depending on whether you want to be able to work with all the dates (rather than just a subset), one of the easiest ways to achieve this is with a cast:
;with cte as (SELECT cast(my_datetime as date) as my_date, * from TestAllData)
SELECT * FROM cte where my_date = '2019-02-14'
Note when casting datetime to date, times are truncated, ie just the date part is extracted.
As I say though, whether this is efficient, depends on your needs, as all datetime values from all records will be cast to date, before the data is filtered. If you want to select several dates (as opposed to just one or two), however, it may prove overall quicker, as it reads the whole table once and then gives you a column upon which you can much more efficiently filter.
If this is a permanent requirement, though, I would probably use a persisted computed column, which effectively would mean that the casting is done once initially and then only again if the corresponding value changed. For a large table I would also strongly consider an index on the computed column.

Generating Working Hours using SQL Server Query

I have this data and I need to generate a query that will give the output below
You can do this kind of groupings of rows with 2 separate row_number()s. Have 1 for all the data, ordered by date and second one ordered by code and date. To get the groups separated from the data, use the difference between these 2 row_number()s. When it changes, then it's a new block of data. You can then use that number in group by and take the minimum / maximum dates for each of them.
For the final layout you can use pivot or sum + case, most likely you want to have a new row_number for getting the rows aligned properly. Depending if you can have data missing / not matching you'll need probably additional checks.

Sum field by date range

I'm using solr 3.6 and I'm kinda stuck trying to perform a special query.
I'm actually using facets by date range, the face.date.gap is set to +1DAY. Of course, the facet is supposed to return the count of docs at a date range but I also need to get the sum of a special field at the same ranges used in facet. It's like I need to count how many votes I have daily monthly, weekly, whatever... it depends on the gap params.
Any ideas? Should I use the group.query or facet.query?
One suggestion I have is to treat the weeks, days separately, and index them. For ex. Today is part of 24th week. Another suggestion is not to rule out multiple searches to service one request. One to calculate all oth facets and one to return counts for given date range (based on search results from first query).

Resources