Best way to handle time consuming queries in InfluxDB - database

We have an API that queries an Influx database and a report functionality was implemented so the user can query data using a start and end date.
The problem is that when a longer period is chosen(usually more than 8 weeks), we get a timeout from influx, query takes around 13 seconds to run. When the query returns a dataset successfully, we store that in cache.
The most time-consuming part of the query is probably comparison and averages we do, something like this:
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
FROM $MEASUREMENT
WHERE time >= $startDate AND time < $endDate
AND ("field" = 'myFieldValue' )
GROUP BY "tagname"
What would be the best approach to fix this? I can of course limit the amount of weeks the user can choose, but I guess that's not the ideal fix.
How would you approach this? Increase timeout? Batch query? Any database optimization to be able to run this faster?

In such cases where you allow user to select in days, I would suggest to have another table that stores the result (min, max and avg) of each day as a document. This table can be populated using some job after end of the day.
You can also think changing the document per day to per week or per month, based on how you plot the values. You can also add more fields like in your case, tagname and other fields.
Reason why this is superior to using a cache: When you use a cache, you can store the result of the query, so you have to compute for every different combination in realtime. However, in this case, the cumulative results are already available with much smaller dataset to compute.

Based on your query, I assume you are using InfluxDB v1.X. You could try Continuous Queries which are InfluxQL queries that run automatically and periodically on realtime data and store query results in a specified measurement.
In your case, for each report, you could generate a CQ and let your users to query it.
e.g.:
Step 1: create a CQ
CREATE CONTINUOUS QUERY "cq_basic_rp" ON "db"
BEGIN
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
INTO "mean_min_max"
FROM $MEASUREMENT
WHERE "field" = 'myFieldValue' // note that the time filter is not here
GROUP BY time(1h), "tagname" // here you can define the job interval
END
Step 2: Query against that CQ
SELECT * FROM "mean_min_max"
WHERE time >= $startDate AND time < $endDate // here you can pass the user's time filter
Since you already ask InfluxDB to run these aggregates continuously based on the specified interval, you should be able to trade space for time.

Related

Anylogic: How to create plot from database table?

In my Anylogic model I succesfully create plots of datasets that count the number of trucks arriving from terminals each hour in my simulation. Now, I want to add the actual/"observed" number of trucks arriving at a terminal, to compare my simulation to these numbers. I added these numbers in a database table (see picture below). Is there a simple way of adding this data to the plot?
I tried it by creating a variable that reads the database table for every hour and adding that to a dataset (like can be seen in the pictures below), but this did not work unfortunately (the plot was empty).
Maybe simply delete the variable and fill the dataset at the start of the model by looping through the dbase table data. Use the dbase query wizard to create a for-loop. Something like this should work:
int numEntries = (int) selectFrom(observed_arrivals).count();
DataSet myDataSet = new DataSet(numEntries);
List<Tuple> rows = selectFrom(observed_arrivals).list();
for (Tuple
row : rows) {
myDataSet.add(row.get( observed_arrivals.hour ), row.get( observed_arrivals.terminal_a ));
}
myChart.addDataSet(myDataSet);
You don't explain why it "didn't work" (what errors/problems did you get?), nor where you defined these elements.
(1) Since you want both observed (empirical) and simulated arrivals per terminal, datasets for each should be in the Terminal agent. And then the replicated plot (in Main) can have two data entries referring to data sets terminals(index).observedArrivals and terminals(index).simulatedArrivals or whatever you name them.
(2) Using getHourOfDay to add to the observed dataset is wrong because that just returns 0-23 (i.e., the hour in the current day for the current model date). Your database table looks like it has hours since model start, so you just want time(HOUR) to get the model time in elapsed hours (irrespective of what the model time unit is). Or possibly time(HOUR) - 1 if you only want to update the empirical arrivals for the hour at the end of that hour (i.e., at the same time that you updated the simulated arrivals).
(3) Using a Variable to get the database value each hour doesn't work because a variable's initial value is only evaluated once at model initialisation. You want an hourly cyclic Event in Terminal instead which adds the relevant row's value. (You need to use the Insert Database Query wizard to generate the relevant Java code for the query you need in the event's action.)
(4) Because you have a database table with specifically-named columns for each terminal (columns terminal_a and presumably terminal_b etc.) that makes it slightly more awkward. (This isn't proper relational table design where, instead of 4 columns for the 4 terminals, you'd instead have two columns for terminal_id and observed_value with a row for each time period and terminal combination.)
So your database query expression (in your Terminal agents) will need to use the SQL format (not the QueryDSL format) so that you can 'stitch in' the correct column name into the SQL.

Extract data by day from SQL Server

I need to get all the values from a SQL Server database by day (24 hours). I have timestamps column in TestAllData table and I want to select the data which only corresponds to a specific day.
For instance, there are timestamps of DateTime type like '2019-03-19 12:26:03.002', '2019-03-19 17:31:09.024' and '2019-04-10 14:45:12.015' so I want to load the data for the day 2019-03-19 and separately for the day 2019-04-10. Basically, it is needed to get DateTime values with the same date.
Is this possible to use some functions like DatePart or DateDiff for that?
And how can I solve such problem overall?
As in this case, I do not know the exact difference in hours between a timestamp and the end of the day (because there are various timestamps for 1 day) and I need to extract the day itself from the timestamp. After that, I need to group the data by days or something like this and get block by block. For example:
'2019-03-19' - 1200 records
'2019-04-10' - 3500 records
'2019-05-12' - 10000 records and so on
I'm looking for a more generic solution not supplying a timestamp (like '2019-03-19') as a boundary or in a where clause because the problem is not about simply filtering the data by some date!!
UPDATE: In my dataset, I have about 1,000,000 records and more than 100 unique dates. I was thinking about extracting the set of unique dates and then kind of run a query in the loop where the data would be filtered by the provided day. It would look in such a way:
select * from TestAllData where dayColumn = '2019-03-19'
select * from TestAllData where dayColumn = '2019-04-10'
select * from TestAllData where dayColumn = '2019-05-12'
...
I might use this query in my code, so I may run it in the loop from Scala function. However, I am not sure that in terms of performance it would be ok to run separate unique dates extraction query.
Depending on whether you want to be able to work with all the dates (rather than just a subset), one of the easiest ways to achieve this is with a cast:
;with cte as (SELECT cast(my_datetime as date) as my_date, * from TestAllData)
SELECT * FROM cte where my_date = '2019-02-14'
Note when casting datetime to date, times are truncated, ie just the date part is extracted.
As I say though, whether this is efficient, depends on your needs, as all datetime values from all records will be cast to date, before the data is filtered. If you want to select several dates (as opposed to just one or two), however, it may prove overall quicker, as it reads the whole table once and then gives you a column upon which you can much more efficiently filter.
If this is a permanent requirement, though, I would probably use a persisted computed column, which effectively would mean that the casting is done once initially and then only again if the corresponding value changed. For a large table I would also strongly consider an index on the computed column.

Flink DataStream - execute SQL query on a window, do orderBy

so I'm simulating a streaming task using Flink DataStream and I want to execute an SQL query on each window.
Let's say this is the query
SELECT name, age, sum(days), avg(salary)
FROM employees
WHERE age > 25
GROUP BY name, age
ORDER BY name, age
I'm having a hard time to translate it to Flink. From my understanding, to calculate average I need to do it manually using .apply() and WindowFunction. But how do I calculate the sum then? Also manually in the same WindowFunction?
I'm also wondering if it is possible to do order by on the whole window?
Below is the pseudocode of what I thought of so far. Any help would be appreciated! Thanks!
employeesStream
.filter(new FilterFunction() ....) \\ where clause
.keyby(nameIndex, ageIndex) \\ group by??
.timeWindow(Time.seconds(10), Time.seconds(1))
.apply(new WindowFunction() ....) \\ calculate average (and sum?)
// order by??
I checked the Table API but it seems for streaming not a lot of operations are supported, e.g orderBy.
Ordering in streaming is not trivial. How do you want to sort something that is never ending? In your example you want to calculate an average or a sum, which is just one value per window. You cannot sort one value.
Another possibility is to buffer all values and wait for an indicator of completeness to start sorting. Thanks to event-time and watermarks, it is possible to sort a stream if you know that you have seen all values until a certain time (aka watermarks).
Event-time sort has been introduced recently and will be part of Flink 1.4 Table API. See here for an example.

Tableau – Using Nested Aggregations to Establish a Weekday/Hour Baseline

Background Information: We have an incident time tracker that tracks how long each user spends with a representative before the issue can be closed. We want to determine the average volume of incidents that are being handled for each hour. To say this in another way: We want to get an hourly baseline for each day of the week that will show us the average total call length within the specific time period. Eg: We want to average the total length of every call on Monday from 9AM-10AM for all the weeks in the database, and the same for other hourly intervals.
The simplest way to think of this is that I want AVG(SUM) for the specific time periods, but Tableau does not allow me to do this.
Tableau Output:
This is the desired, target visualization that I am looking for from Tableau.
SQL Query:
I have written a SQL query that returns the answer:
We are looking at two columns: start_time (time stamp) and interval_seconds(float)
In the inner query I use the hour_start function which truncates the date/time value to the hour start, so I can group by the hour and day of the week in the outer query.
SQL Results:
Question:
Is there a way to solve this problem ENTIRELY in Tableau that would get me the result that I am looking for without having to write any SQL code?
Files Stored on Drive
CSV File:
https://drive.google.com/open?id=0B4nMLxIVTDc7NEtqWlpHdVozRXc
Tableau Worksheet:
https://drive.google.com/open?id=0B4nMLxIVTDc7M3A4Q0JxbGdlTE0
You can use Level of Detail expressions to compute the SUM(interval_seconds) at the hour level and then use AVG to calculate the number you are looking for.
I created a couple of calculations:
hour which is defined as: DATETRUNC('hour',[start_time])
this should be equivalent to your hour_start(start_time).
and interval_hours which is defined as {FIXED [hour] : SUM([interval_seconds])/3600 }
This calculates the aggregate for each start_time truncated to the hour.
After this, you simply calculate AVG(interval_hours) and use it in your view.
I put a workbook in dropbox: https://www.dropbox.com/s/3hfvz8w529g9f46/Interval%20Time%20Baseline.twbx?dl=0
Although the chart looks similar to yours, the numbers I came up with are somewhat different from the "SQL Results" you show. Was the data you provided slightly different?

strange appengine query result

What am I doing wrong in this query?
SELECT * FROM TreatmentPlanDetails
WHERE
accountId = 'ag5zfmRvbW9kZW50d2ViMnIRCxIIQWNjb3VudHMYtcjdAQw' AND
status = 'done' AND
category = 'chirurgia orale' AND
setDoneCalendarEventStartTimestamp >= [timestamp for 6 june 2012] AND
setDoneCalendarEventStartTimestamp <= [timestamp for 11 june 2012] AND
deleteStatus = 'notDeleted'
ORDER BY setDoneCalendarEventStartTimestamp ASC
I am not getting any record and I am sure there are records meeting the where clause conditions. To get the correct records I have to widen the timestamp interval by 1 millisecond. Is it normal? Furthermore, if I modify this query by removing the category filter, I am getting the correct results. This is definitely weird.
I also asked on google groups, but I got no answer. Anyway, for details:
https://groups.google.com/forum/?fromgroups#!searchin/google-appengine/query/google-appengine/ixPIvmhCS3g/d4OP91yTkrEJ
Let's talk specifically about creating timestamps to go into the query. What code are you using to create the timestamp record? Apparently that's important, because fuzzing with it a little bit affects the query. It may be relevant that in the datastore, timestamps are recorded as integers representing posix timestamps with microseconds, i.e. the number of microseconds since 1/1/1970 UTC (not counting leap seconds). It's also relevant that dates (i.e. without a time) are represented as midnight, i.e. the earliest time on that day. But please show us the exact code. (It may also be important to show the actual content of the record that you're attempting to retrieve.)
An aside that is not specific to your question: Entity property names count as part of your storage quota. If this is going to be a huge dataset, you might pay more $$ than you'd like for property names like setDoneCalendarEventStartTimestamp.
Because you write :
if I modify this query by removing the category filter, I am getting
the correct results
this probably means that the category was not indexed at the time you write the matching records to the data store. You have to re-write your records to the data store if you want them added to the newly created index.

Resources