Ok this is pretty simple but i'm drawing a blank and can't even think on the right combination of words to search for the answer.
I have a tsql table with start and end time, task, as well as a new/repeat flag.
I want to pull the average duration between start and end, both when the record is new and when it is a repeat. I'll be grouping on the task.
My result would look like Task - NewDurationAverage - RepeatDurationAverage.
Cheers in advance.
Your query should be something like this:
SELECT TaskId, NewDurationAverage, RepeatDurationAverage FROM
(SELECT TaskId, DATEDIFF(hh, TaskStart, TaskEnd) as NewDurationAverage
FROM Task WHERE IsNew=1 GROUP BY TaskId) NewTasks
LEFT OUTER JOIN
(SELECT TaskId, DATEDIFF(hh, TaskStart, TaskEnd) as RepeatDurationAverage
FROM Task WHERE IsRepeat=1 GROUP BY TaskId) RepeatTasks
ON NewTasks.TaskId=RepeatTasks.TaskId
You need to follow the steps below:
Find the difference between the start and the end date/time columns. For example, using DATEDIFF function
Perform the AVG on the calculated value
Convert the result in any appropriate format you want
Depending on your needs, you can make DATEDIFF to return the time difference in a desire format (days, minutes, nanoseconds, etc). So, have to decided how precise the results should be (smaller is better).
Related
We have an API that queries an Influx database and a report functionality was implemented so the user can query data using a start and end date.
The problem is that when a longer period is chosen(usually more than 8 weeks), we get a timeout from influx, query takes around 13 seconds to run. When the query returns a dataset successfully, we store that in cache.
The most time-consuming part of the query is probably comparison and averages we do, something like this:
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
FROM $MEASUREMENT
WHERE time >= $startDate AND time < $endDate
AND ("field" = 'myFieldValue' )
GROUP BY "tagname"
What would be the best approach to fix this? I can of course limit the amount of weeks the user can choose, but I guess that's not the ideal fix.
How would you approach this? Increase timeout? Batch query? Any database optimization to be able to run this faster?
In such cases where you allow user to select in days, I would suggest to have another table that stores the result (min, max and avg) of each day as a document. This table can be populated using some job after end of the day.
You can also think changing the document per day to per week or per month, based on how you plot the values. You can also add more fields like in your case, tagname and other fields.
Reason why this is superior to using a cache: When you use a cache, you can store the result of the query, so you have to compute for every different combination in realtime. However, in this case, the cumulative results are already available with much smaller dataset to compute.
Based on your query, I assume you are using InfluxDB v1.X. You could try Continuous Queries which are InfluxQL queries that run automatically and periodically on realtime data and store query results in a specified measurement.
In your case, for each report, you could generate a CQ and let your users to query it.
e.g.:
Step 1: create a CQ
CREATE CONTINUOUS QUERY "cq_basic_rp" ON "db"
BEGIN
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
INTO "mean_min_max"
FROM $MEASUREMENT
WHERE "field" = 'myFieldValue' // note that the time filter is not here
GROUP BY time(1h), "tagname" // here you can define the job interval
END
Step 2: Query against that CQ
SELECT * FROM "mean_min_max"
WHERE time >= $startDate AND time < $endDate // here you can pass the user's time filter
Since you already ask InfluxDB to run these aggregates continuously based on the specified interval, you should be able to trade space for time.
I need to get all the values from a SQL Server database by day (24 hours). I have timestamps column in TestAllData table and I want to select the data which only corresponds to a specific day.
For instance, there are timestamps of DateTime type like '2019-03-19 12:26:03.002', '2019-03-19 17:31:09.024' and '2019-04-10 14:45:12.015' so I want to load the data for the day 2019-03-19 and separately for the day 2019-04-10. Basically, it is needed to get DateTime values with the same date.
Is this possible to use some functions like DatePart or DateDiff for that?
And how can I solve such problem overall?
As in this case, I do not know the exact difference in hours between a timestamp and the end of the day (because there are various timestamps for 1 day) and I need to extract the day itself from the timestamp. After that, I need to group the data by days or something like this and get block by block. For example:
'2019-03-19' - 1200 records
'2019-04-10' - 3500 records
'2019-05-12' - 10000 records and so on
I'm looking for a more generic solution not supplying a timestamp (like '2019-03-19') as a boundary or in a where clause because the problem is not about simply filtering the data by some date!!
UPDATE: In my dataset, I have about 1,000,000 records and more than 100 unique dates. I was thinking about extracting the set of unique dates and then kind of run a query in the loop where the data would be filtered by the provided day. It would look in such a way:
select * from TestAllData where dayColumn = '2019-03-19'
select * from TestAllData where dayColumn = '2019-04-10'
select * from TestAllData where dayColumn = '2019-05-12'
...
I might use this query in my code, so I may run it in the loop from Scala function. However, I am not sure that in terms of performance it would be ok to run separate unique dates extraction query.
Depending on whether you want to be able to work with all the dates (rather than just a subset), one of the easiest ways to achieve this is with a cast:
;with cte as (SELECT cast(my_datetime as date) as my_date, * from TestAllData)
SELECT * FROM cte where my_date = '2019-02-14'
Note when casting datetime to date, times are truncated, ie just the date part is extracted.
As I say though, whether this is efficient, depends on your needs, as all datetime values from all records will be cast to date, before the data is filtered. If you want to select several dates (as opposed to just one or two), however, it may prove overall quicker, as it reads the whole table once and then gives you a column upon which you can much more efficiently filter.
If this is a permanent requirement, though, I would probably use a persisted computed column, which effectively would mean that the casting is done once initially and then only again if the corresponding value changed. For a large table I would also strongly consider an index on the computed column.
I need to find expired credit cards in a table. The fields expire_year and expire_month are integer values.
I was thinking something like this could work:
select *
from CREDITCARD
where CURRENT_TIMESTAMP > DATEFROMPARTS(EXPIRE_YEAR, EXPIRE_MONTH, 1);
The problem with this is that the definition of expired would be the first day of the next month. Therefore I need to find a way to write EXPIRE_MONTH + 1. But this is also no good, as the month might be December, in which case I'd be looking for month number 13. In such cases, I'd need to bump the EXPIRE_YEAR instead, and set EXPIRE_MONTH to 1.
I´ve tried to google to the solution, but my issue seems a bit too specific. In Java this would be easy enough to solve, but my SQL knowledge is limited to fairly simple queries.
Something like that :
SELECT DATEADD(month, 1, DATEFROMPARTS(EXPIRE_YEAR, EXPIRE_MONTH, 1))FROM MY_TABLE
so I'm simulating a streaming task using Flink DataStream and I want to execute an SQL query on each window.
Let's say this is the query
SELECT name, age, sum(days), avg(salary)
FROM employees
WHERE age > 25
GROUP BY name, age
ORDER BY name, age
I'm having a hard time to translate it to Flink. From my understanding, to calculate average I need to do it manually using .apply() and WindowFunction. But how do I calculate the sum then? Also manually in the same WindowFunction?
I'm also wondering if it is possible to do order by on the whole window?
Below is the pseudocode of what I thought of so far. Any help would be appreciated! Thanks!
employeesStream
.filter(new FilterFunction() ....) \\ where clause
.keyby(nameIndex, ageIndex) \\ group by??
.timeWindow(Time.seconds(10), Time.seconds(1))
.apply(new WindowFunction() ....) \\ calculate average (and sum?)
// order by??
I checked the Table API but it seems for streaming not a lot of operations are supported, e.g orderBy.
Ordering in streaming is not trivial. How do you want to sort something that is never ending? In your example you want to calculate an average or a sum, which is just one value per window. You cannot sort one value.
Another possibility is to buffer all values and wait for an indicator of completeness to start sorting. Thanks to event-time and watermarks, it is possible to sort a stream if you know that you have seen all values until a certain time (aka watermarks).
Event-time sort has been introduced recently and will be part of Flink 1.4 Table API. See here for an example.
Background Information: We have an incident time tracker that tracks how long each user spends with a representative before the issue can be closed. We want to determine the average volume of incidents that are being handled for each hour. To say this in another way: We want to get an hourly baseline for each day of the week that will show us the average total call length within the specific time period. Eg: We want to average the total length of every call on Monday from 9AM-10AM for all the weeks in the database, and the same for other hourly intervals.
The simplest way to think of this is that I want AVG(SUM) for the specific time periods, but Tableau does not allow me to do this.
Tableau Output:
This is the desired, target visualization that I am looking for from Tableau.
SQL Query:
I have written a SQL query that returns the answer:
We are looking at two columns: start_time (time stamp) and interval_seconds(float)
In the inner query I use the hour_start function which truncates the date/time value to the hour start, so I can group by the hour and day of the week in the outer query.
SQL Results:
Question:
Is there a way to solve this problem ENTIRELY in Tableau that would get me the result that I am looking for without having to write any SQL code?
Files Stored on Drive
CSV File:
https://drive.google.com/open?id=0B4nMLxIVTDc7NEtqWlpHdVozRXc
Tableau Worksheet:
https://drive.google.com/open?id=0B4nMLxIVTDc7M3A4Q0JxbGdlTE0
You can use Level of Detail expressions to compute the SUM(interval_seconds) at the hour level and then use AVG to calculate the number you are looking for.
I created a couple of calculations:
hour which is defined as: DATETRUNC('hour',[start_time])
this should be equivalent to your hour_start(start_time).
and interval_hours which is defined as {FIXED [hour] : SUM([interval_seconds])/3600 }
This calculates the aggregate for each start_time truncated to the hour.
After this, you simply calculate AVG(interval_hours) and use it in your view.
I put a workbook in dropbox: https://www.dropbox.com/s/3hfvz8w529g9f46/Interval%20Time%20Baseline.twbx?dl=0
Although the chart looks similar to yours, the numbers I came up with are somewhat different from the "SQL Results" you show. Was the data you provided slightly different?