Snowflake - execution history of tasks with predecessors

Snowflake - execution history of tasks with predecessors - snowflake-cloud-data-platform

I have two tasks T1 and T2. T1 runs every day at 7:00 hours and it is the predecessor of T2.
I am able to use the task execution history for T1. However, the history of T2 doesn't show up in the results.
Is there a way I can track whether the second task executed or not?
Best,
Vijay Prakash
Update:
Here is the query I used to query task execution status:
select *
from table(information_schema.task_history(
result_limit => 10,
task_name=>'T2'));

Can you give us a hit of how you are attempting to track tasks?
Should be able to track tasks and more using combinations of the below.
SELECT *
FROM TABLE(information_schema.task_history(RESULT_LIMIT => 10000))
SELECT *
FROM TABLE(information_schema.QUERY_HISTORY_BY_WAREHOUSE(WAREHOUSE_NAME => 'WH_FOR_TASKS', RESULT_LIMIT => 10000))
Thanks.

The root cause of the above issue is, the order of starting the jobs was incorrect. I had resumed the first task (T1) before the second one (T2) which was why the second task was not executed.

Related

View AVG Task Execution Time in Snowflake

I run the following task in Snowflake to see which queries are candidates for inefficiency improvements:
select datediff(second,scheduled_time,query_start_time) as second, *
from table(information_schema.task_history())
where state != 'SCHEDULED'
order by datediff(second,scheduled_time,query_start_time) desc;
However, I frequently see the seconds a query took to run change from day to day. How can I modify this query in Snowflake to get all the historical runs from task history and average their seconds to get a fuller picture with less variance?
The documentation says it pulls the last 7 days but in practice it is only pulling the last 2 days based on the output's scheduled_time (each of my tasks run every 12 hours). I'd like to get the average seconds each task took over the last 30 days and sort them.

The documentation says it pulls the last 7 days but in practice it is only pulling the last 2 days based on the output's scheduled_time (each of my tasks run every 12 hours).
Task history
RESULT_LIMIT => integer
A number specifying the maximum number of rows returned by the function.
Default: 100.
To get more rows the RESULT_LIMIT should be defined:
select datediff(second,scheduled_time,query_start_time) as second, *
from table(information_schema.task_history(RESULT_LIMIT =>10000))
where state != 'SCHEDULED'
order by datediff(second,scheduled_time,query_start_time) desc;
ACCOUNT_USAGE.TASK_HISTORY provides data for the last 365 days.
SELECT datediff(second,scheduled_time,query_start_time) as second, *
FROM snowflake.account_usage.task_history
WHERE state != 'SCHEDULED'
ORDER BY datediff(second,scheduled_time,query_start_time) DESC;

Best way to handle time consuming queries in InfluxDB

We have an API that queries an Influx database and a report functionality was implemented so the user can query data using a start and end date.
The problem is that when a longer period is chosen(usually more than 8 weeks), we get a timeout from influx, query takes around 13 seconds to run. When the query returns a dataset successfully, we store that in cache.
The most time-consuming part of the query is probably comparison and averages we do, something like this:
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
FROM $MEASUREMENT
WHERE time >= $startDate AND time < $endDate
AND ("field" = 'myFieldValue' )
GROUP BY "tagname"
What would be the best approach to fix this? I can of course limit the amount of weeks the user can choose, but I guess that's not the ideal fix.
How would you approach this? Increase timeout? Batch query? Any database optimization to be able to run this faster?

In such cases where you allow user to select in days, I would suggest to have another table that stores the result (min, max and avg) of each day as a document. This table can be populated using some job after end of the day.
You can also think changing the document per day to per week or per month, based on how you plot the values. You can also add more fields like in your case, tagname and other fields.
Reason why this is superior to using a cache: When you use a cache, you can store the result of the query, so you have to compute for every different combination in realtime. However, in this case, the cumulative results are already available with much smaller dataset to compute.

Based on your query, I assume you are using InfluxDB v1.X. You could try Continuous Queries which are InfluxQL queries that run automatically and periodically on realtime data and store query results in a specified measurement.
In your case, for each report, you could generate a CQ and let your users to query it.
e.g.:
Step 1: create a CQ
CREATE CONTINUOUS QUERY "cq_basic_rp" ON "db"
BEGIN
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
INTO "mean_min_max"
FROM $MEASUREMENT
WHERE "field" = 'myFieldValue' // note that the time filter is not here
GROUP BY time(1h), "tagname" // here you can define the job interval
END
Step 2: Query against that CQ
SELECT * FROM "mean_min_max"
WHERE time >= $startDate AND time < $endDate // here you can pass the user's time filter
Since you already ask InfluxDB to run these aggregates continuously based on the specified interval, you should be able to trade space for time.

Flink CEP sql restrict output

I have a use case where I have 2 input topics in kafka.
Topic schema:
eventName, ingestion_time(will be used as watermark), orderType, orderCountry
Data for first topic:
{"eventName": "orderCreated", "userId":123, "ingestionTime": "1665042169543", "orderType":"ecommerce","orderCountry": "UK"}
Data for second topic:
{"eventName": "orderSucess", "userId":123, "ingestionTime": "1665042189543", "orderType":"ecommerce","orderCountry": "USA"}
I want to get all the userid for orderType,orderCountry where user does first event but not the second one in a window of 5 minutes for a maximum of 2 events per user for a orderType and orderCountry (i.e. upto 10 mins only).
I have union both topics data and created a view on top of it and trying to use flink cep sql to get my output, but somehow not able to figure it out.
SELECT *
FROM union_event_table
MATCH_RECOGNIZE(
PARTITION BY orderType,orderCountry
ORDER BY ingestion_time
MEASURES
A.userId as userId
A.orderType as orderType
A.orderCountry AS orderCountry
ONE ROW PER MATCH
PATTERN (A not followed B) WITHIN INTERVAL '5' MINUTES
DEFINE
A As A.eventName = 'orderCreated'
B AS B.eventName = 'orderSucess'
)
First thing is not able to figure it out what to use in place of A not followed B in sql, another thing is how can I restrict the output for a userid to maximum of 2 events per orderType and orderCountry, i.e. if a user doesn't perform 2nd event after 1st event in 2 consecutive windows for 5 minutes, the state of that user should be removed, so that I will not get output of that user for same orderType and orderCountry again.

I don't believe this is possible using MATCH_RECOGNIZE. This could, however, be implemented with the DataStream CEP library by using its capability to send timed out patterns to a side output.
This could also be solved at a lower level by using a KeyedProcessFunction. The long ride alerts exercise from the Apache Flink Training repo is an example of that -- you can jump straight away to the solution if you want.

SQL join running slow

I have 2 sql queries doing the same thing, first query takes 13 sec to execute while second takes 1 sec to execute. Any reason why ?
Not necessary all the ids in ProcessMessages will have data in ProcessMessageDetails
-- takes 13 sec to execute
Select * from dbo.ProcessMessages t1
join dbo.ProcessMessageDetails t2 on t1.ProcessMessageId = t2.ProcessMessageId
Where Id = 4 and Isdone = 0
--takes under a sec to execute
Select * from dbo.ProcessMessageDetails
where ProcessMessageId in ( Select distinct ProcessMessageId from dbo.ProcessMessages t1
Where Where Id = 4 and Isdone = 0 )
I have clusterd index on t1.processMessageId(Pk) and non clusterd index on t2.processMessageId (FK)

I would need the actual execution plans to tell you exactly what SqlServer is doing behind the scenes. I can tell you these queries aren't doing the exact same thing.
The first query is going through and finding all of the items that meet the conditions for t1 and finding all of the items for t2 and then finding which ones match and joining them together.
The second one is saying first find all of the items that are meet my criteria from t1, and then find the items in t2 that have one of these IDs.
Depending on your statistics, available indexes, hardware, table sizes: Sql Server may decide to do different types of scans or seeks to pick data for each part of the query, and it also may decide to join together data in a certain way.

The answer to your question is really simple the first query which have used will generate more number of rows as compared to the second query so it will take more time to search those many rows that's the reason your first query took 13 seconds and the second one to only one second
So it is generally suggested that you should apply your conditions before making your join or else your number of rows will increase and then you will require more time to search those many rows when joined.

How to speed up a query with few intersect operations

Is there a way to speed up a query with few intersect operations? Query looks something like this:
SELECT TOP(2000) vk_key FROM (
SELECT vk_key FROM mat_property_det where [type]='mp' AND (ISNULL(mat_property_det.value_max, ISNULL(mat_property_det.value_min, 9999)) <= 980 OR ISNULL(mat_property_det.value, 9999) <= 980) AND mat_property_det.property_id=6
INTERSECT
SELECT vk_key FROM search_advance_mat WHERE 1=1 AND (search_advance_mat.group_id = 101 )
INTERSECT
SELECT vk_key FROM V_search_advance_subgroup_en WHERE CONTAINS((Subgroup_desc, Comment, [Application], HeatTreatment), ' "plates*"') ) T
We don't know in advance how many intersections will we have and we couldn't change intersection with e.g. inner join because query is created from application according to user's search parameters.
Here is an execution plan:
Any help or advice would be appreciated!

Help/Advice: Focus on the Table Spool (Lazy Spool) in your execution plan, as it is consuming 36% of the query effort. You should be able to mouse hover over that area of the plan and discover the table/view/function/etc involved.
As MSDN states for lazy spools, the spool operator gets a row from its input operator and stores it in the spool, rather than consuming all rows at once, which means every row is being scanned, one at a time.
Do everything you can to remove the Table Spool and/or improve the performance in this specific area of the query/execution plan and the entire query will benefit.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Snowflake - execution history of tasks with predecessors - snowflake-cloud-data-platform

The root cause of the above issue is, the order of starting the jobs was incorrect. I had resumed the first task (T1) before the second one (T2) which was why the second task was not executed.

Related

View AVG Task Execution Time in Snowflake

Best way to handle time consuming queries in InfluxDB

Flink CEP sql restrict output

SQL join running slow

How to speed up a query with few intersect operations

Categories

Resources