external function timeout and retry behavior - snowflake-cloud-data-platform

viewing Azure function logs, an external function appears to retry a query after ~30 seconds if no response is received. queries that complete in less than 30 seconds are successfully returned to snowflake.
Azure function is setup with a consumption plan and default timeout is 5 minutes.
any plans, thoughts, opinions about changing this behavior, for example:
configurable
back off to allow additional time (maybe also configurable).
appreciate
documented guidance - including send less, function and snowflake in same region - check
preview feature
change may already be afoot
balance between performance and getting the data
shout out to the function doco. folks, nice work.
cheers,

Related

Best way for running long Python scripts on GCP

We are starting a new project in our company where we basically run few Python scripts for each client, twice a day.
So the idea is, twice a day a Cloud Function will be triggered where the function will trigger the Python script for each client creating new instances of App Engine / Cloud Run or any other serverless service Google's offer.
At the begining we though of using Cloud Functions, but very quickly we found out they are not suited for long running Python scripts, the scripts will eventually calculate and collect different information for each client and write them to Firebase.
The flow of the processes would be: Cloud Function triggered -> function trigger GCP instance for each client -> script running for each client -> out put is being saved to Firebase.
What would be the recommended way to do it without a dedicated server, which GCP serverless services would fit the most?
There is a lot of great answers! The key here is to decouple and to distribute the processing.
When you talk about decoupling you can use Cloud Task (where you can add flow control with rate limit or to postpone a task in the future) or PubSub (more simple message queueing solution).
And Cloud Run is a requirement to run up to 15 minutes processing. But you will have to fine tune it (see below my tips)
So, to summarize the process
You have to trigger a Cloud Functions twice a day. You can use Cloud Scheduler for that.
The triggered Cloud Functions get the list of clients (in database?) and for each client, create a task on Cloud Task(or a message in PubSub)
Each task (or message) call a HTTP endpoint on Cloud Run that perform the process for each client. Set the timeout to 30 minutes on Cloud Run.
However, if your processing is compute intensive, you have to tune Cloud Run. If the processing take 15 minutes for 1 client on 1vCPU, that mean you can't process more than 1 client per CPU if you don't want to reach the timeout (2 clients can lead you to take about 30 minutes for both on the same CPU and you can reach the timeout). For that, I recommend you to set the concurrency parameter of Cloud Run to 1, to process only one request at a time (of course, if you set 2 or 4 CPU on Cloud Run you can also increase the concurrency parameter to 2 or 4 to allow parallel processing on the same instance, but on different CPU).
If the processing is not CPU intensive (you perform API call and you wait the answer) it's harder to say. Try with a concurrency of 5, 10, 30,... and observe the behaviour/latency of the processed requests. No worries, with Cloud Task and PubSUb you can set retry policies in case of timeout.
Last things: is your processing idempotent? I mean, if you run 2 time the same process for the same client, is the result correct or is it a problem? Try to make the solution idempotent to overcome retry issues and globally issues that can happen on distributed computing (including the replays)
#NoCommandLine's answer is a best recommendation and Cloud Run is also a good option if you want to set longer running operations as timeout could be set between 5 minutes (as default) and 60 minutes. You can set or update request timeout through either Cloud Console, command line or YAML.
Meanwhile, execution time for Cloud Function only has 1 minute (by default) and could be set to 9 minutes maximum.
You can check out the full documentation below:
Requesting Timeout for Cloud Run
Requesting Timeout for Cloud Function
You can also check a related SO question through this link.
You can execute "long" running Google App Engine (GAE) Tasks using Cloud Tasks.
How long (which is why I have it in quotes) depends on the kind of scaling that you are using for your GAE Project Instance. Instances which are set to 'automatic scaling' are limited to a maximum of 10 minutes while instances which are set to 'manual' or 'basic' have up to 24 hours execution time.
From the earlier link
....all workers must send an HTTP response code (200-299) to the Cloud
Tasks service, in this instance before a deadline based on the
instance scaling type of the service: 10 minutes for automatic scaling
or up to 24 hours for manual scaling. If a different response is sent,
or no response, the task is retried....
Adding Update (there's seems to be some confusion between 30 mins vs 24 hours)
Standard HTTP Requests have a maximum execution time of 30 minutes (source) while GAE Endpoints can run for up to 24 hours if you're using manual scaling (source)

How can I check if the system time of the DB server is correct?

I have got a bug case from the service desk, which was a result of different system times on the application server (JBoss) and DB(Oracle) server. As a result, timeouts lied.
It doesn't happen often, but for the future, it will be better if the app server could raise alarm about the bad time on the DB server before it results in some deeper problems.
Of course, I can simply read
Select CURRENT_TIMESTAMP
, and compare it against the local time. But it is probable that the time of sending the query and getting its result will get some noticeable time and I will recognize good time as bad one or vice versa.
I can also check the time from sending the query to the return of the result. But this way will work correctly in the case of the good net without lags. And if the time on the DB server fails, it is highly probable that the net around the DB server is not OK. The queues on the DB server can make the times of sending and receiving noticeably unequal.
What is the best way you know to check the time on the DB server?
Limitations: preciseness of 5 sec
false alarms <10%
To be optimized(minimized): lost alarms.
Maybe I am inventing the bicycle and JBoss and/or Oracle have some tool for that? (I could not find it)
Have a program running on the app server get the current time there, then query the database time (CURRENT_TIMESTAMP) and the app server gets the current time there after the query returns.
Confirm that the DB time is between the two times on the App Server (with any tolerance you need). You can include a separate check on how long it took to get the response from the DB but it should be trivial.
If the environment is some form of VM, issues are most likely to arise when the VM is started or resumed from a pause. There might be situations where a clock is running fast or slow so recording the times would allow you to look for trends in either direction and allow you to take preemptive action.

Why is it that when the same query is executed using ExecuteReader() in vb .net twice, it returns two very different response times?

Whenever a user clicks on GetReport button, there is a request to the server where SQL is formed in the back end and connection is established with Database. When the function ExecuteReader() is executed, it returns data at different time responses.
There are 12 servers in Production environment and the settings is such that when there is no response for more than 60 seconds from the back end, the connection is removed and hence blank screen appears on "UI".
In my code the SQL is formed and connection is established and when ExecuteReader()function is executed, it is returning data after the interval of 60 seconds where as per settings in the server, the connection is removed and hence leading to appearance of blank screen.
If the ExecuteReader() function returns data within 60 seconds, then the functionality works fine. The problem is only when the ExecuteReader() function does not retrieve data within 60 seconds.
Problem is that ExecuteReader() function returns data within 2 seconds sometimes for the same SQl and sometimes takes 2 minutes to retrieve data.
Please suggest why there is variation in returning data at different time intervals for the same query and how should I be proceeding in this situation as we are not able to increase the response time in production because of security issues.
Code is in vb.net
You said it yourself:
how should I be proceeding in this situation as we are not able to increase the response time in production because of security issues.
There's nothing you can do
If, however, you do suddenly gain the permissions to modify the query that is being run, or reconfigure the resource provision of the production system, post back here with a screenshot of the execution plan and we can tell you any potential performance bottlenecks.
Dan's comment pretty much covers why a database query might be slow; usually a similar reason why YouTube is slower to buffer at 7pm - the parents got home from work at 6, the kids screamed at them for an hour ago wanting to go on YouTube while parent desperately tries to engage child in something more educational or physically active, parent finally gives in and wants some peace and quiet :) - resource provision/supply and demand in the entire chain between you and YouTube

DeadlineExceededError in GAE + BQ despite several steps taken to avoid it

I have a BigQuery query that takes perhaps a minute several BigQuery queries that each take around 10-30 seconds to run that I have been trying to execute from Google App Engine. At one or more places in the call stack, an HTTP request is being killed with a DeadlineExceededError. Sometimes the DeadlineExceededError (unsure which kind) is raised as is, and sometimes it is translated to an HTTPException.
Following leads found in different SO posts, I have taken various steps to avoid the timeout:
Run the query in a task that is added to a GAE TaskQueue, setting the task_age_limit to 10m. (1)
Pass a timeoutMs flag to getQueryResults (called on a job object in Google's Python API) using a value of 599 * 1000 ~ 10 minutes. (2)
Just before the call to getQueryResults, call urlfetch.set_default_fetch_deadline(60), every time, in an attempt to ensure that the setting is local to the thread that is making the call. (3, 4, 5)
A gist of the relevant part of a typical stack trace can be found here. In a typical task execution, there will be a number of failures and then finally, perhaps, a success.
This answer seems to be saying that a urlfetch call will not be allowed to exceed 60 seconds on GAE, in any context (including a task). I doubt the queries are exceeding the hard limit in my case, so I'm probably missing an important step. Has anyone run into a similar situation and figured out what was going on?

Prioritizing I/O for a specific query request in SQL server

Sorry for the long introduction but before I can ask my question, I think giving the background would help understanding our problem much better.
We are using sql server 2008 for our web services as the backend and from time to time it takes too much time for responding back for the requests that supposed to run really fast, like taking more than 20 seconds for a select request that queries a table that has only 22 rows. We went through many potential areas that could cause the issue from indexes to stored procedures, triggers etc, and tried to optimize whatever we can like removing indexes that are not read but write frequently or adding NOLOCK for our select queries to reduce the locking of the tables (we are OK with dirty reads).
We also had our DBA's reviewed the server and benchmarked the components to see any bottlenecks in CPU, memory or disk subsystem, and found out that hardware-wise we are OK as well. And since the pikes are occurring occasionally, it is really hard to reproduce the error on production or development because most of the time when we rerun the same query it yields response times that we are expecting, which are short, not the one that has been experienced earlier.
Having said that, I almost have been suspicious about I/O although it is not seem to be a bottleneck. But I think I was just be able to reproduce the error after running an index fragmentation report for a specific table on the server, which immediately caused pikes in requests not only run against that table but also in other requests that query other tables. And since the DB, and the server, is shared with other applications we use and also from time to time queries can be run on the server and database that take long time is a common scenario for us, my suspicion regarding occasional I/O bottleneck is, I believe, becoming a fact.
Therefore I want to find out a way that would prioritize requests that are coming from web services which will be processed even if there are other resource sensitive queries being run. I have been looking for some kind of prioritization I described above since very beginning of the resolution process and found out that SQL Server 2008 has a feature called 'Resource Governor' that allows prioritization of the requests.
However, since I am not an expert on Resource Governor nor a DBA, I would like to ask other people's experience who may have used or is using Resource Governor, as well as whether I can prioritize I/O for a specific login or a specific stored procedure (For example, if one I/O intensive process is being run at the time we receive a web service request, can SQL server stops, or slows down, I/O activity for that process and give a priority to the request we just received?).
Thank you for anyone that spends time on reading or helping out in advance.
Some Hardware Details:
CPU: 2x Quad Core AMD Opteron 8354
Memory: 64GB
Disk Subsystem: Compaq EVA8100 series (I am not sure but it should be RAID 0+1 accross 8 HP HSV210 SCSI drives)
PS:And I can almost 100 percent sure that application servers are not causing the error and there is no bottleneck we can identify there.
Update 1:
I'll try to answer as much as I can for the following questions that gbn asked below. Please let me know if you are looking something else.
1) What kind of index and statistics maintenance do you have please?
We have a weekly running job that defrags indexes every Friday. In addition to that, Auto Create Statistics and Auto Update Statistics are enabled. And the spikes are occurring in other times than the fragmentation job as well.
2) What kind of write data volumes do you have?
Hard to answer.In addition to our web services, there is a front end application that accesses the same database and periodically resource intensive queries needs to be run to my knowledge, however, I don't know how to get, let's say weekly or daily, write amount to DB.
3) Have you profiled Recompilation and statistics update events?
Sorry for not be able to figure out this one. I didn't understand what you are asking about by this question. Can you provide more information for this question, if possible?
first thought is that statistics are being updated because of the data change threshold is reached causing execution plans to be rebuilt.
What kind of index and statistics maintenance do you have please? Note: index maintenance updates index stats, not column stats: you may need separate stats updates.
What kind of write data volumes do you have?
Have you profiled Recompilation and statistics update events?
In response to question 3) of your Update to the original question, take a look at the following reference on SQL Server Pedia. It provides an explanation of what query recompiles are and also goes on to explain how you can monitor for these events. What I believe gbn is asking (feel free to correct me sir :-) ) is are you seeing recompile events prior to the slow execution of the troublesome query. You can look for this occurring by using the SQL Server Profiler.
Reasons for Recompiling a Query Execution Plan

Resources