Query compilation and provisioning times - snowflake-cloud-data-platform

What does it mean there is a longer time for COMPILATION_TIME, QUEUED_PROVISIONING_TIME or both more than usual?
I have a query runs every couple of minutes and it usually takes less than 200 milliseconds for compilation and 0 for provisioning. There are 2 instances in the last couple of days the values are more than 4000 for compilation and more than 100000 for provisioning.
Is that mean warehouse was being resumed and there was a hiccup?

COMPILATION_TIME:
The SQL is parsed and simplified, and the tables meta data is loaded. Thus a compile for select a,b,c from table_name will be fractally faster than select * from table_name because the meta data is not needed from every partition to know the final shape.
Super fragmented tables, can give poor compile performance as there is more meta data to load. Fragmentation comes from many small writes/deletes/updates.
Doing very large INSERT statements can give horrible compile performance. We did a lift-and-shift and did all data loading via INSERT, just avoid..
PRIOVISIONING_TIME is the amount of time to setup the hardware, this occurs for two main reasons ,you are turning on 3X, 4X, 5X, 6X servers and it can take minutes just to allocate those volume of servers.
Or there is failure, sometime around releases there can be a little instability, where a query fails on the "new" release, and query is rolled back to older instances, which you would see in the profile as 1, 1001. But sometimes there has been problems in the provisioning infrastructure (I not seen it for a few years, but am not monitoring for it presently).
But I would think you will mostly see this on a on going basis for the first reason.

The compilation process involves query parsing, semantic checks, query rewrite components, reading object metadata, table pruning, evaluating certain heuristics such as filter push-downs, plan generations based upon the cost-based optimization, etc., which totally accounts for the COMPILATION_TIME.
QUEUED_PROVISIONING_TIME refers to Time (in milliseconds) spent in the warehouse queue, waiting for the warehouse compute resources to provision, due to warehouse creation, resume, or resize.
https://docs.snowflake.com/en/sql-reference/functions/query_history.html
To understand the reason behind the query taking long time recently in detail, the query ID needs to be analysed. You can raise a support case to Snowflake support with the problematic query ID to have the details checked.

Related

Table containing TEXT column growing continuously

We've got a table in a production system which (for legacy reasons) is running SQL 2005 (9.0.5266) and contains a TEXT column (along with a few other columns of various datatypes).
All of a sudden (since a week ago) we noticed the size of this one table increasing linearly by 10-15GB per day (whereas previously it has always remained at a constant size). The table is a queue for a messaging system, and as such the data in it completely refreshes itself every few seconds. At any one time there could be anywhere from 0 to around 1000 rows, but it fluctuates rapidly as messages are inserted, and sent (at which point they're deleted).
We can't find anything that was changed on the day the growth started - so have no obvious potential cause identified at this stage.
One "obvious" culprit is the TEXT column, and so we checked to see if any massive values were now being stored, but (using DATALENGTH) we found no single rows above around ~32k. We've run CHECKDB, updated space usage, rebuild all indexes, etc - nothing reduces the size (and CHECKDB showed no errors).
We've queried sys.allocation_units and the size increase is definitely LOB_DATA (which show total_pages and used_pages increasing together at a constant rate).
To reduce the database size last night we simple created a new table along-side the one in question (which is luckily referenced via a view by the application), dropped the old table, and renamed the new one. We left last night, taking comfort in the fact that we'd alleviated the space issues, and that we had a backup of the dodgy table to investigate further today.
However, this morning the table size is already up to 14GB (and growing), while there are only the usual ~500 rows in the table, and MAX(DATALENGTH(text_column)) is only showing around 35k.
Any ideas as to what could be causing this "runaway" growth, or anything else that we could try or query to get more information about what exactly is using the space?
Cheers,
Dave
This is a general problem in dealing with queues. The article linked talks about Service Broker queues, but the issue is the same for ordinary tables used as queues. If you have a busy system with generous resources (CPU, memory, disk IO) and you push a queue on this system to high throughput, then a large portion of these resources will be used to handle the two operations: enqueue (ie. INSERT) and dequeue (ie. DELETE). However, the full lifecycle of the record requires three operations: INSERT, DELETE and ghost purge. They cost roughly the same in terms on CPU/memory/disk IO needs, so if you use that queue for say 90% of the system resources then you should allocate 30% resources to each. But only the first two are under your control (ie. explicit statements running in user sessions). The third one, the ghost purge, is a background process controlled by SQL Server, and there is no chance the ghost cleanup process will be allowed to consume 30% resources. This is a fundamental issue and, if you push the pedal-to-the-metal for long enough time your *will hit it. Once ghost records accumulate and pass system/workload specific threshold the performance will degrade quickly and the symptoms will spiral to abysmal performance (a negative feedback loop forms).
Luckily, since you do not use Service Broker queues but real tables as queues, you have some better tools at your disposal, like ALTER TABLE REORGANIZE and ALTER TABLE REBUILD. By far the best solution is an online index/table rebuild. SQL Server 2012 supports online operations on tables containing BLOBs and you can eleverage that. Of course you would have to get rid of the deprecated obsolete TEXT type and use VARCHAR(MAX), but that goes w/o saying.
As a side note:
If you have pages with nothing but ghost records on them, then you
will not read those pages again and they won't get marked for cleanup
This is incorrect. Pages with nothing but ghosts will be detected and purged by scans. As I said, the issue is not detection, is resources. If you push your system enough, you will race ahead of the ghost cleanup and he will never catch up.
Early this morning I restarted the SQL service on the instance with this "problem queue table". It appears that this has fixed the issue. Immediately following the restart, I monitored the LOB_DATA page-in-use count, and it started dropping straight away. It was being cleaned up quite slowly, so probably took around an hour or two to reclaim the 60+GB of space being held (I went to bed after I'd made sure all was well).
At the moment the table is back to normal as far as in-use allocations (hovering around <100 pages), and is not showing any signs of re-growing.
Given the fact that we have used this table in the same way (i.e. as a queue) for at least 10 years, and it has had busier periods than what we've had over the past week or two, I would've been surprised if it was the issue described by Remus above (although I understand how that can occur; I guess this specific queue just isn't quite busy enough to swamp the ghost cleanup process?). Very strange...
Thanks again for the help guys!

Need recommendations on pushing the envelope with SqlBulkCopy on SQL Server

I am designing an application, one aspect of which is that it is supposed to be able to receive massive amounts of data into SQL database. I designed the database stricture as a single table with bigint identity, something like this one:
CREATE TABLE MainTable
(
_id bigint IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
field1, field2, ...
)
I will omit how am I intending to perform queries, since it is irrelevant to the question I have.
I have written a prototype, which inserts data into this table using SqlBulkCopy. It seemed to work very well in the lab. I was able to insert tens of millions records at a rate of ~3K records/sec (full record itself is rather large, ~4K). Since the only index on this table is autoincrementing bigint, I have not seen a slowdown even after significant amount of rows was pushed.
Considering that the lab SQL server was a virtual machine with relatively weak configuration (4Gb RAM, shared with other VMs disk sybsystem), I was expecting to get significantly better throughput on the physical machine, but it didn't happen, or lets say the performance increase was negligible. I could, maybe get 25% faster inserts on physical machine. Even after I configured 3-drive RAID0, which performed 3 times faster than a single drive (measured by a benchmarking software), I got no improvement. Basically: faster drive subsystem, dedicated physical CPU and double RAM almost didn't translate into any performance gain.
I then repeated the test using biggest instance on Azure (8 cores, 16Gb), and I got the same result. So, adding more cores did not change insert speed.
At this time I have played around with following software parameters without any significant performance gain:
Modifying SqlBulkInsert.BatchSize parameter
Inserting from multiple threads simultaneously, and adjusting # of threads
Using table lock option on SqlBulkInsert
Eliminating network latency by inserting from a local process using shared memory driver
I am trying to increase performance at least 2-3 times, and my original idea was that throwing more hardware would get tings done, but so far it doesn't.
So, can someone recommend me:
What resource could be suspected a bottleneck here? How to confirm?
Is there a methodology I could try to get reliably scalable bulk insert improvement considering there is a single SQL server system?
UPDATE I am certain that load app is not a problem. It creates record in a temporary queue in a separate thread, so when there is an insert it goes like this (simplified):
===>start logging time
int batchCount = (queue.Count - 1) / targetBatchSize + 1;
Enumerable.Range(0, batchCount).AsParallel().
WithDegreeOfParallelism(MAX_DEGREE_OF_PARALLELISM).ForAll(i =>
{
var batch = queue.Skip(i * targetBatchSize).Take(targetBatchSize);
var data = MYRECORDTYPE.MakeDataTable(batch);
var bcp = GetBulkCopy();
bcp.WriteToServer(data);
});
====> end loging time
timings are logged, and the part that creates a queue never takes any significant chunk
UPDATE2 I have implemented collecting how long each operation in that cycle takes and the layout is as follows:
queue.Skip().Take() - negligible
MakeDataTable(batch) - 10%
GetBulkCopy() - negligible
WriteToServer(data) - 90%
UPDATE3 I am designing for standard version of SQL, so I cannot rely on partitioning, since it's only available in Enterprise version. But I tried a variant of partitioning scheme:
created 16 filegroups (G0 to G15),
made 16 tables for insertion only (T0 to T15) each bound to its individual group. Tables are with no indexes at all, not even clustered int identity.
threads that insert data will cycle through all 16 tables each. This makes it almost a guarantee that each bulk insert operation uses its own table
That did yield ~20% improvement in bulk insert. CPU cores, LAN interface, Drive I/O were not maximized, and used at around 25% of max capacity.
UPDATE4 I think it is now as good as it gets. I was able to push inserts to a reasonable speeds using following techniques:
Each bulk insert goes into its own table, then results are merged into main one
Tables are recreated fresh for every bulk insert, table locks are used
Used IDataReader implementation from here instead of DataTable.
Bulk inserts done from multiple clients
Each client is accessing SQL using individual gigabit VLAN
Side processes accessing the main table use NOLOCK option
I examined sys.dm_os_wait_stats, and sys.dm_os_latch_stats to eliminate contentions
I have a hard time to decide at this point who gets a credit for answered question. Those of you who don't get an "answered", I apologize, it was a really tough decision, and I thank you all.
UPDATE5: Following item could use some optimization:
Used IDataReader implementation from here instead of DataTable.
Unless you run your program on machine with massive CPU core count, it could use some re-factoring. Since it is using reflection to generate get/set methods, that becomes a major load on CPUs. If performance is a key, it adds a lot of performance when you code IDataReader manually, so that it is compiled, instead of using reflection
For recommendations on tuning SQL Server for bulk loads, see the Data Loading and Performance Guide paper from MS, and also Guidelines for Optimising Bulk Import from books online. Although they focus on bulk loading from SQL Server, most of the advice applies to bulk loading using the client API. This papers apply to SQL 2008 - you don't say which SQL Server version you're targetting
Both have quite a lot of information which it's worth going through in detail. However, some highlights:
Minimally log the bulk operation. Use bulk-logged or simple recovery.
You may need to enable traceflag 610 (but see the caveats on doing
this)
Tune the batch size
Consider partitioning the target table
Consider dropping indexes during bulk load
Nicely summarised in this flow chart from Data Loading and Performance Guide:
As others have said, you need to get some peformance counters to establish the source of the bottleneck, since your experiments suggest that IO might not be the limitation.
Data Loading and Performance Guide includes a list of SQL wait types and performance counters to monitor (there are no anchors in the document to link to but this is about 75% through the document, in the section "Optimizing Bulk Load")
UPDATE
It took me a while to find the link, but this SQLBits talk by Thomas Kejser is also well worth watching - the slides are available if you don't have time to watch the whole thing. It repeats some of the material linked here but also covers a couple of other suggestions for how to deal with high incidences of particular performance counters.
It seems you have done a lot however I am not sure if you have had chance to study Alberto Ferrari SqlBulkCopy Performance Analysis report, which describes several factors to consider the performance related with SqlBulkCopy. I would say lots of things discussed in that paper is still worth trying to that would good to try first.
I am not sure why you are not getting 100% utilization on CPU, IO or memory. But if you simply want to improve your bulk load speeds, here is something to consider:
Partition you data file into different files. Or if they are coming from different sources, then simply create different data files.
Then run multiple bulk inserts simultaneously.
Depending on your situation the above may not be feasible; but if you can then I am sure it should improve your load speeds.

Maximum number of records for a custom object in salesforce.com

What is the maximum number of records within a single custom object in salesforce.com?
There does not seem to be a limit indicated in https://login.salesforce.com/help/doc/en/limits.htm
But of course, there has to be a limit of some kind. EG: Could 250 million records be stored in a single salesforce.com custom object?
As far as I'm aware the only limit is your data storage, you can see what you've used by going to Setup -> Administration Setup -> Data Management -> Storage Usage.
In one of the Orgs I work with I can see one object has almost 2GB of data for just under a million records, and this accounts for a little over a third of the storage available. Your storage space depends on your Salesforce Edition and number of users. See here for details.
I've seen the performance issue as well, though after about 1-2M records the performance hit appears magically to plateau, or at least it didn't appear to significantly slow down between 1M and 10M. I wonder if orgs are tier-tuned based on volume... :/
But regardless of this, there are other challenges which make it less than ideal for big data. Even though they've increased the SOQL governor limit to permit up to 50 million records to be retrieved in one call, you're still strapped with a 200,000 line execution limit in Apex and a 10K DML limit (per execution thread). These can be bypassed through Batch Apex, yet this has limitations as well. You can only execute 250K batches in 24 hours and only have 5 batches running at any given time.
So... the moral of the story seems to be that even if you managed to get a billion records into a custom object, you really can't do much with the data at that scale anyway. Therefore, it's effectively not the right tool for that job in its current state.
2-cents
LaceySnr is correct. However, there is an inverse relationship between the number of records for an object and performance. Any part of the system that filters on that object will be impacted, such as views, reports, SOQL queries, etc.
It's hard to talk specific numbers since salesforce has upwards of a dozen server clusters, each with their own performance characteristics. And there's probably a lot of dynamic performance management that occurs regularly. But, in the past I've seen performance issues start to creep in around 2M records. One possible remedy is you can ask salesforce to index fields that you plan to filter on.

Terrible SQL reads performance (culprit update stats?)

I'm running on SQL Server 2008 R2 and am trying to fine-tune performance. I did everything I could from:
Code review of SQL code
Create or remove indexes as I think appropriate
Auto create stats ON
Auto update stats ON
Auto update stats async ON
I have a 24/7 system that constantly stores data. Sometimes we do reads and that's where the issue is. Sometimes the reads take a couple of seconds or less (which would be expected and acceptable to us). Other times, the reads take several seconds that could amount to a minute before the stored procedure completes and we render data on the UI.
If we do the read again, it would be faster. The SQL profiler would trace the particular stored procedure or query that took several seconds. We would zoom into that stored procedure, and do everything we can do to optimize it if we can.
I also traced the auto stats event and the recompile event. It's hard to tell if a stat is being updated causing the read to take a long time, or if a recompile caused it. Sometimes, I see that the profiler traced a recompile of the read query that took several unacceptable minutes, other times it doesn't trace a recompile.
I tried to prevent the query optimizer from blocking the read until it recompiles or updates stats by using option use plan XML, etc. But I ran into compile errors complaining that the query plan XML isn't valid; that could be true because the query is quiet involved: select + joins that involve a local table var. I sort of hacked the XML and maybe that's why it deemed it invalid. So I gave up on using plan hint.
We tried periodic (every 15 minutes) manual running update stats in order to keep stats up-to-date as much as we can, but that hurt performance. updatestats blocks writes, and I'm sure even reads; updatestats seemed to maintain a bunch of statistics and on average it was taking around 80-90 seconds. A read that waits that long is unacceptable.
So the idea is to let the reads happen and prevent a situation when a recompile/update stat blocks it, correct? Does it make sense to disable auto statistics altogether? Or perhaps disable auto create statistics after deleting all the auto created stats?
This goes against Microsoft recommendations perhaps, since they enable auto create statistics and auto update statistics by default, and performance may suffer, but any ideas/hints you can give would be appreciated.
From what you are explaining, it looks like the below (all or some) might be happening.
You are doing physical reads. The quick way you avoid this is by increasing the amount of RAM you throw at the box. You haven't mentioned the hardware specs of your server. Please add details.
If you trace the SQL calls then you can easily figure out why the RECOMPILE happened. Look at the EventSubClass to figure out the reason and work towards resolving that.
ref: http://msdn.microsoft.com/en-us/library/ms187105.aspx
You mentioned table variables. These are notorious for causing performance issues when NOT using at the right place. If you use table variables in a JOIN, parallel plan is out of the question and no stats also. I am NOT sure how and where you are using but try replacing them with temp tables. And starting from SQL Server 2005, you will get only STMT recompilation at best and NOT the complete SP recompile as it happened in 2000.
You mentioned Update Stats ASYNC option and this won't block the query.
What are the TOP WAIT STATS on this server? Have you identified the expensive procedures based on CPU, Logical reads & execution count?
Have you looked the Page Life Expectancy, amount of IO using virtual file stats DMV?
Updating Stats every 15 minutes is NOT a good plan. How often is data inserted into the system? What is the sample rate you are using? What is your index maintenance strategy?
Have you looked at the missing indexes DMV?
There are a bunch of good queries to identify problems in more granular fashion using the below queries.
ref: http://dl.dropbox.com/u/13748067/SQL%20Server%202008%20Diagnostic%20Information%20Queries%20%28April%202011%29.sql
There are so many other things to look at but the above is a good starting point.
OK, here is my IMHO catch on this:
DBCC INDEXDEFRAG is worth trying and is an ONLINE function hence can be used on a live system
You could be reaching the maximum capacity of your architectural design. You can scale up which can always help but more likely you have to change the architecture to achieve better scalability sacrificing simplicity
A common trick is partitioning. You are writing to a table whose index distribution looks nothing like it was a few hours ago - hence degrading performance. This is a massive write, such a table could be divided to daily write and the rest of the data with nightly batches of moving stuff across.
More and more, people are being converting to CQRS. You might be the next. This solves the problem by separating reads from writes (a very simplistic explanation).

Detecting/Monitoring for parameter sniffing problems

Are there any tools to specifically monitor/detect for parameter sniffing problems as opposed to those which report queries that take a long time?
I have just got hit with a parameter sniffing problem. (It wasn't too serious as it caused a report to take about 2 minutes to run instead of a few seconds if properly cached and maybe 30 seconds if recompiled. And since the report is usually only run a few times per month, it is not really a problem).
However, since I wrote the report and I knew what it did, I was curious and went investigating and using SQL Profiler, I could see a section in the query plan where the number of estimated rows was 1, but the actual number of rows was several hundred thousand.
So, it struck me, that if SQL has these figures, (or at least can get these figures), that perhaps there is some way of getting sql to track and report which plans were significantly out.
You've got a couple of questions in there:
Are there any tools to specifically monitor/detect for parameter sniffing problems as opposed to those which report queries that take a long time?
To catch this, you need to monitor the procedure cache to find out when a query's execution plan changes from good to bad. SQL Server 2008 made this a lot easier by adding query_hash and query_plan_hash fields to sys.dm_exec_query_stats. You can compare the current query plan to past ones for the same query_hash, and when it changes, compare the number of logical reads or amount of worker time from the old query to the new one. If it skyrockets, you might have a parameter sniffing problem.
Then again, someone might have just eliminated an index or changed the code in a UDF that's being called or a change in MAXDOP or any one of a million settings that influence query plan behavior.
What you want is a single dashboard that shows the most resource-consuming queries in aggregate (because you might have this problem on a query that's called extremely frequently, but consumes tiny amounts of resources each time) and then shows you changes in its execution plan over time, plus lays over system and database level changes. Quest Foglight Performance Analysis does this. (I used to work for Quest, so I know the product, but I'm not shilling here.) Note that Quest sells a separate product, Foglight, that has nothing to do with Performance Analysis. I'm not aware of any other product that goes into this level of detail.
I could see a section in the query plan where the number of estimated rows was 1, but the actual number of rows was several hundred thousand.
That's not necessarily parameter sniffing - that could be bad stats or table variable usage, for example. To catch this kind of issue, I like the free SQL Sentry Plan Advisor tool. In the Top Operations tab, it highlights variances between estimated and actual rows.
Now, that's only for one plan at a time, and you have to know the plan first. You want to do this 24/7, right? Sure you do - but it's computationally intensive. The procedure cache can be huge (I've got clients with >100GB of procedure cache), and it's all unindexed XML. To compare estimated vs actual rows, you have to shred all that XML - and keep in mind that the procedure cache can be constantly changing under load.
What you really want is a product that could very rapidly dump the entire procedure cache into a database, throw XML indexes on it, and then compare estimates versus actual rows. I can imagine a script doing that, but I haven't seen one yet.
You said
"estimated rows was 1, but the actual number of rows was several hundred thousand."
This can be caused by table variables which don't have statistics.
To detect parameter sniffing is difficult but you can verify it is happening by running sp_updatestats. If the problems disappears it's most likely parameter sniffing. If it doesn't then you have other problems, such as too large table variables
We use parameter masking consistently now (system was developed on SQL Server 2000). We don't need it 99.9+ % of the time but the < 0.1% justifies it because of user confidence + support overhead it entails.
You can set up a trace that to record the query text of all batches / stored procedures run that have duration > Ns.
You obviously need to tailor N for your system (and probably add rules to exclude batch jobs that take a long time even during normal execution), but this should identify which queries offer the poorest performance and will also record any queries (along with their parameters) which have abnormally long execution times - potentially the result of a parameter sniffing problem.
See How to create a SQL trace using T-SQL on how to create a trace using T-SQL. This will give better performance than using SQL Profiler as this only captures the events that you set trace events for (SQL Profiler reportedly captures all events and then filters them in the application).

Resources