I'm processing a 260M row, ~1,500 column table in chunks through a model in Python. Using the connectors, I grab a chunk of 100,000 records each time. I'm using LIMIT and OFFSET to churn through the table. After each section I increase the OFFSET by the chunksize. As the OFFSET increases, the time the query runs increases to the point where each chunk takes me in excess of 45 minutes to grab toward the end. Here is a mock up of my query:
SELECT ~50_fields
FROM mytable
WHERE a_couple_conditions
ORDER BY my_primary_key
LIMIT 100000 OFFSET #########
Given the performance this is a particularly bad way to run this. I read that I might be able to use RESULT_SCAN to speed it up, but the docs said that I would still need to use ORDER BY against it, which seems to me may defeat the purpose. I actually don't care what order the records come into my process, just that I process each row exactly once.
Is there a way to get these queries running in a decent amount of time, of should I look into doing something like increasing the LIMIT dramatically for each chunk, then breaking it down further in my program? Any ideas or best practices on getting Snowflake to play ball?
What if you tried something like this?
SELECT ~50_fields, row_number() OVER (ORDER BY my_primary_key) as row_cnt
FROM mytable
WHERE a_couple_conditions;
and then loop through:
SELECT ~50_fields
FROM table(result_scan(query_id))
WHERE row_cnt BETWEEN x and xx;
where query_id is the query_id from the first statement. The initial select might take a long time to order the entire table, but the remaining chunks should be very quick and will not take longer and longer as you go.
All Python SQL clients that I'm aware of allow you to process the output of a query in batches. As an example, here's how snowflake-connector-python allows to retrieve result batches from a query:
with connect(...) as conn:
with conn.cursor() as cur:
# Execute a query.
cur.execute('select seq4() as n from table(generator(rowcount => 100000));')
# Iterate over a list of PyArrow tables for result batches.
for table_for_batch in cur.fetch_arrow_batches():
my_pyarrow_table_processing_function(table_for_batch)
With Snowflake in particular, the batch size can be controlled in megabytes (but not in rows, sadly) using the parameter CLIENT_RESULT_CHUNK_SIZE.
Related
I have an Assets table with ~165,000 rows in it. However, the Assets make up "Collections" and each Collection may have ~10,000 items, which I want to save a "rank" for so users can see where a given asset ranks within the collection.
The rank can change (based on an internal score), so it needs to be updated periodically (a few times an hour).
That's currently being done on a per-collection basis with this:
UPDATE assets a
SET rank = a2.seqnum
FROM
(SELECT a2.*,
row_number() OVER (
ORDER BY elo_rating DESC) AS seqnum
FROM assets a2
WHERE a2.collection_id = #{collection_id} ) a2
WHERE a2.id = a.id;
However, that's causing the size of the table to double (i.e. 1GB to 2GB) roughly every 24 hours.
A VACUUM FULL clears this up, but that doesn't feel like a real solution.
Can the query be adjusted to not create so much (what I assume is) temporary storage?
Running PostgreSQL 13.
Every update writes a new row version in Postgres. So (aside from TOASTed columns) updating every row in the table roughly doubles its size. That's what you observe. Dead tuples can later be cleaned up to shrink the physical size of the table - that's what VACUUM FULL does, expensively. See:
Are TOAST rows written for UPDATEs not changing the TOASTable column?
Alternatively, you might just not run VACUUM FULL and keep the table at ~ twice it's minimal physical size. If you run plain VACUUM (without FULL!) enough - and if you don't have long running transactions blocking that - Postgres will have marked dead tuples in the free-space map by the time the next UPDATE kicks in and can reuse the disk space, thus staying at ~ twice its minimal size. That's probably cheaper than shrinking and re-growing the table all the time, as the most expensive part is typically to physically grow the table. Be sure to have aggressive autovacuum settings for the table. See:
Aggressive Autovacuum on PostgreSQL
VACUUM returning disk space to operating system
Probably better yet, break out the ranking into a minimal separate 1:1 table (a.k.a. "vertical partitioning") , so that only minimal rows have to be written "a few times an hour". Probably including elo_rating you mention in the query, which seems to change at least as frequently (?).
(LEFT) JOIN to the main table in queries. While that adds considerable overhead, it may still be (substantially) cheaper. Depends on the complete picture, most importantly the average row size in table assets and the typical load apart from your costly updates.
See:
Many columns vs few tables - performance wise
UPDATE or INSERT & DELETE? Which is better for storage / performance with large text columns?
When you use the Snowflake TOP clause in a query, does the SQL Server engine stop searching for rows once it has enough to satisfy the TOP X needed to be returned?
I think it depends on the rest of your query. For example, if you use TOP 10 but don't supply an order by then yes, it will stop as soon as the 10 records are returned but your results are non-deterministic.
If do you use an order by, then the entire query has to be executed first before the top 10 results can be returned but your results will be deterministic.
Here is a real example. If I run a select on the SAMPLE_DATA.TPCH_SF10000.CUSTOMER table with a limit 10 it returns in 1.8 seconds (no caching). This table has 1,500,000,000 rows in it. If I then check the query plan it has only scanned a tiny portion of the table, 1 out of 6,971 partitions:
You can see that it will return when 10 records have been streamed back from the initial table scan since there is nothing more it has to do.
From my testing and understanding, it does not stop. You can see typically see that the last step in the execution plan is the "limit" step. You can also see what's going on by looking at the execution plans. You will typically see the LIMIT (or whatever) after full processing. Also, if you take a query that runs for say 20 seconds without a LIMIT (or similar) and add the LIMIT, you will typically not see any difference in the execution time (but be aware of fetch time). I typically run query performance testing in the UI to avoid issues with client side tools that can mislead you due to limits on fectching and/or use of cursors.
Is the number of result sets limited that a stored procedure can return in SQL Server? Or is there any other component between server and a .Net Client using sqlncli11 limiting it? I'm thinking of really large numbers like 100000 result sets.
I couldn't find a specific answer to this in the Microsoft docs or here on SO.
My use case:
A stored procedure that iterates over a cursor and produces around 100 rows each iteration. I could collect all the rows in a temp table first, but since this is a long-running operation I want the client to start sooner with processing the results. Also the temp table can get quite large and the execution plans shows 98% cost on the INSERT INTO part.
I'm thinking of really large numbers like 100000 result sets.
Ah, I hope you have a LOT of time.
100k result sets means 100k SELECT statements.
Just switching from one result set to the next will take - together - a long time. 1ms? that is 100 seconds.
Is the number of result sets limited that a stored procedure can return in SQL Server?
Not to my knowledge. Remember, those are not part of any real metadata - there is a stream of data, endmarker, next stream. The number of resultsets a procedure returns is not defined (as: it can vary).
Also the temp table can get quite large
I have seen temp tables with hundreds of GB.
and the execution plans shows 98% cost on the INSERT INTO part.
That basically indicates that there is otherwise not a lot happening. Note that unless you do optimization - the relative cost is not relevant, the absolute is.
Have you considered a middle ground? Collect data and return resultsets grouping i.e. 100 resultsets.
But yes, staging into temp has a lot of overhead. It also means you can not start returning data BEFORE all processing is finished. That can be a bummer. Your approach will allow processing to start while the SP is still working on more data.
I have tried with below SQL query.
SELECT
sql_id,
child_number,
sql_fulltext,
elapsed_time,
executions,
round(elapsed_time_avg) elapsed_time_avg
FROM
(
SELECT
command_type,
sql_id,
child_number,
sql_fulltext,
elapsed_time,
cpu_time,
disk_reads,
executions,
( elapsed_time / executions ) elapsed_time_avg
FROM
v$sql
WHERE
executions > 0
order by elapsed_time_avg desc
)
where rownum <=10;
I expect all the time top 10 expensive query from the database. my query fetched but after some time change the SQL_id (results change) with a same SQL query.
Your approach is correct. (Wowever, I suggest sorting by ELAPSED_TIME instead of an average, since it's the total run time that matters most. A million fast queries can be worst than one slow query.) But you just have to keep in mind that queries will disappear from V$SQL as they age out of the shared pool. And it's hard to predict exactly how long something will stay in the shared pool.
You might want to look at the active session history, in V$ACTIVE_SESSION_HISTORY, which usually stores many hours worth of data. And then look at DBA_HIST_ACTIVE_SESS_HISTORY, which stores 8 days of data by default. You'll have to adjust your queries, since those two views don't store sums, they store a row for each wait. You'll need to count the number of rows per SQL_ID to find the estimated wait time. (V$ACTIVE_SESSION_HISTORY samples once per second, DBA_HIST_ACTIVE_SESS_HISTORY samples once every 10 seconds.)
One of the most important thing to realize about tuning SQL is that you're not looking for perfection. You don't want to trace every single statement, or you'll go crazy. If you sample the system every X seconds, and a statement doesn't show up, then you almost certainly don't care about that statement. It's fine if slow statements disappear from the top N list.
I want to use fulltextsearch for an autocomplete service, which means I need it to work fast! Up to two seconds max.
The search results are drawn from different tables and so I created a view that joins them together.
The SQL function that I'm using is FREETEXTTABLE().
The query runs very slowly, sometimes up to 40 seconds.
To optimize the query execution time, I made sure the base table has a clustered index column that's an integer data type (and not a GUID)
I have two questions:
First, any additional ideas about how to make the full text search faster? (not including upgrading the hardware...)
Second, How come each time after I rebuild the full text catalog, the search query works very fast (less then one second), but only for the first run. The second time I run the query it takes a few more seconds and it's all down hill from there.... any idea why this happens?
The reason why your query is very fast the first time after rebuilding the catalog might be very simple:
When you delete the catalog and rebuild it, the indexes have to be rebuilt, which takes some time. If you make a query before the rebuilding is finished, they query is faster, simply because there is less data. You should also notice, that your query-result contains less rows.
So testing the query speed only makes sense after rebuilding of the indexes is finished.
The following select might come handy to check the size (and also fragmentation) of the indexes. When the size stops growing, rebuilding of the indexes is finished ;)
-- Compute fragmentation information for all full-text indexes on the database
SELECT c.fulltext_catalog_id, c.name AS fulltext_catalog_name, i.change_tracking_state,
i.object_id, OBJECT_SCHEMA_NAME(i.object_id) + '.' + OBJECT_NAME(i.object_id) AS object_name,
f.num_fragments, f.fulltext_mb, f.largest_fragment_mb,
100.0 * (f.fulltext_mb - f.largest_fragment_mb) / NULLIF(f.fulltext_mb, 0) AS fulltext_fragmentation_in_percent
FROM sys.fulltext_catalogs c
JOIN sys.fulltext_indexes i
ON i.fulltext_catalog_id = c.fulltext_catalog_id
JOIN (
-- Compute fragment data for each table with a full-text index
SELECT table_id,
COUNT(*) AS num_fragments,
CONVERT(DECIMAL(9,2), SUM(data_size/(1024.*1024.))) AS fulltext_mb,
CONVERT(DECIMAL(9,2), MAX(data_size/(1024.*1024.))) AS largest_fragment_mb
FROM sys.fulltext_index_fragments
GROUP BY table_id
) f
ON f.table_id = i.object_id
Here's a good resource to check out. However if you really want to improve performance you'll have to think about upgrading your hardware. (I saw a significant performance increase by moving my data and full text index files to separate read-optimized disks and by moving logs and tempdb to separate write-optimized disks -- a total of 4 extra disks plus 1 more for the OS and SQL Server binaries.)
Some other non-hardware solutions I recommend:
Customize the built-in stop word list to define more stop words, thereby reducing the size of your full text index.
Change the file structure of tempdb. See here and here.
If your view performs more than 1 call to FREETEXTTABLE then consider changing your data structure so that the view only has to make 1 call.
However none of these by themselves are likely to be the silver bullet solution you're looking for to speed things up. I suspect there may be other factors here (maybe a poor performing server, network latency, resource contention on the server..) especially since you said the full text searches get slower with each execution which is the opposite of what I've seen in my experience.