Slow Query Performance on Large Table - query-optimization

I have a table that consists of 56millions rows.
This table is handling high load of UPSERTS every 5 minutes as it's loading streaming data from KAFKA. Approx 200-500k updates every load.
When I run a SELECT with an ORDER BY against one of the timestamp columns, it takes over 5-7 minutes to return a result.
I tried Cluster Key for that column but since there is a high DML operation on that table and high cardinality on the column itself, the clustering was ineffective and costly.
So far, the only think that has significantly reduced query time to about 15 seconds is increasing the warehouse size to an X-Large from a Small.
I am not convinced that the only solution is to increase the warehouse size. Any advice here would be great!

Clustering on date(timestamp) (or something that's lower cardinality) would be more effective, although because of the volume of updates it will still be expensive.
At a happy hour event, I heard a Snowflake user that achieved acceptable results on a similar (ish) scenario by clustering on late arriving facts (e.g. iff(event_date<current_date, true, false))) (although I think they were INSERTing not UPSERTing and in the later case the micropartitions have to be re-written anyway so it might not help much.)
There are other things to consider too.
Inspect the query plan to confirm that ordering is the problem (e.g is a lot of time spent on ordering.) Without seeing your actual query, I wonder if a majority of the time is spent on the table scan (when it is grabbing the data from remote storage.) If a larger warehouse improves performance, this is likely the case since every added node in the cluster means more micro-partitions can be read concurrently.

Are you running against:
A true timestamp column?
A JSON column cast as time stamp but no
additional function?
How many fields in the JSON
What is the relative ratio of UPDATEs to INSERTs?
Have you looked at the cluster statistics?

Related

Large partition count performance impact

I have tables have millions of partitions.
Should I reduce partition count for performance?
As my experience of spark application or hive query system, too many partition was bad for performance.
If you do not have auto clustering on the table, it will not be auto defragmented. So if you write to the table frequently with small row counts, it will be in very bad shape.
Partition count impacts compile time badly, as every partition has metadata that is load to plan/optimize the query. I would suggest doing a rebuild test (select into a new transient table) and run some comparable queries to see the different in compile time.
We have a number of table that sorting (thus auto clustering) does not make sense for as the use pattern is always full-table scan, thus we just rebuild those tables on schedules to keep the partition count down, and for us, that rebuild cost is worth the performance gain.
As with everything Snowflake you should run a test, and see how it is for you. And monitor hot spots as they can and do change.
In Snowflake, there are micro-partitions, and they are managed automatically. Therefore you do not need to worry about the number of micro-partition.
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html#what-are-micro-partitions
It says:
Micro-partitioning is automatically performed on all Snowflake tables.
Tables are transparently partitioned using the ordering of the data as
it is inserted/loaded.
From this page, I understand that micro-partitions are managed by Snowflake, and you do not need to focus on reducing the partition count (this is the original question).
This should also help to understand the difference between clustering and micro-partitions:
https://docs.snowflake.com/en/user-guide/table-considerations.html#when-to-set-a-clustering-key
If you read the above link, you can see that it is not a must to define clustering on even large tables to get a good query performance!
As the original question about reducing the partition count, I also have to say that clustering does not always reduce the number of partitions, but it is another story.

WHERE clause vs Smaller table

Is there a meaningful difference (or a rule of thumb for a given table size) for query time of a table with a WHERE clause limiting the result set compared to a smaller table which is equal to the size of the post-WHERE, limited result set?
For example:
Your table has records with timestamps spanning many years. You run a query that contains a WHERE clause limiting your result to the last 10 days only.
Your table has only 10 days of data, and you run the same query as above (obviously without the WHERE clause since it's not necessary in this case).
Should I expect a query performance difference in the two scenarios above? Note that I'm using Redshift. Obviously there is a $$ cost savings of storing less data, which is one benefit of scenario 2. Any others?
It depends entirely on the table and the indexes (in case of Redshift the Sort Key). Traditionally if you have a descending index on the timestamp and use the timestamp on the where clause, then the query engine will pretty quickly find the records it needs and stop looking.
There may still be some benefit from having less records, perhaps even maintaining two tables, but duplicating data should be a very last resort if testing shows that the performance benefit is real and necessary.
In Redshift, The answer is yes, it is always quicker to query a smaller table rather than a where clause on a larger table. This is because Redshift will generally scan all of the rows in the table. or at least those rows which are not excluded by the distribution/sort key optimisations.
Lets also address the other important aspects of this question
In almost all cases Redshift storage is cheap - that is because storage is usually not the deciding factor when capacity planning a Redshift cluster. It is more about getting the performance you need for the queries that you want to run.
You can improve the performance of Redshift queries in 4 ways
Increase the size of the cluster.
Tune the query.
Alter the definition of the Redshift tables, taking into account
contents and usage patterns. Sort and Distribution keys can make a
big difference. compression types should also be considered.
Implement Redshift performance management, to give priority to
higher priority queries.

Performance of 100M Row Table (Oracle 11g)

We are designing a table for ad-hoc analysis that will capture umpteen value fields over time for claims received. The table structure is essentially (pseudo-ish-code):
table_huge (
claim_key int not null,
valuation_date_key int not null,
value_1 some_number_type,
value_2 some_number_type,
[etc...],
constraint pk_huge primary key (claim_key, valuation_date_key)
);
All value fields all numeric. The requirements are: The table shall capture a minimum of 12 recent years (hopefully more) of incepted claims. Each claim shall have a valuation date for each month-end occurring between claim inception and the current date. Typical claim inception volumes range from 50k-100k per year.
Adding all this up I project a table with a row count on the order of 100 million, and could grow to as much as 500 million over years depending on the business's needs. The table will be rebuilt each month. Consumers will select only. Other than a monthly refresh, no updates, inserts or deletes will occur.
I am coming at this from the business (consumer) side, but I have an interest in mitigating the IT cost while preserving the analytical value of this table. We are not overwhelmingly concerned about quick returns from the Table, but will occasionally need to throw a couple dozen queries at it and get all results in a day or three.
For argument's sake, let's assume the technology stack is, I dunno, in the 80th percentile of modern hardware.
The questions I have are:
Is there a point at which the cost-to-benefit of indices becomes excessive, considering a low frequency of queries against high-volume tables?
Does the SO community have experience with +100M row tables and can
offer tips on how to manage?
Do I leave the database technology problem to IT to solve or should I
seriously consider curbing the business requirements (and why?)?
I know these are somewhat soft questions, and I hope readers appreciate this is not a proposition I can test before building.
Please let me know if any clarifications are needed. Thanks for reading!
First of all: Expect this to "just work" if leaving the tech problem to IT - especially if your budget allows for an "80% current" hardware level.
I do have experience with 200M+ rows in MySQL on entry-level and outdated hardware, and I was allways positivly suprised.
Some Hints:
On monthly refresh, load the table without non-primary indices, then create them. Search for the sweet point, how many index creations in parallell work best. In a project with much less date (ca. 10M) this reduced load time compared to the naive "create table, then load data" approach by 70%
Try to get a grip on the number and complexity of concurrent queries: This has influence on your hardware decisions (less concurrency=less IO, more CPU)
Assuming you have 20 numeric fields of 64 bits each, times 200M rows: If I can calculate correctly, ths is a payload of 32GB. Trade cheap disks against 64G RAM and never ever have an IO bottleneck.
Make sure, you set the tablespace to read only
You could consider anchor modeling approach to store changes only.
Considering that there are so many expected repeated rows, ~ 95% --
bringing row count from 100M to only 5M, removes most of your concerns.
At this point it is mostly cache consideration, if the whole table
can somehow fit into cache, things happen fairly fast.
For "low" data volumes, the following structure is slower to query than a plain table; at one point (as data volume grows) it becomes faster. That point depends on several factors, but it may be easy to test. Take a look at this white-paper about anchor modeling -- see graphs on page 10.
In terms of anchor-modeling, it is equivalent to
The modeling tool has automatic code generation, but it seems that it currenty fully supports only MS SQL server, though there is ORACLE in drop-down too. It can still be used as a code-helper.
In terms of supporting code, you will need (minimum)
Latest perspective view (auto-generated)
Point in time function (auto-generated)
Staging table from which this structure will be loaded (see tutorial for data-warehouse-loading)
Loading function, from staging table to the structure
Pruning functions for each attribute, to remove any repeating values
It is easy to create all this by following auto-generated-code patterns.
With no ongoing updates/inserts, an index NEVER has negative performance consequences, only positive (by MANY orders of magnitude for tables of this size).
More critically, the schema is seriously flawed. What you want is
Claim
claim_key
valuation_date
ClaimValue
claim_key (fk->Claim.claim_key)
value_key
value
This is much more space-efficient as it stores only the values you actually have, and does not require schema changes when the number of values for a single row exceeds the number of columns you have allocated.
Using partition concept & apply partition key on every query that you perform will save give the more performance improvements.
In our company we solved huge number of performance issues with the partition concept.
One more design solutions is if we know that the table is going to be very very big, try not to apply more constraints on the table & handle in the logic before u perform & don't have many columns on the table to avoid row chaining issues.

Your first gut feeling on this SqlServer design question

We have 2 tables. One holds measurements, the other one holds timestamps (one for every minute)
every measurement holds a FK to a timestamp.
We have 8M (million) measurements, and 2M timestamps.
We are creating a report database via replication, and my first solution was this: when a new measurement was received via the replication process, lookup the right timestamp and add it to the measurement table.
Yes, it's duplication of data, but it is for reporting and since we have measurements every 5 minutes and users can query for yearly data (105.000 measurements) we have to optimize for speed.
But a co-developer said: you don't have to do that, we'll just query with a join (on the two tables), SqlServer is so fast, you don't see the difference.
My first reaction was: a join on two tables with 8M and 2M records can't make 'no difference'.
What is your first feeling on this?
EDIT:
new measurements: 400 records per 5 minutes
EDIT 2:
maybe the question is not so clear:
the first solution is to get the data from the timestamp table and copy it to the measurement table when the measurement record is inserted.
In that case we have an action when the record is inserted AND an extra (duplicated) timestamp value. In this case we lonly query ONE table because it holds all the data.
The second solution is to join the two tables in a query.
With the proper index the join will make no difference*. My initial thought is that if the report is querying over the entire dataset, the joins might actually be faster because there is literally 6 million fewer timestamps that it has to read from the disk.
*This is just a guess based on my experience with tables with millions of records. You results will vary based on your queries.
I'd create an Indexed View (similar to a Materialized view in Oracle) which joins the tables using appropriate indexes.
If the query just retrieves the data for the given date ranges, there will be a merge join - that is, a range scan for each of tow tables. Since the timestamp table presumably contains only timestamp, this shouldn't be expensive.
On the other hand, if you have only one table and index on the date column, the index itself becomes larger and more expensive to scan.
So, with properly constructed indexes and queries I won't expect a significant difference in performance.
I'd suggest you to keep properly normalized design until you start having performance problems that force you to change it. And then you need to carefully analyze query plans and measure performance with different options - there're lots of thing that could matter in your particular case.
Frankly in this case your best bet is try both solutions and see which one is better. Performance tuning is an art when you start talking about large data sets and is highly dependant onthe not only the database design you have but the hardware and the whther you are using partioning, etc. Be sure to test both getting the data out and putting the data in. Since you have so many inserts, insert speed is critical and tthe index you would need on on the datetime field is critical to select performance, so you really need to thouroughly test this. Don't forget about dumping the cache when you test. And test multiple times and if possible test under a typical query load.

DB non-clustered Index on event log date DESC a bad idea?

We have a SQL table that is populated with events from our website (mostly error logging and the like.) The table has several text fields that contain all of the information about the type of event, and a date/time field that shows when the event was logged. The table is fairly large and grows by around 10-100 records per day.
Obviously, when going through this log, we often are looking for the most recent items, so I figured an obvious way to improve our search times would be to add a index to the date field. Me, I figured that while either ASC or DESC would both be great, DESC would be better since that's the way we're searching most of the time. Our DB guy said "no way"...it would be really bad, because the index table would rapidly become fragmented.
I could see why you wouldn't want to have a clustered index on date DESC, because you'd constantly be trying to insert at the beginning...but I thought with a non-clustered index it would be okay, since the records wouldn't need to be moved around. But what he's saying also makes sense...still would have to move indexes around.
But how much? And how big of a hit would it be? And even if it isn't much of a hit, maybe it's still not worth it because the performance on occasional selects just couldn't improve that much? Thoughts?
I don't think it's a bad idea - quite the contrary!
Not knowing your database system, I can't really be sure why your DB guy would think this would be a bad idea. And even so - even an ascending index on the date will be quite beneficial already (at least in the case of SQL Server).
In this case, if you do frequently query by date and usually will retrieve the most recent ones, this seems like a perfect index to me! Maybe you could make it even better by adding the second most likely selection criteria (log application? log type?) to it, so that if you specify both the date and that second criteria, the search scope would be even more limited within the index.
If I were you, I would try a few sample queries against the table without this index, and then add the non-clustered index on your logdate - first with ASC and test how your queries perform (check out their execution plans!), then try the index with DESC, and possibly try the index with LogDate and an additional criteria field, too. See how performance looks like.
Marc
Indexes speed up some queries but slow down all loads. Whether or not an index gives an overall performance improvement depends on how much it speeds up your actual query workload and how much it slows down your actual loading workload (as well as deletes and updates that modify the indexed column).
In many (probably most) applications that involve storing event data, there is a huge amount of loading going on and relatively little querying, which is primarily summary-type queries that don't benefit from indexes. In these sorts of applications, indexes often do more harm than good.
In many such applications, it is possible to do loads during off hours so even if the index gives an overall slowdown, it might be worth it to increase query speed because someone is waiting for the query output but no one waits for the load to complete. However, the index can get so large that overruns the file cache and each insert has to read and write a different leaf page from disk. At this point, loads start to require a linear number of random access disk reads and writes, which can cause it to take all day to do a load.

Resources