Costs vs Consistant gets - database

What does it indicate to see a query that has a low cost in the explain plan but a high consistent gets count in autotrace? In this case the cost was in the 100's and the CR's were in the millions.

The cost can represent two different things depending on version and whether you are running in cpu-based costing mode or not.
Briefly, the cost represents that amount of time that the optimizer expects the query to execute for, but it is expressed in units of the amount of time that a single block read takes. For example if Oracle expects a single block read to take 1ms and the query to take 20ms, then the cost equals 20.
Consistent gets do not match exactly with this for a number of reasons: the cost includes non-consistent (current) gets (eg reading and writing temp data), the cost includes CPU time, and a consistent get can be a multiblock read instead of a single block read and hence have a different duration. Oracle can also get the estimate of the cost completely wrong and it could end up requiring a lot more or less consistent gets than the estimate suggested.
A useful method that can helo explain disconnects between predicted execution plan and actual performance is "cardinality feedback". See this presentation: http://www.centrexcc.com/Tuning%20by%20Cardinality%20Feedback.ppt.pdf

At best, the cost is the optimizer's estimate of the number of I/O's that a query would perform. So, at best, the cost is only likely to be accurate if the optimizer has found a very good plan-- if the optimizer's estimate of the cost is correct and the plan is ideal, that generally means that you're never going to bother looking at the plan because that query is going to perform reasonably well.
Consistent gets, however, is an actual measure of the number of gets that a query actually performed. So that is a much more accurate benchmark to use.
Although there are many, many things that can influence the cost, and a few things that can influence the number of consistent gets, it is probably reasonable to expect that if you have a very low cost and a very high number of consistent gets that the optimizer is probably working with poor estimates of the cardinality of the various steps (the ROWS column in the PLAN_TABLE tells you the expected number of rows returned in each step). That may indicate that you have missing or outdated statistics, that you are missing some histograms, that your initialization parameters or system statistics are wrong in some way, or that the CBO has problems for some other reason estimating the cardinality of your results.
What version of Oracle are you using?

Related

query performance issue in my select query in oracle database

My query had ordered hint. it was giving below cost and cardinality
When I removed ordered hint then it started giving below cost and cardinality.
in terms of performance which plan is better? I can put more details including query if required. I am not saying somebody to my work, but even smallest suggestion would be really helpful for me.
Impossible to say which is faster based on cost alone. Cost is only the amount of work the optimizer estimates it will take to execute a query a certain way. This will depend on your statistics and your query (and optimizer math). If your statistics don’t represent the data or your query has filters that it can’t estimate: you’re going to get a misleading cost calculation. What you need to remember is Garbage In - Garbage Out, ie bad stats will give you a bad plan.
If you’re putting hints in, generally that means the execution plan that the optimizer came up with wasn’t deemed good enough. In those cases, you’re essentially saying that Oracle’s cost calculation was wrong - so we definitely shouldn’t use it to see which query is faster.
Luckily, you have everything you need to determine which query is faster - you have your database and the queries, you just need to execute them and see.
I suspect neither is particularly fast, but if you want to improve them you’re going to need to look at where the work is really going in executing them. The final cost in those queries are very high so maybe it has correctly identified an unavoidable (based on how the query is written and what structures exist) high cost operation. Reading over the execution plan yourself and considering how much effort each step would be is always a good idea.
The easy way to begin tuning it would be to get out the Row Source Execution Statistics for a complete execution and target the parts of the plan that are responsible for the most actual time. See parts 3 and 4 of https://ctandrewsayer.wordpress.com/2017/03/21/4-easy-lessons-to-enhance-your-performance-diagnostics/ for how to do that - if anything it will give you something you can share that concrete advise can be given on (if you do share it then don’t forget to include the full query).
Normally cost comparison is enough to say whether using hint makes sense. Usually hints make it worse when statistics is gathered properly.
So, the one with less query cost is better.
I always look on usage of cpu, logical reads (reads from RAM) and physical reads (reads from disk). The better option uses less resources.

OPTION (OPTIMIZE FOR (#AccountID=148))

I recently saw this statement in a large, complex production query that is getting executed a couple hundred times per minute. It seems the original author of this query was trying to optimize the query for when AccountId = 148. There are total of 811 different account Id values that could occur in the table. The value 148 represents 9.5% of all rows in this table (~60.1M rows total) and has the highest total of any account value.
I've never come across anyone doing something like this. It seems to me this only has significant value if, more often than not, the #AccountId parameter is equal to 148. Otherwise, the query plan could assume more rows are being returned than actually are. In this case a scan might be performed instead of a seek.
So, is there any practical value to doing this, in this particular scenario?
Assume the extreme case where only account 148 covers 10% of the table, and all the other accounts just 0.001% each. Well, given that that account represents 10% of your data, it also stands to reason it will be searched for more often than the other accounts. Now imagine that for any other account, a nested loop join over a small amount of rows would be really fast, but for account 148, it would be hideously slow and a hash join would be the superior choice.
Further imagine that by some stroke of bad luck, the first query that comes in to your system after a reboot/plan recycle is one for an account other than 148. You are now stuck with a plan that performs extremely poorly 10% of the time, even though it also happens to be really good for other data. In this case, you may well want the optimizer to stick with the plan that isn't a disaster 10% of the time, even if that means it's slightly less than optimal the other times. This is where OPTIMIZE FOR comes in.
Alternatives are OPTIMIZE FOR UNKNOWN (if your distribution is far more even, but you need to protect against an accidental plan lock-in from unrepresentative parameters), RECOMPILE (which can have a considerable performance impact for queries executed frequently), explicit hints (FORCESEEK, OPTION HASH JOIN, etcetera), adding fine-grained custom statistics (CREATE STATISTICS) and splitting queries (IF #AccountID = 148 ...), possibly in combination with a filtered index. But aside from all these, OPTIMIZE FOR #Param = <specific non-representative value> certainly has a place.

Does scalability with databases have a limit?

Assuming that the larger a database gets, the longer it will take to SELECT rows, won't a database eventually take too long (i.e. annoying to users) to traverse regardless of how optimized it is?
Is it simply a matter of the increasing time being so negligible that there is only a theoretical limit, but no realistic one?
Well, yes, in a manner of speaking. Generally, the more data you have, the longer it will take to find what you're looking for.
There are ways to dramatically reduce that time (indexing, sharding, etc), and you can always add more hardware. Indexing especially saves you from scanning the whole table to find your result. If you've got a simple B-tree index, the worst case should be O(log n).
Apart from theoretical limits, there are also practical ones, for example maximum number of rows per table, but these days those limits are so high that you can almost ignore them.
I wouldn't worry about it. If you're using a decent DBMS and decent hardware... with realistic amounts of data, you can always find a way to return a result in an acceptable amount of time. If you do reach the limits, chances are that you're making money from what you've got stored, and then you can always hire a pro to help you out ;)

Processing large amounts of data quickly

I'm working on a web application where the user provides parameters, and these are used to produce a list of the top 1000 items from a database of up to 20 million rows. I need all top 1000 items at once, and I need this ranking to happen more or less instantaneously from the perspective of the user.
Currently, I'm using a MySQL with a user-defined function to score and rank the data, then PHP takes it from there. Tested on a database of 1M rows, this takes about 8 seconds, but I need performance around 2 seconds, even for a database of up to 20M rows. Preferably, this number should be lower still, so that decent throughput is guaranteed for up to 50 simultaneous users.
I am open to any process with any software that can process this data as efficiently as possible, whether it is MySQL or not. Here are the features and constraints of the process:
The data for each row that is relevant to the scoring process is about 50 bytes per item.
Inserts and updates to the DB are negligible.
Each score is independent of the others, so scores can be computed in parallel.
Due to the large number of parameters and parameter values, the scores cannot be pre-computed.
The method should scale well for multiple simultaneous users
The fewer computing resources this requires, in terms of number of servers, the better.
Thanks
A feasible approach seems to be to load (and later update) all data into about 1GB RAM and perform the scoring and ranking outside MySQL in a language like C++. That should be faster than MySQL.
The scoring must be relatively simple for this approache because your requirements only leave a tenth of a microsecond per row for scoring and ranking without parallelization or optimization.
If you could post query you are having issue with can help.
Although here are some things.
Make sure you have indexes created on database.
Make sure to use optimized queries and using joins instead of inner queries.
Based on your criteria, the possibility of improving performance would depend on whether or not you can use the input criteria to pre-filter the number of rows for which you need to calculate scores. I.e. if one of the user-provided parameters automatically disqualifies a large fraction of the rows, then applying that filtering first would improve performance. If none of the parameters have that characteristic, then you may need either much more hardware or a database with higher performance.
I'd say for this sort of problem, if you've done all the obvious software optimizations (and we can't know that, since you haven't mentioned anything about your software approaches), you should try for some serious hardware optimization. Max out the memory on your SQL servers, and try to fit your tables into memory where possible. Use an SSD for your table / index storage, for speedy deserialization. If you're clustered, crank up the networking to the highest feasible network speeds.

How do you measure SQL Fill Factor value

Usually when I'm creating indexes on tables, I generally guess what the Fill Factor should be based on an educated guess of how the table will be used (many reads or many writes).
Is there a more scientific way to determine a more accurate Fill Factor value?
You could try running a big list of realistic operations and looking at IO queues for the different actions.
There are a lot of variables that govern it, such as the size of each row and the number of writes vs reads.
Basically: high fill factor = quicker read, low = quicker write.
However it's not quite that simple, as almost all writes will be to a subset of rows that need to be looked up first.
For instance: set a fill factor to 10% and each single-row update will take 10 times as long to find the row it's changing, even though a page split would then be very unlikely.
Generally you see fill factors 70% (very high write) to 95% (very high read).
It's a bit of an art form.
I find that a good way of thinking of fill factors is as pages in an address book - the more tightly you pack the addresses the harder it is to change them, but the slimmer the book. I think I explained it better on my blog.
I would tend to be of the opinion that if you're after performance improvements, your time is much better spent elsewhere, tweaking your schema, optimising your queries and ensuring good index coverage. Fill factor is one of those things that you only need to worry about when you know that everything else in your system is optimal. I don't know anyone that can say that.

Resources