needed index to make this sql query run faster - sql-server

here is the query I am stuck with:
SELECT *
FROM customers
WHERE salesmanid = #salesrep
OR telephonenum IN (SELECT telephonenum
FROM salesmancustomers
WHERE salesmanname = #salesrepname)
ORDER BY customernum
It is SLOW and crushing my CPU at 99%. I know an index would help but not sure what kind or if it should be 2 indexes or 1 with both columns included.

Three indexes probably each on a single column. This is assuming that your queries are all quite selective relative to the size of the tables.
It would help if you told us what your table schemas are along with details of existing indexes (Your PKs will get a clustered index by default if you don't specify otherwise) and some details about size of tables / selectivity.
Customers
SalesmanId
TelephoneNum
SalesmanCustomers
SalesmanName

Take a look at the Query Execution Plan and see if there are any table scans going on. This will help you identify what indexes you need.

I suppose that in addition to columns suggested by #Martin, an index on CustomerNum is also required since its used in order by clause.
If you have lots of records, the OrderBy is something which takes a lot of time. You can also try to run query without the orderby and see how much time it takes.

Related

Optimization Strategies to perform aggregate on 4 Bil records

So I have a table of 4.7 billion records on which I want to perform a group by count expression in Postgres using PGadmin4.
Obviously this is gonna take a lot of time and I want to speed up the process as high as possible.
Example query
Update TT FROM Target_table TT, (Select col_1, count(col_1) cnt from
Very_Large_Table
group by col_1) as AA
set tt.qty = AA.cnt
where aa.col1 = tt.col1 and aa.cnt <> tt.qty;
I have freshly created/analyzed indexes on the column col_1 still the process takes 2 hours.
Trying parallel hints by adding /+ PARALLEL (very_large_table 6) +/ in select but it seems like a different syntax is required as explained plan still shows 2 workers.
It cannot create partitioning.
Any help is greatly appreciated as I am out of ideas now. This is choking the system, and other applications are getting impacted.
Edit: Thanks everyone for all the help but I am looking for ideas to mitigate the problem as I am quite sure anything I write/change directly on PGadmin would not help me here.
Sometimes there are situations where we don't have any functionality or capabilities of DB to help us solve that problem. In these cases, we have to think of some logical solutions. For example, suppose we need the number of records in a table. If we don't need an exact count, but an approximate count is enough, we can get very high performance by getting this count from information-schema. So, if calculating the count of a table with 4 billion records takes 1-5 minutes, with information-schema we can get it in 1 millisecond.
Now information-schema will not help us in this matter you wrote, because it only gives the count of records of the entire table, grouping prevents us here. But you can use materialized views, if you don't need exactly count of records. Every day on the night you can refreshing this materialized using schedule and use it on day.
Again, after getting to know the issue in depth, knowing the business logic in detail, we can think of different alternative solutions. All I can say is that in all DBs the count command is slow running process on very large tables.
From your question, I guess you need to do queries on your Target_Table where each row of your result set shows how many rows from Very_Large_Table have the same value of col_1 as the present row. Your plan seems to be populating your qty column with that number.
With respect, your present approach is impractical.
As you have discovered, it takes too long to populate qty.
The dbms write-locks your tables while you populate qty, otherwise insertions, updates and deletions will interfere with the UPDATE query in your question. The locks interfere with other workloads on your dbms.
Your qty values will become stale between times you run your query. Therefore you must treat them as approximate values in your application.
There are other ways to generate the qty value you need, when querying. For example, this retrieves your qty at the time of your query.
SELECT TT.*, COALESCE (q.qty, 0) qty
FROM Target_Table
LEFT JOIN (
SELECT COUNT(*) qty,
col_1
FROM Very_Large_Table
GROUP BY col_1
) q ON TT.col_1 = q.col_1
WHERE something;
It seems very likely that you always, or almost always, use a WHERE something filter clause when querying. I suspect it's very rare to query the whole table. The presence of a filter, any filter, reduces the number of col_1 values for which you need qty. If the filter is selective enough it reduces that number to a workable value. This approach has the benefit that qty is always up to date.
This index accelerates the query I propose.
CREATE INDEX col_1 ON Very_Large_Table (col_1);
There's another, more elaborate, approach. If you decide to go this direction, give it a try and ask another question if you need help.
Create a separate table Col_1_Counts_From_Very_Large_Table with col_1 and qty columns. Put indexes on (col_1, qty) and (qty, col_1).
Populate it, during some system downtime, with a query like the one in your question.
Create triggers for insert, update, and delete on your Very_Large_Table. Have those triggers update the qty values in the appropriate rows in Col_1_Counts_From_Very_Large_Table. This keeps your counts table up-to-date whenever somebody changes rows in Very_Large_Table.
Then when you run your query do it like this:
SELECT TT.*, COALESCE (q.qty, 0) qty
FROM Target_Table
LEFT JOIN Col_1_Counts_From_Very_Large_Table q ON TT.col_1 = q.col_1
WHERE something;
This approach will be accurate and fast at query time, at the cost of trigger complexity and small slowdowns on inserts, updates, and deletes to your very large table. It also lets you do things like find MAX(qty) or ORDER BY qty DESC without dimming the lights in your data center.

SQL Server - what kind of index should I create?

I need to make queries such as
SELECT
Url, COUNT(*) AS requests, AVG(TS) AS avg_timeSpent
FROM
myTable
WHERE
Url LIKE '%/myController/%'
GROUP BY
Url
run as fast as possible.
The columns selected and grouped are almost always the same, being the difference, an extra column on the select and group by (the column tenantId)
What kind of index should I create to help me run this scenario?
Edit 1:
If I change my base query to '/myController/%' (note there's no % at the begging) would it be better?
This is a query that cannot be sped up with an index. The DBMS cannot know beforehand how many records will match the condition. It may be 100% or 0.001%. There is no clue for the DBMS to guess this. And access via an index only makes sense when a small percentage of rows gets selected.
Moreover, how can such an index be structured and useful? Think of a telephone book and you want to find all names that contain 'a' or 'rs' or 'ems' or whatever. How would you order the names in the book to find all these and all other thinkable letter combinations quickly? It simply cannot be done.
So the DBMS will read the whole table record for record, no matter whether you provide an index or not.
There may be one exception: With an index on URL and TS, you'd have both columns in the index. So the DBMS might decide to read the whole index rather than the whole table then. This may make sense for instance when the table has hundreds of columns or when the table is very fragmented or whatever. I don't know. A table is usually much easier to read sequentially than an index. You can still just try, of course. It doesn't really hurt to create an index. Either the DBMS uses it or not for a query.
Columnstore indexes can be quite fast at such tasks (aggregates on globals scans). But even they will have trouble handling a LIKE '%/mycontroler/%' predicate. I recommend you parse the URL once into an additional computed field that projects the extracted controller of your URL. But the truth is that looking at global time spent on a response URL reveals very little information. It will contain data since the beginning of time, long since obsolete by newer deployments, and not be able to capture recent trends. A filter based on time, say per hour or per day, now that is a very useful analysis. And such a filter can be excellently served by a columnstore, because of natural time order and segment elimination.
Based on your posted query you should have a index on Url column. In general columns which are involved in WHERE , HAVING, ORDER BY and JOIN ON condition should be indexed.
You should get the generated query plan for the said query and see where it's taking more time. Again based n the datatype of the Url column you may consider having a FULLTEXT index on that column

Does ORDER BY clause slows down the query?

Does the ORDER BY clause slows down the query performance ??? How much will it effect if the column is indexed as against when it is not.
#JamesZ is correct. There are many things to consider when adding the order by clause to your query. For instance if you did a select top 10 * from dbo.Table order by field with let's say 10,000,000 rows would cause the query to spill into tempdb as it spooled the entire table to tempdb and then after sorting by your non-indexed field would then return the 10 rows. If you did the same select without the sort, results would return almost immediately.
It's very important to know how your tables are indexed before issuing an order by clause. CTRL-M is your friend in SSMS.
If there is a sort operator in the query plan (and there is more than 1 row), yes, it has an affect. If the data is already in the order you need it (either clustered index, or non-clustered index that has all the fields that the query needs in correct order), the sorting might not be needed, but other operations in the plan might still cause sorting to be done to be sure that the data is still in the correct order.
How much does if affect, well test it. Take the sorting away and compare the performance.
It is better to test. You can use actual execution plan.
Here is a simple ORDEY BY but the additional cost is 4 times the original one

Is My T-SQL Query Written Efficiently?

SELECT o.oxxxxID,
m.mxxxx,
txxxx,
exxxxName,
paxxxxe,
fxxxxe,
pxxxx,
axxxx,
nxxxx,
nxxxx,
nxxxx,
ixxxx,
CONVERT(VARCHAR, o.dateCreated, 103)
FROM Offer o INNER JOIN Mxxxx m ON o.mxxxxID = m.mxxxxID
INNER JOIN EXXXX e ON e.exxxxID = o.exxxxID
INNER JOIN PXXXX p ON p.pxxxxxxID = o.pxxxxID
INNER JOIN Fxxxx f ON f.fxxxxxID = o.fxxxxxID
WHERE o.cxxxxID = 11
The above query is expected to be executed via website by approximately 1000 visitors daily. Is it badly written and has a high chance to cause lack of performance? If yes, can you please suggest me how to improve it.
NOTE: every table has only one index (Primary key).
Looks good to me.
Now for the performance piece you need to make sure you have the proper indexes covering the columns you are filtering and joining (Foreign Keys, etc).
A good start would be to do an Actual Execution Plan or, the easy route, run it against the Indexing Tunning Wizard.
The actual execution plan in SQL 2008 (perhaps 2005 as well) will give you missing indexes hints already on the top.
It's hard to tell without knowing the content of the data, but it looks like a perfectly valid SQL statement. The many joins will likely degrade performance a bit, but you can use a fw strategies for improving performance... I have a few ideas.
indexed views can often improve performance
stored procedures will optomize the query for you and save the optomized query
or if possible, create a one-off table that's not live, but contains the data from this statement, only in a non-normalized format. This on-off table would need tp be updated regularly, but you can get some huge performance boosts using this strategy if it's possible in your situation.
For general performance issues and ideas, this is a good place to start, if you haven't alredy: http://msdn.microsoft.com/en-us/library/ff647793.aspx
This one is very good as well: http://technet.microsoft.com/en-us/magazine/2006.01.boostperformance.aspx
That would depend mostly on the keys and indexes defined on the tables. If you could provide those a better answer could be given. While the query looks ok (other than the xxx's in all the names), if you're joining on fields with no indexes, or there field in the where clause has no index then you may run into performance issues on larger data sets.
It looks pretty good to me. Probably the only improvement I might make is to output o.datecreated as is and let the client format it.
You could also add indexes to the join columns.
There may also be a potential to create an indexed view if performance is an issue and space isn't.
Actually, your query looks perfectly good written. The only point that we can't know if it can be improved is the existence of indexes and keys on the columns that you are using on the JOINS and the WHERE statement. Other than that, I don't see anything that can be improved.
If you only have single indexes on the primary keys, then it is unlikely the indexes will be covering for all the data output in your select statement. So what will happen is that the query can efficiently locate the rows for each primary key but it will need to use bookmark lookups to find the data rows and extract the additional columns.
So, although the query itself is probably fine (except for the date conversion) as long as all these columns are truly needed in the output, the execution plan could probably be improved by adding additional columns to your indexes. A clustered index key is not allowed to have included columns, and this is probably also your primary key enforcement, and you are unlikely to want to add other columns to your primary key, so this would mean creating an additional non-clustered index with the PK column first and then including additional columns.
At this point the indexes will cover the query and it will not need to do the bookmark lookups. Note that the indexes need to support the most common usage scenarios and that the more indexes you add, the slower your write performance will be, since all the indexes will need to be updated.
In addition, you might also want to review your constraints, since these can be used by the optimizer to eliminate joins if a table is not used for any output columns when the optimizer can determine there will not be an outer join or cross join which would eliminate or multiply rows.

Can Multiple Indexes Work Together?

Suppose I have a database table with two fields, "foo" and "bar". Neither of them are unique, but each of them are indexed. However, rather than being indexed together, they each have a separate index.
Now suppose I perform a query such as SELECT * FROM sometable WHERE foo='hello' AND bar='world'; My table a huge number of rows for which foo is 'hello' and a small number of rows for which bar is 'world'.
So the most efficient thing for the database server to do under the hood is use the bar index to find all fields where bar is 'world', then return only those rows for which foo is 'hello'. This is O(n) where n is the number of rows where bar is 'world'.
However, I imagine it's possible that the process would happen in reverse, where the fo index was used and the results searched. This would be O(m) where m is the number of rows where foo is 'hello'.
So is Oracle smart enough to search efficiently here? What about other databases? Or is there some way I can tell it in my query to search in the proper order? Perhaps by putting bar='world' first in the WHERE clause?
Oracle will almost certainly use the most selective index to drive the query, and you can check that with the explain plan.
Furthermore, Oracle can combine the use of both indexes in a couple of ways -- it can convert btree indexes to bitmaps and perform a bitmap ANd operation on them, or it can perform a hash join on the rowid's returned by the two indexes.
One important consideration here might be any correlation between the values being queried. If foo='hello' accounts for 80% of values in the table and bar='world' accounts for 10%, then Oracle is going to estimate that the query will return 0.8*0.1= 8% of the table rows. However this may not be correct - the query may actually return 10% of the rwos or even 0% of the rows depending on how correlated the values are. Now, depending on the distribution of those rows throughout the table it may not be efficient to use an index to find them. You may still need to access (say) 70% or the table blocks to retrieve the required rows (google for "clustering factor"), in which case Oracle is going to perform a ful table scan if it gets the estimation correct.
In 11g you can collect multicolumn statistics to help with this situation I believe. In 9i and 10g you can use dynamic sampling to get a very good estimation of the number of rows to be retrieved.
To get the execution plan do this:
explain plan for
SELECT *
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Contrast that with:
explain plan for
SELECT /*+ dynamic_sampling(4) */
*
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Eli,
In a comment you wrote:
Unfortunately, I have a table with lots of columns each with their own index. Users can query any combination of fields, so I can't efficiently create indexes on each field combination. But if I did only have two fields needing indexes, I'd completely agree with your suggestion to use two indexes. – Eli Courtwright (Sep 29 at 15:51)
This is actually rather crucial information. Sometimes programmers outsmart themselves when asking questions. They try to distill the question down to the seminal points but quite often over simplify and miss getting the best answer.
This scenario is precisely why bitmap indexes were invented -- to handle the times when unknown groups of columns would be used in a where clause.
Just in case someone says that BMIs are for low cardinality columns only and may not apply to your case. Low is probably not as small as you think. The only real issue is concurrency of DML to the table. Must be single threaded or rare for this to work.
Yes, you can give "hints" with the query to Oracle. These hints are disguised as comments ("/* HINT */") to the database and are mainly vendor specific. So one hint for one database will not work on an other database.
I would use index hints here, the first hint for the small table. See here.
On the other hand, if you often search over these two fields, why not create an index on these two? I do not have the right syntax, but it would be something like
CREATE INDEX IX_BAR_AND_FOO on sometable(bar,foo);
This way data retrieval should be pretty fast. And in case the concatenation is unique hten you simply create a unique index which should be lightning fast.
First off, I'll assume that you are talking about nice, normal, standard b*-tree indexes. The answer for bitmap indexes is radically different. And there are lots of options for various types of indexes in Oracle that may or may not change the answer.
At a minimum, if the optimizer is able to determine the selectivity of a particular condition, it will use the more selective index (i.e. the index on bar). But if you have skewed data (there are N values in the column bar but the selectivity of any particular value is substantially more or less than 1/N of the data), you would need to have a histogram on the column in order to tell the optimizer which values are more or less likely. And if you are using bind variables (as all good OLTP developers should), depending on the Oracle version, you may have issues with bind variable peeking.
Potentially, Oracle could even do an on the fly conversion of the two b*-tree indexes to bitmaps and combine the bitmaps in order to use both indexes to find the rows it needs to retrieve. But this is a rather unusual query plan, particularly if there are only two columns where one column is highly selective.
So is Oracle smart enough to search
efficiently here?
The simple answer is "probably". There are lots'o' very bright people at each of the database vendors working on optimizing the query optimizer, so it's probably doing things that you haven't even thought of. And if you update the statistics, it'll probably do even more.
I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.
The best approach would be to add foo to bar's index, or add bar to foo's index (or both). If foo's index also contains an index on bar, that additional indexing level will not affect the utility of the foo index in any current uses of that index, nor will it appreciably affect the performance of maintaining that index, but it will give the database additional information to work with in optimizing queries such as in the example.
It's better than that.
Index Seeks are always quicker than full table scans. So behind the scenes Oracle (and SQL server for that matter) will first locate the range of rows on both indices. It will then look at which range is shorter (seeing that it's an inner join), and it will iterate the shorter range to find the matches with the larger of the two.
You can provide hints as to which index to use. I'm not familiar with Oracle, but in Mysql you can use USE|IGNORE|FORCE_INDEX (see here for more details). For best performance though you should use a combined index.

Resources