I have this webapplication tool which queries over data and shows it in a grid. Now a lot of people use it so it has to be quite performant.
The thing is, I needed to add a couple of extra fields through joins and now it takes forever for the query to run.
If I in sql server run the following query:
select top 100 *
from bam_Prestatie_AllInstances p
join bam_Zending_AllRelationships r on p.ActivityID = r.ReferenceData
join bam_Zending_AllInstances z on r.ActivityID = z.ActivityID
where p.PrestatieZendingOntvangen >= '2010-01-26' and p.PrestatieZendingOntvangen < '2010-01-27'
This takes about 35-55seconds, which is waaay too long. Because this is only a small one.
If I remove one of the two date checks it only takes 1second. If I remove the two joins it also takes only 1 second.
When I use a queryplan on this I can see that 100% of the time is spend on the indexing of the PrestatieZendingOntvangen field. If I set this field to be indexed, nothing changes.
Anybody have an idea what to do?
Because my clients are starting to complain about time-outs etc.
Thanks
Besides the obvious question of an index on the bam_Prestatie_AllInstances.PrestatieZendingOntvangen column, also check if you have indices for the foreign key columns:
p.ActivityID (table: bam_Prestatie_AllInstances)
r.ReferenceData (table: bam_Zending_AllRelationships)
r.ActivityID (table: bam_Zending_AllRelationships)
z.ActivityID (table: bam_Zending_AllInstance)
Indexing the foreign key fields can help speed up JOINs on those fields quite a bit!
Also, as has been mentioned already: try to limit your fields being selected by specifying a specific list of fields - rather than using SELECT * - especially if you join several tables, just the sheer number of columns you select (multiplied by the number of rows you select) can cause massive data transfer - and if you don't need all those columns, that's just wasted bandwidth!
Specify the fields you want to retrieve, rather than *
Specify either Inner Join or Outer Join
Try between?
where p.PrestatieZendingOntvangen
between '2010-01-26 00:00:00' and '2010-01-27 23:00:00'
Have you placed an indexes on the date fields in your Where clause.
If not, I would create a INDEX on that fields to see if it makes any differences to yourv time.
Of course, Indexes will take up more disk space so you will have to consider the impact of that extra index.
EDIT:
The others have also made good points about specifing what columns you require in the Select instead of * (wildcard), and placing more indexes on foreign keys etc.
Someone from DB background can clear my doubt on this.
I think, you should specify date in the style in which the DB will be able to understand it.
for e.g. Assuming, the date is stored in mm/dd/yyyy style inside the table & your query tries to put in a different style of date for comparison (yyyy-mm-dd), the performance will go down.
Am I being too naive, when I assume this?
How many columns does bam_Prestatie_AllInstances and the other tables have? It looks like you are oulling all columns and that can definitely be a performance issue.
Have you tried to select specific columns from specific tables such as:
select top 100 p.column1, p.column2, p.column3
Instead of querying all columns as you are currently doing:
select top 100 *
Related
I need help using SUM and GROUP BY in SQL Server.
I am generating the query based on 5 tables. I have tried in SQL Server.
Some parts of the query are working, but when I advance the query, I get the wrong results/data.
The problem is that the data is processed twice instead of once on every group by field, e.g. farmer_ID, where the farmer bears or has two or more records.
This happens when i add more tables to the join - on one or two tables, the sum values are okay.
Hence I get farmer_sales = 200 instead of 100.
Kindly let me know how I can get some help
Thanks
David
You can use the outer join (Left or Right) and choose the table that have only one record for each item
Another solution you can use Keyword Distinct before the column name
Nobody here will be able to help without table definitions, the requirements & the query.
With these issues I find it helpful to write the query so you get the proper rows you need. It sounds like you have either dodgy data or incomplete join conditions but it's not possible to tell that without the above. You can debug your data by finding a problem (e.g. farmer_sales) and working through the raw data & query from there. You will have an incomplete PK/FK relationship in your query or a missing constraint allowing bad data. Or you have misunderstood the requirement or the requirement does not makes sense to the data model.
Once you have the query working correctly you can add the aggregations.
One bit of general advice I can give is that adding DISTINCT is almost always the wrong approach.
My question concerns Oracle 11g and the use of indexes in SQL queries.
In my database, there is a table that is structured as followed:
Table tab (
rowid NUMBER(11),
unique_id_string VARCHAR2(2000),
year NUMBER(4),
dynamic_col_1 NUMBER(11),
dynamic_col_1_text NVARCHAR2(2000)
) TABLESPACE tabspace_data;
I have created two indexes:
CREATE INDEX Index_dyn_col1 ON tab (dynamic_col_1, dynamic_col_1_text) TABLESPACE tabspace_index;
CREATE INDEX Index_unique_id_year ON tab (unique_id_string, year) TABLESPACE tabspace_index;
The table contains around 1 to 2 million records. I extract the data from it by executing the following SQL command:
SELECT distinct
"sub_select"."dynamic_col_1" "AS_dynamic_col_1","sub_select"."dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM
(
SELECT "tab".* FROM "tab"
where "tab".year = 2011
) "sub_select"
Unfortunately, the query needs around 1 hour to execute, although I created the both indexes described above.
The explain plan shows that Oracle uses a "Table Full Access", i.e. a full table scan. Why is the index not used?
As an experiment, I tested the following SQL command:
SELECT DISTINCT
"dynamic_col_1" "AS_dynamic_col_1", "dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM "tab"
Even in this case, the index is not used and a full table scan is performed.
In my real database, the table contains more indexed columns like "dynamic_col_1" and "dynamic_col_1_text".
The whole index file has a size of about 50 GB.
A few more informations:
The database is Oracle 11g installed on my local computer.
I use Windows 7 Enterprise 64bit.
The whole index is split over 3 dbf files with about 50GB size.
I would really be glad, if someone could tell me how to make Oracle use the index in the first query.
Because the first query is used by another program to extract the data from the database, it can hardly be changed. So it would be good to tweak the table instead.
Thanks in advance.
[01.10.2011: UPDATE]
I think I've found the solution for the problem. Both columns dynamic_col_1 and dynamic_col_1_text are nullable. After altering the table to prohibit "NULL"-values in both columns and adding a new index solely for the column year, Oracle performs a Fast Index Scan.
The advantage is that the query takes now about 5 seconds to execute and not 1 hour as before.
Are you sure that an index access would be faster than a full table scan? As a very rough estimate, full table scans are 20 times faster than reading an index. If tab has more than 5% of the data in 2011 it's not surprising that Oracle would use a full table scan. And as #Dan and #Ollie mentioned, with year as the second column this will make the index even slower.
If the index really is faster, than the issue is probably bad statistics. There are hundreds of ways the statistics could be bad. Very briefly, here's what I'd look at first:
Run an explain plan with and without and index hint. Are the cardinalities off by 10x or more? Are the times off by 10x or more?
If the cardinality is off, make sure there are up to date stats on the table and index and you're using a reasonable ESTIMATE_PERCENT (DBMS_STATS.AUTO_SAMPLE_SIZE is almost always the best for 11g).
If the time is off, check your workload statistics.
Are you using parallelism? Oracle always assumes a near linear improvement for parallelism, but on a desktop with one hard drive you probably won't see any improvement at all.
Also, this isn't really relevant to your problem, but you may want to avoid using quoted identifiers. Once you use them you have to use them everywhere, and it generally makes your tables and queries painful to work with.
Your index should be:
CREATE INDEX Index_year
ON tab (year)
TABLESPACE tabspace_index;
Also, your query could just be:
SELECT DISTINCT
dynamic_col_1 "AS_dynamic_col_1",
dynamic_col_1_text "AS_dynamic_col_1_text"
FROM tab
WHERE year = 2011;
If your index was created solely for this query though, you could create it including the two fetched columns as well, then the optimiser would not have to go to the table for the query data, it could retrieve it directly from the index making your query more efficient again.
Hope it helps...
I don't have an Oracle instance on hand so this is somewhat guesswork, but my inclination is to say it's because you have the compound index in the wrong order. If you had year as the first column in the index it might use it.
Your second test query:
SELECT DISTINCT
"dynamic_col_1" "AS_dynamic_col_1", "dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM "tab"
would not use the index because you have no WHERE clause, so you're asking Oracle to read every row in the table. In that situation the full table scan is the faster access method.
Also, as other posters have mentioned, your index on YEAR has it in the second column. Oracle can use this index by performing a skip scan, but there is a performance hit for doing so, and depending on the size of your table Oracle may just decide to use the FTS again.
I don't know if it's relevant, but I tested the following query:
SELECT DISTINCT
"dynamic_col_1" "AS_dynamic_col_1", "dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM "tab"
WHERE "dynamic_col_1" = 123 AND "dynamic_col_1_text" = 'abc'
The explain plan for that query show that Oracle uses an index scan in this scenario.
The columns dynamic_col_1 and dynamic_col_1_text are nullable. Does this have an effect on the usage of the index?
01.10.2011: UPDATE]
I think I've found the solution for the problem. Both columns dynamic_col_1 and dynamic_col_1_text are nullable. After altering the table to prohibit "NULL"-values in both columns and adding a new index solely for the column year, Oracle performs a Fast Index Scan. The advantage is that the query takes now about 5 seconds to execute and not 1 hour as before.
Try this:
1) Create an index on year field (see Ollie answer).
2) And then use this query:
SELECT DISTINCT
dynamic_col_1
,dynamic_col_1_text
FROM tab
WHERE ID (SELECT ID FROM tab WHERE year=2011)
or
SELECT DISTINCT
dynamic_col_1
,dynamic_col_1_text
FROM tab
WHERE ID (SELECT ID FROM tab WHERE year=2011)
GROUP BY dynamic_col_1, dynamic_col_1_text
Maybe it will help you.
Few days ago I wrote one query and it gets executes quickly but now a days it takes 1 hrs.
This query run on my SQL7 server and it takes about 10 seconds.
This query exists on another SQL7 server and until last week it took about
10 seconds.
The configuration of both servers are same. Only the hardware is different.
Now, on the second server this query takes about 30 minutes to extract the s
ame details, but anybody has changed any details.
If I execute this query without Where, it'll show me the details in 7
seconds.
This query still takes about same time if Where is problem
Without seeing the query and probably the data I can't do a lot other than offer tips.
Can you put more constraints on the query. If you can reduce the amount of data involved then this will speed up the query.
Look at the columns used in your joins, where and having clauses and order by. Check that the tables that the columns belong to contain indices for these columns.
Do you need to use the user defined function or can it be done another way?
Are you using subquerys? If so can these be pulled out into separate views?
Hope this helps.
Without knowing how much data is going into your tables, and not knowing your schema, it's hard to give a definitive answer but things to look at:
Try running UPDATE STATS or DBCC REINDEX.
Do you have any indexes on the tables? If not, try adding indexes to columns used in WHERE clauses and JOIN predicates.
Avoid cross table OR clauses (i.e, where you do WHERE table1.col1 = #somevalue OR table2.col2 = #someothervalue). SQL can't use indexes effectively with this construct and you may get better performance by splitting the query into two and UNION'ing the results.
What do your functions (UDFs) do and how are you using them? It's worth noting that dropping them in the columns part of a query gets expensive as the function is executed per row returned: thus if a function does a select against the database, then you end up running n + 1 queries against the database (where n = number of rows returned in the main select). Try and engineer the function out if possible.
Make sure your JOINs are correct -- where you're using a LEFT JOIN, revisit the logic and see if it needs to be a LEFT or whether it can be turned into an INNER JOIN. Sometimes people use LEFT JOINs, but when you examine the logic in the rest of the query, it can sometimes be apparent that the LEFT JOIN gives you nothing (because, for example, someone may had added a WHERE col IS NOT NULL predicate against the joined table). INNER JOINs can be faster, so it's worth reviewing all of these.
It would be a lot easier to suggest things if we could see the query.
We have a table with 6 million records, and then we have a SQL which need around 7 minutes to query the result. I think the SQL cannot be optimized any more.
The query time causes our weblogic to throw the max stuck thread exception.
Is there any recommendation for me to handle this problem ?
Following is the query, but it's hard for me to change it,
SELECT * FROM table1
WHERE trim(StudentID) IN ('354354','0')
AND concat(concat(substr(table1.LogDate,7,10),'/'),substr(table1.LogDate,1,5))
BETWEEN '2009/02/02' AND '2009/03/02'
AND TerminalType='1'
AND RecStatus='0' ORDER BY StudentID, LogDate DESC, LogTime
However, I know it's time consuming for using strings to compare dates, but someone wrote before I can not change the table structure...
LogDate was defined as a string, and the format is mm/dd/yyyy, so we need to substring and concat it than we can use between ... and ... I think it's hard to optimize here.
The odds are that this query is doing a full-file scan, because you're WHERE conditions are unlikely to be able to take advantage of any indexes.
Is LogDate a date field or a text field? If it's a date field, then don't do the substr's and concat's. Just say "LogDate between '2009-02-02' and '2009-02-03' or whatever the date range is. If it's defined as a text field you should seriously consider redefining it to a date field. (If your date really is text and is written mm/dd/yyyy then your ORDER BY ... LOGDATE DESC is not going to give useful results if the dates span more than one year.)
Is it necessary to do the trim on StudentID? It is far better to clean up your data before putting it in the database then to try to clean it up every time you retrieve it.
If LogDate is defined as a date and you can trim studentid on input, then create indexes on one or both fields and the query time should fall dramatically.
Or if you want a quick and dirty solution, create an index on "trim(studentid)".
If that doesn't help, give us more info about your table layouts and indexes.
SELECT * ... WHERE trim(StudentID) IN ('354354','0')
If this is normal construct, then you need a function based index. Because without it you force the DB server to perform full table scan.
As a rule of thumb, you should avoid as much as possible use of functions in the WHERE clause. The trim(StundentID), substr(table1.LogDate,7,10) prevent DB servers from using any index or applying any optimization to the query. Try to use the native data types as much as possible e.g. DATE instead of VARCHAR for the LogDate. StudentID should be also managed properly in the client software by e.g. triming the data before INSERT/UPDATE.
If your database supports it, you might want to try a materialized view.
If not, it might be worth thinking about implementing something similar yourself, by having a scheduled job that runs a query that does the expensive trims and concats and refreshes a table with the results so that you can run a query against the better table and avoid the expensive stuff. Or use triggers to maintain such a table.
But the query time cause our weblogic to throw the max stuck thread exception.
If the query takes 7 minutes and cannot be made faster, you have to stop running this query real-time. Can you change your application to query a cached results table that you periodically refresh?
As an emergency stop-gap before that, you can implement a latch (in Java) that allows only one thread at a time to execute this query. A second thread would immediately fail with an error (instead of bringing the whole system down). That is probably not making users of this query happy, but at least it protects everyone else.
I updated the query, could you give me some advices ?
Those string manipulations make indexing pretty much impossible. Are you sure you cannot at least get rid of the "trim"? Is there really redundant whitespace in the actual data? If so, you could narrow down on just a single student_id, which should speed things up a lot.
You want a composite index on (student_id, log_date), and hopefully the complex log_date condition can still be resolved using a index range scan (for a given student id).
Without any further information about what kind of query you are executing and wheter you are using indexes or not it is hard to give any specific information.
But here are a few general tips.
Make sure you use indexes on the columns you often filter/order by.
If it is only a certain query that is way too slow, than perhaps you can prevent yourself from executing that query by automatically generating the results while the database changes. For example, instead of a count() you can usually keep a count stored somewhere.
Try to remove the trim() from the query by automatically calling trim() on your data before/while inserting it into the table. That way you can simply use an index to find the StudentID.
Also, the date filter should be possible natively in your database. Without knowing which database it might be more difficult, but something like this should probably work: LogDate BETWEEN '2009-02-02' AND '2009-02-02'
If you also add an index on all of these columns together (i.e. StudentID, LogDate, TerminalType, RecStatus and EmployeeID than it should be lightning fast.
Without knowing what database you are using and what is your table structure, its very difficult to suggest any improvement but queries can be improved by using indexes, hints, etc.
In your query the following part
concat(concat(substr(table1.LogDate,7,10),'/'), substr(table1.LogDate,1,5)) BETWEEN '2009/02/02' AND '2009/02/02'
is too funny. BETWEEN '2009/02/02' AND '2009/02/02' ?? Man, what are yuu trying to do?
Can you post your table structure here?
And 6 million records is not a big thing anyway.
It is told a lot your problem is in date field. You definitely need to change your date from a string field to a native date type. If it is a legacy field that is used in your app in this exact way - you may still create a to_date(logdate, 'DD/MM/YYYY') function-based index that transforms your "string" date into a "date" date, and allows a fast already mentioned between search without modifying your table data.
This should speed things up a lot.
With the little information you have provided, my hunch is that the following clause gives us a clue:
... WHERE trim(StudentID) IN ('354354','0')
If you have large numbers of records with unidentified student (i.e. studentID=0) an index on studentID would be very imbalanced.
Of the 6 million records, how many have studentId=0?
Your main problem is that your query is treating everything as a string.
If LogDate is a Date WITHOUT a time component, you want something like the following
SELECT * FROM table1
WHERE StudentID IN (:SearchStudentId,0)
AND table1.LogDate = :SearchDate
AND TerminalType='1'
AND RecStatus='0'
ORDER BY EmployeeID, LogDate DESC, LogTime
If LogDate has a time component, and SearchDate does NOT have a time component, then something like this. (The .99999 will set the time to 1 second before midnight)
SELECT * FROM table1
WHERE StudentID IN (:SearchStudentId,:StudentId0)
AND table1.LogDate BETWEEN :SearchDate AND :SearchDate+0.99999
AND TerminalType='1'
AND RecStatus='0'
ORDER BY EmployeeID, LogDate DESC, LogTime
Note the use of bind variables for the parameters that change between calls. It won't make the query much faster, but it is 'best practice'.
Depending on your calling language, you may need to add TO_DATE, etc, to cast the incoming bind variable into a Date type.
If StudentID is a char (usually the reason for using trim()) you may be able to get better performance by padding the variables instead of trimming the field, like this (assuming StudentID is a char(10)):
StudentID IN (lpad('354354',10),lpad('0',10))
This will allow the index on StudentID to be used, if one exists.
I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:
Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits
Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.
Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.
the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.
Any ideas if regular expressions or any other methodology would make it go faster?
What about doing it all in the DB, rather than pulling records into your .Net app:
UPDATE source_table s SET some_field = true WHERE EXISTS
(
SELECT target_join_field FROM target_table t
WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)
This will reduce the total number of queries from 750k update queries down to 1 update.
First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.
If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.
BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.
One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.
Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.
The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.
First Idea
For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)
Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.
It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)
Second Idea
Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)
EDIT:
Modified Second Idea
Put the whole target table in to a temp table.
Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.
Loop through each record from the source table one at a time. Use the following logic in your loop...
DELETE target WHERE field LIKE '%' + #source_field + '%'
IF (##row_count = 0)
[no matches]
ELSE
[matches]
The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)
Try this --
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.
As le dorfier correctly pointed out, there is little hope if you are using a UDF.
There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.
Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.
I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).
You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.
By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.
Tell us more about the problem.
Holy smoke, what great responses!
system is on disconnected network, so I can't copy paste, but here's the retype
Current UDF:
Create function CountInTrim
(#caseno varchar255)
returns int
as
Begin
declare #reccount int
select #reccount = count(recId) from targettable where title like '%' + #caseNo +'%'
return #reccount
end
Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.
Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.
I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:
DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)
Parsing that became a bit more complex so I thought I'd keep it simpler.
I'm going to try the single update statement, then fall back to regex in the clr if needed.
I'll update the results. And, since I've already processed more than half the records, that should help.
Try either Dan R's update query from above:
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField
from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.