Faceting on multiple collections is slow - solr

I have a very slow performance downgrade when using the same query for multiple collections.
query:
http://localhost:8983/solr/select?
q=[text]&
fq=((l_id:([some id]) AND l_departed:false))&
qf=field1 field2 field3 field4 field5 field6&
sort=field1 asc, id asc,&
rows=0&
indent=true&
facet=on&
facet.pivot=field1,field2,field3&
facet.missing=true&
facet.mincount=1
if I use the collection parameter with all collections it will be very slow, even though the collection schema is the same.
The difference is something like for one collection is 4ms and for all is 4000ms.
Can you help me understand why?

It seems that upgrading solr from version 7.2 to at least 8 fixes the performance issue, without doing anything to the query.

Related

SQL Server ignores proper index with included column

I have tblClaims(ClaimID, ValidityTo, ...) and tblClaimServices(ClaimServiceId, ClaimID, ValidityTo, ....) with an obvious foreign key on ClaimID. The ValidityTo is used for history, so actual data has ValidityTo=null.
These tables have respectively 3 million and 13 million rows.
The query:
select * from tblClaimServices where ClaimID=1234567 and ValidityTo is null
takes 5 seconds to execute !
Querying ... where ClaimID=1234567 is instantaneous.
Note that we're not doing select * but specifying almost all columns. This is an ORM (Django)
The explain plan shows that it's using a clustered index on (ClaimServiceID, ValidityTo) and then working hard to query the ClaimID within those. That's insane ! ValidityTo is null for 98% of the rows.
We created an index on (ClaimID, ValidityTo) but it wasn't used. We then created an index on ClaimID with an included column for validityto:
CREATE NONCLUSTERED INDEX idx_test1 ON tblClaimServices (ClaimID) include (ValidityTo) WHERE ValidityTo IS NULL
But wasn't used either. (So taking 5 seconds to find 0 to 10 rows)
However, using a hint
from tblClaimServices with (index(idx_test1))
Does work great. Instant results.
Now, I can't and don't want to have to include hints. SQL Server should be able to use an index that is so specific ! And it would require me to update an old app that uses a ORM and including the hints there would be a major pain. And make the app pretty fragile or become very slow in other queries.
How can I improve SQL Server's decision to use that proper index ?
I discovered that this strange behavior disappeared when the database was in 2012 compatibility mode. In a more recent mode, the database avoids using the date index validity_to.
We do have a similar field that points from soft-deleted records to the current one which is an integer. Replacing the date condition ( is null ) with the integer makes all these queries use the index properly and return results immediately.
I am still not 100% sure why the index isn't used for the validity_to but my problem is solved.

How to exclude index from query

I recently added an index to a table with about 20 million rows to improve performance for some queries. That worked well. The problem is that once a day, several statistics are generated and now, with that index, one of the queries is now taking too long (from a couple minutes to timing out after 30 minutes).
I looked at the table hints and only saw how to specify use of an index and not how to exclude use of an index. Did I miss something? Is there a way to force an index to not be used by the execution plan? I'd prefer to keep the index but will remove it if there is no way to exclude it on the nightly statistics generation.
It is hard to know without seeing a query, but it sounds like a case where the following may help:
SELECT * FROM MyTable WITH (INDEX(0))
WHERE MyColumn = 'MyValue'
If that doesn't work for you, then you may need to post some additional information on what your query contains.
If you do
SELECT *
FROM [Table] WITH (INDEX(0))
WHERE "IndexColumn" = 0
tablescan will be triggered.

SQL Server 2008 index approach when using Year()

I have a table 'customertransactions' with a column 'transactiondate' of type DateTime.
I will be querying it with the following:
SELECT SUM(balance) AS totalbal FROM customertransactions WHERE accountcode=?
AND (MONTH(GETDATE())-MONTH(transactiondate)+12*(YEAR(GETDATE())-YEAR(transactiondate)))>= 3
... obviously passing a sanitised parameter for 'accountcode'.
My question is - how do I best create an index to optimise that ?
Thanks.
Primarily, I would consider indexing accountcode. Additionally, if you can rewrite your date clause so that it is sargable, then you may benefit from indexing transactiondate as well.
As always, consider the cardinality of your data, and examine the query plan when adding indexes. There are no hard and fast rules.

How can I handle the time consuming SQL?

We have a table with 6 million records, and then we have a SQL which need around 7 minutes to query the result. I think the SQL cannot be optimized any more.
The query time causes our weblogic to throw the max stuck thread exception.
Is there any recommendation for me to handle this problem ?
Following is the query, but it's hard for me to change it,
SELECT * FROM table1
WHERE trim(StudentID) IN ('354354','0')
AND concat(concat(substr(table1.LogDate,7,10),'/'),substr(table1.LogDate,1,5))
BETWEEN '2009/02/02' AND '2009/03/02'
AND TerminalType='1'
AND RecStatus='0' ORDER BY StudentID, LogDate DESC, LogTime
However, I know it's time consuming for using strings to compare dates, but someone wrote before I can not change the table structure...
LogDate was defined as a string, and the format is mm/dd/yyyy, so we need to substring and concat it than we can use between ... and ... I think it's hard to optimize here.
The odds are that this query is doing a full-file scan, because you're WHERE conditions are unlikely to be able to take advantage of any indexes.
Is LogDate a date field or a text field? If it's a date field, then don't do the substr's and concat's. Just say "LogDate between '2009-02-02' and '2009-02-03' or whatever the date range is. If it's defined as a text field you should seriously consider redefining it to a date field. (If your date really is text and is written mm/dd/yyyy then your ORDER BY ... LOGDATE DESC is not going to give useful results if the dates span more than one year.)
Is it necessary to do the trim on StudentID? It is far better to clean up your data before putting it in the database then to try to clean it up every time you retrieve it.
If LogDate is defined as a date and you can trim studentid on input, then create indexes on one or both fields and the query time should fall dramatically.
Or if you want a quick and dirty solution, create an index on "trim(studentid)".
If that doesn't help, give us more info about your table layouts and indexes.
SELECT * ... WHERE trim(StudentID) IN ('354354','0')
If this is normal construct, then you need a function based index. Because without it you force the DB server to perform full table scan.
As a rule of thumb, you should avoid as much as possible use of functions in the WHERE clause. The trim(StundentID), substr(table1.LogDate,7,10) prevent DB servers from using any index or applying any optimization to the query. Try to use the native data types as much as possible e.g. DATE instead of VARCHAR for the LogDate. StudentID should be also managed properly in the client software by e.g. triming the data before INSERT/UPDATE.
If your database supports it, you might want to try a materialized view.
If not, it might be worth thinking about implementing something similar yourself, by having a scheduled job that runs a query that does the expensive trims and concats and refreshes a table with the results so that you can run a query against the better table and avoid the expensive stuff. Or use triggers to maintain such a table.
But the query time cause our weblogic to throw the max stuck thread exception.
If the query takes 7 minutes and cannot be made faster, you have to stop running this query real-time. Can you change your application to query a cached results table that you periodically refresh?
As an emergency stop-gap before that, you can implement a latch (in Java) that allows only one thread at a time to execute this query. A second thread would immediately fail with an error (instead of bringing the whole system down). That is probably not making users of this query happy, but at least it protects everyone else.
I updated the query, could you give me some advices ?
Those string manipulations make indexing pretty much impossible. Are you sure you cannot at least get rid of the "trim"? Is there really redundant whitespace in the actual data? If so, you could narrow down on just a single student_id, which should speed things up a lot.
You want a composite index on (student_id, log_date), and hopefully the complex log_date condition can still be resolved using a index range scan (for a given student id).
Without any further information about what kind of query you are executing and wheter you are using indexes or not it is hard to give any specific information.
But here are a few general tips.
Make sure you use indexes on the columns you often filter/order by.
If it is only a certain query that is way too slow, than perhaps you can prevent yourself from executing that query by automatically generating the results while the database changes. For example, instead of a count() you can usually keep a count stored somewhere.
Try to remove the trim() from the query by automatically calling trim() on your data before/while inserting it into the table. That way you can simply use an index to find the StudentID.
Also, the date filter should be possible natively in your database. Without knowing which database it might be more difficult, but something like this should probably work: LogDate BETWEEN '2009-02-02' AND '2009-02-02'
If you also add an index on all of these columns together (i.e. StudentID, LogDate, TerminalType, RecStatus and EmployeeID than it should be lightning fast.
Without knowing what database you are using and what is your table structure, its very difficult to suggest any improvement but queries can be improved by using indexes, hints, etc.
In your query the following part
concat(concat(substr(table1.LogDate,7,10),'/'), substr(table1.LogDate,1,5)) BETWEEN '2009/02/02' AND '2009/02/02'
is too funny. BETWEEN '2009/02/02' AND '2009/02/02' ?? Man, what are yuu trying to do?
Can you post your table structure here?
And 6 million records is not a big thing anyway.
It is told a lot your problem is in date field. You definitely need to change your date from a string field to a native date type. If it is a legacy field that is used in your app in this exact way - you may still create a to_date(logdate, 'DD/MM/YYYY') function-based index that transforms your "string" date into a "date" date, and allows a fast already mentioned between search without modifying your table data.
This should speed things up a lot.
With the little information you have provided, my hunch is that the following clause gives us a clue:
... WHERE trim(StudentID) IN ('354354','0')
If you have large numbers of records with unidentified student (i.e. studentID=0) an index on studentID would be very imbalanced.
Of the 6 million records, how many have studentId=0?
Your main problem is that your query is treating everything as a string.
If LogDate is a Date WITHOUT a time component, you want something like the following
SELECT * FROM table1
WHERE StudentID IN (:SearchStudentId,0)
AND table1.LogDate = :SearchDate
AND TerminalType='1'
AND RecStatus='0'
ORDER BY EmployeeID, LogDate DESC, LogTime
If LogDate has a time component, and SearchDate does NOT have a time component, then something like this. (The .99999 will set the time to 1 second before midnight)
SELECT * FROM table1
WHERE StudentID IN (:SearchStudentId,:StudentId0)
AND table1.LogDate BETWEEN :SearchDate AND :SearchDate+0.99999
AND TerminalType='1'
AND RecStatus='0'
ORDER BY EmployeeID, LogDate DESC, LogTime
Note the use of bind variables for the parameters that change between calls. It won't make the query much faster, but it is 'best practice'.
Depending on your calling language, you may need to add TO_DATE, etc, to cast the incoming bind variable into a Date type.
If StudentID is a char (usually the reason for using trim()) you may be able to get better performance by padding the variables instead of trimming the field, like this (assuming StudentID is a char(10)):
StudentID IN (lpad('354354',10),lpad('0',10))
This will allow the index on StudentID to be used, if one exists.

What fields should be indexed on a given table?

I've a table with a lot of registers (more than 2 million). It's a transaction table but I need a report with a lot of joins. Whats the best practice to index that table because it's consuming too much time.
I'm paging the table using the storedprocedure paging method but I need an index because when I want to export the report I need to get the entire query without pagination and to get the total records I need a select all.
Any help?
The SQL Server 2008 Management Studio query tool, if you turn on "Include Actual Execution Plan", will tell you what indexes a given query needs to run fast. (Assuming there's an obvious missing index that is making the query run unusually slow, that is.)
SQL Server 2008 Management Studio Query Screenshot http://img208.imageshack.us/img208/4108/image4sy8.png
We use this all the time on Stack Overflow.. one of the best features of SQL 2008. It works against older SQL instances as well, just install the SQL 2008 tools and point them at a SQL 2005 instance. Not sure if it works on anything earlier, though.
As others have noted, you can also do this manually, but it takes a bit of trial and error. You'll want indexes on fields that are used in ORDER BY and WHERE clauses.
key fields have to be everithing in
the where clause ???
No, that would be overkill. Indexing a field really only works if a) your WHERE clause is selective enough (that is: only selects out about 1-2% of the values; an index on a "Gender" field which can be only one of two or three possible values is pointless), and b) your WHERE clause doesn't involve function calls or other magic.
In your case, TBL.Status might be a candidate - how many possible values are there? You select the '1' and '2' value - if there are hundreds of possible values, then it's a good choice.
On a side note:
this clause here: (TBL.Login IS NULL AND TBL.Login <> 'dev' ) is pretty pointless - if the value of TBL.login IS NULL, then it's DEFINITELY not 'dev' ..... so just the "IS NULL" will be more than sufficient......
The other field you might want to consider putting an index on is the TBL.Date, since you seem to select a range of dates here - that might be a good choice.
Also, on a general note: whenever possible, DO NOT use a SELECT * FROM ...... to select your fields. This causes a lot of overhead for SQL Server. SPECIFY your columns - and ONLY select those that you REALLY NEED - not just all of them for the heck of it.....
Check your queries, and find which fields are used to match them. Those are usually the best candidates!
SQL Server has a 'Database Engine Tuning Advisor' that could help you. This does not exist for SQL Server Express, but does for all other versions of SQL Server.
Load your query in a query window.
On the menu, click Query -> Analyze Query in Database Engine
Tuning Advisor
The tuning advisor will identify indexes that could be added to your table(s) to improve performance. In my experience, the tuning advisor doesn't always help, but most of the time it does. It's where I suggest you start.
ok this is the query in doing
SELECT
TBL.*
FROM
FOREINGDATABASE..TABLENAME TBL
LEFT JOIN Status S
ON TBL.Status = S.Number
WHERE
(TBL.ID = CASE #Reference WHEN 0 THEN TBL.ID ELSE #Reference END) AND
TBL.Date >= #FechaInicial AND
TBL.Date <= #FechaFinal AND
(TBL.Channel = CASE #Canal WHEN '' THEN TBL.Channel ELSE #Canal END)AND
(TBL.DocType = CASE #TipoDocumento WHEN '' THEN TBL.DocType ELSE #TipoDocumento END)AND
(TBL.Document = CASE #NumDocumento WHEN '' THEN TBL.Document ELSE #NumDocumento END)AND
(TBL.Login = CASE #Login WHEN '' THEN TBL.Login ELSE #Login END)AND
(TBL.Login IS NULL AND TBL.Login <> 'dev' ) AND
TBL.Status IN ('1','2')
key fields have to be everithing in the where clause ???
If I am not mistaken, please correct me if I am, I think you should create non-clustered Index on the fields of the conditions of the where clause. (Maybe this can be useful as a starting point to get some candidates for the indexes).
Good Luck
if an Index Scan instead of a seek is performed, the cause might be that the fields are not in the correct order in the index.
put indexes on all columns that you're joining and filtering on.
the use of indexes is also determined by the selectivity of the indexed column.
the best way would be to show us your query so we can try to improve it.

Resources