Solr Keyword Search - solr

I have a solr index which contains approx 10 million web discussion threads. The Solr operates in Reader-Writer mode. I have another process which queries solr for different keyword queries. Keywords can be of following formats:
A
A AND B AND C.....
A AND B AND C.... AND Z NOT AA NOT AB NOT AC......
The final Solr Query somewhat looks like this
text:( "Keyword A" OR "Keyword B" OR "Keyword C" ...) AND source: (source1 OR source2 OR source3...) AND date:[date1 TO date2]
There are around 100 such different combination which are queried on solr. The selection of query combination depends on the number of results each query it returned.
The query somehow seems to take a lot of time. Sometimes it is in minutes (2 - 15 min). The use of cache seems to be difficult as very rarely a query is picked up back to back by scheduling thread.
How can I reduce the time taken for Solr Queries?

Related

Solr query long execution time for multiple ORs

I have a question regarding solr queries.
My query basically contains thousand of OR conditions for authors (author:name1 OR author:name2 OR author:name3 OR author:name4 ...)
The execution time on my index is huge (around 15 sec). When i tag all the associated documents with a custom field and value like authorlist:1 and then i change my query to just search for authorlist:1 it executes in 78 ms. How come there is such a big difference in exec-time?
Can somebody please explain why there is sucha difference (maybe the query parser?) and if there is a way to speed this up?
Thx for the help

Solr Relevance Search boost

I am using Solr 5.0.0, I have one question in relevance boost:
If I search for laptop table like words, is there any way to boost results search word before the words like by with or without etc.
I used this query:
? defType = dismax
& q = foo bar
& bq = (*:* -by)^999
But, this will boost negatively those documents having the word by or with etc. How can i avoid this problem?
For example, if I search for laptop table then by the above query the result DGB Cooling Laptop Table by GDB won't boost.
I just need to give a boost to the search words before certain words like by, with, etc.
Is it possible?
In your example you want ...laptop table by... results to score higher than laptop table results without by. And you want ...by laptop table... to be omitted entirely.
Let's call them:
Q1: ...laptop table by... (let's boost by 2 for the exercise)
Q2: ...laptop table... (let's not boost at all)
Q3: ...by laptop table... (we want to omit this)
So, your query in the abstract is: (Q1^2 OR Q2) NOT Q3
The dismax parser may be obscuring your goals by doing too much on your behalf. At least for this piece, you should consider using the Standard (Lucene) Query Parser:
`q=("laptop table by"^2 OR "laptop table") NOT "by laptop table"`
If you want to allow for slop (i.e. extra words between the query and 'by' or 'with') and preserve the order of terms as above, you should look into the ComplexPhraseQueryParser (Solr 4.8+) Yonik Seeley has a nice post about this.
Then you could do something like this:
`q=({!complexphrase inOrder=true}"laptop table by"~2^2 OR "laptop table") NOT {!complexphrase inOrder=true}"by laptop table"~2`

How can I limit my Solr search to an arbitrary set of 100,000 documents?

I've got an 11,000,000-document index. Most documents have a unique ID called "flrid", plus a different ID called "solrid" that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it "never" that any two searches will have the same set of FLRIDs to limit on.
What we're doing right now is, roughly:
q=title:dogs AND
(flrid:(123 125 139 .... 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))
Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together.
The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.
How can we do this better?
Things we've tried or considered:
Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement.
Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible.
Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID->SolrID to do the matching.
What we're hoping for:
An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database.
Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching.
A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it.
I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now.
solr search within subset defined by list of keys
Searching within a subset of data - Solr
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

Different scores from Solr 1 vs Solr 4 Dismax Handler

I've migrated my Solr 1.4 index to Solr 4.0 using this method, and I've kept my solrconfig.xml and schema.xml as unchanged as possible while still being functional.
I'm using the DisjunctionMaxQuery (dismax / solr.DisMaxRequestHandler) requestHandler and comparing my search results between Solr 1.4 and Solr 4. Using ?debugQuery=on in the URL, I can see that the parsedQuery portion is virtually the same between Solr versions, yet the generated scores are different. (The explain portion is different, but the calculation is long and obtuse.)
Example query: q=foo
Example response:
Solr 1.4:
title: "foo (32-bit)"
score: 3.8850176
Solr 4.0:
title: "foo (32-bit)"
score: 2.1525226
Despite having the same request handler and identical indices, what would be causing this significant difference in scores?
If the explain portion is different, then it's using different calculations to calculate the scores so they are going to be different. Scores are pretty arbitrary anyways and are basically only used for comparison within the one result set for the query, in other words it doesn't make sense to compare scores from one query to the scores of another query. The same probably applies to different version of solr, especially if the way the calculations are done are different.

Solr * vs *:* query performance

We're running Solr 3.4 and have a relatively small index of 90,000 documents or so. These documents are split over several logical sources, and so each search will have an applied filter query for a particular source, e.g:
?q=<query>&fq=source:<source>
where source is a classic string field. We're using edismax and have a default search field text.
We are currently seeing q=* taking on average 20 times longer to run than q=*:*. The difference is quite noticeable, with *:* taking 100ms and * taking up to 3500ms. A search for a common word in the document set (matching nearly 50% of all documents) will return a result in less than 200ms.
Looking at the queries with debugQuery on, we can see that * is parsed to a DisjunctionMaxQuery((text:*)), while *:* is parsed to a MatchAllDocsQuery(*:*). This makes sense, but I still don't feel like it accounts for a slowdown of this magnitude (a slowdown of 2000% over something that matches 50% of the documents).
What could be causing this? Is there anything we can tweak?
When you are passing just * you are ordering to check every value in the field and match it against * and that is a lot to do. However when you are using * : * you are asking Solr to give you everything and skip any matching.
Solr/Lucene is optimized to do * : * fast and efficient!

Resources