Solr * vs *:* query performance

Solr * vs *:* query performance - solr

We're running Solr 3.4 and have a relatively small index of 90,000 documents or so. These documents are split over several logical sources, and so each search will have an applied filter query for a particular source, e.g:
?q=<query>&fq=source:<source>
where source is a classic string field. We're using edismax and have a default search field text.
We are currently seeing q=* taking on average 20 times longer to run than q=*:*. The difference is quite noticeable, with *:* taking 100ms and * taking up to 3500ms. A search for a common word in the document set (matching nearly 50% of all documents) will return a result in less than 200ms.
Looking at the queries with debugQuery on, we can see that * is parsed to a DisjunctionMaxQuery((text:*)), while *:* is parsed to a MatchAllDocsQuery(*:*). This makes sense, but I still don't feel like it accounts for a slowdown of this magnitude (a slowdown of 2000% over something that matches 50% of the documents).
What could be causing this? Is there anything we can tweak?

When you are passing just * you are ordering to check every value in the field and match it against * and that is a lot to do. However when you are using * : * you are asking Solr to give you everything and skip any matching.
Solr/Lucene is optimized to do * : * fast and efficient!

Related

Getiing a certain number of docs from Solr

I need to get only n first documents sorted by prevId field from Solr (and not getting all the docs but cut to rows value) It seems to have poor performance and moreover it returns me the wrong value of found docs.Is where any way to do it from SOLR gui
or raw request?

numFound is the total number of documents that matches your query in the index (which in this case is all the documents in the index), it's not the number of documents returned.
You can enable docValues on your field if sorting is slow for that field - but caching usually helps a lot when doing multiple sorts (as long as your index hasn't been modified in between). That being said, your query took 285ms on the Solr side, so maybe the slowness you're experiencing comes from somewhere else than Solr?
Different output formats (&wt=json etc.) might also be more efficient for deserializing in your language of choice (.. and for display in your browser, which does a lot of syntax highlighting for XML).

Improve Slow Solr Query Performance

We are using Solr on Windows with multiple collections. Collections are having multiple stored and indexed fields with appx 200k documents. The use case is for e-commerce website search. The size of index is appx. 200 MB
While the normal search takes less than few ms, the query where I need to find all data for multiple categories are taking somewhere around 1100ms to 1200ms. The query includes appx. 400 categories with OR something like..
Category:(5 OR 33 OR 312 OR 1192 OR 1193 OR 1196 OR .....)
I have increased Heap Size to 4gb, and configured Solr cache value to be on higher size, this reduced the query time from 2000ms to 1100ms, but we are looking for more improvement.
I also found following on Solr UI:
lockFactory=org.apache.lucene.store.NativeFSLockFactory#56761b2a; maxCacheMB=48.0 maxMergeSizeMB=4.0
But not sure does that impact? And if Yes, how to change that?
Can you advise what else we can do? Let me know if you need more details.
Thank you in anticipation.

you should add your full request so it's easier to give some advice. But, from your sentence "The query includes appx. 400 categories with OR something like.." I understand you are putting your huge clause in q param? That is not the right approach.
Instead use q=* :* and put your clause in fq. This way, it will be cached, and subsequent queries will be much faster. If you get a good cache hit rate, queries will be significantly faster.
As a second thing you might try (but go first with the above) could be transforming the big OR clause into a (or a combination of) range clauses, as:
Category:[5 TO 1190] OR Category:[1192 TO 1196]
If your type is an tint, and you can transform the clause into ranges combination by significantly decreasing it's size, it might work too

Lucene comparing document contents

I am trying to compare the contents of documents using solr. I do this by simply using the entire document contents as a query. This works until the documents get large. A document can contain as many as 15k words or more. This results in a max boolean clause exception which has a default value of 1024. Now I could of course increase this value, but even if I increase it to 5k then it will remain impossible to compare documents with large contents.
Is Lucene even suitable for such tasks? And if so, what should I do to accomplish said requirements. If not, what would be an alternative way of comparing the contents of one document with other documents?

I think MoreLikeThis. MoreLikeThis prunes a documents contents to it's higher frequency terms, and just searches with those, which gets around the high numbers of terms (and improving performance). If you are searching for documents similar to an external source:
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(someReader, "contents");
Hits hits = indexsearcher.search(query);
Or if searching for a document already in the index:
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(documentNumber);
Hits hits = indexsearcher.search(query);
Solr also includes a MoreLikeThis handler.

Solr facet performance

I am working with Solr facet fields and come across a performance problem I don't understand. Consider these two queries:
q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
The only difference is an empty facet.prefix in the first query.
The first query returns after some 20 seconds (QTime 20000 in the result) while the second one takes only 80 msec (QTime 80). Why is this?
And as side note: facet.method=fc makes the queries run 'forever' and eventually fail with org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT.
This is with Solr 1.4.

From this doc:http://docs.lucidworks.com/display/solr/Faceting
The facet.prefix parameter limits the terms on which to facet to those
starting with the given string prefix.
that means that you facet by less terms.
Now, I'm quite sure the faceting time is included in the Qtime (as seems demonstrated by this post: http://www.mail-archive.com/solr-user#lucene.apache.org/msg39859.html).
So that means less terms, less time.

Maybe not facet on CONTENT as this probably has many different terms and makes no sense faceting on. Try faceting on a category field or some other field with less unique terms.

Have you tried executing them in the opposite order after a fresh restart of Solr server?
Usually the first query takes more time and if the next queries happen to have more in common with any of the previous, there'd be cache-hits and response time would be incredible.
In addition, please note that 'enum' is more suitable for facet-fields with less number of unique terms within.
Also, try increasing filter-cache. to a really big number and check your cache-hit ratio at
SOLR_DOMAIN:PORT/solr/#/collection1/plugins/cache?entry=fieldValueCache,filterCache

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)

Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.

You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Solr * vs : query performance - solr

When you are passing just * you are ordering to check every value in the field and match it against * and that is a lot to do. However when you are using * : * you are asking Solr to give you everything and skip any matching. Solr/Lucene is optimized to do * : * fast and efficient!

Related

Getiing a certain number of docs from Solr

Improve Slow Solr Query Performance

Lucene comparing document contents

Solr facet performance

SOLR index time boost depending on the field value

Categories

Resources