Getiing a certain number of docs from Solr - solr

I need to get only n first documents sorted by prevId field from Solr (and not getting all the docs but cut to rows value) It seems to have poor performance and moreover it returns me the wrong value of found docs.Is where any way to do it from SOLR gui
or raw request?

numFound is the total number of documents that matches your query in the index (which in this case is all the documents in the index), it's not the number of documents returned.
You can enable docValues on your field if sorting is slow for that field - but caching usually helps a lot when doing multiple sorts (as long as your index hasn't been modified in between). That being said, your query took 285ms on the Solr side, so maybe the slowness you're experiencing comes from somewhere else than Solr?
Different output formats (&wt=json etc.) might also be more efficient for deserializing in your language of choice (.. and for display in your browser, which does a lot of syntax highlighting for XML).

Related

Solr 7.5 Limiting Facet with certain terms can not order by count?

I would like to facet location.authorIds field on Solr table authors, but also limit facet with certain terms.
I did this, just like in the Solr 7.5 tutorials, http://lucene.apache.org/solr/guide/7_5/faceting.html, but the results returned from Solr did not sort on count, does anyone knows why?
Here is my Solr query:
http://127.0.0.1:8080/solr/authors/select?q=*:*&fl=handle&fq=search.resourcetype:2&start=0&rows=0&facet.field={!terms=27J001,J027,J132,J107,J225,J141,J092,J191,J224,J198,J062,J051,J143,J208,J119,J031,J057,J030,J134,J144,J158,J058,J181,J222,J153,J002,J203,J012,J045,J014,J186,J011,J064,J065,J147,J112,J192,J167,J066,J135,J096,J082,J075,J009,J193,J217,J168,J121,J059,J034,J213,J148,J169,J133,J013,J161,J093,J097,J162,J021,J170,J171,J083,J187,J178,J077,J194,J078,J098,J067,J047,J052,J172,J005,J113,J079,J099,J114,J100,J115,J068,J173,J084,J214,J101,J060,J025,J122,J195,J188,J196,J116,J102,J159,J197,J029,J094,J123,J053,J043,J189,J124,J015,J085,J174,J004,J044,J182,J088,J007}location.authorIds&facet=true&facet.sort=count
Here is the results, you can see it is not sorted by count.
After I did a testing and debugging the code, I could clearly see, that it’s a bug in the code.
org.apache.solr.request.SimpleFacets#getListedTermCounts
for (String term : terms) {
int count = searcher.numDocs(ft.getFieldQuery(null, sf, term), parsed.docs);
res.add(term, count);
}
Which is clearly wrong, since it’s just iterate over the list and providing counts in exact order of what you specified in the query parameters.
I’ve created a Solr issue - https://issues.apache.org/jira/browse/SOLR-13156 - and provided the patch. It was committed and will be available in Solr soon

Lucene comparing document contents

I am trying to compare the contents of documents using solr. I do this by simply using the entire document contents as a query. This works until the documents get large. A document can contain as many as 15k words or more. This results in a max boolean clause exception which has a default value of 1024. Now I could of course increase this value, but even if I increase it to 5k then it will remain impossible to compare documents with large contents.
Is Lucene even suitable for such tasks? And if so, what should I do to accomplish said requirements. If not, what would be an alternative way of comparing the contents of one document with other documents?
I think MoreLikeThis. MoreLikeThis prunes a documents contents to it's higher frequency terms, and just searches with those, which gets around the high numbers of terms (and improving performance). If you are searching for documents similar to an external source:
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(someReader, "contents");
Hits hits = indexsearcher.search(query);
Or if searching for a document already in the index:
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(documentNumber);
Hits hits = indexsearcher.search(query);
Solr also includes a MoreLikeThis handler.

Displaying information about SolR search result

I am struggling with a little problem where I have to display relevant information about the resultset returned from SolR but can't figure out how to calculate it without iterating the results (bad).
Basically I am storing my documents with a state field and while the search is supposed to return all documents, the UI has to show "Found 15 entities, 5 are in state A, 3 in state B and 8 in C".
At the moment I am using a rather brittle approach of running the query 3 times with additional scoping by type, but I'd rather get that information from the one query I am displaying. (There have been some edge cases where the numbers don't add up and since SolR can return facets I guess there has to be a way to use that functionality in this case)
I am using SolR 3.5 from Rails with the sunspot gem
As you mention yourself, you can use facets for this by setting
facet=true&facet.field=state
I'm not familiar with the sunspot gem, but by looking at the documentation you can use
facets like this(Assuming Entity is your searchable):
Entity.search do:
facet :state
end
This should return the states of all entities returned by your query with the number of entities in this state. The Sunspot documentation tells me you can read these facets in the following way:
search.facet(:state).rows.each do |facet|
puts "State #{facet.value} has #{facet.count} entities"
end
Essentially there are three main sets of functions you can use to garner stats from solr.
The first is faceting:
http://wiki.apache.org/solr/SimpleFacetParameters
There is also grouping (field collapsing):
https://wiki.apache.org/solr/FieldCollapsing
And the stats package:
https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
Although the stats, facet and group may be replaced by the analytic package known as olap which is aimed to be in solr V 5.0.0:
https://issues.apache.org/jira/browse/SOLR-5302
Good luck.

Solr facet performance

I am working with Solr facet fields and come across a performance problem I don't understand. Consider these two queries:
q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
The only difference is an empty facet.prefix in the first query.
The first query returns after some 20 seconds (QTime 20000 in the result) while the second one takes only 80 msec (QTime 80). Why is this?
And as side note: facet.method=fc makes the queries run 'forever' and eventually fail with org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT.
This is with Solr 1.4.
From this doc:http://docs.lucidworks.com/display/solr/Faceting
The facet.prefix parameter limits the terms on which to facet to those
starting with the given string prefix.
that means that you facet by less terms.
Now, I'm quite sure the faceting time is included in the Qtime (as seems demonstrated by this post: http://www.mail-archive.com/solr-user#lucene.apache.org/msg39859.html).
So that means less terms, less time.
Maybe not facet on CONTENT as this probably has many different terms and makes no sense faceting on. Try faceting on a category field or some other field with less unique terms.
Have you tried executing them in the opposite order after a fresh restart of Solr server?
Usually the first query takes more time and if the next queries happen to have more in common with any of the previous, there'd be cache-hits and response time would be incredible.
In addition, please note that 'enum' is more suitable for facet-fields with less number of unique terms within.
Also, try increasing filter-cache. to a really big number and check your cache-hit ratio at
SOLR_DOMAIN:PORT/solr/#/collection1/plugins/cache?entry=fieldValueCache,filterCache

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)
Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.
You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.

Resources