Strange Solr spring behaviour - solr

When I built my query with Spring data, it produce the following query:
http://localhost:8983/solr/pride_projects/select?q=accession:*PXD*+OR+accession:*PRD*+AND+publication_date:[2012\-12\-31T00\:00\:00.000Z+TO+2012\-12\-31T00\:00\:00.000Z]
This gives me no results. I have manually change the query in my Solr to:
http://localhost:8983/solr/pride_projects/select?q=accession:*PXD*+OR+accession:*PRD*+publication_date:[2012\-12\-31T00\:00\:00.000Z+TO+2012\-12\-31T00\:00\:00.000Z]
Here the output:
{
accession: "PXD000002",
project_title: "The human spermatozoa proteome",
project_description: "The human spermatozoa proteome was in depth
characterized using shotgun iterative GeLC-MS/MS method with peptide
exclusion lists.",
project_sample_protocol: "This LC-MS/MS analysis was repeated twice
by digested band using an identified peptide exclusion list,
generated by Proteome Discoverer, from the previous LC-MS/MS runs of
the same sample.",
submission_date: "2012-01-02T00:00:00Z",
Removing the last AND and now the query works. I would expect similar behavior with both queries, but not.
Any ideas?

There is no reason the result should be the same when you remove an AND requirement - the default behavior for your setup is probably that all clauses are optional except when you are specific about requiring it (through AND).
Since your publication_date interval only matches a single millisecond, it doesn't match any documents - so in your last query it's being ignored (and would affect score if a document matched).
2012-12-31T00:00:00.000Z TO 2012-12-31T00:00:00.000Z
.. the start of your interval is the same as the end of your interval, but since you used [ and ] (which means that the value itself is included) you would get a match for a document indexed with exactly that millisecond.
You probably meant to filter for a far wider range.

Related

Nested document searches in Solr complex parentFilter syntax

We are adding nested documents to our Solr index. For this purpose, we've added a solr_record_type field to each record, but there will be an interval while we are updating the index where the original documents will have null in this field. We would like to treat all of the original documents as root documents.
In our Solr index, solr_record_type equals 1 and the child types are represented by 2-4. So, in order to get backwards compatibility with what is currently returned by queries, I added this fq parameter:
-solr_record_type:[2 TO 4]
However, I am having trouble composing the parentFilter in the child transformer. For the fl field I've tried:
*,[child parentFilter="-solr_record_type:[2 TO 4]"]
This doesn't work because it then omits the _childDocuments_ section from the results for some reason. I don't know why. I need some way to specify that the parent filter is either "null or 1" or "anything but 2, 3, and 4". How can I do this?
I was unable to find a definitive reference for syntax for the parentFilter, only very simple examples.
A negative query needs to be prefixed with what it's going to remove the documents from. Think of it as the intersection between the two sets, and if you only have the set which are "these documents should be removed", you have nothing to remove them from.
The regular query parser (and the edismax handlers) append the set of all documents, *:* automagically in front of negative queries for you, so it appears to work - until you start with longer AND and OR statements involving negative queries, where you suddenly need to prefix *:* as well.
The same is the case in the parentFilter syntax - there is no inherent set of all documents automagically prefixed internally, so if you have a negative query, you'll have to add it yourself.
*,[child parentFilter="*:* -solr_record_type:[2 TO 4]"]

solr function query by example

Can someone explain with example that how Solr function query is used.
I could not find any concrete example which shows the result difference with function queries and without function queries.
I want something with example URL and what is shows in response result.
A function query is a query that invokes a function on one (or more) of the fields available. You add a function query if the value you have in a field has to be processed to get the value you want - just as you'd do in a mathematical sense.
Showing "the difference between a query with function queries and without" isn't really possible, as they don't do the same thing. You pick one (or both) depending on what you need.
An adopted example from the reference manual - Lets imagine we have a set of documents that describe users, and these users have two fields - mails_read and mails_received. To get anyone that has read less than 50% of their mails, we can apply a filter query as a function (with the frange query parser) (fq here means filter query - the frange is what makes it a function query):
fq={!frange l=0 u=0.5}div(mails_read,mails_received)
Otherwise we'd be limited to receive those who just had read a specific range of emails or that had received a specific range of emails - or we'd have to index a value that kept the updated value for mails_read / mails_received each time we updated the document (which is a perfectly valid strategy, and usually more efficient).
Another example is to use a function query for boosting documents, and the most common one is to boost by recency (i.e. that a more recent document receives a larger boost):
bf=recip(ms(NOW/HOUR,mydatefield),3.16e-11,1,1)
This applies the recip function to the difference (expressed in milliseconds) between the mydatefield field and the current hour.
recip: Performs a reciprocal function with recip(x,m,a,b) implementing a/(m*x+b) where m,a,b are constants, and x is any arbitrarily complex function.
Yet another fine use case is to use the special _val_ field - if you query against this magic field with a function, the value returned by the function will be used as the score of the document (instead of affecting it through boosting or limiting the resulting set of documents as a query).
_val_:"div(popularity, price)"
.. would give the score of the document based on the result of the division (what the values represent is up to you).

How to perform an exact search in Solr

I implementing Solr search using an API. When I call it using the parameters as, "Chillout Lounge", it returns me the collection which are same/similar to the string "Chillout Lounge".
But when I search for "Chillout Lounge Box", it returns me results which don't have any of these three words.(in the DB there are values which have these 3 values, but they are not returned.)
According to me, Solr uses Fuzzy search, but when it is done it should return me some values, which will have at least one these value.
Or what could be the possible changes I should to my schema.XML, such that is would give me proper values.
First of all - "Fuzzy search" is a feature you'll have to ask for (by using ~ in standard Lucene query syntax).
If you're talking about regular searches, you can use q.op to select which operator to use. q.op=AND will make sure that all the terms match, while q.op=OR will make any document that contain at least one of the terms be returned. As long as you aren't using fq for this, the documents that match more terms should be scored higher (as the score will add up across multiple terms), and thus, be shown higher in the result set.
You can use the debug query feature in the web interface to see scores for each term for a document, and find out why the document was returned at all. If the document doesn't match any terms, it shouldn't be returned, unless you're asking for all documents to be returned.
Be aware that the analyzer chain defined for the field you're searching might affect what's considered a match and not.
You'll have to add a proper example to get a more detailed answer.

No result return by Solr when query contains word that is not in the collection

I am trying to set up Solr but encountered the problem mentioned in the title. I just downloaded Solr and used the built-in example. When I used a query with words occurred in the example documents, such as "ipod". Solr worked properly. However, when I added some words that are not in these documents, such as "what". Solr does not return anything. For me, it is weird since the relevance scores should be computed to query terms separately and added up. Non-existing query term should not affect the ranking (even though the coord norm is affected, thus the scores of documents will change).
Could anyone tell me what might be the issue? Thanks.
There are several ways of configuring how you want this behavior. I'll assume that you're using the edismax query handler for these examples, although some of these also apply to the standard lucene query parser.
The reason for not always wanting "ipod what" to retrieve the same subset sa "ipod" is that you'll get a poor result set and user experience for terms that are more general than "ipod" (i.e. searching for "microsoft windows" will not be perceived as a good search result if you're showing only general hits for anything about windows - it's usually better to say "we didn't find anything" in those cases). It all depends on your use case.
First, you can do it yourself, by applying either AND or OR between terms to get the exact kind of matching you're looking for.
You can use q.op to configure wether each term should be AND-ed together (all required) or OR-ed together (any one is sufficient). This overrides the (now deprecated) value from <solrQueryParser defaultOperator=".."/> in schema.xml.
For (e)dismax, there's the mm parameter, which allows you do more specific, but in a general way, handling of how you want matches to be performed. mm allows you to say "at least 50% of the terms should match" or "if there's only two terms, both should match, but any over that should be optional" or "match everything up to four, and 75% after that".

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)
Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.
You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.

Resources