Carrot: different clusters for the same query - solr

When issuing the same query with match all query (* : *) I get different clusters and scores all the time. What could be the reason?
First try:
label: "В Минске"
score: 52.79549568196028
Second try:
label: "В Минске"
"score": 54.74385944060893
Third try:
label: "В Минске"
"score": 48.884082925408734
Document ids inside clusters are also different. Clusters themselves change: in one query response I get a cluster "тысячами евро", in the subsequent one it is gone, but new cluster appears: "Тысячами Долларов"
Is there some carrot parameter that could make clusters stable for a given query? Could it be desiredClusterCountBase ?
The Solr index is the same for all cases. Algorithm used: org.carrot2.clustering.lingo.LingoClusteringAlgorithm with StopWordLabelFilter.enabled=false and clustering.rows=1000.

It looks like I found the reason:
in the index there were duplicate of each document, with only one difference: one copy had a publication date, the other did not.
at the same time, my date filter did not work correctly, because publication dates were incorrectly stamped on each document and ranking function with reciprocal rank could return different documents each time for the top 1000 (this part is hard to debug without looking into Solr source code)
clustering module would get slightly different sets of documents => clusters would change. However, one could see that most prominent clusters (by size) were still stable, only scores were changing. Less prominent clusters could be replaced by other less prominent clusters between requests.
I don't know if this is a bug still, but removing all documents from the index and putting them back with the correct publication date has solved the issue.

Related

Strange Solr spring behaviour

When I built my query with Spring data, it produce the following query:
http://localhost:8983/solr/pride_projects/select?q=accession:*PXD*+OR+accession:*PRD*+AND+publication_date:[2012\-12\-31T00\:00\:00.000Z+TO+2012\-12\-31T00\:00\:00.000Z]
This gives me no results. I have manually change the query in my Solr to:
http://localhost:8983/solr/pride_projects/select?q=accession:*PXD*+OR+accession:*PRD*+publication_date:[2012\-12\-31T00\:00\:00.000Z+TO+2012\-12\-31T00\:00\:00.000Z]
Here the output:
{
accession: "PXD000002",
project_title: "The human spermatozoa proteome",
project_description: "The human spermatozoa proteome was in depth
characterized using shotgun iterative GeLC-MS/MS method with peptide
exclusion lists.",
project_sample_protocol: "This LC-MS/MS analysis was repeated twice
by digested band using an identified peptide exclusion list,
generated by Proteome Discoverer, from the previous LC-MS/MS runs of
the same sample.",
submission_date: "2012-01-02T00:00:00Z",
Removing the last AND and now the query works. I would expect similar behavior with both queries, but not.
Any ideas?
There is no reason the result should be the same when you remove an AND requirement - the default behavior for your setup is probably that all clauses are optional except when you are specific about requiring it (through AND).
Since your publication_date interval only matches a single millisecond, it doesn't match any documents - so in your last query it's being ignored (and would affect score if a document matched).
2012-12-31T00:00:00.000Z TO 2012-12-31T00:00:00.000Z
.. the start of your interval is the same as the end of your interval, but since you used [ and ] (which means that the value itself is included) you would get a match for a document indexed with exactly that millisecond.
You probably meant to filter for a far wider range.

Solr spatial search advanced! Solr field value in the query? Solr 4.10

Let's say we have solr document representing building with multiple location fields. Every building document has at least one location, which indicates building's location. While all others location fields are dynamic, and represents facilities around the building.
Let's say that these facilities are type based, for an example; 1 - schools, 2 - parks, 3 - parking lots.
Therefore each building may have variety of these facilities, some of the buildings may be pointing to the same type facility and same location, while others may have pointing same type, but with different location.
In essence we have:
building: {
...
main_location: "lat:long",
facility_1_location: "lat:long",
facility_2_location: "lat:long",
...
}
How to construct query, if we want to find all buildings that have facility of type "schools" or "1" with 5 kilometers radius?
One potential solution is to make sub queries, while each sub-query takes main_location of the building and queries against facility_1_location, however query will grow in size very repeatedly if we have a lot of building to store.
Another solution, would be to use documents itself field as main_location to construct query, but I am not sure if that's possible in Solr. Tried and searched for it, but I couldn't find a solution.
Are there any experts on this? I am using Solr 4.10

How do I create a Solr query that returns results even if one field in my query has no matches?

Suppose I want to create a recommendation system to suggest people you should connect with based off of certain attributes that I know about you and attributes I have about other people that are stored in a Solr index. Is it possible to query the index with a list of attributes (along with boosts for each attribute) and have Solr return scored results even if some of my fields return no matches? The way that I understand that Solr works is that if one of your fields doesn't contain a match in any documents found in your index, you get zero results for the entire query (even if other fields in the query matched) - is that right? What I would hope is that I could query the index and get a list of results back in order of a score given based on how many (and which) fields matched to something, even if some fields have no matches, for example:
Say that there are 2 people documents stored in the index as follows (figuratively):
Person 1:
Industry: Manufacturing
City: Oakland
Person 2:
Industry: Manufacturing
City: San Jose
And say that I perform a pseudo-Solr query that basically says "Search for everyone whose industry is equal to manufacturing and whose city is equal to Oakland". What I would like is to receive both results back in the result set, even though one of the "Persons" does not reside in Oakland. I just want that person to come back as a result with a lower score than Person1. Is this possible? What might a solr query look like to handle this? Assume that I have many more than 2 attributes for each person (so saying that I can use "And" and "Or" in my solr query isn't really feasible.. or is it?) Thanks in advance for your helpful input! (PS I'm using Solr 3.6)
You mention using the AND operator, which is likely your problem.
The default behavior of Lucene, and Solr, query syntax is exactly what you are asking for. A query like:
industry:manufacturing city:oakland
Will match either, with scoring preference on those that match both. See the lucene query syntax documentation
You can use the bq parameter (boost query) does not affect matching, but affects the scores only.
http://localhost:8983/solr/persons/select?q=industry:manufacturing&bq=City:Oakland^2
play with the boosting factor at the end to get the correct balance between matching score, and boosting score.

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!
I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.
To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

SOLR: Is it it possible to index multiple timestamp:value pairs per document?

Is it possible in solr to index key-value pairs for a single document, like:
Document ID: 100
2011-05-01,20
2011-08-23,200
2011-08-30,1000
Document ID: 200
2011-04-23,10
2011-04-24,100
and then querying for documents with a specific value aggregation in a specific time range, i.e. "give me documents with sum(value) > 0 between 2011-08-01 and 2011-09-01" would return the document with id 100 in the example data above.
Here is a post from the Solr User Mailing List where a couple of approaches for dealing with fields as key/value pairs are discussed.
1) encode the "id" and the "label" in the field value; facet on it;
require clients to know how to decode. This works really well for simple
things where the the id=>label mappings don't ever change, and are
easy to encode (ie "01234:Chris Hostetter"). This is a horrible approach
when id=>label mappings do change with any frequency.
2) have a seperate type of "metadata" document, one per "thing" that you
are faceting on containing fields for id and the label (and probably a
doc_type field so you can tell it apart from your main docs) then once
you've done your main query and gotten the results back facetied on id,
you can query for those ids to get the corrisponding labels. this works
realy well if the labels ever change (just reindex the corrisponding
metadata document) and has the added bonus that you can store additional
metadata in each of those docs, and in many use cases for presenting an
initial "browse" interface, you can sometimes get away with a cheap
search for all metadata docs (or all metadata docs meeting a certain
criteria) instead of an expensive facet query across all of your main
documents.

Resources