I am doing address matching in SOLR and for most part it is working fine. I have a situation where I would like return the same value for the following two cases:
10 SMITH COURT REDBANK PLAINS QLD
10 SMITH CT REDBANK PLAINS QLD
The street type abbreviation CT = COURT.
One option I have tried is to have both the records in SOLR, but that just leads to duplication of a lot of data. I have ~30 million records, but these could be halved if there is a way in SOLR to match as explained above.
Any suggestions how to handle this issue?
Synonyms allow users to find documents through multiple terms that might not have been used in the original document definition.
You can try using solr.SynonymGraphFilterFactory
For more details on the synonym filter please refer to the documentation.
Related
We noticed following issues with 'categoryname' field search in WebSphere Commerce, so trying to understand if it's rather a data set up issue or Commerce Search/SOLR is not designed to work with such type of scenarios.
We have more than 100 catalogs that site and customer specific, customers get their own catalog/category/items when they log in and there is no issue with category browsing or order placement, but having an issue with OOB keyword search since OOB IBM_findProductsBySearchTerm profile includes 'categoryname' as part of 'defaultSearch' while making SOLR calls along with name, shortDesc, keyword, and few other fields.
Having said that we are seeing too many and unwanted results that are not relevant for given search keyword since a match is found in some other customer's catalog category(s) name. We do see correct results if I comment below in wc-search.xml file, but this prevents searching categories in the current catalog as well.
<_config:field name="defaultSearch"/>
<_config:field name="categoryname"/>
For example, following are the categories that match 'candy' keyword but are not part of the current site and catalog(site with catalog D) keyword search, how do we prevent these getting scored during keyword search and still use categoryname search?
Rubys' candy -- in catalog A
Smith dairy stuff -- in catalog B
Kitchen Utensils -- in catalog C
Candy supplies -- in catalog E
Prep kits -- in catalog D, no items in this category have word 'candy' in it.
Basically we are getting items from 'Prep kits' category as well for the site with catalog D in keyword search since other catalog's categories have word 'candy' in it. In nutshell we are getting too many and non-relevant results the moment 'categoryname' field used in wc-search.xml or in direct SOLR query(qf=categoryname).
I believe the issue is because the categoryname is indexed as wc_text and multivalued with comma separated data across all catalogs in the system.
What kind of customization needs to be done to fix this issue, so that the search would return relevant results?
Thanks
There is nothing OOB since categoryname index data has no catalog_id visibility. Solved the issue by adding a dynamic and multivalued categoryname_ field and used that to replace existing categoryname qf in a custom ExpressionProvider class. This limits keyword searches to current catalog(s) categories only and returns correct results.
Let's say we have solr document representing building with multiple location fields. Every building document has at least one location, which indicates building's location. While all others location fields are dynamic, and represents facilities around the building.
Let's say that these facilities are type based, for an example; 1 - schools, 2 - parks, 3 - parking lots.
Therefore each building may have variety of these facilities, some of the buildings may be pointing to the same type facility and same location, while others may have pointing same type, but with different location.
In essence we have:
building: {
...
main_location: "lat:long",
facility_1_location: "lat:long",
facility_2_location: "lat:long",
...
}
How to construct query, if we want to find all buildings that have facility of type "schools" or "1" with 5 kilometers radius?
One potential solution is to make sub queries, while each sub-query takes main_location of the building and queries against facility_1_location, however query will grow in size very repeatedly if we have a lot of building to store.
Another solution, would be to use documents itself field as main_location to construct query, but I am not sure if that's possible in Solr. Tried and searched for it, but I couldn't find a solution.
Are there any experts on this? I am using Solr 4.10
Suppose I want to create a recommendation system to suggest people you should connect with based off of certain attributes that I know about you and attributes I have about other people that are stored in a Solr index. Is it possible to query the index with a list of attributes (along with boosts for each attribute) and have Solr return scored results even if some of my fields return no matches? The way that I understand that Solr works is that if one of your fields doesn't contain a match in any documents found in your index, you get zero results for the entire query (even if other fields in the query matched) - is that right? What I would hope is that I could query the index and get a list of results back in order of a score given based on how many (and which) fields matched to something, even if some fields have no matches, for example:
Say that there are 2 people documents stored in the index as follows (figuratively):
Person 1:
Industry: Manufacturing
City: Oakland
Person 2:
Industry: Manufacturing
City: San Jose
And say that I perform a pseudo-Solr query that basically says "Search for everyone whose industry is equal to manufacturing and whose city is equal to Oakland". What I would like is to receive both results back in the result set, even though one of the "Persons" does not reside in Oakland. I just want that person to come back as a result with a lower score than Person1. Is this possible? What might a solr query look like to handle this? Assume that I have many more than 2 attributes for each person (so saying that I can use "And" and "Or" in my solr query isn't really feasible.. or is it?) Thanks in advance for your helpful input! (PS I'm using Solr 3.6)
You mention using the AND operator, which is likely your problem.
The default behavior of Lucene, and Solr, query syntax is exactly what you are asking for. A query like:
industry:manufacturing city:oakland
Will match either, with scoring preference on those that match both. See the lucene query syntax documentation
You can use the bq parameter (boost query) does not affect matching, but affects the scores only.
http://localhost:8983/solr/persons/select?q=industry:manufacturing&bq=City:Oakland^2
play with the boosting factor at the end to get the correct balance between matching score, and boosting score.
I am trying to compare two documents in solr (say Doc A, Doc B), based on a common "name" field using solr query. Based on query A.name I get a result document B with a relevancy score of say SCR1. Now if i do it in the reverse way, i.e I query with B.name and i get the document A in somewhere in the result, but this time score of B with A is not the same SCR1.
I believe this is happening because of the no. of terms in Doc A.name and Doc B.name are different so similarity score is not same. Is it the reason for this difference?
Is there anyway I can get same score either way (as described above)?
Is it not possible to compare score of any any two queries?
Is it possible to do this in native Lucene APIs?
To answer your second question, scores of two documents must not be compared.
A similar question was posted in the java-users lucene mailing list.
Here's a link to it: Compare scores across queries
An explanation is given there as why one must not do that.
I'm not quite sure I'm clear on the queries you are referring to, but let's say the situation is something like this:
Doc A: Name = "Carlos Fernando Luís Maria Víctor Miguel Rafael Gabriel Gonzaga Xavier Francisco de Assis José Simão de Bragança, Sabóia Bourbon e Saxe-Coburgo-Gotha"
Doc B: Name = "Tomás António Gonzaga"
If you search for "gonzaga", Doc B will be given the higher score, since, while there is one match in each name, Doc B has a much shorter name, with only three terms, and shorter fields are weighed more heavily. This is the LengthNorm refered to in the TFIDFSimilarity documentation.
There are other factors though. If we just chuck each name into the queryparser, and see what comes up, something like:
Query queryA = queryparser.parse(docA.name);
Query queryB = queryparser.parse(docB.name);
Then the queries generated are much different:
name:carlos name:fernando name:luis name:maria name:victor name:miguel name:rafael name:gabriel name:gonzaga name:xavier name:francisco name:de name:assis name:jose name:simao name:de name:braganca name:baboia name:bourbon name:e name:saxe name:coburgo name:gotha
vs
name:tomas name:antonio name:gonzaga
there are a wealth of reasons why these would generate different scores. The lengthNorm discussed above, the coord factor, which boosts results which match more query terms would very likely come into play, tf, which weighs documents with more matches for a term more heavily, idf, which prefers terms that appear less frequently over the entire index, etc. etc.
Scores are only relevant to the result set of a query run. A change to the query, or to the state of the index can lead to different scores, and they are not intended to be comparable. You can use IndexSearcher.explain, to understand how a score was calculated.
I am attempting to do a join on two fields that have the same name (company_id) but are from different entities to query a document based on a field it does not have.
ex: I have a sales entity and a company entity, where the sales entity holds a company id, and the company entity holds the name of the company.
For size reasons, I cannot do this join at index time.
I wish to get the names of the companies that have a sale over x.
I attempted both of the following:
q={!join+from=company_id+to=company_id}sales:[100 TO *]
and
fq={!join+from=company_id+to=company_id}sales:[100 TO *]
For the fq one I just specified *:* as the q parameter.
In both cases I got results, but the results did not have sales in that range.
How can I fix this?
Using Solr 4.4
Note: This appears to work with only one entity involved.
With "different entities" are you referring to 2 Solr Core ?
In that case you have use a slight different sintax :
http://localhost:8983/solr/<coreTO>/select?q={!join from=docId to=id fromIndex=<coreFROM>}query
From this link Solr-join
I have found the solution.
According to this :
The join operation is done on a term basis, so the "from" and "to" fields must use compatible field types. For example: joining between a StrField and a TrieIntField will not work, likewise joining between a StrField and a TextField that uses LowerCaseFilterFactory will only work for values that are already lower cased in the string field.