Solr - how do I index barcode - solr

I have a documnet that contains the following data
car {
id: guid
name: string
sku: list<barcode>
}
Now,
The barcodes dont have a pattern. It can be either one of the follwings:
ABCD-EF34GD-JOHN
ABCD-C08-YUVF
I want to index my documents so that search for
1. ABCD will return both.
2. AB will return both.
3. JO - will return ABCD-EF34GD-JOHN but not car with name john.
4. If the ID (which is indexed) contains "ABCD", i dont want the document to be returned (the user doesn't see it)
so far I have defined car and sku as text_en.
But I dont get bulletes no 2 and 3.
IS there a better way to define sku attribute.
My Query is
http://....:8983/solr/vault/select?q=ABCD&qf=Name+SKU&defType=edismax
Thanks.

What you are trying to do here is actually a wildcard search on the tokens separated by the dash ("-").
An easy (but slow in performance) way is to add a star (*) at the end of your word in the query, like this:
http://....:8983/solr/vault/select?q=AB*&qf=Name+SKU&defType=edismax
Another option is to change the field type that you use to index and implement an NGram algoritm. If you use this filter in your field you will create a toklen for each combination of letters in the word you are indexing. For example: ABCD => AB, ABC, ABCD
So it will find what you are looking for and the search will be very fast, but the index will be very big and the indexation time will also increase notably.
You can find more info here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

Related

Object with longer field is returned against an object with a short field

Let's say that we have an index with two objects:
{
"name": "iPhone 6s Plus big screen, super fast, ultra responsive, blah blah"
}
and:
{
"name" : "iPhone 6s Plus"
}
Now, when i search for iPhone 6s Plus, it returns the first object which does not make sense, since the first object contains more words (or noise) than the second object for the given query. In other words, a term appearing in a short name field should carry more ranking points than the same term appearing in a long name field
Algolia uses a TF/IDF algorithm, which takes account of the Field-length norm, so the second object should have higher score than the first one.
So why does the first object has higher score than the second one? Is there any settings option that I am missing?
I found the answer, but I am not sure if it is 100% correct, or if there is a better way to achieve this.
Login to Algolia -> Select Your Instance -> Go to Ranking Tab.
In the Ranking Formula, add a new row. The new row, should have Attribute type {{the name of the column, in this example is "title"}} and set it to Ascending.
With that, you will achieve what we are looking for.
One option is to break up the value into two different attributes, one for just the product name and another for the description. Doing that also lets you prioritize the product name in your searchable attributes, which would lead to better relevance in most cases.

Solr what is the difference between query using q and df?

I just did two things.
q -> iphone
df -> brand
and
q -> brand:iphone
Both returns same result.
First one looks for iphone string in brand field. Second one returns brand field whose value is phone.
What is the purpose of df field?
There really isn't any difference - but to show WHEN it would be different, you'll have to consider the case when you query a different field than the one provided in df.
q=model:foo&df=brand
This would lead to foo being matched against values in the field model, while brand is ignored. If the person writing the query however didn't specify a field, brand would be searched.
Most of the time you'd want to use the edismax or dismax query type (defType=edismax) to be able to create more suitable rules for which fields to query and the weight between the fields, and to handle how most people use a search field:
defType=edismax&q=foo&qf=brand^10 model
.. would search the fields brand and model for foo, and give a tenfold increase in score if the hit is in the brand field compared to the model field. Just q=foo&qf=brand would replicate your first query, and since edismax also supports parts of the lucene syntax, q=brand:foo&qf=model should also work.

Faceted on multiple values of the same field in haystack

I am using Haystack and SOLR. And I am trying to implement faceting search on one field for multiple values. For example, I am faceting on "author" field.
john 3
kevin 2
sam 2
I want to faceted on "john" OR "sam". How can I format the URL for it?
http://localhost:8000/search/?q=*&selected_facets=author_exact:john +OR+ selected_facets=author_exact:sam
If you want to limit the resulting set of documents to those containing either john or sam, use a fq:
fq=author:sam OR author:john
If you want to only generate facets on certain values or queries, use facet.query:
facet.query=author:sam OR author:john
You will have to use OR with narrow() in your view/form (the exact implementation depends on which view/form you are using).
Since getting the list of selected_facets simply involves:
self.request.GET.getlist('selected_facets')
How you wish to implement that in your url is solely up to you:
you could do it with some kind of separator then you split them apart:
localhost:8000/search/?q=*&selected_facets=author_exact:john|sam
`for x in selected_facets:
field_name, value = x.split(':', 1)
if "|" not in value:
continue
values = x.split('|')`
you could also do it this way:
localhost:8000/search/q=*&selected_facets=author_exact:john&selected_facets=author_exact:sam
facet_dict = dict()
for x in selected_facets:
field_name, value = x.split(':', 1)
facet_dict[field_name].append(value)
Then in haystack:
sqs.narrow('author_exact:(john OR sam)')
So basically there are no strict rules/standards for how to implement multiple values in the url for faceting.

How to boost AND in a solr query?

Suppose a user enters a two word input for search, since the default boolean applied is OR, all entries containing all or both entries appear.
What I was interested to know, is that if conditions specifically meeting the AND condition could be boosted.
In case of multiple words, can words be specified to imply specific constraints in searching or boost few parameters in case these words are present.For e.g: , if input be "with x and y without z", can i make my solr to interpret it as (x AND y) AND (Not z)? or at least boost those entries which partially or fully meet the requirement?
EDIT:
I have tried using boost with edismax as shown here:
$query = $client->createSelect(); //create search query
$query->setQuery('memberType:'.$searchQuery.' firstName:'.$searchQuery.' gender:'.$searchQuery); //include fields required for searching //meantion fields to be searched and search query/ies
$edismax = $query->getEDisMax();
$edismax->setQueryFields('firstName memberType^3 gender^2'); //boost fields
$query->setStart($start)->setRows($rows); //vary bracketted numbers to vary results staring point and no. of rows to be displayed, use variables instead of constants
$query->setFields(array('id', 'firstName', 'lastName', 'eid', 'gender', 'memberType')); //set return fields
//$query->addSort('id', $query::SORT_ASC); //sort field and customisations
$resultSet = $client->select($query);
When i search for a name with a particular member type, like "sanjay candidate" i expect the order to be entries with sanjay and candidate, and then all users who are candidates and then all users who are sanjay, but instead i get sanjay and candidate then all who are sanjay and then all candidates.
I am not able to figure out what the issue may be or if i can provide a more customized boosting.
If you are using eDismax, you have a whole collection of boosting options for a phrase, bigram, a separate boosting query and so on. Reading through the wiki page and experiment. You should not need to do any custom coding for this scenario.

Using multivalued field in map function

I'm working on implementing Solr in a project and right now I'm stuck on a specific search including an arr field. The thing is:
I'd like to search sub-id's on an object, these sub-id's are stored in a multivalue field, e.g.:
<arr name="SubIds">
<int>12272</int>
<int>12304</int>
<int>12306</int>
</arr>
The query (or part of the query) that I want to use is as follows:
map(SubIds,i,i,1,0)
When I, for example, fill 12304 on the 'i' space in the map function above, I would expect my function to return 1. If I would enter 12345 it should return 0. The thing is that when I run this query it returns 0, or "There's no number 12304 in this field, I return 0".
When removing the 0 from my map function I can see the actual value returned to me (when 12304 return 1, when different return value), in this case that's 12306! I've tried this with some different multivalued fields but the result is the same; it looks like the function is checking the last value in the multivalue field against my filled in ID.
Is this true? And when it does, is there any way in looking through the whole arr and only return 0 when the value doesn't exist in the whole multivalued field?
** Edit: It's just a hunch, but could it be that the map() function automatically orders the arr list when it sees that all the items are of type int (for example). That could mean that the map returns the first number (the highest) which would (in my example) be 12306, not 12304...*
Thanks!
... It looks like function queries don't work with multivalued fields ...
http://lucene.472066.n3.nabble.com/Using-multivalued-field-in-map-function-td3318843.html#a3322023:
Function queries don't work with multivalued field.
http://wiki.apache.org/solr/FunctionQuery#Vector_Functions
Given the following case, is there anybody who has a better idea on how I can query the wanted data?
I've got a website full of blogposts and every blogpost has an owner,
this owner is refererred to through his/her id. For example: BloggerId
= 123. It's also possible that the blog has multiple co-writers, which
are also referred to by there BloggerId but these id's are stored in
the multivalue field, in my previous example SubIds.
When searching for a specific blogger one searches the BloggerId.
Searchresults are influenced by a number of variables, the
country/state/more specific geological data, the blogcategory, etc.
For this I use a facetted query. Next I want to make some results more
important, depending on the BloggerId, I tried to do this with the
following query:
?q={!func}map(sum(map(BloggerId,12304,12304,2,0),map(BloggerId,12304,12304,1,0)),3,3,2)&fl=*,score&facet.field=Country&f.Country.facet.limit=6&facet.field=State&fq=(BlogCategory:internet%20OR%20BlogCategory:sports&sort=score%20desc,Top%20desc,%20SortPriority%20asc&start=0&omitHeader=true
In the resulting list, blogs written by BloggerId 12304 should be on
top of the list, followed by the blogs where BloggerId 12304 was
co-writer. After that, all other blogs that follow the criteria but
aren't written (or co-written) by BloggerId 12304.
Maybe I could make this multivalued field a string field (where id's are seperated by ";") and query my value, but if one has a better idea your always welcome!
In the end I chose to add a string valued field with whitespaces to seperate the different values. After that I used the solr.WhitespaceTokenizerFactory class to quickly scan the string for occurences of a specific ID.

Resources