reducing similar top results in solr result output

reducing similar top results in solr result output - solr

I have a search in solr that is returning about 1500 documents. These documents are basically products. For example, I have a bunch of womens shoes in my dataset. My dataset has a wide variety of shoes for women, but it also has some very similar results, for instance, size 11 womens nike trainers, size 10 womens nike trainers, etc... Now, when I search for womens shoes, solr scoring causes a certain set of these results to bubble to the top that are all very similar.. For instance, all the colors of one particular shoe model might come to the top. They are definitely different products, but I would prefer to get a wider variety of results than just every color of nike trainer shoes.
Does anyone have any suggestions? Note, I don't want to eliminate all the individually colored products. When someone searches for blue womens nike trainers, I want them to get the blue model as the top result. I'm using the dismax query as my main query. What I would like to do is basically boost on some kind of "uniqueness of name compared to other results" factor.

You could either collapse on fields like color or so:
http://wiki.apache.org/solr/FieldCollapsing
or you can use near duplicate detection when indexing:
http://wiki.apache.org/solr/Deduplication
http://karussell.wordpress.com/2010/12/23/detect-stolen-and-duplicate-tweets-with-solr/
the latter algorithm is implemented in jetwick for tweets, so it should work for titles, but not performant enough for big documents (so only plagiarism detection for 'short' strings). for long text you'll need local sensitive hashing:
http://en.wikipedia.org/wiki/Locality_sensitive_hashing

Related

SOLR - Different score to different words in a multi word search query

I am using SOLR with mongoDB in one of my projects for search. I must say, SOLR is very powerful.
Currently, I am looking for a method to set different scores for different keywords if query is multi word.
e.g. If a user searches of black doll house
the weightage of black should be greater than doll and weightage of doll should be greater than house.
black > doll > house
Is it possible to implement this in SOLR. If yes, how?

You can give a separate weight to each term in the standard lucene query syntax (searching in a field named text):
text:black^10 text:doll^5 text:house
This will give black ten times as much weight as house, and doll five times a much weight as house, but only half the weight of black. You'll have to tweak the weights to get the results you're looking for. If you want to use the regular text in the q= field with (e)dismax as the query parser, you can use bq to add apply these boosts separately from the query itself.

Did you try boosting the terms in the query. you can specify different boost value for a term in the query.
example: if you transform your query to :
textfeild:black^6 textfeild:doll^5 textfeild:house^2
you get results with top documents will be matched for black, next black, next with house.
it multiplies term weight with boost value. here black with 6, doll with 5 and house with 2.

User search pricing calculation

I'm building a search engine which provide me a list of cap drivers. We have some requirements:
User is searching cheapest cap driver to bring him from place a to place b. He can go from any place to any place.
Default formula would be distance * price per mile
But there are also special prices like AMSTERDAM to THE HAGUE would be always 100 EUR
The price for each mile is season based winter/summers have different prices.
Faceting search based on attributes. Like is there Champagne/Luxory/Male/Female driver/Etc etc.
User want's to sort on cheapest ride/but also distance.
What would be the best approach to fit all there requirements? I've tried Solr but have not found a good solution for putting the price modal in there. Any ideas?

Solr Search Facets: How do i make them count products and NOT product varieties

The shop i'm working on sells clothing. Each item of clothing come in multiple varieties. For example Shirt A might come in: Red Large, Red Medium, Blue Large, Blue Medium, White Large, and White Medium.
At first I had added each variety as a solr doc. So for the above product I added 6 solr docs, each with the same Product ID. I got solr to group the results by Product ID and everything worked perfectly.
However the facet counts were all variety counts and not product counts. So for example .. just limiting it to the one product above - (if that were the only product in the system say).. the facet counts would show:
Red (2)
Blue (2)
White (2)
Which was correct, there were 2 documents added for each color. But really what i want to see is this:
Red (1)
Blue (1)
White (1)
As there is only 1 product for each color.
So now i'm thinking in order to do that I need to make each solr document a product.
In that case i would add the product, and add the field "color" 3 times one red, one blue, one white, and add the field size 3 times as well. But now solr doesn't really know what size goes with each color. Maybe I only have white in small.
What is the correct way to go about this to make the facet counts as they should be?

Turns out I could do this using grouping (field collapsing) here
http://wiki.apache.org/solr/FieldCollapsing#Request_Parameters
specially these parameters added to the query
group=true
group.field=product_id"
group.limit=100
group.facet=true
group.ngroups=true
group.facet is the one that really make the facets work with the groups like i wanted them to.

I think that you have 2 options.
Option 1:
Once you get the list of facet values (Red, Blue & White in the given example), then fire the original query again with each facet value as a filter. For example, if the original query was q=xyz&group.field=ProductID then fire q=xyz&group.field=ProductID&group.ngroups=true&fq=color:Red. The ngroups value in the response will give you the required count for Red. Similarly, fire a separate query for Blue and White.
Option 2:
Create a separate field called Product_Color which includes both the ProductID and the color. For example, if a product has ID is ABC123 and color is Red, then Product_Color will be ABC123_Red. Now, to get the facets for color, fire a separate query which groups by Product_Color instead of ProductID and you will get the required facets with the correct values. Remeber to set group.truncate=true for this to work.

You can try looking into Facet Pivot, which would allow you to have single document, tree like facet with proper counts and filtering.

Efficiently selecting a title (the center of the cluster) for a cluster of strings

I have an (imperfectly) clustered string data, where the items in one cluster might look like this:
[
Yellow ripe banana very tasty,
Yellow ripe banana with little dots,
Green apple with little dots,
Green ripe banana - from the market,
Yellow ripe banana,
Nice yellow ripe banana,
Cool yellow ripe banana - my favourite,
Yellow ripe,
Yellow ripe
],
where the optimal title would be 'Yellow ripe banana'.
Currently, I am using simple heuristics - choosing the most common, or the shortest name if tie, - with the help of SQL GROUP BY. My data contains a large amount of such clusters, they change frequently, and, every time a new fruit is added to or removed from the cluster, the title for the cluster has to be re-calculated.
I would like to improve two things:
(1) Efficiency - e.g., compare the new fruit name to the title of the cluster only, and avoid grouping / phrase clustering of all fruit titles each time.
(2) Precision - instead of looking for the most common complete name, I would like to extract the most common phrase. The current algorithm would choose 'Yellow ripe', which repeats 2 times and is the most common complete phrase; however, as the phrase, 'Yellow ripe banana' is the most common in the given set.
I am thinking of using Solr + Carrot2 (got no experience with the second). At this point, I do not need to cluster the documents - they are already clustered based on other parameters - I only need to choose the central phrase as the center/title of the cluster.
Any input is very appreciated, thanks!

Solr provides an analysis component called a ShingleFilter that you can use to create tokens from groups of adjacent words. If you put that in your analysis chain (ie apply it it incoming documents when you index them), and then compute facets for the resulting field with a query restricted to the "fruit cluster", you will be able to get a list of all distinct shingles along with their occurrence frequencies - I think you can even retrieve them sorted by frequency - which you can use easily I think to derive the title you want. Then when you add a new fruit, its shingles will automatically be included in the facet computations the next time around.
Just a bit more concrete version of this proposal:
create two fields: fruit_shingle, and cluster_id.
Configure fruit_shingle with the ShingleFilter and any other processing you might want (like tokenizing at word boundaries with maybe StandardTokenizer, prior to the ShingleFilter).
Configure cluster_id as a unique id, using whatever data you use to identify the clusters.
For each new fruit, store its text in fruit_shingle and its id in cluster_id.
Then retrieve facets for a query: "cluster_id:", and you will get a list of words, word pairs, word triplets, etc (shingles). You can configure the ShingleFilter to have a max length, I believe. Sort the facets by some combination of length and/or frequency that you deem appropriate and use that as the "title" of the fruit cluster.

Data Type of a field

I have a field in a table that I will be storing different kinds of data in it, Like: X-Large, Medium, Small....or I might store: 22-March-2009, 1 Year, 2 Years, 3 Years...or 06Months, 12 Months, 1 Year, or I might store: "33-36", "37-40"...and that data is not fixed, i might need in the future to add new categories...
The obvious choice of a data type would be an nvarchar(length), but any other suggestions? is there a way to go around this?

Sounds like you're trying to store a "size". Maybe you need a "Size" table with those values in it (X-Large, Medium, Small, 1 Year, etc.) and an ID field that goes in the other table.
Why you would also want to store a date in the same field is a bit confusing to me. Are you sure you shouldn't have two different fields there?
ETA:
Based on your comment, I would suggest creating a couple additional tables:
SizeType - Would define the type of "size" you were working with (e.g. childrens clothing, childrens shoes, mens shoes, womens shoes, mens shirts, mens pants, womens shirts, womens pants, etc.). Would have two columns - an ID and a Description.
Size - Would define the individual sizes (e.g. "Size 5", XL, 33-34, 0-6 Months, etc.). Would have three columns - and ID, a Description, and the corresponding SizeType id from SizeType.
Now on your product table, you would put the ID from the size table. This gives you some flexibility in terms of adding new sizes, figuring out which sizes go with which type of products, etc. You could break it down further as well to make the design even better, but I don't want to overcomplicate things here.

No matter what you do, such database design does not look good.
Still, you can use BLOB data type to just store any data in a column, or a Text type if it’s text (this way search will work better, understanding upper and lower case and such).

nvarchar(max) would work. Otherwise, you might have multiple columns, one of each possible type. That would keep you from converting things like double-precision numbers into strings and back.

nvarchar(max) if the data is
restricted to strings of less than
2Gb.
ntext if you need to allow for
strings of more than 2Gb.
binary or image if you need to store
binary data.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight