Ecommerce and Vespa: filtering customer's wishlist - vespa

Let's suppose an ecommerce site has wishlisting functionality. Some users wishlist a lot of products (on a scale of tens of thousands). Total amount of products is in millions. We want to implement functionality where a customer can filter those products like in catalogue.
When we looked into implementing this using Elasticsearch, the best way we've found was using terms lookup. Like, we create a document for each user's wishlist, and then filter out products we need using this document. After that all filtering/etc is done regularly. The problem here is that Elasticsearch cannot sort those products properly — i.e. how document specifies it, since we want to sort by wishlisting time.
This is when I decided to look into Vespa. But after reading documentation I still have no idea what's the best way to implement this. This looks like a problem needing "join" to my rdbms-infected mind. :)
Cardinalities of data is like so:
millions of products
hundreds of thousands of users
tens of thousands of wishlisted items
So... any ideas how to implement or pointers on what to read?

In Vespa you need two document types for this (if you want to store the wishlist in Vespa that is, it's not a requirement)
Retrieved the whislist for given user, either from Vespa using get api of Vespa or from another storage solution.
Retrieve and rank using DotProductItem over the product id field along with a ranking profile.
/search/?yq=select * from products where dotProduct(product_id, {"a":3, "b":2});&ranking=wishlist&hits=10
In this case a is a more recent product in the wishlist then b. ranking profile to go with it:
rank-profile wishlist {
first-phase {
expression:rawScore(product_id)
}
}
You can also use WAND to speed up the search and retrieve only the most recent/top ranking hits in the wishlist. The above example retrieves all and ranks all.
See
https://docs.vespa.ai/en/reference/query-language-reference.html
https://docs.vespa.ai/en/using-wand-with-vespa.html

You can do this by using a WeightedSet item where the item tokens are an id of the products and the weight is the timestamp you want to sort by, see https://docs.vespa.ai/documentation/reference/query-language-reference.html#weightedset, or see https://docs.vespa.ai/documentation/multivalue-query-operators.html#weightedset-example to create it in a Searcher (recommended).
To rank by the timestamp, use a rank profile which just ranks by the match weight, e.g attributeMatch(name).totalWeight
(Regarding sorting, you could also just retrieve all the matches in a Searcher, resort in code, remove those below the fold and then fill() with the summary data. That will scale fine to a few tens of k as long as you do it before filling.)

Related

Pinning documents to specific positions in Azure Cognitive Search

We have a business requirement for a retail site wherein for non-search product listing pages business wants to fix position of certain products.
E.g.
Url - /nav/category-id
Current result - it displays all the products under that category-id
Say, P1,P3,P5,P7,P2,P4...
Requirement - business wants to fix first 3 position with P4,P3,P2...
Rest of the products should maintain their original order.
What we are doing currently - we are giving a very high boost in descending order to the required pinned products and it seems to work for now.
Our concern - even with very high boost we think we can never garantee the order because of conflicting boost coming from scoring profile.
E.g business might have pinned a new product but our scoring profile is trying to boost products with good sales number.
So what would be the best way to achieve this.
When we tried fixing the position externally by making 2 calls to Cognitive Search, one for getting only pinned products and then making second call to get other products, we just opened a can of bugs with discrepancy coming in facet counts, wrong count on doing pagination and overall the solution became very complex.
Making two calls is a good approach here. One call should responsible for regular results, pagination, refiners, counts, etc. The secondary call should just populate the first N places.
If this complicates your application, you have to change your implementation. From the frontend, you should have the possibility to have individual components that request any number of external services to populate components with content on the page.
The most performant would be to submit both requests in parallel. They should populate content on the page as responses come back. Alternatively, you could make two sequential requests in the backend and then prepare the result by injecting up to three boosted items. We have a similar use case where we present up to three top-selling products above the regular results. When the page loads, two simultaneous requests are submitted to search, and the frontend renders two separate components. You must decide which search response you render your refiners, result counts, and pagination from.
Sure, you could fiddle with boosting rules, but I agree with your concern. It cannot be guaranteed and will probably fail at some point.
Update: With two queries in parallel, your pinned products will be duplicated. I.e. the pinned products will also be part of the main query responsible for presenting results, refiners, and pagination. This ensures correct refiner counts, hit count, filtering, and pagination.
If repeating the pinned items is a concern, you can run two queries in sequence. In the first query, you get up to three pinned items. In the second query, you do not filter out these three items from your query (to ensure the counts are correct). Instead, you skip presenting the pinned items again.

In Azure Search, is there a way to scope facet counts to the top parameter?

If I have an index with 10,000,000 documents and search text and ask to retrieve the top 1,000 items, is there a way to scope the facets to those 1,000 items?
My current problem is:
We have a very large index with a few different facets, including manufacturer. If I search for a product (WD-40 for instance), that matches a lot of different document and document fields. It finds the product and it is the top scoring match, but because they only make 1 or 2 products, the manufacturer doesn't show up as a top facet option because it is sorted by count.
Is there a way to scope the facets to the top X documents? Or, is there a way to only grab documents which are above a certain #search.score?
The purpose of a refiner is to give users options to narrow down the result set. I would say the $top parameter and returned facets works as it should. Trying to limit the number of refiners to be based on the top 1000 results is a bad idea when we think about it. You'll end up with confusing usability- and recall issues.
Your query for WD-40 returns a large result set. So large that there are 155347 unique manufacturers listed. I'm guessing you have several million hits. The intent of that query is to return the products called WD-40 (my assumption). But, since you search all properties in all types of content, you end up with various products like doors, hinges, and bikes that might have some text saying that "put some WD-40 on it to stop squeaks".
I'm guessing that most of the hits you get are irrelevant. Thus, you should either limit the scope of your initial query by default. For example, limit to searching only the title property. Or add a filter to exclude a category of documents (like manuals, price lists, etc.).
You could also consider submitting different queries from your frontend application. One narrowly scoped query that retrieves the refiners and another, broader query that returns the results.
I don't have a relevant data set to test on, but I believe the $top parameter might do what you want. See this link:
https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents#top-optional
That said, there are other approaches to solve your use case.
Normalize your data
I don't know how clean your data is. But, for any data set of this size, it's common that the manufacturer name is not consistent. For example, your manufacturer may be listed as
WD40 Company
WD-40 Company
WDFC
WD 40
WD-40 Inc.
...
Normalizing will greatly reduce the number of values in your refiners. It's probably not enough for your use case, but still worth doing.
Consider adding more refiners
When you have a refiner with too many options it's always a good idea to consider having more refiner with course values. For example a category or perhaps a simple refiner that splits the results in two. Like a "Physical vs. Digital" product as a first choice. Or consumer vs. professional product. In stock or on back-order. This pattern allows users to quickly reduce the result set without having to use the brand refiner.
Categorize your refiner with too many options
In your case, your manufacturer refiner contained too many options. I have seen examples where people add a search box within the refiner. Another option is to group your refiner options in buckets. For text values like a manufacturer, you could generate a new property with the first character of the manufacturer's name. That way you could present a refiner that lets users select a manufacturer from A-Z.

How to store results of an advanced search?

Currently I am trying to add a new functionality to my system in which users will be able to see customs lists of products.
For the creation of these lists I will mostly likely use an algorithm or some criteria that will be used to gather data from my database, or sometimes use hand-picked items.
I wonder what is the best way to do that, in terms of storage and computational time. I was thinking about using an object for my model (something like CustomList), that will store some attributes regarding to this list (that is basically filled with products which are results of an advanced search in my database) and by doing so, store a query string or something like that so it can be reprocessed periodically and if it's personalized for an specific user, run it for every user that requests it.
Example of a query (in natural language): "Select all items that are cheaper than 15 dollars and are designed for the gender of the user X"
I don't know if there is a better way to do that. I wonder how Spotify work with their personalized and custom lists (like Discovery Weekly, Running Musics, Sleepy Monday et cetera).
Should I use a query string and store it on an attribute inside this object on my model? Should I do all of that without an object model (on the fly)? What are the best options? How big companies do that?

Filtering Functionality Similar to Ebay SQL Count Issue

I am stuck on a database problem for a client, wandering if someone could help me out. I am currently trying to implement filtering functionality so that a user can filter results after they have searched for something. We are using SQL Server 2008. I am working on an electronics e-commerce site and the database is quite large (500,000 plus records). The scenario is this - user goes to our website and types in 'laptop' and clicks search. This brings up the first page of several thousand results. What I want to do is then
filter these results further and present the user with options such as:
Filter By Manufacturer
Dell (10,000)
Acer (2,000)
Lenovo (6,000)
Filter By Colour
Black (7000)
Silver (2000)
The main columns of the database are like this - the primary key is an integer ID
ID Title Manufacturer Colour
The key part of the question is how to get the counts in various categories in an efficient manner. The only way I currently know how to do it is with separate queries. However, should we wish to filter by further categories then this will become very slow - especially as the database grows. My current SQL is this:
select count(*) as ManufacturerCount, Manufacturer from [ProductDB.Product] GROUP BY Manufacturer;
select count(*) as ColourCount, Colour from [ProductDB.Product] GROUP BY Colour;
My question is if I can get the results as a single table using some-kind of join or union and if this would be faster than my current method of issuing multiple queries with the Count(*) function. Thanks for your help, if you require any further information please ask. PS I am wandering how on sites like ebay and amazon manage to do this so fast. In order to understand my problem better if you go onto ebay and type in laptop you will
see a number of filters on the left - this is basically what I am trying to achieve. I don't know how it can be done efficiently when there are many filters. E.g to get functionality equivalent to Ebay I would need about 10 queries and I'm sure that will be slow. I was thinking of creating an intermediate table with all the counts however the intermediate table would have to be continuously updated in order to reflect changes to the database and that would be a problem if there are multiple updates per minute. Thanks.
The "intermediate table" is exactly the way to go. I can guarantee you that no e-commerce site with substantial traffic and large number of products would do what you are suggesting on the fly at every inquiry.
If you are worried about keeping track of changes to products, just do all changes to the product catalog thru stored procs (my preferred method) or else use triggers.
One complication is how you will group things in the intermediate table. If you are only grouping on pre-defined categories and sub-categories that are built into the product hierarchy, then it's fairly easy. It sounds like you are allowing free-text search... if so, how will you manage multiple keywords that result in an unexpected intersection of different categories? One way is to save the keywords searched along with the counts and a time stamp. Then, the next time someone searches on the same keywords, check the intermediate table and if the time stamp is older than some predetermined threshold (say, 5 minutes), return your results to a temp table, query the category counts from the temp table, overwrite the previous counts with the new time stamp, and return the whole enchilada to the web app. Otherwise, skip the temp table and just return the pre-aggregated counts and data records. In this case, you might get some quirky front-end count behavior, like it might say "10 results" in a particular category but then when the user drills down, they actually find 9 or 11. It's happened to me on different sites as a customer and it's really not a big deal.
BTW, I used to work for a well-known e-commerce company and we did things like this.

Indexes for google app engine data models

We have many years of weather data that we need to build a reporting app on. Weather data has many fields of different types e.g. city, state, country, zipcode, latitude, longitude, temperature (hi/lo), temperature (avg), preciptation, wind speed, date etc. etc.
Our reports require that we choose combinations of these fields then sort, search and filter on them e.g.
WeatherData.all().filter('avg_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
or
WeatherData.all().filter('lo_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
May be easy to see that these queries require different indexes. May also be obvious that the 200 index limit can be crossed very very easily with any such data model where a combination of fields will be used to filter, sort and search entities. Finally, the number of entities in such a data model can obviously run into millions considering that there are many cities and we could do hourly data instead of daily.
Can anyone recommend a way to model this data which allows for all the queries to still be run, at the same time staying well under the 200 index limit? The write-cost in this model is not as big a deal but we need super fast reads.
Your best option is to rely on the built-in support for merge join queries, which can satisfy these queries without an index per combination. All you need to do is define one index per field you want to filter on and sort order (if that's always date, then you're down to one index per field). See this part of the docs for details.
I know it seems counter-intuitive but you can use a full-text search system that supports categories (properties/whatever) to do something like this as long as you are primarily using equality filters. There are ways to get inequality filters to work but they are often limited. The faceting features can be useful too.
The upcoming Google Search API
IndexTank is the service I currently use
EDIT:
Yup, this is totally a hackish solution. The documents I am using it for are already in my search index and I am almost always also filtering on search terms.

Resources