Pinning documents to specific positions in Azure Cognitive Search - azure-cognitive-search

We have a business requirement for a retail site wherein for non-search product listing pages business wants to fix position of certain products.
E.g.
Url - /nav/category-id
Current result - it displays all the products under that category-id
Say, P1,P3,P5,P7,P2,P4...
Requirement - business wants to fix first 3 position with P4,P3,P2...
Rest of the products should maintain their original order.
What we are doing currently - we are giving a very high boost in descending order to the required pinned products and it seems to work for now.
Our concern - even with very high boost we think we can never garantee the order because of conflicting boost coming from scoring profile.
E.g business might have pinned a new product but our scoring profile is trying to boost products with good sales number.
So what would be the best way to achieve this.
When we tried fixing the position externally by making 2 calls to Cognitive Search, one for getting only pinned products and then making second call to get other products, we just opened a can of bugs with discrepancy coming in facet counts, wrong count on doing pagination and overall the solution became very complex.

Making two calls is a good approach here. One call should responsible for regular results, pagination, refiners, counts, etc. The secondary call should just populate the first N places.
If this complicates your application, you have to change your implementation. From the frontend, you should have the possibility to have individual components that request any number of external services to populate components with content on the page.
The most performant would be to submit both requests in parallel. They should populate content on the page as responses come back. Alternatively, you could make two sequential requests in the backend and then prepare the result by injecting up to three boosted items. We have a similar use case where we present up to three top-selling products above the regular results. When the page loads, two simultaneous requests are submitted to search, and the frontend renders two separate components. You must decide which search response you render your refiners, result counts, and pagination from.
Sure, you could fiddle with boosting rules, but I agree with your concern. It cannot be guaranteed and will probably fail at some point.
Update: With two queries in parallel, your pinned products will be duplicated. I.e. the pinned products will also be part of the main query responsible for presenting results, refiners, and pagination. This ensures correct refiner counts, hit count, filtering, and pagination.
If repeating the pinned items is a concern, you can run two queries in sequence. In the first query, you get up to three pinned items. In the second query, you do not filter out these three items from your query (to ensure the counts are correct). Instead, you skip presenting the pinned items again.

Related

Solr - Get more next "cursorMark" indices to allow a real pagination

i'm developing an application, and the database is managed by Solr v8.1.
I have the necessity of create a pagination system, and i have read that cursors are advised for this type of operations.
The problem is: if i want create a pagination system, that will show to the end-user more than 1 next page, how can i do this?
Normally solr will return only 1 nextCursor index, but what about next 2/3/4 or more pages? Is this possible? Is possible have the same behaviour for previous cursors?
Checking the documentation, seems a continue fetch using the next cursor is mandatory, but i don't think that this is a smart solution.
Thanks
Sounds like what you want is regular pagination if those are important features. CursorMarks are (very) useful for certain use cases, but might not give you any additional performance in your case.
You can however use cursorMarks, but a cursorMark won't tell you how far into a result set you've come (or how many rows are left - just how many rows there are in total. You can still keep track of this manually in your UI). The cursorMark only tells Solr "this is the last entry I showed, so start returning values from here..". This is useful for deep pagination across a cluster with many nodes, as it greatly reduces the required number of results to fetch from each node in the cluster.
If you decide to use a cursorMark, keep track of the current offset, the page size and the page number in your URL. You won't be able to let people skip directly to page X, but you can at least show how many results that remain (this is the same strategy as applied by Gmail).

How to fetch thousands of data from database without getting slow down?

I want auto search option in textbox and data is fetching from database. I have thousands of data in my database table (almost 8-10000 rows). I know how to achieve this but as I am fetching thousands of data, it will take a lot of time to fetch. How to achieve this without getting slow down? Should I follow any other methodology to achieve this apart from simple fetching methods? I am using Oracle SQL Developer for database.
Besides the obvious solutions involving indexes and caching, if this is web technology and depending on your tool you can sometimes set a minimum length before the server call is made. Here is a jquery UI example: https://api.jqueryui.com/autocomplete/#option-minLength
"The minimum number of characters a user must type before a search is performed. Zero is useful for local data with just a few items, but a higher value should be used when a single character search could match a few thousand items."
It depends on your web interface, but you can use two tecniques:
Paginate your data: if your requirements are to accept empty values and to show all the results load them in block of a predefined size. goggle for example paginates search results. On Oracle pagination is made using the rownum special variable (see this response). Beware: you must first issue a query with a order by and then enclose it in a new one that use rownum. Other databases that use the limit keyword behave in a different way. If you apply the pagination techique to a drop down you end up with an infinite scroll (see this response for example)
Limit you data imposing some filter that limits the number of rows returned; your search display some results only after the user typed at least n chars in the field
You can combine 1 & 2, but unless you find an existing web component (a jquery one for example) it may be a difficult task if you don't have a Javascript knowledge.

Finding unique products (never seen before by a user) in a datastore sorted by a dynamically changing value (i.e. product rating)

been trying to solve this problem for a week and couldn't come up with any solutions in all my research so I thought I'd ask you all.
I have a "Product" table and a "productSent" table, here's a quick scheme to help explain:
class Product(ndb.Model):
name = ndb.StringProperty();
rating = ndb.IntegerProperty
class productSent(ndb.Model): <--- the key name here is md5(Product Key+UUID)
pId = ndb.KeyProperty(kind=Product)
uuId = ndb.KeyProperty(kind=userData)
action = ndb.StringProperty()
date = ndb.DateTimeProperty(auto_now_add=True)
My goal is to show users the highest rated product that they've never seen before--fast. So to keep track of the products users have seen, I use the productSent table. I created this table instead of using Cursors because every time the rating order changes, there's a possibility that the cursor skips the new higher ranking product. An example: assume the user has seen products 1-24 in the db. Next, 5 users liked product #25, making it the #10 product in the database--I'm worried that the product will never be shown again to the user (and possibly mess things up on a higher scale).
The problem with the way I'm doing it right now is that, once the user has blown past the first 1,000 products, it really starts slowing down the query performance. Because I'm literally pulling 1,000+ results, checking if they've been sent by querying against the productSent table (doing a keyName lookup to speed things up) and going through the loop until 15 new ones have been detected.
One solution I thought of was to add a repeated property (listProperty) to the Product table of all the users who have seen a product. Or if I don't want to have inequality filters I could put a repeated property of all the users who haven't seen a product. That way when I query I can dynamically take those out. But I'm afraid of what happens when I have 1,000+ users:
a) I'll go through the roof on the limit of repeated properties in one entity.
b) The index size will increase size costs
Has anyone dealt with this problem before (I'm sure someone has!) Any tips on the best way to structure it?
update
Okay, so had another idea. In order to minimize the changes that take place when a rating (number of likes) changes, I could have a secondary column that only has 3 possible values: positive, neutral, negative. And sort by that? Ofcourse for items that have a rating of 0 and get a 'like' (making them a positive) would still have a chance of being out of order or skipped by the cursor--but it'd be less likely. What do y'all think?
Sounds like the inverse, productNotSent would work well here. Every time you add a new product, you would add a new productNotSent entity for each user. When the user wants to see the highest rated product they have not seen, you will only have to query over the productNotSent entities that match that user. If you put the rating directly on the productNotSent you could speed the query up even more, since you will only have to query against one Model.
Another idea would be to limit the number of productNotSent entities per user. So each user only has ~100 of these entities at a time. This would mean your query would be constant for each user, regardless of the number of products or users you have. The creation of new productNotSent entities would become more complex, though. You'd have to have a cron job or something that "tops up" a user's collection of productNotSent entities when they use some up. You also may want to double-check that products rated higher than those already within the user's set of productNotSent entities get pushed in there. These are a little more difficult and well require some design trade-offs.
Hope this helps!
I do not know your expected volumes and exact issues (only did a quick perusal of your question), but you may consider using Json TextProperty storage as part of your plan. Create dictionaries/lists and store them in records by json.dump()ing them to a TextProperty. When the client calls, simply send the TextProperties to the client, and figure everything out on the client side once you JSON.parse() them. We have done some very large array/object processing in JS this way, and it is very fast (particularly indexed arrays). When the user clicks on something, send a transaction back to update their record. Set up some pull or push queue processes to handle your overall product listing updates, major customer rec updates, etc.
One downside is higher bandwidth going out of you app, but I think this cost will be minimal given potential processing savings on GAE. If you structure this right, you may be able to use get_by_id() to replace all or most of your planned indices and queries. We have found json.loads() and json.dumps() to be very fast inside the app, but we only use simple dictionary/list structures.This approach will be, though, a big, big quantum measure lower than your planned use of queries. The other potential issue is that very large objects may run into soft memory limits. Be sure that your Json objects are fairly simple+lightweight to avoid this (e.g. do no include product description, sub-objects, etc. in the Json item, just the basics such as product number). HTH, -stevep

looking for ideas/alternatives to providing a page/item count/navigation of items matching a GAE datastore query

I like the datastore simplicity, scalability and ease of use; and the enhancements found in the new ndb library are fabulous.
As I understand datastore best practices, one should not write code to provide item and/or page counts of matching query results when the number of items that match a query is large; because the only way to do this is to retrieve all the results which is resource intensive.
However, in many applications, including ours, it is a common desire to see a count of matching items and provide the user with the ability to navigate to a specific page of those results. The datastore paging issue is further complicated by the requirement to work around limitations of fetch(limit, offset=X) as outlined in the article Paging Through Large Datasets. To support the recommended approach, the data must include a uniquely valued column that can be ordered in the way the results are to be displayed. This column will define a starting value for each page of results; saving it, we can fetch the corresponding page efficiently, allowing navigation to a specific or next page as requested. Therefore, if you want to show results ordered in multiple ways, several such columns may need to be maintained.
It should be noted that as of SDK v1.3.1, Query Cursors are the recommended way to do datastore paging. They have some limitations, including lack of support for IN and != filter operators. Currently some of our important queries use IN, but we'll try writing them using OR for use with query cursors.
Following the guidelines suggested, a user could be given a (Next) and (Prev) navigation buttons, as well as specific page buttons as navigation proceeded. For example if the user pressed (Next) 3 times, the app could show the following buttons, remembering the unique starting record or cursor for each to keep the navigation efficient: (Prev) (Page-1) (Page-2) (Page-3) (Page-4) (Next).
Some have suggested keeping track of counts separately, but this approach isn't practical when users will be allowed to query on a rich set of fields that will vary the results returned.
I'm looking for insights on these issues in general and the following questions specifically:
What navigational options of query results do you provide in your datastore apps to work around these limitations?
If providing users with efficient result counts and page navigation
of the entire query result set is a priority, should use of the datastore
be abandoned in favor of the GAE MySql solution now being offered.
Are there any upcoming changes in the big table architecture or
datastore implementation that will provide additional capability for
counting results of a query efficiently?
Many thanks in advance for your help.
It all depends on how many results you typically get. E.g. by passing .count() a suitable limit you can provide an exact count if the #items is e.g. <= 100 and "many" if there are more. It sounds like you cannot pre-compute all possible counts, but at least you could cache them, thereby saving many datastore ops.
Using NDB, the most efficient approach may either be to request the first page of entities using fetch_page(), and then using the resulting cursor as a starting point for a count() call; or alternatively, you may be better off running the fetch() of the first page and the count() concurrently using its async facilities. The second option may be your only choice if your query does not support cursors. Most IN / OR queries don't currently support cursors, but they do if you order by __key__.
In terms of UI options, I think it's sufficient to offer next and previous page options; the "Gooooooogle" UI that affords skipping ahead several pages is cute but I almost never use it myself. (To implement "previous page", reverse the order of the query and use the same cursor you used for the current page. I'm pretty sure this is guaranteed to work.)
Maybe just aim for this style of paging:
(first)(Prev)(Page1)(Page2)(Page3)....(Last)(next)
That way the total number is not required - you only need your code to know that there is enough results for another 3+ pages. with page size of 10 items per page, you just need to know there are 30+ items.
If you have 60 items, (enough for 6 pages) when youre already on page 4, your code would look forward and realise there are only another 20 records to go, so you could then show the last page number:
(first)(Prev)(Page4)(Page5)(Page6)(next)(last)
Basically for each fetch for the current page, just fetch enough records for another 3 pages of data, count them to see how many more pages you actully have, then dispaly your pager accordingly.
Also, if you just fetch the keys, it will be more efficient than fetching extra items.
hope that makes some sense!!?? :)
I notice that gmail is ready with some counts - it can tell you how many total emails you've received, and how many are in your inbox, etc - but on other counts, like full-text searches it says you're looking at "1-20 of many" or "1-20 of about 130". Do you really need to display counts for every query, or could you pre-calculate just the important ones?
Since the question was "looking for ideas/alternatives to providing a page", maybe the very simple alternative of fetching 10 pages worth of key_only items, then handling navigation through within this set is worth considering.
I have elaborated on this in answering a similar question, you will find sample code there :
Backward pagination with cursor is working but missing an item
The sample code would be more appropriate for this question. Here is a piece of it:
def session_list():
page = request.args.get('page', 0, type=int)
sessions_keys = Session.query().order(-Session.time_opened).fetch(100, keys_only=True)
sessions_keys, paging = generic_list_paging(sessions_keys, page)
# generic_list_paging will select the proper sublist.
sessions = [ sk.get() for sk in sessions_keys ]
return render_template('generic_list.html', objects=sessions, paging=paging)
See the referenced question for more code.
Of course, if the result set is potentially huge, some limit to the fetch must still be given, the hard limit being 1000 items I think. Obviously, it the result is more than some 10 pages long, the user will be asked to refine by adding criteria.
Dealing with paging within a few hundreds of keys_only items is really so much simpler, that it's definitely worth considering. It makes it quite easy to provide direct page navigation as mentionned in the question. The actual entity items are only fetched for the actual current page, the rest is only keys so it's not so costly. And you may consider keeping the keys_only result set in memcache for a few minutes so that a user quickly browsing through pages will not require the same query to be performed again.

Efficiently sorting and paging with Solr when index is changing

I'm working on a structured document viewer, where each Solr document is a "section" or "paragraph" in a large set of legal documents, along with assorted metadata. I have a corpus which will probably represent 10^12 or more of these sections. I want to provide paging for the user so that they can view N of these sections at a time in sort_path order.
Now the problem: Even if sort_path is indexed, there are docs being added and removed all the time. A simple sort and paging solution will end up with users possibly skipping sections or jumping around in the ordering unexpectedly, even when they are nowhere near the documents being added/removed in the ordering; this behavior would be unacceptable.
Example: I make the "next" page link point at something like ...sort_order=sort_path+desc&rows=N&start:12345. Then, while the user is viewing the page, a document early in the sort_path order is deleted. Now when they fetch the next N rows, they will have skipped 1 document without knowing.
So, given I have a sort_path field which orders the sections, the front end needs to be able to ask for N sections "before" or "after" sort_path:/X/Y/Z, instead of asking for rows:N with start:12345. I have no idea how to represent this in a Solr query.
I may be pushing the edges of Solr a little far, and it may end up making more sense to store representations of these "section" documents both in Solr (for content searches, which Solr is awesome at) and an RDBMS (for ordering and indexing). I was hoping to avoid that, and this sort of query is still going to be ugly in a database, so maybe you've got some ideas. (Thanks!)
Update:
It turns out that solr ranges combined with sorting may give me exactly what I need. On the indexed field, I can do something like
sort_path:["/A/B/C" TO *]
to get the "next" N sections, and do
sort_path:[* TO "/A/B/C"]
ordering by sort_path:desc and then reversing the returned chunk to get the previous N sections. I am going to test the performance of this solution, but it seems viable.
This is not really a Solr-specific problem, but a general problem with pagination of any external data source, because the data source has an independent state from the (web) application. For example, it also happens on relational databases. Here's a good coverage of pagination in relational databases, along with the possible solutions. Most web applications / websites take the first solution: "Repeat the query for each new request" since the other solutions are much more complex and not scalable, but this suffers from the problem you describe. Browse the questions on stackoverflow.com for a while and you'll notice it, since questions are being created constantly.
In your case I'd consider modeling the Solr documents as your whole legal documents instead of their individual sections. You'll get a lot less documents (therefore a slower rate of inserts/deletes) and you can use the highlighting parameters to get snippets of the sections that matched the user query.
Another option would be decreasing your commit rate, but this could end up in less-than-ideal document freshness.

Resources