looking for ideas/alternatives to providing a page/item count/navigation of items matching a GAE datastore query - google-app-engine

I like the datastore simplicity, scalability and ease of use; and the enhancements found in the new ndb library are fabulous.
As I understand datastore best practices, one should not write code to provide item and/or page counts of matching query results when the number of items that match a query is large; because the only way to do this is to retrieve all the results which is resource intensive.
However, in many applications, including ours, it is a common desire to see a count of matching items and provide the user with the ability to navigate to a specific page of those results. The datastore paging issue is further complicated by the requirement to work around limitations of fetch(limit, offset=X) as outlined in the article Paging Through Large Datasets. To support the recommended approach, the data must include a uniquely valued column that can be ordered in the way the results are to be displayed. This column will define a starting value for each page of results; saving it, we can fetch the corresponding page efficiently, allowing navigation to a specific or next page as requested. Therefore, if you want to show results ordered in multiple ways, several such columns may need to be maintained.
It should be noted that as of SDK v1.3.1, Query Cursors are the recommended way to do datastore paging. They have some limitations, including lack of support for IN and != filter operators. Currently some of our important queries use IN, but we'll try writing them using OR for use with query cursors.
Following the guidelines suggested, a user could be given a (Next) and (Prev) navigation buttons, as well as specific page buttons as navigation proceeded. For example if the user pressed (Next) 3 times, the app could show the following buttons, remembering the unique starting record or cursor for each to keep the navigation efficient: (Prev) (Page-1) (Page-2) (Page-3) (Page-4) (Next).
Some have suggested keeping track of counts separately, but this approach isn't practical when users will be allowed to query on a rich set of fields that will vary the results returned.
I'm looking for insights on these issues in general and the following questions specifically:
What navigational options of query results do you provide in your datastore apps to work around these limitations?
If providing users with efficient result counts and page navigation
of the entire query result set is a priority, should use of the datastore
be abandoned in favor of the GAE MySql solution now being offered.
Are there any upcoming changes in the big table architecture or
datastore implementation that will provide additional capability for
counting results of a query efficiently?
Many thanks in advance for your help.

It all depends on how many results you typically get. E.g. by passing .count() a suitable limit you can provide an exact count if the #items is e.g. <= 100 and "many" if there are more. It sounds like you cannot pre-compute all possible counts, but at least you could cache them, thereby saving many datastore ops.
Using NDB, the most efficient approach may either be to request the first page of entities using fetch_page(), and then using the resulting cursor as a starting point for a count() call; or alternatively, you may be better off running the fetch() of the first page and the count() concurrently using its async facilities. The second option may be your only choice if your query does not support cursors. Most IN / OR queries don't currently support cursors, but they do if you order by __key__.
In terms of UI options, I think it's sufficient to offer next and previous page options; the "Gooooooogle" UI that affords skipping ahead several pages is cute but I almost never use it myself. (To implement "previous page", reverse the order of the query and use the same cursor you used for the current page. I'm pretty sure this is guaranteed to work.)

Maybe just aim for this style of paging:
(first)(Prev)(Page1)(Page2)(Page3)....(Last)(next)
That way the total number is not required - you only need your code to know that there is enough results for another 3+ pages. with page size of 10 items per page, you just need to know there are 30+ items.
If you have 60 items, (enough for 6 pages) when youre already on page 4, your code would look forward and realise there are only another 20 records to go, so you could then show the last page number:
(first)(Prev)(Page4)(Page5)(Page6)(next)(last)
Basically for each fetch for the current page, just fetch enough records for another 3 pages of data, count them to see how many more pages you actully have, then dispaly your pager accordingly.
Also, if you just fetch the keys, it will be more efficient than fetching extra items.
hope that makes some sense!!?? :)

I notice that gmail is ready with some counts - it can tell you how many total emails you've received, and how many are in your inbox, etc - but on other counts, like full-text searches it says you're looking at "1-20 of many" or "1-20 of about 130". Do you really need to display counts for every query, or could you pre-calculate just the important ones?

Since the question was "looking for ideas/alternatives to providing a page", maybe the very simple alternative of fetching 10 pages worth of key_only items, then handling navigation through within this set is worth considering.
I have elaborated on this in answering a similar question, you will find sample code there :
Backward pagination with cursor is working but missing an item
The sample code would be more appropriate for this question. Here is a piece of it:
def session_list():
page = request.args.get('page', 0, type=int)
sessions_keys = Session.query().order(-Session.time_opened).fetch(100, keys_only=True)
sessions_keys, paging = generic_list_paging(sessions_keys, page)
# generic_list_paging will select the proper sublist.
sessions = [ sk.get() for sk in sessions_keys ]
return render_template('generic_list.html', objects=sessions, paging=paging)
See the referenced question for more code.
Of course, if the result set is potentially huge, some limit to the fetch must still be given, the hard limit being 1000 items I think. Obviously, it the result is more than some 10 pages long, the user will be asked to refine by adding criteria.
Dealing with paging within a few hundreds of keys_only items is really so much simpler, that it's definitely worth considering. It makes it quite easy to provide direct page navigation as mentionned in the question. The actual entity items are only fetched for the actual current page, the rest is only keys so it's not so costly. And you may consider keeping the keys_only result set in memcache for a few minutes so that a user quickly browsing through pages will not require the same query to be performed again.

Related

Pinning documents to specific positions in Azure Cognitive Search

We have a business requirement for a retail site wherein for non-search product listing pages business wants to fix position of certain products.
E.g.
Url - /nav/category-id
Current result - it displays all the products under that category-id
Say, P1,P3,P5,P7,P2,P4...
Requirement - business wants to fix first 3 position with P4,P3,P2...
Rest of the products should maintain their original order.
What we are doing currently - we are giving a very high boost in descending order to the required pinned products and it seems to work for now.
Our concern - even with very high boost we think we can never garantee the order because of conflicting boost coming from scoring profile.
E.g business might have pinned a new product but our scoring profile is trying to boost products with good sales number.
So what would be the best way to achieve this.
When we tried fixing the position externally by making 2 calls to Cognitive Search, one for getting only pinned products and then making second call to get other products, we just opened a can of bugs with discrepancy coming in facet counts, wrong count on doing pagination and overall the solution became very complex.
Making two calls is a good approach here. One call should responsible for regular results, pagination, refiners, counts, etc. The secondary call should just populate the first N places.
If this complicates your application, you have to change your implementation. From the frontend, you should have the possibility to have individual components that request any number of external services to populate components with content on the page.
The most performant would be to submit both requests in parallel. They should populate content on the page as responses come back. Alternatively, you could make two sequential requests in the backend and then prepare the result by injecting up to three boosted items. We have a similar use case where we present up to three top-selling products above the regular results. When the page loads, two simultaneous requests are submitted to search, and the frontend renders two separate components. You must decide which search response you render your refiners, result counts, and pagination from.
Sure, you could fiddle with boosting rules, but I agree with your concern. It cannot be guaranteed and will probably fail at some point.
Update: With two queries in parallel, your pinned products will be duplicated. I.e. the pinned products will also be part of the main query responsible for presenting results, refiners, and pagination. This ensures correct refiner counts, hit count, filtering, and pagination.
If repeating the pinned items is a concern, you can run two queries in sequence. In the first query, you get up to three pinned items. In the second query, you do not filter out these three items from your query (to ensure the counts are correct). Instead, you skip presenting the pinned items again.

Solr - Get more next "cursorMark" indices to allow a real pagination

i'm developing an application, and the database is managed by Solr v8.1.
I have the necessity of create a pagination system, and i have read that cursors are advised for this type of operations.
The problem is: if i want create a pagination system, that will show to the end-user more than 1 next page, how can i do this?
Normally solr will return only 1 nextCursor index, but what about next 2/3/4 or more pages? Is this possible? Is possible have the same behaviour for previous cursors?
Checking the documentation, seems a continue fetch using the next cursor is mandatory, but i don't think that this is a smart solution.
Thanks
Sounds like what you want is regular pagination if those are important features. CursorMarks are (very) useful for certain use cases, but might not give you any additional performance in your case.
You can however use cursorMarks, but a cursorMark won't tell you how far into a result set you've come (or how many rows are left - just how many rows there are in total. You can still keep track of this manually in your UI). The cursorMark only tells Solr "this is the last entry I showed, so start returning values from here..". This is useful for deep pagination across a cluster with many nodes, as it greatly reduces the required number of results to fetch from each node in the cluster.
If you decide to use a cursorMark, keep track of the current offset, the page size and the page number in your URL. You won't be able to let people skip directly to page X, but you can at least show how many results that remain (this is the same strategy as applied by Gmail).

How to fetch thousands of data from database without getting slow down?

I want auto search option in textbox and data is fetching from database. I have thousands of data in my database table (almost 8-10000 rows). I know how to achieve this but as I am fetching thousands of data, it will take a lot of time to fetch. How to achieve this without getting slow down? Should I follow any other methodology to achieve this apart from simple fetching methods? I am using Oracle SQL Developer for database.
Besides the obvious solutions involving indexes and caching, if this is web technology and depending on your tool you can sometimes set a minimum length before the server call is made. Here is a jquery UI example: https://api.jqueryui.com/autocomplete/#option-minLength
"The minimum number of characters a user must type before a search is performed. Zero is useful for local data with just a few items, but a higher value should be used when a single character search could match a few thousand items."
It depends on your web interface, but you can use two tecniques:
Paginate your data: if your requirements are to accept empty values and to show all the results load them in block of a predefined size. goggle for example paginates search results. On Oracle pagination is made using the rownum special variable (see this response). Beware: you must first issue a query with a order by and then enclose it in a new one that use rownum. Other databases that use the limit keyword behave in a different way. If you apply the pagination techique to a drop down you end up with an infinite scroll (see this response for example)
Limit you data imposing some filter that limits the number of rows returned; your search display some results only after the user typed at least n chars in the field
You can combine 1 & 2, but unless you find an existing web component (a jquery one for example) it may be a difficult task if you don't have a Javascript knowledge.

Choosing the right model for storing and querying data?

I am working on my first GAE project using java and the datastore. And this is my first try with noSQL database. Like a lot of people i have problems understanding the right model to use. So far I've figured out two models and I need help to choose the right one.
All the data is represented in two classes User.class and Word.class.
User: couple of string with user data (username, email.....)
Word: two strings
Which is better :
Search in 10 000 000 entities for the 100 i need. For instance every entity Word have a string property owner and i query (owner = ‘John’).
In User.class i add property List<Word> and method getWords() that returns the list of words. So i query in 1000 users for the one i need and then call method like getWords() that returns List<Word> with that 100 i need.
Which one uses less resources ? Or am i going the wrong way with this ?
The answer is to use appstats and you can find out:
AppStats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it be caching
data instead of making repeated RPC calls to get the same data? Will
your application perform better if multiple requests are executed in
parallel rather than serially?
Run some tests, try it both ways and see what appstats says.
But I'd say that your option 2) is better simply because you don't need to search millions of entities. But who knows for sure? The trouble is that "resources" are a dozen different things in app engine - CPU, datastore reads, datastore writes etc etc etc.
For your User class, set a unique ID for each user (such as a username or email address). For the Word class, set the parent of each Word class as a specific User.
So, if you wanted to look up words from a specific user, you would do an ancestor query for all words belonging to that specific user.
By setting an ID for each user, you can get that user by ID as opposed to doing an additional query.
More info on ancestor queries:
https://developers.google.com/appengine/docs/java/datastore/queries#Ancestor_Queries
More info on IDs:
https://developers.google.com/appengine/docs/java/datastore/entities#Kinds_and_Identifiers
It really depends on the queries you're using. I assume that you want to find all the words given a certain owner.
Most likely, 2 would be cheaper, since you'll need to fetch the user entity instead of running a query.
2 will be a bit more work on your part, since you'll need to manually keep the list synchronized with the instances of Word
Off the top of my head I can think of 2 problems with #2, which may or may not apply to you:
A. If you want to find all the owners given a certain word, you'll need to keep that list of words indexed. This affects your costs. If you mostly find words by owner, and rarely find owners by words, it'll still make sense to do it this way. However, if your search pattern flips around and you're searching for owners by words a lot, this may be the wrong design. As you see, you need to design the models based on the queries you will be using.
B. Entities are limited to 1MB, and there's a limit on the number of indexed properties (5000 I think?). Those two will limit the number of words you can store in your list. Make sure that you won't need more than that limit of words per user. Method 1 allows you unlimted words per user.

Efficiently sorting and paging with Solr when index is changing

I'm working on a structured document viewer, where each Solr document is a "section" or "paragraph" in a large set of legal documents, along with assorted metadata. I have a corpus which will probably represent 10^12 or more of these sections. I want to provide paging for the user so that they can view N of these sections at a time in sort_path order.
Now the problem: Even if sort_path is indexed, there are docs being added and removed all the time. A simple sort and paging solution will end up with users possibly skipping sections or jumping around in the ordering unexpectedly, even when they are nowhere near the documents being added/removed in the ordering; this behavior would be unacceptable.
Example: I make the "next" page link point at something like ...sort_order=sort_path+desc&rows=N&start:12345. Then, while the user is viewing the page, a document early in the sort_path order is deleted. Now when they fetch the next N rows, they will have skipped 1 document without knowing.
So, given I have a sort_path field which orders the sections, the front end needs to be able to ask for N sections "before" or "after" sort_path:/X/Y/Z, instead of asking for rows:N with start:12345. I have no idea how to represent this in a Solr query.
I may be pushing the edges of Solr a little far, and it may end up making more sense to store representations of these "section" documents both in Solr (for content searches, which Solr is awesome at) and an RDBMS (for ordering and indexing). I was hoping to avoid that, and this sort of query is still going to be ugly in a database, so maybe you've got some ideas. (Thanks!)
Update:
It turns out that solr ranges combined with sorting may give me exactly what I need. On the indexed field, I can do something like
sort_path:["/A/B/C" TO *]
to get the "next" N sections, and do
sort_path:[* TO "/A/B/C"]
ordering by sort_path:desc and then reversing the returned chunk to get the previous N sections. I am going to test the performance of this solution, but it seems viable.
This is not really a Solr-specific problem, but a general problem with pagination of any external data source, because the data source has an independent state from the (web) application. For example, it also happens on relational databases. Here's a good coverage of pagination in relational databases, along with the possible solutions. Most web applications / websites take the first solution: "Repeat the query for each new request" since the other solutions are much more complex and not scalable, but this suffers from the problem you describe. Browse the questions on stackoverflow.com for a while and you'll notice it, since questions are being created constantly.
In your case I'd consider modeling the Solr documents as your whole legal documents instead of their individual sections. You'll get a lot less documents (therefore a slower rate of inserts/deletes) and you can use the highlighting parameters to get snippets of the sections that matched the user query.
Another option would be decreasing your commit rate, but this could end up in less-than-ideal document freshness.

Resources