How to fetch thousands of data from database without getting slow down? - database

I want auto search option in textbox and data is fetching from database. I have thousands of data in my database table (almost 8-10000 rows). I know how to achieve this but as I am fetching thousands of data, it will take a lot of time to fetch. How to achieve this without getting slow down? Should I follow any other methodology to achieve this apart from simple fetching methods? I am using Oracle SQL Developer for database.

Besides the obvious solutions involving indexes and caching, if this is web technology and depending on your tool you can sometimes set a minimum length before the server call is made. Here is a jquery UI example: https://api.jqueryui.com/autocomplete/#option-minLength
"The minimum number of characters a user must type before a search is performed. Zero is useful for local data with just a few items, but a higher value should be used when a single character search could match a few thousand items."

It depends on your web interface, but you can use two tecniques:
Paginate your data: if your requirements are to accept empty values and to show all the results load them in block of a predefined size. goggle for example paginates search results. On Oracle pagination is made using the rownum special variable (see this response). Beware: you must first issue a query with a order by and then enclose it in a new one that use rownum. Other databases that use the limit keyword behave in a different way. If you apply the pagination techique to a drop down you end up with an infinite scroll (see this response for example)
Limit you data imposing some filter that limits the number of rows returned; your search display some results only after the user typed at least n chars in the field
You can combine 1 & 2, but unless you find an existing web component (a jquery one for example) it may be a difficult task if you don't have a Javascript knowledge.

Related

Solr - Get more next "cursorMark" indices to allow a real pagination

i'm developing an application, and the database is managed by Solr v8.1.
I have the necessity of create a pagination system, and i have read that cursors are advised for this type of operations.
The problem is: if i want create a pagination system, that will show to the end-user more than 1 next page, how can i do this?
Normally solr will return only 1 nextCursor index, but what about next 2/3/4 or more pages? Is this possible? Is possible have the same behaviour for previous cursors?
Checking the documentation, seems a continue fetch using the next cursor is mandatory, but i don't think that this is a smart solution.
Thanks
Sounds like what you want is regular pagination if those are important features. CursorMarks are (very) useful for certain use cases, but might not give you any additional performance in your case.
You can however use cursorMarks, but a cursorMark won't tell you how far into a result set you've come (or how many rows are left - just how many rows there are in total. You can still keep track of this manually in your UI). The cursorMark only tells Solr "this is the last entry I showed, so start returning values from here..". This is useful for deep pagination across a cluster with many nodes, as it greatly reduces the required number of results to fetch from each node in the cluster.
If you decide to use a cursorMark, keep track of the current offset, the page size and the page number in your URL. You won't be able to let people skip directly to page X, but you can at least show how many results that remain (this is the same strategy as applied by Gmail).

CouchBase view get for multiple ranges

I'm evaluating CouchBase for an application, and trying to figure out something about range queries on views. I know I can do a view get for a single key, multiple keys, or a range. Can I do a get for multiple ranges? i.e. I want to retrieve items with view key 0-10, 50-100, 5238-81902. I might simultaneously need 100 different ranges, so having to make 100 requests to the database seems like a lot of overhead.
As far as I know in couchbase there is no way to implement getting values from multiple ranges with one view. May be there are (or will be implemented in future) some features in Couchbase N1QL, but I didn't work with it.
Answering your question 100 requests will not be a big overhead. Couchbase is quiet fast and it's designed to handle a lot of operations per second. Also, if your view is correctly designed, it will not be "recalculated" on each query.
Also there is another way:
1. Determine minimum and maximum value of your range (it will be 0..81902 according to your example)
2. Query view that will return only document ids and a value that range was based on, without including all docs in result.
3. On client side filter array of results from previous step according to your ranges (0-10, 50-100, 5238-81902)
and then use getMulti with document ids that left in array.
I don't know your data structure, so you can try both ways, test them and choose the best one that will fit your demands.

How to efficiently check if a result set changed and serve it to a web application for syndication

Here is the scenario:
I am handling a SQL Server database with a stored procedure which takes care of returning headers for Web feed items (RSS/Atom) I am serving as feeds through a web application.
This stored procedure should, when called by the service broker task running at a given interval, verify if there has been a significant change in the underlying data - in that case, it will trigger a resource intensive activity of formatting the feed item header through a call to the web application which will get/retrieve the data, format them and return to the SQL database.
There the header would be stored ready for a request for RSS feed update from the client.
Now, trying to design this to be as efficient as possible, I still have a couple of turning point I'd like to get your suggestions about.
My tentative approach at the stored procedure would be:
get together the data in a in-memory table,
create a subquery with the signature columns which change with the information,
convert them to XML with a FOR XML AUTO
hash the result with MD5 (with HASHBYTES or fn_repl_hash_binary depending on the size of the result)
verify if the hash matches with the one stored in the table where I am storing the HTML waiting for the feed requests.
if Hash matches do nothing otherwise proceed for the updates.
The first doubt is the best way to check if the base data have changed.
Converting to XML inflates significantly the data -which slows hashing-, and potentially I am not using the result apart from hashing: is there any better way to perform the check or to pack all the data together for hashing (something csv-like)?
The query is merging and aggregating data from multiple tables, so would not rely on table timestamps as their change is not necessarily related to a change in the result set
The second point is: what is the best way to serve the data to the webapp for reformatting?
- I might push the data through a CLR function to the web application to get data formatted (but this is synchronous and for multiple feed item would create unsustainable delay)
or
I might instead save the result set instead and trigger multiple asynchronous calls through the service broker. The web app might retrieve the data stored in some way instead of running again the expensive query which got them.
Since I have different formats depending on the feed item category, I cannot use the same table format - so storing to a table is going to be hard.
I might serialize to XML instead.
But is this going to provide any significant gain compared to re-running the query?
For the efficient caching bit, have a look at query notifications. The tricky bit in implementing this in your case is you've stated "significant change" whereas query notifications will trigger on any change. But the basic idea is that your application subscribes to a query. When the results of that query change, a message is sent to the application and it does whatever it is programmed to do (typically refreshing cached data).
As for serving the data to your app, there's a saying in the business: "don't go borrowing trouble". Which is to say if the default method of serving data (i.e. a result set w/o fancy formatting) isn't causing you a problem, don't change it. Change it only if and when it's causing you a significant enough headache that your time is best spent there.

looking for ideas/alternatives to providing a page/item count/navigation of items matching a GAE datastore query

I like the datastore simplicity, scalability and ease of use; and the enhancements found in the new ndb library are fabulous.
As I understand datastore best practices, one should not write code to provide item and/or page counts of matching query results when the number of items that match a query is large; because the only way to do this is to retrieve all the results which is resource intensive.
However, in many applications, including ours, it is a common desire to see a count of matching items and provide the user with the ability to navigate to a specific page of those results. The datastore paging issue is further complicated by the requirement to work around limitations of fetch(limit, offset=X) as outlined in the article Paging Through Large Datasets. To support the recommended approach, the data must include a uniquely valued column that can be ordered in the way the results are to be displayed. This column will define a starting value for each page of results; saving it, we can fetch the corresponding page efficiently, allowing navigation to a specific or next page as requested. Therefore, if you want to show results ordered in multiple ways, several such columns may need to be maintained.
It should be noted that as of SDK v1.3.1, Query Cursors are the recommended way to do datastore paging. They have some limitations, including lack of support for IN and != filter operators. Currently some of our important queries use IN, but we'll try writing them using OR for use with query cursors.
Following the guidelines suggested, a user could be given a (Next) and (Prev) navigation buttons, as well as specific page buttons as navigation proceeded. For example if the user pressed (Next) 3 times, the app could show the following buttons, remembering the unique starting record or cursor for each to keep the navigation efficient: (Prev) (Page-1) (Page-2) (Page-3) (Page-4) (Next).
Some have suggested keeping track of counts separately, but this approach isn't practical when users will be allowed to query on a rich set of fields that will vary the results returned.
I'm looking for insights on these issues in general and the following questions specifically:
What navigational options of query results do you provide in your datastore apps to work around these limitations?
If providing users with efficient result counts and page navigation
of the entire query result set is a priority, should use of the datastore
be abandoned in favor of the GAE MySql solution now being offered.
Are there any upcoming changes in the big table architecture or
datastore implementation that will provide additional capability for
counting results of a query efficiently?
Many thanks in advance for your help.
It all depends on how many results you typically get. E.g. by passing .count() a suitable limit you can provide an exact count if the #items is e.g. <= 100 and "many" if there are more. It sounds like you cannot pre-compute all possible counts, but at least you could cache them, thereby saving many datastore ops.
Using NDB, the most efficient approach may either be to request the first page of entities using fetch_page(), and then using the resulting cursor as a starting point for a count() call; or alternatively, you may be better off running the fetch() of the first page and the count() concurrently using its async facilities. The second option may be your only choice if your query does not support cursors. Most IN / OR queries don't currently support cursors, but they do if you order by __key__.
In terms of UI options, I think it's sufficient to offer next and previous page options; the "Gooooooogle" UI that affords skipping ahead several pages is cute but I almost never use it myself. (To implement "previous page", reverse the order of the query and use the same cursor you used for the current page. I'm pretty sure this is guaranteed to work.)
Maybe just aim for this style of paging:
(first)(Prev)(Page1)(Page2)(Page3)....(Last)(next)
That way the total number is not required - you only need your code to know that there is enough results for another 3+ pages. with page size of 10 items per page, you just need to know there are 30+ items.
If you have 60 items, (enough for 6 pages) when youre already on page 4, your code would look forward and realise there are only another 20 records to go, so you could then show the last page number:
(first)(Prev)(Page4)(Page5)(Page6)(next)(last)
Basically for each fetch for the current page, just fetch enough records for another 3 pages of data, count them to see how many more pages you actully have, then dispaly your pager accordingly.
Also, if you just fetch the keys, it will be more efficient than fetching extra items.
hope that makes some sense!!?? :)
I notice that gmail is ready with some counts - it can tell you how many total emails you've received, and how many are in your inbox, etc - but on other counts, like full-text searches it says you're looking at "1-20 of many" or "1-20 of about 130". Do you really need to display counts for every query, or could you pre-calculate just the important ones?
Since the question was "looking for ideas/alternatives to providing a page", maybe the very simple alternative of fetching 10 pages worth of key_only items, then handling navigation through within this set is worth considering.
I have elaborated on this in answering a similar question, you will find sample code there :
Backward pagination with cursor is working but missing an item
The sample code would be more appropriate for this question. Here is a piece of it:
def session_list():
page = request.args.get('page', 0, type=int)
sessions_keys = Session.query().order(-Session.time_opened).fetch(100, keys_only=True)
sessions_keys, paging = generic_list_paging(sessions_keys, page)
# generic_list_paging will select the proper sublist.
sessions = [ sk.get() for sk in sessions_keys ]
return render_template('generic_list.html', objects=sessions, paging=paging)
See the referenced question for more code.
Of course, if the result set is potentially huge, some limit to the fetch must still be given, the hard limit being 1000 items I think. Obviously, it the result is more than some 10 pages long, the user will be asked to refine by adding criteria.
Dealing with paging within a few hundreds of keys_only items is really so much simpler, that it's definitely worth considering. It makes it quite easy to provide direct page navigation as mentionned in the question. The actual entity items are only fetched for the actual current page, the rest is only keys so it's not so costly. And you may consider keeping the keys_only result set in memcache for a few minutes so that a user quickly browsing through pages will not require the same query to be performed again.

Paging of Query Result Set

Greetings Overflowers,
I'm wondering if there is a way to query some kind of a database and only fetch a certain window in the full result set without having to actually go through them all.
For example, if I query my database and I want only results number 100 to 200, would the database fetch all the results (say 0 to 1000) that match my query and later on filter them to exclude any thing outside my specified window frame ?
Actually, I'm working on a full text search problem (not really relational db stuff).
So how about Google and other search engines, do they get full result then filter or do they have direct access to only the needed window frame ?
Thank you all !
Your question is probably best answered in two parts.
For a database (traditional, relational), a query that is executed contains a number of "where" clauses, which will cause the database engine to limit the number of results that it returns. So if you specify a where clause that basically limits between 2 values of the primary key,
select * From table where id>99 and id<201;
you'll get what you're asking for.
For a search engine, a query you make to get the results will always paginate - using various techniques, all the results will be pre-split into pages and a few will be cached. Other pages will be generated on demand. So if you want pages 100-200 then you only ever fetch those that are needed.
The option to filter is not very efficient because large data sources never want to load all their data into memory and slice - you only want to load what's needed.

Resources