Paginated searching... does performance degrade heavily after N records? - sql-server

I just tried the following query on YouTube:
http://www.youtube.com/results?search_query=test&search=tag&page=100
and received the error message:
Sorry, YouTube does not serve more than 1000 results for any query.
(You asked for results starting from 2000.)
I also tried Google search for "test", and although it said there were about 3.44 billion results, I was only able to get to page 82 (or about 820 results).
This leads me to wonder, does performance start to degrade with paginated searches after N records (specifically wondering about with ROW_NUMBER() in SQL Server or similar feature in other DB systems), or are YouTube/Google doing this for other reasons? Granted, it's pretty unlikely that most people would need to go past the first 1000 results for a query, but I would imagine the limitation is specifically put in place for some technical reason.
Then again Stack Overflow lets you page through 47k results: https://stackoverflow.com/questions/tagged/c?page=955&sort=newest&pagesize=50

Yes. High offsets are slow and inefficient.
The only way to find the records at an offset, is to compute all records that came before and then discard them.
(I dont know ROW_NUMBER(), but would be LIMIT in standard SQL. So
SELECT * FROM table LIMIT 1999,20
)
.. in the above example, the first 2000 records have to be fetched first, and then discarded. Generally it can't skip ahead, or use indexes to jump right to the correct location in the data, because normally there would be a 'WHERE' clause filtering the results.
It is possible to cache the results, which is probably what SO does. So it doesn't actually have to compute the large offsets each and every time. (Most of SO's searches are a 'small' set of known tags, so its quite feasible to cache. A arbitrary search query is will have much versions to catch, making it impractical)
(Alternatively it might be using some other implementation that does allow arbitrary offsets)
Other places taking about similar things
http://sphinxsearch.com/docs/current.html#conf-max-matches
Back of the envolope test:
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 100999,3;
...
3 rows in set (11.32 sec)
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 3;
...
3 rows in set (4.59 sec)
(Arbitrary query choosen so as not to use indexes very well, if indexes can be used the difference is less pronounced and harder to see. But in a production system running lots of queries, 1 or 2ms difference is huge)
Update: (to show a indexed query)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 10;
...
10 rows in set (0.00 sec)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 100000,10;
...
10 rows in set (1.70 sec)

It's a TOP clause designed to limit the amount of physical reads that the database has to perform, which limits the amount of time that the query takes. Imagine you have 82 billion links to stories about "Japan" in your database. What if someone queries "Japan"? Are all 82 billion results really going to be clicked? No. The user needs the top 1000 most relevant results. When the search is generic, like "test", there is no way to determine relevance. In this case, YouTube/Google has to limit the volume returned so other users aren't affected by generic searches. What's faster, returning 1,000 results or 82,000,000,000 results?

Related

Hasura Timeout, Queries with Inconsistent speed Issues

I have a table with 2 millions of data in it, lets say I search for user email Id(using _ilike), it takes more than 15 to 20 seconds (or sometime I get Timeout error) to respond with out indexing. With indexing I get within a second(still there are times where it takes 15 to 20s, let say 2 out of 10 times I have this delay).
Now the question following are the question I have,
There is timeout most of the time when we search for mail id which is not present in the table, why is that? Whether this is expected behavior?
How much space/configuration of the DB does Hasura is expected to have for approximate 2 millions data?
Whether btree indexing is better for _ilike searchs or gin indexing is the better solution?
Any more suggestions to improve the performance of the query other than indexing?
Even the basic query to get the count of the rows is pretty slow, is there a way to improve?
`userTable_aggregate {
aggregate {
count
}
}`
Note: Every one hour data is getting added to the userTable(lets 100 approximate).
Thank you for taking time to answer my questions

Table Scan very high "actual rows" when filter placed on different table

I have a query, that I did not write, that takes 2.5 minutes to run. I am trying to optimize it without being able to modify the underlying tables, i.e. no new indexes can be added.
During my optimization troubleshooting I commented out a filter and all of a sudden my query ran in .5 seconds. I have screwed with the formatting and placing of that filter and if it is there the query takes 2.5 minutes, without it .5 seconds. The biggest problem is that the filter is not on the table that is being table-scanned (With over 300k records), it is on a table with 300 records.
The "Actual Execution Plan" of both the 0:0:0.5 vs 0:2:30 are identical down to the exact percentage costs of all steps:
Execution Plan
The only difference is that on the table-scanned table the "Actual Number of Rows" on the 2.5 min query shows 3.7 million rows. The table only has 300k rows. Where the .5 sec query shows Actual Number of Rows as 2,063. The filter is actually being placed on the FS_EDIPartner table that only has 300 rows.
With the filter I get the correct 51 records, but it takes 2.5 minutes to return. Without the filter I get duplication, so I get 2,796 rows, and only take half a second to return.
I cannot figure out why adding the filter to a table with 300 rows and a correct index is causing the Table scan of a different table to have such a significant difference in actual number of rows. I am even doing the "Table scan" table as a sub-query to filter its records down from 300k to 17k prior to doing the join. Here is the actual query in its current state, sorry the tables don't make a lot of sense, I could not reproduce this behavior in test data.
SELECT dbo.FS_ARInvoiceHeader.CustomerID
, dbo.FS_EDIPartner.PartnerID
, dbo.FS_ARInvoiceHeader.InvoiceNumber
, dbo.FS_ARInvoiceHeader.InvoiceDate
, dbo.FS_ARInvoiceHeader.InvoiceType
, dbo.FS_ARInvoiceHeader.CONumber
, dbo.FS_EDIPartner.InternalTransactionSetCode
, docs.DocumentName
, dbo.FS_ARInvoiceHeader.InvoiceStatus
FROM dbo.FS_ARInvoiceHeader
INNER JOIN dbo.FS_EDIPartner ON dbo.FS_ARInvoiceHeader.CustomerID = dbo.FS_EDIPartner.CustomerID
LEFT JOIN (Select DocumentName
FROM GentranDatabase.dbo.ZNW_Documents
WHERE DATEADD(SECOND,TimeCreated,'1970-1-1') > '2016-06-01'
AND TransactionSetID = '810') docs on dbo.FS_ARInvoiceHeader.InvoiceNumber = docs.DocumentName COLLATE Latin1_General_BIN
WHERE docs.DocumentName IS NULL
AND dbo.FS_ARInvoiceHeader.InvoiceType = 'I'
AND dbo.FS_ARInvoiceHeader.InvoiceStatus <> 'Y'
--AND (dbo.FS_EDIPartner.InternalTransactionSetCode = '810')
AND (NOT (dbo.FS_ARInvoiceHeader.CONumber LIKE 'CB%'))
AND (NOT (dbo.FS_ARInvoiceHeader.CONumber LIKE 'DM%'))
AND InvoiceDate > '2016-06-01'
The Commented out line in the Where statement is the culprit, uncommenting it causes the 2.5 minute run.
It could be that the table statistics may have gotten out of whack. These include the number of records tables have which is used to choose the best query plan. Try running this and running the query again:
EXEC sp_updatestats
Using #jeremy's comment as a guideline to point out the Actual Number of Rows was not my problem, but instead the number of executions, I figured out that the Hash Match was .5 seconds, the Nested loop was 2.5 minutes. Trying to force the Hash Match using Left HASH Join was inconsistent depending on what the other filters were set to, changing dates took it from .5 seconds, to 30 secs sometimes. So forcing the Hash (Which is highly discouraged anyway) wasn't a good solution. Finally I resorted to moving the poor performing view to a Stored Procedure and splitting out both of the tables that were related to the poor performance into Table Variables, then joining those table variables. This resulted in the most consistently good performance of getting the results. On average the SP returns in less than 1 second, which is far better than the 2.5 minutes it started at.
#Jeremy gets the credit, but since his wasn't an answer, I thought I would document what was actually done in case someone else stumbles across this later.

How to grab filter elements form a large search result when just using smaller pages in the application?

I have a search query that returns a bunch of records but we're using paging so we only return back 10, 25, or 50 page subsets of the data. Basically the stored procedure goes along these lines.
WITH search_results AS
(
SELECT model, brandname, msrp,
ROW_NUMBER() OVER (ORDER BY #sortExpression) as rowNumber
FROM models
WHERE ...criteria....
)
SELECT * FROM search_results
WHERE rowNumber BETWEEN ((#pageNumber-1)*#pageSize)+1 and ((#pageNumber-1)*#pageSize)+#pageSize
When I use small pages my sproc comes back very quickly, usually in under a second. However, sometimes our users will enter criteria that may return back a few thousand to potentially a ten thousand records. They'll page through and just grab a few at a time, but the actual search results have a large number. The sproc is running quickly when my page size is small but when I increase it, it takes a few seconds which is too long.
This is all fine, I am using smaller pages. The problem is that part of our solutions is a filter. This filter lists all of the brands, categories, and 4 price range quadrants for the full search results. So when they click filter it takes the lowest price and and the highest, breaks it into 4 equal sized groupings and they are on the form with checkboxes. user than can check the ranges they want to filter and the brands and categories they want to filter. This re-submits the search with new criteria.
I'm not sure how to return a full set of brands, categories and highest/lowest price without running the main procedure (in the WITH) twice. Does it make sense to dump all of that into a temporary table and then return back multiple recordsets to my business object? The results, the brand list, the category list, and then the MIN and MAX prices? Is there a pattern for returning back filter information for search results like this?
The answer is no, there's no pattern and maybe. Try to put the raw big result in a temp table and use it to return multiple record sets. Test it and see if it works better. Doing it you are (in general) using more memory and less CPU. In the tunning business sometimes there are trade offs where you can exchange memory/IO/CPU use to speed up things.

SOLR faceting slower than manual count?

I'm trying to get SOLR range query working. I have a database with over 12 milion documents, and i am filtering by few parameters for example:
product_category:"category1" AND product_group:"group1" AND product_manu:"manufacturer1"
The query itself returns about 700 documents and executes in two-three seconds on average.
But when i want to add date range facet to that query (i want to see how many products were added each day for past x years) it executes in 50 seconds or more. So it seems that it would be faster to just retrieve all matching documents and perform manual counting in java.
So i guess i must be doing something wrong with faceting?
here is an example faceted query:
start=0&rows=0&facet.query=productDate%3A[0999-12-26T23%3A36%3A00.000Z+TO+2012-05-22T15%3A58%3A05.232Z]&q=source%3A%22source1%22+AND+productCategory%3A%22category1%22+AND+type%3A%22type1%22&facet=true&facet.limit=-1&facet.sort=count&facet.range=productDate&facet.range.start=NOW%2FDAY-5000DAYS&facet.range.end=NOW%2FDAY%2B1DAY&facet.range.gap=%2B1DAY
My only explanation is that SOLR is counting fields on some larger document pool than my 700 documents resulting from "q=" parameter. Or maybe i should filter documents in another way?
I have tried changing filterCache size and it works, but it seems to be a waste of memory for queries like these. After all aggregating over 700 documents should be very fast shouldnt it?

100k Rows Returned in a random order, without a SQL time out please

Ok,
I've been doing a lot of reading on returning a random row set last year, and the solution we came up with was
ORDER BY newid()
This is fine for <5k rows. But when we are getting >10-20k rows we are getting SQL time outs, the Execution planned tells me that 76% of my query cost comes from this line. and removing this line increase the speed by an order of magnitude when we have a large amount of rows.
Our users have a requirement of doing up to 100k rows at a time like this.
To give you all a bit more details.
We have a table with 2.6 million 4 digit alpha-numeric codes. We use a random set of these to gain entry into a venue. For example, if we have an event with a 5000 capacity, a random set of 5000 of these will be drawn from the table then issued to the each customer as a bar-code, then the bar-code scanning app at the door with have the same list of 5000. The reason for using a 4 digit alpha numeric code (and not a stupidly long number like a GUID) is that it easy for people to write the number down (or SMS it to a friend) and just bring the number and have it entered manually, so we don't want large amount of characters. Customers love the last bit btw.
Is there a better way than ORDER BY newid(), or is there a faster way to get 100k random rows from a table with 2.6 mil?
Oh, and we are using MS SQL 2005.
Thanks,
Jo
There is an MSDN article entitled "Selecting Rows Randomly from a Large Table" that talks about this exact problem and shows a solution (using no sorting but instead using a WHERE clause on a generated column to filter the rows).
The reason your query is slow is that the ORDER BY clause causes the whole table to be copied into tempdb for sorting.
If you want to generate random 4-digit codes, why not just generate them instead of trying to pull them out of a database?
Generate 100k unique numbers from 0 to 1,679,616 (which is the number of unique four-digit alphanumeric codes, ignoring case - 2.6 million rows must have some duplicates) and convert them to your four-digit codes.
You don't have to sort.
DECLARE #RandomNumber int
DECLARE #Threshold float
SELECT #RandomNumber = COUNT(*) FROM customers
SELECT #Threshold = 50000 / #RandomNumber
SELECT TOP 50000 * FROM customers WHERE rand() > #Threshold ORDER BY newid()
Just as a matter of interest, what is the performance like if you replace
ORDER BY newid()
by
ORDER BY CHECKSUM(newid())
One thought is to break down the process into steps. Add a column in the table for a GUID then do an update statement into the table adding the GUIDs. This can be done ahead of time if necessary. You should then be able to run the query with an orderby on the GUID column to recieve the results the same way.
Have you tried using % (modulo) on a given int column? Not sure what your table structure is, but you could do something like this:
select top 50000 *
from your_table
where CAST((CAST(ASCII(SUBSTRING(venuecode,1,1)) as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,2,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,3,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,4,1))as varchar(3))) as bigint) % 500000 between 0 and 50000
The above code will take all of your alpha numeric venues and convert them to an integer and then split the entire table into 500,000 buckets of which you are taking the top 50000 that fall between 0 and 50000. You can play with the number after the % since (500,000) and you can play with the between. This should randomize it for you. Not sure if the where clause will bite you on performance, but it's worth a shot. Also, without an order by, there is no guarantee of the order (if you have multiple cpus and threading).

Resources