Will a large offset trigger millions of reads in TSQL? - sql-server

I have a system that needs to suck up an entire MS SQL database. Currently it does so with something like:
select top 1000 from table where id > 0 order by id;
Then, for the next chunk:
select top 1000 from table where id > 1000 order by id;
And then:
select top 1000 from table where id > 2000 order by id;
And so forth.
In MySQL, I've learned that doing LIMIT and OFFSET queries is brutally slow because the database has to first sort the results, then scan over the OFFSET count. When that count gets big, life starts to suck as the read count skyrockets.
My question is this: does the same problem apply to TOP? Put another way, can I expect a really high read count when I run these queries on a database with, say 10,000,000 records, at the time when id > 9,999,000? If so, are there any ways to handle this better?

It will be very fast if ID is indexed. If that column is not indexed then it would case a full table scan.
I would suggest the following in addition:
select * from table where id > 0 and id <= 1000 order by id ;
This way if you don't have all records you don't have bugs (duplicates).

Related

How to determine if count is greater than threshold in most efficient way in SQL server?

I want to select if a user produced more than 1000 logs. Given these queries I let SQL Server Studio display estimated execution plan.
select count(*) from tbl_logs where id_user = 3
select 1 from tbl_logs where id_user = 3 having count(1) > 1000
I thought the second one should be better because it can return as soon as SQL Server found 1000 rows. Whereas the first one returns the actual count of rows.
Also when I profile the queries they are equal in terms of Reads, CPU and Duration.
What would be the most efficient query for my task?
This query should also improve the performance :
select 1 from tbl_logs order by 1 offset (1000) rows fetch next (1) rows only
You get a 1 when more than 1.000 rows exists, and an empty dataset when they doesn't.
It only fetches the first 1.001 rows, as Alexander's answer does, but after that it has the advantage that it doesn't need to re-count the rows already fetched.
If you want the result to be exactly 1 or 0, then you could read it like this:
with Row_1001 as (
select 1 as Row_1001 from tbl_logs order by 1 offset (1000) rows fetch next (1) rows only
)
select count(*) as More_Than_1000_Rows_Exist from Row_1001
I think, some performance improvement can be achieved this way:
select 1 from (
select top 1001 1 as val from tbl_logs where id_user = 3
) cnt
having count(*) > 1000
In this example, derived query will fetch only first 1001 rows (if they are exists) and outer query will perform a logical check on count.
However, it will not lead to reduction of reads if table tbl_logs is tiny, so index seek uses so small index that only few pages to be fetched

Fastest way to process Millions of Rows in SQL Server for a Chart

We are logging realtime data every second to a SQL Server database and we want to generate charts from 10 Million rows or more. At the moment we use something like the code below. The goal is to get at least 1000-2000 values to pass into the chart.
In the query below, we take an avg of every next n'th rows depending on the count of data we pick out from the LargeTable. This works fine up to 200.000 selected rows, but then it is way too slow.
SELECT
AVG(X),
AVG(Y)
FROM
(SELECT
X, Y,
(Id / #AvgCount) AS [Group]
FROM
[LargeTable]
WHERE
Timestmp > #From
AND Timestmp < #Till) j
GROUP BY
[Group]
ORDER BY
X;
Now we tried to select out only every n'th row from LargeTable and then make an average of this data to get more performance, but it takes nearly the same time.
SELECT
X, Y
FROM
(SELECT
X, Y,
ROW_NUMBER() OVER (ORDER BY Id) AS rownr
FROM
LargeTable
WHERE
Timestmp >= #From
AND Timestmp <= #Till) a
WHERE
a.rownr % (#count / 10000) = 0;
It is only pseudo code! We have indexes on all relevant columns.
Are there better and faster ways to get chart data?
I think on two approaches to improve the performance of the charts:
Trying to improve the performance of the queries.
Reducing the amount of data needed to be read.
It's almost impossible for me to improve the performance of the queries without the full DDL and execution plans. So I'm suggesting you to reduce the amount of data to be read.
The key is summarizing groups at a given granularity level as the data comes and storing it in a separate table like the following:
CREATE TABLE SummarizedData
(
int GroupId PRIMARY KEY,
FromDate datetime,
ToDate datetime,
SumX float,
SumY float,
GroupCount
)
IdGroup should be equals to Id/100 or Id/1000 depending on how much granularity you want in groups. With larger groups you get more coarse granularity but more efficient charts.
I'm assuming LargeTable Id column increases monotonically, so you can store the last Id that has been processed in another table called SummaryProcessExecutions
You would need a stored procedure ExecuteSummaryProcess that:
Read LastProcessedId from SummaryProcessExecutions
Read the Last Id on large table and store it into #NewLastProcessedId variable
Summarize all rows from LargeTable with Id > #LastProcessedId and Id <= #NewLastProcessedId and store the results into SummarizedData table
Store #NewLastProcessedId variable into SummaryProcessExecutions table
You can execute ExecuteSummaryProcess stored procedure frequently in a SQL Server Agent Job.
I believe that grouping by date would be a better choice than grouping by Id. It would simplify things. The SummarizedData GroupId column would not be related to LargeTable Id and you would not need to update SummarizedData rows, you would only need to insert rows.
Since the time to scan the table increases with the number of rows in it, I assume there is no index on Timestmp column. An index like the one bellow may speed up you query:
CREATE NONCLUSTERED INDEX [IDX_Timestmp] ON [LargeTable](Timestmp) INCLUDE(X, Y, Id)
Please note, that creation of such index may take significant amount of time, and it will impact your inserts too.

sql select performance (Microsoft)

I have a very big table with many rows (50 million) and more than 500 columns. Its indexes are period and client. I need to keep for a period the client and another column (not an index). It takes too much time. So I'm trying to understand why:
If I do:
select count(*)
from table
where cd_periodo=201602
It takes less than 1 sec and returns the number 2 million.
If I select into a temp table the period it also takes no time (2 secs)
select cd_periodo
into #table
from table
where cd_periodo=201602
But if I select another column that it's not part of an index it takes more than 3 minutes.
select not_index_column
into #table
from table
where cd_periodo=201602
Why is this happening? I'm not doing any filter on the column.
When you select an indexed column, the reader doesn't have to process and go into the entire table and read the entire row. The index helps the reader to select the value without having to actually get the row.
When you select a nonindexed column, the opposite of what I said happens, and the reader have to read the whole table in order to get the value from this column.

Improving OFFSET performance in PostgreSQL

I have a table I'm doing an ORDER BY on before a LIMIT and OFFSET in order to paginate.
Adding an index on the ORDER BY column makes a massive difference to performance (when used in combination with a small LIMIT). On a 500,000 row table, I saw a 10,000x improvement adding the index, as long as there was a small LIMIT.
However, the index has no impact for high OFFSETs (i.e. later pages in my pagination). This is understandable: a b-tree index makes it easy to iterate in order from the beginning but not to find the nth item.
It seems that what would help is a counted b-tree index, but I'm not aware of support for these in PostgreSQL. Is there another solution? It seems that optimizing for large OFFSETs (especially in pagination use-cases) isn't that unusual.
Unfortunately, the PostgreSQL manual simply says "The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient."
You might want a computed index.
Let's create a table:
create table sales(day date, amount real);
And fill it with some random stuff:
insert into sales
select current_date + s.a as day, random()*100 as amount
from generate_series(1,20);
Index it by day, nothing special here:
create index sales_by_day on sales(day);
Create a row position function. There are other approaches, this one is the simplest:
create or replace function sales_pos (date) returns bigint
as 'select count(day) from sales where day <= $1;'
language sql immutable;
Check if it works (don't call it like this on large datasets though):
select sales_pos(day), day, amount from sales;
sales_pos | day | amount
-----------+------------+----------
1 | 2011-07-08 | 41.6135
2 | 2011-07-09 | 19.0663
3 | 2011-07-10 | 12.3715
..................
Now the tricky part: add another index computed on the sales_pos function values:
create index sales_by_pos on sales using btree(sales_pos(day));
Here is how you use it. 5 is your "offset", 10 is the "limit":
select * from sales where sales_pos(day) >= 5 and sales_pos(day) < 5+10;
day | amount
------------+---------
2011-07-12 | 94.3042
2011-07-13 | 12.9532
2011-07-14 | 74.7261
...............
It is fast, because when you call it like this, Postgres uses precalculated values from the index:
explain select * from sales
where sales_pos(day) >= 5 and sales_pos(day) < 5+10;
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using sales_by_pos on sales (cost=0.50..8.77 rows=1 width=8)
Index Cond: ((sales_pos(day) >= 5) AND (sales_pos(day) < 15))
Hope it helps.
I don't know anything about "counted b-tree indexes", but one thing we've done in our application to help with this is break our queries into two, possibly using a sub-query. My apologies for wasting your time if you're already doing this.
SELECT *
FROM massive_table
WHERE id IN (
SELECT id
FROM massive_table
WHERE ...
LIMIT 50
OFFSET 500000
);
The advantage here is that, while it still has to calculate the proper ordering of everything, it doesn't order the entire row--only the id column.
Instead of using an OFFSET, a very efficient trick is to use a temporary table:
CREATE TEMPORARY TABLE just_index AS
SELECT ROW_NUMBER() OVER (ORDER BY myID), myID
FROM mytable;
For 10 000 000 rows it needs about 10s to be created.
Then you want to use SELECT or UPDATE your table, you simply:
SELECT * FROM mytable INNER JOIN (SELECT just_index.myId FROM just_index WHERE row_number >= *your offset* LIMIT 1000000) indexes ON mytable.myID = indexes.myID
Filtering mytable with only just_index is more efficient (in my case) with a INNER JOIN than with a WHERE myID IN (SELECT ...)
This way you don't have to store the last myId value, you simply replace the offset with a WHERE clause, that uses indexes
It seems that optimizing for large
OFFSETs (especially in pagination
use-cases) isn't that unusual.
It seems a little unusual to me. Most people, most of the time, don't seem to skim through very many pages. It's something I'd support, but wouldn't work hard to optimize.
But anyway . . .
Since your application code knows which ordered values it's already seen, it should be able to reduce the result set and reduce the offset by excluding those values in the WHERE clause. Assuming you order a single column, and it's sorted ascending, your app code can store the last value on the page, then add AND your-ordered-column-name > last-value-seen to the WHERE clause in some appropriate way.
recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one.
i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination.
In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":
create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
join table2 t2 on (t2.fieldpk = t1.fieldpk)
join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;
after that, i can paginate without count each row but using the sequence assigned:
select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:
#Override
public E get(int index) {
if (bufferParcial.size() <= (index - lastIndexRoulette))
{
lastIndexRoulette = index;
bufferParcial.removeAll(bufferParcial);
bufferParcial = new ArrayList<E>();
bufferParcial.addAll(daoInterface.getBufferParcial());
if (bufferParcial.isEmpty())
{
return null;
}
}
return bufferParcial.get(index - lastIndexRoulette);<br>
}
by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.
results for this approach can be seen here
http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

MS Access row number, specify an index

Is there a way in MS access to return a dataset between a specific index?
So lets say my dataset is:
rank | first_name | age
1 Max 23
2 Bob 40
3 Sid 25
4 Billy 18
5 Sally 19
But I only want to return those records between 'rank' 2 and 4, so my results set is Bob, Sid and Billy? However, Rank is not part of the table, and this should be generated when the query is run. Why don't I use an autogenerated number, because if a record is deleted, this will be inconsistent, and what if I wanted the results in reverse!
This obviously very simple, and the reason I ask is because I am working on a product catalogue and I am looking for a more efficient way of paging through the returned dataset, so if I only return 1 page worth of data from the database this is obviously going to be quicker then return a complete set of 3000 records and then having to subselect from that set!
Thanks R.
Original suggestion:
SELECT * from table where rank BETWEEN 2 and 4;
Modified after comment, that rank is not existing in structure:
Select top 100 * from table;
And if you want to choose subsequent results, you can choose the ID of the last record from the first query, say it was ID 101, and use a WHERE clause to get the next 100;
Select top 100 * from table where ID > 100;
But these won't give you what you're looking for either, I bet.
How are you calculating rank? I assume you are basing it on some data in another dataset somewhere. If so, create a function, do a table join, or do something that can calculate rank based on values in other table(s), then you can do queries based on the rank() function.
For example:
select *
from table
where rank() between 2 and 4
If you are not calculating rank based on some data somewhere, there really isn't a way to write this query, and you might as well be returning three random rows from the table.
I think you need to use a correlated subquery to calculate the rank on the fly e.g. I'm guessing the rank is based on name:
SELECT T1.first_name, T1.age,
(
SELECT COUNT(*) + 1
FROM MyTable AS T2
WHERE T1.first_name > T2.first_name
) AS rank
FROM MyTable AS T1;
The bad news is the Access data engine is poorly optimized for this kind of query; in my experience, performace will start to noticeably degrade beyond a few hundred rows.
If it is not possible to maintain the rank on the db side of the house (e.g. high insertion environment) consider doing the paging on the client side. For example, an ADO classic recordset object has properties to support paging (PageCount, PageSize, AbsolutePage, etc), something for which DAO recordsets (being of an older vintage) have no support.
As always, you'll have to perform your own timings but I suspect that when there are, say, 10K rows you will find it faster to take on the overhead of fetching all the rows to an ADO recordset then finding the page (then perhaps fabricate smaller ADO recordset consisting of just that page's worth of rows) than it is to perform a correlated subquery to only fetch the number of rows for the page.
Unfortunately the LIMIT keyword isn't available in MS Access -- that's what is used in MySQL for a multi-page presentation. If you can write an order key into the results table, then you can use it something like this:
SELECT TOP 25 MyOrder, Etc FROM Table1 WHERE MyOrder in
(SELECT TOP 55 MyOrder FROM Table1 ORDER BY MyOrder DESC)
ORDER BY MyOrder ASCENDING
If I understand you correctly, there is ionly first_name and age columns in your table. If this is the case, then there is no way to return Bob, Sid, and Billy with a single query. Unless you do something like
SELECT * FROM Table
WHERE FirstName = 'Bob'
OR FirstName = 'Sid'
OR FirstName = 'Billy'
But I think that this is not what you are looking for.
This is because SQL databases make no guarantee as to the order that the data will come out of the database unless you specify an ORDER BY clause. It will usually come out in the same order it was added, but there are no guarantees, and once you get a lot of rows in your table, there's a reasonably high probability that they won't come out in the order you put them in.
As a side note, you should probably add a "rank" column (this column is usually called id) to your table, and make it an auto incrementing integer (see Access documentation), so that you can do the query mentioned by Sev. It's also important to have a primary key so that you can be certain which rows are being updated when you are running an update query, or which rows are being deleted when you run a delete query. For example, if you had 2 people named Max, and they were both 23, how you delete 1 row without deleting the other. If you had another auto incrementing unique column in there, you could specify the unique ID in your query to delete only one.
[ADDITION]
Upon reading your comment, If you add an autoincrement field, and want to read 3 rows, and you know the ID of the first row you want to read, then you can use "TOP" to read 3 rows.
Assuming your data looks like this
ID | first_name | age
1 Max 23
2 Bob 40
6 Sid 25
8 Billy 18
15 Sally 19
You can wuery Bob, Sid and Billy with the following QUERY.
SELECT TOP 3 FirstName, Age
From Table
WHERE ID >= 2
ORDER BY ID

Resources