SQL Large Table select random row strategy - sql-server

I would like to select a random row from a very large table (10 mil records). So the strategy that are most common such as RAND() and NEWID() doesn't seem to be practical.
I have tried the following strategy and would like to know if this is the most ideal way.
Create a new field called 'RandomSort' as UniqueIdentified
At the end of each hour/day will do a Update RandomSort = NewID() to the entire table
Each time I need to query, I can do a Top 10 Order by RandomSort
It does get the job done (better than ORDER BY NewID), but not sure if this is the best practice so far?

Add an identity column 'rowid' (int or bigint depending on your table size) and create a unique non-clustered index on it.
The following query uses the NEWID() function to return approximately one percent of the rows of the table:
SELECT * FROM MyTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), rowID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
The rowId column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), rowid) & 0x7fffffff AS float / CAST(0x7fffffff AS int) evaluates to a random float value between 0 and 1.
In fact you could use any column indexed column in your table (I believe).
If you just want to pick a single random row:
SELECT TOP 1 * FROM table
WHERE rowid >= RAND(CHECKSUM(NEWID())) * (SELECT MAX(rowid) FROM table)
This works in constant time, provided the rowid column is indexed. Note: this assumes that rowid is uniformly distributed in the range 0..MAX(rowid), hence the suggested identity column addition. If your dataset has some other distribution, your results will be skewed (i.e. some rows will be picked more often than others).

Related

SQL Server index on optional columns

In my scenario i have a table with a lot of optional columns (20 columns in total, say form col00 to col19, every column contain an integer not nullable).
When the column contains a 0 it's considered empty any other values have a meaning.
Any subset of that 20 columns could be queried, so i should query for col01 = int1 and col17 = int2.
I need to improve the performance of such queries, but i don't know how to create a representative index.
Surely i can monitor table for a while and see which columns subset are searchest most, but this is not a satisfiable solution to me (the table is periodically regenerated every few months..and the "tags" encoded that way may change)
I think the best you'll be able to do is to index every column by itself, then use the set operator INTERSECT... in a subquery of your where clause.
INTERSECT returns distinct rows that are output by both the left and right input queries operator. So if you select the primary key of the table in the INTERSECT then you should have a good subquery that can be used in a where-clause. This will require you to re-write your queries however.
Example:
SELECT *
FROM tablename
WHERE primary_key = (
SELECT primary_key FROM tablename WHERE col01 = int1
INTERSECT
SELECT primary_key FROM tablename WHERE col17 = int2
)
That should be sargable, if col01 and col17 have their own index.

Fastest way to process Millions of Rows in SQL Server for a Chart

We are logging realtime data every second to a SQL Server database and we want to generate charts from 10 Million rows or more. At the moment we use something like the code below. The goal is to get at least 1000-2000 values to pass into the chart.
In the query below, we take an avg of every next n'th rows depending on the count of data we pick out from the LargeTable. This works fine up to 200.000 selected rows, but then it is way too slow.
SELECT
AVG(X),
AVG(Y)
FROM
(SELECT
X, Y,
(Id / #AvgCount) AS [Group]
FROM
[LargeTable]
WHERE
Timestmp > #From
AND Timestmp < #Till) j
GROUP BY
[Group]
ORDER BY
X;
Now we tried to select out only every n'th row from LargeTable and then make an average of this data to get more performance, but it takes nearly the same time.
SELECT
X, Y
FROM
(SELECT
X, Y,
ROW_NUMBER() OVER (ORDER BY Id) AS rownr
FROM
LargeTable
WHERE
Timestmp >= #From
AND Timestmp <= #Till) a
WHERE
a.rownr % (#count / 10000) = 0;
It is only pseudo code! We have indexes on all relevant columns.
Are there better and faster ways to get chart data?
I think on two approaches to improve the performance of the charts:
Trying to improve the performance of the queries.
Reducing the amount of data needed to be read.
It's almost impossible for me to improve the performance of the queries without the full DDL and execution plans. So I'm suggesting you to reduce the amount of data to be read.
The key is summarizing groups at a given granularity level as the data comes and storing it in a separate table like the following:
CREATE TABLE SummarizedData
(
int GroupId PRIMARY KEY,
FromDate datetime,
ToDate datetime,
SumX float,
SumY float,
GroupCount
)
IdGroup should be equals to Id/100 or Id/1000 depending on how much granularity you want in groups. With larger groups you get more coarse granularity but more efficient charts.
I'm assuming LargeTable Id column increases monotonically, so you can store the last Id that has been processed in another table called SummaryProcessExecutions
You would need a stored procedure ExecuteSummaryProcess that:
Read LastProcessedId from SummaryProcessExecutions
Read the Last Id on large table and store it into #NewLastProcessedId variable
Summarize all rows from LargeTable with Id > #LastProcessedId and Id <= #NewLastProcessedId and store the results into SummarizedData table
Store #NewLastProcessedId variable into SummaryProcessExecutions table
You can execute ExecuteSummaryProcess stored procedure frequently in a SQL Server Agent Job.
I believe that grouping by date would be a better choice than grouping by Id. It would simplify things. The SummarizedData GroupId column would not be related to LargeTable Id and you would not need to update SummarizedData rows, you would only need to insert rows.
Since the time to scan the table increases with the number of rows in it, I assume there is no index on Timestmp column. An index like the one bellow may speed up you query:
CREATE NONCLUSTERED INDEX [IDX_Timestmp] ON [LargeTable](Timestmp) INCLUDE(X, Y, Id)
Please note, that creation of such index may take significant amount of time, and it will impact your inserts too.

Can I prevent a computed column from changing it's value if the formula changes?

I have a computed column in MS SQL 2005 that does some VAT calculations. The website uses invoices that can only be generated once and rely on the value in the computed column to work out the VAT.
Unfortunately, a bug was found that means that the the VAT value calculated was off by a few cents. Not a huge problem but we can't change the values from all the previously computed values as these need to be honoured on the invoices.
tldr;
How do I change the calculation for a computed column without re-calculating the values that have already be calculated?
Long story short, you can't, because the computed column definition applies to all rows in the table. But why are you using a computed column here anyway? If (when) the VAT rate changes, the rules for applying it to goods and services change, and the number of invoices increases over time, then a computed column becomes more and more awkward as a solution.
It would be a lot simpler and safer to calculate the VAT once, store it in a column and then just don't update the value. You can use permissions, triggers and/or auditing to ensure that the value is not changed after being entered.
So I would add a new, non-computed column, copy the values from the computed column and drop the computed column (see this question). Plus whatever application development you need to do to actually calculate the values in the first place, of course. It's some extra work but since you've found a bug you have to fix it anyway.
If there is a datestamp on the row your formula can vary based on that value. Note that you will have to drop the computed column and re-add it. This will require minimal integration, however I think eventually you should plan on stamping this value instead of using the calculation.
CREATE TABLE dbo.mytable ( low int, high int, insertdate datetime, myavg AS (low + high)/2 ) ;
insert into dbo.mytable values (1,10,'10-10-10')
insert into dbo.mytable values (1,10,'10-10-10')
insert into dbo.mytable values (2,20,'11-11-11')
alter table dbo.mytable add myavgplusone as ( case when insertdate < '11-1-11' then (low + high)/2 else ((low + high)/2)+1 end)
select * from dbo.mytable

"order by newid()" - how does it work?

I know that If I run this query
select top 100 * from mytable order by newid()
it will get 100 random records from my table.
However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?
I know what NewID() does, I'm just
trying to understand how it would help
in the random selection. Is it that
(1) the select statement will select
EVERYTHING from mytable, (2) for each
row selected, tack on a
uniqueidentifier generated by NewID(),
(3) sort the rows by this
uniqueidentifier and (4) pick off the
top 100 from the sorted list?
Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.
SELECT TOP 100 *
FROM master..spt_values
ORDER BY NEWID()
The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.
SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)
In general it works like this:
All rows from mytable is "looped"
NEWID() is executed for each row
The rows are sorted according to random number from NEWID()
100 first row are selected
as MSDN says:
NewID() Creates a unique value of type
uniqueidentifier.
and your table will be sorted by this random values.
use select top 100 randid = newid(), * from mytable order by randid
you will be clarified then..
I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.
Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.
From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.
4 ways to get a random row from a large table:
Method 1, Bad: ORDER BY NEWID() > Bad performance!
Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really
random!
Method 3, Best but Requires Code: Random Primary Key >
Fastest, but won't work for negative numbers.
Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered
index.
More on method 3:
Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.
/* Get a random number smaller than the table's top ID */
DECLARE #rand BIGINT;
DECLARE #maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT #rand = ABS((CHECKSUM(NEWID()))) % #maxid;
/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= #rand;

Random record from a database table (T-SQL)

Is there a succinct way to retrieve a random record from a sql server table?
I would like to randomize my unit test data, so am looking for a simple way to select a random id from a table. In English, the select would be "Select one id from the table where the id is a random number between the lowest id in the table and the highest id in the table."
I can't figure out a way to do it without have to run the query, test for a null value, then re-run if null.
Ideas?
Is there a succinct way to retrieve a random record from a sql server table?
Yes
SELECT TOP 1 * FROM table ORDER BY NEWID()
Explanation
A NEWID() is generated for each row and the table is then sorted by it. The first record is returned (i.e. the record with the "lowest" GUID).
Notes
GUIDs are generated as pseudo-random numbers since version four:
The version 4 UUID is meant for generating UUIDs from truly-random or
pseudo-random numbers.
The algorithm is as follows:
Set the two most significant bits (bits 6 and 7) of the
clock_seq_hi_and_reserved to zero and one, respectively.
Set the four most significant bits (bits 12 through 15) of the
time_hi_and_version field to the 4-bit version number from
Section 4.1.3.
Set all the other bits to randomly (or pseudo-randomly) chosen
values.
—A Universally Unique IDentifier (UUID) URN Namespace - RFC 4122
The alternative SELECT TOP 1 * FROM table ORDER BY RAND() will not work as one would think. RAND() returns one single value per query, thus all rows will share the same value.
While GUID values are pseudo-random, you will need a better PRNG for the more demanding applications.
Typical performance is less than 10 seconds for around 1,000,000 rows — of course depending on the system. Note that it's impossible to hit an index, thus performance will be relatively limited.
On larger tables you can also use TABLESAMPLE for this to avoid scanning the whole table.
SELECT TOP 1 *
FROM YourTable
TABLESAMPLE (1000 ROWS)
ORDER BY NEWID()
The ORDER BY NEWID is still required to avoid just returning rows that appear first on the data page.
The number to use needs to be chosen carefully for the size and definition of table and you might consider retry logic if no row is returned. The maths behind this and why the technique is not suited to small tables is discussed here
Also try your method to get a random Id between MIN(Id) and MAX(Id) and then
SELECT TOP 1 * FROM table WHERE Id >= #yourrandomid
It will always get you one row.
If you want to select large data the best way that I know is:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM
(keycol1, NEWID())) as int))
% 100) < 10
Source: MSDN
I was looking to improve on the methods I had tried and came across this post. I realize it's old but this method is not listed. I am creating and applying test data; this shows the method for "address" in a SP called with #st (two char state)
Create Table ##TmpAddress (id Int Identity(1,1), street VarChar(50), city VarChar(50), st VarChar(2), zip VarChar(5))
Insert Into ##TmpAddress(street, city, st, zip)
Select street, city, st, zip
From tbl_Address (NOLOCK)
Where st = #st
-- unseeded RAND() will return the same number when called in rapid succession so
-- here, I seed it with a guaranteed different number each time. ##ROWCOUNT is the count from the most recent table operation.
Set #csr = Ceiling(RAND(convert(varbinary, newid())) * ##ROWCOUNT)
Select street, city, st, Right(('00000' + ltrim(zip)),5) As zip
From ##tmpAddress (NOLOCK)
Where id = #csr
If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in the CHECKSUM expression so that
NEWID() evaluates once per row to achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS
float / CAST (0x7fffffff AS int) evaluates to a random float value
between 0 and 1."
Source: http://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx
This is further explained below:
How does this work? Let's split out the WHERE clause and explain it.
The CHECKSUM function is calculating a checksum over the items in the
list. It is arguable over whether SalesOrderID is even required, since
NEWID() is a function that returns a new random GUID, so multiplying a
random figure by a constant should result in a random in any case.
Indeed, excluding SalesOrderID seems to make no difference. If you are
a keen statistician and can justify the inclusion of this, please use
the comments section below and let me know why I'm wrong!
The CHECKSUM function returns a VARBINARY. Performing a bitwise AND
operation with 0x7fffffff, which is the equivalent of (111111111...)
in binary, yields a decimal value that is effectively a representation
of a random string of 0s and 1s. Dividing by the co-efficient
0x7fffffff effectively normalizes this decimal figure to a figure
between 0 and 1. Then to decide whether each row merits inclusion in
the final result set, a threshold of 1/x is used (in this case, 0.01)
where x is the percentage of the data to retrieve as a sample.
Source: https://www.mssqltips.com/sqlservertip/3157/different-ways-to-get-random-data-for-sql-server-data-sampling

Resources