Random record from a database table (T-SQL) - sql-server

Is there a succinct way to retrieve a random record from a sql server table?
I would like to randomize my unit test data, so am looking for a simple way to select a random id from a table. In English, the select would be "Select one id from the table where the id is a random number between the lowest id in the table and the highest id in the table."
I can't figure out a way to do it without have to run the query, test for a null value, then re-run if null.
Ideas?

Is there a succinct way to retrieve a random record from a sql server table?
Yes
SELECT TOP 1 * FROM table ORDER BY NEWID()
Explanation
A NEWID() is generated for each row and the table is then sorted by it. The first record is returned (i.e. the record with the "lowest" GUID).
Notes
GUIDs are generated as pseudo-random numbers since version four:
The version 4 UUID is meant for generating UUIDs from truly-random or
pseudo-random numbers.
The algorithm is as follows:
Set the two most significant bits (bits 6 and 7) of the
clock_seq_hi_and_reserved to zero and one, respectively.
Set the four most significant bits (bits 12 through 15) of the
time_hi_and_version field to the 4-bit version number from
Section 4.1.3.
Set all the other bits to randomly (or pseudo-randomly) chosen
values.
—A Universally Unique IDentifier (UUID) URN Namespace - RFC 4122
The alternative SELECT TOP 1 * FROM table ORDER BY RAND() will not work as one would think. RAND() returns one single value per query, thus all rows will share the same value.
While GUID values are pseudo-random, you will need a better PRNG for the more demanding applications.
Typical performance is less than 10 seconds for around 1,000,000 rows — of course depending on the system. Note that it's impossible to hit an index, thus performance will be relatively limited.

On larger tables you can also use TABLESAMPLE for this to avoid scanning the whole table.
SELECT TOP 1 *
FROM YourTable
TABLESAMPLE (1000 ROWS)
ORDER BY NEWID()
The ORDER BY NEWID is still required to avoid just returning rows that appear first on the data page.
The number to use needs to be chosen carefully for the size and definition of table and you might consider retry logic if no row is returned. The maths behind this and why the technique is not suited to small tables is discussed here

Also try your method to get a random Id between MIN(Id) and MAX(Id) and then
SELECT TOP 1 * FROM table WHERE Id >= #yourrandomid
It will always get you one row.

If you want to select large data the best way that I know is:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM
(keycol1, NEWID())) as int))
% 100) < 10
Source: MSDN

I was looking to improve on the methods I had tried and came across this post. I realize it's old but this method is not listed. I am creating and applying test data; this shows the method for "address" in a SP called with #st (two char state)
Create Table ##TmpAddress (id Int Identity(1,1), street VarChar(50), city VarChar(50), st VarChar(2), zip VarChar(5))
Insert Into ##TmpAddress(street, city, st, zip)
Select street, city, st, zip
From tbl_Address (NOLOCK)
Where st = #st
-- unseeded RAND() will return the same number when called in rapid succession so
-- here, I seed it with a guaranteed different number each time. ##ROWCOUNT is the count from the most recent table operation.
Set #csr = Ceiling(RAND(convert(varbinary, newid())) * ##ROWCOUNT)
Select street, city, st, Right(('00000' + ltrim(zip)),5) As zip
From ##tmpAddress (NOLOCK)
Where id = #csr

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in the CHECKSUM expression so that
NEWID() evaluates once per row to achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS
float / CAST (0x7fffffff AS int) evaluates to a random float value
between 0 and 1."
Source: http://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx
This is further explained below:
How does this work? Let's split out the WHERE clause and explain it.
The CHECKSUM function is calculating a checksum over the items in the
list. It is arguable over whether SalesOrderID is even required, since
NEWID() is a function that returns a new random GUID, so multiplying a
random figure by a constant should result in a random in any case.
Indeed, excluding SalesOrderID seems to make no difference. If you are
a keen statistician and can justify the inclusion of this, please use
the comments section below and let me know why I'm wrong!
The CHECKSUM function returns a VARBINARY. Performing a bitwise AND
operation with 0x7fffffff, which is the equivalent of (111111111...)
in binary, yields a decimal value that is effectively a representation
of a random string of 0s and 1s. Dividing by the co-efficient
0x7fffffff effectively normalizes this decimal figure to a figure
between 0 and 1. Then to decide whether each row merits inclusion in
the final result set, a threshold of 1/x is used (in this case, 0.01)
where x is the percentage of the data to retrieve as a sample.
Source: https://www.mssqltips.com/sqlservertip/3157/different-ways-to-get-random-data-for-sql-server-data-sampling

Related

SQL Server: random number in WHERE clause

As far as I am aware, the only way to get a random value in a SELECT statement is by using the newid() function, as the random() function doesn’t generate new values for each row.
This leads to the following awkward construction to get a random number from, say 0 - 9:
abs(checksum(newid())) % 10
If I use this expression in the SELECT clause, it behaves as expected. However, if I try something like the following:
select *
from table
where abs(checksum(newid())) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them. Apparently newid() is only evaluated once, instead of for each row.
The question is, how can I use a random number in the WHERE clause?
More
There is a similar question which asks for fixed number of rows at random. In the above example I could have used:
select top 50 percent from table order by newid();
which will get me what I am looking for.
The question remains, how can I use a random number in the WHERE clause. For example, is it possible to do something like this?
select *
from table
where code={random number};
Here is one way to get around the problem
SELECT *
FROM (SELECT *,
Abs(Checksum(Newid())) % 10 AS ran
FROM yourtable) a
WHERE ran > 4;
for some reason newid() in where clause it is executed only once and it is checked with the constant.
When I check the execution plan your query is missing compute scalar where as my query has compute scalar present in execution plan.
The function newid() is calculate only once in the WHERE clause, not row by row. The trick is to force it to run row by row.
Of course it is possible to include it in a SELECT clause, and, in turn, include that in a CTE or a subquery, as per the other answers.
Microsoft offer a solution here: https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms189108(v=sql.105)?redirectedfrom=MSDN
The trick is to force newid() to recalculate by combining it with some row value. This is easily done in the checksum() function.
For example:
SELECT *
FROM table
WHERE abs(checksum(newid(),id)) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them
You may get all of the rows or none of them ,since NEWID() is executed once per query when you use it in where clause..This is explained here by Conor Cunnigham and the technical term for this is called RumTimeConstants
You can look at your execution plan and look out for below expression
Const ConstValue
which you can see is calculated once and used throughout and finally you are doing just a boolean comparison,so you will end up with all rows or none
you have to use CTE Like the one stated in another answer or use Top with order by newid() or tablesample to return random rows
you may find Tablesample option more helpfull,since this may not go though all the table data to get only sample set of rows,unlike Newid()
below is one example on a table having 1000000 rows
select * from Orders
TABLESAMPLE (50 PERCENT)
plan

SQL Large Table select random row strategy

I would like to select a random row from a very large table (10 mil records). So the strategy that are most common such as RAND() and NEWID() doesn't seem to be practical.
I have tried the following strategy and would like to know if this is the most ideal way.
Create a new field called 'RandomSort' as UniqueIdentified
At the end of each hour/day will do a Update RandomSort = NewID() to the entire table
Each time I need to query, I can do a Top 10 Order by RandomSort
It does get the job done (better than ORDER BY NewID), but not sure if this is the best practice so far?
Add an identity column 'rowid' (int or bigint depending on your table size) and create a unique non-clustered index on it.
The following query uses the NEWID() function to return approximately one percent of the rows of the table:
SELECT * FROM MyTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), rowID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
The rowId column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), rowid) & 0x7fffffff AS float / CAST(0x7fffffff AS int) evaluates to a random float value between 0 and 1.
In fact you could use any column indexed column in your table (I believe).
If you just want to pick a single random row:
SELECT TOP 1 * FROM table
WHERE rowid >= RAND(CHECKSUM(NEWID())) * (SELECT MAX(rowid) FROM table)
This works in constant time, provided the rowid column is indexed. Note: this assumes that rowid is uniformly distributed in the range 0..MAX(rowid), hence the suggested identity column addition. If your dataset has some other distribution, your results will be skewed (i.e. some rows will be picked more often than others).

SQL Server Insert Random

i have this table:
Create Table Person
(
Consecutive Integer Identity(1,1),
Identification Varchar(15) Primary Key,
)
The Identification column can contain letters, numbers, and is optional, i.e., the customer can enter it or not, if not, creates a number automatic.. how can i do to insert a random number that does not exist before?, preferably a lower number.
A example could be:
Select Random From Person Where Random Not Exists In Identification
This is my code:
Select Min(Convert(Integer,Identification)) - 1
From Person
Where IsNumeric(Identification) = 1
Or
Select Max(Convert(Integer,Identification)) + 1
From Person
Where IsNumeric(Identification) = 1
Works well, but if the customer enter a number high, for example 1000, or higher, then the number will begin from there could have an overflow error
But if there is not a number below Identification and greater than 0 then well be -1, -2, -3.. etc.
Thanks in advance..
I agree with what M.Ali said. But you can just make use of the below code, but still I don't recommend beyond what M.Ali said.
The loop with continue until a random number is generated which is not in your table. You can change the precision to 5 digits by changing 1000 to 10000 and so on.
DECLARE #I INT = 0
DECLARE #RANDOM INT;
WHILE(#I=0)
BEGIN
SELECT #RANDOM = 1000 + (CONVERT(INT, CRYPT_GEN_RANDOM(3)) % 1000);
IF NOT EXISTS(SELECT Identification FROM YOURTABLE WHERE Identification = CAST(#RANDOM AS VARCHAR(4)))
BEGIN
-- Do your stuff here
BREAK;
END
ELSE
BEGIN
-- The ELSE part
END
END
Maintaining a Random number of VARCHAR(15), which depends on end user's input can be a very expensive approach when you also want it to be unique.
Imagine a scenario when you have some decent amount of rows say 10,000 rows in this table and a user comes in trying to insert a Random number, chances are the user maybe try 5, 10 or even maybe 15 times to get a unique random value.
On each failed attempt a call will be made to server, a search will be done on table (more rows more expensive this query will become), and the more (failed) attempts a user makes more disappointment/poor application experience user will have.
Would you ever go back to an application(web/windwos) where just for registeration you had struggle this many time? obviously not.
The moral of the story is if you are asking a user to enter some random value, do not expect users to maintain your database integrity and keep that column unique, take control and pair that value with another column which will definately be random. In your case it can be the Identity column. Or alternately you can generate that value for user yourself, using guid.
select count(*) +1 from Person
This generates a logical ID for Identification that sets the ID to what it 'would have been' with a simple incrementor.
However, you then cannot delete records; instead you must deactivate them, or clear the row.
Alternately, have a separate (hidden) column that only auto-increments, and if Identification is left empty, use the value from the hidden column. Same result, but less risk if deletion is relevant.

Improving OFFSET performance in PostgreSQL

I have a table I'm doing an ORDER BY on before a LIMIT and OFFSET in order to paginate.
Adding an index on the ORDER BY column makes a massive difference to performance (when used in combination with a small LIMIT). On a 500,000 row table, I saw a 10,000x improvement adding the index, as long as there was a small LIMIT.
However, the index has no impact for high OFFSETs (i.e. later pages in my pagination). This is understandable: a b-tree index makes it easy to iterate in order from the beginning but not to find the nth item.
It seems that what would help is a counted b-tree index, but I'm not aware of support for these in PostgreSQL. Is there another solution? It seems that optimizing for large OFFSETs (especially in pagination use-cases) isn't that unusual.
Unfortunately, the PostgreSQL manual simply says "The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient."
You might want a computed index.
Let's create a table:
create table sales(day date, amount real);
And fill it with some random stuff:
insert into sales
select current_date + s.a as day, random()*100 as amount
from generate_series(1,20);
Index it by day, nothing special here:
create index sales_by_day on sales(day);
Create a row position function. There are other approaches, this one is the simplest:
create or replace function sales_pos (date) returns bigint
as 'select count(day) from sales where day <= $1;'
language sql immutable;
Check if it works (don't call it like this on large datasets though):
select sales_pos(day), day, amount from sales;
sales_pos | day | amount
-----------+------------+----------
1 | 2011-07-08 | 41.6135
2 | 2011-07-09 | 19.0663
3 | 2011-07-10 | 12.3715
..................
Now the tricky part: add another index computed on the sales_pos function values:
create index sales_by_pos on sales using btree(sales_pos(day));
Here is how you use it. 5 is your "offset", 10 is the "limit":
select * from sales where sales_pos(day) >= 5 and sales_pos(day) < 5+10;
day | amount
------------+---------
2011-07-12 | 94.3042
2011-07-13 | 12.9532
2011-07-14 | 74.7261
...............
It is fast, because when you call it like this, Postgres uses precalculated values from the index:
explain select * from sales
where sales_pos(day) >= 5 and sales_pos(day) < 5+10;
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using sales_by_pos on sales (cost=0.50..8.77 rows=1 width=8)
Index Cond: ((sales_pos(day) >= 5) AND (sales_pos(day) < 15))
Hope it helps.
I don't know anything about "counted b-tree indexes", but one thing we've done in our application to help with this is break our queries into two, possibly using a sub-query. My apologies for wasting your time if you're already doing this.
SELECT *
FROM massive_table
WHERE id IN (
SELECT id
FROM massive_table
WHERE ...
LIMIT 50
OFFSET 500000
);
The advantage here is that, while it still has to calculate the proper ordering of everything, it doesn't order the entire row--only the id column.
Instead of using an OFFSET, a very efficient trick is to use a temporary table:
CREATE TEMPORARY TABLE just_index AS
SELECT ROW_NUMBER() OVER (ORDER BY myID), myID
FROM mytable;
For 10 000 000 rows it needs about 10s to be created.
Then you want to use SELECT or UPDATE your table, you simply:
SELECT * FROM mytable INNER JOIN (SELECT just_index.myId FROM just_index WHERE row_number >= *your offset* LIMIT 1000000) indexes ON mytable.myID = indexes.myID
Filtering mytable with only just_index is more efficient (in my case) with a INNER JOIN than with a WHERE myID IN (SELECT ...)
This way you don't have to store the last myId value, you simply replace the offset with a WHERE clause, that uses indexes
It seems that optimizing for large
OFFSETs (especially in pagination
use-cases) isn't that unusual.
It seems a little unusual to me. Most people, most of the time, don't seem to skim through very many pages. It's something I'd support, but wouldn't work hard to optimize.
But anyway . . .
Since your application code knows which ordered values it's already seen, it should be able to reduce the result set and reduce the offset by excluding those values in the WHERE clause. Assuming you order a single column, and it's sorted ascending, your app code can store the last value on the page, then add AND your-ordered-column-name > last-value-seen to the WHERE clause in some appropriate way.
recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one.
i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination.
In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":
create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
join table2 t2 on (t2.fieldpk = t1.fieldpk)
join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;
after that, i can paginate without count each row but using the sequence assigned:
select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:
#Override
public E get(int index) {
if (bufferParcial.size() <= (index - lastIndexRoulette))
{
lastIndexRoulette = index;
bufferParcial.removeAll(bufferParcial);
bufferParcial = new ArrayList<E>();
bufferParcial.addAll(daoInterface.getBufferParcial());
if (bufferParcial.isEmpty())
{
return null;
}
}
return bufferParcial.get(index - lastIndexRoulette);<br>
}
by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.
results for this approach can be seen here
http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

"order by newid()" - how does it work?

I know that If I run this query
select top 100 * from mytable order by newid()
it will get 100 random records from my table.
However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?
I know what NewID() does, I'm just
trying to understand how it would help
in the random selection. Is it that
(1) the select statement will select
EVERYTHING from mytable, (2) for each
row selected, tack on a
uniqueidentifier generated by NewID(),
(3) sort the rows by this
uniqueidentifier and (4) pick off the
top 100 from the sorted list?
Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.
SELECT TOP 100 *
FROM master..spt_values
ORDER BY NEWID()
The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.
SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)
In general it works like this:
All rows from mytable is "looped"
NEWID() is executed for each row
The rows are sorted according to random number from NEWID()
100 first row are selected
as MSDN says:
NewID() Creates a unique value of type
uniqueidentifier.
and your table will be sorted by this random values.
use select top 100 randid = newid(), * from mytable order by randid
you will be clarified then..
I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.
Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.
From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.
4 ways to get a random row from a large table:
Method 1, Bad: ORDER BY NEWID() > Bad performance!
Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really
random!
Method 3, Best but Requires Code: Random Primary Key >
Fastest, but won't work for negative numbers.
Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered
index.
More on method 3:
Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.
/* Get a random number smaller than the table's top ID */
DECLARE #rand BIGINT;
DECLARE #maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT #rand = ABS((CHECKSUM(NEWID()))) % #maxid;
/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= #rand;

Resources