"order by newid()" - how does it work? - sql-server

I know that If I run this query
select top 100 * from mytable order by newid()
it will get 100 random records from my table.
However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?

I know what NewID() does, I'm just
trying to understand how it would help
in the random selection. Is it that
(1) the select statement will select
EVERYTHING from mytable, (2) for each
row selected, tack on a
uniqueidentifier generated by NewID(),
(3) sort the rows by this
uniqueidentifier and (4) pick off the
top 100 from the sorted list?
Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.
SELECT TOP 100 *
FROM master..spt_values
ORDER BY NEWID()
The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.
SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)

In general it works like this:
All rows from mytable is "looped"
NEWID() is executed for each row
The rows are sorted according to random number from NEWID()
100 first row are selected

as MSDN says:
NewID() Creates a unique value of type
uniqueidentifier.
and your table will be sorted by this random values.

use select top 100 randid = newid(), * from mytable order by randid
you will be clarified then..

I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.
Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.
From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.
4 ways to get a random row from a large table:
Method 1, Bad: ORDER BY NEWID() > Bad performance!
Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really
random!
Method 3, Best but Requires Code: Random Primary Key >
Fastest, but won't work for negative numbers.
Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered
index.
More on method 3:
Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.
/* Get a random number smaller than the table's top ID */
DECLARE #rand BIGINT;
DECLARE #maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT #rand = ABS((CHECKSUM(NEWID()))) % #maxid;
/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= #rand;

Related

SQL Server: random number in WHERE clause

As far as I am aware, the only way to get a random value in a SELECT statement is by using the newid() function, as the random() function doesn’t generate new values for each row.
This leads to the following awkward construction to get a random number from, say 0 - 9:
abs(checksum(newid())) % 10
If I use this expression in the SELECT clause, it behaves as expected. However, if I try something like the following:
select *
from table
where abs(checksum(newid())) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them. Apparently newid() is only evaluated once, instead of for each row.
The question is, how can I use a random number in the WHERE clause?
More
There is a similar question which asks for fixed number of rows at random. In the above example I could have used:
select top 50 percent from table order by newid();
which will get me what I am looking for.
The question remains, how can I use a random number in the WHERE clause. For example, is it possible to do something like this?
select *
from table
where code={random number};
Here is one way to get around the problem
SELECT *
FROM (SELECT *,
Abs(Checksum(Newid())) % 10 AS ran
FROM yourtable) a
WHERE ran > 4;
for some reason newid() in where clause it is executed only once and it is checked with the constant.
When I check the execution plan your query is missing compute scalar where as my query has compute scalar present in execution plan.
The function newid() is calculate only once in the WHERE clause, not row by row. The trick is to force it to run row by row.
Of course it is possible to include it in a SELECT clause, and, in turn, include that in a CTE or a subquery, as per the other answers.
Microsoft offer a solution here: https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms189108(v=sql.105)?redirectedfrom=MSDN
The trick is to force newid() to recalculate by combining it with some row value. This is easily done in the checksum() function.
For example:
SELECT *
FROM table
WHERE abs(checksum(newid(),id)) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them
You may get all of the rows or none of them ,since NEWID() is executed once per query when you use it in where clause..This is explained here by Conor Cunnigham and the technical term for this is called RumTimeConstants
You can look at your execution plan and look out for below expression
Const ConstValue
which you can see is calculated once and used throughout and finally you are doing just a boolean comparison,so you will end up with all rows or none
you have to use CTE Like the one stated in another answer or use Top with order by newid() or tablesample to return random rows
you may find Tablesample option more helpfull,since this may not go though all the table data to get only sample set of rows,unlike Newid()
below is one example on a table having 1000000 rows
select * from Orders
TABLESAMPLE (50 PERCENT)
plan

SQL Large Table select random row strategy

I would like to select a random row from a very large table (10 mil records). So the strategy that are most common such as RAND() and NEWID() doesn't seem to be practical.
I have tried the following strategy and would like to know if this is the most ideal way.
Create a new field called 'RandomSort' as UniqueIdentified
At the end of each hour/day will do a Update RandomSort = NewID() to the entire table
Each time I need to query, I can do a Top 10 Order by RandomSort
It does get the job done (better than ORDER BY NewID), but not sure if this is the best practice so far?
Add an identity column 'rowid' (int or bigint depending on your table size) and create a unique non-clustered index on it.
The following query uses the NEWID() function to return approximately one percent of the rows of the table:
SELECT * FROM MyTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), rowID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
The rowId column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), rowid) & 0x7fffffff AS float / CAST(0x7fffffff AS int) evaluates to a random float value between 0 and 1.
In fact you could use any column indexed column in your table (I believe).
If you just want to pick a single random row:
SELECT TOP 1 * FROM table
WHERE rowid >= RAND(CHECKSUM(NEWID())) * (SELECT MAX(rowid) FROM table)
This works in constant time, provided the rowid column is indexed. Note: this assumes that rowid is uniformly distributed in the range 0..MAX(rowid), hence the suggested identity column addition. If your dataset has some other distribution, your results will be skewed (i.e. some rows will be picked more often than others).

selecting random rows based on weight on another row

I need to select random rows from a table based on weight in another row. Like if the user enters random value 50 I need to select 50 random rows from the table being that the rows with higher weight gets returned more number of times. I saw using NEWID() to select n number of random rows and this link
Random Weighted Choice in T-SQL
where we can select one row based on the weight from another row but I need to select several rows based on user random input number ,so will the best way be using the suggested answer in the above link and looping over it n number of times(but I think it would return the same row) is there any other easy solution.
MY table is like this
ID Name Freq
1 aaa 50
2 bbb 30
3 ccc 10
so when the user enters 50 I need to return 50 random names so it should be like more aaa ,bbb than ccc.Might be like 25 aaa 15 bbb and 10 ccc. Anything close to this will work to.I saw this answer but when I execute against my DB it seems to be running for 5mins and no results yet.
SQL : select one row randomly, but taking into account a weight
I think the difficult part here is getting any individual row to potentially appear more than once. I'd look into doing something like the following:
1) Build a temp table, duplicating records according to their frequency (I'm sure there's a better way of doing this, but the first answer that came to my mind was a simple while loop... This particular one really only works if the frequency values are integers)
create table #dup
(
id int,
nm varchar(10)
)
declare #curr int, #maxFreq int
select #curr=0, #maxFreq=max(freq)
from tbl
while #curr < #maxFreq
begin
insert into #dup
select id, nm
from tbl
where freq > #curr
set #curr = #curr+1
end
2) Select your top records, ordered by a random value
select top 10 *
from #dup
order by newID()
3) Cleanup
drop table #dup
Maybe could you try something like the following:
ORDER BY Freq * rand()
in your sql? So columns with a higher Freq value should in theory get returned more often than those with a lower Freq value. It seems a bit hackish but it might work!

Preserving ORDER BY in SELECT INTO

I have a T-SQL query that takes data from one table and copies it into a new table but only rows meeting a certain condition:
SELECT VibeFGEvents.*
INTO VibeFGEventsAfterStudyStart
FROM VibeFGEvents
LEFT OUTER JOIN VibeFGEventsStudyStart
ON
CHARINDEX(REPLACE(REPLACE(REPLACE(logName, 'MyVibe ', ''), ' new laptop', ''), ' old laptop', ''), excelFilename) > 0
AND VibeFGEventsStudyStart.MIN_TitleInstID <= VibeFGEvents.TitleInstID
AND VibeFGEventsStudyStart.MIN_WinInstId <= VibeFGEvents.WndInstID
WHERE VibeFGEventsStudyStart.excelFilename IS NOT NULL
ORDER BY VibeFGEvents.id
The code using the table relies on its order, and the copy above does not preserve the order I expected. I.e. the rows in the new table VibeFGEventsAfterStudyStart are not monotonically increasing in the VibeFGEventsAfterStudyStart.id column copied from VibeFGEvents.id.
In T-SQL how might I preserve the ordering of the rows from VibeFGEvents in VibeFGEventsStudyStart?
I know this is a bit old, but I needed to do something similar. I wanted to insert the contents of one table into another, but in a random order. I found that I could do this by using select top n and order by newid(). Without the 'top n', order was not preserved and the second table had rows in the same order as the first. However, with 'top n', the order (random in my case) was preserved. I used a value of 'n' that was greater than the number of rows. So my query was along the lines of:
insert Table2 (T2Col1, T2Col2)
select top 10000 T1Col1, T1Col2
from Table1
order by newid()
What for?
Point is – data in a table is not ordered. In SQL Server the intrinsic storage order of a table is that of the (if defined) clustered index.
The order in which data is inserted is basically "irrelevant". It is forgotten the moment the data is written into the table.
As such, nothing is gained, even if you get this stuff. If you need an order when dealing with data, you HAVE To put an order by clause on the select that gets it. Anything else is random - i.e. the order you et data is not determined and may change.
So it makes no sense to have a specific order on the insert as you try to achieve.
SQL 101: sets have no order.
Just add top to your sql with a number that is greater than the actual number of rows:
SELECT top 25000 *
into spx_copy
from SPX
order by date
I've found a specific scenario where we want the new table to be created with a specific order in the columns' content:
Amount of rows is very big (from 200 to 2000 millions of rows), so we are using SELECT INTO instead of CREATE TABLE + INSERT because needs to be loaded as fast as possible (minimal logging). We have tested using the trace flag 610 for loading an already created empty table with a clustered index but still takes longer than the following approach.
We need the data to be ordered by specific columns for query performances, so we are creating a CLUSTERED INDEX just after the table is loaded. We discarded creating a non-clustered index because it would need another read for the data that's not included in the ordered columns from the index, and we discarded creating a full-covering non-clustered index because it would practically double the amount of space needed to hold the table.
It happens that if you manage to somehow create the table with columns already "ordered", creating the clustered index (with the same order) takes a lot less time than when the data isn't ordered. And sometimes (you will have to test your case), ordering the rows in the SELECT INTO is faster than loading without order and creating the clustered index later.
The problem is that SQL Server 2012+ will ignore the ORDER BY column list when doing INSERT INTO or when doing SELECT INTO. It will consider the ORDER BY columns if you specify an IDENTITY column on the SELECT INTO or if the inserted table has an IDENTITY column, but just to determine the identity values and not the actual storage order in the underlying table. In this case, it's likely that the sort will happen but not guaranteed as it's highly dependent on the execution plan.
A trick we have found is that doing a SELECT INTO with the result of a UNION ALL makes the engine perform a SORT (not always an explicit SORT operator, sometimes a MERGE JOIN CONCATENATION, etc.) if you have an ORDER BY list. This way the select into already creates the new table in the order we are going to create the clustered index later and thus the index takes less time to create.
So you can rewrite this query:
SELECT
FirstColumn = T.FirstColumn,
SecondColumn = T.SecondColumn
INTO
#NewTable
FROM
VeryBigTable AS T
ORDER BY -- ORDER BY is ignored!
FirstColumn,
SecondColumn
to
SELECT
FirstColumn = T.FirstColumn,
SecondColumn = T.SecondColumn
INTO
#NewTable
FROM
VeryBigTable AS T
UNION ALL
-- A "fake" row to be deleted
SELECT
FirstColumn = 0,
SecondColumn = 0
ORDER BY
FirstColumn,
SecondColumn
We have used this trick a few times, but I can't guarantee it will always sort. I'm just posting this as a possible workaround in case someone has a similar scenario.
You cannot do this with ORDER BY but if you create a Clustered Index on VibeFGEvents.id after your SELECT INTO the table will be sorted on disk by VibeFGEvents.id.
I'v made a test on MS SQL 2012, and it clearly shows me, that insert into ... select ... order by makes sense. Here is what I did:
create table tmp1 (id int not null identity, name sysname);
create table tmp2 (id int not null identity, name sysname);
insert into tmp1 (name) values ('Apple');
insert into tmp1 (name) values ('Carrot');
insert into tmp1 (name) values ('Pineapple');
insert into tmp1 (name) values ('Orange');
insert into tmp1 (name) values ('Kiwi');
insert into tmp1 (name) values ('Ananas');
insert into tmp1 (name) values ('Banana');
insert into tmp1 (name) values ('Blackberry');
select * from tmp1 order by id;
And I got this list:
1 Apple
2 Carrot
3 Pineapple
4 Orange
5 Kiwi
6 Ananas
7 Banana
8 Blackberry
No surprises here. Then I made a copy from tmp1 to tmp2 this way:
insert into tmp2 (name)
select name
from tmp1
order by id;
select * from tmp2 order by id;
I got the exact response like before. Apple to Blackberry.
Now reverse the order to test it:
delete from tmp2;
insert into tmp2 (name)
select name
from tmp1
order by id desc;
select * from tmp2 order by id;
9 Blackberry
10 Banana
11 Ananas
12 Kiwi
13 Orange
14 Pineapple
15 Carrot
16 Apple
So the order in tmp2 is reversed too, so order by made sense when there is a identity column in the target table!
The reason why one would desire this (a specific order) is because you cannot define the order in a subquery, so, the idea is that, if you create a table variable, THEN make a query from that table variable, you would think you would retain the order(say, to concatenate rows that must be in order- say for XML or json), but you can't.
So, what do you do?
The answer is to force SQL to order it by using TOP in your select (just pick a number high enough to cover all your rows).
I have run into the same issue and one reason I have needed to preserve the order is when I try to use ROLLUP to get a weighted average based on the raw data and not an average of what is in that column. For instance, say I want to see the average of profit based on number of units sold by four store locations? I can do this very easily by creating the equation Profit / #Units = Avg. Now I include a ROLLUP in my GROUP BY so that I can also see the average across all locations. Now I think to myself, "This is good info but I want to see it in order of Best Average to Worse and keep the Overall at the bottom (or top) of the list)." The ROLLUP will fail you in this so you take a different approach.
Why not create row numbers based on the sequence (order) you need to preserve?
SELECT OrderBy = ROW_NUMBER() OVER(PARTITION BY 'field you want to count' ORDER BY 'field(s) you want to use ORDER BY')
, VibeFGEvents.*
FROM VibeFGEvents
LEFT OUTER JOIN VibeFGEventsStudyStart
ON
CHARINDEX(REPLACE(REPLACE(REPLACE(logName, 'MyVibe ', ''), ' new laptop', ''), ' old laptop', ''), excelFilename) > 0
AND VibeFGEventsStudyStart.MIN_TitleInstID <= VibeFGEvents.TitleInstID
AND VibeFGEventsStudyStart.MIN_WinInstId <= VibeFGEvents.WndInstID
WHERE VibeFGEventsStudyStart.excelFilename IS NOT NULL
Now you can use the OrderBy field from your table to set the order of values. I removed the ORDER BY statement from the query above since it does not affect how the data is loaded to the table.
I found this approach helpful to solve this problem:
WITH ordered as
(
SELECT TOP 1000
[Month]
FROM SourceTable
GROUP BY [Month]
ORDER BY [Month]
)
INSERT INTO DestinationTable (MonthStart)
(
SELECT * from ordered
)
Try using INSERT INTO instead of SELECT INTO
INSERT INTO VibeFGEventsAfterStudyStart
SELECT VibeFGEvents.*
FROM VibeFGEvents
LEFT OUTER JOIN VibeFGEventsStudyStart
ON
CHARINDEX(REPLACE(REPLACE(REPLACE(logName, 'MyVibe ', ''), ' new laptop', ''), ' old laptop', ''), excelFilename) > 0
AND VibeFGEventsStudyStart.MIN_TitleInstID <= VibeFGEvents.TitleInstID
AND VibeFGEventsStudyStart.MIN_WinInstId <= VibeFGEvents.WndInstID
WHERE VibeFGEventsStudyStart.excelFilename IS NOT NULL
ORDER BY VibeFGEvents.id`

Random record from a database table (T-SQL)

Is there a succinct way to retrieve a random record from a sql server table?
I would like to randomize my unit test data, so am looking for a simple way to select a random id from a table. In English, the select would be "Select one id from the table where the id is a random number between the lowest id in the table and the highest id in the table."
I can't figure out a way to do it without have to run the query, test for a null value, then re-run if null.
Ideas?
Is there a succinct way to retrieve a random record from a sql server table?
Yes
SELECT TOP 1 * FROM table ORDER BY NEWID()
Explanation
A NEWID() is generated for each row and the table is then sorted by it. The first record is returned (i.e. the record with the "lowest" GUID).
Notes
GUIDs are generated as pseudo-random numbers since version four:
The version 4 UUID is meant for generating UUIDs from truly-random or
pseudo-random numbers.
The algorithm is as follows:
Set the two most significant bits (bits 6 and 7) of the
clock_seq_hi_and_reserved to zero and one, respectively.
Set the four most significant bits (bits 12 through 15) of the
time_hi_and_version field to the 4-bit version number from
Section 4.1.3.
Set all the other bits to randomly (or pseudo-randomly) chosen
values.
—A Universally Unique IDentifier (UUID) URN Namespace - RFC 4122
The alternative SELECT TOP 1 * FROM table ORDER BY RAND() will not work as one would think. RAND() returns one single value per query, thus all rows will share the same value.
While GUID values are pseudo-random, you will need a better PRNG for the more demanding applications.
Typical performance is less than 10 seconds for around 1,000,000 rows — of course depending on the system. Note that it's impossible to hit an index, thus performance will be relatively limited.
On larger tables you can also use TABLESAMPLE for this to avoid scanning the whole table.
SELECT TOP 1 *
FROM YourTable
TABLESAMPLE (1000 ROWS)
ORDER BY NEWID()
The ORDER BY NEWID is still required to avoid just returning rows that appear first on the data page.
The number to use needs to be chosen carefully for the size and definition of table and you might consider retry logic if no row is returned. The maths behind this and why the technique is not suited to small tables is discussed here
Also try your method to get a random Id between MIN(Id) and MAX(Id) and then
SELECT TOP 1 * FROM table WHERE Id >= #yourrandomid
It will always get you one row.
If you want to select large data the best way that I know is:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM
(keycol1, NEWID())) as int))
% 100) < 10
Source: MSDN
I was looking to improve on the methods I had tried and came across this post. I realize it's old but this method is not listed. I am creating and applying test data; this shows the method for "address" in a SP called with #st (two char state)
Create Table ##TmpAddress (id Int Identity(1,1), street VarChar(50), city VarChar(50), st VarChar(2), zip VarChar(5))
Insert Into ##TmpAddress(street, city, st, zip)
Select street, city, st, zip
From tbl_Address (NOLOCK)
Where st = #st
-- unseeded RAND() will return the same number when called in rapid succession so
-- here, I seed it with a guaranteed different number each time. ##ROWCOUNT is the count from the most recent table operation.
Set #csr = Ceiling(RAND(convert(varbinary, newid())) * ##ROWCOUNT)
Select street, city, st, Right(('00000' + ltrim(zip)),5) As zip
From ##tmpAddress (NOLOCK)
Where id = #csr
If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in the CHECKSUM expression so that
NEWID() evaluates once per row to achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS
float / CAST (0x7fffffff AS int) evaluates to a random float value
between 0 and 1."
Source: http://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx
This is further explained below:
How does this work? Let's split out the WHERE clause and explain it.
The CHECKSUM function is calculating a checksum over the items in the
list. It is arguable over whether SalesOrderID is even required, since
NEWID() is a function that returns a new random GUID, so multiplying a
random figure by a constant should result in a random in any case.
Indeed, excluding SalesOrderID seems to make no difference. If you are
a keen statistician and can justify the inclusion of this, please use
the comments section below and let me know why I'm wrong!
The CHECKSUM function returns a VARBINARY. Performing a bitwise AND
operation with 0x7fffffff, which is the equivalent of (111111111...)
in binary, yields a decimal value that is effectively a representation
of a random string of 0s and 1s. Dividing by the co-efficient
0x7fffffff effectively normalizes this decimal figure to a figure
between 0 and 1. Then to decide whether each row merits inclusion in
the final result set, a threshold of 1/x is used (in this case, 0.01)
where x is the percentage of the data to retrieve as a sample.
Source: https://www.mssqltips.com/sqlservertip/3157/different-ways-to-get-random-data-for-sql-server-data-sampling

Resources