SQL Server: random number in WHERE clause - sql-server

As far as I am aware, the only way to get a random value in a SELECT statement is by using the newid() function, as the random() function doesn’t generate new values for each row.
This leads to the following awkward construction to get a random number from, say 0 - 9:
abs(checksum(newid())) % 10
If I use this expression in the SELECT clause, it behaves as expected. However, if I try something like the following:
select *
from table
where abs(checksum(newid())) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them. Apparently newid() is only evaluated once, instead of for each row.
The question is, how can I use a random number in the WHERE clause?
More
There is a similar question which asks for fixed number of rows at random. In the above example I could have used:
select top 50 percent from table order by newid();
which will get me what I am looking for.
The question remains, how can I use a random number in the WHERE clause. For example, is it possible to do something like this?
select *
from table
where code={random number};

Here is one way to get around the problem
SELECT *
FROM (SELECT *,
Abs(Checksum(Newid())) % 10 AS ran
FROM yourtable) a
WHERE ran > 4;
for some reason newid() in where clause it is executed only once and it is checked with the constant.
When I check the execution plan your query is missing compute scalar where as my query has compute scalar present in execution plan.

The function newid() is calculate only once in the WHERE clause, not row by row. The trick is to force it to run row by row.
Of course it is possible to include it in a SELECT clause, and, in turn, include that in a CTE or a subquery, as per the other answers.
Microsoft offer a solution here: https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms189108(v=sql.105)?redirectedfrom=MSDN
The trick is to force newid() to recalculate by combining it with some row value. This is easily done in the checksum() function.
For example:
SELECT *
FROM table
WHERE abs(checksum(newid(),id)) % 10>4;

I should have though that I would get roughly half the rows. Instead I get I get all or none of them
You may get all of the rows or none of them ,since NEWID() is executed once per query when you use it in where clause..This is explained here by Conor Cunnigham and the technical term for this is called RumTimeConstants
You can look at your execution plan and look out for below expression
Const ConstValue
which you can see is calculated once and used throughout and finally you are doing just a boolean comparison,so you will end up with all rows or none
you have to use CTE Like the one stated in another answer or use Top with order by newid() or tablesample to return random rows
you may find Tablesample option more helpfull,since this may not go though all the table data to get only sample set of rows,unlike Newid()
below is one example on a table having 1000000 rows
select * from Orders
TABLESAMPLE (50 PERCENT)
plan

Related

Access SubQuery: SHOW TOP (count form select query) Table

Is it possible to use a Count() or number from another Select query to SELECT TOP a number of rows in a different query?
Below is a sample of the update query I'm trying to use but would like to take the count from another query to replace "10".
...
WHERE Frames.Package IN (
SELECT TOP 10 Frames
FROM Frames.Package WHERE Package = "100"
ORDER BY Frames.ReferenceNumber
)
So for example, i've tried to do
SELECT TOP SelectQuery.RecordCount Frames
Sample SelectQuery.RecordCount
SELECT COUNT(Frames.Package) AS RecordCount
FROM Frames
HAVING Frames.Package = "100";
Any assistance would be appreciated...
Access does not support using a parameter for SELECT TOP. You must write a literal value into the text of the SQL statement.
From another answer: Select TOP N not working in MS Access with parameter
On that note, your two queries appear to be just interchanging HAVING and WHERE clauses to get the record count. It doesn't seem to be doing anything more, thus why bother with the TOP clause and simply SELECT * FROM Frames WHERE [..]?
Am I missing something?

PostgreSQL Inserted rows differ from select

I have a problem with an INSERT in PostgreSQL. I have this query:
INSERT INTO track_segments(tid, gdid1, gdid2, distance, speed)
SELECT * FROM (
SELECT DISTINCT ON (pga.gdid)
pga.tid as ntid,
pga.gdid as gdid1, pgb.gdid as gdid2,
ST_Distance(pga.geopoint, pgb.geopoint) AS segdist,
(ST_Distance(pga.geopoint, pgb.geopoint) / EXTRACT(EPOCH FROM (pgb.timestamp - pga.timestamp + interval '0.1 second'))) as speed
FROM fl_pure_geodata AS pga
LEFT OUTER JOIN fl_pure_geodata AS pgb ON (pga.timestamp < pgb.timestamp AND pga.tid = pgb.tid)
ORDER BY pga.gdid ASC) AS sq
WHERE sq.gdid2 IS NOT NULL;
to fill a table with pairwise connected segements of geopoints. When I run the SELECT alone I get the correct pairs, but when I use it in the statement above, then some are paired the wrong way or not at all. Here's what I mean:
result of SELECT alone:
tid;gdid1;gdid2;distance;speed
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";10;11;34.105058803;31.0045989118182
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";11;12;90.099603143;14.7704267447541
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";12;13;23.331326565;21.2102968772727
result after INSERT with the same SELECT:
tid;gdid1;gdid2;distance;speed
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";10;12;122.574;17.2639603638028
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";11;12;90.0996;14.7704267447541
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";12;13;23.3313;21.2102968772727
What be the cause of that? It's exactly the same SELECT statement for the INSERT, so why does it give different results?
DISTINCT ON (pga.gdid) can pick any row from a set with equal pga.gdid. You can get different result even by execution the same query for several times. Add additional ordering to get consistent results. something like: pga.gdid ASC, pgb.gdid ASC
BTW You may want to order by pga.gdid ASC, pgb.timestamp - pga.timestamp ASC to get the "next" point.
BTW2 It may be easier to use lead() or lag() window functions to calculate differences between current row and next/previous. This way you wont need a self join and will likely get better performance.
You are ordering your query results only by the column pga.gdid, which is the same in all the rows, so postgres will order the results in a different way each time you do the select query.

"order by newid()" - how does it work?

I know that If I run this query
select top 100 * from mytable order by newid()
it will get 100 random records from my table.
However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?
I know what NewID() does, I'm just
trying to understand how it would help
in the random selection. Is it that
(1) the select statement will select
EVERYTHING from mytable, (2) for each
row selected, tack on a
uniqueidentifier generated by NewID(),
(3) sort the rows by this
uniqueidentifier and (4) pick off the
top 100 from the sorted list?
Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.
SELECT TOP 100 *
FROM master..spt_values
ORDER BY NEWID()
The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.
SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)
In general it works like this:
All rows from mytable is "looped"
NEWID() is executed for each row
The rows are sorted according to random number from NEWID()
100 first row are selected
as MSDN says:
NewID() Creates a unique value of type
uniqueidentifier.
and your table will be sorted by this random values.
use select top 100 randid = newid(), * from mytable order by randid
you will be clarified then..
I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.
Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.
From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.
4 ways to get a random row from a large table:
Method 1, Bad: ORDER BY NEWID() > Bad performance!
Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really
random!
Method 3, Best but Requires Code: Random Primary Key >
Fastest, but won't work for negative numbers.
Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered
index.
More on method 3:
Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.
/* Get a random number smaller than the table's top ID */
DECLARE #rand BIGINT;
DECLARE #maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT #rand = ABS((CHECKSUM(NEWID()))) % #maxid;
/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= #rand;

SQL Server Reference a Calculated Column

I have a select statement with calculated columns and I would like to use the value of one calculated column in another. Is this possible? Here is a contrived example to show what I am trying to do.
SELECT [calcval1] = CASE Statement, [calcval2] = [calcval1] * .25
No.
All the results of a single row from a select are atomic. That is, you can view them all as if they occur in parallel and cannot depend on each other.
If you're referring to computed columns, then you need to update the formula's input for the result to change during a select.
Think of computed columns as macros or mini-views which inject a little calculation whenever you call them.
For example, these columns will be identical, always:
-- assume that 'Calc' is a computed column equal to Salaray*.25
SELECT Calc, Salary*.25 Calc2 FROM YourTable
Also keep in mind that the persisted option doesn't change any of this. It keeps the value around which is nice for indexing, but the atomicity doesn't change.
Unfortunately not really, but a workaround that is sometimes worth it is
SELECT [calcval1], [calcval1] * .25 AS [calcval2]
FROM (SELECT [calcval1] = CASE Statement FROM whatever WHERE whatever)
Yes it's possible.
Use the WITH Statement for nested selects:
Two ways I can think of to do that. First understand that the calval1 column does not exist as far as SQL Server is concerned until the statement has run, therefore it cannot be directly used as showning your example. So you can put the calculation in there twice, once for calval1 and once as substitution for calcval1 in the calval2 calculation.
The other way is to make a derived table with calval1 in it and then calculate calval2 outside the derived table something like:
select calcval1*.25 as calval2, calval1, field1, field2
from (select casestament as cavlval1, field1, field2 from my table) a
You'll need to test both for performance.
You should use an outer apply instead of a subselect:
select V.calc,V.calc*0.25 from FOO outer apply (select case Statement as calc) V
You can't "reset" the value of a calculated column in a Select clause, if that's what you're trying to do... The value of a calculated column is based on the calculated column formulae. Which CAN include the value of another calculated column.... but you canlt reset the formulae in a Select clause... if all you want to do is "output" the value based on two calculated columns, (as the syntax in your question reads" Then the
"[calcval2]"
in
SELECT [calcval1] = CASE Statement, [calcval2] = [calcval1] * .25
would just become a column alias in the output of the Select Clause.
or are you asking how to define the formulae for one calculated column to be based on another?

Use of SET ROWCOUNT in SQL Server - Limiting result set

I have a sql statement that consists of multiple SELECT statements. I want to limit the total number of rows coming back to let's say 1000 rows. I thought that using the SET ROWCOUNT 1000 directive would do this...but it does not. For example:
SET ROWCOUNT 1000
select orderId from TableA
select name from TableB
My initial thought was that SET ROWCOUNT would apply to the entire batch, not the individual statements within it. The behavior I'm seeing is it will limit the first select to 1000 and then the second one to 1000 for a total of 2000 rows returned. Is there any way to have the 1000 limit applied to the batch as a whole?
Not in one statement. You're going to have to subtract ##ROWCOUNT from the total rows you want after each statement, and use a variable (say, "#RowsLeft") to store the remaining rows you want. You can then SELECT TOP #RowsLeft from each individual query...
And how would you ever see any records from the second query if the first always returns more than 1000 if you were able to do this in a batch?
If the queries are simliar enough you could try to do this through a union and use the rowcount on that as it would only be one query at that point. If the queries are differnt in the columns returned I'm not sure what you would get by limiting the entire group to 1000 rows because the meanings would be different. From a user perspective I'd rather consistently get 500 orders and 500 customer names than 998 orers and 2 names one day and 210 orders and 790 names the next. It would be impossible to use the application especially if you happened to be most interested in the information in the second query.
Use TOP not ROWCOUNT
http://msdn.microsoft.com/en-us/library/ms189463.aspx
You trying to get 1000 rows MAX from all tables right?
I think other methods may fill up with from the top queries first, and you may never get results from the lower ones.
The requirement sounds odd. Unless you are unioning or joining the data from the two selects, to consider them as one so that you apply a max rows simply does not make sense, since they are unrelated queries at that point. If you really need to do this, try:
select top 1000 from (
select orderId, null as name, 'TableA' as Source from TableA
union all
select null as orderID, name, 'TableB' as Source from TableB
) a order by Source
SET ROWCOUNT applies to each individual query. In your given example, it's applied twice, once to each SELECT statement, since each statement is its own batch (they're not grouped or unioned or anything, and so execute completely separately).
#RedFilter's approach seems the most likely to give you what you want.
Untested and doesn't make use of ROWCOUNT, but could give you an idea?
Assumes col1 in TableA and TableB are the same type.
SELECT TOP 1000 *
FROM (select orderId
from TableA
UNION ALL
select name from TableB) t
The following worked for me:
CREATE PROCEDURE selectTopN
(
#numberOfRecords int
)
AS
SELECT TOP (#numberOfRecords) * FROM Customers
GO
this is your solution :
TOP (Transact-SQL)
and about ##RowCount you can read this Link :
SET ROWCOUNT (Transact-SQL)
Important
Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server. Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax. For more information, see TOP (Transact-SQL).
I think two way will work.!

Resources