Random sample of n rows - snowflake-cloud-data-platform

I would like to sample n rows from a table at random. Alas
SELECT * FROME testtable sample (10 rows);
as the docs suggest gives me:
SQL compilation error: Sampling with sample missing tag for parameter seed.
Tagging on SEED(123) gives me
SQL compilation error: Sampling with sample wrong number of arguments for parameter seed.

If I correctly understand, you want to randomly select 10 rows from a table.
The following can help to achieve this result:
select * from testtable
order by RANDOM(123)
limit 10;

Seems to work okay for me. Are you sure that's a table you are sampling on and not a view / materialised view or anything?
If you have access to the snowflake_sample_data database could you try this and see?
select
*
from snowflake_sample_data.tpch_sf1.customer sample (10 rows)
;

Related

Sqlite giving same output and not considering where clause order

I want to get the entry from database using sqlite3 using sqlite_prepare_v2() in c interface and used to get the out put not in the expected order.
want to get the entry in the exact order in which size is given in the where clause but behavior is as below:
sqlite> select id,size from audio where size in (9,16,8);
3|9
4|8
5|16
sqlite>
sqlite> select id,size from audio where size in (8,9,16);
3|9
4|8
5|16
sqlite>
sqlite> select id,size from audio where size in (16,8,9);
3|9
4|8
5|16
For the first query, I am expecting the ouput to be in other order of size inside the where clause, but is seems sqlite is by default always giving the output in the order of the id. here id is the primary key field. Is sqlite using any default indexing on primary key field making this to happen? or is there a way to get the output from the sqlite as it is given in the where clause? Please let me know. Thanks.
The order of the in() clause is irrelevant to the order of the select results.
In order to get the select results in the desired order, you must either add an explicit order by clause that fits your need, such as order by size in the second case, or you must issue 3 separate select statements in the desired order.

SQL Server: random number in WHERE clause

As far as I am aware, the only way to get a random value in a SELECT statement is by using the newid() function, as the random() function doesn’t generate new values for each row.
This leads to the following awkward construction to get a random number from, say 0 - 9:
abs(checksum(newid())) % 10
If I use this expression in the SELECT clause, it behaves as expected. However, if I try something like the following:
select *
from table
where abs(checksum(newid())) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them. Apparently newid() is only evaluated once, instead of for each row.
The question is, how can I use a random number in the WHERE clause?
More
There is a similar question which asks for fixed number of rows at random. In the above example I could have used:
select top 50 percent from table order by newid();
which will get me what I am looking for.
The question remains, how can I use a random number in the WHERE clause. For example, is it possible to do something like this?
select *
from table
where code={random number};
Here is one way to get around the problem
SELECT *
FROM (SELECT *,
Abs(Checksum(Newid())) % 10 AS ran
FROM yourtable) a
WHERE ran > 4;
for some reason newid() in where clause it is executed only once and it is checked with the constant.
When I check the execution plan your query is missing compute scalar where as my query has compute scalar present in execution plan.
The function newid() is calculate only once in the WHERE clause, not row by row. The trick is to force it to run row by row.
Of course it is possible to include it in a SELECT clause, and, in turn, include that in a CTE or a subquery, as per the other answers.
Microsoft offer a solution here: https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms189108(v=sql.105)?redirectedfrom=MSDN
The trick is to force newid() to recalculate by combining it with some row value. This is easily done in the checksum() function.
For example:
SELECT *
FROM table
WHERE abs(checksum(newid(),id)) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them
You may get all of the rows or none of them ,since NEWID() is executed once per query when you use it in where clause..This is explained here by Conor Cunnigham and the technical term for this is called RumTimeConstants
You can look at your execution plan and look out for below expression
Const ConstValue
which you can see is calculated once and used throughout and finally you are doing just a boolean comparison,so you will end up with all rows or none
you have to use CTE Like the one stated in another answer or use Top with order by newid() or tablesample to return random rows
you may find Tablesample option more helpfull,since this may not go though all the table data to get only sample set of rows,unlike Newid()
below is one example on a table having 1000000 rows
select * from Orders
TABLESAMPLE (50 PERCENT)
plan

Access SubQuery: SHOW TOP (count form select query) Table

Is it possible to use a Count() or number from another Select query to SELECT TOP a number of rows in a different query?
Below is a sample of the update query I'm trying to use but would like to take the count from another query to replace "10".
...
WHERE Frames.Package IN (
SELECT TOP 10 Frames
FROM Frames.Package WHERE Package = "100"
ORDER BY Frames.ReferenceNumber
)
So for example, i've tried to do
SELECT TOP SelectQuery.RecordCount Frames
Sample SelectQuery.RecordCount
SELECT COUNT(Frames.Package) AS RecordCount
FROM Frames
HAVING Frames.Package = "100";
Any assistance would be appreciated...
Access does not support using a parameter for SELECT TOP. You must write a literal value into the text of the SQL statement.
From another answer: Select TOP N not working in MS Access with parameter
On that note, your two queries appear to be just interchanging HAVING and WHERE clauses to get the record count. It doesn't seem to be doing anything more, thus why bother with the TOP clause and simply SELECT * FROM Frames WHERE [..]?
Am I missing something?

SQL Server pulling data from multiple tables

I'm having a problem trying to pull a specific data from two tables. According to the textbook its:
Select *
From terra..retailsales and terra..retailaccount
Where retailaccountid in retailsales = 2345678
Get date range from = 3/01/2014 to 6/30/2015
However, when running the code it produces an syntax error within the in. Yet to me the whole code looks wrong. Can someone help me. I would like to get this to work in order to do my assignment. It's driving me nuts! I contacted the prof and he said that the code in the book is correct, but I think he's wrong.
Can someone help?
The code you provided is not TSQL - actually looks more like some kind of pseudocode.
Just guessing at your column names here, but if I've got it right your query should look something like this:-
SELECT * FROM terra..retailsales
WHERE retailaccountid = 2345678
AND [date range] BETWEEN '20140301' AND '20150630'
Not sure where the 2nd table comes into this though.
You can JOIN two table, like this:
SELECT *
FROM terra..retailsales RS
INNER JOIN terra..retailaccount RC
ON RS.retailaccountid = RC.ID
WHERE RS.retailaccountid = 2345678
AND [date] BETWEEN '20140301' AND '20150630'
Your provided code is very confusing. I see the [terra..retailsales] table, but have no idea what your other table is. Are you sure you need to get your data from two tables?
What is the syntax error you're receiving? Can you paste the exact code you're trying in a code block? Not much of that makes any sense.
In order to pull data from two tables, you could union those tables in a CTE (common table expression), throw them both into a temp table, or join them in a select statement. If the format of both tables is identical, then why do you have two of them?
You're missing the column name where you want to compare a [date] to "3/01/2014 to 6/30/2015". You can use getDate() to return the current time.
Select *
FROM [terra..retailsales]
Where [retailaccountid] = 2345678
AND [<DateColumn>] BETWEEN '3/01/2014' AND '6/30/2015'
You don't need to re-specify your table name again in line "Where retailaccountid in retailsales = 2345678". It will just assume that the retailaccountid is from retailsales.

"order by newid()" - how does it work?

I know that If I run this query
select top 100 * from mytable order by newid()
it will get 100 random records from my table.
However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?
I know what NewID() does, I'm just
trying to understand how it would help
in the random selection. Is it that
(1) the select statement will select
EVERYTHING from mytable, (2) for each
row selected, tack on a
uniqueidentifier generated by NewID(),
(3) sort the rows by this
uniqueidentifier and (4) pick off the
top 100 from the sorted list?
Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.
SELECT TOP 100 *
FROM master..spt_values
ORDER BY NEWID()
The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.
SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)
In general it works like this:
All rows from mytable is "looped"
NEWID() is executed for each row
The rows are sorted according to random number from NEWID()
100 first row are selected
as MSDN says:
NewID() Creates a unique value of type
uniqueidentifier.
and your table will be sorted by this random values.
use select top 100 randid = newid(), * from mytable order by randid
you will be clarified then..
I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.
Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.
From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.
4 ways to get a random row from a large table:
Method 1, Bad: ORDER BY NEWID() > Bad performance!
Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really
random!
Method 3, Best but Requires Code: Random Primary Key >
Fastest, but won't work for negative numbers.
Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered
index.
More on method 3:
Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.
/* Get a random number smaller than the table's top ID */
DECLARE #rand BIGINT;
DECLARE #maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT #rand = ABS((CHECKSUM(NEWID()))) % #maxid;
/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= #rand;

Resources