In Snowflake what is the difference between SELECT TOP and SELECT ... LIMIT - snowflake-cloud-data-platform

Is there a difference between SELECT TOP n col and SELECT col ...LIMIT n
Both seem to return the same results. For example:
SELECT TOP 5 C_ACCTBAL FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER";
SELECT C_ACCTBAL FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" LIMIT 5;
both return:
C_ACCTBAL
711.56
121.65
7498.12
2866.83
794.47

According to documentation there is no difference
TOP and LIMIT are equivalent.
https://docs.snowflake.com/en/sql-reference/constructs/top_n.html
--- EDIT ---
BUT just as #Robert Long said, LIMIT supports OFFSET but TOP does not.

With the basic usage pattern in the OP, there is no difference.
However, LIMIT also supports the use of OFFSET, which will skip rows first.
So, we might have:
SELECT TOP 5 C_ACCTBAL FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER";
which returns
C_ACCTBAL
711.56
121.65
7498.12
2866.83
794.47
..as in the OP, but:
SELECT C_ACCTBAL FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" LIMIT 2 OFFSET 3;
will return
C_ACCTBAL
2866.83
794.47

Related

Max Recursion Exhausted before Statement Completion

I know this has been asked and answered a few times here, but I can't seem to find the answer to my specific problem. Here's the recursive query:
CTE as (
SELECT
ZipCode
,Age
,[Population]
,Deaths
,DeathRate
,Death_Proportion
,DeathProbablity
,SurvivalProbablity
,PersonsAlive
FROM ProbabilityTable
WHERE Age = 0
UNION ALL
SELECT
p.ZipCode
,p.Age
,p.[Population]
,p.Deaths
,p.DeathRate
,p.Death_Proportion
,p.DeathProbablity
,p.SurvivalProbablity
,LAG(c.PersonsAlive,1) OVER(PARTITION BY p.ZipCode ORDER BY p.Age) * p.SurvivalProbablity
FROM ProbabilityTable p
INNER JOIN CTE c
ON p.ZipCode = c.ZipCode
and p.Age = c.Age
WHERE p.Age < 86
)
In the ProbabilityTable PersonsAlive is set to 100,000 when Age = 0. What I'm looking to do with the recursive CTE is multiple the previous value of PersonsAlive by the current SurvivalProbability to calculate the PersonsAlive of that Age. Age goes up to 85 so that's why I have my termination clause set at 86.
I've tried tweaking the recursive part of the query a number of times (and also setting PersonsAlive to 100,000 in the anchor part) but I can't figure it out. This is my first attempt at a recursive query and even with some course work it's not clicking for me.
EDIT
Here is the updated code that actually runs:
CTE as (
SELECT
ZipCode
,Age
,[Population]
,Deaths
,DeathRate
,Death_Proportion
,DeathProbablity
,SurvivalProbablity
,PersonsAlive
FROM ProbabilityTable
WHERE Age = 0
UNION ALL
SELECT
p.ZipCode
,p.Age
,p.[Population]
,p.Deaths
,p.DeathRate
,p.Death_Proportion
,p.DeathProbablity
,p.SurvivalProbablity
,LAG(c.PersonsAlive,1) OVER(PARTITION BY p.ZipCode ORDER BY p.Age) * p.SurvivalProbablity
FROM ProbabilityTable p
INNER JOIN CTE c
ON p.ZipCode = c.ZipCode
and p.Age = c.Age + 1
WHERE p.Age < 6
)
And here is the results it returns:
What I want the results to be for PersonsAlive is as follows:
So with each iteration of the CTE, it needs to reference the previous row of PersonsAlive and the current row of SurvivalProbability to calculate PersonsAlive
It's hard to test this without your raw data but I think your issue is you're lagging over the previous row, causing your frame of reference to be 2 rows back.
When you're using a recursive CTE, you already have access to the previous row, via CTE c. When you do LAG(c.PersonsAlive,1) you're actually telling it to look at PersonsAlive from 2 rows back from the current row (lagging 1 row back from the previous row).
Since on the first recursive pass, there is only 1 row back, the LAG() function will return NULL by default since there is no 2 rows back at that point. This is why every row in your results has NULL for the PersonsAlive column, except for the first row (anchor row from the first half of your UNION ALL clause). So if you remove the LAG() function from it and instead just do c.PersonsAlive * p.SurvivalProbablity, you should get all of the expected PersonsAlive values.
That being said, a recursive CTE seems like overkill here and you probably can just use the LAG() window function in a static call on your ProbabilityTable like so:
SELECT
ZipCode,
Age,
[Population],
Deaths,
DeathRate,
Death_Proportion,
DeathProbablity,
SurvivalProbablity,
ISNULL(LAG(PersonsAlive,1) OVER (PARTITION BY ZipCode ORDER BY Age), PersonsAlive) AS PersonsAlive
FROM ProbabilityTable
As I mentioned, I can't really test this, so please let me know if you run into any issues, and I'll help you accordingly.
Recursive CTEs are good for tree-like problems, e.g. when you need to compare multiple child rows to their parent, or interact with multiple levels of the tree simultaneously. Window functions like LAG() allow you to interact with any single row at a time relative to the current row. Your problem seems to be the latter kind.

BigQuery - Random numbers repeating when generated inside arrays

I made a BigQuery query which involves generating an array of random numbers for each row. I use the random numbers to decide which elements to include from an array that exists in my source table.
I had a lot of trouble getting the arrays of random numbers to not repeat themselves over every single row. I found a workaround, but is this expected behavior? I'll post two "methods" (one with desired results, one with bad results) below. Note that both methods work fine if you don't use an array, but just generate a single random number.
Method 1 (BAD Results):
SELECT
(
SELECT
ARRAY(
SELECT AS STRUCT
RAND() AS random
FROM UNNEST(GENERATE_ARRAY(0, 10, 1)) AS _time
) AS random_for_times
)
FROM UNNEST(GENERATE_ARRAY(0, 10, 1))
Method 2 (GOOD results):
SELECT
(
SELECT
ARRAY(
SELECT AS STRUCT
RAND() AS random
FROM UNNEST(GENERATE_ARRAY(0, 10, 1)) AS _time
) AS random_for_times
FROM (SELECT NULL FROM UNNEST([0]))
)
FROM UNNEST(GENERATE_ARRAY(0, 10, 1))
Example Results - Method 1 (BAD):
Row 1
0.5431173080158003
0.5585452983410205
...
Row 2
0.5431173080158003
0.5585452983410205
...
Example Results - Method 2 (GOOD):
Row 1
0.49639706531271377
0.1604380522058521
...
Row 2
0.7971869432989377
0.9815667330115473
...
EDIT: See below for some alternative examples that are similar, after Yun Zhang's theory about subqueries. Your solution was useful for the problem I posted, but note that there are still some cases I am finding baffling. Also, although I agree that you are probably correct about the subqueries being tied to the problem: shouldn't a subquery (especially one without a FROM clause) be less likely to have its results re-used than selecting a "normal" value? People talk about performance issues with subqueries sometimes, because they are supposedly calculated one time for each row, even if the results may be the same.
Do you agree that this seems like it may be a bug?
The below examples show that it is not necessarily even creating an array of randoms that is the problem -- even performing a sub-select that just happens to have an unrelated array in it can cause problems with RAND(). The problem goes away by eliminating the sub-select, by choosing just the random value from the sub-select, or by including a value inside the array which varies by row. Weird !!!
BAD
SELECT
(SELECT AS STRUCT RAND() AS r, ARRAY(SELECT 1) AS a)
FROM UNNEST(GENERATE_ARRAY(0, 5, 1)) AS u
FIX #1 - No subselect
SELECT
STRUCT(RAND() AS r, ARRAY(SELECT 1) AS a)
FROM UNNEST(GENERATE_ARRAY(0, 5, 1)) AS u
FIX #2 - Select only r
SELECT
(SELECT AS STRUCT RAND() AS r, ARRAY(SELECT 1) AS a).r
FROM UNNEST(GENERATE_ARRAY(0, 5, 1)) AS u
Fix #3 - Array contains "u"
SELECT
(SELECT AS STRUCT RAND() AS r, ARRAY(SELECT u) AS a).r
FROM UNNEST(GENERATE_ARRAY(0, 5, 1)) AS u
Haven't understood why first query didn't work but I have a simpler version that works for you:
SELECT (
SELECT array_agg(RAND()) AS random
FROM UNNEST(GENERATE_ARRAY(0, 10, 1)) AS _time
) AS random_for_times
FROM UNNEST(GENERATE_ARRAY(0, 10, 1))
Update: I later realized that the problem is with ARRAY(subquery), as long as you can avoid using it for your case (like in my query above), you should be fine.

Unusual SQL Server query with "select top 1 #arastr = k"

select top 1 #arastr = k
from #m
where datalength(k) = (select max(datalength(k)) from #m)
What does this query do, and what is the point of select top 1 #arastr = k? This query is taken from a stored procedure which has been working for 7-8 years, so there is nothing wrong with the query, but I could not understand what it does.
(#m is a temp table which is created in the early part of the query.)
The query select one random value (since top is used without an order by clause) from the column k in the temporary table #m and assigns it to a variable #arastr (which has been previously declared supposedly). The string selected will be any matching the longest (measured in bytes (by the datalength function)) string in the table.
This is a quite common (but a little old fashioned) way to get the value of k into the (previously declared!) variable #arastr for later usage.
The function DATALENGTH will measure the length of e.g. a VARCHAR.
With TOP 1 you geht in any case only one result row, the one with the "longest" k, it's value is in #arastr afterwards...
EDIT: As pointed out by #jpw this will be random, if there is more than one k with the same (longest) length.
Without knowing, what #m looks like and what kind of data is in 'k' I cannot tell you any more.
probably makes more sense if it looks like this
SET #arastr = (SELECT TOP 1 k
FROM #m
WHERE DATALENGTH(k) = (SELECT MAX(DATALENGTH(k)) FROM #m))

SubString of value with several 0s

The data in the table looks like this
ID Value
1 5006049
2 5006050
How do I select a substring so that I get
R6049
R6050
Keeping in mind that the values are sequential starting from
5000001 = R1
to
5999999 = R999999
Just substract
SELECT 'R' + CAST(VALUE - 5000000 as VARCHAR(6))
FROM table
SqlFiddle
I think as easily as this:
Select 'R'+Substring(convert(VARCHAR(7), Value), 4,7)
Which will give R0001 (do you want the zeros?)
If you don't want the zeros / only looking to remove the top digit:
Select 'R'+ convert(VARCHAR(6),Value - 5000000)

How do I generate a random number for each row in a T-SQL select?

I need a different random number for each row in my table. The following seemingly obvious code uses the same random value for each row.
SELECT table_name, RAND() magic_number
FROM information_schema.tables
I'd like to get an INT or a FLOAT out of this. The rest of the story is I'm going to use this random number to create a random date offset from a known date, e.g. 1-14 days offset from a start date.
This is for Microsoft SQL Server 2000.
Take a look at SQL Server - Set based random numbers which has a very detailed explanation.
To summarize, the following code generates a random number between 0 and 13 inclusive with a uniform distribution:
ABS(CHECKSUM(NewId())) % 14
To change your range, just change the number at the end of the expression. Be extra careful if you need a range that includes both positive and negative numbers. If you do it wrong, it's possible to double-count the number 0.
A small warning for the math nuts in the room: there is a very slight bias in this code. CHECKSUM() results in numbers that are uniform across the entire range of the sql Int datatype, or at least as near so as my (the editor) testing can show. However, there will be some bias when CHECKSUM() produces a number at the very top end of that range. Any time you get a number between the maximum possible integer and the last exact multiple of the size of your desired range (14 in this case) before that maximum integer, those results are favored over the remaining portion of your range that cannot be produced from that last multiple of 14.
As an example, imagine the entire range of the Int type is only 19. 19 is the largest possible integer you can hold. When CHECKSUM() results in 14-19, these correspond to results 0-5. Those numbers would be heavily favored over 6-13, because CHECKSUM() is twice as likely to generate them. It's easier to demonstrate this visually. Below is the entire possible set of results for our imaginary integer range:
Checksum Integer: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Range Result: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5
You can see here that there are more chances to produce some numbers than others: bias. Thankfully, the actual range of the Int type is much larger... so much so that in most cases the bias is nearly undetectable. However, it is something to be aware of if you ever find yourself doing this for serious security code.
When called multiple times in a single batch, rand() returns the same number.
I'd suggest using convert(varbinary,newid()) as the seed argument:
SELECT table_name, 1.0 + floor(14 * RAND(convert(varbinary, newid()))) magic_number
FROM information_schema.tables
newid() is guaranteed to return a different value each time it's called, even within the same batch, so using it as a seed will prompt rand() to give a different value each time.
Edited to get a random whole number from 1 to 14.
RAND(CHECKSUM(NEWID()))
The above will generate a (pseudo-) random number between 0 and 1, exclusive. If used in a select, because the seed value changes for each row, it will generate a new random number for each row (it is not guaranteed to generate a unique number per row however).
Example when combined with an upper limit of 10 (produces numbers 1 - 10):
CAST(RAND(CHECKSUM(NEWID())) * 10 as INT) + 1
Transact-SQL Documentation:
CAST(): https://learn.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql
RAND(): http://msdn.microsoft.com/en-us/library/ms177610.aspx
CHECKSUM(): http://msdn.microsoft.com/en-us/library/ms189788.aspx
NEWID(): https://learn.microsoft.com/en-us/sql/t-sql/functions/newid-transact-sql
Random number generation between 1000 and 9999 inclusive:
FLOOR(RAND(CHECKSUM(NEWID()))*(9999-1000+1)+1000)
"+1" - to include upper bound values(9999 for previous example)
Answering the old question, but this answer has not been provided previously, and hopefully this will be useful for someone finding this results through a search engine.
With SQL Server 2008, a new function has been introduced, CRYPT_GEN_RANDOM(8), which uses CryptoAPI to produce a cryptographically strong random number, returned as VARBINARY(8000). Here's the documentation page: https://learn.microsoft.com/en-us/sql/t-sql/functions/crypt-gen-random-transact-sql
So to get a random number, you can simply call the function and cast it to the necessary type:
select CAST(CRYPT_GEN_RANDOM(8) AS bigint)
or to get a float between -1 and +1, you could do something like this:
select CAST(CRYPT_GEN_RANDOM(8) AS bigint) % 1000000000 / 1000000000.0
The Rand() function will generate the same random number, if used in a table SELECT query. Same applies if you use a seed to the Rand function. An alternative way to do it, is using this:
SELECT ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) AS [RandomNumber]
Got the information from here, which explains the problem very well.
Do you have an integer value in each row that you could pass as a seed to the RAND function?
To get an integer between 1 and 14 I believe this would work:
FLOOR( RAND(<yourseed>) * 14) + 1
If you need to preserve your seed so that it generates the "same" random data every time, you can do the following:
1. Create a view that returns select rand()
if object_id('cr_sample_randView') is not null
begin
drop view cr_sample_randView
end
go
create view cr_sample_randView
as
select rand() as random_number
go
2. Create a UDF that selects the value from the view.
if object_id('cr_sample_fnPerRowRand') is not null
begin
drop function cr_sample_fnPerRowRand
end
go
create function cr_sample_fnPerRowRand()
returns float
as
begin
declare #returnValue float
select #returnValue = random_number from cr_sample_randView
return #returnValue
end
go
3. Before selecting your data, seed the rand() function, and then use the UDF in your select statement.
select rand(200); -- see the rand() function
with cte(id) as
(select row_number() over(order by object_id) from sys.all_objects)
select
id,
dbo.cr_sample_fnPerRowRand()
from cte
where id <= 1000 -- limit the results to 1000 random numbers
select round(rand(checksum(newid()))*(10)+20,2)
Here the random number will come in between 20 and 30.
round will give two decimal place maximum.
If you want negative numbers you can do it with
select round(rand(checksum(newid()))*(10)-60,2)
Then the min value will be -60 and max will be -50.
try using a seed value in the RAND(seedInt). RAND() will only execute once per statement that is why you see the same number each time.
If you don't need it to be an integer, but any random unique identifier, you can use newid()
SELECT table_name, newid() magic_number
FROM information_schema.tables
You would need to call RAND() for each row. Here is a good example
https://web.archive.org/web/20090216200320/http://dotnet.org.za/calmyourself/archive/2007/04/13/sql-rand-trap-same-value-per-row.aspx
The problem I sometimes have with the selected "Answer" is that the distribution isn't always even. If you need a very even distribution of random 1 - 14 among lots of rows, you can do something like this (my database has 511 tables, so this works. If you have less rows than you do random number span, this does not work well):
SELECT table_name, ntile(14) over(order by newId()) randomNumber
FROM information_schema.tables
This kind of does the opposite of normal random solutions in the sense that it keeps the numbers sequenced and randomizes the other column.
Remember, I have 511 tables in my database (which is pertinent only b/c we're selecting from the information_schema). If I take the previous query and put it into a temp table #X, and then run this query on the resulting data:
select randomNumber, count(*) ct from #X
group by randomNumber
I get this result, showing me that my random number is VERY evenly distributed among the many rows:
It's as easy as:
DECLARE #rv FLOAT;
SELECT #rv = rand();
And this will put a random number between 0-99 into a table:
CREATE TABLE R
(
Number int
)
DECLARE #rv FLOAT;
SELECT #rv = rand();
INSERT INTO dbo.R
(Number)
values((#rv * 100));
SELECT * FROM R
select ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) as [Randomizer]
has always worked for me
Use newid()
select newid()
or possibly this
select binary_checksum(newid())
If you want to generate a random number between 1 and 14 inclusive.
SELECT CONVERT(int, RAND() * (14 - 1) + 1)
OR
SELECT ABS(CHECKSUM(NewId())) % (14 -1) + 1
DROP VIEW IF EXISTS vwGetNewNumber;
GO
Create View vwGetNewNumber
as
Select CAST(RAND(CHECKSUM(NEWID())) * 62 as INT) + 1 as NextID,
'abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'as alpha_num;
---------------CTDE_GENERATE_PUBLIC_KEY -----------------
DROP FUNCTION IF EXISTS CTDE_GENERATE_PUBLIC_KEY;
GO
create function CTDE_GENERATE_PUBLIC_KEY()
RETURNS NVARCHAR(32)
AS
BEGIN
DECLARE #private_key NVARCHAR(32);
set #private_key = dbo.CTDE_GENERATE_32_BIT_KEY();
return #private_key;
END;
go
---------------CTDE_GENERATE_32_BIT_KEY -----------------
DROP FUNCTION IF EXISTS CTDE_GENERATE_32_BIT_KEY;
GO
CREATE function CTDE_GENERATE_32_BIT_KEY()
RETURNS NVARCHAR(32)
AS
BEGIN
DECLARE #public_key NVARCHAR(32);
DECLARE #alpha_num NVARCHAR(62);
DECLARE #start_index INT = 0;
DECLARE #i INT = 0;
select top 1 #alpha_num = alpha_num from vwGetNewNumber;
WHILE #i < 32
BEGIN
select top 1 #start_index = NextID from vwGetNewNumber;
set #public_key = concat (substring(#alpha_num,#start_index,1),#public_key);
set #i = #i + 1;
END;
return #public_key;
END;
select dbo.CTDE_GENERATE_PUBLIC_KEY() public_key;
Update my_table set my_field = CEILING((RAND(CAST(NEWID() AS varbinary)) * 10))
Number between 1 and 10.
Try this:
SELECT RAND(convert(varbinary, newid()))*(b-a)+a magic_number
Where a is the lower number and b is the upper number
If you need a specific number of random number you can use recursive CTE:
;WITH A AS (
SELECT 1 X, RAND() R
UNION ALL
SELECT X + 1, RAND(R*100000) --Change the seed
FROM A
WHERE X < 1000 --How many random numbers you need
)
SELECT
X
, RAND_BETWEEN_1_AND_14 = FLOOR(R * 14 + 1)
FROM A
OPTION (MAXRECURSION 0) --If you need more than 100 numbers

Resources