Sampling calculation on based IDs in snowflake - snowflake-cloud-data-platform

I've been facing an issue while taking sample from an ID of a table. If I sample it over the whole population then getting improper results, and getting count 3 on CLI_1 and count 0 for CLI_2, when I sample with .1 mean. For example -
Table create query :
CREATE
OR
REPLACE temp TABLE tmp.cg_test AS
SELECT
column1::varchar ID ,
column2::boolean sample_flag ,
column3::varchar term_id
FROM
( VALUES
('ID_1', NULL, 'CLI_1'),
('ID_2', NULL, 'CLI_1'),
('ID_3', NULL, 'CLI_1'),
('ID_4', NULL, 'CLI_1'),
('ID_5', NULL, 'CLI_1'),
('ID_6', NULL, 'CLI_1'),
('ID_7', NULL, 'CLI_1'),
('ID_8', NULL, 'CLI_1'),
('ID_9', NULL, 'CLI_1'),
('ID_10', NULL,'CLI_1'),
('ID_11', NULL,'CLI_2'),
('ID_12', NULL,'CLI_2'),
('ID_13', NULL,'CLI_2'),
('ID_14', NULL,'CLI_2'),
('ID_15', NULL,'CLI_2'),
('ID_16', NULL,'CLI_2'),
('ID_17', NULL,'CLI_2'),
('ID_18', NULL,'CLI_2'),
('ID_19', NULL,'CLI_2'),
('ID_20', NULL,'CLI_2')
);
Sampling Query:
CREATE OR REPLACE TEMP TABLE tmp.sample_calc_cli AS
Select id,
uniform(0::float, 1::float, random(1)) AS random_val,
-- uniform(0::float, 1::float, random(1)) OVER (PARTITION BY term_id) AS cli_sampling_value,
(uniform(0::float, 1::float, random(1))<.1)::boolean AS sample_flag,
-- ((uniform(0::float, 1::float, random(1)) < 0.1) OVER (PARTITION BY term_id))::boolean as sample_flag, -- error
term_id
from tmp.cg_test
GROUP BY id, term_id ORDER BY term_id asc;
Count check for sample:
SELECT term_id,
count(term_id) AS total_cli,
count_if(sample_flag= true) AS sample_count
FROM tmp.sample_calc_cli GROUP BY term_id;
Result
TERM_ID
TOTAL_CLI
SAMPLE_COUNT
CLI_1
10
3
CLI_2
10
0
Uncommenting partitioning with uniform() populates an error as :
SQL Error [2060] [42601]: SQL compilation error:
Invalid function type [UNIFORM] for window function.
What could be the possible way to rectify this issue?

You just calculate random values, to randomly pick rows from the table. If you use 23 as the seed number, you can see that you get 2 samples from CLI_2:
CREATE OR REPLACE TEMP TABLE sample_calc_cli AS
Select id,
uniform(0::float, 1::float, random(23)) AS random_val,
(uniform(0::float, 1::float, random(23))<.1)::boolean AS sample_flag,
term_id
from cg_test
GROUP BY id, term_id ORDER BY term_id asc;
If you need to pick equal number of sample rows, you may use this:
SELECT *
FROM sample_calc_cli
qualify row_number() over (partition by term_id order by random() ) < 3;
This one will pick 2 rows for each term_id.

Related

Select only the most recent datarows [duplicate]

This question already has answers here:
Get top 1 row of each group
(19 answers)
Closed 1 year ago.
I have a table that takes multiple entries for specific products, you can create a sample like this:
CREATE TABLE test(
[coltimestamp] [datetime] NOT NULL,
[col2] [int] NOT NULL,
[col3] [int] NULL,
[col4] [int] NULL,
[col5] [int] NULL)
GO
Insert Into test
values ('2021-12-06 12:31:59.000',1,8,5321,1234),
('2021-12-06 12:31:59.000',7,8,4047,1111),
('2021-12-06 14:38:07.000',7,8,3521,1111),
('2021-12-06 12:31:59.000',10,8,3239,1234),
('2021-12-06 12:31:59.000',27,8,3804,1234),
('2021-12-06 14:38:07.000',27,8,3957,1234)
You can view col2 as product number if u like.
What I need is a query for this kind of table that returns unique data for col2, it must choose the most recent timestamp for not unique col2 entries.
In other words I need the most recent entry for each product
So in the sample the result will show two rows less: the old timestamp for col2 = 7 and col2 = 27 are removed
Thanks for your advanced knowledge
Give a row number by ROW_NUMBER() for each col2 value in the descending order of timestamp.
;with cte as(
Select rn=row_number() over(partition by col2 order by coltimestamp desc),*
From table_name
)
Select * from cte
Whwre rn=1;

SQL Group By Multiple Columns Having More Than One Unique Value for Grouping Column

I am looking for a way to group by two columns where the first grouping column has more than one unique value for the second grouping column. Below is a sample table with sample data.
CREATE TABLE [dbo].[MyTable](
[ID] [int] IDENTITY(1,1) NOT NULL,
[Type] [varchar](20) NOT NULL,
[UnitOfMeasure] [varchar](20) NULL,
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED
(
[ID] ASC
) ON [PRimary]
) ON [PRimary];
INSERT INTO [MyTable] (Type, UnitOfMeasure)
VALUES ('height', 'cm')
, ('distance', 'km')
, ('weight', 'kg')
, ('Glucose', 'mg/dL')
, ('weight', 'kg')
, ('Duration', 'hours')
, ('Glucose', 'mg/dL')
, ('Glucose', 'mg/dL')
, ('height', 'cm')
, ('Allergy', 'kUnits/L')
, ('Volume', 'mL')
, ('height', 'inch')
, ('height', 'cm')
, ('Chloride', 'mmol/L')
, ('Volume', 'cup')
, ('distance', 'km')
, ('Volume', 'cup')
, ('Duration', 'hours')
, ('Chloride', 'mmol/L')
, ('Duration', 'minutes');
The desired out put is as follows.
Type UnitOfMeasure
Duration hours
Duration minutes
height cm
height inch
Volume cup
Volume mL
This output includes Duration because it has two unit of measures. However, it does not include weight, nor Chloride, due to it having only a single unit of measure.
You can use a CTE to get a DISTINCT COUNT, and then use an EXISTS with a further DISTINCT. I expect this to be a little expensive though, and ideally you probably want to address those duplicate rows you have.
WITH Counts AS(
SELECT [Type],
COUNT(DISTINCT UnitOfMeasure) AS DistinctMeasures
FROM dbo.MyTable
GROUP BY [Type])
SELECT DISTINCT
[Type],
UnitOfMeasure
FROM dbo.MyTable MT
WHERE EXISTS (SELECT 1
FROM Counts C
WHERE C.[Type] = MT.[Type]
AND C.DistinctMeasures > 1);
You can do it with EXISTS:
SELECT DISTINCT t.[Type], t.[UnitOfMeasure]
FROM [MyTable] t
WHERE EXISTS (
SELECT 1 FROM [MyTable]
WHERE [Type] = t.[Type] AND [UnitOfMeasure] <> t.[UnitOfMeasure]
)
See the demo.
Results:
> Type | UnitOfMeasure
> :------- | :------------
> Duration | hours
> Duration | minutes
> height | cm
> height | inch
> Volume | cup
> Volume | mL
You can do this with window functions only. Just compare the min and max unit per type: if they differ, then you know you have at least two distinct values, and you can retain the corresponding rows:
select distinct type, unitofmeasure
from (
select t.*,
min(unitofmeasure) over(partition by type) min_unit,
max(unitofmeasure) over(partition by type) max_unit
from mytable t
) t
where min_unit <> max_unit

SQL Server 2016 Compare values from multiple columns in multiple rows in single table

USE dev_db
GO
CREATE TABLE T1_VALS
(
[SITE_ID] [int] NULL,
[LATITUDE] [numeric](10, 6) NULL,
[UNIQUE_ID] [int] NULL,
[COLLECT_RANK] [int] NULL,
[CREATED_RANK] [int] NULL,
[UNIQUE_ID_RANK] [int] NULL,
[UPDATE_FLAG] [int] NULL
)
GO
INSERT INTO T1_VALS
(SITE_ID,LATITUDE,UNIQUE_ID,COLLECT_RANK,CREATED_RANK,UNIQUEID_RANK)
VALUES
(207442,40.900470,59664,1,1,1)
(207442,40.900280,61320,1,1,2)
(204314,40.245220,48685,1,2,2)
(204314,40.245910,59977,1,1,1)
(202416,39.449530,9295,1,1,2)
(202416,39.449680,62264,1,1,1)
I generated the COLLECT_RANK and CREATED_RANK columns from two date columns (not shown here) and the UNIQUEID_RANK column from the UNIQUE_ID which is used here.
I used a SELECT OVER clause with ranking function to generate these columns. A _RANK value of 1 means the latest date or greatest UNIQUE_ID value. I thought my solution would be pretty straight forward using these rank values via array and cursor processing but I seem to have painted myself into a corner.
My problem: I need to choose LONGITUDE value and its UNIQUE_ID based upon the following business rules and set the update value, (1), for that record in its UPDATE_FLAG column.
Select the record w/most recent Collection Date (i.e. RANK value = 1) for a given SITE_ID. If multiple records exist w/same Collection Date (i.e. same RANK value), select the record w/most recent Created Date (RANK value =1) for a given SITE_ID. If multiple records exist w/same Created Date, select the record w/highest Unique ID for a given SITE_ID (i.e. RANK value = 1).
Your suggestions would be most appreciated.
I think you can use top and order by:
select top 1 t1.*
from t1_vals
order by collect_rank asc, create_rank, unique_id desc;
If you want this for sites, which might be what your question is asking, then use row_number():
select t1.*
from (select t1.*,
row_number() over (partition by site_id order by collect_rank asc, create_rank, unique_id desc) as seqnum
from t1_vals
) t1
where seqnum = 1;

GROUP BY error with TOP in T-SQL / SQL Server

I have this table in my SQL Server database.
CREATE TABLE [dbo].[CODIFICHE_FARMACI]
(
[Principio_Attivo] [nvarchar](250) NULL,
[LanguageID] [nvarchar](50) NOT NULL,
[Codice] [nvarchar](50) NOT NULL,
[Confezione_rif] [nvarchar](1000) NULL,
[ATC] [nvarchar](100) NULL,
[Farmaco] [nvarchar](1000) NULL,
[Confezione] [nvarchar](1000) NULL,
[Ditta] [nvarchar](100) NULL,
CONSTRAINT [PK_CODIFICHE_FARMACI]
PRIMARY KEY CLUSTERED ([LanguageID] ASC, [Codice] ASC)
)
Now I want extract from this table the first 60 record group by Farmaco column.
So I wrote this query :
SELECT TOP 60 *
FROM CODIFICHE_FARMACI
GROUP BY Farmaco
But I have this strange error:
La colonna 'CODIFICHE_FARMACI.Principio_Attivo' non è valida nell'elenco di selezione perché non è inclusa né in una funzione di aggregazione né nella clausola GROUP BY.
In English:
The column 'CODIFICHE_FARMACI.Principio_Attivo' is invalid in the select list because it is not included in either an aggregate function or the GROUP BY clause.
EDIT: with this query, I get this result
As you can see I have replicate the column Farmaco (There are two times ABBA, ABESART)
EDIT as result I want :
|FARMACO|
ABBA
ABESART
ABILIFY
If you want to select the first 60 Farmaco values while showing only distinct values, you can try using SELECT DISTINCT:
SELECT DISTINCT TOP 60 Farmaco
FROM [dbo].[CODIFICHE_FARMACI]
ORDER BY Farmaco
Note that if you really have records that are duplicate then it implies your data is not normalized. Possibly, the duplicates only are the same with regard to certain columns, but not others.
SELECT TOP 60 cf.Farmaco
FROM CODIFICHE_FARMACI AS cf
GROUP BY cf.Farmaco
ORDER BY cf.Farmaco
When you use GROUP BY, you'll get a distinct values from the column/s followed by GROUP BY (in your case Farmaco).
First the FROM statement is going be executed, then the retrieved data set now with alias cf from CODIFICHE_FARMACI is going to be grouped by cf.Farmaco.
The SELECT command will retrieve only cf.Farmaco column values, with ORDER BY they will be ordered ascending (because ORDER BY expression, has a default ascending ordering). At the end TOP 60 will filter only 60 ROWS from the data set.
When you specify SELECT * and you have GROUP BY clause you will have a problem, because every column written in GROUP BY must be at the SELECT statement.
You can also add columns that are a result of aggregate functions such as SUM, MIN, MAX and etc. to the SELECT statement.
Try this:
SELECT Top 60
a.Farmaco
FROM [dbo].[CODIFICHE_FARMACI] A
GROUP BY A.Farmaco

Copy Distinct Records Based on 3 Cols

I have loads of data in a table called Temp. This data consists of duplicates.
Not Entire rows but the same data in 3 columns. They are HouseNo,DateofYear,TimeOfDay.
I want to copy only the distinct rows from "Temp" into another table, "ThermData."
Basically what i want to do is copy all the distinct rows from Temp to ThermData where distinct(HouseNo,DateofYear,TimeOfDay). Something like that.
I know we can't do that. An alternative to how i can do that.
Do help me out. I have tried lots of things but haven't solved got it.
Sample Data. Values which are repeated are like....
I want to delete the duplicate row based on the values of HouseNo,DateofYear,TimeOfDay
HouseNo DateofYear TimeOfDay Count
102 10/1/2009 0:00:02 AM 2
102 10/1/2009 1:00:02 AM 2
102 10/1/2009 10:00:02 AM 2
Here is a Northwind example based on the Orders table.
There are duplicates based on the (EmployeeID , ShipCity , ShipCountry) columns.
If you only execute the code between these 2 lines:
/* Run everything below this line to show crux of the fix */
/* Run everything above this line to show crux of the fix */
you'll see how it works. Basically:
(1) You run a GROUP BY on the 3 columns of interest. (derived1Duplicates)
(2) Then you join back to the table using these 3 columns. (on ords.EmployeeID = derived1Duplicates.EmployeeID and ords.ShipCity = derived1Duplicates.ShipCity and ords.ShipCountry = derived1Duplicates.ShipCountry)
(3) Then for each group, you tag them with Cardinal numbers (1,2,3,4,etc) (using ROW_NUMBER())
(4) Then you keep the row in each group that has the cardinal number of "1". (where derived2DuplicatedEliminated.RowIDByGroupBy = 1)
Use Northwind
GO
declare #DestinationVariableTable table (
NotNeededButForFunRowIDByGroupBy int not null ,
NotNeededButForFunDuplicateCount int not null ,
[OrderID] [int] NOT NULL,
[CustomerID] [nchar](5) NULL,
[EmployeeID] [int] NULL,
[OrderDate] [datetime] NULL,
[RequiredDate] [datetime] NULL,
[ShippedDate] [datetime] NULL,
[ShipVia] [int] NULL,
[Freight] [money] NULL,
[ShipName] [nvarchar](40) NULL,
[ShipAddress] [nvarchar](60) NULL,
[ShipCity] [nvarchar](15) NULL,
[ShipRegion] [nvarchar](15) NULL,
[ShipPostalCode] [nvarchar](10) NULL,
[ShipCountry] [nvarchar](15) NULL
)
INSERT INTO #DestinationVariableTable (NotNeededButForFunRowIDByGroupBy , NotNeededButForFunDuplicateCount , OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry )
Select RowIDByGroupBy , MyDuplicateCount , OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry
From
(
/* Run everything below this line to show crux of the fix */
Select
RowIDByGroupBy = ROW_NUMBER() OVER(PARTITION BY ords.EmployeeID , ords.ShipCity , ords.ShipCountry ORDER BY ords.OrderID )
, derived1Duplicates.MyDuplicateCount
, ords.*
from
[dbo].[Orders] ords
join
(
select EmployeeID , ShipCity , ShipCountry , COUNT(*) as MyDuplicateCount from [dbo].[Orders] GROUP BY EmployeeID , ShipCity , ShipCountry /*HAVING COUNT(*) > 1*/
) as derived1Duplicates
on ords.EmployeeID = derived1Duplicates.EmployeeID and ords.ShipCity = derived1Duplicates.ShipCity and ords.ShipCountry = derived1Duplicates.ShipCountry
/* Run everything above this line to show crux of the fix */
)
as derived2DuplicatedEliminated
where derived2DuplicatedEliminated.RowIDByGroupBy = 1
select * from #DestinationVariableTable
emphasized text*emphasized text*emphasized text

Resources