How to generate samples with no instersection using seed? - snowflake-cloud-data-platform

I am trying to generate unique samples of data using SAMPLE() and SEED().
This means samples should not intersect with each other. Every sample should be unique.
But I am not reaching my target, because Snowflake generates samples with 10% intersection to the previous sample when I am using this approach:
CREATE TABLE employee_sample_10_1 as
SELECT employee_id FROM employee SAMPLE(10) SEED(1);
CREATE TABLE employee_sample_10_2 as
SELECT employee_id FROM employee SAMPLE(10) SEED(2);
CREATE TABLE employee_sample_10_3 as
SELECT employee_id FROM employee SAMPLE(10) SEED(3);
CREATE TABLE employee_sample_10_4 as
SELECT employee_id FROM employee SAMPLE(10) SEED(4);
CREATE TABLE employee_sample_10_5 as
SELECT employee_id FROM employee SAMPLE(10) SEED(5);
CREATE TABLE employee_sample_10_6 as
SELECT employee_id FROM employee SAMPLE(10) SEED(6);
CREATE TABLE employee_sample_10_7 as
SELECT employee_id FROM employee SAMPLE(10) SEED(7);
CREATE TABLE employee_sample_10_8 as
SELECT employee_id FROM employee SAMPLE(10) SEED(8);
CREATE TABLE employee_sample_10_9 as
SELECT employee_id FROM employee SAMPLE(10) SEED(9);
CREATE TABLE employee_sample_10_0 as
SELECT employee_id FROM employee SAMPLE(10) SEED(0);
ideally, all these tables created as samples, should not intersect and their total number of rows should be equal to the number of rows in the original table EMPLOYEE.
In fact, none of these conditions are true. The number of rows won't match, moreover, each individual sample table will intersect with its neighbours by ~10% roughly.
SELECT employee_id FROM employee_sample_10_1 INTERSECT SELECT employee_id FROM employee_sample_10_2; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_2 INTERSECT SELECT employee_id FROM employee_sample_10_3; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_3 INTERSECT SELECT employee_id FROM employee_sample_10_4; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_4 INTERSECT SELECT employee_id FROM employee_sample_10_5; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_5 INTERSECT SELECT employee_id FROM employee_sample_10_6; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_6 INTERSECT SELECT employee_id FROM employee_sample_10_7; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_7 INTERSECT SELECT employee_id FROM employee_sample_10_8; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_8 INTERSECT SELECT employee_id FROM employee_sample_10_9; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_9 INTERSECT SELECT employee_id FROM employee_sample_10_0; --getting ~10% of data intersect, need 0%
Question: How to make SAMPLE() and SEED() produce only unique sets of values?

You can assign a random employee group to each row; create a temp table off that result set (or use the result_scan method if you prefer) and then insert into the 10 tables accordingly, stripping out the column used to assign the groups if necessary.
Here's a sample using the CUSTOMER table in the Snowflake sample data:
create temp table RANDOM_EMPLOYEE_GROUPS as
select uniform(0, 9, random()) as RANDOM_GROUP, *
from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER";
select * from RANDOM_EMPLOYEE_GROUPS limit 100;
-- Insert into the 10 tables from there.

Related

Create a separate table based on select condition query in snowflake

I am using select query with condition to remove the duplicates. Query as below
select * from (
select LOCATIONID, OBSERVATION_TIME_UTC, max(ROW_KEY) ROW_KEY from OLD_TABLE group by LOCATIONID, OBSERVATION_TIME_UTC
)
here it will display only 3 columns and LOCATIONID, OBSERVATION_TIME_UTC,ROW_KEY out of 15 columns
I want to create a separate table which has all the columns and order of the columns should not be changed.
I tried below query
create or replace table NEW_TABLE as
select * from (
select LOCATIONID, OBSERVATION_TIME_UTC, max(ROW_KEY) ROW_KEY from OLD_TABLE group by LOCATIONID, OBSERVATION_TIME_UTC
)
but the above query gives only 3 columns, whereas I need the data as it is in new table(it should have all the columns).
could someone correct my query please!
Qualify could be used to grab the highest row(row_key) per location and observation_time:
-- create or replace new_table as
Select *
From old_table
Qualify row_number() over(partition by location_id, observation_time_utc
order by row_key desc) = 1

SQL Server - More efficient way of joining three tables without parent table

This is an issue I have been working on for a while, I have three tables, all of which share 3 of the same columns but there are rows that are unique to each row. I would like to combine all of the tables without duplicating rows. I have a working solution but I feel like it might not be the most efficient. I tried using joins but found that without a parent table, I wasn't getting the expected number of results. My solution which does yield the correct number of results(I've cut some columns for simplicity):
--Create table
CREATE TABLE #Temp
(
ID,
Date
)
-- Insert rows that are only in db1
INSERT INTO #Temp
SELECT
ID,
Date
FROM test.dbo.db1
-- Do not include rows shared by db1 and db2
EXCEPT
(
SELECT
ID,
Date
FROM test.dbo.db2
INTERSECT
SELECT
ID,
Date
FROM test.dbo.db1
)
EXCEPT
-- And not in db1 and db3
(
SELECT
ID,
Date
FROM test.dbo.db1
INTERSECT
SELECT
ID,
Date
FROM test.dbo.db3
)
EXCEPT
-- And not in db1, db2 and db3
** Code where I intersect all 3 tables
I repeat the above steps for all three tables and then add the intersections for each combined ID/Date(db1+d2+db3, db1+db2, etc...)
Does anyone know of a way to do this that is more direct and to the point? I have tried doing a full join of all of them but without a parent table with all of the ID's, I found the ID's that only appear in the other two tables don't show up.
SELECT
ID,
Date
FROM test.dbo.db1
UNION
SELECT
ID,
Date
FROM test.dbo.db2
UNION
SELECT
ID,
Date
FROM test.dbo.db3
The UNION takes care of removing duplicates.

Adding Foreign keys from multiple tables, how to ensure sequence

So this is basically my current task.
2 Tables have received a certain number of auto-generated data.
Table_1 id number, identitynumber number, name varchar2, sex varchar2, birthday date;
Table_2 id number, identification number, manufacturer varchar2, typ varchar2;
The ID values on both tables are the primary keys of each table. Now I gotta insert the data from these 2 tables into a 3rd table, that'll use these ids as foreign keys. This table should also receive some auto-generated data.
Table_3 id number, plate varchar2, id_table1 number, id_table2 number, from date, until date;
I planned on using a insert with a select to query the required data:
insert into table_3 (id, plate, id_table1, id_table2, from, until)
select function_randomID as id,
generate_randomPlate as plate,
(select t1.id
from table_1) as id_table1,
(select t2.od
from table_2) as id_table2,
generate_date as from,
gemerate_date as until
from dual;
Now, I know the selects for both IDs are incorrect, and this is precisely the question.
I don't know what condition I need to put into those selects, in order to get a single row and add that into the third table.
Sorry if I didn't ask it in a more succinct way. Hopefully it's clear enough now to be understood.
Now I gotta insert the data from these 2 tables into a 3rd table,
that'll use these ids as foreign keys. This table should also receive
some auto-generated data.
Assuming you autogenerated number are coming from a function, you can try this:
INSERT INTO table_3 (id,
plate,
id_table1,
id_table2,
fm,--From is a reserve keyword
until)
SELECT function_randomID AS id,
generate_randomPlate AS plate,
tb1.id,
tb2.id,
generate_date AS fm,
gemerate_date AS untl
FROM table_1 tb1
CROSS JOIN table_2 tb2 ;
--ON tb1.id = tb2.id;

Merge columns and add values based on duplicate column

I have rows in my SQL Server that I would like to merge based on duplicate StartDate column. By merging, I would also like to
ID CustomerID Amount PurchaseDate TimeStamp
1 113 20 2015-10-01 0x0000000000029817
2 113 30 2015-10-01 0x0000000000029818
Based on the example above, I would like to have a single column where the values for the Amount column are summed up.
ID CustomerID Amount PurchaseDate TimeStamp
2 113 50 2015-10-01 0x0000000000029818
I'm not certain how I should go about this whether I should:
Create a new row with the new values or;
Update the latest added row and add the Amount to that row
But first I'd like to know how to get rows with duplicate StartDate column values
UPDATE: I have here a delete script for old values
DELETE FROM Table WHERE ID NOT IN (SELECT MAX(ID) FROM Table GROUP BY CustomerID, PurchaseDate)
I suggest updating the last inserted;
UPDATE T
SET Amount = X.Amount
FROM Table T INNER JOIN (
SELECT MAX(ID), SUM(Amount)
FROM Table
GROUP BY CustomerID, PurchaseDate) X ON T.ID = X.ID)
In this case I'd suggest also to remove the old values

Joining two tables having duplicate data in both columns on the base of which we are joining

I have two tables. A column named CardName is in first table. There is duplicate data in this columns. That column also exists in second table. There is a column named amount related to each cardName also in second table. What i want is to select distinct CardName from 1st table and and take sum of all the amounts from second column whose cardname is in first table. BUT first table cardname should be distinct.
what should i do?
select name,sum(amount) from tableB
where name in (select distinct name from TableA)
group by name
use distinct keyword. Distinct will give you only the unique name from TableA and from the sub query result we are getting name and sum from tableB
Refer this : http://technet.microsoft.com/en-us/library/ms187831(v=sql.105).aspx
From you comment below UPDATE
with cte (name) as
(
select distict name from TableA
)
select cte.name,ISNULL(sum(count),0) from TableB as B
left join cte.name = B.name

Resources