Year Based Primary Key? - sql-server

How can I create a Primary Key in SQL Server 2005/2008 with the format:
CurrentYear + auto-increment?
Example: The current year is 2010, in a new table, the ID should start in 1, so: 20101, 20102, 20103, 20104, 20105... and so on.

The cleaner solution is to create a composite primary key consisting of e.g. Year and Counter columns.

Not sure exactly what you are trying to accomplish by doing that, but it makes a lot more sense to do this with two fields.
If the combination of the two must be the PK for some reason, just span it across both columns. However, it seems unnecessary since the identity part will be unique exclusive of the year.

This technically meets the needs of what you requested:
CREATE TABLE #test
( seeded_column INT IDENTITY(1,1) NOT NULL
, year_column INT NOT NULL DEFAULT(YEAR(GETDATE()))
, calculated_column AS CONVERT(BIGINT, CONVERT(CHAR(4), year_column, 120) + CONVERT(VARCHAR(MAX), seeded_column)) PERSISTED PRIMARY KEY
, test VARCHAR(MAX) NOT NULL);
INSERT INTO #test (test)
SELECT 'Badda'
UNION ALL
SELECT 'Cadda'
UNION ALL
SELECT 'Dadda'
UNION ALL
SELECT 'Fadda'
UNION ALL
SELECT 'Gadda'
UNION ALL
SELECT 'Hadda'
UNION ALL
SELECT 'Jadda'
UNION ALL
SELECT 'Kadda'
UNION ALL
SELECT 'Ladda'
UNION ALL
SELECT 'Madda'
UNION ALL
SELECT 'Nadda'
UNION ALL
SELECT 'Padda';
SELECT *
FROM #test;
DROP TABLE #test;

You have to write a trigger for this :)
Have a separate table for storing the last digit used (I really don't know whether there is something similar to sequences in Oracle in SQL Server).
OR
You can get the last item inserted item and extract the last number of it.
THEN
You can get the current year from SELECT DATEPART(yyyy,GetDate());
The trigger would be a ON INSERT trigger where you combine the year and the last digit and update the column

Related

How to delete Duplicate records in snowflake database table

how to delete the duplicate records from snowflake table. Thanks
ID Name
1 Apple
1 Apple
2 Apple
3 Orange
3 Orange
Result should be:
ID Name
1 Apple
2 Apple
3 Orange
Adding here a solution that doesn't recreate the table. This because recreating a table can break a lot of existing configurations and history.
Instead we are going to delete only the duplicate rows and insert a single copy of each, within a transaction:
-- find all duplicates
create or replace transient table duplicate_holder as (
select $1, $2, $3
from some_table
group by 1,2,3
having count(*)>1
);
-- time to use a transaction to insert and delete
begin transaction;
-- delete duplicates
delete from some_table a
using duplicate_holder b
where (a.$1,a.$2,a.$3)=(b.$1,b.$2,b.$3);
-- insert single copy
insert into some_table
select *
from duplicate_holder;
-- we are done
commit;
Advantages:
Doesn't recreate the table
Doesn't modify the original table
Only deletes and inserts duplicated rows (good for time travel storage costs, avoids unnecessary reclustering)
All in a transaction
If you have some primary key as such:
CREATE TABLE fruit (key number, id number, name text);
insert into fruit values (1,1, 'Apple'), (2,1,'Apple'),
(3,2, 'Apple'), (4,3, 'Orange'), (5,3, 'Orange');
as then
DELETE FROM fruit
WHERE key in (
SELECT key
FROM (
SELECT key
,ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY key) AS rn
FROM fruit
)
WHERE rn > 1
);
But if you do not have a unique key then you cannot delete that way. At which point a
CREATE TABLE new_table_name AS
SELECT id, name FROM (
SELECT id
,name
,ROW_NUMBER() OVER (PARTITION BY id, name) AS rn
FROM table_name
)
WHERE rn > 1
and then swap them
ALTER TABLE table_name SWAP WITH new_table_name
Here's a very simple approach that doesn't need any temporary tables. It will work very nicely for small tables, but might not be the best approach for large tables.
insert overwrite into some_table
select distinct * from some_table
;
The OVERWRITE keyword means that the table will be truncated before the insert takes place.
Snowflake does not have effective primary keys, their use is primarily with ERD tools.
Snowflake does not have something like a ROWID either, so there is no way to identify duplicates for deletion.
It is possible to temporarily add a "is_duplicate" column, eg. numbering all the duplicates with the ROW_NUMBER() function, and then delete all records with "is_duplicate" > 1 and finally delete the utility column.
Another way is to create a duplicate table and swap, as others have suggested.
However, constraints and grants must be kept. One way to do this is:
CREATE TABLE new_table LIKE old_table COPY GRANTS;
INSERT INTO new_table SELECT DISTINCT * FROM old_table;
ALTER TABLE old_table SWAP WITH new_table;
The code above removes exact duplicates. If you want to end up with a row for each "PK" you need to include logic to select which copy you want to keep.
This illustrates the importance to add update timestamp columns in a Snowflake Data Warehouse.
this has been bothering me for some time as well. As snowflake has added support for qualify you can now create a dedupped table with a single statement without subselects:
CREATE TABLE fruit (id number, nam text);
insert into fruit values (1, 'Apple'), (1,'Apple'),
(2, 'Apple'), (3, 'Orange'), (3, 'Orange');
CREATE OR REPLACE TABLE fruit AS
SELECT * FROM
fruit
qualify row_number() OVER (PARTITION BY id, nam ORDER BY id, nam) = 1;
SELECT * FROM fruit;
Of course you are left with a new table and loose table history, primary keys, foreign keys and such.
Based on above ideas.....following query worked perfectly in my case.
CREATE OR REPLACE TABLE SCHEMA.table
AS
SELECT
DISTINCT *
FROM
SCHEMA.table
;
Your question boils down to: How can I delete one of two perfectly identical rows? . You can't. You can only do a DELETE FROM fruit where ID = 1 and Name = 'Apple';, then both rows will go away. Or you don't, and keep both.
For some databases, there are workarounds using internal rows, but there isn't any in snowflake, see https://support.snowflake.net/s/question/0D50Z00008FQyGqSAL/is-there-an-internalmetadata-unique-rowid-in-snowflake-that-i-can-reference . You cannot limit deletes, either, so your only option is to create a new table and swap.
Additional Note on Hans Henrik Eriksen's remark on the importance of update timestamps: This is a real help when the duplicates where added later. If, for example, you want to keep the newer values, you can then do this:
-- setup
create table fruit (ID Integer, Name VARCHAR(16777216), "UPDATED_AT" TIMESTAMP_NTZ);
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (2, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- wait > 1 nanosecond
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- delete older duplicates (DESC)
DELETE FROM fruit
WHERE (ID
, UPDATED_AT) IN (
SELECT ID
, UPDATED_AT
FROM (
SELECT ID
, UPDATED_AT
, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS rn
FROM fruit
)
WHERE rn > 1
);
simple UNION eliminate duplicates on use case of just all columns/no pks.
anyway problem should he solved as early on ingestion pipeline, and/or use scd etc.
Just a raw magic best way how to delete is wrong in principle, use scd with high resolution timestamp, solves any problem.
you want fix massive dups load ? then add column like batch id and remove all batch loaded records
Its like being healthy, you have 2 approaches:
eat a lot > get far > go-to a gym to burn it
eat well > have healthy life style and no need for gym.
So before discussing best gym, try change life style.
hope this helps, learn to do pressure upstream on data producers instead of living like jesus christ trying to clean up the mess of everyone.
The following solution is effective if you are looking at one or few columns as primary key references for the table.
-- Create a temp table to hold our duplicates (only second occurrence)
CREATE OR REPLACE TRANSIENT TABLE temp_table AS (
SELECT [col1], [col2], .. [coln]
FROM (
SELECT *, ROW_NUMBER () OVER(
PARTITION BY [pk]1, [pk]2, .. [pk]m
ORDER BY [pk]1, [pk]2, .. [pk]m) AS duplicate_count
FROM [schema].[table]
) WHERE duplicate_count = 2
);
-- Delete all the duplicate records from the table
DELETE FROM [schema].[table] t1
USING temp_table t2
WHERE
t1.[pk]1 = t2.[pk]1 AND
t1.[pk]2 = t2.[pk]2 AND
..
t1.[pk]n = t2.[pk]m;
-- Insert single copy using the temp_table in the original table
INSERT INTO [schema].[table]
SELECT *
FROM temp_table;
This is inspired by #Felipe Hoffa's answer:
##create table with dupes and take the max id
create or replace transient table duplicate_holder as (
select max(S.ID) ID, some_field, count(some_field) numberAssets
from some_table S
group by some_field
having count(some_field)>1
)
##join back to the original table on the field excluding the ID in the duplicate table and delete.
delete from some_table as t
USING duplicate_holder as d
WHERE t.some_field=d.some_field
and t.id <> d.id
Not sure if people are still interested in this but I've used the below query which is more elegant and seems to have worked
create or replace table {{your_table}} as
select * from {{your_table}}
qualify row_number() over (partition by {{criteria_columns}} order by 1) = 1

How can I keep the order of column values in a union select?

I am doing a bulk insert into a table using SELECT and UNION. I need the order of the SELECT values to be unchanged when calling the INSERT, but it seems that the values are being inserted in an ascending order, rather than the order I specify.
For example, the below insert statement
declare #QuestionOptionMapping table
(
[ID] [int] IDENTITY(1,1)
, [QuestionOptionID] int
, [RateCode] varchar(50)
)
insert into #QuestionOptionMapping (
RateCode
)
select
'PD0116'
union
select
'PL0090'
union
select
'PL0091'
union
select
'DD0026'
union
select
'DD0025'
SELECT * FROM #QuestionOptionMapping
renders the data as
(5 row(s) affected)
ID QuestionOptionID RateCode
----------- ---------------- --------------------------------------------------
1 NULL DD0025
2 NULL DD0026
3 NULL PD0116
4 NULL PL0090
5 NULL PL0091
(5 row(s) affected)
How can the select of the inserted data return the same order as when it was inserted?
SQL Server stores your rows as an unordered set. The data points may or may not be contiguous, and they may or may not be in the "order" the data was specified in your insert statements.
When you query the data, the engine will retrieve the rows in the most efficient order, as determined by the optimizer. There is no guarantee that the order will be the same every time you query the data.
The only way to guarantee the order of your result set is to include an explicit ORDER BY clause with your SELECT statement.
See this answer for a much more in depth discussion as to why this the case. Default row order in SELECT query - SQL Server 2008 vs SQL 2012
By using the SELECT/UNION option for your INSERT statement, you're creating an unordered set that SQL Server ingests as a set, not as a series of inputs. Separate your inserts into discrete statements if you need them to have the IDENTITY values applied in order. Better yet, if the row numbering matters, don't leave it to chance. Explicitly number the rows on insert.
SQL tables do represent unordered sets. However, the identity column on an insert will follow the ordering of the order by.
Your data is getting out of order because of the duplicate elimination in the union. However, I would suggest writing the query to explicitly sort the data:
insert into #QuestionOptionMapping (RateCode)
select ratecode
from (values (1, 'PD0116'),
(2, 'PL0090'),
(3, 'PL0091'),
(4, 'DD0026'),
(5, 'DD0025')
) v(ord, ratecode)
order by ord;
Then be sure to use order by for the select:
select qom.*
from #QuestionOptionMapping qom
order by id;
Note that this also uses the values() table constructor, which is a very handy syntax.
If you're not selecting from tables?
Then you could insert VALUES, instead of a select with unions.
insert into #QuestionOptionMapping (RateCode) values
('PD0116')
,('PL0090')
,('PL0091')
,('DD0026')
,('DD0025')
Or in your query, change all the UNION to UNION ALL.
The difference between a UNION and a UNION ALL is that a UNION will remove duplicate rows.
While UNION ALL just stiches the resultsets from the selects together.
And for UNION to find those duplicates, internally it first has to sort them.
But a UNION ALL doesn't care about uniqueness, so it doesn't need to sort.
A 3th option would be to simply change from 1 insert statement to multiple insert statements.
One insert per value. Thus avoiding UNION completely.
But that anti-golfcoding method is also the most wordy.
Your problem is you are not putting them in in the order you think. UNION is distinct values only and it will typically sort the values to facilitate the distinct. Run the select statement alone and you will see.
If you insert using values then order is preserved:
insert into #QuestionOptionMapping (RateCode) values
('PD0116'), ('PL0090'), ('PL0091'), ('DD0026'), ('DD0025')
select * from #QuestionOptionMapping order by ID

T-SQL, using Union to dedupe , but retain an additional column

I am using T-SQL to return records from the database (there are multiple criteria, but the list of unique ID's must be distinct), in short, the T-SQL looks like this:
SELECT
t1.ID,
[query1mark] = 1
WHERE criteria1 = 1
UNION
SELECT
t1.ID,
[query2mark] = 1
WHERE criteria2 = 1
I would like to be able to use Union to de-dupe on the ID field (the data has to be unique on the ID field), whilst retaining the derived column "query1mark" or "query2mark" to highlight which query it additionally came from. In my real world case, there are 5 queries that need to be de-duped against each other, so I need an efficient solution is possible.
EDIT: Additionally, the results from the first query need to be prioritised over those from the second query, and the results from the second query need to be prioritised over those from the third query, as I understand, this feature is inherent when using Union, as it will only add records from below the Union statement.
Is Union the best solution for this, and if not, what can I use?
Thanks
What about this:
DECLARE #DataSource TABLE
(
[ID] INT
,[criteria] INT
);
INSERT INTO #DataSource ([ID], [criteria])
VALUES (1, 1)
,(1, 2)
,(2, 1)
,(3, 1)
,(3, 2)
,(4, 2);
WITH DataSource ([ID], [query_mark], [RowID]) AS
(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY [ID] ORDER BY [criteria] ASC)
FROM #DataSource
)
SELECT [id], [query_mark]
FROM DataSource
WHERE [RowID] = 1;
The idea is to create sequence of all duplicated elements for particular group. The duplicates are order by the criteria field, but yo can change the logic if you need - for example to show the biggest criteria. The group is defined using the PARTITION BY [ID] statement, which means, order items for each [ID] group. Then, in the select, we only need to show one record per each group [RowID] = 1
you can use top 1 with ties
SELECT top 1 with ties * FROM yourtable
ORDER BY ROW_NUMBER() OVER (PARTITION BY [ID] ORDER BY [criteria])

In SQL Server, how do I set the next value for an autoincrement field to an arbitrary value like you can in Postgres?

Is it possible to set the next value of an autoincrement field in SQL Server like you can do in Postgres?
For the curious, here's the whole backstory. My company used to use Postgres, which allows you to easily set the next value of an autoincrement field to an arbitrary value.
New company bought old company, and now we're importing Postgres data to SQL Server. Somehow the autoincremented AcctID field on Accounts got set to a 9-digit number even though there are thousands of 8 digit numbers to be had. Apparently someone did this a while back in Postgres for some now unknown reason.
So now in the new SQL Server database, new accounts are having 9-digit account ids, but the client's accounting software can't deal with 9-digit account numbers, so any new accounts they add can't be processed by their accounting department until this gets resolved.
Of course, there are up to 72 different tables which can have dependencies on the AcctID field of Accounts, and the client created about 360 new accounts before they realized the problems involved, so saving that data, truncating the table, and reinserting the data would be an onerous task.
Much better would be to set the autoincrement value of AcctID to the last 8-digit value + 1. Then at least they'd be able to add new accounts while a solution to the 9-digit accounts was being worked on. In fact they claim they only need 3 of the 360 accounts they've added.
So is it possible to reset the autoincrement value of a field in SQL Server like you can do in Postgres?
In SQL Server you can reset an autoincrement column like this:
dbcc checkident ( table_name, RESEED, new_value )
You can check MSDN's documentation about it here.
You can do this too:
CREATE TABLE #myTable
(
ID INT IDENTITY,
abc VARCHAR(20)
)
INSERT INTO #myTable
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'cba'
SELECT *
FROM #myTable
-- Jump Identities
SET IDENTITY_INSERT #myTable ON
INSERT INTO #myTable
( id, abc )
VALUES ( 50, 'cbd' )
SELECT *
FROM #myTable
SET IDENTITY_INSERT #myTable OFF
-- Back to contigious
INSERT INTO #myTable
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'abc'
UNION ALL
SELECT 'cba'
SELECT *
FROM #myTable
DROP TABLE #myTable

T-SQL, Insert into with MAX()+1 in subquery doesn't increment, alternatives?

I have a query where I need to "batch" insert rows into a table with a primary key without identity.
--TableA
--PK int (Primary key, no-identity)
--CustNo int
INSERT INTO TableA (PK,CustNo)
SELECT (SELECT MAX(PK)+1 AS PK FROM TableA), CustNo
FROM Customers
(simplified example - please don't comment about possible concurrency issues :-))
The problem is that it doesn't increment the PK "for each" processed row, and I get a primary key violation.
I know how to do it with a cursor/while loop, but I would like to avoid that, and solve it in a set-based kind of manner, if that's even possible ?
(running SQL Server 2008 Standard)
Declare #i int;
Select #i = max(pk) + 1 from tablea;
INSERT INTO TableA (PK, custno)
Select row_number() over(order by custno) + #i , CustNo
FROM Customers
+1 to Michael Buen, but I have one suggestion:
The table "tablea" can be empty, so we should write:
Select #i = isnull(max(pk),0) + 1 from tablea;
This will prevent a null error when trying to use this code.
The problem as you have seen is that they all get the same row number, the max(PK) +1 is the same for every row.
Try convert it to be Max(PK) + Row_number()
I'm working on the basis as to why you know this is a bad idea etc, and your question is simplified for the purpose of getting an answer, and not how you would wish to solve the problem.
You can;
;with T(NPK, CustNo) as (
select row_number() over (order by CustNo), CustNo from Customers
)
insert into TableA (PK, CustNo)
select NPK, custno from T
order by CustNo
I have a suggestion for you buddy, a better practice on SQL says to use SEQUENCE, and guess what, it´s VERY easy to do it man, just copy and paste mine:
CREATE SEQUENCE SEQ_TABLEA AS INTEGER
START WITH 1
INCREMENT BY 1
MAXVALUE 2147483647
MINVALUE 1
NO CYCLE
and use like this:
INSERT INTO TableA (PK,CustNo) VALUES (SEQ_TABLEA.NEXTVAL,123)
Hope this tip able to help ya!

Resources