Related
how to delete the duplicate records from snowflake table. Thanks
ID Name
1 Apple
1 Apple
2 Apple
3 Orange
3 Orange
Result should be:
ID Name
1 Apple
2 Apple
3 Orange
Adding here a solution that doesn't recreate the table. This because recreating a table can break a lot of existing configurations and history.
Instead we are going to delete only the duplicate rows and insert a single copy of each, within a transaction:
-- find all duplicates
create or replace transient table duplicate_holder as (
select $1, $2, $3
from some_table
group by 1,2,3
having count(*)>1
);
-- time to use a transaction to insert and delete
begin transaction;
-- delete duplicates
delete from some_table a
using duplicate_holder b
where (a.$1,a.$2,a.$3)=(b.$1,b.$2,b.$3);
-- insert single copy
insert into some_table
select *
from duplicate_holder;
-- we are done
commit;
Advantages:
Doesn't recreate the table
Doesn't modify the original table
Only deletes and inserts duplicated rows (good for time travel storage costs, avoids unnecessary reclustering)
All in a transaction
If you have some primary key as such:
CREATE TABLE fruit (key number, id number, name text);
insert into fruit values (1,1, 'Apple'), (2,1,'Apple'),
(3,2, 'Apple'), (4,3, 'Orange'), (5,3, 'Orange');
as then
DELETE FROM fruit
WHERE key in (
SELECT key
FROM (
SELECT key
,ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY key) AS rn
FROM fruit
)
WHERE rn > 1
);
But if you do not have a unique key then you cannot delete that way. At which point a
CREATE TABLE new_table_name AS
SELECT id, name FROM (
SELECT id
,name
,ROW_NUMBER() OVER (PARTITION BY id, name) AS rn
FROM table_name
)
WHERE rn > 1
and then swap them
ALTER TABLE table_name SWAP WITH new_table_name
Here's a very simple approach that doesn't need any temporary tables. It will work very nicely for small tables, but might not be the best approach for large tables.
insert overwrite into some_table
select distinct * from some_table
;
The OVERWRITE keyword means that the table will be truncated before the insert takes place.
Snowflake does not have effective primary keys, their use is primarily with ERD tools.
Snowflake does not have something like a ROWID either, so there is no way to identify duplicates for deletion.
It is possible to temporarily add a "is_duplicate" column, eg. numbering all the duplicates with the ROW_NUMBER() function, and then delete all records with "is_duplicate" > 1 and finally delete the utility column.
Another way is to create a duplicate table and swap, as others have suggested.
However, constraints and grants must be kept. One way to do this is:
CREATE TABLE new_table LIKE old_table COPY GRANTS;
INSERT INTO new_table SELECT DISTINCT * FROM old_table;
ALTER TABLE old_table SWAP WITH new_table;
The code above removes exact duplicates. If you want to end up with a row for each "PK" you need to include logic to select which copy you want to keep.
This illustrates the importance to add update timestamp columns in a Snowflake Data Warehouse.
this has been bothering me for some time as well. As snowflake has added support for qualify you can now create a dedupped table with a single statement without subselects:
CREATE TABLE fruit (id number, nam text);
insert into fruit values (1, 'Apple'), (1,'Apple'),
(2, 'Apple'), (3, 'Orange'), (3, 'Orange');
CREATE OR REPLACE TABLE fruit AS
SELECT * FROM
fruit
qualify row_number() OVER (PARTITION BY id, nam ORDER BY id, nam) = 1;
SELECT * FROM fruit;
Of course you are left with a new table and loose table history, primary keys, foreign keys and such.
Based on above ideas.....following query worked perfectly in my case.
CREATE OR REPLACE TABLE SCHEMA.table
AS
SELECT
DISTINCT *
FROM
SCHEMA.table
;
Your question boils down to: How can I delete one of two perfectly identical rows? . You can't. You can only do a DELETE FROM fruit where ID = 1 and Name = 'Apple';, then both rows will go away. Or you don't, and keep both.
For some databases, there are workarounds using internal rows, but there isn't any in snowflake, see https://support.snowflake.net/s/question/0D50Z00008FQyGqSAL/is-there-an-internalmetadata-unique-rowid-in-snowflake-that-i-can-reference . You cannot limit deletes, either, so your only option is to create a new table and swap.
Additional Note on Hans Henrik Eriksen's remark on the importance of update timestamps: This is a real help when the duplicates where added later. If, for example, you want to keep the newer values, you can then do this:
-- setup
create table fruit (ID Integer, Name VARCHAR(16777216), "UPDATED_AT" TIMESTAMP_NTZ);
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (2, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- wait > 1 nanosecond
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- delete older duplicates (DESC)
DELETE FROM fruit
WHERE (ID
, UPDATED_AT) IN (
SELECT ID
, UPDATED_AT
FROM (
SELECT ID
, UPDATED_AT
, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS rn
FROM fruit
)
WHERE rn > 1
);
simple UNION eliminate duplicates on use case of just all columns/no pks.
anyway problem should he solved as early on ingestion pipeline, and/or use scd etc.
Just a raw magic best way how to delete is wrong in principle, use scd with high resolution timestamp, solves any problem.
you want fix massive dups load ? then add column like batch id and remove all batch loaded records
Its like being healthy, you have 2 approaches:
eat a lot > get far > go-to a gym to burn it
eat well > have healthy life style and no need for gym.
So before discussing best gym, try change life style.
hope this helps, learn to do pressure upstream on data producers instead of living like jesus christ trying to clean up the mess of everyone.
The following solution is effective if you are looking at one or few columns as primary key references for the table.
-- Create a temp table to hold our duplicates (only second occurrence)
CREATE OR REPLACE TRANSIENT TABLE temp_table AS (
SELECT [col1], [col2], .. [coln]
FROM (
SELECT *, ROW_NUMBER () OVER(
PARTITION BY [pk]1, [pk]2, .. [pk]m
ORDER BY [pk]1, [pk]2, .. [pk]m) AS duplicate_count
FROM [schema].[table]
) WHERE duplicate_count = 2
);
-- Delete all the duplicate records from the table
DELETE FROM [schema].[table] t1
USING temp_table t2
WHERE
t1.[pk]1 = t2.[pk]1 AND
t1.[pk]2 = t2.[pk]2 AND
..
t1.[pk]n = t2.[pk]m;
-- Insert single copy using the temp_table in the original table
INSERT INTO [schema].[table]
SELECT *
FROM temp_table;
This is inspired by #Felipe Hoffa's answer:
##create table with dupes and take the max id
create or replace transient table duplicate_holder as (
select max(S.ID) ID, some_field, count(some_field) numberAssets
from some_table S
group by some_field
having count(some_field)>1
)
##join back to the original table on the field excluding the ID in the duplicate table and delete.
delete from some_table as t
USING duplicate_holder as d
WHERE t.some_field=d.some_field
and t.id <> d.id
Not sure if people are still interested in this but I've used the below query which is more elegant and seems to have worked
create or replace table {{your_table}} as
select * from {{your_table}}
qualify row_number() over (partition by {{criteria_columns}} order by 1) = 1
I have to write an SP that can perform Partial Updates on our databases, the changes are stored in a record of the PU table. A values fields contains all values, delimited by a fixed delimiter. A tables field refers to a Schemes table containing the column names for each table in a similar fashion in a Colums fiels.
Now for my SP I need to split the Values field and Columns field in a temp table with Column/Value pairs, this happens for each record in the PU table.
An example:
Our PU table looks something like this:
CREATE TABLE [dbo].[PU](
[Table] [nvarchar](50) NOT NULL,
[Values] [nvarchar](max) NOT NULL
)
Insert SQL for this example:
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','John Doe;26');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','Jane Doe;22');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','Mike Johnson;20');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','Mary Jane;24');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Course','Mathematics');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Course','English');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Course','Geography');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Campus','Campus A;Schools Road 1;Educationville');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Campus','Campus B;Schools Road 31;Educationville');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Campus','Campus C;Schools Road 22;Educationville');
And we have a Schemes table similar to this:
CREATE TABLE [dbo].[Schemes](
[Table] [nvarchar](50) NOT NULL,
[Columns] [nvarchar](max) NOT NULL
)
Insert SQL for this example:
INSERT INTO [dbo].[Schemes]([Table],[Columns]) VALUES ('Person','[Name];[Age]');
INSERT INTO [dbo].[Schemes]([Table],[Columns]) VALUES ('Course','[Name]');
INSERT INTO [dbo].[Schemes]([Table],[Columns]) VALUES ('Campus','[Name];[Address];[City]');
As a result the first record of the PU table should result in a temp table like:
The 5th will have:
Finally, the 8th PU record should result in:
You get the idea.
I tried use the following query to create the temp tables, but alas it fails when there's more that one value in the PU record:
DECLARE #Fields TABLE
(
[Column] INT,
[Value] VARCHAR(MAX)
)
INSERT INTO #Fields
SELECT TOP 1
(SELECT Value FROM STRING_SPLIT([dbo].[Schemes].[Columns], ';')),
(SELECT Value FROM STRING_SPLIT([dbo].[PU].[Values], ';'))
FROM [dbo].[PU] INNER JOIN [dbo].[Schemes] ON [dbo].[PU].[Table] = [dbo].[Schemes].[Table]
TOP 1 correctly gets the first PU record as each PU record is removed once processed.
The error is:
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
In the case of a Person record, the splits are indeed returning 2 values/colums at a time, I just want to store the values in 2 records instead of getting an error.
Any help on rewriting the above query?
Also do note that the data is just generic nonsense. Being able to have 2 fields that both have delimited values, always equal in amount (e.g. a 'person' in the PU table will always have 2 delimited values in the field), and break them up in several column/header rows is the point of the question.
UPDATE: Working implementation
Based on the (accepted) answer of Sean Lange, I was able to work out followin implementation to overcome the issue:
As I need to reuse it, the combine column/value functionality is performed by a new function, declared as such:
CREATE FUNCTION [dbo].[JoinDelimitedColumnValue]
(#splitValues VARCHAR(8000), #splitColumns VARCHAR(8000),#pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH MyValues AS
(
SELECT ColumnPosition = x.ItemNumber,
ColumnValue = x.Item
FROM dbo.DelimitedSplit8K(#splitValues, #pDelimiter) x
)
, ColumnData AS
(
SELECT ColumnPosition = x.ItemNumber,
ColumnName = x.Item
FROM dbo.DelimitedSplit8K(#splitColumns, #pDelimiter) x
)
SELECT cd.ColumnName,
v.ColumnValue
FROM MyValues v
JOIN ColumnData cd ON cd.ColumnPosition = v.ColumnPosition
;
In case of the above sample data, I'd call this function with the following SQL:
DECLARE #FieldValues VARCHAR(8000), #FieldColumns VARCHAR(8000)
SELECT TOP 1 #FieldValues=[dbo].[PU].[Values], #FieldColumns=[dbo].[Schemes].[Columns] FROM [dbo].[PU] INNER JOIN [dbo].[Schemes] ON [dbo].[PU].[Table] = [dbo].[Schemes].[Table]
INSERT INTO #Fields
SELECT [Column] = x.[ColumnName],[Value] = x.[ColumnValue] FROM [dbo].[JoinDelimitedColumnValue](#FieldValues, #FieldColumns, #Delimiter) x
This data structure makes this way more complicated than it should be. You can leverage the splitter from Jeff Moden here. http://www.sqlservercentral.com/articles/Tally+Table/72993/ The main difference of that splitter and all the others is that his returns the ordinal position of each element. Why all the other splitters don't do this is beyond me. For things like this it is needed. You have two sets of delimited data and you must ensure that they are both reassembled in the correct order.
The biggest issue I see is that you don't have anything in your main table to function as an anchor for ordering the results correctly. You need something, even an identity to ensure the output rows stay "together". To accomplish I just added an identity to the PU table.
alter table PU add RowOrder int identity not null
Now that we have an anchor this is still a little cumbersome for what should be a simple query but it is achievable.
Something like this will now work.
with MyValues as
(
select p.[Table]
, ColumnPosition = x.ItemNumber
, ColumnValue = x.Item
, RowOrder
from PU p
cross apply dbo.DelimitedSplit8K(p.[Values], ';') x
)
, ColumnData as
(
select ColumnName = replace(replace(x.Item, ']', ''), '[', '')
, ColumnPosition = x.ItemNumber
, s.[Table]
from Schemes s
cross apply dbo.DelimitedSplit8K(s.Columns, ';') x
)
select cd.[Table]
, v.ColumnValue
, cd.ColumnName
from MyValues v
join ColumnData cd on cd.[Table] = v.[Table]
and cd.ColumnPosition = v.ColumnPosition
order by v.RowOrder
, v.ColumnPosition
I recommended not storing values like this in the first place. I recommend having a key value in the tables and preferably not using Table and Columns as a composite key. I recommend to avoid using reserved words. I also don't know what version of SQL you are using. I am going to assume you are using a fairly recent version of Microsoft SQL Server that will support my provided stored procedure.
Here is an overview of the solution:
1) You need to convert both the PU and the Schema table into a table where you will have each "column" value in the list of columns isolated in their own row. If you can store the data in this format rather than the provided format, you will be a little better off.
What I mean is
Table|Columns
Person|Jane Doe;22
needs converted to
Table|Column|OrderInList
Person|Jane Doe|1
Person|22|2
There are multiple ways to do this, but I prefer an xml trick that I picked up. You can find multiple split string examples online so I will not focus on that. Use whatever gives you the best performance. Unfortunately, You might not be able to get away from this table-valued function.
Update:
Thanks to Shnugo's performance enhancement comment, I have updated my xml splitter to give you the row number which reduces some of my code. I do the exact same thing to the Schema list.
2) Since the new Schema table and the new PU table now have the order each column appears, the PU table and the schema table can be joined on the "Table" and the OrderInList
CREATE FUNCTION [dbo].[fnSplitStrings_XML]
(
#List NVARCHAR(MAX),
#Delimiter VARCHAR(255)
)
RETURNS TABLE
AS
RETURN
(
SELECT y.i.value('(./text())[1]', 'nvarchar(4000)') AS Item,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) as RowNumber
FROM
(
SELECT CONVERT(XML, '<i>'
+ REPLACE(#List, #Delimiter, '</i><i>')
+ '</i>').query('.') AS x
) AS a CROSS APPLY x.nodes('i') AS y(i)
);
GO
CREATE Procedure uspGetColumnValues
as
Begin
--Split each value in PU
select p.[Table],p.[Values],a.[Item],CHARINDEX(a.Item,p.[Values]) as LocationInStringForSorting,a.RowNumber
into #PuWithOrder
from PU p
cross apply [fnSplitStrings_XML](p.[Values],';') a --use whatever string split function is working best for you (performance wise)
--Split each value in Schema
select s.[Table],s.[Columns],a.[Item],CHARINDEX(a.Item,s.[Columns]) as LocationInStringForSorting,a.RowNumber
into #SchemaWithOrder
from Schemes s
cross apply [fnSplitStrings_XML](s.[Columns],';') a --use whatever string split function is working best for you (performance wise)
DECLARE #Fields TABLE --If this is an ETL process, maybe make this a permanent table with an auto incrementing Id and reference this table in all steps after this.
(
[Table] NVARCHAR(50),
[Columns] NVARCHAR(MAX),
[Column] VARCHAR(MAX),
[Value] VARCHAR(MAX),
OrderInList int
)
INSERT INTO #Fields([Table],[Columns],[Column],[Value],OrderInList)
Select pu.[Table],pu.[Values] as [Columns],s.Item as [Column],pu.Item as [Value],pu.RowNumber
from #PuWithOrder pu
join #SchemaWithOrder s on pu.[Table]=s.[Table] and pu.RowNumber=s.RowNumber
Select [Table],[Columns],[Column],[Value],OrderInList
from #Fields
order by [Table],[Columns],OrderInList
END
GO
EXEC uspGetColumnValues
GO
Update:
Since your working implementation is a table-valued function, I have another recommendation. The problem I see is that your using a table valued function which ultimately handles one record at a time. You are going to have better performance with set based operations and batching as needed. With a tabled valued function, you are likely going to be looping through each row. If this is some sort of ETL process, your team will be better off if you have a stored procedure that processes the rows in bulk. It might make sense to stage the results into a better table that your team can work with down stream rather than have them use a potentially slow table-valued function.
SQL Fiddle: http://sqlfiddle.com/#!6/52c67/1
CREATE TABLE MailingList (EmployeeId INT, Email VARCHAR(50))
INSERT INTO MailingList VALUES (1, 'bob#co.com')
INSERT INTO MailingList VALUES (2, 'jill#co.com')
INSERT INTO MailingList VALUES (3, 'frank#co.com')
INSERT INTO MailingList VALUES (4, 'fred#co.com')
Now I get a list of EmployeeIds from somewhere: 1,2,3,4,5
I need to check which of these employeeIds are NOT in the Mailinglist table. I expect to get the result "5" in this case, as it is NOT in the mailinglist table.
What is the easiest way to do this?
Is there an easier way than generating a temporary table, inserting the values 1,2,3,4,5 and then doing either a select ... where not in (select ...) - or getting the same with doing a join. So basically without creating a temporary table and insert the data, but just working with the list 1,2,3,4,5.
Everyone is on the right track here with the idea of an ANTI JOIN. It's worth noting however, that the answers proposed will not always produce the exact same results and each solution has different performance implications. What MatBailie is proposing is how to do an ANTI JOIN, What Alexander is proposing is how to do an ANTI SEMI JOIN.
Alexander is more on the right track IMO as what we're looking for is an ANTI SEMI JOIN; a LEFT ANTI SEMI JOIN, to be specific, with your list of employeeIds from "somewhere" as the Left table and MailingList as the Right table.
An ANTI JOIN returns records that exist in this set that don't exist in that set. By set I'm referring to a table, view, subquery, etc. By "this" set I'm referring to the LEFT table and by "that" set I'm referring to RIGHT table. A SEMI JOIN is where only one matching row from the LEFT table is returned. In other words, A SEMI join returns a distinct set.
Now I get a list of EmployeeIds from somewhere
Using the sample data provided. Let's say that, by "somewhere" you are talking about a table. (I'm including the number 5 twice to demonstrate the difference between and ANTI JOIN and ANTI SEMI JOIN)
CREATE TABLE dbo.somewhere (employeeId int);
INSERT dbo.somewhere VALUES (1),(2),(3),(4),(5),(5);
You could do a LEFT ANTI JOIN using NOT IN or NOT EXISTS
-- ANTI JOIN USING NOT IN
SELECT somewhere.EmployeeId--, <other columns>
FROM dbo.somewhere
WHERE somewhere.EmployeeId NOT IN (SELECT EmployeeId FROM dbo.MailingList); -- EXLCLUDE IDs NOT IN MailingList
-- ANTI JOIN USING NOT EXISTS
SELECT somewhere.EmployeeId--, <other columns>
FROM dbo.somewhere
WHERE NOT EXISTS
(
SELECT EmployeeId
FROM dbo.MailingList ML
WHERE ML.EmployeeId = somewhere.employeeId
);
Note that Each of these return the number 5 twice. If you only needed it once you would use EXCEPT to perform an ANTI SEMI JOIN like so:
SELECT somewhere.EmployeeId
FROM dbo.somewhere
EXCEPT -- SET OPERATOR (SET OPERATORS INCLUDE: UNION, UNION ALL, EXCEPT, INTERSECT)
SELECT EmployeeId
FROM dbo.MailingList; -- EXLCLUDE IDs NOT IN MailingList
EXCEPT is a Set Operator like UNION and INTERSECT. Set operators return a unique result set. (The one exception to this being UNION ALL). If you wanted a unique result set using NOT IN or NOT EXISTS you would also need to include DISTINCT or GROUP BY all the columns which you want to be unique.
If by "somewhere" you are talking about a comma-delimited list or XML or JSON file/fragment then you would first need to turn that list, XML, JSON or whatever into the LEFT table. Using SQL Server 2016's string_split (or another "splitter" function) you would do this:
-- "somewhere" is a csv, list or array
DECLARE #somewhere varchar(1000) = '1,2,3,4,5';
-- ANTI JOIN WITH NOT IN
SELECT EmployeeId = [value]
FROM string_split(#somewhere, ',')
WHERE [value] NOT IN (SELECT EmployeeId FROM dbo.MailingList);
-- ANTI SEMI JOIN WITH NOT IN
SELECT DISTINCT EmployeeId = [value]
FROM string_split(#somewhere, ',')
WHERE [value] NOT IN (SELECT EmployeeId FROM dbo.MailingList);
-- ANTI SEMI JOIN WITH EXCEPT
SELECT EmployeeId = [value]
FROM string_split(#somewhere, ',')
EXCEPT
SELECT EmployeeId FROM dbo.MailingList;
GO
.. or if it were XML, one option would look like this:
-- "somewhere" is XML
DECLARE #somewhere XML =
'<employees>
<employee>1</employee>
<employee>2</employee>
<employee>3</employee>
<employee>4</employee>
<employee>5</employee>
</employees>'
-- ANTI SEMI JOIN using EXCEPT
SELECT employeeId = emp.id.value('.', 'int')
FROM (VALUES (#somewhere)) s(empid)
CROSS APPLY empid.nodes('/employees/employee') emp(id)
EXCEPT
SELECT employeeId
FROM dbo.MailingList;
Lastly. You want an index on EmployeeId in your mailing list table. In my examples you would want an index on dbo.somewhere as well. If you are doing SEMI joins then you want those indexes to be unique.
You can use EXCEPT command.
Example:
SELECT *
FROM
(
SELECT 1 AS Id
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
) AS t
EXCEPT
SELECT Id FROM MailingList
You don't seem to be asking about the logic, just about how to "best" represent the set {1,2,3,4,5}.
One answer is a temporary table, as you mentioned.
Another is a sub-query or a CTE with a bunch of UNION ALL statements.
Another would be to use VALUES (1), (2), (3), (4), (5) in either a CTE or sub-query.
But there is a glaring point here. If you have a table with an EmployeeID field, then surely you have an Employee table? That being the case you should be able to "derive" your set of 5 employees from there?
(SELECT id FROM employee WHERE manager_id = 666)
or...
(SELECT id FROM employee WHERE staff_ref IN ('111', '222', '333', '444', '555'))
etc, etc...
EDIT:
As for the actual logic once you have your set representing your 5 employees, you can do an "anti-join" using LEFT JOIN and IS NULL...
SELECT
Employee.*
FROM
Employee
LEFT JOIN
MailingList
ON MailingList.list_id = 789
AND MailingList.employee_id = Employee.id
WHERE
Employee.manager_id = 666
AND MailingList.employee_id IS NULL
=> Employees with manager #666 but not on mailing list #789
I am using T-SQL to return records from the database (there are multiple criteria, but the list of unique ID's must be distinct), in short, the T-SQL looks like this:
SELECT
t1.ID,
[query1mark] = 1
WHERE criteria1 = 1
UNION
SELECT
t1.ID,
[query2mark] = 1
WHERE criteria2 = 1
I would like to be able to use Union to de-dupe on the ID field (the data has to be unique on the ID field), whilst retaining the derived column "query1mark" or "query2mark" to highlight which query it additionally came from. In my real world case, there are 5 queries that need to be de-duped against each other, so I need an efficient solution is possible.
EDIT: Additionally, the results from the first query need to be prioritised over those from the second query, and the results from the second query need to be prioritised over those from the third query, as I understand, this feature is inherent when using Union, as it will only add records from below the Union statement.
Is Union the best solution for this, and if not, what can I use?
Thanks
What about this:
DECLARE #DataSource TABLE
(
[ID] INT
,[criteria] INT
);
INSERT INTO #DataSource ([ID], [criteria])
VALUES (1, 1)
,(1, 2)
,(2, 1)
,(3, 1)
,(3, 2)
,(4, 2);
WITH DataSource ([ID], [query_mark], [RowID]) AS
(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY [ID] ORDER BY [criteria] ASC)
FROM #DataSource
)
SELECT [id], [query_mark]
FROM DataSource
WHERE [RowID] = 1;
The idea is to create sequence of all duplicated elements for particular group. The duplicates are order by the criteria field, but yo can change the logic if you need - for example to show the biggest criteria. The group is defined using the PARTITION BY [ID] statement, which means, order items for each [ID] group. Then, in the select, we only need to show one record per each group [RowID] = 1
you can use top 1 with ties
SELECT top 1 with ties * FROM yourtable
ORDER BY ROW_NUMBER() OVER (PARTITION BY [ID] ORDER BY [criteria])
I have data from a SQL query that looks like the table on the left but I want to use SSRS to make it look like the table on the right. Is this possible?
I'm willing to modify the SQL if neccesary, assume it is currently of the form
SELECT
Name
,Capital Visited
From
Trips
You can use CROSS APPLY for this combined with XML PATH functions as follows.
It looks a bit lengthy but it includes sample data matching your sample so you can test it before you apply it to your real tables.
DECLARE #trips TABLE([Name] varchar(50), [Capital Visited] varchar(50))
INSERT INTO #trips
VALUES
('Joe', 'London'),
('Fred', 'Tokyo'),
('Joe', 'Berlin'),
('Bob', 'Paris'),
('Fred', 'London'),
('Fred', 'Madrid'),
('Bob', 'Rome')
/* Uncomment below to check the table data looks as expected */
-- SELECT [Name] ,[Capital Visited] From #trips
SELECT DISTINCT
[Name], cx.Captials
FROM
#trips t
CROSS APPLY ( SELECT Stuff(
(
SELECT ', ' + [Capital Visited]
FROM #trips WHERE [Name] = t.[Name]
FOR XML PATH('')
), 1, 2, '') AS Captials
) cx
This gives you following results
Name Captials
Bob Paris, Rome
Fred Tokyo, London, Madrid
Joe London, Berlin
Rather than me explaining the answer in full, there is a reasonable explaination here.
How Stuff and 'For Xml Path' work in Sql Server
As an SSRS solution, you can combine Join and LookupSet to concatenate all the values in a group. Within a table grouped by NAME, use this expression:
=Join(LookupSet(Fields!NAME.Value, Fields!NAME.Value, Fields!CAPITAL_VISITED.Value, "DataSet1"), ", ")
LookupSet() gets you all the values from a dataset based on a primary key and returns them as an array. (It's typically used to get values from a different dataset, but in this case we use just one). Join() then concatenates all the values found the array using a chosen delimiter (", ").