SQL Server 2008: produce table of unique entries - sql-server

I have the following problem. I have a table with a few hundred thousand records, which has the following identifiers (for simplicity)
MemberID SchemeName BenefitID BenefitAmount
10 ABC 1 10000
10 ABC 1 2000
10 ABC 2 5000
10 A.B.C 3 11000
What I need to do is to convert this into a single record that looks like this:
MemberID SchemeName B1 B2 B3
10 ABC 12000 5000 11000
The problem of course being that I need to differentiate by SchemeName, and for most records this won't be a problem, but for some SchemeName wouldn't be captured properly. Now, I don't particularly care if the converted table uses "ABC" or "A.B.C" as scheme name, as long as it just uses 1 of them.
I'd love hear your suggestions.
Thanks
Karl
(Using SQL Server 2008)

based on the limited info in the original question, give this a try:
DECLARE #YourTable table(MemberID int, SchemeName varchar(10), BenefitID int, BenefitAmount int)
INSERT INTO #YourTable VALUES (10,'ABC' ,1,10000)
INSERT INTO #YourTable VALUES (10,'ABC' ,1,2000)
INSERT INTO #YourTable VALUES (10,'ABC' ,2,5000)
INSERT INTO #YourTable VALUES (10,'A.B.C',3,11000)
INSERT INTO #YourTable VALUES (11,'ABC' ,1,10000)
INSERT INTO #YourTable VALUES (11,'ABC' ,1,2000)
INSERT INTO #YourTable VALUES (11,'ABC' ,2,5000)
INSERT INTO #YourTable VALUES (11,'A.B.C',3,11000)
INSERT INTO #YourTable VALUES (10,'mnp',3,11000)
INSERT INTO #YourTable VALUES (11,'mnp' ,1,10000)
INSERT INTO #YourTable VALUES (11,'mnp' ,1,2000)
INSERT INTO #YourTable VALUES (11,'mnp' ,2,5000)
INSERT INTO #YourTable VALUES (11,'mnp',3,11000)
SELECT
MemberID, REPLACE(SchemeName,'.','') AS SchemeName
,SUM(CASE WHEN BenefitID=1 THEN BenefitAmount ELSE 0 END) AS B1
,SUM(CASE WHEN BenefitID=2 THEN BenefitAmount ELSE 0 END) AS B2
,SUM(CASE WHEN BenefitID=3 THEN BenefitAmount ELSE 0 END) AS B3
FROM #YourTable
GROUP BY MemberID, REPLACE(SchemeName,'.','')
ORDER BY MemberID, REPLACE(SchemeName,'.','')
OUTPUT:
MemberID SchemeName B1 B2 B3
----------- ----------- ----------- ----------- -----------
10 ABC 12000 5000 11000
10 mnp 0 0 11000
11 ABC 12000 5000 11000
11 mnp 12000 5000 11000
(4 row(s) affected)

It looks that PIVOTS can help

The schemename issue is something that will have to be dealt with manually since the names can be so different. This indicates first and foremost a problem with how you are allowing data entry. You should not have these duplicate schemenames.
However since you do, I think the best thing is to create cross reference table that has two columns, something like recordedscheme and controlling scheme. Select distinct scheme name to create a list of possible schemenames and insert into the first column. Go through the list and determine what the schemename you want to use for each one is (most willbe the same as the schemename). Once you have this done, you can join to this table to get the query. This will work for the current dataset, however, you need to fix whatever is causeing the schemename to get duplicated beofre going further. YOu will also want to fix it so when a schemename is added, you table is populated with the new schemename in both columns. Then if it later turns out that a new one is a duplicate, all you have to do is write a quick update to the second column showing which one it really is and boom you are done.
The alternative is to actually update the schemenames that are bad in the data set to the correct one. Depending on how many records you have to update and in how many tables, this might be a performance issue.This too is only good for querying the data right now and doesn't address how to fix the data going forth.

Related

How does updating rows from a subquery work in SQL Server?

How does SQL Server know which rows to update when updating from a subquery rather than a table?
Say I have a table with three columns defined like below:
CREATE TABLE A (
AId int IDENTITY (1,1) PRIMARY KEY,
AExternalId int NULL,
ASequence int NULL
)
I want to update the column ASequence by sequential numbers within groups of AExternalId where ASequence is NULL.
For example, having inserted four different AExternalId's (or groups),
INSERT INTO A ([AExternalId]) VALUES (1001)
INSERT INTO A ([AExternalId]) VALUES (1002)
INSERT INTO A ([AExternalId]) VALUES (1002)
INSERT INTO A ([AExternalId]) VALUES (1003)
INSERT INTO A ([AExternalId]) VALUES (1003)
INSERT INTO A ([AExternalId]) VALUES (1003)
INSERT INTO A ([AExternalId], [ASequence]) VALUES (1004, 10)
INSERT INTO A ([AExternalId], [ASequence]) VALUES (1004, 20)
INSERT INTO A ([AExternalId], [ASequence]) VALUES (1004, 30)
the table looks like this:
AId
AExternalId
ASequence
1
1001
NULL
2
1002
NULL
3
1002
NULL
4
1003
NULL
5
1003
NULL
6
1003
NULL
7
1004
10
8
1004
20
9
1004
30
After the update, the table should look like this:
AId
AExternalId
ASequence
1
1001
1
2
1002
1
3
1002
2
4
1003
1
5
1003
2
6
1003
3
7
1004
10
8
1004
20
9
1004
30
AIds within every group of AExternalId's now has a sequential number (except for the ones that already had a sequence).
I can achieve this by running the following query:
UPDATE t1
SET t1.[ASequence] = t1.[CalcSequence]
FROM (
SELECT AId, AExternalId, ASequence, ROW_NUMBER() OVER (PARTITION BY [AExternalId] ORDER BY [AExternalId], [AId] ASC) AS [CalcSequence]
FROM [A]
WHERE (ASequence IS NULL) AND (AExternalId IS NOT NULL)
) t1
The question is, why (or rather how) does this work as there is no table specified and no condition for the update?
I have been taught that an update without condition will update all rows in a table but in this case there is no table specified (only in the subquery).
Does this work because I am updating the resulting rows from the inner select? If so, how are rows "matched" so that the update is made on the correct row?
Is this an example of a Correlated Subquery?
I've tried to read up on those but failed to understand if this applies here. Many texts on Correlated Subqueries talk about performance issues and that the correlated subquery requires values from its outer query which does not seem to fit this example.
An alternative way of achieving the same result is by using an INNER JOIN:
UPDATE t1
SET t1.[ASequence] = t2.[CalcSequence]
FROM [A] t1
INNER JOIN (
SELECT AId, AExternalId, ASequence, ROW_NUMBER() OVER (PARTITION BY [AExternalId] ORDER BY [AExternalId], [AId] ASC) AS [CalcSequence]
FROM [A]
WHERE (ASequence IS NULL) AND (AExternalId IS NOT NULL)
) t2 ON t2.AId = t1.AId
I have compared the results of both queries and they are identical. Performance-wise, the first query seems to be a bit faster and consumes less resources.
The second query (with inner join) feels more "familiar", more "correct" but I would really like to understand how the first one works.

FInd duplicate rows and show only the earliest

I have the following table:
respid, uploadtime
I need a query that will show all the records that respid is duplicate and show them except the latest (by upload time)
exmple:
4 2014-01-01
4 2014-06-01
4 2015-01-01
4 2015-06-01
4 2016-01-01
In this case the query should return four records (the latest is : 4 2016-01-01 )
Thank you very much.
Use ROW_NUMBER:
WITH cte AS (
SELECT respid, uploadtime,
ROW_NUMBER() OVER (PARTITION BY respid ORDER BY uploadtime DESC) rn
FROM yourTable
)
SELECT respid, uploadtime
FROM cte
WHERE rn > 1
ORDER BY respid, uploadtime;
The logic here is to show all records except those having the first row number value, which would be the latest records for each respid group.
If I interpreted your question correctly, then you want to see all records where respid occurs multiple times, but exclude the last duplicate.
Translating this to SQL could sound like "show all records that have a later record for the same respid". That is exactly what the solution below does. It says that for every row in the result a later record with the same respid must exists.
Sample data
declare #MyTable table
(
respid int,
uploadtime date
);
insert into #MyTable (respid, uploadtime) values
(4, '2014-01-01'),
(4, '2014-06-01'),
(4, '2015-01-01'),
(4, '2015-06-01'),
(4, '2016-01-01'), --> last duplicate of respid=4, not part of result
(5, '2020-01-01'); --> has no duplicate, not part of result
Solution
select mt.respid, mt.uploadtime
from #MyTable mt
where exists ( select top 1 'x'
from #MyTable mt2
where mt2.respid = mt.respid
and mt2.uploadtime > mt.uploadtime );
Result
respid uploadtime
----------- ----------
4 2014-01-01
4 2014-06-01
4 2015-01-01
4 2015-06-01

SQL Server - Update Column with Handing Duplicate and Unique Rows Based Upon Timestamp

I'm working with SQL Server 2005 and looking to export some data off of a table I have. However, prior to do that I need to update a status column based upon a field called "VisitNumber", which can contain multiple entries same value entries. I have a table set up in the following manner. There are more columns to it, but I am just putting in what's relevant to my issue
ID Name MyReport VisitNumber DateTimeStamp Status
-- --------- -------- ----------- ----------------------- ------
1 Test John Test123 123 2014-01-01 05.00.00.000
2 Test John Test456 123 2014-01-01 07.00.00.000
3 Test Sue Test123 555 2014-01-02 08.00.00.000
4 Test Ann Test123 888 2014-01-02 09.00.00.000
5 Test Ann Test456 888 2014-01-02 10.00.00.000
6 Test Ann Test789 888 2014-01-02 11.00.00.000
Field Notes
ID column is a unique ID in incremental numbers
MyReport is a text value and can actually be thousands of characters. Shortened for simplicity. In my scenario the text would be completely different
Rest of fields are varchar
My Goal
I need to address putting in a status of "F" for two conditions:
* If there is only one VisitNumber, update the status column of "F"
* If there is more than one visit number, only put "F" for the one based upon the earliest timestamp. For the other ones, put in a status of "A"
So going back to my table, here is the expectation
ID Name MyReport VisitNumber DateTimeStamp Status
-- --------- -------- ----------- ----------------------- ------
1 Test John Test123 123 2014-01-01 05.00.00.000 F
2 Test John Test456 123 2014-01-01 07.00.00.000 A
3 Test Sue Test123 555 2014-01-02 08.00.00.000 F
4 Test Ann Test123 888 2014-01-02 09.00.00.000 F
5 Test Ann Test456 888 2014-01-02 10.00.00.000 A
6 Test Ann Test789 888 2014-01-02 11.00.00.000 A
I was thinking I could handle this by splitting each types of duplicates/triplicates+ (2,3,4,5). Then updating every other (or every 3,4,5 rows). Then delete those from the original table and combine them together to export the data in SSIS. But I am thinking there is a much more efficient way of handling it.
Any thoughts? I can accomplish this by updating the table directly in SQL for this status column and then export normally through SSIS. Or if there is some way I can manipulate the column for the exact conditions I need, I can do it all in SSIS. I am just not sure how to proceed with this.
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY VisitNumber ORDER BY DateTimeStamp) rn from MyTable
)
UPDATE cte
SET [status] = (CASE WHEN rn = 1 THEN 'F' ELSE 'A' END)
I put together a test script to check the results. For your purposes, use the update statements and replace the temp table with your table name.
create table #temp1 (id int, [name] varchar(50), myreport varchar(50), visitnumber varchar(50), dts datetime, [status] varchar(1))
insert into #temp1 (id,[name],myreport,visitnumber, dts) values (1,'Test John','Test123','123','2014-01-01 05:00')
insert into #temp1 (id,[name],myreport,visitnumber, dts) values (2,'Test John','Test456','123','2014-01-01 07:00')
insert into #temp1 (id,[name],myreport,visitnumber, dts) values (3,'Test Sue','Test123','555','2014-01-01 08:00')
insert into #temp1 (id,[name],myreport,visitnumber, dts) values (4,'Test Ann','Test123','888','2014-01-01 09:00')
insert into #temp1 (id,[name],myreport,visitnumber, dts) values (5,'Test Ann','Test456','888','2014-01-01 10:00')
insert into #temp1 (id,[name],myreport,visitnumber, dts) values (6,'Test Ann','Test789','888','2014-01-01 11:00')
select * from #temp1;
update #temp1 set status = 'F'
where id in (
select id from #temp1 t1
join (select min(dts) as mindts, visitnumber
from #temp1
group by visitNumber) t2
on t1.visitnumber = t2.visitnumber
and t1.dts = t2.mindts)
update #temp1 set status = 'A'
where id not in (
select id from #temp1 t1
join (select min(dts) as mindts, visitnumber
from #temp1
group by visitNumber) t2
on t1.visitnumber = t2.visitnumber
and t1.dts = t2.mindts)
select * from #temp1;
drop table #temp1
Hope this helps

Auto running number ID with format xxxx/year number (9999/12) in SQL Server stored procedure

I have one table (Stock_ID, Stock_Name). I want to write a stored procedure in SQL Server with Stock_ID running number with a format like xxxx/12 (xxxx = number start from 0001 to 9999; 12 is the last 2 digits of current year).
My scenario is that if the year change, the running number will be reset to 0001/13.
what do you intend to do when you hit more than 9999 in a single year??? it may sound impossible, but I've had to deal with so many "it will never happen" data related design mess-ups over the years from code first design later developers. These are major pains depending on how may places you need to fix these items which are usually primary key and foreign keys used all over.
This looks like a system requirement to SHOW the data this way, but it is the developers responsibility to design the internals of the application. The way you store it and display it don't need to be identical. I'd split that into two columns, using an int for the number portion and a tiny int for the 2 digit year portion. You can use a computed column for quick and easy display (persist it and index if necessary), where you pad with leading zeros and add the slash. Throw in a check constraint on the year portion to make sure it stays within a reasonable range. You can make the number portion an identity and just have a job reseed it back to 1 every new years eve.
try it out:
--drop table YourTable
--create the basic table
CREATE TABLE YourTable
(YourNumber int identity(1,1) not null
,YourYear tinyint not null
,YourData varchar(10)
,CHECK (YourYear>=12 and YourYear<=25) --optional check constraint
)
--add the persisted computed column
ALTER TABLE YourTable ADD YourFormattedNumber AS ISNULL(RIGHT('0000'+CONVERT(varchar(10),YourNumber),4)+'/'+RIGHT(CONVERT(varchar(10),YourYear),2),'/') PERSISTED
--make the persisted computed column the primary key
ALTER TABLE YourTable ADD CONSTRAINT PK_YourTable PRIMARY KEY CLUSTERED (YourFormattedNumber)
sample data:
--insert rows in 2012
insert into YourTable values (12,'aaaa')
insert into YourTable values (12,'bbbb')
insert into YourTable values (12,'cccc')
--new years eve job run this
DBCC CHECKIDENT (YourTable, RESEED, 0)
--insert rows in 2013
insert into YourTable values (13,'aaaa')
insert into YourTable values (13,'bbbb')
select * from YourTable order by YourYear,YourNumber
OUTPUT:
YourNumber YourYear YourData YourFormattedNumber
----------- -------- ---------- -------------------
1 12 aaaa 0001/12
2 12 bbbb 0002/12
3 12 cccc 0003/12
1 13 aaaa 0001/13
2 13 bbbb 0002/13
(5 row(s) affected)
to handle the possibility of more than 9999 rows per year try a different computed column calculation:
CREATE TABLE YourTable
(YourNumber int identity(9998,1) not null --<<<notice the identity starting point, so it hits 9999 quicker for this simple test
,YourYear tinyint not null
,YourData varchar(10)
)
--handles more than 9999 values per year
ALTER TABLE YourTable ADD YourFormattedNumber AS ISNULL(RIGHT(REPLICATE('0',CASE WHEN LEN(CONVERT(varchar(10),YourNumber))<4 THEN 4 ELSE 1 END)+CONVERT(varchar(10),YourNumber),CASE WHEN LEN(CONVERT(varchar(10),YourNumber))<4 THEN 4 ELSE LEN(CONVERT(varchar(10),YourNumber)) END)+'/'+RIGHT(CONVERT(varchar(10),YourYear),2),'/') PERSISTED
ALTER TABLE YourTable ADD CONSTRAINT PK_YourTable PRIMARY KEY CLUSTERED (YourFormattedNumber)
sample data:
insert into YourTable values (12,'aaaa')
insert into YourTable values (12,'bbbb')
insert into YourTable values (12,'cccc')
DBCC CHECKIDENT (YourTable, RESEED, 0) --new years eve job run this
insert into YourTable values (13,'aaaa')
insert into YourTable values (13,'bbbb')
select * from YourTable order by YourYear,YourNumber
OUTPUT:
YourNumber YourYear YourData YourFormattedNumber
----------- -------- ---------- --------------------
9998 12 aaaa 9998/12
9999 12 bbbb 9999/12
10000 12 cccc 10000/12
1 13 aaaa 0001/13
2 13 bbbb 0002/13
(5 row(s) affected)
This might help:
DECLARE #tbl TABLE(Stock_ID INT,Stock_Name VARCHAR(100))
INSERT INTO #tbl
SELECT 1,'Test'
UNION ALL
SELECT 2,'Test2'
DECLARE #ShortDate VARCHAR(2)=RIGHT(CAST(YEAR(GETDATE()) AS VARCHAR(4)),2)
;WITH CTE AS
(
SELECT
CAST(ROW_NUMBER() OVER(ORDER BY tbl.Stock_ID) AS VARCHAR(4)) AS RowNbr,
tbl.Stock_ID,
tbl.Stock_Name
FROM
#tbl AS tbl
)
SELECT
REPLICATE('0', 4-LEN(RowNbr))+CTE.RowNbr+'/'+#ShortDate AS YourColumn,
CTE.Stock_ID,
CTE.Stock_Name
FROM
CTE
From memory, this is a way to get the next id:
declare #maxid int
select #maxid = 0
-- if it does not have #maxid will be 0, if it was it will give the next id
select #maxid = max(convert(int, substring(Stock_Id, 1, 4))) + 1
from table
where substring(Stock_Id, 6, 2) = substring(YEAR(getdate()), 3, 2)
declare #nextid varchar(7)
select #nextid = right('0000'+ convert(varchar,#maxid),4)) + '/' + substring(YEAR(getdate()), 3, 2)

Select Max from each Subset

I'm banging my head here. I feel pretty stupid because I'm sure I've done something like this before, but can't for the life of me remember how. One of those days I guess >.<
Say I have the following data: ---> and a query which returns this: ---> But I want this:
ID FirstID ID FirstID ID FirstID
-- ------- -- ------- -- -------
1 1 1 1 7 1
2 1 3 3 3 3
3 3 4 4 6 4
4 4 5 5 5 5
5 5
6 4
7 1
Notice that my query returns the records where ID = FirstID, but I want it to return the Max(ID) for each subset of unique FirstID. Sounds simple enough right? That's what I thought, but I keep getting back just record #7. Here's my query (the one that returns the second block of figures above) with some test code to make your life easier. I need this to give me the results in the far right block. It should be noted that this is a self-joining table where FirstID is a foreign key to ID. Thanks :)
declare #MyTable table (ID int, FirstID int)
insert into #MyTable values (1,1),(2,1),(3,3),(4,4),(5,5),(6,4),(7,1)
select ID, FirstID
from #MyTable
where ID = FirstID
Does this work
declare #MyTable table (ID int, FirstID int)
insert into #MyTable values (1,1),(2,1),(3,3),(4,4),(5,5),(6,4),(7,1)
Select FirstID, Max (Id) ID
From #MyTable
Group BY FirstID
Results in
FirstID ID
----------- -----------
1 7
3 3
4 6
5 5
With SQL2005 and later SQL2008 versions the Aggregate functions in SQL Server have been improved
You can use PARTITION BY clause for example with MAX,MIN,SUM,COUNT functions
Please try the following example
select
Distinct FirstID, Max(ID) OVER (PARTITION BY FirstID) MaxID
from #MyTable
You can find an example at http://www.kodyaz.com/t-sql/sql-count-function-with-partition-by-clause.aspx
Upon your comment, I modified the same query just to provide the exact output in order of rows and columns as follows
select Distinct
Max(ID) OVER (PARTITION BY FirstID) ID,
FirstID
from #MyTable
order by FirstID

Resources