Change where statement to exclude duplicates

Change where statement to exclude duplicates - sql-server

I have a stored procedure that I use to import orders into a survey software's database. Currently it imports all the orders from the previous day, but recent changes in business plan requires these surveys to be sent out hourly. The imports come from a view in a different database. attribute 1 is the order number, which will be unique per survey so we can use that to limit the imports.
How do I change this to not pull in duplicates?
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
Insert INTO SurveyTable
(SurveyTable.firstname,
SurveyTable.token,
SurveyTable.email,
SurveyTable.emailstatus,
SurveyTable.language,
SurveyTable.remindersent,
SurveyTable.attribute_1,
SurveyTable.attribute_2)
select
location,
cast( '' as xml ).value('xs:base64Binary(sql:column( "token" ) )', 'nvarchar(MAX)' ),
email,
emailstatus,
[language],
remindersent,
attribute_1,
attribute_2
from
(
select
RTRIM([Closed_Orders_For_Survey].[Location]) location,
crypt_gen_random(12) as token,
[Closed_Orders_For_Survey].[Email] email,
'OK' emailstatus,
'en' [language],
'N' remindersent,
[Closed_Orders_For_Survey].[Order Number] attribute_1,
CONVERT(VARCHAR(10), [Closed_Orders_For_Survey].[Invoice Date],110) attribute_2
from
[Closed_Orders_For_Survey]
where
[Closed_Orders_For_Survey].[Order Date] >= dateadd(DAY, -1, Convert(date, GETDATE()))
) as x
END
P.S. the token is a generated unique string used in creaating the survey URL. We decided not to use the order number as the token because this would be too predictable and enable people to change their url to fill out other users surveys.

This is not perfectly optimal for very large data sets (you'd want to make sure you have good indexes on attribute_1), but you can do a 'where not exists' clause to filter out any orders from the inner query that already have been inserted:
select
location,
cast( '' as xml ).value('xs:base64Binary(sql:column( "token" ) )', 'nvarchar(MAX)' ),
email,
emailstatus,
[language],
remindersent,
attribute_1,
attribute_2
from
(
select
RTRIM([Closed_Orders_For_Survey].[Location]) location,
crypt_gen_random(12) as token,
[Closed_Orders_For_Survey].[Email] email,
'OK' emailstatus,
'en' [language],
'N' remindersent,
[Closed_Orders_For_Survey].[Order Number] attribute_1,
CONVERT(VARCHAR(10), [Closed_Orders_For_Survey].[Invoice Date],110) attribute_2
from
[Closed_Orders_For_Survey]
where
[Closed_Orders_For_Survey].[Order Date] >= dateadd(DAY, -1, Convert(date, GETDATE()))
) as x
where
not exists (
select attribute_1
from SurveyTable
where
attribute_1 = x.attribute_1
)

This might seem silly, but if the token is a primary key, then the attempt to insert a duplicate token will sinply fail. Trap that specific error in your application (and vigorously ignore it) and you're golden.
If it must be all one transaction, you can always add a "and token not in (select token from table)" subselect (potentially inefficient, but should work).

Related

Easier way to count users in T-SQL

I'm using this query in SQL Server 2016 to determine how many users have logged into my system.
The users.lastaccess column contains a unix timestamp, so I use DATEDIFF() to convert it to a yyyy-mm-dd hh:mm:ss date.
SELECT
COUNT(*) AS user_logins
FROM
(SELECT
ROW_NUMBER() OVER(ORDER BY lastaccess DESC) AS Row
FROM
users
WHERE
lastaccess > DATEDIFF(s, '1970-01-01 02:00:00', (SELECT Convert(DateTime, DATEDIFF(DAY, 0, GETDATE()))))
)
The result is a simple number, e.g. 75, representing the number of users who have been authenticated on the system.

The following code returns the count of users. It uses cast to drop the time-of-day from the value returned by GetDate and uses ISO 8601 for the base date/time of the unix system.
select Count(*) as User_Logins
from Users
where LastAccess > DateDiff( s, '1970-01-01T02:00:00', Cast( GetDate() as Date ) );

Why do you need a correlated subquery and a ROW_NUMBER() windowing function at all? And what is that oddball date-based WHERE clause? What are you really checking for - the fact that last_access is not null/empty??
Just use:
SELECT
COUNT(*) AS user_logins
FROM
dbo.users
WHERE
-- your WHERE condition isn't very clear - please add code as needed
-- but *DO NOT* convert dates to string to compare! Compare proper dates!
lastaccess IS NOT NULL
Also: if you have a non-nullable, narrow, fixed-width column in your dbo.Users table, you should have a nonclustered index on this (e.g. on lastaccess - is that column nullable?) - that could speed things up quite a bit

Repeat Customers with multiple purchases on the same day counts a 1

I am trying to wrap my head around this problem. I was asked to create a report that show repeat customers in our database.
One of the requirements is if a customer has more than 1 order on a specific date, it would only count as 1.
Then if they have more than 1 purchase date, they would then count as a repeat customer.
Searching on here, I found this which works for finding the Customers with more then 1 purchase on a specific purchase date.
SELECT DISTINCT s.[CustomerName], s.PurchaseDate
FROM Reports.vw_Repeat s WHERE s.PurchaseDate <> ''
GROUP BY s.[CustomerName] , cast(s.PurchaseDate as date)
HAVING COUNT(*) > 1;
This MSSQL code works like it should, by showing customers who had more than 1 purchase on the same date.
My problem is what would the best approach be to Join this into another query (this is where i need help) that then shows a complete repeat customer list where customers with more than 1 purchase would be returned.
I am using MSSQL. Any help would be greatly appreciated.

You're close, you need to move distinct into your having clause because you want to include only customers that have more than 1 distinct purchase date.
Also, only group by the customer id because the different dates have to be part of the same group for count distinct to work.
SELECT s.[CustomerName], COUNT(distinct cast(s.PurchaseDate as date))
FROM Reports.vw_Repeat s WHERE s.PurchaseDate <> ''
GROUP BY s.[CustomerName]
HAVING COUNT(distinct cast(s.PurchaseDate as date)) > 1;

If you want to pass a parameter to a query and join the result, that's what table-valued functions are for. When you join it, you use CROSS APPLY or OUTER APPLY instead of an INNER JOIN or a LEFT JOIN.
Also, I think this goes without saying, but when you check if PurchaseDate is empty:
WHERE s.PurchaseDate <> ''
Could be issues there... it implies it's a varchar field instead of a datetime (yes?) and doesn't handle null values. You might, at least, want to replace that with ISNULL(s.PurchaseDate, '') <> ''. If it's actually a datetime, use IS NOT NULL instead of <> ''.
(Edited to add sample data and DDL statements. I recommend adding these to SQL posts to assist answerers. Also, I made purchasedate a varchar instead of a datetime because of the string comparison in the query.)
https://technet.microsoft.com/en-us/library/ms191165(v=sql.105).aspx
CREATE TABLE company (company_name VARCHAR(25))
INSERT INTO company VALUES ('Company1'), ('Company2')
CREATE TABLE vw_repeat (customername VARCHAR(25), purchasedate VARCHAR(25), company VARCHAR(25))
INSERT INTO vw_repeat VALUES ('Cust1', '11/16/2017', 'Company1')
INSERT INTO vw_repeat VALUES ('Cust1', '11/16/2017', 'Company1')
INSERT INTO vw_repeat VALUES ('Cust2', '11/16/2017', 'Company2')
CREATE FUNCTION [dbo].tf_customers
(
#company varchar(25)
)
RETURNS TABLE AS RETURN
(
SELECT s.[CustomerName], cast(s.PurchaseDate as date) PurchaseDate
FROM vw_Repeat s
WHERE s.PurchaseDate <> '' AND s.Company = #company
GROUP BY s.[CustomerName] , cast(s.PurchaseDate as date)
HAVING COUNT(*) > 1
)
GO
SELECT *
FROM company c
CROSS APPLY tf_customers(c.company_name)

First thanks to everyone for the help.
#MaxSzczurek suggested I use table-valued functions. After looking into this more, I ended up using just a temporary table first to get the DISTINCT purchase dates for each Customer. I then loaded that into another temp table RIGHT JOINED to the main table. This gave me the result I was looking for. Its a little(lot) ugly, but it works.

A tricky query to update the table with values of another table

This question is an extension of my previous one available at Unable know the exception in query.
This time I've Another table named breaks.
And it looks like below.
I'm able to get the column sum using the below query.
SELECT DATEADD(SECOND, SUM(DATEDIFF(SECOND, '19000101', TotalBreakTime)), '19000101') where
where USERID = 0138039 AND CONVERT(Date, StartTime) = CONVERT(Date, GETDATE()))
as t FROM BreaksTable;
My second table looks like below.
This time, I want to update the breaks column with the sum of the totalbreaktime from Breaks table(the first screenshot) and the condition has to be the date is to be current day.
I'm really unable to understand how to do this.

You need MERGE:
MERGE SecondTable as target
USING (
SELECT USERID,
SUM(DATEDIFF(SECOND, '19000101', TotalBreakTime)) as ColumnWithBreaksCount
FROM BreaksTable
where CONVERT(Date, StartTime) = CONVERT(Date, GETDATE()))
GROUP BY USERID) as source
ON target.USERID = source.USERID
WHEN MATCHED THEN UPDATE
SET BREAKS = source.ColumnWithBreaksCount
But this will work only if you have only one column for each USERID in your secondtable, else you need to add another key kolumn in ON part of query, which will help to make rows unique.

SQL Server Selecting multiple LIKE values in Report Builder

I have a report where I'm trying to allow the user to select multiple predetermined LIKE values from a drop down list for their results in report builder. Is there a way I can do this? I have tried to use LIKE IN() but those two keywords don't seem to work together. Here is the code that I have. The code I have only works if I select one option.
DECLARE #Warehouse nvarchar(10)
DECLARE #Location nvarchar(10)
SET #Warehouse = 'Warehouse1'
SET #Location = 'IB'
SELECT Part
, Tag
, CurrentLocation AS 'Location'
, TotalQty
, DateTimeCreated
, datediff(hour, DateTimeCreated, getdate()) AS 'Hours in Location'
, User AS 'Last User'
FROM table1
WHERE datediff(hour, DateTimeCreated, getdate())>=1
AND Warehouse IN(#Warehouse)
AND(CurrentLocation LIKE '%' + #Location + '%')
ORDER BY 'Hours in Location' DESC, CurrentLocation

This would be best handled with a stored procedure. Here is a high-level description of how it would work. Each of the techniques can be learned at a low-level with some astute googling:
For your report dataset, call the stored procedure, pass your multi-valued parameter to a varchar parameter in the stored proc. For the rest of this answer, we'll call that parameter #MVList
In the stored proc, #MVList will be received as a comma-delimited string of all the options the user chose in the parameter list.
Write your SELECT query from Table1, JOINing to a Table-Valued Function that splits the #MVList (google SQL Split Function to get pre-written code), and produces a table with one row for each value that the user chose.
For the JOIN condition, instead of equals, do a LIKE:
INNER JOIN MySplitFunction(#MVList, ',') AS SplitFunction
ON Table1.CurrentLocation LIKE '%'+SplitFunction.value+'%'
The result of the query will be the IN/LIKE result you are looking for.

Thank you for your responses. This is what I ended up doing that fixed my problem.
SELECT Part
, Tag
, CurrentLocation AS 'Location'
, TotalQty
, DateTimeCreated
, datediff(hour, DateTimeCreated, getdate()) AS 'Hours in Location'
, User AS 'Last User'
FROM table 1
WHERE datediff(hour, DateTimeCreated, getdate())>=1
AND Warehouse in (#Warehouse)
AND LEFT(CurrentLocation,2) IN(#Location)
ORDER BY 'Hours in Location' DESC, CurrentLocation

Row_Number() CTE Performance when using ORDER BY CASE

I have a table I'd like to do paging and ordering on and was able to get a query similar to the following to do the work (the real query is much more involved with joins and such).
WITH NumberedPosts (PostID, RowNum) AS
(
SELECT PostID, ROW_NUMBER() OVER (ORDER BY
CASE WHEN #sortCol = 'User' THEN User END DESC,
CASE WHEN #sortCol = 'Date' THEN Date END DESC,
CASE WHEN #sortCol = 'Email' THEN Email END DESC) as RowNum
FROM Post
)
INSERT INTO #temp(PostID, User, Date, Email)
SELECT PostID, User, Date, Email
FROM Post
WHERE NumberedPosts.RowNum BETWEEN #start and (#start + #pageSize)
AND NumberedPosts.PostID = Post.PostID
The trouble is that performance is severely degraded when using the CASE statements (at least a 10x slowdown), when compared to a normal ORDER BY Date desc clause . Looking at the query plan it appears that all columns are still being sorted, even if they do not match the #sortCol qualifier.
Is there a way to get this to execute at near 'native' speed? Is dynamic SQL the best candidate for this problem? Thanks!

There shouldn't be any reason to query the post table twice. You can go the dynamic route and address those issues on performance or create 3 queries determined by the #sortCol parameter. Redundant code except for the row_num and order by parts, but sometimes you give up maintainability if speed is critical.
If #sortCol = 'User'
Begin
Select... Order by User
End
If #sortCol = 'Date'
Begin
Select .... Order by Date
end
If #sortCol = 'Email'
Begin
Select... Order by Email
End

Better to do this with either three hardcoded queries (in appropriate IF statements based on #sortCol) or dynamic SQL.
You might be able to do a trick with UNION ALL of three different queries (base on a base CTE which does all your JOINs), where only one returns rows for #sortCol, but I'd have to profile it before recommending it:
WITH BasePosts(PostID, User, Date, Email) AS (
SELECT PostID, User, Date, Email
FROM Posts -- This is your complicated query
)
,NumberedPosts (PostID, User, Date, Email, RowNum) AS
(
SELECT PostID, User, Date, Email, ROW_NUMBER() OVER (ORDER BY User DESC)
FROM BasePosts
WHERE #sortCol = 'User'
UNION ALL
SELECT PostID, User, Date, Email, ROW_NUMBER() OVER (ORDER BY Date DESC)
FROM BasePosts
WHERE #sortCol = 'Date'
UNION ALL
SELECT PostID, User, Date, Email, ROW_NUMBER() OVER (ORDER BY Email DESC)
FROM BasePosts
WHERE #sortCol = 'Email'
)
INSERT INTO #temp(PostID, User, Date, Email)
SELECT PostID, User, Date, Email
FROM NumberedPosts
WHERE NumberedPosts.RowNum BETWEEN #start and (#start + #pageSize)

I would definitely go down the dynamic SQL route (using sp_executesql with parameters to avoid any injection attacks). Using the CASE approach you're immediately stopping SQL Server from using any relevant indexes that would assist in the sorting process.

This should work, but not sure if it improves performance:
WITH NumberedPosts (PostID, RowNum) AS
(
SELECT PostID, ROW_NUMBER() OVER (ORDER BY
CASE WHEN #sortCol = 'User' THEN User
WHEN #sortCol = 'Date' THEN Date
WHEN #sortCol = 'Email' THEN Email
END DESC) as RowNum
FROM Post
)
INSERT INTO #temp(PostID, User, Date, Email)
SELECT PostID, User, Date, Email
FROM Post
WHERE NumberedPosts.RowNum BETWEEN #start and (#start + #pageSize)
AND NumberedPosts.PostID = Post.PostID