adding an index to sql server - sql-server

I have a query that gets run often. its a dynmaic sql query because the sort by changes.
SELECT userID, ROW_NUMBER(OVER created) as rownumber
from users
where
divisionID = #divisionID and isenrolled=1
the OVER part of the query can be:
userid
created
Should I create an index for:
divisionID + isenrolled
divisionID + isenrolled + each_sort_by_option ?
Where should I put indexes for this table?

I'd start with
CREATE INDEX IX_SOQuestion ON dbo.users (divisionID, isenrolled) INCLUDE (userID, created)
The created ranking is unrelated to the WHERE clause, so may as well just INCLUDE it (so it's covered) rather that in the key columns. An internal sort would be needed anyway, so why make the key bigger?
Other sort columns could be included too
userid is only needed for output, so INCLUDE it
perhaps take isenrolled into INCLUDE if it's bit. Only 2 states (OK, 3 with NULL), so kinda pointless to add to the key columns

Start with divisionID + isenrolled + userID as it will always be used

Related

SQL Server : Text Search Pattern for Performance

I have a requirement in which I periodically have to check 40k names against table of 70k names (on Azure SQL Server).
Table has 2 relevant columns
FIRSTNAME (nvarchar(15))
LASTNAME (nvarchar(20))
Name matches must be exact first and last name match.
Naively, my first approach would be to run 40k select/where firstname='xxx' and lastname='yyy' queries, but I have to believe there is a more performant way of doing it. I guess, on the surface, it sounds like about 280k text-based queries. Obviously, the column is short enough to where I can index it, but surely there is something more I could do?
My first question is, what's the most efficient way to handle a problem like this in SQL Server?
My second question is, does anyone with experience with something like this have any idea of how long a 40k text searches across 70k rows query would take, even just on order of magnitude? I.e. am I looking at minutes, hours, days, etc?
Thanks in advance for any insights.
An index which contains both FIRSTNAME and LASTNAME columns should be enough, if possible, make it clustered.
CREATE CLUSTERED INDEX [idx_yourTable] ON yourTable (
FIRSTNAME ASC,
LASTNAME ASC
)
If you are not able to create an index on your table, then you can retrive all the data to a temp table and make an index on the temp table.
DROP TABLE IF EXISTS #T_Local
DROP TABLE IF EXISTS #T_Azure
SELECT
ID
-- A seperator is used to avoid case like
-- 'FirstName' + 'LastName' = 'FirstNameLast' + 'Name'
,FIRSTNAME + '|' + LASTNAME AS [FULL_NAME]
,FIRSTNAME
,LASTNAME
INTO #T_Local
FROM server1.DB1.dbo.YourTable
SELECT
ID
,FIRSTNAME + '|' + LASTNAME AS [FULL_NAME]
,FIRSTNAME
,LASTNAME
INTO #T_Azure
FROM server2.DB1.dbo.YourTable
CREATE CLUSTERED INDEX [idx_t_local] ON #T_Local (
[FULL_NAME] ASC)
CREATE CLUSTERED INDEX [idx_t_azure] ON #T_Azure (
[FULL_NAME] ASC)
SELECT
tl.ID AS [ID_Local]
,tl.FIRSTNAME AS [FIRSTNAME_Local]
,tl.LASTNAME AS [LASTNAME_Local]
,ta.ID AS [ID_Azure]
,ta.FIRSTNAME AS [FIRSTNAME_Azure]
,ta.LASTNAME AS [FIRSTNAME_Azure]
FROM #T_Local tl
INNER JOIN #T_Azure ta
ON tl.FULL_NAME = ta.FULL_NAME
At last, 40k to 70k records are not that much data to cause any performance issues even without a proper index.

How to get rid of duplicates with T-SQL

Hi I have a login table that has some duplicated username.
Yes I know I should have put a constraint on it, but it's a bit too late for that now!
So essentially what I want to do is to first identify the duplicates. I can't just delete them since I can't be too sure which account is the correct one. The accounts have the same username and both of them have roughly the same information with a few small variances.
Is there any way to efficiently script it so that I can add "_duplicate" to only one of the accounts per duplicate?
You can use ROW_NUMBER with a PARTITION BY in the OVER() clause to find the duplicates and an updateable CTE to change the values accordingly:
DECLARE #dummyTable TABLE(ID INT IDENTITY, UserName VARCHAR(100));
INSERT INTO #dummyTable VALUES('Peter'),('Tom'),('Jane'),('Victoria')
,('Peter') ,('Jane')
,('Peter');
WITH UpdateableCTE AS
(
SELECT t.UserName AS OldValue
,t.UserName + CASE WHEN ROW_NUMBER() OVER(PARTITION BY UserName ORDER BY ID)=1 THEN '' ELSE '_duplicate' END AS NewValue
FROM #dummyTable AS t
)
UPDATE UpdateableCTE SET OldValue = NewValue;
SELECT * FROM #dummyTable;
The result
ID UserName
1 Peter
2 Tom
3 Jane
4 Victoria
5 Peter_duplicate
6 Jane_duplicate
7 Peter_duplicate
You might include ROW_NUMBER() as another column to find the duplicates ordinal. If you've got a sort clause to get the earliest (or must current) numbered with 1 it should be easy to find and correct the duplicates.
Once you've cleaned this mess, you should ensure not to get new dups. But you know this already :-D
There is no easy way to get rid of this nightmare. Some manual actions required.
First identify duplicates.
select * from dbo.users
where userId in
(select userId from dbo.users
group by username
having count(userId) > 1)
Next identify "useless" users (for example those who registered but never place any order).
Rerun the query above. Out of this list find duplicates which are the same (by email for example) and combine them in a single record. If they did something useful previously (for example placed orders) then first assign these orders to a user which survive. Remove others.
Continue with other criteria until you you get rid of duplicates.
Then set unique constrain on username field. Also it is good idea to set unique constraint on email field.
Again, it is not easy and not automatic.
In this case where you duplicates and the original names have some variance it is highly impossible to select non duplicate rows since you are not aware which is real and which is duplicate.
I think the best thing to is to correct you data and then fix from where you are getting this slight variant duplicates.

Checking existance of dynamic value before update/insert

I am trying to mass update a table column with values but I need to get the query to check whether this value already exists. If it does then to make the relevant changes before checking again and updating the table.
The database primarily holds staff information and I need to create a unique username, the script to create the username is :
select upper(LEFT(first_name,1))+LEFT(surname,3)+'1'
from staff_test
If this was used for an example user it would generate a username of ABit1 for user Andrew Bithell. What I need it to do is check to see if there already is a ABit1 username in the STAFF_TEST table and if so change Andrews username to ABit2 as the usernames have to be unique before it moves onto the next user.
I have created another table which lists all the current usernames splitting the existing usernames into 2 columns, so they display in this table as
column1 | column2
------------------
ABit |1
I have experimented with a function and I am now thinking a Merge statement might be the way to go.
Any suggestions are welcomed.
Use row_number can generate all the unique names at once:
select
upper(LEFT(first_name,1))+LEFT(surname,3)+
rtrim(row_number() over (partition by upper(LEFT(first_name,1))+LEFT(surname,3) ))
,first_name
,surname
from staff_test
Perform an up front check to see if there are any clashes:
SELECT UPPER(LEFT(first_name, 1)) + LEFT(surname, 3) + '1' AS username ,
COUNT(1) counter
FROM staff_test
GROUP BY UPPER(LEFT(first_name, 1)) + LEFT(surname, 3) + '1'
HAVING COUNT(1) > 1
ORDER BY COUNT(1) DESC
This will return each username on your staff table, grouped by the username, along with a count of how many occurrences there are of each.
You can either sanitize the data if that's what you're looking to do, otherwise I would suggest, appending an Id column value or some other unique value per record instead of 1 on the end.

Primary Key on a temp-table messes up the results

This is a "Why does this happen??? - Question"
I have the following script:
DECLARE #sql_stmt nvarchar(max)
SET #sql_stmt = '
select top 100000 id as id
from dat.sev_sales_event
order by id
'
DECLARE #preResult TABLE ( sales_event_id INT NOT NULL PRIMARY KEY)
INSERT INTO #preResult(sales_event_id)
EXEC sp_executesql #sql_stmt
SELECT * FROM #preResult
If I run this script, results may vary each time it's executed
By simply removing "PRIMARY KEY" from the temporary-table, results stay stable
Can someone tell me the theory to this behaviour?
Kind regards
Jürgen
The order of data in a database has no meaning.
If you want your results to be ordered then you must specify an ORDER BY clause.
This is irrespective of having a PRIMARY key or not.
The following scripts illustrate the issue nicely
Expecting order without ORDER BY (1).sql - gvee.co.uk
Expecting order without ORDER BY (2).sql - gvee.co.uk
Expecting order without ORDER BY (3).sql - gvee.co.uk
Are you sure the result set is different or just in a different order?
Adding a primary key to the temporary table should result in the contents of the table being ordered numerically ascending, and so appear 'stable'. Removing this will remove the inherent ordering.

Generate Unique hash for a field in SQL Server

I'm in the process of writing a Membership Provider for use with our existing membership base. I use EF4.1 for all of my database access and one of the issued that I'm running into is when the DB was originally setup the relationships were done programmatically instead of in the db. One if the relationships needs to be made on a column that isn't required for all of our users, but in order to make the relationships does need to be unique (from my understanding).
My solution that I believe will work is to do an MD5 hash on the userid field (which is unique ...which would/should guarantee a unique value in that field). The part that I'm having issues with on sql server is the query that would do this WITHOUT replacing the existing values stored in the employeeNum field (the one in question).
So in a nutshell my question is. What is the best way to get a unique value in the employeeNum field (possibly based on an md5 hash of the userid field) on all the rows in which a value isn't already present. Also, to a minor/major extent...does this sound like a good plan?
If your question is just how to generate a hash value for userid, you can do it this way using a computed column (or generate this value as part of the insert process). It isn't clear to me whether you know about the HASHBYTES function or what other criteria you're looking at when you say "best."
DECLARE #foo TABLE
(
userid INT,
hash1 AS HASHBYTES('MD5', CONVERT(VARCHAR(12), userid)),
hash2 AS HASHBYTES('SHA1', CONVERT(VARCHAR(12), userid))
);
INSERT #foo(userid) SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 500;
SELECT userid, hash1, hash2 FROM #foo;
Results:
userid hash1 hash2
------ ---------------------------------- ------------------------------------------
1 0xC4CA4238A0B923820DCC509A6F75849B 0x356A192B7913B04C54574D18C28D46E6395428AB
2 0xC81E728D9D4C2F636F067F89CC14862C 0xDA4B9237BACCCDF19C0760CAB7AEC4A8359010B0
500 0xCEE631121C2EC9232F3A2F028AD5C89B 0xF83A383C0FA81F295D057F8F5ED0BA4610947817
In SQL Server 2012, I highly recommend at least SHA2_256 instead of either of the above. (You forgot to mention what version you're using - always useful information.)
All that said, I still want to call attention to the point I made in the comments: the "best" solution here is to fix the model. If employeeNum is optional, EF shouldn't be made to think it is required or unique, and it shouldn't be used in relationships if it is not, in fact, some kind of identifier. Why would a user care about collisions between employeeNum and userid if you're using the right attribute for the relationship in the first place?
EDIT as requested by OP
So what is wrong with saying UPDATE table SET EmployeeNum = 1000000 + UserID WHERE EmployeeNum IS NULL? If EmployeeNum will stay below 1000000 then you've guaranteed no collisions and you've avoided hashing altogether.
You could generate similar padding if employeeNum might contain a string, but again is it EF that promotes these horrible column names? Why would a column with a Num suffix contain anything but a number?
You could also use a uniqueidentifier setting the default value to (newid())
Create a new column EmployeeNum as uniqueidentifer, then:
UPDATE Employees SET EmployeeNum = newid()
Then set as primary key.
UPDATE EMPLOYEE
SET EMPLOYEENUM = HASHBYTES('SHA1', CAST(USERID AS VARCHAR(20)))
WHERE EMPLOYEENUM IS NULL

Resources