Comparing two rows in SQL Server - sql-server

Scenario
A very large size query returns a lot of fields from multiple joined tables.
Some records seem to be duplicated.
You accomplish some checks, some grouping. You focus on a couple of records for further investigation.
Still, there are too much fields to check each value.
Question
Is there any built-in function that compares two records, returning TRUE if the records match, otherwise FALSE and the set of not matching fields?

The CHECKSUM function should help identify matching rows
SELECT CHECKSUM(*) FROM table

May be this is what you are looking for:
SELECT * FROM YourTable
GROUP BY <<ColumnList>>
HAVING COUNT(*) > 1
Just developing on the suggestion provide by Podiluska to find the records which are duplicates
SELECT CHECKSUM(*)
FROM YourTable
GROUP BY CHECKSUM(*)
HAVING COUNT(*) > 1

I would suggest that use the hashbytes function to compare rows.It is better than checksum.
What about creating a row_number and parttion by all the columns and then select all the rows which are having the rn as 2 and above? This is not slow method as well as it will give you perfect data and will give the full row's data which is being duplicated.I would go with this method instead of relying on all the hashing techniques..

Related

Is There a Way to Use count in a subquery

As you can see in the link of the ER Diagram, I got two tables, Department and Admissions. My goal is to print out only the Reshnum of the Department that has the most Admissions.
My first attempt was this:
select top 1 count(*) as Number_of_Adm, Department.ReshNum
from Admission
inner join Department on Admission.DepartmentId = Department.id
group by Department.ReshNum
order by Number_of_Adm;
It's pretty straight forward, counts the rows, groups them to the department and prints out the top answer after ordering for the highest count. My problem is that it prints both the count and the Rashnum.
I'm trying to only print the Rashnum (name of the branch/serialnumber). I've read up on sub queries to try to get the count in a subquery and then pick the branch out from that query, but I can't get it to work.
Any tips?
You just need to select the column you need and move the count to the order by criteria.
Using column aliases also helps make your query easier to follow, especially with more columns & tables in the query.
you also say you want the most, I assume you'll need to order descending.
select top (1) d.ReshNum
from Admission a
inner join Department d
on a.DepartmentId = d.id
group by d.ReshNum
Order By count(*) desc;
Great question! Stu's answer is probably the most optimum way, depending on your indexes.
Just for posterity, since your inquiry includes how to make a subquery work, here is an alternative using a subquery. As far as I can tell, at least on my database, SQL Query Optimizer plans the two queries out with about the same performance on either version.
Subqueries can be really useful in tons of scenarios, like when you want to add another field to display and group by without having to add every single other field on the table in the group by clause.
SELECT TOP 1 x.ReshNum /*, x.DepartmentName*/
FROM
(
SELECT count(*) AS CountOfAdmissions, d.CustomerNumber /*d.DepartmentName*/
FROM Adminssion a
INNER JOIN Department d ON a.DepartmentId= d.Id
GROUP BY d.ReshNum /*, d.DepartmentName*/
/*HAVING count(*) > 50*/
) x
ORDER BY CountOfAdmissions DESC
How it works:
You need to wrap your subquery in parenthesis and give that subquery an alias. Above, I have it aliased as x just outside the closing parenthesis as an arbitrary identifier. You could certainly alias it as depts or mySubQuery or whatever reads well in the resulting overall query to you.
Next, it's important to notice that while the group by clause can be included inside the subquery, the order by clause cannot. So you have to keep the order by clause on the outside of the query, which means you are actually ordering the results of the subquery, and not the results of the actual table. Which could be great for performance because the result of your subquery is likely to be vastly smaller than the whole table. However, it will not use your table's index that way, so depending on how your indexes are, that bonus may wash out or even be worse than ordering without a subquery.
Last, one of the benefits of this kind of subquery approach is that you could easily throw in another field if you want, like the department name for example, without costing very much in performance. Above I have that hypothetical department name field commented out between the /* and */ flags. Note that it is referenced with the d table alias on the inside of the subquery, but uses the subquery's x alias outside of the subquery.
Just as a bonus, in case it comes up, also commented out above is a having clause that you might be able to use. Just to show what could be done inside the subquery.
Cheers

SQL Server: one table fulltext index vs multiple table fulltext index

I have a search procedure which has to search in five tables for the same string. I wonder which is better regarding read performance?
To have all tables combined in one table, then add a full text index on it
To create full text indexes in all those tables and issue a query on all of them, then to combine the results
I think something to consider when working with performance, is that reading data is almost always faster than reading data, moving data and then reading again rather than just reading once.
So from your question if you are combining all the tables into a single say temporary table or table variable this will most definitely be slow because it has to query all that data and move it (depending on how much data you are working with this may or may not make much of a difference). As well regarding your table structure, indexing on strings only really becomes efficient when the same string shows up a number of times throughout the tables.
I.e. if you are searching on months (Jan, Feb, Mar, etc...) an index will be great because it can only be 1 of 12 options, and the more times a value is repeated/the less options there are to choose from the better an index is. Where if you are searching on say user entered values ("I am writing my life story....") and you are searching on a word inside that string, an index may not make any difference.
So assuming your query is searching on, in my example months, then you could do something like this:
SELECT value
FROM (
SELECT value FROM table1
UNION
SELECT value FROM table2
UNION
SELECT value FROM table3
UNION
SELECT value FROM table4
UNION
SELECT value FROM table5
) t
WHERE t.value = 'Jan'
This will combine your results into a single result set without moving data around. As well the interpreter will be able to find the most efficient way to query each table using the provided index on each table.

SQL: How to bring back 20 columns in a select statement, with a unique on a single column only?

I have a huge select statement, with multiple inner joins, that brings back 20 columns of info.
Is there some way to filter the result to do a unique (or distinct) based on a single column only?
Another way of thinking about it is that when it does a join, it only grabs the first result of the join on a single ID, then it halts and moved onto joining by the next ID.
I've successfully used group by and distinct, but these require you to specify many columns, not just one column, which appears to slow the query down by an order of magnitude.
Update
The answer by #Martin Smith works perfectly.
When I updated the query to use this technique:
It more than doubled in speed (1663ms down to 740ms)
It used less T-SQL code (no need to add lots of parameters to the GROUP BY clause).
It's more maintainable.
Caveat (very minor)
Note that you should only use the answer from #Martin Smith if you are absolutely sure that the rows that will be eliminated will always be duplicates, or else this query will be non-deterministic (i.e. it could bring back different results from run to run).
This is not an issue with GROUP BY, as the TSQL syntax parser will prevent this from ever occurring, i.e. it will only let you bring back results where there is no possibility of duplicates.
You can use row_number for this
WITH T AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY YourCol ORDER BY YourOtherCol) AS RN,
--Rest of your query here
)
SELECT *
FROM T
WHERE RN=1

SQL Server Select Query

I have to write a query to get the following data as result.
I have four columns in my database. ID is not null, all others can have null values.
EMP_ID EMP_FIRST_NAME EMP_LAST_NAME EMP_PHONE
1 John Williams +123456789
2 Rodney +124568937
3 Jackson +124578963
4 Joyce Nancy
Now I have to write a query which returns the columns which are not null.
I do not want to specify the column name in my query.
I mean, I want to use SELECT * FROM TABLE WHERE - and add the filter, but I do not want to specify the column name after the WHERE clause.
This question may be foolish but correct me wherever necessary. I'm new to SQL and working on a project with c# and sql.
Why I do not want to use the column name because, I have more than 250 columns and 1500 rows. Now if I select any row, at least one column will have null value. I want to select the row, but the column which has null values for that particular row should not appear in the result.
Please advice. Thank you in advance.
Regards,
Vinay S
Every row returned from a SQL query must contain exactly the same columns as the other rows in the set. There is no way to select only those columns which do not return null unless all of the results in the set have the same null columns and you specify that in your select clause (not your where clause).
To Anders Abels's comment on your question, you could avoid a good deal of the query complexity by separating your data into tables which serve common purposes (called normalizing).
For example, you could put names in one table (Employee_ID, First_Name, Last_Name, Middle_Name, Title), places in another (Address_ID, Address_Name, Street, City, State), relationships in another, then tiny 2-4 column tables which link them all together. Structuring your data this way avoids duplication of individual facts, like, "who is John Williams's supervisor and how do I contact that person."
Your question reads:
I want to get all the columns that don't have a null value.
And at the same time:
But I don't want to specify column names in the WHERE clause.
These are conflicting goals. Your only option is to use the sys.tables and sys.columns DMVs to build a series of dynamic SQL statements. In the end, this is going to be more work that just writing one query by hand the first time.
You can do this with a dynamic PIVOT / UNPIVOT approach, assuming your version of SQL Server supports it (you'll need SQL Server 2005 or better), which would be based on the concepts found in these links:
Dynamic Pivot
PIVOT / UNPIVOT
Effectively, you'll select a row, transform your columns into rows in a pivot table, filter out the NULL entries, and then unpivot it back into a single row. It's going to be ugly and complex code, though.

In SQL Server what is most efficient way to compare records to other records for duplicates with in a given range of values?

We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.
Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.
So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.
For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.
(order_id, customer_id, product_id, quantity, sale_date, price)
We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder.
Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, but different quantities within 20
units, and sales dates within five days of each other should be considered suspect.
Is it possible to satisfy each one of these criteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?
If this gets much more involved, then you might be looking at a simple ETL process to do the heavy carrying for you: the load to the database should be manageable in the sense that you will be loading to your ETL environment, running tranformations/checks/comparisons and then writing your results to perhaps a staging table that outputs the stats you need. It sounds like a lot of work, but once it is setup, tweaking it is no great pain.
On the other hand, if you are looking at comparing vast amounts of data, then that might entail significant network traffic.
I am thinking efficient will mean adding index to the fields you are looking into the contents of. Not sure offhand if a megajoin is what you need, or just to list off a primary key of the suspect records into a hold table to simply list problems later. I.e. do you need to know why each record is suspect in the result set
You could
-- Assuming some pkid (primary key) has been added
1.
select pkid,order_id, customer_id product_id, quantity, sale_date
from orders o
join orders o2 on o.product_id=o2.productid and o.sale_date=o2.sale_date
and o.quantity=o2.quantity and o.customerid<>o2.customerid
then keep joining up more copies of orders, I suppose
You can do this in a single Case statement. In this below scenario, the value for MarkedForReview will tell you which of your three Tests (1,2, or 3) triggered the review. Note that I have to check for the conditions of the third test before the second test.
With InputData As
(
Select order_id, product_id, sale_date, quantity, customer_id
, Case
When O.sale_date = O2.sale_date Then 1
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5
And Abs( O.quantity - O2.quantity ) <= 20 Then 3
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5 Then 2
Else 0
End As MarkedForReview
From Orders As O
Left Join Orders As O2
On O2.order_id <> O.order_id
And O2.customer_id = O.customer_id
And O2.product_id = O.product_id
)
Select order_id, product_id, sale_date, quantity, customer_id
From InputData
Where MarkedForReview <> 0
Btw, if you are using something prior to SQL Server 2005, you can achieve the equivalent query using a derived table. Also note that you can return the id of the complementary order that triggered the review. Both orders that trigger a review will obviously be returned.

Resources