In SQL server what is the difference between using = and join? - sql-server

We have been learning SQL Server programming in Database Systems class. The professor goes exceptionally fast and is not very open to asking questions. I did ask him this, but he just advised me to review the code he'd given (which doesn't actually answer the question).
When making a query, what is the difference between using the term JOIN and using the "=" operator? For example, I have the following query:
SELECT VENDOR_NAME, ITEM_NAME, QTY
FROM VENDOR, VENDOR_ORDER, INVENTORY
WHERE VENDOR.VENDOR_ID = VENDOR_ORDER.VENDOR_ID
AND VENDOR_ORDER.INV_ID = INVENTORY.INV_ID
ORDER BY VENDOR_NAME
In class the professor has used the following code:
SELECT DISTINCT CUS_CODE, CUS_LNAME, CUS_FNAME
FROM CUSTOMER JOIN INVOICE USING (CUS_CODE)
JOIN LINE USING (INV_NUMBER)
JOIN PRODUCT USING (P_CODE)
WHERE P_DESCRIPT = 'Claw hammer';
It seems to me that using a join is performing the same function as the "=" is in mine? Am I correct or is there a difference that I am unaware of?
Edit:
Trying to use Inner Join based on things I've found on Google. I ended up with the following.
SELECT VENDOR_NAME, ITEM_NAME, QTY
FROM VENDOR, VENDOR_ORDER, INVENTORY
INNER JOIN VENDOR_ORDER USING (VENDOR_ID)
INNER JOIN INVENTORY USING (INV_ID)
ORDER BY VENDOR_NAME
Now I get the error message ""VENDOR_ID" is not a recognized table hints option. If it is intended as a parameter to a table-valued function or to the CHANGETABLE function, ensure that your database compatibility mode is set to 90.
"
I'm using 2014, so my compatibility level is 120.

The difference between what you are doing (in your first example) and what your professor is doing is that you are creating a set of all possible combinations of the rows in those tables, then narrowing your results to the ones that match the way you want them to. He is creating a set of only the rows that match the way you want them to in the first place.
If your tables were:
Table1
ID1
1
2
3
Table2
ID2
1
2
3
Your query starts with basically a cross join:
Select * from Table1, Table2
ID1 ID2
1 1
2 1
3 1
1 2
2 2
3 2
1 3
2 3
3 3
Then narrows that result set down by applying the where ID1 = ID2
ID1 ID2
1 1
2 2
3 3
This is inefficient and somewhat difficult to read in more complex examples, as people have mentioned in the comments.
Your professor is building the criteria to relate the two tables into the join itself, so he is effectively skipping the first step. In our example tables, this would be Select * from Table1 join Table2 on ID1 = ID2.
There are several types of joins in SQL, which differ based on how you want to handle cases where a value exists in one of your tables, but has no match in the other table. See traditional venn diagram explanation from http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins:

Don't worry it's your professors issue not yours. Make sure you give appropriate feedback at the end of the course ;)
Hang in there.
So here is some info:
So the first issue is: your professor should not be teaching you USING because it has limited implementation (it definitely won't work in SQL Server) and IMHO it's a bad idea because you should explicitly list join columns.
Here are some queries that will work in SQL Server - lets build them up bit by bit. I will need to make some assumptions
First just join vendor to vendor order:
SELECT VENDOR.VENDOR_NAME, VENDOR_ORDER.QTY
FROM VENDOR
INNER JOIN
VENDOR_ORDER
ON VENDOR.VENDOR_ID = VENDOR_ORDER.VENDOR_ORDER
By using inner join we match these two tables on VENDOR_ID
If you have seven records in VENDOR_ORDER with VENDOR_ID = 7, and one record in table VENDOR then the result of this will be.... 7 records, with the data from the VENDOR table repeating seven times.
Now to that, join in inventory
SELECT VENDOR.VENDOR_NAME, INVENTORY.ITEM_NAME, VENDOR_ORDER.QTY
FROM VENDOR
INNER JOIN
VENDOR_ORDER
ON VENDOR.VENDOR_ID = VENDOR_ORDER.VENDOR_ORDER
INNER JOIN
INVENTORY ON INVENTORY.INV_ID = VENDOR_ORDER.INV_ID
ORDER BY VENDOR.VENDOR_NAME
This 'INNER JOIN' syntax is the modern version (often referred as SQL-92). Having a comma seperated list after the FROM clause is 'old school'
Both methods work the same way but the old school way causes ambiguities if you start using outer joins. So get into the habit of doing it the new way.
Lastly, to neaten things up you can use an 'allias'. Which means you give each table a shorter name then use that. I've also added in the invoice number so you can get an idea of what's going on:
SELECT V.VENDOR_NAME, I.ITEM_NAME, ORD.INV_ID, ORD.QTY
FROM VENDOR As V
INNER JOIN
VENDOR_ORDER As ORD
ON V.VENDOR_ID = ORD.VENDOR_ORDER
INNER JOIN
INVENTORY As I ON I.INV_ID = ORD.INV_ID
ORDER BY V.VENDOR_NAME

Related

Is there an equivalent to OR clause in CONTAINSTABLE - FULL TEXT INDEX

I am trying to find a solution in order to improve the String searching process and I selected FULL-TEXT INDEX Strategy.
However, after implementing it, I still can see there is a performance hit when it comes to search by using multiple strings using multiple Full-Text Index tables with OR clauses.
(E.x. WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%'))
As a solution, I am trying to use CONTAINSTABLE expecting a performance improvement.
Now, I am facing an issue with CONTAINSTABLE when it comes to joining tables with a LEFT JOIN
Please go through the example below.
Query 1
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
LEFT JOIN CONTAINSTABLE(P.Building,*,'%John%') AS FFTIndex ON F.ID = FFTIndex.[Key]
LEFT JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
LEFT JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
LEFT JOIN P.Person p ON pr2.ID = p.PID
LEFT JOIN CONTAINSTABLE(P.Person,FirstName,'%John%') AS PFTIndex ON P.ID = PFTIndex.[Key]
WHERE F.Name IS NOT NULL
This produces the below result.
Query 2
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
INNER JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
INNER JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
INNER JOIN P.Person p ON pr2.ID = p.PID
WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%')
AND F.Name IS NOT NULL
Result
Expectation
To use query 1 in a way that works as the behavior of an SQL SERVER OR clause. As I can understand Query 1's CONTAINSTABLE, joins the data with the building table, and the rest of the results are going to ignore so that the CONTAINSTABLE of the Person table gets data that already contains the keyword filtered from the building table.
If the keyword = Building, I want to match the keyword in both the tables regardless of searching a saved record in both the tables. Having a record in each table is enough.
Summary
Query 2 performs well but is creates a slowness when the words in the indexes are growing. Query 1 seems optimized(When it comes to multiple online resources and MS Documentation),
however, it does not give me the expected output.
Is there any way to solve this problem?
I am not strictly attached to CONTAINSTABLE. Suggesting another optimization method will also be considerable.
Thank you.
Hard to say definitively without your full data set but a couple of options to explore
Remove Invalid % Wildcards
Why are you using '%SearchTerm%'? Does performance improve if you use the search term without the wildcards (%)? If you want a word that matches a prefix, try something like
WHERE CONTAINS (String,'"SearchTerm*"')
Try Temp Tables
My guess is CONTAINS is slightly faster than CONTAINSTABLE as it doesn't calculate a rank, but I don't know if anyone has ever attempted to benchmark it. Either way, I'd try saving off the matches to a temp table before joining up to the rest of the tables. This will allow the optimizer to create a better execution plan
SELECT ID INTO #Temp
FROM YourTable
WHERE CONTAINS (String,'"SearchTerm"')
SELECT *
FROM #Temp
INNER JOIN...
Optimize Full Text Index by Removing Noisy Words
You might find you have some noisy words aka words that reoccur many times in your data that are meaningless like "the" or perhaps some business jargon. Adding these to your stop list will mean your full text index will ignore them, making your index smaller thus faster
The query below will list indexed words with the most frequent at the top
Select *
From sys.dm_fts_index_keywords(Db_Id(),Object_Id('dbo.YourTable') /*Replace with your table name*/)
Order By document_count Desc
This OR That Criteria
For your WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%') criteria where you want this or that, is tricky. OR clauses generally perform even when using simple equality operators.
I'd try either doing two queries and union the results like:
SELECT * FROM Table1 F
/*Other joins and stuff*/
WHERE CONTAINS(F.*,'%Gayan%')
UNION
SELECT * FROM Table2 P
/*Other joins and stuff*/
WHERE CONTAINS(P.FirstName,'%John%')
OR this is much more work, but you could load all your data into giant denormalized table with all your columns. Then apply a full text index to that table and adjust your search criteria that way. It'd probably be the fastest method searching, but then you'd have to ensure the data is sync between the denormalized table and the underlying normalized tables
SELECT B.*,P.* INTO DenormalizedTable
FROM Building AS B
INNER JOIN People AS P
CREATE FULL TEXT INDEX ft ON DenormalizedTable
etc...

SQL query inside a query

Allow me to share my query in an informal way (not following the proper syntax) as I'm a newbie - my apologies:
select * from table where
(
(category = Clothes)
OR
(category = games)
)
AND
(
(Payment Method = Cash) OR (Credit Card)
)
This is one part from my query. The other part is that from the output of the above, I don’t want to show the records meeting these criteria:
Category = Clothes
Branch = B3 OR B4 OR B5
Customer = Chris
VIP Level = 2 OR 3 OR 4 OR 5
SQL is not part of my job but I’m doing it to ease things for me. So you can consider me a newbie. I searched online, maybe I missed the solution.
Thank you,
HimaTech
There's a few ways of doing this (specifically within SQL - not looking at MDX here).
Probably the easiest to understand way would be to get the dataset that you want to exclude as a subquery, and use the not exists/not in command.
SELECT * FROM table
WHERE category IN ('clothes', 'games')
AND payment_method IN ('cash', 'credit card')
AND id NOT IN (
-- this is the subquery containing the results to exclude
SELECT id FROM table
WHERE category = 'clothes' [AND/OR]
branch IN ('B3', 'B4', 'B5') [AND/OR]
customer = 'Chris' [AND/OR]
vip_level IN (2, 3, 4, 5)
)
Another way you could do it is to do left join the results you want to exclude on to the overall results, and exclude these results using IS NULL like so:
SELECT t1.*
FROM table
LEFT JOIN
(SELECT id FROM table
WHERE customer = 'chris' AND ...) -- results to exclude
AS t2 ON table.id = t2.id
WHERE t2.id IS NULL
AND ... -- any other criteria
The trick here is that when doing a left join, if there is no result from the join then the value is null. But this is certainly more difficult to get your head around.
There will also be different performance impacts from doing it either way, so it may be worth looking into it. This is probably a good place to start:
What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL?

SQL Server query returns rows more than expected

I have a sql server query that returns rows more than I expected:
select
b.isbn, l.lend_no, s.first_name
from
dbo.books b, dbo.lending l, dbo.students s
where
(l.act between '4/16/2013' and '4/16/2013')
and (l.stat ='close')`
I want to do is get the isbn, lend_no and student name that book returned date is between given dates and lend status is closed , my lending table has only 2 lending that returned on given date but query give me 304 rows
Your current query gets the cartesian product from the three tables causing to retrieve unexpected result. You need to define the relationship or how the tables should be join, example
select b.isbn, l.lend_no, s.first_name
from dbo.books b
INNER JOIN dbo.lending l
ON c.Colname = l.ColName -- << define condition here
INNER JOIN dbo.students s
ON ...... -- << define condition here
where l.act between '4/16/2013' and '4/16/2013' and
l.stat ='close'
To further gain more knowledge about joins, kindly visit the link below:
Visual Representation of SQL Joins
You're not definining any join conditions between the tables, so you'll get a cartesian product.
Try something like this instead:
SELECT
b.isbn, l.lend_no, s.first_name
FROM
dbo.books b
INNER JOIN
dbo.lending l ON l.Book_id = b.Book_id -- just guessing here
INNER JOIN
dbo.students s ON l.student_id = s.student_id -- just guessing here
WHERE
l.act BETWEEN '20130416' AND '20130416'
AND l.stat = 'close'
Define the join conditions as needed - I don't know your tables, you'll have to find out what columns link the two tables respectively.
I also used the proper ANSI JOIN syntax - don't just list a bunch of tables separated by a comma, that's been kicked out of the SQL standards over 20 years ago (SQL 1992).
Also: I would always use the ISO-8601 date format YYYYMMDD to be safe - this is the only format that works on all versions of SQL Server and with all language, regional and dateformat settings.
Your FROM clause isn't doing what you expect it to. By specifying the tables that way you are doing a full join which is giving you a cartesian product. You need to be using the proper table join syntax.
This is a great explanation of table joins: http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html

How to improve SQL Query Performance

I have the following DB Structure (simplified):
Payments
----------------------
Id | int
InvoiceId | int
Active | bit
Processed | bit
Invoices
----------------------
Id | int
CustomerOrderId | int
CustomerOrders
------------------------------------
Id | int
ApprovalDate | DateTime
ExternalStoreOrderNumber | nvarchar
Each Customer Order has an Invoice and each Invoice can have multiple Payments.
The ExternalStoreOrderNumber is a reference to the order from the external partner store we imported the order from and the ApprovalDate the timestamp when that import happened.
Now we have the problem that we had a wrong import an need to change some payments to other invoices (several hundert, so too mach to do by hand) according to the following logic:
Search the Invoice of the Order which has the same external number as the current one but starts with 0 instead of the current digit.
To do that I created the following query:
UPDATE DB.dbo.Payments
SET InvoiceId=
(SELECT TOP 1 I.Id FROM DB.dbo.Invoices AS I
WHERE I.CustomerOrderId=
(SELECT TOP 1 O.Id FROM DB.dbo.CustomerOrders AS O
WHERE O.ExternalOrderNumber='0'+SUBSTRING(
(SELECT TOP 1 OO.ExternalOrderNumber FROM DB.dbo.CustomerOrders AS OO
WHERE OO.Id=I.CustomerOrderId), 1, 10000)))
WHERE Id IN (
SELECT P.Id
FROM DB.dbo.Payments AS P
JOIN DB.dbo.Invoices AS I ON I.Id=P.InvoiceId
JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00'
Now I started that query on a test system using the live data (~250.000 rows in each table) and it is now running since 16h - did I do something completely wrong in the query or is there a way to speed it up a little?
It is not required to be really fast, as it is a one time task, but several hours seems long to me and as I want to learn for the (hopefully not happening) next time I would like some feedback how to improve...
You might as well kill the query. Your update subquery is completely un-correlated to the table being updated. From the looks of it, when it completes, EVERY SINGLE dbo.payments record will have the same value.
To break down your query, you might find that the subquery runs fine on its own.
SELECT TOP 1 I.Id FROM DB.dbo.Invoices AS I
WHERE I.CustomerOrderId=
(SELECT TOP 1 O.Id FROM DB.dbo.CustomerOrders AS O
WHERE O.ExternalOrderNumber='0'+SUBSTRING(
(SELECT TOP 1 OO.ExternalOrderNumber FROM DB.dbo.CustomerOrders AS OO
WHERE OO.Id=I.CustomerOrderId), 1, 10000))
That is always a BIG worry.
The next thing is that it is running this row-by-row for every record in the table.
You are also double-dipping into payments, by selecting from where ... the id is from a join involving itself. You can reference a table for update in the JOIN clause using this pattern:
UPDATE P
....
FROM DB.dbo.Payments AS P
JOIN DB.dbo.Invoices AS I ON I.Id=P.InvoiceId
JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00'
Moving on, another mistake is to use TOP without ORDER BY. That's asking for random results. If you know there's only one result, you wouldn't even need TOP. In this case, maybe you're ok with randomly choosing one from many possible matches. Since you have three levels of TOP(1) without ORDER BY, you might as well just mash them all up (join) and take a single TOP(1) across all of them. That would make it look like this
SET InvoiceId=
(SELECT TOP 1 I.Id
FROM DB.dbo.Invoices AS I
JOIN DB.dbo.CustomerOrders AS O
ON I.CustomerOrderId=O.Id
JOIN DB.dbo.CustomerOrders AS OO
ON O.ExternalOrderNumber='0'+SUBSTRING(OO.ExternalOrderNumber,1,100)
AND OO.Id=I.CustomerOrderId)
However, as I mentioned very early on, this is not being correlated to the main FROM clause at all. We move the entire search into the main query so that we can make use of JOIN-based set operations rather than row-by-row subqueries.
Before I show the final query (fully commented), I think your SUBSTRING is supposed to address this logic but starts with 0 instead of the current digit. However, if that means how I read it, it means that for an order number '5678', you're looking for '0678' which would also mean that SUBSTRING should be using 2,10000 instead of 1,10000.
UPDATE P
SET InvoiceId=II.Id
FROM DB.dbo.Payments AS P
-- invoices for payments
JOIN DB.dbo.Invoices AS I ON I.Id=P.InvoiceId
-- orders for invoices
JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
-- another order with '0' as leading digit
JOIN DB.dbo.CustomerOrders AS OO
ON OO.ExternalOrderNumber='0'+substring(O.ExternalOrderNumber,2,1000)
-- invoices for this other order
JOIN DB.dbo.Invoices AS II ON OO.Id=II.CustomerOrderId
-- conditions for the Payments records
WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00'
It is worth noting that SQL Server allows UPDATE ..FROM ..JOIN which is less supported by other DBMS, e.g. Oracle. This is because for a single row in Payments (update target), I hope you can see that it is evident it could have many choices of II.Id to choose from from all the cartesian joins. You will get a random possible II.Id.
I think something like this will be more efficient ,if I understood your query right. As i wrote it by hand and didn't run it, it may has some syntax error.
UPDATE DB.dbo.Payments
set InvoiceId=(SELECT TOP 1 I.Id FROM DB.dbo.Invoices AS I
inner join DB.dbo.CustomerOrders AS O ON I.CustomerOrderId=O.Id
inner join DB.dbo.CustomerOrders AS OO On OO.Id=I.CustomerOrderId
and O.ExternalOrderNumber='0'+SUBSTRING(OO.ExternalOrderNumber, 1, 10000)))
FROM DB.dbo.Payments
JOIN DB.dbo.Invoices AS I ON I.Id=Payments.InvoiceId and
Payments.Active=0
AND Payments.Processed=0
AND O.ApprovalDate='2012-07-19 00:00:00'
JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
Try to re-write using JOINs. This will highlight some of the problems. Will the following function do just the same? (The queries are somewhat different, but I guess this is roughly what you're trying to do)
UPDATE Payments
SET InvoiceId= I.Id
FROM DB.dbo.Payments
CROSS JOIN DB.dbo.Invoices AS I
INNER JOIN DB.dbo.CustomerOrders AS O
ON I.CustomerOrderId = O.Id
INNER JOIN DB.dbo.CustomerOrders AS OO
ON O.ExternalOrderNumer = '0' + SUBSTRING(OO.ExternalOrderNumber, 1, 10000)
AND OO.Id = I.CustomerOrderId
WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00')
As you see, two problems stand out:
The undonditional join between Payments and Invoices (of course, you've caught this off by a TOP 1 statement, but set-wise it's still unconditional) - I'm not really sure if this really is a problem in your query. Will be in mine though :).
The join on a 10000-character column (SUBSTRING), embodied in a condition. This is highly inefficient.
If you need a one-time speedup, just take the queries on each table, try to store the in-between-results in temporary tables, create indices on those temporary tables and use the temporary tables to perform the update.

Getting repetitive column names by adding a prefix to the repeated column name in SQL Server 2005

How can I write a stored procedure in SQL Server 2005 so that i can display the repeated column names by having a prefix added to it?
Example: If I have 'Others' as the column name belonging to a multiple categories mapped to another table having columns as 'MyColumn','YourColumn'. I need to join these two tables so that my output should be 'M_Others' and 'Y_Others'. I can use a case but I am not sure of any other repeated columns in the table. How to write that dynamically to know the repetitions ?
Thanks In Advance
You should use aliases in the projection of the query: (bogus example, showing the usage)
SELECT c.CustomerID AS Customers_CustomerID, o.CustomerID AS Orders_CustomerID
FROM Customers c INNER JOIN Orders o ON c.CustomerID = o.CustomerID
You can't dynamically change the column names without using dynamic SQL.
You have to explicitly alias them. There is no way to change "A_Others" or "B_Others" in this query:
SELECT
A.Others AS A_Others,
B.Others AS B_Others
FROM
TableA A
JOIN
TableB B ON A.KeyCol = B.KeyCol
If the repeated columns contain the same data (i.e. they are the join fields), you should not be sending both in the query anyway as this is a poor practice and is wasteful of both server and network resources. You should not use select * in queries on production especially if there are joins. If you are properly writing SQL code, you would alias as you go along when there are two columns with the same name that mean different things (for instance if you joined twice to the person table, once to get the doctor name and once to get the patient name). Doing this dynamically from system tables would not only be inefficient but could end up giving you a big security hole depending on how badly you wrote the code. You want to save five minutes or less in development by permanently affecting performance for every user and possibly negatively impacing data security. This is what database people refer to as a bad thing.
select n.id_pk,
(case when groupcount.n_count > 1 then substring(m.name, 1, 1) + '_' + n.name
else n.name end)
from test_table1 m
left join test_table2 n on m.id_pk = n.id_fk
left join (select name, count(name) as n_count
from test_table2 group by name)
groupcount on n.name = groupcount.name

Resources