So we are migrating from Informix to Sql Server. And I have noticed that in Informix the queries are written in this manner:
select [col1],[col2],[col3],[col4],[col5]
from tableA, tableB
where tableA.[col1] = table.[gustavs_custom_chrome_id]
Whereas all the queries I write in SQL Server are written as:
select [col1],[col2],[col3],[col4],[col5]
from tableA
inner join tableB on tableA.[col1] = table.[gustavs_custom_chrome_id]
Now, my first thought was: that first query is bad. It probably creates this huge record set then whittles to the actual record set using the Where clause. Therefore, it's bad for performance. And it's non-ansi. So it's double bad.
However, after some googling, it seems that they both are, in theory, pretty much the same. And they both are ANSI compliant.
So my questions are:
Do both queries perform the same? IE. runs just as fast and always gives the same answer.
Are both really ANSI-compliant?
Are there any outstanding reasons why I should push for one style over another? Or should I just leave good enough alone?
Note: These are just examples of the queries. I've seen some queries (of the first kind) join up to 5 tables at a time.
Well, "better" is subjective. There is some style here. But I'll address your questions directly.
Both perform the same
Both are ANSI-compliant.
The problem with the first example is that
it is very easy to inadvertently derive the cross product (since it is easier to leave out join criteria)
it also becomes difficult to debug the join criteria as you add more and more tables to the join
since the old-style outer join (*=) syntax has been deprecated (it has long been documented to return incorrect results), when you need to introduce outer joins, you need to mix new style and old style joins ... why promote inconsistency?
while it's not exactly the authority on best practices, Microsoft recommends explicit INNER/OUTER JOIN syntax
with the latter method:
you are using consistent join syntax regardless of inner / outer
it is tougher (not impossible) to accidentally derive the cross product
isolating the join criteria from the filter criteria can make debugging easier
I wrote the post Kevin pointed to.
Related
My Database Professor told us to use:
SELECT A.a1, B.b1 FROM A, B WHERE A.a2 = B.b2;
Rather than:
SELECT A.a1, B.b1 FROM A INNER JOIN B ON A.a2 = B.b2;
Supposedly Oracle don't likes JOIN-Syntaxes, because these JOIN-syntaxes are harder to optimize than the WHERE restriction of the Cartesian Product.
I can't imagine why this should be the case. The only Performance issue could be that the parser Needs to parse a few characters more. But that is negligible in my eyes.
I found this Stack Overflow Questions:
Is there an Oracle official recommendation on the use of explicit ANSI JOINs vs implicit joins?
Explicit vs implicit SQL joins
And this sentence in a Oracle Documentation: https://docs.oracle.com/cd/B19306_01/server.102/b14200/queries006.htm
Oracle recommends that you use the FROM clause OUTER JOIN syntax rather than the Oracle join operator.
Can someone give me up-to-date recommendations from Oracle with link. Because she don't acknowledges StackOverflow (here can answer everyone) and the 10g Documentation is outdated in here eyes.
If i am wrong and Oracle realy don't likes JOINS now than thats also ok, but i don't find articles. I just want to know who is Right.
Thanks a lot to everyone who can help me!
Your professor should speak with Gordon Linoff, who is a computer science professor at Columbia University. Gordon, and most SQL enthusiasts on this site, will almost always tell you to use explicit join syntax. The reasons for this are many, including (but not limited to):
Explicit joins make it easy to see what the actual join logic is. Implicit joins, on the other hand, obfuscate the join logic, by spreading it out across both the FROM and WHERE clauses.
The ANSI 92 standard recommends using modern explicit joins, and in fact deprecated the implicit join which your professor seems to be pushing
Regarding performance, as far as I know, both versions of the query you wrote would be optimized to the same thing under the hood. You can always check the execution plans of both, but I doubt you would see a significant difference very often.
An average sql query you will encounter in real business has 7-8 joins with 12-16 join conditions. One every 10 or 20 queries may involve nested joins or other more advanced cases.
Explicit join syntax is simply far easier to maintain, debug and develop. And those factors are critical for business software - the faster and safer the better.
Implicit join are somewhat easier to code if you create statements dynamically through application code. Perhaps there are other uses that i am unaware.
As with many non-trivial things there is no simple yes / no answer.
The first thing is, for trivial queries (as yours example in the question) it doesn’t matter which syntax you use. The classic syntax is even more compact for simple queries.
First for non-trivial queries (say more than five joins) you will learn the benefits of the ANSI syntax. The main benefit is that the join predicates are separated and divided from the WHERE condition.
Simple example – this is a complete valid query in the pre-ANSI syntax
SELECT A.a1, B.b1
FROM A, B
WHERE A.a1 = B.b1 and
A.a1 = B.b1(+);
Is it inner or outer join? Furthermore if this construct is scattered in a predicate with 10 other join condition in the WHERE clause, it is even very easy to misread it.
Anyway, it would be very naïve to assume that those two syntax options are only a syntax sugar and that the resulting execution plan is for all queries, any data and all Oracle versions identical.
Yes, and there were times (about Oracle 10) you should be careful. But in times of 12 and 18 versions I do not see a reason to be defensive and I'm convinced it is safe to use the ANSI syntax from the above reason of better overview and readability.
Final remark for your professor: if you get in the position of optimizing the WHERE restriction of the Cartesian Product you typically encounters a performance problem. Make a thought experiment with a Cartesian Product of four tables with 1.000 rows each…
There are rare occasions when the optimiser suffers from a bug when using the explicit JOIN syntax as opposed to the implicit one. For example, I once could not achieve to profit from a join elimination optimisation in Oracle 12c when using explicit joins, whereas the join was properly eliminated with the implicit join syntax. When working with views querying views querying views, lack of join elimination can indeed cause performance issues. I've explained the concept of join elimination in a blog post, here.
That was a bug (and a rare one at that, these days), and not a good reason to avoid the explicit join syntax in general. I think in current versions of Oracle, there's no reason in favour of one or the other syntax other than personal taste, when join trees are simple. With complex join trees, the explicit syntax tends to be superior, as it is more clear, and some relationships (e.g. full outer joins or joins with complex join predicates) are not possible otherwise. But neither of these arguments is about performance.
My DBA tells me that I should always use OPTION (FORCE ORDER) in my SQL statements when accessing a particular set of views. I understand this is to prevent the server vetoing the order of his joins.
Fair enough - it's worth while keeping the DBA happy and I am happy to comply.
However, I would like to write a couple of views in my own schema, but this isn't supported apparently.
How then, can I achieve the same when writing my views, ie having OPTION (FORCE ORDER) being enforced?
Thanks
Fred
Blindly appending OPTION (FORCE ORDER) onto all queries that reference a particular view is extremely poor blanket advice.
OPTION (FORCE ORDER) is a query hint and these are not valid inside a view - you would need to put it on the outer level on all queries referencing your own views.
It is valid to use Join hints inside views though and
If a join hint is specified for any two tables, the query optimizer
automatically enforces the join order for all joined tables in the
query, based on the position of the ON keywords.
So
SELECT v1.Foo,
v2.Bar
FROM v1
INNER HASH JOIN v2
ON v1.x = v2.x;
Would enforce the join order inside v1 and v2 (as well as enforcing the join ordering and algorithm between them).
But I would not recommend this. These types of hints should only be used in an extremely targeted manner in a last resort after not being able to get a satisfactory plan any other way. Not as a matter of policy without even testing alternatives.
Can somebody completely explain what is big difference in these two methods?
Is there misunderstanding in database theory of programmers? Can somebody give a good article about the question or just say - what is a difference in these methods in PostgreSQL?
Did you mean SELECT * FROM table1, table2 vs SELECT * FROM table1 JOIN table2 ON condition?
PostgreSQL optimizer makes this queries run with the same speed, but JOIN is more transparent and usable. Also, you can use LEFT/RIGHT JOIN.
In the PostgreSQL documentation there is a related topic. Explicit joins can give you more control over the execution order of statements using the join_collapse_limit GUC. Take a look at this page.
There are also all the other already mentioned advantages in readability and maintainability.
I have my business-logic in ~7000 lines of T-SQL stored procedures, and most of them has next JOIN syntax:
SELECT A.A, B.B, C.C
FROM aaa AS A, bbb AS B, ccc AS C
WHERE
A.B = B.ID
AND B.C = C.ID
AND C.ID = #param
Will I get performance growth if I will replace such query with this:
SELECT A.A, B.B, C.C
FROM aaa AS A
JOIN bbb AS B
ON A.B = B.ID
JOIN ccc AS C
ON B.C = C.ID
AND C.ID = #param
Or they are the same?
The two queries are the same, except the second is ANSI-92 SQL syntax and the first is the older SQL syntax which didn't incorporate the join clause. They should produce exactly the same internal query plan, although you may like to check.
You should use the ANSI-92 syntax for several of reasons
The use of the JOIN clause separates
the relationship logic from the
filter logic (the WHERE) and is thus
cleaner and easier to understand.
It doesn't matter with this particular query, but there are a few circumstances where the older outer join syntax (using + ) is ambiguous and the query results are hence implementation dependent - or the query cannot be resolved at all. These do not occur with ANSI-92
It's good practice as most developers and dba's will use ANSI-92 nowadays and you should follow the standard. Certainly all modern query tools will generate ANSI-92.
As pointed out by #gbn, it does tend to avoid accidental cross joins.
Myself I resisted ANSI-92 for some time as there is a slight conceptual advantage to the old syntax as it's a easier to envisage the SQL as a mass Cartesian join of all tables used followed by a filtering operation - a mental technique that can be useful for grasping what a SQL query is doing. However I decided a few years ago that I needed to move with the times and after a relatively short adjustment period I now strongly prefer it - predominantly because of the first reason given above. The only place that one should depart from the ANSI-92 syntax, or rather not use the option, is with natural joins which are implicitly dangerous.
The second construct is known as the "infixed join syntax" in the SQL community. The first construct AFAIK doesn't have widely accepted name so let's call it the 'old style' inner join syntax.
The usual arguments go like this:
Pros of the 'Traditional' syntax: the
predicates are physically grouped together in the WHERE clause in
whatever order which makes the query generally, and n-ary relationships particularly, easier to read and understand (the ON clauses of the infixed syntax can spread out the predicates so you have to look for the appearance of one table or column over a visual distance).
Cons of the 'Traditional' syntax: There is no parse error when omitting one of the 'join' predicates and the result is a Cartesian product (known as a CROSS JOIN in the infixed syntax) and such an error can be tricky to detect and debug. Also, 'join' predicates and 'filtering' predicates are physically grouped together in the WHERE clause, which can cause them to be confused for one another.
The two queries are equal - the first is using non-ANSI JOIN syntax, the 2nd is ANSI JOIN syntax. I recommend sticking with the ANSI JOIN syntax.
And yes, LEFT OUTER JOINs (which, btw are also ANSI JOIN syntax) are what you want to use when there's a possibility that the table you're joining to might not contain any matching records.
Reference: Conditional Joins in SQL Server
OK, they execute the same. That's agreed.
Unlike many I use the older convention. That SQL-92 is "easier to understand" is debatable. Having written programming languages for pushing 40 years (gulp) I know that 'easy to read' begins first, before any other convention, with 'visual acuity' (misapplied term here but it's the best phrase I can use).
When reading SQL the FIRST thing you mind cares about is what tables are involved and then which table (most) defines the grain. Then you care about relevant constraints on the data, then the attributes selected. While SQL-92 mostly separates these ideas out, there are so many noise words, the mind's eye has to interpret and deal with these and it makes reading the SQL slower.
SELECT Mgt.attrib_a AS attrib_a
,Sta.attrib_b AS attrib_b
,Stb.attrib_c AS attrib_c
FROM Main_Grain_Table Mgt
,Surrounding_TabA Sta
,Surrounding_tabB Stb
WHERE Mgt.sta_join_col = Sta.sta_join_col
AND Mgt.stb_join_col = Stb.stb_join_col
AND Mgt.bus_logic_col = 'TIGHT'
Visual Acuity!
Put the commas for new attributes in front It makes commenting code easier too
Use a specific case for functions and keywords
Use a specific case for tables
Use a specific case for attributes
Vertically Line up operators and operations
Make the first table(s) in the FROM represent the grain of the data
Make the first tables of the WHERE be join constraints and let the specific, tight constraints float to the bottom.
Select 3 character alias for ALL tables in your database and use the alias EVERYWHERE you reference the table. You should use that alias as a prefix for (many) indexes on that table as well.
6 of 1 1/2 dozen of another, right? Maybe. But even if you're using ANSI-92 convention (as I have and in cases will continue to do) use visual acuity principles, verticle alignment to let your mind's eye avert to the places you want to see and and easily avoid things (particularly noise words) you don't need to.
Execute both and check their query plans. They should be equal.
In my mind the FROM clause is where I decide what columns I need in the rows for my SELECT clause to work on. It is where a business rule is expressed that will bring onto the same row, values needed in calculations. The business rule can be customers who have invoices, resulting in rows of invoices including the customer responsible. It could also be venues in the same postcode as clients, resulting in a list of venues and clients that are close together.
It is where I work out the centricity of the rows in my result set. After all, we are simply shown the metaphor of a list in RDBMSs, each list having a topic (the entity) and each row being an instance of the entity. If the row centricity is understood, the entity of the result set is understood.
The WHERE clause, which conceptually executes after the rows are defined in the from clause, culls rows not required (or includes rows that are required) for the SELECT clause to work on.
Because join logic can be expressed in both the FROM clause and the WHERE clause, and because the clauses exist to divide and conquer complex logic, I choose to put join logic that involves values in columns in the FROM clause because that is essentially expressing a business rule that is supported by matching values in columns.
i.e. I won't write a WHERE clause like this:
WHERE Column1 = Column2
I will put that in the FROM clause like this:
ON Column1 = Column2
Likewise, if a column is to be compared to external values (values that may or may not be in a column) such as comparing a postcode to a specific postcode, I will put that in the WHERE clause because I am essentially saying I only want rows like this.
i.e. I won't write a FROM clause like this:
ON PostCode = '1234'
I will put that in the WHERE clause like this:
WHERE PostCode = '1234'
ANSI syntax does enforce neither predicate placement in the proper clause (be that ON or WHERE), nor the affinity of the ON clause to adjacent table reference. A developer is free to write a mess like this
SELECT
C.FullName,
C.CustomerCode,
O.OrderDate,
O.OrderTotal,
OD.ExtendedShippingNotes
FROM
Customer C
CROSS JOIN Order O
INNER JOIN OrderDetail OD
ON C.CustomerID = O.CustomerID
AND C.CustomerStatus = 'Preferred'
AND O.OrderTotal > 1000.0
WHERE
O.OrderID = OD.OrderID;
Speaking of query tools who "will generate ANSI-92", I'm commenting here because it generated
SELECT 1
FROM DEPARTMENTS C
JOIN EMPLOYEES A
JOIN JOBS B
ON C.DEPARTMENT_ID = A.DEPARTMENT_ID
ON A.JOB_ID = B.JOB_ID
The only syntax that escapes conventional "restrict-project-cartesian product" is outer join. This operation is more complicated because it is not associative (both with itself and with normal join). One have to judiciously parenthesize query with outer join, at least. However, it is an exotic operation; if you are using it too often I suggest taking relational database class.
I'm using MS SQL Server 2005.
What's the best schema for a Wiki-like system? where users edit/revise a submission and the system keeps track of these submissions.
Lets say we're doing a simple wiki-based system. Will keep track of each revision plus views and latest activity of each revision. In other screens, the system will list "Latest Submissions" and "Most Viewed", plus search by title.
My current schema (and I know its bad) is using a single table. When I need to see the "Latest Submissions" I sort by "LatestActivity", group by "DocumentTitle", then take first N records. I assume a lot of grouping (especially grouping on nvarchar) is bad news. For listing the most viewed I also do the same: sort by views, group by name, take first N records. Most of the time, I will also be doing a "WHERE DocumentName LIKE '%QUERY-HERE%'".
My current schema is "Version 1", see below:
alt text http://www.anaimi.com/junk/schemaquestion.png
I assume this is not acceptable. So i'm trying to come up with another/more-performant design. How does Version 2 sound to you? In version two I get the advantage of grouping on WikiHeadId which is a number - i'm assuming grouping over a number is better than nvarchar.
Or the extreme case which is version 3, where I will do no grouping, but has several disadvantages such as duplicating values, maintaining these values in code, etc.
Or is there a better/known schema for such systems?
Thanks.
(moved from ServerFault - i think its a development question more than an IT question)
Firstly (and out of curiosity) how does the current schema indicate what the current version is? Do you just have multiple 'WikiDocument' entries with the same DocumentTitle?
I'm also not clear on why you need a 'LastActivity' at a Version level. I don't see how 'LastActivity' fits with the concept of a 'Version' -- in most wikis, the 'versions' are write-once: if you modify a version, then you're creating a new version, so the concept of a last-updated type value on the version is meaningless -- it's really just 'datecreated'.
Really, the 'natural' schema for your design is #2. Personally, I'm a bit of a fan of the old DB axiom 'normalize until it hurts, then denormalize until it works'. #2 is a cleaner, nicer design (simple, with no duplication), and if you have no urgent reason to denormalize to version 3, I wouldn't bother.
Ultimately, it comes down to this: are you worrying about 'more performant' design because you've observed performance problems, or because you hypothetically might have some? There's no real reason #2 shouldn't perform well. Grouping isn't necessarily bad news in SQL Server -- in fact, if there's an appropriate covering index for the query, it can perform extremely well because it can just navigate to a particular level in the index to find the grouped values, then use the remaining columns of the index to use to MIN/MAX/whatever. Grouping by NVARCHAR isn't particularly bad -- if it's not observed to be a problem, don't fret about it, though (non-binary) collations can make it a little tricky -- but in version 2, where you need to GROUP BY you can do it by WikiHeadId, right?
One thing that may make life easier, if you do a lot of operations on the current version (as I assume you would), to add an FK back from the head table to the body table, indicating the current version. If you want to view the current versions with the highest number of hits, with #2 as it stands now it might be:
SELECT TOP ...
FROM WikiHead
INNER JOIN
(SELECT WikiHeadId, MAX(WikiBodyVersion) /* or LastUpdated? */ AS Latest
FROM WikiBody GROUP BY WikiHeadId) AS LatestVersions
INNER JOIN WikiBody ON
(Latest.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiBody.WikiBodyVersion = LatestVersions.Latest)
ORDER BY
Views DESC
or alternatively
...
INNER JOIN WikiBody ON
(WikiHead.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiBody.WikiBodyVersion =
(SELECT MAX(WikiBodyVersion) FROM WikiBody WHERE WikiBody.WikiHeadId = WikiHead.WikiHeadId)
...
both of which are icky. If the WikiHead keeps a pointer to the current version, it's just
...
INNER JOIN WikiBody ON
(WikiHead.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiHead.Latest = WikiBody.WikiBodyVersion)
...
or whatever, which may be a useful denormalization just because it makes your life easier, not for performance.
Check this out.
It's the database schema for mediawiki, what wikipedia is based on.
It looks pretty well documented and would be an interesting read for you.
From this page.