Does writing the full path in SELECT statements enhance performance SQL? - sql-server

Is the performance of queries impacted when writing the full query path. And what is the best practice when writing such queries ? Assuming the script is way more complex and longer than the following.
Example #1:
SELECT Databasename.Tablename.NameofColumn
FROM databasename.tablename
Example #2:
SELECT NameofColumn
FROM tablename
OR using aliases - example #3:
SELECT t.NameofColumn
FROM tablename t

There are a number of considerations when you're writing queries that are going to be released into a production environment, and how and when to use fully qualified names is one of those considerations.
A fully qualified table name has four parts: [Server].[Database].[Schema].[Table]. You missed Schema in your examples above, but it's actually the one that makes the most difference. SQL Server will allow you to have objects with the same name in different schemas; so you could have dbo.myTable and staging.myTable in the same database. SQL Server doesn't care, but your query probably does.
Even if there aren't identically named objects, adding the schema still helps the engine find the object you're querying a little bit faster, so there's your performance boost, albeit a small one, and only in the compile/execution plan step.
Besides performance, though, you need to worry about readability for your own sake when you need to revisit your code, and conventionality for when somebody else needs to look at your code. Conventions vary slightly from shop to shop, but here are a couple of generalities that will at least make your code easier to look at, say, on Stack Overflow.
1. Use table aliases.
This gets almost unreadable after about three column names:
SELECT
SchemaName.Tablename.NameofColumn1,
SchemaName.Tablename.NameofColumn2,
SchemaName.Tablename.NameofColumn3
FROM SchemaName.TableName
This is just easier on the brain:
SELECT
tn.NameofColumn1,
tn.NameofColumn2,
tn.NameofColumn3
FROM SchemaName.TableName as tn
2. Put the alias in front of every column reference, everywhere in your query.
There should never be any ambiguity about which table a particular column is coming from, either for you, when you're trying to troubleshoot it at 3:00 AM, or for anyone else, when you're sipping margaritas on the beach and your buddy's on call for you.
3. Make your aliases meaningful.
Again, it's about keeping things straight in your head later on. Aaron Bertrand wrote the definitive post on it almost ten years ago now.
4. Include the database name in the FROM clause if you want, but...*
If you have to restore a database using a different name, your procedures won't run. In my shop, we prefer a USE statement at the top of each proc. Fewer places to change a name if need be.
tl;dr
Your example #3 is pretty close. Just add the table schema to the FROM clause.

Related

Stored procedure with multiple selects - interaction with client tool?

Suppose I have a stored procedure as follows:
create procedure p_x
as
begin
select 'a','b','c'
select 'c','d','e'
select 'e','f','g'
end
go
This is of course not the real code, but it illustrates enough to be able to ask my questions.
I'm looking for the best performance and the best practices to deal with it.
How will the client tool (eg Informatica Data Quality) calling this procedure react?
Will it receive 3 separate results, just the last query result or all results at once?
Will each separate query be send to the client directly (and will the procedure halt till completed)? or is this done after the procedure finished?
Is it good practice to work this way? I was looking for the exchange of an OUTPUT table type parameter, but this doesn't seem possible if I'm correct (based on other stories)(just as input)
Is there a performance impact in this way? And if so what is the way to do this as efficient as possible (e.g. to just send one result back to the client)
You would be better served by posting your question to the Informatica forums. They should be able to answer your questions precisely and accurately. But I'll give it a go.
How will the tool react? Don't know, but often tools that support using stored procedures as a datasource will assume and will consume a single (and the first) resultset. Any others will be ignored. Go ask in their forums.
Will it receive 3 ...? Roughly the same question and answer as the first.
Will each separate query ...? Your procedure produces three resultsets. How the client consumes them is, again, something you should ask in their forums. The procedure itself will not "halt" waiting for the client to do anything.
Is it good practice...? Not in my opinion. Nor is posting a complete nonsense procedure a useful tool for discussing the pros/cons of this approach. Can it be a useful thing to do? Likely. But it is not often done IME. In addition, you are dealing with a tool with which you are not familiar. The simpler you keep things the better you are off in the long run regardless of your tools.
A procedure is a unit of work and should do one "thing". If it produces multiple resultsets, one can argue that it ceases to do a single thing since, logically, each resultset represents a set of different (even if related) things. And typically one would expect to see some relationship among the resultsets. If there are no relationships, then the resultsets are obviously different things which violates the idea of a procedure. You might want to review the topic of coupling and cohesion. But I think I see a bigger issue - which I'll address with the next item.
Is there a performance impact ...? This can't really be answered. Performance is always, ALWAYS specific to a particular situation (query, schema, etc). Based on that last sentence, I think you have not made the adjustment to thinking in terms of sets - something that is critical to writing efficient sql. Rather, I'll guess that you are thinking in terms of a loop which includes a select statement and each iteration will produce a set of (perhaps 1 but who knows) rows. If you think you have the "option" to produce just one resultset of 3 rows vs. 3 resultsets of 1 row, then you are most likely stuck in RBAR land. Regardless, this can't really be answered. It is also a question for the Informatica people.

What is better- Add an optional parameter to an existing SP or add a new SP?

I have a production SQL-Server DB (reporting) that has many Stored Procedures.
The SPs are publicly exposed to the external world in different ways
- some users have access directly to the SP,
- some are exposed via a WebService
- while others are encapsulated as interfaces thru a DCOM layer.
The user base is large and we do not know exactly which user-set uses which method of accessing the DB.
We get frequent (about 1 every other month) requests from user-sets for modifying an existing SP by adding one column to the output or a group of columns to the existing output, all else remaining same.
We initially started doing this by modifying the existing SP and adding the newly requested columns to the end of the output. But this broke the custom tools built by some other user bases as their tool had the number of columns hardcoded, so adding a column meant they had to modify their tool as well.
Also for some columns complex logic is required to get that column into the report which meant the SP performance degraded, affecting all users - even those who did not need the new column.
We are thinking of various ways to fix this:
1 Default Parameters to control flow
Update the existing SP and control the new functionality by adding a flag as a default parameter to control the code path. By using default parameters, if value of the Parameter is set to true then only call the new functionality. By default it is set to False.
Advantage
New Object is not required.
On going maintenance is not affected.
Testing overhead remains under control.
Disadvantage
Since an existing SP is modified, it will need testing of existing functionality as well as new functionality.
Since we have no inkling on how the client tools are calling the SPs we can never be sure that we have not broken anything.
It will be difficult to handle if same report gets modified again with more requests – will mean more flags and code will become un-readable.
2 New Stored procedure
A new stored procedure will be created for any requirement which changes the signature(Input/Output) of the SP. The new SP will call the original stored procedure for existing stuff and add the logic for new requirement on top of it.
Advantage
Here benefit will be that there will be No impact on the existing procedure hence No Testing required for old logic.
Disadvantage
Need to create new objects in database whenever changes are requested. This will be overhead in database maintenance.
Will the execution plan change based on adding a new parameter? If yes then this could adversely affect users who did not request the new column.
Considering a SP is a public interface to the DB and interfaces should be immutable should we go for option 2?
What is the best practice or does it depend on a case by case basis, and what should be the main driving factors when choosing a option?
Thanks in advance!
Quoting from a disadvantage for your first option:
It will be difficult to handle if same report gets modified again with more requests – will mean more flags and code will become un-readable.
Personally I feel this is the biggest reason not to modify an existing stored procedure to accommodate the new columns.
When bugs come up with a stored procedure that has several branches, it can become very difficult to debug. Also as you hinted at, the execution plan can change with branching/if statements. (sql using different execution plans when running a query and when running that query inside a stored procedure?)
This is very similar to object oriented coding and your instinct is correct that it's best to extend existing objects instead of modify them.
I would go for approach #2. You will have more objects, but at least when an issue comes up, you will be able to know the affected stored procedure has limited scope/impact.
Over time I've learned to grow objects/data structures horizontally, not vertically. In other words, just make something new, don't keep making existing things bigger and bigger and bigger.
Ok. #2. Definitely. No doubt.
#1 says: "change the existing procedure", causing things to break. No way that's a good thing! Your customers will hate you. Your code just gets more complex meaning it is harder and harder to avoid breaking things leading to more hatred. It will go horribly slowly, and be impossible to tune. And so on.
For #2 you have a stable interface. No hatred. Yay! Seriously, "yay" as in "I still have a job!" as opposed to "boo, I got fired for annoying the hell out of my customers". Seriously. Never ever do #1 for that reason alone. You know this is true. You know it!
Having said that, record what people are doing. Take a user-id as a parameter. Log it. Know your users. Find the ones using old crappy code and ask them nicely to upgrade if necessary.
Your reason given to avoid number 2 is proliferation. But that is only a problem if you don't test stuff. If you do test stuff properly, then proliferation is happening anyway, in your tests. And you can always tune things in #2 if you have to, or at least isolate performance problems.
If the fatter procedure is really great, then retrofit the skinny version with a slimmer version of the fat one. In SQL this is tricky, but copy/paste and cut down your select column list works. Generally I just don't bother to do this. Life is too short. Having really good test code is a much better investment of time, and data schema tend to rarely change in ways that break existing queries.
Okay. Rant over. Serious message. Do #2, or at the very least do NOT do #1 or you will get yourself fired, or hated, or both. I can't think of a better reason than that.
Easier to go with #2. Nullable SP parameters can create some very difficult to locate bugs. Although, I do employ them from time to time.
Especially when you start getting into joins on nulls and ANSI settings. The way you write the query will change the results dramatically. KISS. (Keep things simple stupid).
Also, if it's a parameterized search for reporting or displaying, I might consider a super-fast fetch of data into a LINQ-able object. Then you can search an in-memory list rather than re-fetching from the database.
#2 could be better option than #1 particularly considering the bullet 3 of disadvantages of #1 since requirements keep changing on most of the time. I feel this because disadvantages are dominating here than advantages on either side.
I would also vote for #2. I've seen a few stored procedures which take #1 to the extreme: The SPs has a parameter #Option and a few parameters #param1, #param2, .... The net effect is that this is a single stored procedure that tries to play the role of many stored procedures.
The main disadvantage to #2 is that there are more stored procedures. It may be more difficult to find the one you're looking for, but I think that is a small price to pay for the other advantages you get.
I want to make sure also, that you don't just copy and paste the original stored procedure and add some columns. I've also seen too many of those. If you are only adding a few columns, you can call the original stored procedure and join in the new columns. This will definitely incur a performance penalty if those columns were readily available before, but you won't have to change your original stored procedure (refactoring to allow for good performance and no duplication of the code), nor will you have to maintain two copies of the code (copy and paste for performance).
I am going to suggest a couple of other options based on the options you gave.
Alternative option #1: Add another variable, but instead of making it a default variable base the variable off of customer name. That way Customer A can get his specialized report and Customer B can get his slightly different customized report. This adds a ton of work as updates to the 'Main' portion would have to get copied to all the specialty customer ones.
You could do this with branching 'if' statements.
Alternative option #2: Add new stored procedures, just add the customer's name to the stored procedure. Maintenance wise, this might be a little more difficult but it will achieve the same end results, each customer gets his own report type.
Option #2 is the one to choose.
You yourself mentioned (dis)advantages.
While you consider adding new objects to db based on requirement changes, add only necessary objects that don't make your new SP bigger and difficult to maintain.

Searching a nvarchar(max) field

Our application connects to a SQL Server database. There is a column that is nvarchar(max) that has been added an must be included in the search. The number of records in the this DB is only in the 10s of thousands and there are only a few hundred people using the application. I'm told to explore Full Text Search, is this necessary?
This is like asking, I work 5 miles away, and I was told to consider buying a car. Is this necessary? Too many variables to give you a simple and correct answer to your question. For example, is it a nice walk? Is there public transit available? Is your schedule flexible? Do you have to go far for lunch or run errands after work?
Full-Text Search can help if your typical searches are going to be WHERE col LIKE '%foo%' - but whether it is necessary depends on how large this column will get, whether your searches are true wildcard searches, your tolerance for millisecond vs. nanosecond queries, the amount of concurrency, even seemingly extraneous stuff like whether the data is always in memory and can be searched more efficiently.
The better answer is that you should try it. Populate a table with a copy of your data, add a full-text index, and see if your typical workload improves by using full-text queries instead of LIKE. It probably will, but there's no way for us to know for sure even if you add more specifics than ballpark row counts.
In a similar situation I ended up making a table structure that was more search friendly and indexable, then setting up a batch job to copy records from the live database to the reporting one.
In my case the original data didn't come close to needing an nvarchar(max) column so I could get away with that. Your mileage may vary. In any case, the answer is "try a few things and see what works for you".

Can Joins between views and table hurt performance?

I am new in sql server,
my manager has given me job where i have to find out performance of query in sql server 2008.
That query is very complex having joins between views and table. I read in internet that joins between views and table cause performance hurt?
If any expert can help me on this? Any good link where i found knowledge of this? How to calculate query performance in sql server?
Look at the query plan - it could be anything. Missing indexes on one of the underlying view tables, missing indexes on the join table or something else.
Take a look at this article by Gail Shaw about finding performance issues in SQL Server - part 1, part 2.
A view (that isn't indexed/materialised) is just a macro: no more, no less
That is, it expands into the outer query. So a join with 3 views (with 4, 5, and 6 joins respectively) becomes a single query with 15 JOINs.
This is a common "I can reuse it" mistake: DRY doesn't usually apply to SQL code.
Otherwise, see Oded's answer
Oded is correct, you should definitely start with the query plan. That said, there are a couple of things that you can already see at a high level, for example:
CONVERT(VARCHAR(8000), CN.Note) LIKE '%T B9997%'
LIKE searches with a wildcard at the front are bad news in terms of performance, because of the way indexes work. If you think about it, it's easy to find all people in the phone book whose name starts with "Smi". If you try to find all people who have "mit" anywhere in their name, you will find that you need to read the entire phone book. SQL Server does the same thing - this is called a full table scan, and is typically quite slow. The other issue is that the left side of the condition uses a function to modify the column (specifically, converting it to a varchar). Essentially, this again means that SQL Server cannot use an index, even if there was one for the column CN.Note.
My guess is that the column is a text column, and that you will not be allowed to change the filter logic to remove the wildcard at the beginning of the search. In this case, I would recommend looking into Full-Text Search / Indexing functionality. By enabling full text indexing, and using specific keywords such as CONTAINS, you should get better performance.
Again (as with all performance optimisation scenarios), you should still start with the query plan to see if this really is the biggest problem with the query.

High-performance wiki-schema

I'm using MS SQL Server 2005.
What's the best schema for a Wiki-like system? where users edit/revise a submission and the system keeps track of these submissions.
Lets say we're doing a simple wiki-based system. Will keep track of each revision plus views and latest activity of each revision. In other screens, the system will list "Latest Submissions" and "Most Viewed", plus search by title.
My current schema (and I know its bad) is using a single table. When I need to see the "Latest Submissions" I sort by "LatestActivity", group by "DocumentTitle", then take first N records. I assume a lot of grouping (especially grouping on nvarchar) is bad news. For listing the most viewed I also do the same: sort by views, group by name, take first N records. Most of the time, I will also be doing a "WHERE DocumentName LIKE '%QUERY-HERE%'".
My current schema is "Version 1", see below:
alt text http://www.anaimi.com/junk/schemaquestion.png
I assume this is not acceptable. So i'm trying to come up with another/more-performant design. How does Version 2 sound to you? In version two I get the advantage of grouping on WikiHeadId which is a number - i'm assuming grouping over a number is better than nvarchar.
Or the extreme case which is version 3, where I will do no grouping, but has several disadvantages such as duplicating values, maintaining these values in code, etc.
Or is there a better/known schema for such systems?
Thanks.
(moved from ServerFault - i think its a development question more than an IT question)
Firstly (and out of curiosity) how does the current schema indicate what the current version is? Do you just have multiple 'WikiDocument' entries with the same DocumentTitle?
I'm also not clear on why you need a 'LastActivity' at a Version level. I don't see how 'LastActivity' fits with the concept of a 'Version' -- in most wikis, the 'versions' are write-once: if you modify a version, then you're creating a new version, so the concept of a last-updated type value on the version is meaningless -- it's really just 'datecreated'.
Really, the 'natural' schema for your design is #2. Personally, I'm a bit of a fan of the old DB axiom 'normalize until it hurts, then denormalize until it works'. #2 is a cleaner, nicer design (simple, with no duplication), and if you have no urgent reason to denormalize to version 3, I wouldn't bother.
Ultimately, it comes down to this: are you worrying about 'more performant' design because you've observed performance problems, or because you hypothetically might have some? There's no real reason #2 shouldn't perform well. Grouping isn't necessarily bad news in SQL Server -- in fact, if there's an appropriate covering index for the query, it can perform extremely well because it can just navigate to a particular level in the index to find the grouped values, then use the remaining columns of the index to use to MIN/MAX/whatever. Grouping by NVARCHAR isn't particularly bad -- if it's not observed to be a problem, don't fret about it, though (non-binary) collations can make it a little tricky -- but in version 2, where you need to GROUP BY you can do it by WikiHeadId, right?
One thing that may make life easier, if you do a lot of operations on the current version (as I assume you would), to add an FK back from the head table to the body table, indicating the current version. If you want to view the current versions with the highest number of hits, with #2 as it stands now it might be:
SELECT TOP ...
FROM WikiHead
INNER JOIN
(SELECT WikiHeadId, MAX(WikiBodyVersion) /* or LastUpdated? */ AS Latest
FROM WikiBody GROUP BY WikiHeadId) AS LatestVersions
INNER JOIN WikiBody ON
(Latest.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiBody.WikiBodyVersion = LatestVersions.Latest)
ORDER BY
Views DESC
or alternatively
...
INNER JOIN WikiBody ON
(WikiHead.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiBody.WikiBodyVersion =
(SELECT MAX(WikiBodyVersion) FROM WikiBody WHERE WikiBody.WikiHeadId = WikiHead.WikiHeadId)
...
both of which are icky. If the WikiHead keeps a pointer to the current version, it's just
...
INNER JOIN WikiBody ON
(WikiHead.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiHead.Latest = WikiBody.WikiBodyVersion)
...
or whatever, which may be a useful denormalization just because it makes your life easier, not for performance.
Check this out.
It's the database schema for mediawiki, what wikipedia is based on.
It looks pretty well documented and would be an interesting read for you.
From this page.

Resources