I have seen several patterns used to 'overcome' the lack of constants in SQL Server, but none of them seem to satisfy both performance and readability / maintainability concerns.
In the below example, assuming that we have an integral 'status' classification on our table, the options seem to be:
Just to hard code it, and possibly just 'comment' the status
-- StatusId 87 = Loaded
SELECT ... FROM [Table] WHERE StatusId = 87;
Using a lookup table for states, and then joining to this table so that the WHERE clause references the friendly name.
SubQuery:
SELECT ...
FROM [Table]
WHERE
StatusId = (SELECT StatusId FROM TableStatus WHERE StatusName = 'Loaded');
or joined
SELECT ...
FROM [Table] t INNER JOIN TableStatus ts On t.StatusId = ts.StatusId
WHERE ts.StatusName = 'Loaded';
A bunch of scalar UDF's defined which return constants, viz
CREATE Function LoadedStatus()
RETURNS INT
AS
BEGIN
RETURN 87
END;
and then
SELECT ... FROM [Table] WHERE StatusId = LoadedStatus();
(IMO this causes a lot of pollution in the database - this might be OK in an Oracle package wrapper)
And similar patterns with Table Valued Functions holding the constants with values as rows or columns, which are CROSS APPLIED back to [Table]
How have other SO users have solved this common issue?
Edit : Bounty - Does anyone have a best practice method for maintaining $(variables) in DBProj DDL / Schema scripts as per Remus answer and comment?
Hard coded. With SQL performance trumps maintainability.
The consequences in the execution plan between using a constant that the optimizer can inspect at plan generation time vs. using any form of indirection (UDF, JOIN, sub-query) are often dramatic. SQL 'compilation' is an extraordinary process (in the sense that is not 'ordinary' like say IL code generation) in as the result is determined not only by the language construct being compiled (ie. the actual text of the query) but also by the data schema (existing indexes) and actual data in those indexes (statistics). When a hard coded value is used, the optimizer can give a better plan because it can actually check the value against the index statistics and get an estimate of the result.
Another consideration is that a SQL application is not code only, but by a large margin is code and data. 'Refactoring' a SQL program is ... different. Where in a C# program one can change a constant or enum, recompile and happily run the application, in SQL one cannot do so because the value is likely present in millions of records in the database and changing the constant value implies also changing GBs of data, often online while new operations occur.
Just because the value is hard-coded in the queries and procedures seen by the server does not necessarily mean the value has to be hard coded in the original project source code. There are various code generation tools that can take care of this. Consider something as trivial as leveraging the sqlcmd scripting variables:
defines.sql:
:setvar STATUS_LOADED 87
somesource.sql:
:r defines.sql
SELECT ... FROM [Table] WHERE StatusId = $(STATUS_LOADED);
someothersource.sql:
:r defines.sql
UPDATE [Table] SET StatusId = $(STATUS_LOADED) WHERE ...;
While I agree with Remus Rusanu, IMO, maintainability of the code (and thus readability, least astonishment etc.) trump other concerns unless the performance difference is sufficiently significant as to warrant doing otherwise. Thus, the following query loses on readability:
Select ..
From Table
Where StatusId = 87
In general, when I have system dependent values which will be referenced in code (perhaps mimicked in an enumeration by name), I use string primary keys for the tables in which they are kept. Contrast this to user-changeable data in which I generally use surrogate keys. The use of a primary key that requires entry helps (albeit not perfectly) to indicate to other developers that this value is not meant to be arbitrary.
Thus, my "Status" table would look like:
Create Table Status
(
Code varchar(6) Not Null Primary Key
, ...
)
Select ...
From Table
Where StatusCode = 'Loaded'
This makes the query more readable, it does not require a join to the Status table, and does not require the use of a magic number (or guid). Using user-defined functions, IMO is a bad practice. Beyond the performance implications, no developer would ever expect UDFs to be used in this manner and thus violates the least astonishment criteria. You would almost be compelled to have a UDF for each constant value; otherwise, what you are passing into the function: a name? a magic value? If a name, you might as well keep the name in a table and use it directly in the query. If a magic value, you are back the original problem.
I have been using the scalar function option in our DB and it's work fine and as per my view is the best way of this solution.
if more values related to one item then made lookup like if you load combobox or any other control with static value then use lookup that's the best way to do this.
You can also add more fields to your status table that act as unique markers or groupers for status values.
For example, if you add an isLoaded field to your status table, record 87 could be the only one with the field's value set, and you can test for the value of the isLoaded field instead of the hard-coded 87 or status description.
Related
Today I'm trying to tune the performance of an audit database. I have a legal reason for tracking changes to rows, and I've implemented a set of tables using the System Versioned tables method in SQL Server 2016.
My overall process lands "RAW" data into an initial table from a source system. From here, I then have a MERGE process that takes data from the RAW table and compares every column in the RAW table to what exists in the audit-able system versioned staging table and decides what has changed. System row versioning then tells me what has changed and what hasn't.
The trouble with this approach is that my tables are very wide. Some of them have 400 columns or more. Even tables that have 450,000 records take SQL server about 17 minutes to perform a MERGE operation. It's really slowing down the performance of our solution and it seems it would help things greatly if we could speed it up. We presently have hundreds of tables we need to do this for.
At the moment both the RAW and STAGE tables are indexed on an ID column.
I've read in several places that we might consider using a CHECKSUM or HASHBYTES function to record a value in the RAW extract. (What would you call this? GUID? UUID? Hash?). We'd then compare the calculated value to what exists in the STAGE table. But here's the rub: There are often quite a few NULL values across many columns. It's been suggested that we cast all the column types to be the same (nvarchar(max))?, and NULL values seem to cause the entire computation of the checksum to fall flat. So I'm also coding lots of ISNULL(,'UNKNOWN') statements into my code too.
So - Are there better methods for improving the performance of the merge here? I thought that I could use a row updated timestamp column as a single value to compare instead of the checksum, but I am not certain that that would pass legal muster/scrutiny. Legal is concerned that rows may be edited outside of an interface and the column wouldn't always be updated. I've seen approaches with developers using a concatenate function (shown below) to combine many column values together. This seems code intensive and expensive to compute / cast columns too.
So my questions are:
Given the situational reality, can I improve MERGE performance in any way here?
Should I use a checksum, or hashbytes, and why?
Which hashbytes method makes the most sense here? (I'm only comparing one RAW row to another STAGE row based on an ID match right)?
Did I miss something with functions that might make this comparison faster or easier in the reading
I have done? It seems odd there aren't better functions besides CONCAT available to do this in SQL Server.
I wrote the below code to show some of the ideas I am considering. Is there something better than what I wrote below?
DROP TABLE IF EXISTS MyTable;
CREATE TABLE MyTable
(C1 VARCHAR(10),
C2 VARCHAR(10),
C3 VARCHAR(10)
);
INSERT INTO MyTable
(C1,C2,C3)
VALUES
(NULL,NULL,NULL),
(NULL,NULL,3),
(NULL,2,3),
(1,2,3);
SELECT
HASHBYTES('SHA2_256',
CONCAT(C1,'-',
C2,'-',
C3)) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(CAST(C1 as varchar(max)),'-',
CAST(C2 as varchar(max)),'-',
CAST(C3 as varchar(max)))) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN'))) AS HashbytesValueWithCastWithNullCheck,
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN')) AS StringValue,
CONCAT(C1,'-',C2,'-',C3) AS ConcatString,
C1,
C2,
C3
FROM
MyTable;
'''
Given the situational reality, can I improve MERGE performance in any way here?
You should test, but storing a hash for every row, computing the hash for the new rows, and comparing based on the (key,hash) should be cheaper than comparing every column.
Should I use a checksum, or hashbytes, and why?
HASHBYTES has a much lower probability of missing a change. Roughly, with CHECKSUM you'll probably eventually miss a change or two, with HASHBYTES you probably won't ever miss a change. See remarks here: BINARY_CHECKSUM.
Did I miss something with functions that might make this comparison faster or easier in the reading I have done?
No. There's no special way to compare multiple columns.
Is there something better than what I wrote below?
You definitely should replace nulls, else a row (1,null,'A') and (1,'A',null) would get the same hash. And you should replace nulls, and delimit, with something that won't appear as a value in any column. And if you have Unicode text, converting to varchar may erase some changes, so it's safer to use nvarchar. eg:
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C2 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C3 as nvarchar(max)),N'~'))) AS HashbytesValueWithCastWithNullCheck
JSON in SQL Server is very fast. So you might try a pattern like:
select t.Id, z.RowJSON, hashbytes('SHA2_256', RowJSON) RowHash
from SomeTable t
cross apply (select t.* for json path) z(RowJSON)
DBMSes only allow values as parameters for prepared statements. However, table, column, and field names are not allowed with prepared statements. For example:
String sql = "Select * from TABLE1 order by ?";
PreparedStatement st = conn.prepareStatement(sql);
st.setString(1, "column_name_1");
Such statement is not allowed. What is the reason for DBMSes to not implement filed names in prepared statements?
There are basically two reasons that I am aware of:
Although details vary per database system, conceptually, when a statement is prepared, it is compiled and checked for correctness (do all tables and columns exist). The server generates a plan which describes which tables to access, which fields to retrieve, indexes to use, etc.
This means that at prepare time the database system must know which tables and fields it needs to access, therefore parameterization of tables and fields is not possible. And even if it where technically possible, it would be inefficient, because statement compilation would need to be deferred until execution, essentially throwing away one of the primary reasons for using prepared statements: reusable query plans for better performance.
And consider this: if table names and field names are allowed to parameterized, why not function names, query fragments, etc?
Not allowing parameterization of objects prevents ambiguity. For example in a query with a where column1 = ?, if you set the parameter to peter, would that be a column name or a string value? That is hard to decide and preventing that ambiguity would make the API harder to use, while the use case of even allowing such parameterization is almost non-existent (and in my experience, the need for such parameterization almost always stem from bad database design).
Allowing parameterization of objects is almost equivalent to just dynamically generating the query and executing it (see also point 1), so why not just forego the additional complexity and disallow parameterization of objects instead.
I have a fairly complex query that does a direct comparision with #EventId if provided and fast since it grabs the clustered index row. However, sometimes I have to do a group of these Event IDs, and the second line takes almost 30 seconds to run. I figured it would work the same way with looking up the primary key. Is there a reason why it's so much slower?
DECLARE #EventIds TABLE(Id INT NOT NULL);
WHERE
(#EventId IS NULL OR (ev.Id = #EventId)) AND
(NOT EXISTS(SELECT 1 FROM #EventIds) OR ev.Id IN (SELECT * FROM #EventIds))
There's no real good reason to have the expression
NOT EXISTS(SELECT 1 FROM #EventIds) OR ev.Id IN (SELECT * FROM #EventIds)
The first expression, even if true, doesn't preclude the evaluation of the second expression because SQL Server doesn't shortcut boolean expressions.
Second, as table variables have been known to cause bad execution plans due to incorrect statistics and row count. Please refer to this essay on the difference between table variables and temporary tables, topics: Cardinality, and No column statistics.
It might help to add the following query hint at the end of the query:
OPTION(RECOMPILE);
Yes this recompiles the plan each time, but if you're getting horrible performance the small additional compile time doesn't matter that much.
This query hint is also recommended if you have optional filters as you have with #EventId.
It may also help to have a primary key on Id defined on the #EventIds table variable. This would allow an index seek instead of a table scan.
This question already has an answer here:
Closed 11 years ago.
Possible Duplicate:
Why does the Execution Plan include a user-defined function call for a computed column that is persisted?
In SQL Server 2008 I'm running the SQL profiler on a long running query and can see that a persisted computed column is being repeatedly recalculated. I've noticed this before and anecdotally I'd say that this seems to occur on more complex queries and/or tables with at least a few thousand rows.
This recalculation is definitely the cause of the long execution as it speeds up dramatically if I comment out that one column from the returned results (The field is computed by running an XPath against an Xml field).
EDIT: Offending SQL has the following structure:
DECLARE #OrderBy nvarchar(50);
SELECT
A.[Id],
CASE
WHEN #OrderBy = 'Col1' THEN A.[ComputedCol1]
WHEN #OrderBy = 'Col2' THEN C.[ComputedCol2]
ELSE C.[ComputedCol3]
END AS [Order]
FROM
[Stuff] AS A
INNER JOIN
[StuffCode] AS SC
ON
A.[Code] = SC.[Code]
All columns are nvarchar(50) except for ComputedCol3 which is nvarchar(250).
The query optimizer always tries to pick the cheapest plan, but it may not make the right choice. By persisting a column you are putting it in the main table (in the clustered index or the heap) but in order to pull out these values, normal data access paths are still required.
This means that the engine may choose other indexes instead of the main table to satisfy the query, and it could choose to recalculate the computed column if it thinks doing so combined with its chosen I/O access pattern will cost less. In general, a fair amount of CPU is cheaper than a little I/O, but no internal analysis of the cost of the expression is done, so if your column calls an expensive UDF it may make the wrong decision.
Putting an index on the column could make a difference. Note that you don't have to make the column persisted to put an index on it. If after making an index, the engine is still making mistakes, check to see if you have proper statistics being collected and frequently updated on all the indexes on the table.
It would help us help you if you posted the structure of your table (just the important columns) and the definitions of any indexes, along with some ideas of what the execution plan looks like when things go badly.
One thing to consider is that it may actually be better to recompute the column in some cases, so make sure that it's really correct to force the engine to go get it before doing so.
I have a process that is performing badly due to full table scans on a particular table. I have computed statistics, rebuilt existing indices and tried adding new indices for this table but this hasn't solved the issue.
Can an implicit type conversion stop an index being used? What about other reasons? The cost of a full table scan is around 1000 greater than the index lookup should be.
EDIT:
SQL statement:
select unique_key
from src_table
where natural_key1 = :1
and natural_key2 = :2
and natural_key3 = :3;
Cardinality of natural_key1 is high, but there is a type conversion.
The other parts of the natural key are low cardinality, and bitmap indices are not enabled.
Table size is around 1,000,000 records.
Java code (not easily modifiable):
ps.setLong(1, oid);
This conflicts with the column datatype: varchar2
an implicit conversion can prevent an index from being used by the optimizer. Consider:
SQL> CREATE TABLE a (ID VARCHAR2(10) PRIMARY KEY);
Table created
SQL> insert into a select rownum from dual connect by rownum <= 1e6;
1000000 rows inserted
This is a simple table but the datatype is not 'right', i-e if you query it like this it will full scan:
SQL> select * from a where id = 100;
ID
----------
100
This query is in fact equivalent to:
select * from a where to_number(id) = 100;
It cannot use the index since we indexed id and not to_number(id). If we want to use the index we will have to be explicit:
select * from a where id = '100';
In reply to pakr's comment:
There are lots of rules concerning implicit conversions. One good place to start is the documentation. Among other things, we learn that:
During SELECT FROM operations, Oracle converts the data from the column to the type of the target variable.
It means that when implicit conversion occurs during a "WHERE column=variable" clause, Oracle will convert the datatype of the column and NOT of the variable, therefore preventing an index from being used. This is why you should always use the right kind of datatypes or explicitly converting the variable.
From the Oracle doc:
Oracle recommends that you specify explicit conversions, rather than rely on implicit or automatic conversions, for these reasons:
SQL statements are easier to understand when you use explicit datatype conversion functions.
Implicit datatype conversion can have a negative impact on performance, especially if the datatype of a column value is converted to that of a constant rather than the other way around.
Implicit conversion depends on the context in which it occurs and may not work the same way in every case. For example, implicit conversion from a datetime value to a VARCHAR2 value may return an unexpected year depending on the value of the NLS_DATE_FORMAT parameter.
Algorithms for implicit conversion are subject to change across software releases and among Oracle products. Behavior of explicit conversions is more predictable.
Make you condition sargable, that is compare the field itself to a constant condition.
This is bad:
SELECT *
FROM mytable
WHERE TRUNC(date) = TO_DATE('2009.07.21')
, since it cannot use the index. Oracle cannot reverse the TRUNC() function to get the range bounds.
This is good:
SELECT *
FROM mytable
WHERE date >= TO_DATE('2009.07.21')
AND date < TO_DATE('2009.07.22')
To get rid of implicit conversion, well, use explicit conversion:
This is bad:
SELECT *
FROM mytable
WHERE guid = '794AB5396AE5473DA75A9BF8C4AA1F74'
-- This uses implicit conversion. In fact this is RAWTOHEX(guid) = '794AB5396AE5473DA75A9BF8C4AA1F74'
This is good:
SELECT *
FROM mytable
WHERE guid = HEXTORAW('794AB5396AE5473DA75A9BF8C4AA1F74')
Update:
This query:
SELECT unique_key
FROM src_table
WHERE natural_key1 = :1
AND natural_key2 = :2
AND natural_key3 = :3
heavily depends on the type of your fields.
Explicitly cast your variables to the field type, as if from string.
You could use a function-based index.
Your query is:
select
unique_key
from
src_table
where
natural_key1 = :1
In your case the index isn't being used because natural_key1 is a varchar2 and :1 is a number. Oracle is converting your query to:
select
unique_key
from
src_table
where
to_number(natural_key1) = :1
So... put on an index for to_number(natural_key1):
create index ix_src_table_fnk1 on src_table(to_number(natural_key1));
Your query will now use the ix_src_table_fnk1 index.
Of course, better to get your Java programmers to do it properly in the first place.
What happens to your query if you run it with an explicit conversion around the argument (e.g., to_char(:1) or to_number(:1) as appropriate)? If doing so makes your query run fast, you have your answer.
However, if your query still runs slow with the explicit conversion, there may be another issue. You don't mention what version of Oracle you're running, if your high-cardinality column (natural_key1) has values that have a very skewed distribution, you may be using a query plan generated when the query was first run, which used an unfavorable value for :1.
For example, if your table of 1 million rows had 400,000 rows with natural_key1 = 1234, and the remaining 600,000 were unique (or nearly so), the optimizer would not choose the index if your query constrained on natural_key1 = 1234. Since you're using bind variables, if that was the first time you ran the query, the optimizer would choose that plan for all subsequent runs.
One way to test this theory would be to run this command before running your test statement:
alter system flush shared_pool;
This will remove all query plans from the optimizer's brain, so the next statement run will be optimized fresh. Alternatively, you could run the statement as straight SQL with literals, no bind variables. If it ran well in either case, you'd know your problem was due to plan corruption.
If that is the case, you don't want to use that alter system command in production - you'll probably ruin the rest of your system's performance if you run it regularly, but you could get around it by using dynamic sql instead of bind variables, or if it is possible to determine ahead of time that :1 is non-selective, use a slightly different query for the nonselective cases (such as re-ordering the conditions in the WHERE clause, which will cause the optimizer to use a different plan).
Finally, you can try adding an index hint to your query, e.g.:
SELECT /*+ INDEX(src_table,<name of index for natural_key1>) */
unique_key
FROM src_table
WHERE natural_key1 = :1
AND natural_key2 = :2
AND natural_key3 = :3;
I'm not a big fan of index hints - they're a pretty fragile method of programming. If the name changed on the index down the road, you'd never know it until your query started to perform poorly, plus you're potentially shooting yourself in the foot if server upgrades or data distribution changes result in the optimizer being able to choose an even better plan.