DBMSes only allow values as parameters for prepared statements. However, table, column, and field names are not allowed with prepared statements. For example:
String sql = "Select * from TABLE1 order by ?";
PreparedStatement st = conn.prepareStatement(sql);
st.setString(1, "column_name_1");
Such statement is not allowed. What is the reason for DBMSes to not implement filed names in prepared statements?
There are basically two reasons that I am aware of:
Although details vary per database system, conceptually, when a statement is prepared, it is compiled and checked for correctness (do all tables and columns exist). The server generates a plan which describes which tables to access, which fields to retrieve, indexes to use, etc.
This means that at prepare time the database system must know which tables and fields it needs to access, therefore parameterization of tables and fields is not possible. And even if it where technically possible, it would be inefficient, because statement compilation would need to be deferred until execution, essentially throwing away one of the primary reasons for using prepared statements: reusable query plans for better performance.
And consider this: if table names and field names are allowed to parameterized, why not function names, query fragments, etc?
Not allowing parameterization of objects prevents ambiguity. For example in a query with a where column1 = ?, if you set the parameter to peter, would that be a column name or a string value? That is hard to decide and preventing that ambiguity would make the API harder to use, while the use case of even allowing such parameterization is almost non-existent (and in my experience, the need for such parameterization almost always stem from bad database design).
Allowing parameterization of objects is almost equivalent to just dynamically generating the query and executing it (see also point 1), so why not just forego the additional complexity and disallow parameterization of objects instead.
Related
I'm using MSSQL 2012,
To handle special character in searching through LINQ, i found to change COLLATE of the column to *_CI_AI, but before changing it i would like to know what and where its impact.
This might be not so easy...
If this column takes part in indexes and constraints you will have to drop them, change the collation and recreate them.
One very painfull point with collations is the fact, that the temp-db uses - by default - the default-collation of the server-instance. We once had a project, where after such a step certain statements ran into errors. This happened, when a Stored Procedure created a #table' and used such a column in any kind of comparison (in WHEREorJOIN`-predicat).
You can type the collation in any statement manually, so it will be possible to get things working, but the effort might be huge...
Some related answers:
https://stackoverflow.com/a/39101572/5089204
https://stackoverflow.com/a/35840417/5089204
UPDATE a list if effects / impacts
sorting might change (a sorted list could appear in a different order)
comparisons will be less restrictive with _CI_AI. "Peter" eq. to "peter". Sometimes this is OK (most of the time actually), but not always (imagine a password). In cases where "Pétè" should be the same as "Pete" this helps...
Joins on string base might join differently (If ProductCode "aBx5" is not the same code as "ABx5")
Check-Constraints might be less restrictive (you force values "A","B" or "C" and suddenly you may insert "a","b" and "c"...)
You might run (this can be very annoying!) into collation errors in connection with temp tables. This can break existing code...
With simple text columns this should be not problem...
I am new to Postgresql and I am trying to figure out some details about stored procedures (which I think are actually called functions in pgsql) when used in a multiple schema environment.
The application I have in mind involves a multi-tenant DB design where one schema is used for each tenant and all schemata, which have the same table structure and names, are part of the same database. As far as I know from DBs in general, stored procedures/functions are pre-compiled and therefore faster so I woulid like to use them for performing operations on each schema's tables by sending the required parameters from the application server instead of sending a list of SQL commands. In addition, I would like to have a SINGLE set of functions that implement all the SELECT (including JOIN type), INSERT, UPDATE, etc operations on the tables of each schema. This will allow to easily perform changes in each function and avoid SQL code replication and redundancy. As I found out, it is possible to create a set of functions in a schema s0 and then create s1, s2, ... schemata (having all the same tables) that use these functions.
For exapmle, I can create a template schema named s0 (identical to all others) and create a SQL or pl/pgSQL function that belongs to this schema and contains operations on the schema's tables. In this function, the table names are written without the schema prefix, i.e.
first_table and not s0.first_table
An example function could be:
CREATE FUNCTION sel() RETURNS BIGINT
AS 'SELECT count(a) from first_table;'
LANGUAGE SQL;
As I have tested, this function works well. If I move to schema s1 by entering:
set search_path to s1;
and then call the function again, the function acts upon s1 schema's identically named table first_table.
The function could also include the parameter path in order to call it with a schema name and a command to change the search_ path similar to this:
CREATE OR REPLACE FUNCTION doboth(path TEXT, is_local BOOLEAN DEFAULT false) RETURNS BIGINT AS $$
SELECT set_config('search_path', regexp_replace(path, '[^\w ,]', '', 'g'), is_local);
SELECT count(a) from first_table;
$$ LANGUAGE sql;
as shown in the proposed solution in PostgreSQL: how do I set the search_path from inside a function?
However, when I tried this and I called the function for a schema, I noticed that the second SELECT of the function was executed before the first, which led to executing the second SELECT on the wrong schema! This was really unexpected. Does anybody know the explanation to this behavior?
In order to bypass this issue, I created a plpgsql function that does the same thing and it worked without any execution order issues:
CREATE OR REPLACE FUNCTION doboth(path TEXT, is_local BOOLEAN DEFAULT false) RETURNS BIGINT AS $$
DECLARE result BIGINT;
BEGIN
PERFORM set_config('search_path', regexp_replace(path, '[^\w ,]', '', 'g'), is_local);
SELECT count(a) from first_table INTO result;
RETURN result;
END
$$ LANGUAGE plpgsql;
So, now some questions about performance this time:
1) Apart from a) having the selection of schema to operate and the specified operation on the schema in one transaction which is necessary for my multi-tenant implementation, and b) teaming together SQL commands and avoiding some extra data exchange between the application server and the DB server which is beneficial, do the Postgresql functions have any performance benefits over executing the same code in separate SQL commands?
2) In the described multi-tenant scenario with many schemata and one DB,
does a function that is defined once and called for any identical schema to the one it is defined lose any of its performance benefits (if any)?
3) Is there any difference in performance between an SQL function and a PL/pgSQL function that contains the same operations?
Before I answer your questions, a remark to your SQL function.
It does not fail because the statements are executed in a wrong order, but because both queries are parsed before the first one is executed. The error message you get is somewhat like
ERROR: relation "first_table" does not exist
[...]
CONTEXT: SQL function "doboth" during startup
Note the “during startup”.
Aswers
You may experience a slight performance boost, particularly if the SQL statements are complicated, because the plans of SQL statements in a PL/pgSQL function are cached for the duration of a database session or until they are invalidated.
If the plan for the query is cached by the PL/pgSQL function, but the SQL statement calling the function has to be planned every time, you might actually be worse of from a performance angle because of the overhead of executing the function.
Whenever you call the function with a different schema name, the query plan will be invalidated and has to be created anew. So if you change the schema name for every invocation, you won't gain anything.
SQL function don't cache query plans, so they don't perform better than the plain SQL query.
Note, however, that the gains from caching simple SQL statements in functions are not tremendous.
Use functions that just act as containers for SQL statements only if it makes life simpler for you, otherwise use plain SQL.
Do not only focus on performance uring design, but on a good architecture and a simple design.
If the same statements keep repeating over and over, you might gain more performance using prepared statements than using functions.
Firstly, I do not really believe there can be any issues with line execution order in functions. If you have any issues, it's your code not working, not Postgres.
Secondly, multi-tenant behavior is well implemented with set search_path to s1, s0;. There is usually no need for switching anything inside procedures.
Thirdly, there are no performance benefits in using stored procedures except for minimizing data flows between DB and the application. If you consider a query like SELECT count(*) FROM mytable WHERE somecolumn = $1 there is absolutely nothing you can optimize before you know the value of $1.
And finally, no, there is no significant difference between functions in SQL and PL/pgSQL. The most time is still consumed by reading through tables, so focus on perfecting that.
Hope that clarifies the situation. Also, you may want to consider security benefits of storage procedures. Just a hint.
I am creating a Java function that needs to use a SQL query with a lot of joins before doing a full scan of its result. Instead of hard-coding a lot of joins I decided to create a view with this complex query. Then the Java function just uses the following query to get this result:
SELECT * FROM VW_####
So the program is working fine but I want to make it faster since this SELECT command is taking a lot of time. After taking a look on its plan execution plan I created some indexes and made it +-30% faster but I want to make it faster.
The problem is that every operation in the execution plan have cost between 0% and 4% except one operation, a clustered-index insert that has +-50% of the execution cost. I think that the system is using a temporary table to store the view's data, but an index in this view isn't useful for me because I need all rows from it.
So what can I do to optimize that insert in the CWT_PrimaryKey? I think that I can't turn off that index because it seems to be part of the SQL Server's internals. I read somewhere that this operation could appear when you use cursors but I think that I am not using (or does the view use it?).
The command to create the view is something simple (no T-SQL, no OPTION, etc) like:
create view VW_#### as SELECTS AND JOINS HERE
And here is a picture of the problematic part from the execution plan: http://imgur.com/PO0ZnBU
EDIT: More details:
Well the query to create the problematic view is a big query that join a lot of tables. Based on a single parameter the Java-Client modifies the query string before creating it. This view represents a "data unit" from a legacy Database migrated to the SQLServer that didn't had any Foreign or Primary Key, so our team choose to follow this strategy. Because of that the view have more than 50 columns and it is made from the join of other seven views.
Main view's query (with a lot of Portuguese words): http://pastebin.com/Jh5vQxzA
The other views (from VW_Sintese1 until VW_Sintese7) are created like this one but without using extra views, they just use joins with the tables that contain the data requested by the main view.
Then the Java Client create a prepared Statement with the query "Select * from VW_Sintese####" and execute it using the function "ExecuteQuery", something like:
String query = "Select * from VW_Sintese####";
PreparedStatement ps = myConn.prepareStatement(query,ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
ResultSet rs = ps.executeQuery();
And then the program goes on until the end.
Thanks for the attention.
First: you should post the code of the view along with whatever is using the views because of the rest of this answer.
Second: the definition of a view in SQL Server is later used to substitute in querying. In other words, you created a view, but since (I'm assuming) it isn't an indexed view, it is the same as writing the original, long SELECT statement. SQL Server kind of just swaps it out in the DML statement.
From Microsoft's 'Querying Microsoft SQL Server 2012': T-SQL supports the following table expressions: derived tables, common table expressions (CTEs), views, inline table-valued functions.
And a direct quote:
It’s important to note that, from a performance standpoint, when SQL Server optimizes
queries involving table expressions, it first unnests the table expression’s logic, and therefore interacts with the underlying tables directly. It does not somehow persist the table expression’s result in an internal work table and then interact with that work table. This means that table expressions don’t have a performance side to them—neither good nor
bad—just no side.
This is a long way of reinforcing the first statement: please include the SQL code in the view and what you're actually using as the SELECT statement. Otherwise, we can't help much :) Cheers!
Edit: Okay, so you've created a view (no performance gain there) that does 4-5 LEFT JOIN on to the main view (again, you're not helping yourself out much here by eliminating rows, etc.). If there are search arguments you can use to filter down the resultset to fewer rows, you should have those in here. And lastly, you're ordering all of this at the top, so your query engine will have to get those views, join them up to a massive SELECT statement, figure out the correct order, and (I'm guessing here) the result count is HUGE and SQL's db engine is ordering it in some kind of temporary table.
The short answer: get less data (fewer columns and only the rows you need); don't order the results if the resultset is very large, just get the data to the client and then sort it there.
Again, if you want more help, you'll need to post table schemas and index strategies for all tables that are in the query (including the views that are joined) and you'll need to include all view definitions (including the views that are joined).
I would like to know how comparisons for IN clause in a DB work. In this case, I am interested in SQL server and Oracle.
I thought of two comparison models - binary search, and hashing. Can someone tell me what method does SQL server follow.
SQL Server's IN clause is basically shorthand for a wordier WHERE clause.
...WHERE column IN (1,2,3,4)
is shorthand for
...WHERE Column = 1
OR Column = 2
OR column = 3
OR column = 4
AFAIK there is no other logic applied that would be different from a standard WHERE clause.
It depends on the query plan the optimizer chooses.
If there is a unique index on the column you're comparing against and you are providing relatively few values in the IN list in comparison to the number of rows in the table, it's likely that the optimizer would choose to probe the index to find out the handful of rows in the table that needed to be examined. If, on the other hand, the IN clause is a query that returns a relatively large number of rows in comparison to the number of rows in the table, it is likely that the optimizer would choose to do some sort of join using one of the many join methods the database engine understands. If the IN list is relatively non-selective (i.e. something like GENDER IN ('Male','Female')), the optimizer may choose to do a simple string comparison for each row as a final processing step.
And, of course, different versions of each database with different statistics may choose different query plans that result in different algorithms to evaluate the same IN list.
IN is the same as EXISTS in SQL Server usually. They will give a similar plan.
Saying that, IN is shorthand for OR..OR as JNK mentioned.
For more than you possibly ever needed to know, see Quassnoi's blog entry
FYI: The OR shorthand leads to another important difference NOT IN is very different to NOT EXISTS/OUTER JOIN: NOT IN fails on NULLs in the list
I have seen several patterns used to 'overcome' the lack of constants in SQL Server, but none of them seem to satisfy both performance and readability / maintainability concerns.
In the below example, assuming that we have an integral 'status' classification on our table, the options seem to be:
Just to hard code it, and possibly just 'comment' the status
-- StatusId 87 = Loaded
SELECT ... FROM [Table] WHERE StatusId = 87;
Using a lookup table for states, and then joining to this table so that the WHERE clause references the friendly name.
SubQuery:
SELECT ...
FROM [Table]
WHERE
StatusId = (SELECT StatusId FROM TableStatus WHERE StatusName = 'Loaded');
or joined
SELECT ...
FROM [Table] t INNER JOIN TableStatus ts On t.StatusId = ts.StatusId
WHERE ts.StatusName = 'Loaded';
A bunch of scalar UDF's defined which return constants, viz
CREATE Function LoadedStatus()
RETURNS INT
AS
BEGIN
RETURN 87
END;
and then
SELECT ... FROM [Table] WHERE StatusId = LoadedStatus();
(IMO this causes a lot of pollution in the database - this might be OK in an Oracle package wrapper)
And similar patterns with Table Valued Functions holding the constants with values as rows or columns, which are CROSS APPLIED back to [Table]
How have other SO users have solved this common issue?
Edit : Bounty - Does anyone have a best practice method for maintaining $(variables) in DBProj DDL / Schema scripts as per Remus answer and comment?
Hard coded. With SQL performance trumps maintainability.
The consequences in the execution plan between using a constant that the optimizer can inspect at plan generation time vs. using any form of indirection (UDF, JOIN, sub-query) are often dramatic. SQL 'compilation' is an extraordinary process (in the sense that is not 'ordinary' like say IL code generation) in as the result is determined not only by the language construct being compiled (ie. the actual text of the query) but also by the data schema (existing indexes) and actual data in those indexes (statistics). When a hard coded value is used, the optimizer can give a better plan because it can actually check the value against the index statistics and get an estimate of the result.
Another consideration is that a SQL application is not code only, but by a large margin is code and data. 'Refactoring' a SQL program is ... different. Where in a C# program one can change a constant or enum, recompile and happily run the application, in SQL one cannot do so because the value is likely present in millions of records in the database and changing the constant value implies also changing GBs of data, often online while new operations occur.
Just because the value is hard-coded in the queries and procedures seen by the server does not necessarily mean the value has to be hard coded in the original project source code. There are various code generation tools that can take care of this. Consider something as trivial as leveraging the sqlcmd scripting variables:
defines.sql:
:setvar STATUS_LOADED 87
somesource.sql:
:r defines.sql
SELECT ... FROM [Table] WHERE StatusId = $(STATUS_LOADED);
someothersource.sql:
:r defines.sql
UPDATE [Table] SET StatusId = $(STATUS_LOADED) WHERE ...;
While I agree with Remus Rusanu, IMO, maintainability of the code (and thus readability, least astonishment etc.) trump other concerns unless the performance difference is sufficiently significant as to warrant doing otherwise. Thus, the following query loses on readability:
Select ..
From Table
Where StatusId = 87
In general, when I have system dependent values which will be referenced in code (perhaps mimicked in an enumeration by name), I use string primary keys for the tables in which they are kept. Contrast this to user-changeable data in which I generally use surrogate keys. The use of a primary key that requires entry helps (albeit not perfectly) to indicate to other developers that this value is not meant to be arbitrary.
Thus, my "Status" table would look like:
Create Table Status
(
Code varchar(6) Not Null Primary Key
, ...
)
Select ...
From Table
Where StatusCode = 'Loaded'
This makes the query more readable, it does not require a join to the Status table, and does not require the use of a magic number (or guid). Using user-defined functions, IMO is a bad practice. Beyond the performance implications, no developer would ever expect UDFs to be used in this manner and thus violates the least astonishment criteria. You would almost be compelled to have a UDF for each constant value; otherwise, what you are passing into the function: a name? a magic value? If a name, you might as well keep the name in a table and use it directly in the query. If a magic value, you are back the original problem.
I have been using the scalar function option in our DB and it's work fine and as per my view is the best way of this solution.
if more values related to one item then made lookup like if you load combobox or any other control with static value then use lookup that's the best way to do this.
You can also add more fields to your status table that act as unique markers or groupers for status values.
For example, if you add an isLoaded field to your status table, record 87 could be the only one with the field's value set, and you can test for the value of the isLoaded field instead of the hard-coded 87 or status description.