LINQ to SQL Take w/o Skip Causes Multiple SQL Statements - sql-server

I have a LINQ to SQL query:
from at in Context.Transaction
select new {
at.Amount,
at.PostingDate,
Details =
from tb in at.TransactionDetail
select new {
Amount = tb.Amount,
Description = tb.Desc
}
}
This results in one SQL statement being executed. All is good.
However, if I attempt to return known types from this query, even if they have the same structure as the anonymous types, I get one SQL statement executed for the top level and then an additional SQL statement for each "child" set.
Is there any way to get LINQ to SQL to issue one SQL statement and use known types?
EDIT: I must have another issue. When I plugged a very simplistic (but still hieararchical) version of my query into LINQPad and used freshly created known types with just 2 or 3 members, I did get one SQL statement. I will post and update when I know more.
EDIT 2: This appears to be due to a bug in Take. See my answer below for details.

First - some reasoning for the Take bug.
If you just Take, the query translator just uses top. Top10 will not give the right answer if cardinality is broken by joining in a child collection. So the query translator doesn't join in the child collection (instead it requeries for the children).
If you Skip and Take, then the query translator kicks in with some RowNumber logic over the parent rows... these rownumbers let it take 10 parents, even if that's really 50 records due to each parent having 5 children.
If you Skip(0) and Take, Skip is removed as a non-operation by the translator - it's just like you never said Skip.
This is going to be a hard conceptual leap to from where you are (calling Skip and Take) to a "simple workaround". What we need to do - is force the translation to occur at a point where the translator can't remove Skip(0) as a non-operation. We need to call Skip, and supply the skipped number at a later point.
DataClasses1DataContext myDC = new DataClasses1DataContext();
//setting up log so we can see what's going on
myDC.Log = Console.Out;
//hierarchical query - not important
var query = myDC.Options.Select(option => new{
ID = option.ParentID,
Others = myDC.Options.Select(option2 => new{
ID = option2.ParentID
})
});
//request translation of the query! Important!
var compQuery = System.Data.Linq.CompiledQuery
.Compile<DataClasses1DataContext, int, int, System.Collections.IEnumerable>
( (dc, skip, take) => query.Skip(skip).Take(take) );
//now run the query and specify that 0 rows are to be skipped.
compQuery.Invoke(myDC, 0, 10);
This produces the following query:
SELECT [t1].[ParentID], [t2].[ParentID] AS [ParentID2], (
SELECT COUNT(*)
FROM [dbo].[Option] AS [t3]
) AS [value]
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY [t0].[ID]) AS [ROW_NUMBER], [t0].[ParentID]
FROM [dbo].[Option] AS [t0]
) AS [t1]
LEFT OUTER JOIN [dbo].[Option] AS [t2] ON 1=1
WHERE [t1].[ROW_NUMBER] BETWEEN #p0 + 1 AND #p1 + #p2
ORDER BY [t1].[ROW_NUMBER], [t2].[ID]
-- #p0: Input Int (Size = 0; Prec = 0; Scale = 0) [0]
-- #p1: Input Int (Size = 0; Prec = 0; Scale = 0) [0]
-- #p2: Input Int (Size = 0; Prec = 0; Scale = 0) [10]
-- Context: SqlProvider(Sql2005) Model: AttributedMetaModel Build: 3.5.30729.1
And here's where we win!
WHERE [t1].[ROW_NUMBER] BETWEEN #p0 + 1 AND #p1 + #p2

I've now determined this is the result of a horrible bug. The anonymous versus known type turned out not to be the cause. The real cause is Take.
The following result in 1 SQL statement:
query.Skip(1).Take(10).ToList();
query.ToList();
However, the following exhibit the one sql statement per parent row problem.
query.Skip(0).Take(10).ToList();
query.Take(10).ToList();
Can anyone think of any simple workarounds for this?
EDIT: The only workaround I've come up with is to check to see if I'm on the first page (IE Skip(0)) and then make two calls, one with Take(1) and the other with Skip(1).Take(pageSize - 1) and addRange the lists together.

I've not had a chance to try this but given that the anonymous type isn't part of LINQ rather a C# construct I wonder if you could use:
from at in Context.Transaction
select new KnownType(
at.Amount,
at.PostingDate,
Details =
from tb in at.TransactionDetail
select KnownSubType(
Amount = tb.Amount,
Description = tb.Desc
)
}
Obviously Details would need to be an IEnumerable collection.
I could be miles wide on this but it might at least give you a new line of thought to pursue which can't hurt so please excuse my rambling.

Related

Is there any way to optimize this query?

I need to optimize the following query:
IF object_id('tempdb..#TAB001') IS NOT NULL
DROP TABLE #TAB001;
select *
into #TAB001
from dbo.uvw_TAB001
where 1 = 1
and isnull(COD_CUSTOMER,'') = isnull(#cod_customer,isnull(COD_CUSTOMER,''))
and isnull(TAXCODE,'') = isnull(#taxcode, isnull(TAXCODE,''))
and isnull(SURNAME,'') = isnull(#surname,isnull(SURNAME,''))
and isnull(VATCODE,'') = isnull(#vatCode,isnull(VATCODE,''))
The goal is to improve the performance of this query.
It is currently quite fast but I would like to speed it up even more.
This query has the optional parameters for which it is necessary to make a query that regardless of whether all or 1 parameter is set, returns results in the shortest possible time.
What you have here is known as a "catch-all" or "kitchen sink" query, which need a little helping hand sometimes.
Firstly, you need to get rid of those ISNULLs; they are making the query non-SARGable. Also, I would suggest getting rid of the SELECT * and limiting the query to the columns you need.
Then, finally, we can add OPTION (RECOMPILE) to the query; why is discussed in the articles I linked above. This gives you the following:
SELECT * --Replace with Column Names
INTO #TAB001 --Do you actually need to do this?
FROM dbo.uvw_TAB001
--Removed WHERE 1 = 1 as it's always true, thus pointless
WHERE (COD_CUSTOMER = #cod_customer OR #cod_customer IS NULL)
AND (TAXCODE = #taxcode OR #taxcode IS NULL)
AND (SURNAME = #surname OR #surname IS NULL)
AND (VATCODE = #vatCode OR #vatCode IS NULL)
OPTION (RECOMPILE);
Note I am assuming that when a variable (for example #cod_customer) has the value NULL you mean that the variable should be "ignored" and not matched against NULL.
If you actually want {Column} = #{Variable} including NULL then use SQL with the format below instead:
({Column} = #{Variable} OR ({Column} IS NULL AND #{Variable} IS NULL))

SQL Server query using case statement IN Clause doesn't work [duplicate]

What are the best workarounds for using a SQL IN clause with instances of java.sql.PreparedStatement, which is not supported for multiple values due to SQL injection attack security issues: One ? placeholder represents one value, rather than a list of values.
Consider the following SQL statement:
SELECT my_column FROM my_table where search_column IN (?)
Using preparedStatement.setString( 1, "'A', 'B', 'C'" ); is essentially a non-working attempt at a workaround of the reasons for using ? in the first place.
What workarounds are available?
An analysis of the various options available, and the pros and cons of each is available in Jeanne Boyarsky's Batching Select Statements in JDBC entry on JavaRanch Journal.
The suggested options are:
Prepare SELECT my_column FROM my_table WHERE search_column = ?, execute it for each value and UNION the results client-side. Requires only one prepared statement. Slow and painful.
Prepare SELECT my_column FROM my_table WHERE search_column IN (?,?,?) and execute it. Requires one prepared statement per size-of-IN-list. Fast and obvious.
Prepare SELECT my_column FROM my_table WHERE search_column = ? ; SELECT my_column FROM my_table WHERE search_column = ? ; ... and execute it. [Or use UNION ALL in place of those semicolons. --ed] Requires one prepared statement per size-of-IN-list. Stupidly slow, strictly worse than WHERE search_column IN (?,?,?), so I don't know why the blogger even suggested it.
Use a stored procedure to construct the result set.
Prepare N different size-of-IN-list queries; say, with 2, 10, and 50 values. To search for an IN-list with 6 different values, populate the size-10 query so that it looks like SELECT my_column FROM my_table WHERE search_column IN (1,2,3,4,5,6,6,6,6,6). Any decent server will optimize out the duplicate values before running the query.
None of these options are ideal.
The best option if you are using JDBC4 and a server that supports x = ANY(y), is to use PreparedStatement.setArray as described in Boris's anwser.
There doesn't seem to be any way to make setArray work with IN-lists, though.
Sometimes SQL statements are loaded at runtime (e.g., from a properties file) but require a variable number of parameters. In such cases, first define the query:
query=SELECT * FROM table t WHERE t.column IN (?)
Next, load the query. Then determine the number of parameters prior to running it. Once the parameter count is known, run:
sql = any( sql, count );
For example:
/**
* Converts a SQL statement containing exactly one IN clause to an IN clause
* using multiple comma-delimited parameters.
*
* #param sql The SQL statement string with one IN clause.
* #param params The number of parameters the SQL statement requires.
* #return The SQL statement with (?) replaced with multiple parameter
* placeholders.
*/
public static String any(String sql, final int params) {
// Create a comma-delimited list based on the number of parameters.
final StringBuilder sb = new StringBuilder(
String.join(", ", Collections.nCopies(possibleValue.size(), "?")));
// For more than 1 parameter, replace the single parameter with
// multiple parameter placeholders.
if (sb.length() > 1) {
sql = sql.replace("(?)", "(" + sb + ")");
}
// Return the modified comma-delimited list of parameters.
return sql;
}
For certain databases where passing an array via the JDBC 4 specification is unsupported, this method can facilitate transforming the slow = ? into the faster IN (?) clause condition, which can then be expanded by calling the any method.
Solution for PostgreSQL:
final PreparedStatement statement = connection.prepareStatement(
"SELECT my_column FROM my_table where search_column = ANY (?)"
);
final String[] values = getValues();
statement.setArray(1, connection.createArrayOf("text", values));
try (ResultSet rs = statement.executeQuery()) {
while(rs.next()) {
// do some...
}
}
or
final PreparedStatement statement = connection.prepareStatement(
"SELECT my_column FROM my_table " +
"where search_column IN (SELECT * FROM unnest(?))"
);
final String[] values = getValues();
statement.setArray(1, connection.createArrayOf("text", values));
try (ResultSet rs = statement.executeQuery()) {
while(rs.next()) {
// do some...
}
}
No simple way AFAIK.
If the target is to keep statement cache ratio high (i.e to not create a statement per every parameter count), you may do the following:
create a statement with a few (e.g. 10) parameters:
... WHERE A IN (?,?,?,?,?,?,?,?,?,?) ...
Bind all actuall parameters
setString(1,"foo");
setString(2,"bar");
Bind the rest as NULL
setNull(3,Types.VARCHAR)
...
setNull(10,Types.VARCHAR)
NULL never matches anything, so it gets optimized out by the SQL plan builder.
The logic is easy to automate when you pass a List into a DAO function:
while( i < param.size() ) {
ps.setString(i+1,param.get(i));
i++;
}
while( i < MAX_PARAMS ) {
ps.setNull(i+1,Types.VARCHAR);
i++;
}
You can use Collections.nCopies to generate a collection of placeholders and join them using String.join:
List<String> params = getParams();
String placeHolders = String.join(",", Collections.nCopies(params.size(), "?"));
String sql = "select * from your_table where some_column in (" + placeHolders + ")";
try ( Connection connection = getConnection();
PreparedStatement ps = connection.prepareStatement(sql)) {
int i = 1;
for (String param : params) {
ps.setString(i++, param);
}
/*
* Execute query/do stuff
*/
}
An unpleasant work-around, but certainly feasible is to use a nested query. Create a temporary table MYVALUES with a column in it. Insert your list of values into the MYVALUES table. Then execute
select my_column from my_table where search_column in ( SELECT value FROM MYVALUES )
Ugly, but a viable alternative if your list of values is very large.
This technique has the added advantage of potentially better query plans from the optimizer (check a page for multiple values, tablescan only once instead once per value, etc) may save on overhead if your database doesn't cache prepared statements. Your "INSERTS" would need to be done in batch and the MYVALUES table may need to be tweaked to have minimal locking or other high-overhead protections.
Limitations of the in() operator is the root of all evil.
It works for trivial cases, and you can extend it with "automatic generation of the prepared statement" however it is always having its limits.
if you're creating a statement with variable number of parameters, that will make an sql parse overhead at each call
on many platforms, the number of parameters of in() operator are limited
on all platforms, total SQL text size is limited, making impossible for sending down 2000 placeholders for the in params
sending down bind variables of 1000-10k is not possible, as the JDBC driver is having its limitations
The in() approach can be good enough for some cases, but not rocket proof :)
The rocket-proof solution is to pass the arbitrary number of parameters in a separate call (by passing a clob of params, for example), and then have a view (or any other way) to represent them in SQL and use in your where criteria.
A brute-force variant is here http://tkyte.blogspot.hu/2006/06/varying-in-lists.html
However if you can use PL/SQL, this mess can become pretty neat.
function getCustomers(in_customerIdList clob) return sys_refcursor is
begin
aux_in_list.parse(in_customerIdList);
open res for
select *
from customer c,
in_list v
where c.customer_id=v.token;
return res;
end;
Then you can pass arbitrary number of comma separated customer ids in the parameter, and:
will get no parse delay, as the SQL for select is stable
no pipelined functions complexity - it is just one query
the SQL is using a simple join, instead of an IN operator, which is quite fast
after all, it is a good rule of thumb of not hitting the database with any plain select or DML, since it is Oracle, which offers lightyears of more than MySQL or similar simple database engines. PL/SQL allows you to hide the storage model from your application domain model in an effective way.
The trick here is:
we need a call which accepts the long string, and store somewhere where the db session can access to it (e.g. simple package variable, or dbms_session.set_context)
then we need a view which can parse this to rows
and then you have a view which contains the ids you're querying, so all you need is a simple join to the table queried.
The view looks like:
create or replace view in_list
as
select
trim( substr (txt,
instr (txt, ',', 1, level ) + 1,
instr (txt, ',', 1, level+1)
- instr (txt, ',', 1, level) -1 ) ) as token
from (select ','||aux_in_list.getpayload||',' txt from dual)
connect by level <= length(aux_in_list.getpayload)-length(replace(aux_in_list.getpayload,',',''))+1
where aux_in_list.getpayload refers to the original input string.
A possible approach would be to pass pl/sql arrays (supported by Oracle only), however you can't use those in pure SQL, therefore a conversion step is always needed. The conversion can not be done in SQL, so after all, passing a clob with all parameters in string and converting it witin a view is the most efficient solution.
Here's how I solved it in my own application. Ideally, you should use a StringBuilder instead of using + for Strings.
String inParenthesis = "(?";
for(int i = 1;i < myList.size();i++) {
inParenthesis += ", ?";
}
inParenthesis += ")";
try(PreparedStatement statement = SQLite.connection.prepareStatement(
String.format("UPDATE table SET value='WINNER' WHERE startTime=? AND name=? AND traderIdx=? AND someValue IN %s", inParenthesis))) {
int x = 1;
statement.setLong(x++, race.startTime);
statement.setString(x++, race.name);
statement.setInt(x++, traderIdx);
for(String str : race.betFair.winners) {
statement.setString(x++, str);
}
int effected = statement.executeUpdate();
}
Using a variable like x above instead of concrete numbers helps a lot if you decide to change the query at a later time.
I've never tried it, but would .setArray() do what you're looking for?
Update: Evidently not. setArray only seems to work with a java.sql.Array that comes from an ARRAY column that you've retrieved from a previous query, or a subquery with an ARRAY column.
My workaround is:
create or replace type split_tbl as table of varchar(32767);
/
create or replace function split
(
p_list varchar2,
p_del varchar2 := ','
) return split_tbl pipelined
is
l_idx pls_integer;
l_list varchar2(32767) := p_list;
l_value varchar2(32767);
begin
loop
l_idx := instr(l_list,p_del);
if l_idx > 0 then
pipe row(substr(l_list,1,l_idx-1));
l_list := substr(l_list,l_idx+length(p_del));
else
pipe row(l_list);
exit;
end if;
end loop;
return;
end split;
/
Now you can use one variable to obtain some values in a table:
select * from table(split('one,two,three'))
one
two
three
select * from TABLE1 where COL1 in (select * from table(split('value1,value2')))
value1 AAA
value2 BBB
So, the prepared statement could be:
"select * from TABLE where COL in (select * from table(split(?)))"
Regards,
Javier Ibanez
I suppose you could (using basic string manipulation) generate the query string in the PreparedStatement to have a number of ?'s matching the number of items in your list.
Of course if you're doing that you're just a step away from generating a giant chained OR in your query, but without having the right number of ? in the query string, I don't see how else you can work around this.
You could use setArray method as mentioned in this javadoc:
PreparedStatement statement = connection.prepareStatement("Select * from emp where field in (?)");
Array array = statement.getConnection().createArrayOf("VARCHAR", new Object[]{"E1", "E2","E3"});
statement.setArray(1, array);
ResultSet rs = statement.executeQuery();
Here's a complete solution in Java to create the prepared statement for you:
/*usage:
Util u = new Util(500); //500 items per bracket.
String sqlBefore = "select * from myTable where (";
List<Integer> values = new ArrayList<Integer>(Arrays.asList(1,2,4,5));
string sqlAfter = ") and foo = 'bar'";
PreparedStatement ps = u.prepareStatements(sqlBefore, values, sqlAfter, connection, "someId");
*/
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
public class Util {
private int numValuesInClause;
public Util(int numValuesInClause) {
super();
this.numValuesInClause = numValuesInClause;
}
public int getNumValuesInClause() {
return numValuesInClause;
}
public void setNumValuesInClause(int numValuesInClause) {
this.numValuesInClause = numValuesInClause;
}
/** Split a given list into a list of lists for the given size of numValuesInClause*/
public List<List<Integer>> splitList(
List<Integer> values) {
List<List<Integer>> newList = new ArrayList<List<Integer>>();
while (values.size() > numValuesInClause) {
List<Integer> sublist = values.subList(0,numValuesInClause);
List<Integer> values2 = values.subList(numValuesInClause, values.size());
values = values2;
newList.add( sublist);
}
newList.add(values);
return newList;
}
/**
* Generates a series of split out in clause statements.
* #param sqlBefore ""select * from dual where ("
* #param values [1,2,3,4,5,6,7,8,9,10]
* #param "sqlAfter ) and id = 5"
* #return "select * from dual where (id in (1,2,3) or id in (4,5,6) or id in (7,8,9) or id in (10)"
*/
public String genInClauseSql(String sqlBefore, List<Integer> values,
String sqlAfter, String identifier)
{
List<List<Integer>> newLists = splitList(values);
String stmt = sqlBefore;
/* now generate the in clause for each list */
int j = 0; /* keep track of list:newLists index */
for (List<Integer> list : newLists) {
stmt = stmt + identifier +" in (";
StringBuilder innerBuilder = new StringBuilder();
for (int i = 0; i < list.size(); i++) {
innerBuilder.append("?,");
}
String inClause = innerBuilder.deleteCharAt(
innerBuilder.length() - 1).toString();
stmt = stmt + inClause;
stmt = stmt + ")";
if (++j < newLists.size()) {
stmt = stmt + " OR ";
}
}
stmt = stmt + sqlAfter;
return stmt;
}
/**
* Method to convert your SQL and a list of ID into a safe prepared
* statements
*
* #throws SQLException
*/
public PreparedStatement prepareStatements(String sqlBefore,
ArrayList<Integer> values, String sqlAfter, Connection c, String identifier)
throws SQLException {
/* First split our potentially big list into lots of lists */
String stmt = genInClauseSql(sqlBefore, values, sqlAfter, identifier);
PreparedStatement ps = c.prepareStatement(stmt);
int i = 1;
for (int val : values)
{
ps.setInt(i++, val);
}
return ps;
}
}
Spring allows passing java.util.Lists to NamedParameterJdbcTemplate , which automates the generation of (?, ?, ?, ..., ?), as appropriate for the number of arguments.
For Oracle, this blog posting discusses the use of oracle.sql.ARRAY (Connection.createArrayOf doesn't work with Oracle). For this you have to modify your SQL statement:
SELECT my_column FROM my_table where search_column IN (select COLUMN_VALUE from table(?))
The oracle table function transforms the passed array into a table like value usable in the IN statement.
try using the instr function?
select my_column from my_table where instr(?, ','||search_column||',') > 0
then
ps.setString(1, ",A,B,C,");
Admittedly this is a bit of a dirty hack, but it does reduce the opportunities for sql injection. Works in oracle anyway.
Sormula supports SQL IN operator by allowing you to supply a java.util.Collection object as a parameter. It creates a prepared statement with a ? for each of the elements the collection. See Example 4 (SQL in example is a comment to clarify what is created but is not used by Sormula).
Generate the query string in the PreparedStatement to have a number of ?'s matching the number of items in your list. Here's an example:
public void myQuery(List<String> items, int other) {
...
String q4in = generateQsForIn(items.size());
String sql = "select * from stuff where foo in ( " + q4in + " ) and bar = ?";
PreparedStatement ps = connection.prepareStatement(sql);
int i = 1;
for (String item : items) {
ps.setString(i++, item);
}
ps.setInt(i++, other);
ResultSet rs = ps.executeQuery();
...
}
private String generateQsForIn(int numQs) {
String items = "";
for (int i = 0; i < numQs; i++) {
if (i != 0) items += ", ";
items += "?";
}
return items;
}
instead of using
SELECT my_column FROM my_table where search_column IN (?)
use the Sql Statement as
select id, name from users where id in (?, ?, ?)
and
preparedStatement.setString( 1, 'A');
preparedStatement.setString( 2,'B');
preparedStatement.setString( 3, 'C');
or use a stored procedure this would be the best solution, since the sql statements will be compiled and stored in DataBase server
I came across a number of limitations related to prepared statement:
The prepared statements are cached only inside the same session (Postgres), so it will really work only with connection pooling
A lot of different prepared statements as proposed by #BalusC may cause the cache to overfill and previously cached statements will be dropped
The query has to be optimized and use indices. Sounds obvious, however e.g. the ANY(ARRAY...) statement proposed by #Boris in one of the top answers cannot use indices and query will be slow despite caching
The prepared statement caches the query plan as well and the actual values of any parameters specified in the statement are unavailable.
Among the proposed solutions I would choose the one that doesn't decrease the query performance and makes the less number of queries. This will be the #4 (batching few queries) from the #Don link or specifying NULL values for unneeded '?' marks as proposed by #Vladimir Dyuzhev
SetArray is the best solution but its not available for many older drivers. The following workaround can be used in java8
String baseQuery ="SELECT my_column FROM my_table where search_column IN (%s)"
String markersString = inputArray.stream().map(e -> "?").collect(joining(","));
String sqlQuery = String.format(baseSQL, markersString);
//Now create Prepared Statement and use loop to Set entries
int index=1;
for (String input : inputArray) {
preparedStatement.setString(index++, input);
}
This solution is better than other ugly while loop solutions where the query string is built by manual iterations
I just worked out a PostgreSQL-specific option for this. It's a bit of a hack, and comes with its own pros and cons and limitations, but it seems to work and isn't limited to a specific development language, platform, or PG driver.
The trick of course is to find a way to pass an arbitrary length collection of values as a single parameter, and have the db recognize it as multiple values. The solution I have working is to construct a delimited string from the values in the collection, pass that string as a single parameter, and use string_to_array() with the requisite casting for PostgreSQL to properly make use of it.
So if you want to search for "foo", "blah", and "abc", you might concatenate them together into a single string as: 'foo,blah,abc'. Here's the straight SQL:
select column from table
where search_column = any (string_to_array('foo,blah,abc', ',')::text[]);
You would obviously change the explicit cast to whatever you wanted your resulting value array to be -- int, text, uuid, etc. And because the function is taking a single string value (or two I suppose, if you want to customize the delimiter as well), you can pass it as a parameter in a prepared statement:
select column from table
where search_column = any (string_to_array($1, ',')::text[]);
This is even flexible enough to support things like LIKE comparisons:
select column from table
where search_column like any (string_to_array('foo%,blah%,abc%', ',')::text[]);
Again, no question it's a hack, but it works and allows you to still use pre-compiled prepared statements that take *ahem* discrete parameters, with the accompanying security and (maybe) performance benefits. Is it advisable and actually performant? Naturally, it depends, as you've got string parsing and possibly casting going on before your query even runs. If you're expecting to send three, five, a few dozen values, sure, it's probably fine. A few thousand? Yeah, maybe not so much. YMMV, limitations and exclusions apply, no warranty express or implied.
But it works.
No one else seems to have suggested using an off-the-shelf query builder yet, like jOOQ or QueryDSL or even Criteria Query that manage dynamic IN lists out of the box, possibly including the management of all edge cases that may arise, such as:
Running into Oracle's maximum of 1000 elements per IN list (irrespective of the number of bind values)
Running into any driver's maximum number of bind values, which I've documented in this answer
Running into cursor cache contention problems because too many distinct SQL strings are "hard parsed" and execution plans cannot be cached anymore (jOOQ and since recently also Hibernate work around this by offering IN list padding)
(Disclaimer: I work for the company behind jOOQ)
Just for completeness: So long as the set of values is not too large, you could also simply string-construct a statement like
... WHERE tab.col = ? OR tab.col = ? OR tab.col = ?
which you could then pass to prepare(), and then use setXXX() in a loop to set all the values. This looks yucky, but many "big" commercial systems routinely do this kind of thing until they hit DB-specific limits, such as 32 KB (I think it is) for statements in Oracle.
Of course you need to ensure that the set will never be unreasonably large, or do error trapping in the event that it is.
Following Adam's idea. Make your prepared statement sort of select my_column from my_table where search_column in (#)
Create a String x and fill it with a number of "?,?,?" depending on your list of values
Then just change the # in the query for your new String x an populate
There are different alternative approaches that we can use for IN clause in PreparedStatement.
Using Single Queries - slowest performance and resource intensive
Using StoredProcedure - Fastest but database specific
Creating dynamic query for PreparedStatement - Good Performance but doesn't get benefit of caching and PreparedStatement is recompiled every time.
Use NULL in PreparedStatement queries - Optimal performance, works great when you know the limit of IN clause arguments. If there is no limit, then you can execute queries in batch.
Sample code snippet is;
int i = 1;
for(; i <=ids.length; i++){
ps.setInt(i, ids[i-1]);
}
//set null for remaining ones
for(; i<=PARAM_SIZE;i++){
ps.setNull(i, java.sql.Types.INTEGER);
}
You can check more details about these alternative approaches here.
For some situations regexp might help.
Here is an example I've checked on Oracle, and it works.
select * from my_table where REGEXP_LIKE (search_column, 'value1|value2')
But there is a number of drawbacks with it:
Any column it applied should be converted to varchar/char, at least implicitly.
Need to be careful with special characters.
It can slow down performance - in my case IN version uses index and range scan, and REGEXP version do full scan.
After examining various solutions in different forums and not finding a good solution, I feel the below hack I came up with, is the easiest to follow and code:
Example: Suppose you have multiple parameters to pass in the 'IN' clause. Just put a dummy String inside the 'IN' clause, say, "PARAM" do denote the list of parameters that will be coming in the place of this dummy String.
select * from TABLE_A where ATTR IN (PARAM);
You can collect all the parameters into a single String variable in your Java code. This can be done as follows:
String param1 = "X";
String param2 = "Y";
String param1 = param1.append(",").append(param2);
You can append all your parameters separated by commas into a single String variable, 'param1', in our case.
After collecting all the parameters into a single String you can just replace the dummy text in your query, i.e., "PARAM" in this case, with the parameter String, i.e., param1. Here is what you need to do:
String query = query.replaceFirst("PARAM",param1); where we have the value of query as
query = "select * from TABLE_A where ATTR IN (PARAM)";
You can now execute your query using the executeQuery() method. Just make sure that you don't have the word "PARAM" in your query anywhere. You can use a combination of special characters and alphabets instead of the word "PARAM" in order to make sure that there is no possibility of such a word coming in the query. Hope you got the solution.
Note: Though this is not a prepared query, it does the work that I wanted my code to do.
Just for completeness and because I did not see anyone else suggest it:
Before implementing any of the complicated suggestions above consider if SQL injection is indeed a problem in your scenario.
In many cases the value provided to IN (...) is a list of ids that have been generated in a way that you can be sure that no injection is possible... (e.g. the results of a previous select some_id from some_table where some_condition.)
If that is the case you might just concatenate this value and not use the services or the prepared statement for it or use them for other parameters of this query.
query="select f1,f2 from t1 where f3=? and f2 in (" + sListOfIds + ");";
PreparedStatement doesn't provide any good way to deal with SQL IN clause. Per http://www.javaranch.com/journal/200510/Journal200510.jsp#a2 "You can't substitute things that are meant to become part of the SQL statement. This is necessary because if the SQL itself can change, the driver can't precompile the statement. It also has the nice side effect of preventing SQL injection attacks." I ended up using following approach:
String query = "SELECT my_column FROM my_table where search_column IN ($searchColumns)";
query = query.replace("$searchColumns", "'A', 'B', 'C'");
Statement stmt = connection.createStatement();
boolean hasResults = stmt.execute(query);
do {
if (hasResults)
return stmt.getResultSet();
hasResults = stmt.getMoreResults();
} while (hasResults || stmt.getUpdateCount() != -1);
OK, so I couldn't remember exactly how (or where) I did this before so I came to stack overflow to quickly find the answer. I was surprised I couldn't.
So, how I got around the IN problem a long time ago was with a statement like this:
where myColumn in ( select regexp_substr(:myList,'[^,]+', 1, level) from dual connect by regexp_substr(:myList, '[^,]+', 1, level) is not null)
set the myList parameter as a comma delimited string: A,B,C,D...
Note: You have to set the parameter twice!
This is not the ideal practice, yet it's simple and works well for me most of the time.
where ? like concat( "%|", TABLE_ID , "|%" )
Then you pass through ? the IDs in this way: |1|,|2|,|3|,...|

Entity Framework updates with wrong values after insert

This issue is discovered because I have an object with a field calculated off the ID, which contains the ID as part of it with a prefix and a checksum digit. It is a requirement that these calculated values are unique, but they also cannot be random, so this seemed the best way to do it.
The code in question looks like this:
entity = new Entity() { /* values */ };
context.SaveChanges(); //generate the ID field
entity.CALCULATED_FIELD = CalculateField(prefix, entity.ID);
This works just fine in 99% of cases, but occasionally we get a value in the database which looks like:
ID: 1234
CALCULATED_FIELD : prefix000{1233}8
EXPECTED: prefix000{1234}3
With the parts in the braces being calculated from the ID column.
The fact that the calculated field is incorrect is bad enough, but the implication is that upon doing a savechanges, there is no guarantee that the row returned to Entity Framework is the one which was originally worked on! I am looking into using a stored procedure on insert in order to fix the generated field problem, but in the long run we're going to have lots of bad data if we keep working on the wrong rows.
When I told entity framework to map the table to stored procedures it generated the following boilerplate code:
INSERT [dbo].[tableName](fields...)
VALUES(values...)
DECLARE #ID int
SELECT #ID = [ID]
FROM [dbo].[tableName]
WHERE ##ROWCOUNT > 0 AND [ID] = scope_identity()
SELECT t0.[ID]
FROM [dbo].[tableName] as t0
WHERE ##ROWCOUNT > 0 AND t0.[ID] = #ID
The best idea I can come up with is that an extra insert could occur before scope_identity() is called. We are migrating this system from using stored procedures where we used ##IDENTITY in place instead, could there be a difference there?
EDIT: CalculateField:
public static string CalculateField(string prefix, int ID)
{
var calculated = prefix.PadRight(17 - ID.ToString().Length)
.Replace(" ", "0") + ID.ToString();
var multiplier = 3;
var sum = 0;
foreach (char c in calculated.ToCharArray().Reverse())
{
sum += multiplier * int.Parse(c.ToString());
multiplier = 4 - multiplier;
}
if (sum % 10 == 0) { return calculated + "0"; }
return calculated + (10 - (sum % 10)).ToString();
}
UPDATE: Changing the called method from static to an instance method and only running it later after additional changed were made instead of straight after creation appears to have solved the problem, for reasons I can't comprehend. I'm leaving the question open for now since I don't yet have a large enough sample to be completely sure the problem is resolved, and also because I have no explanation for what really changed.

SQL Threadsafe UPDATE TOP 1 for FIFO Queue

I have a table of invoices being prepared, and then ready for printing.
[STATUS] column is Draft, Print, Printing, Printed
I need to get the ID of the first (FIFO) record to be printed, and change the record status. The operation must be threadsafe so that another process does not select the same InvoiceID
Can I do this (looks atomic to me, but maybe not ...):
1:
WITH CTE AS
(
SELECT TOP(1) [InvoiceID], [Status]
FROM INVOICES
WHERE [Status] = 'Print'
ORDER BY [PrintRequestedDate], [InvoiceID]
)
UPDATE CTE
SET [Status] = 'Printing'
, #InvoiceID = [InvoiceID]
... perform operations using #InvoiceID ...
UPDATE INVOICES
SET [Status] = 'Printed'
WHERE [InvoiceID] = #InvoiceID
or must I use this (for the first statement)
2:
UPDATE INVOICES
SET [Status] = 'Printing'
, #InvoiceID = [InvoiceID]
WHERE [InvoiceID] =
(
SELECT TOP 1 [InvoiceID]
FROM INVOICES WITH (UPDLOCK)
WHERE [Status] = 'Print'
ORDER BY [PrintRequestedDate], [InvoiceID]
)
... perform operations using #InvoiceID ... etc.
(I cannot hold a transaction open from changing status to "Printing" until the end of the process, i.e. when status is finally changed to "Printed").
EDIT:
In case it matters the DB is READ_COMMITTED_SNAPSHOT
I can hold a transaction for both UPDATE STATUS to "Printing" AND get the ID. But I cannot continue to keep transaction open all the way through to changing the status to "Printed". This is an SSRS report, and it makes several different queries to SQL to get various bits of the invoice, and it might crash/whatever, leaving the transaction open.
#Gordon Linoff "If you want a queue" The FIFO sequence is not critical, I would just like invoices that are requested first to be printed first ... "more or less" (don't want any unnecessary complexity ...)
#Martin Smith "looks like a usual table as queue requirement" - yes, exactly that, thanks for the very useful link.
SOLUTION:
The solution I am adopting is from comments:
#lad2025 pointed me to SQL Server Process Queue Race Condition which uses WITH (ROWLOCK, READPAST, UPDLOCK) and #MartinSmith explained what the Isolation issue is and pointed me at Using tables as Queues - which talks about exactly what I am trying to do.
I have not grasped why UPDATE TOP 1 is safe, and UPDATE MyTable SET xxx = yyy WHERE MyColumn = (SELECT TOP 1 SomeColumn FROM SomeTable ORDER BY AnotherColumn) (without Isolation Hints) is not, and I ought to educate myself, but I'm happy just to put the isolation hints in my code and get on with something else :)
Thanks for all the help.
My concern would be duplicate [InvoiceID]
Multiple print requests for the same [InvoiceID]
On the first update ONE row gets set [Status] = 'Printing'
On the second update all [InvoiceID] rows get set [Status] = 'Printed'
This would even set rows with status = 'draft'
Maybe that is what you want
Another process could pick up the same [InvoiceID] before the set [Status] = 'Print'
So some duplicates will print and some will not
I go with comments on use the update lock
This is non-deterministic but you could just take top (1) and skip the order by. You will tend to get the most recent row but it is not guaranteed. If you clear the queue then you get em all.
This demonstrates you can lose 'draft' = 1
declare #invID int;
declare #T table (iden int identity primary key, invID int, status tinyint);
insert into #T values (1, 2), (5, 1), (3, 1), (4, 1), (4, 2), (2, 1), (1, 1), (5, 2), (5, 2);
declare #iden int;
select * from #t order by iden;
declare #rowcount int = 1;
while (#ROWCOUNT > 0)
begin
update top (1) t
set t.status = 3, #invID = t.invID, #iden = t.iden
from #t t
where t.status = '2';
set #rowcount = ##ROWCOUNT;
if(#rowcount > 0)
begin
select #invID, #iden;
-- do stuff
update t
set t.status = 4
from #t t
where t.invID = #invID; -- t.iden = #iden;
select * from #T order by iden;
end
end
ATOMicity of a Single Statement
I think your code's fine as it is. i.e. Because you have a single statement which updates the status to printing as soon as the statement runs the status is updated; so anything running before which searches for print would have updated the same record to printing before your process saw it; so your process would pick a subsequent record, or any process hitting it after your statement had run would see it as printing so would not pick it up`. There isn't really a scenario where a record could pick it up whilst the statement's running since as discussed a single SQL statement should be atomic.
Disclaimer
That said, I'm not enough of an expert to say whether explicit lock hints would help; in my opinion they're not needed as the above is atomic, but others in the comments are likely better informed than me. However, running a test (albeit where database and both threads are run on the same machine) I couldn't create a race condition... maybe if the clients were on different machines / if there was a lot more concurrency you'd be more likely to see issues.
My hope is that others have interpreted your question differently, hence the disagreement.
Attempt to Disprove Myself
Here's the code I used to try to cause a race condition; you can drop it into LINQPad 5, select language C# Program, tweak the connectionstring (and optionally any statements) as required, then run:
const long NoOfRecordsToTest = 1000000;
const string ConnectionString = "Server=.;Database=Play;Trusted_Connection=True;"; //assumes a database called "play"
const string DropFifoQueueTable = #"
if object_id('FIFOQueue') is not null
drop table FIFOQueue";
const string CreateFifoQueueTable = #"
create table FIFOQueue
(
Id bigint not null identity (1,1) primary key clustered
, Processed bit default (0) --0=queued, null=processing, 1=processed
)";
const string GenerateDummyData = #"
with cte as
(
select 1 x
union all
select x + 1
from cte
where x < #NoRowsToGenerate
)
insert FIFOQueue(processed)
select 0
from cte
option (maxrecursion 0)
";
const string GetNextFromQueue = #"
with singleRecord as
(
select top (1) Id, Processed
from FIFOQueue --with(updlock, rowlock, readpast) --optionally include this per comment discussions
where processed = 0
order by Id
)
update singleRecord
set processed = null
output inserted.Id";
//we don't really need this last bit for our demo; I've included in case the discussion turns to this..
const string MarkRecordProcessed = #"
update FIFOQueue
set Processed = 1
where Id = #Id";
void Main()
{
SetupTestDatabase();
var task1 = Task<IList<long>>.Factory.StartNew(() => ExampleTaskForced(1));
var task2 = Task<IList<long>>.Factory.StartNew(() => ExampleTaskForced(2));
Task.WaitAll(task1, task2);
foreach (var processedByBothThreads in task1.Result.Intersect(task2.Result))
{
Console.WriteLine("Both threads processed id: {0}", processedByBothThreads);
}
Console.WriteLine("done");
}
static void SetupTestDatabase()
{
RunSql<int>(new SqlCommand(DropFifoQueueTable), cmd => cmd.ExecuteNonQuery());
RunSql<int>(new SqlCommand(CreateFifoQueueTable), cmd => cmd.ExecuteNonQuery());
var generateData = new SqlCommand(GenerateDummyData);
var param = generateData.Parameters.Add("#NoRowsToGenerate",SqlDbType.BigInt);
param.Value = NoOfRecordsToTest;
RunSql<int>(generateData, cmd => cmd.ExecuteNonQuery());
}
static IList<long> ExampleTaskForced(int threadId) => new List<long>(ExampleTask(threadId)); //needed to ensure prevent lazy loadling from causing issues with our tests
static IEnumerable<long> ExampleTask(int threadId)
{
long? x;
while ((x = ProcessNextInQueue(threadId)).HasValue)
{
yield return x.Value;
}
//yield return 55; //optionally return a fake result just to prove that were there a duplicate we'd catch it
}
static long? ProcessNextInQueue(int threadId)
{
var id = RunSql<long?>(new SqlCommand(GetNextFromQueue), cmd => (long?)cmd.ExecuteScalar());
//Debug.WriteLine("Thread {0} is processing id {1}", threadId, id?.ToString() ?? "[null]"); //if you want to see how we're doing uncomment this line (commented out to improve performance / increase the likelihood of a collision
/* then if we wanted to do the second bit we could include this
if(id.HasValue) {
var markProcessed = new SqlCommand(MarkRecordProcessed);
var param = markProcessed.Parameters.Add("#Id",SqlDbType.BigInt);
param.Value = id.Value;
RunSql<int>(markProcessed, cmd => cmd.ExecuteNonQuery());
}
*/
return id;
}
static T RunSql<T>(SqlCommand command, Func<SqlCommand,T> callback)
{
try
{
using (var connection = new SqlConnection(ConnectionString))
{
command.Connection = connection;
command.Connection.Open();
return (T)callback(command);
}
}
catch (Exception e)
{
Debug.WriteLine(e.ToString());
throw;
}
}
Other comments
The above discussion's really only talking about multiple threads taking the next record from the queue whilst avoiding any single record being picked up by multiple threads. There are a few other points...
Race Condition outside of SQL
Per our discussion, if FIFO is mandatory there are other things to worry about. i.e. whilst your threads will pick up each record in order after that it's up to them. e.g. Thread 1 gets record 10 then Thread 2 gets record 11. Now Thread 2 sends record 11 to the printer before Thread 1 sends record 10. If they're going to the same printer, your prints would be out of order. If they're different printers, not a problem; all prints on any printer are sequential. I'm going to assume the latter.
Exception Handling
If any exception occurs in a thread that's processing something (i.e. so the thread's record is printing) then consideration should be made on how to handle this. One option is to keep that thread retrying; though that may be indefinite if it's some fundamental fault. Another is to put the record to some error status to be handled by another process / where it's accepted that this record won't appear in order. Finally, if the order of invoices in the queue is an ideal rather than a hard requirement you could have the owning thread put the status back to print so that it or another thread can pick up that record to retry (though again, if there's something fundamentally wrong with the record, this may block up the queue).
My recommendation here is the error status; as then you have more visibility on the issue / can dedicate another process to dealing with issues.
Crash Handling
Another issue is that because your update to printing is not held within a transaction, if the server crashes you leave the record with this status in the database, and when your system comes back online it's ignored. Ways to avoid that are to include a column saying which thread is processing it; so that when the system comes back up that thread can resume where it left off, or to include a date stamp so that after some period of time any records with status printing which "timeout" can be swept up / reset to Error or Print statuses as required.
WITH CTE AS
(
SELECT TOP(1) [InvoiceID], [Status], [ThreadId]
FROM INVOICES
WHERE [Status] = 'Print'
OR ([Status] = 'Printing' and [ThreadId] = #ThreadId) --handle previous crash
ORDER BY [PrintRequestedDate], [InvoiceID]
)
UPDATE CTE
SET [Status] = 'Printing'
, [ThreadId] = #ThreadId
OUTPUT Inserted.[InvoiceID]
Awareness of Other Processes
We've mostly been focussed on the printing element; but other processes may also be interacting with your Invoices table. We can probably assume that aside from creating the initial Draft record and updating it to Print once ready for printing, those processes won't touch the Status field. However, the same records could be locked by completely unrelated processes. If we want to ensure FIFO we must not use the ReadPast hint, as some records may have status Print but be locked, so we'd skip over them despite them having the earlier PrintRequestedDate. However, if we want things to print as quickly as possible, with them being in order when otherwise not inconvenient, including ReadPast will allow our print process to skip the locked records and carry on, coming back to handle them once they're released.
Similarly another process may lock our record when it's at Printing status, so we can't update it to mark it as complete. Again, if we want to avoid this from causing a holdup we can make use of the ThreadId column to allow our thread to leave the record at status Printing and come back to clean it up later when it's not locked. Obviously this assumes that the ThreadId column is only used by our print process.
Have a dedicated print queue table
To avoid some of the issues with unrelated processes locking your invoices, move the Status field into its own table; so you only need to read from the invoices table; not update it.
This will also have the advantage that (if you don't care about print history) you can delete the record once done, so you'll get better performance (as you won't have to search through the entire invoices table to find those which are ready to print). That said, there's another option for this one (if you're on SQL2008 or above).
Use Filtered Index
Since the Status column will be updated several times it's not a great candidate for an index; i.e. as the record's position in the index jumps from branch to branch as the status progresses.
However, since we're filtering on it it's also something that would really benefit from having an index.
To get around this contradiction, one option's to use a filtered index; i.e. only index those records we're interested in for our print process; so we maintain a small index for a big benefit.
create nonclustered index ixf_Invoices_PrintStatusAndDate
on dbo.Invoices ([Status], [PrintRequestedDate])
include ([InvoiceId]) --just so we don't have to go off to the main table for this, but can get all we need form the index
where [Status] in ('Print','Printing')
Use "Enums" / Reference Tables
I suspect your example uses strings to keep the demo code simple, however including this for completeness.
Using strings in databases can make things hard to support. Rather than having status be a string value, instead use an ID from a related Statuses table.
create table Statuses
(
ID smallint not null primary key clustered --not identity since we'll likely mirror this enum in code / may want to have defined ids
,Name
)
go
insert Statuses
values
(0, 'Draft')
, (1,'Print')
, (2,'Printing')
, (3,'Printed')
create table Invoices
(
--...
, StatusId smallint foreign key references Statuses(Id)
--...
)

Yet another subquery issue

Hello from an absolute beginner in SQL!
I have a field I want to populate based on another table. For this I have written this query, which fails with: Msg 512, Level 16, State 1, Line 1
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
The statement has been terminated.
oK, here goes:
Update kre.CustomerOrderLineCopy
SET DepNo = (SELECT customerordercopy.DepNo
FROM kre.CustomerOrderCopy , kre.CustomerOrderLineCopy
WHERE CustomerOrderLineCopy.OrderCopyNo =kre.CustomerOrderCopy.OrderCopyNo)
WHERE CustomerOrderLineCopy.OrderCopyNo = (SELECT CustomerOrderCopy.OrderCopyNo
FROM kre.CustomerOrderCopy, kre.CustomerOrderLineCopy
WHERE kre.CustomerOrderLineCopy.OrderCopyNo = kre.CustomerOrderCopy.OrderCopyNo)
What I'm trying to do is to change DepNo in CustomerOrderLineCopy, with the value in DepNo in CustomerOrderCopy - based on the same OrderCopyNo in both tables.
I'm open for all suggestion.
Thanks,
ohalvors
If you just join the tables together the update is easier:
UPDATE A SET A.DepNo = B.DepNo
FROM kre.CustomerOrderLineCopy A
INNER JOIN kre.CustomerOrderCopy B ON A.OrderCopyNo = B.OrderCopyNo
The problem is that at least one of your sub queries return more than one value. Think about this:
tablePerson(name, age)
Adam, 11
Eva, 11
Sven 22
update tablePerson
set name = (select name from tablePerson where age = 11)
where name = 'Sven'
Which is equivalent to: set Sven's name to Adam and Eva. Which is not possible.
If you want to use sub queries, either make sure your sub queries can only return one value or force one value by using:
select top 1 xxx from ...
This may be enough to quieten it down:
Update kre.CustomerOrderLineCopy
SET DepNo = (SELECT customerordercopy.DepNo
FROM kre.CustomerOrderCopy --, kre.CustomerOrderLineCopy
WHERE CustomerOrderLineCopy.OrderCopyNo =kre.CustomerOrderCopy.OrderCopyNo)
WHERE CustomerOrderLineCopy.OrderCopyNo = (SELECT CustomerOrderCopy.OrderCopyNo
FROM kre.CustomerOrderCopy --, kre.CustomerOrderLineCopy
WHERE kre.CustomerOrderLineCopy.OrderCopyNo = kre.CustomerOrderCopy.OrderCopyNo)
(Where I've commented out kre.CustomerOrderLineCopy in the subqueries) That is, you were hopefully trying to correlate these subqueries with the outer table - not introduce another instance of kre.CustomerOrderLineCopy.
If you still get an error, then you still have multiple rows in kre.CustomerOrderCopy which have the same OrderCopyNo. If that's so, you need to give us (and SQL Server) the rules that you want to apply for how to select which row you want to use.
The danger of switching to the FROM ... JOIN form shown in #Avitus's answer is that it will no longer report if there are multiple matching rows - it will just silently pick one of them - which one is never made clear.
Now I look at the query again, I'm not sure it even needs a WHERE clause now. I think this is the same:
Update kre.CustomerOrderLineCopy
SET DepNo = (
SELECT customerordercopy.DepNo
FROM kre.CustomerOrderCopy
WHERE CustomerOrderLineCopy.OrderCopyNo = kre.CustomerOrderCopy.OrderCopyNo)

Resources