Jdbc batched updates good key retrieval strategy - database

I insert alot of data into a table with a autogenerated key using the batchUpdate functionality of JDBC. Because JDBC doesn't say anything about batchUpdate and getAutogeneratedKeys I need some database independant workaround.
My ideas:
Somehow pull the next handed out sequences from the database before inserting and then using the keys manually. But JDBC hasn't got a getTheNextFutureKeys(howMany). So how can this be done? Is pulling keys e.g. in Oracle also transaction save? So only one transaction can ever pull the same set of future keys.
Add an extra column with a fake id that is only valid during the transaction.
Use all the other columns as secondary key to fetch the generated key. This isn't really 3NF conform...
Are there better ideas or how can I use idea 1 in a generalized way?

Partial answer
Is pulling keys e.g. in Oracle also transaction save?
Yes, getting values from a sequence is transaction safe, by which I mean even if you roll back your transaction, a sequence value returned by the DB won't be returned again under any circumstances.
So you can prefetch the id-s from a sequence and use them in the batch insert.

Never run into this, so I dived into it a little.
First of all, there is a way to retrieve the generated ids from a JDBC statement:
String sql = "INSERT INTO AUTHORS (LAST, FIRST, HOME) VALUES " +
"'PARKER', 'DOROTHY', 'USA', keyColumn";
int rows = stmt.executeUpdate(sql,
Statement.RETURN_GENERATED_KEYS);
ResultSet rs = stmt.getGeneratedKeys();
if (rs.next()) {
ResultSetMetaData rsmd = rs.getMetaData();
int colCount = rsmd.getColumnCount();
do {
for (int i = 1; i <= colCount; i++) {
String key = rs.getString(i);
System.out.println("key " + i + "is " + key);
}
}
while (rs.next();)
}
else {
System.out.println("There are no generated keys.");
}
see this http://download.oracle.com/javase/1.4.2/docs/guide/jdbc/getstart/statement.html#1000569
Also, theoretically it could be combined with the JDBC batchUpdate
Although, this combination seems to be rather non-trivial, on this pls refer to this thread.
I sugest to try this, and if you do not succeed, fall back to pre-fetching from sequence.

getAutogeneratedKeys() will also work with a batch update as far as I remember.
It returns a ResultSet with all newly created ids - not just a single value.
But that requires that the ID is populated through a trigger during the INSERT operation.

Related

Dapper-Plus BulkInsert - How to return number of rows affected?

In Dapper-Plus, is there a way to return the number of rows affected in the database? This is my code:
using (SqlConnection connection = new SqlConnection(Environment.GetEnvironmentVariable("sqldb_connection")))
{
connection.BulkInsert(myList);
}
I see you can do it for inserting a single row, but can't find functionality on the dapper plus bulk insert.
Since Dapper Plus allow to chain multiple methods, the method doesn't directly return this value.
However, you can do it with the following code:
var resultInfo = new Z.BulkOperations.ResultInfo();
connection.UseBulkOptions(options => {
options.UseRowsAffected = true;
options.ResultInfo = resultInfo;
}).BulkInsert(orders);
// Show RowsAffected
Console.WriteLine("Rows Inserted: " + resultInfo.RowsAffectedInserted);
Console.WriteLine("Rows Affected: " + resultInfo.RowsAffected);
Fiddle: https://dotnetfiddle.net/mOMNng
Keep in mind that using that option will slightly make the bulk operations slower.
EDIT: Answer comment
will it make it as slow as using the regular dapper insert method or is this way still faster?
It will still be way faster than regular Insert.

Common strategy in handling concurrent global 'inventory' updates

To give a simplified example:
I have a database with one table: names, which has 1 million records each containing a common boy or girl's name, and more added every day.
I have an application server that takes as input an http request from parents using my website 'Name Chooser' . With each request, I need to pick up a name from the db and return it, and then NOT give that name to another parent. The server is concurrent so can handle a high volume of requests, and yet have to respect "unique name per request" and still be high available.
What are the major components and strategies for an architecture of this use case?
From what I understand, you have two operations: Adding a name and Choosing a name.
I have couple of questions:
Qustion 1: Do parents choose names only or do they also add names?
Question 2 If they add names, doest that mean that when a name is added it should also be marked as already chosen?
Assuming that you don't want to make all name selection requests to wait for one another (by locking of queueing them):
One solution to resolve concurrency in case of choosing a name only is to use Optimistic offline lock.
The most common implementation to this is to add a version field to your table and increment this version when you mark a name as chosen. You will need DB support for this, but most databases offer a mechanism for this. MongoDB adds a version field to the documents by default. For a RDBMS (like SQL) you have to add this field yourself.
You havent specified what technology you are using, so I will give an example using pseudo code for an SQL DB. For MongoDB you can check how the DB makes these checks for you.
NameRecord {
id,
name,
parentID,
version,
isChosen,
function chooseForParent(parentID) {
if(this.isChosen){
throw Error/Exception;
}
this.parentID = parentID
this.isChosen = true;
this.version++;
}
}
NameRecordRepository {
function getByName(name) { ... }
function save(record) {
var oldVersion = record.version - 1;
var query = "UPDATE records SET .....
WHERE id = {record.id} AND version = {oldVersion}";
var rowsCount = db.execute(query);
if(rowsCount == 0) {
throw ConcurrencyViolation
}
}
}
// somewhere else in an object or module or whatever...
function chooseName(parentID, name) {
var record = NameRecordRepository.getByName(name);
record.chooseForParent(parentID);
NameRecordRepository.save(record);
}
Before whis object is saved to the DB a version comparison must be performed. SQL provides a way to execute a query based on some condition and return the row count of affected rows. In our case we check if the version in the Database is the same as the old one before update. If it's not, that means that someone else has updated the record.
In this simple case you can even remove the version field and use the isChosen flag in your SQL query like this:
var query = "UPDATE records SET .....
WHERE id = {record.id} AND isChosend = false";
When adding a new name to the database you will need a Unique constrant that will solve concurrenty issues.

RDBM - one to one -> check if entry exists and insert/update OR always delete and only insert?

Unfortunately I have no better idea how to name this question, so if you have better suggestion go ahead and edit it :-)
Since years I have been using this approach when trying to insert/update a one-to-one resource: perform a DELETE to make sure there is no row with such PK and then perform INSERT only. I always thought this is the best case both for performance and simplicity. When talking about performance I mean on both sides - DB and application layer (i.e. executing a DELETE query seems to be less expensive than executing a SELECT and checking on the result, also when considering there are data being transferred both directions).
But of course there are other approaches, like INSERT ... ON DUPLICATE KEY UPDATE ..., IF EXISTS (SELECT ...) UPDATE ... ELSE INSERT... or UPDATE ... ;IF ROWCOUNT = 0 INSERT ... (depending on the underlying RDBMS) and of course performing the same on the application layer, i.e. first check whether the entry exists and if it does perform an UPDATE, INSERT otherwise (or perform UPDATE and check number of affected rows, if zero, perform an INSERT [which brings another complication since if UPDATE does not change the underlying resource, it also returns zero as number of affected rows thus following INSERT will return a duplicate PK error])...
I am now curious what is the best approach? By best I mean if you consider performance, best practice, etc...
Using ON DUPLICATE KEY UPDATE is the only sane thing to do; it's simpler and its intention is clear - it's easy to read. It will also perform much better too, since the index entry doesn't need to be changed.
Using DELETE then INSERT is vulnerable to a race condition if two processes try to simultaneously update the same key, which would result in a unique key violation for one of the processes. It's also much slower, because not only must a row be physically deleted and a new one inserted, the index entry must also be deleted then inserted too.
Using IF EXISTS is not a good choice either, because you must do it from within a stored procedure, so it's locked to that invocation choice and can't be ported into an application. Plus, it's just an attempt at replicating the ON DUPLICATE KEY UPDATE built-in command, so it will never be as efficient.
Responding to comment regarding programmer performance, ie efficiency, which IMHO is very important too.
If you want to avoid duplicating parameters, add a little refactoring for convenience:
void applyTwice(PreparedStatement stmt, int fromIndex, Object... values) {
for (int i = 0; i < values.length; i++) {
stmt.setObject(i + fromIndex, values[i]);
stmt.setObject(i + fromIndex + values.length, values[i]);
}
}
Calling this like:
applyTwice(stmt, 3, "foo", "bar", 99);
Will effectively do this:
stmt.setObject(3, "foo");
stmt.setObject(4, "bar");
stmt.setObject(5, 99);
stmt.setObject(6, "foo");
stmt.setObject(7, "bar");
stmt.setObject(8, 99);
The fromIndex parameter allows for non-repeated parameters like ids etc that aren't repeated in the query.
You could also make a simple method that applies a single value at multiple indexes:
void apply(PreparedStatement stmt, Object value, int... indexes) {
for (int i = 0; i < indexes.length; i++) {
stmt.setObject(indexes[i], value);
}
}
Which you would call like:
apply(stmt, "foo", 3, 6);

How to solve "Batch update returned unexpected row count from update; actual row count: 0; expected: 1" problem?

Getting this every time I attempt to CREATE a particular entity ... just want to know how I should go about figuring out the cause.
I'm using Fluent NHibernate auto-mapping so perhaps I haven't set a convention appropriately and/or need to override somethings in one or more mapping files. I've gone thru a number of posts on the web regarding this problem and having a hard time figuring out exactly why it is happening in my case.
The object I'm saving is pretty simple. It is a "Person" object that references a "Company" entity and has a collection of "Address" entities. UPDATES work fine on existing Person objects that are already in the database.
Any suggestions?
The error means that the SQL INSERT statement is being executed, but the ROWCOUNT being returned by SQL Server after it runs is 0, not 1 as expected.
There are several causes, from incorrect mappings, to UPDATE/INSERT triggers that have rowcount turned off.
Your best beat is to profile the SQL statements and see what happens. To do that either turn on nHibernate sql logging, or use the sql profiler. Once you have the SQL you may know the cause, if not try running the SQL manually and see what happens.
Also I suggest you to post your mapping, as it will help people spot any issues.
This can happen when trigger(s) execute additional DML (data modification) queries which affect the row counts. My solution was to add the following at the top of my trigger:
SET NOCOUNT ON;
This may occur because of Auto increment primary key. To solve this problem do not inset auto increment value with data set. Insert data without the primary key.
When targeting a view with an INSTEAD OF trigger it can be next to impossible to get the correct row count. After delving a bit into the source I found out that you can make a custom persister which makes NHibernate ignore the count checks.
public class SingleTableNoResultCheckEntityPersister : SingleTableEntityPersister
{
public SingleTableNoResultCheckEntityPersister(PersistentClass persistentClass, ICacheConcurrencyStrategy cache, ISessionFactoryImplementor factory, IMapping mapping)
: base(persistentClass, cache, factory, mapping)
{
for (int i = 0; i < this.insertResultCheckStyles.Length; i++)
{
this.insertResultCheckStyles[i] = ExecuteUpdateResultCheckStyle.None;
}
for (int i = 0; i < this.updateResultCheckStyles.Length; i++)
{
this.updateResultCheckStyles[i] = ExecuteUpdateResultCheckStyle.None;
}
for (int i = 0; i < this.deleteResultCheckStyles.Length; i++)
{
this.deleteResultCheckStyles[i] = ExecuteUpdateResultCheckStyle.None;
}
}
}
if you get row count greater than 1 its usually because you have duplicate rows in a table that have same ids. Check your table for duplicated records. Deleting duplicates will fix it.
NHibernate.AdoNet.TooManyRowsAffectedException: 'Batch update returned
unexpected row count from update; actual row count: 2; expected: 1'

Hitting the 2100 parameter limit (SQL Server) when using Contains()

from f in CUSTOMERS
where depts.Contains(f.DEPT_ID)
select f.NAME
depts is a list (IEnumerable<int>) of department ids
This query works fine until you pass a large list (say around 3000 dept ids) .. then I get this error:
The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Too many parameters were provided in this RPC request. The maximum is 2100.
I changed my query to:
var dept_ids = string.Join(" ", depts.ToStringArray());
from f in CUSTOMERS
where dept_ids.IndexOf(Convert.ToString(f.DEPT_id)) != -1
select f.NAME
using IndexOf() fixed the error but made the query slow. Is there any other way to solve this? thanks so much.
My solution (Guids is a list of ids you would like to filter by):
List<MyTestEntity> result = new List<MyTestEntity>();
for(int i = 0; i < Math.Ceiling((double)Guids.Count / 2000); i++)
{
var nextGuids = Guids.Skip(i * 2000).Take(2000);
result.AddRange(db.Tests.Where(x => nextGuids.Contains(x.Id)));
}
this.DataContext = result;
Why not write the query in sql and attach your entity?
It's been awhile since I worked in Linq, but here goes:
IQuery q = Session.CreateQuery(#"
select *
from customerTable f
where f.DEPT_id in (" + string.Join(",", depts.ToStringArray()) + ")");
q.AttachEntity(CUSTOMER);
Of course, you will need to protect against injection, but that shouldn't be too hard.
You will want to check out the LINQKit project since within there somewhere is a technique for batching up such statements to solve this issue. I believe the idea is to use the PredicateBuilder to break the local collection into smaller chuncks but I haven't reviewed the solution in detail because I've instead been looking for a more natural way to handle this.
Unfortunately it appears from Microsoft's response to my suggestion to fix this behavior that there are no plans set to have this addressed for .NET Framework 4.0 or even subsequent service packs.
https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=475984
UPDATE:
I've opened up some discussion regarding whether this was going to be fixed for LINQ to SQL or the ADO.NET Entity Framework on the MSDN forums. Please see these posts for more information regarding these topics and to see the temporary workaround that I've come up with using XML and a SQL UDF.
I had similar problem, and I got two ways to fix it.
Intersect method
join on IDs
To get values that are NOT in list, I used Except method OR left join.
Update
EntityFramework 6.2 runs the following query successfully:
var employeeIDs = Enumerable.Range(3, 5000);
var orders =
from order in Orders
where employeeIDs.Contains((int)order.EmployeeID)
select order;
Your post was from a while ago, but perhaps someone will benefit from this. Entity Framework does a lot of query caching, every time you send in a different parameter count, that gets added to the cache. Using a "Contains" call will cause SQL to generate a clause like "WHERE x IN (#p1, #p2.... #pn)", and bloat the EF cache.
Recently I looked for a new way to handle this, and I found that you can create an entire table of data as a parameter. Here's how to do it:
First, you'll need to create a custom table type, so run this in SQL Server (in my case I called the custom type "TableId"):
CREATE TYPE [dbo].[TableId] AS TABLE(
Id[int] PRIMARY KEY
)
Then, in C#, you can create a DataTable and load it into a structured parameter that matches the type. You can add as many data rows as you want:
DataTable dt = new DataTable();
dt.Columns.Add("id", typeof(int));
This is an arbitrary list of IDs to search on. You can make the list as large as you want:
dt.Rows.Add(24262);
dt.Rows.Add(24267);
dt.Rows.Add(24264);
Create an SqlParameter using the custom table type and your data table:
SqlParameter tableParameter = new SqlParameter("#id", SqlDbType.Structured);
tableParameter.TypeName = "dbo.TableId";
tableParameter.Value = dt;
Then you can call a bit of SQL from your context that joins your existing table to the values from your table parameter. This will give you all records that match your ID list:
var items = context.Dailies.FromSqlRaw<Dailies>("SELECT * FROM dbo.Dailies d INNER JOIN #id id ON d.Daily_ID = id.id", tableParameter).AsNoTracking().ToList();
You could always partition your list of depts into smaller sets before you pass them as parameters to the IN statement generated by Linq. See here:
Divide a large IEnumerable into smaller IEnumerable of a fix amount of item

Resources