Batch insert with EclipseLink not working

Batch insert with EclipseLink not working - database

I am using EclipseLink-2.6.1 with Amazon RDS database instance. The following code is used to insert new entities into database:
tx = em.getTransaction();
tx.begin();
for (T item : persistCollection) {
em.merge(item);
}
tx.commit();
The object which is being persisted has composite primary key (not a generated one). Locally, queries run super fast, but when inserting into remote DB it is really slow process (~20 times slower). I have tried to implement JDBC batch writing but had no success with it (eclipselink.jdbc.batch-writing and rewriteBatchedStatements=true). When logging queries which are being executed I only see lots of SELECTS and not one INSERT (SELECTS are probably here because objects are detached at first).
My question is how to proceed on this problem? (I would like to have batch writing and then see how the performance changes, but any help is appreciated)
Thank you!
Edit:
When using em.persist(item) loop is finished almost instantly but after tx.commit() there are lots (I guess for every persisted item) queries like :
[EL Fine]: sql: ServerSession(925803196) Connection(60187547) SELECT NAME FROM TICKER WHERE (NAME = ?), bind => [AA]
My model has #ManyToOne relationship with ticker_name. Why are there again so many slow SELECT queries?

Related

Why does postgres lock one table when inserting into another

My source tables called Event sitting in a different database and it has millions of rows. Each event can have an action of DELETE, UPDATE or NEW.
We have a Java process that goes through these events in the order they were created and do all sort of rules and then insert the results into multiple tables for look up, analyse etc..
I am using JdbcTemplate and using batchUpdate to delete and upsert to Postgres DB in a sequential order right now, but I'd like to be able to parallel too. Each batch is 1,000 entities to be insert/upserted or deleted.
However, currently even doing in a sequential manner, Postgres locks queries somehow which I don't know much about and why.
Here are some of the codes
entityService.deleteBatch(deletedEntities);
indexingService.deleteBatch(deletedEntities);
...
entityService.updateBatch(allActiveEntities);
indexingService.updateBatch(....);
Each of these services are doing insert/delete into different tables. They are in one transaction though.
The following query
SELECT
activity.pid,
activity.usename,
activity.query,
blocking.pid AS blocking_id,
blocking.query AS blocking_query
FROM pg_stat_activity AS activity
JOIN pg_stat_activity AS blocking ON blocking.pid = ANY(pg_blocking_pids(activity.pid));
returns
Query being blocked: "insert INTO ENTITY (reference, seq, data) VALUES($1, $2, $3) ON CONFLICT ON CONSTRAINT ENTITY_c DO UPDATE SET data = $4",
Blockking query: delete from ENTITY_INDEX where reference = $1
There are no foreign constraints between these tables. And we do have indexes so that we can run queries for our processing as part of the process.
Why would one completely different table can block the other tables? And how can we go about resolving this?

Your query is misleading.
What it shows as “blocking query” is really the last statement that ran in the blocking transaction.
It was probably a previous statement in the same transaction that caused entity (or rather a row in it) to be locked.

High volume inserts for SQL Server

I'm looking for some advice on how to implement a process for mass inserts, like to the tune of 400 records per second. The data is coming from an outside real time trigger and the app will get notified when a data change happens. When that data change happens, I need to consume it.
I've looked at several different implementations for doing batch processing including using datatables/sqlbulkcopy or writing to csv and consuming.
What can you recommend?

400 inserts per second doesn't feel like it should present any major challenge. It depends on what you're inserting, if there are any indexes which could have page splits due to inserts, and if you have any extra logic going on during your insert proc or script.
If you want to insert them one by one, I would recommend just building a barebones stored procedure which does a simple insert of it's parameters into a staging table with no indexes, constraints, anything. That will allow you to very quickly get the data into the database, and you can have a separate process come through every minute or something and work off the rows in batches.
Alternatively, you could have your application store up records until you reach a certain number, and then insert them into the database with a proc using a table-valued parameter. Then you'll only have one insert of however many rows you chose to batch up. The cost of that should be pretty trivial. Do note however that if your application crashes before it's inserted enough rows, those will be lost.
SqlBulkCopy is a powerful tool, but as the name suggests, it's built more for bulk loading of tables. If you have a constant stream of insert requests coming in, I would not recommend using it to load up your data. That might be a good approach if you want to batch up a LOT of requests to load all at once, but not as a recurring and frequent activity.

This works pretty well for me. I can't guarantee you 400 per sec tho:
private async Task BulkInsert(string tableName, DataTable dt)
{
if (dt == null)
return;
using (SqlBulkCopy bulkCopy = new SqlBulkCopy("./sqlserver..."))
{
bulkCopy.DestinationTableName = tableName;
await bulkCopy.WriteToServerAsync(dt);
}
}

Trying to query data from an enormous SQL Server table using EF 5

I am working with a SQL Server table that contains 80 million (80,000,000) rows. Data space = 198,000 MB. Not surprisingly, queries against this table often churn or timeout. To add to the issues, the table rows get updated fairly frequently and new rows also get added on a regular basis. It thus continues to grow like a viral outbreak.
My issue is that I would like to write Entity Framework 5 LINQ to Entities queries to grab rows from this monster table. As I've tried, timeouts have become outright epidemic. A few more things: the table's primary key is indexed and it has non-clustered indexes on 4 of its 19 columns.
So far, I am writing simple LINQ queries that use Transaction Scope and Read Uncommitted Isolation Level. I have tried increasing both the command timeout and the connection timeout. I have written queries that return FirstOrDefault() or a collection, such as the following, which attempts to grab a single ID (an int) from seven days before the current date:
public int GetIDForSevenDaysAgo(DateTime sevenDaysAgo)
{
using (var txn = new TransactionScope(TransactionScopeOption.Required, new TransactionOptions { IsolationLevel = IsolationLevel.ReadUncommitted }))
{
var GetId = from te in _repo.GetTEvents()
where te.cr_date > sevenDaysAgo
orderby te.cr_date
select te.id;
return GetId.FirstOrDefault();
}
}
and
public IEnumerable<int> GetIDForSevenDaysAgo(DateTime sevenDaysAgo)
{
using (var txn = new TransactionScope(TransactionScopeOption.Required, new TransactionOptions { IsolationLevel = IsolationLevel.ReadUncommitted }))
{
var GetId = from te in _repo.GetTEvents()
where te.cr_date > sevenDaysAgo
orderby te.cr_date
select te.id;
return GetId.Take(1);
}
}
Each query times out repeatedly regardless of the timeout settings. I'm using the repository pattern with Unity DI and fetching the table with IQueryable<> calls. I'm also limiting the repository call to eight days from the current date (hoping to only grab the needed subset of this mammoth table). I'm using Visual Studio 2013 with Update 5 targeting .NET v4.5 and SQL Server 2008 R2.
I generated the SQL statement that EF generates and it didn't look incredibly more complicated than the LINQ statements above. And my brain hurts.
So, have I reached some sort of tolerance limit for EF? Is the table simply too big? Should I revert to Stored Procedures/domain methods when querying this table? Are there other options I should explore? There was some discussion around removing some of the table's rows, but that probably won't happen anytime soon. I did read a little about paging, but I'm not sure if that would help or not. Any thoughts or ideas would be appreciated! Thank you!

As I can see you only selecting data and don't change it. So why do you need to use TransactionScope? You need it only when you have 2 or more SaveChanges() in your code and you want them to be in one transaction. So get rid of it.
Another thing that i whould use in your case is disable change tracking and auto detection of changes on your context. But be carefull if you don't rectreade your context on each request. It can presist old data.
To do it you should write this lines near your context initialization:
context.ObjectTrackingEnabled = false;
context.DeferredLoadingEnabled = false;
The other thing that you should think about is pagenation and Cache. But as i can see in your example you trying to get only one row. So can't say anything particular.
I recommend you to read this article to further optimisation.

It's not easy to say if you have to go with stored procedures or EF since we speak for a monster. :-)
The first thing I would do is to run the query in SSMS displaying the Actual Execution Plan. Sometimes it provides information about indexes missing that might increase performance.
From you example, I 'm pretty sure you need an index on that date column.
In other words, -if you have access- be sure that table design is optimal for that amount of data.
My thought is that if a simple query hangs, what more EF can do?

Large Data Service Architecture

Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?

One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)

We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.

Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.

Consolidating NHibernate open sessions to the DB (coupling NHibernate and DB sessions)

We use NHibernate for ORM, and at the initialization phase of our program we need to load many instances of some class T from the DB.
In our application, the following code, which extracts all these instances, takes forever:
public IList<T> GetAllWithoutTransaction()
{
using (ISession session = GetSession())
{
IList<T> entities = session
.CreateCriteria(typeof(T))
.List<T>();
return entities;
}
}
}
Using the NHibernate log I found that the actual SQL queries the framework uses are:
{
Load a bunch of rows from a few tables in the DB (one SELECT statement).
for each instance of class T
{
Load all the data for this instance of class T from the abovementioned rows
(3 SELECT statements).
}
}
The 3 select statements are coupled, i.e. the second is dependent on the first, and the third on the first two.
As you can see, the number of SELECT statements is in the millions, giving us a huge overhead which results directly from all those calls to the DB (each of which entails "open DB session", "close DB session", ...), even though we are using a single NHibernate session.
My question is this:
I would somehow like to consolidate all those SELECT statements into one big SELECT statement. We are not running multithreaded, and are not altering the DB in any way at the init phase.
One way of doing this would be to define my own object and map it using NHibernate, which will be loaded quickly and load everything in one query, but it would require that we implement the join operations used in those statements ourselves, and worse - breaks the ORM abstraction.
Is there any way of doing this by some configuration?
Thanks guys!

This is known as the SELECT N+1 problem. You need to decide where you're going to place your joins (FetchMode.Eager)

If you can write the query as a single query in SQL, you can get NHibernate to execute it as a single query (usually without breaking the abstraction).
It appears likely that you have some relationships / classes setup to lazy load when what you really want in this scenario is eager loading.
There is plenty of good information about this in the NHibernate docs. You may want to start here:
http://www.nhforge.org/doc/nh/en/index.html#performance-fetching